Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Determining the cause of a tablet server failure


+
Mike Hugo 2013-02-27, 17:10
+
John Vines 2013-02-27, 17:12
+
Eric Newton 2013-02-27, 17:15
+
Adam Fuchs 2013-02-27, 17:24
+
Mike Hugo 2013-02-27, 20:17
+
Adam Fuchs 2013-02-27, 20:27
+
John Vines 2013-02-27, 20:32
+
Christopher 2013-02-27, 22:46
Copy link to this message
-
Re: Determining the cause of a tablet server failure
Ditto. I don't like the idea of sprinkling additional stuff into log4j, but
I am in favor of trying to make it easier to recognize when tservers die
due to OOMs if there are more suggestions.

On Wednesday, February 27, 2013, Christopher wrote:

> I agree with John Vines.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Wed, Feb 27, 2013 at 12:32 PM, John Vines <[EMAIL PROTECTED]> wrote:
> > I don't like the idea of blending manual logging with log4j in a single
> > file. It's in the .err file already, I don't think anything else is
> > necessary.
> >
> >
> >
> > On Wed, Feb 27, 2013 at 3:27 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote:
> >>
> >> So, question for the community: inside bin/accumulo we have:
> >>   -XX:OnOutOfMemoryError="kill -9 %p"
> >> Should this also append a log message? Something like:
> >>   -XX:OnOutOfMemoryError="kill -9 %p; echo "ran out of memory >>
> >> logfilename"
> >> Is this necessary, or should the OutOfMemoryException still find its way
> >> to the regular log?
> >>
> >> Adam
> >>
> >>
> >>
> >> On Wed, Feb 27, 2013 at 3:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
> >>>
> >>> I'm chalking this up to a mis-configured server.  It looks like during
> >>> the install on this server the accumulo-env.sh file was copied from the
> >>> examples, but rather than setting editing it to set the JAVA_HOME,
> >>> HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were
> replaced with
> >>> those env variables.
> >>>
> >>> I'm assuming this caused us to pick up the default (?)  _OPTS settings
> >>> rather than the correct ones we should have been getting based on our
> server
> >>> memory capacity from the examples.  So we had a bunch of accumulo
> related
> >>> java processes all running with memory settings that were way out of
> whack
> >>> from what they should have been.
> >>>
> >>> To solve it I copied in the files from the conf/examples directory
> again
> >>> and made sure everything was set up correctly and restarted everything.
> >>>
> >>> We never did see anything in out log files or .out / .err logs
> indicating
> >>> the source of the problem, but the above is my best guess as to what
> was
> >>> going on.
> >>>
> >>> Thanks again for all the tips and pointers!
> >>>
> >>> Mike
> >>>
> >>>
> >>> On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <[EMAIL PROTECTED]>
> wrote:
> >>>>
> >>>> There are a few primary reasons why your tablet server would die:
> >>>> 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't
> >>>> communicate with each other then the lock will timeout and the tablet
> server
> >>>> will kill itself. This should show up as several messages in the
> tserver
> >>>> log. If this happens when a tablet server is really busy (lots of
> threads
> >>>> doing stuff) then the log message about the lost lock can be pretty
> far back
> >>>> in the queue. Java garbage collection can cause long pauses that
> inhibit the
> >>>> tserver/zookeeper messages. Zookeeper can also get overwhelmed and
> behave
> >>>> poorly if the server it's running on swaps it out.
> >>>> 2. Problems talking with the master. If a tablet server is too slow in
> >>>> communicating with the master then the master will try to kill it.
> This
> >>>> should show up in the master log, and also will be noted in the
> tserver log.
> >>>> 3. Out of memory. If the tserver JVM runs out of memory it will
> >>>> terminate. As John mentioned, this will be in the .err or .out files
> in the
> >>>> log directory.
> >>>>
> >>>> Adam
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
> >>>>>
> >>>>> After running an ingest process via map reduce for about an hour or
> so,
> >>>>> one of our tserver fails.  It happens pretty consistently, we're
> able to
> >>>>> replicate it without too much difficulty.
> >>>>>
> >>>>> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why
> >>>>> the tserver fails, but I'm not seeing much that points to a cause