Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Determining the cause of a tablet server failure


Copy link to this message
-
Re: Determining the cause of a tablet server failure
There are a few primary reasons why your tablet server would die:
1. Lost lock in Zookeeper. If the tablet server and zookeeper can't
communicate with each other then the lock will timeout and the tablet
server will kill itself. This should show up as several messages in the
tserver log. If this happens when a tablet server is really busy (lots of
threads doing stuff) then the log message about the lost lock can be pretty
far back in the queue. Java garbage collection can cause long pauses that
inhibit the tserver/zookeeper messages. Zookeeper can also get overwhelmed
and behave poorly if the server it's running on swaps it out.
2. Problems talking with the master. If a tablet server is too slow in
communicating with the master then the master will try to kill it. This
should show up in the master log, and also will be noted in the tserver log.
3. Out of memory. If the tserver JVM runs out of memory it will terminate.
As John mentioned, this will be in the .err or .out files in the log
directory.

Adam

On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:

> After running an ingest process via map reduce for about an hour or so,
> one of our tserver fails.  It happens pretty consistently, we're able to
> replicate it without too much difficulty.
>
> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why the
> tserver fails, but I'm not seeing much that points to a cause of the
> tserver going offline.   One minute it's there, the next it's offline.
>  There are some warnings about the swappiness as well as a large row that
> cannot be spit but other than that, not much else to go on.
>
> Is there anything that could help me figure out *why* the tserver died?
>  I'm guessing it's something in our client code or a config that's not
> correct on the server, but it'd be really nice to have a hint before we
> start randomly changing things to see what will fix it.
>
> Thanks,
>
> Mike
>