Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Determining the cause of a tablet server failure


Copy link to this message
-
Re: Determining the cause of a tablet server failure
Adam Fuchs 2013-02-27, 20:27
So, question for the community: inside bin/accumulo we have:
  -XX:OnOutOfMemoryError="kill -9 %p"
Should this also append a log message? Something like:
  -XX:OnOutOfMemoryError="kill -9 %p; echo "ran out of memory >>
logfilename"
Is this necessary, or should the OutOfMemoryException still find its way to
the regular log?

Adam

On Wed, Feb 27, 2013 at 3:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:

> I'm chalking this up to a mis-configured server.  It looks like during the
> install on this server the accumulo-env.sh file was copied from the
> examples, but rather than setting editing it to set the JAVA_HOME,
> HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were replaced
> with those env variables.
>
> I'm assuming this caused us to pick up the default (?)  _OPTS settings
> rather than the correct ones we should have been getting based on our
> server memory capacity from the examples.  So we had a bunch of accumulo
> related java processes all running with memory settings that were way out
> of whack from what they should have been.
>
> To solve it I copied in the files from the conf/examples directory again
> and made sure everything was set up correctly and restarted everything.
>
> We never did see anything in out log files or .out / .err logs indicating
> the source of the problem, but the above is my best guess as to what was
> going on.
>
> Thanks again for all the tips and pointers!
>
> Mike
>
>
> On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <[EMAIL PROTECTED]> wrote:
>
>> There are a few primary reasons why your tablet server would die:
>> 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't
>> communicate with each other then the lock will timeout and the tablet
>> server will kill itself. This should show up as several messages in the
>> tserver log. If this happens when a tablet server is really busy (lots of
>> threads doing stuff) then the log message about the lost lock can be pretty
>> far back in the queue. Java garbage collection can cause long pauses that
>> inhibit the tserver/zookeeper messages. Zookeeper can also get overwhelmed
>> and behave poorly if the server it's running on swaps it out.
>> 2. Problems talking with the master. If a tablet server is too slow in
>> communicating with the master then the master will try to kill it. This
>> should show up in the master log, and also will be noted in the tserver log.
>> 3. Out of memory. If the tserver JVM runs out of memory it will
>> terminate. As John mentioned, this will be in the .err or .out files in the
>> log directory.
>>
>> Adam
>>
>>
>>
>> On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
>>
>>> After running an ingest process via map reduce for about an hour or so,
>>> one of our tserver fails.  It happens pretty consistently, we're able to
>>> replicate it without too much difficulty.
>>>
>>> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why the
>>> tserver fails, but I'm not seeing much that points to a cause of the
>>> tserver going offline.   One minute it's there, the next it's offline.
>>>  There are some warnings about the swappiness as well as a large row that
>>> cannot be spit but other than that, not much else to go on.
>>>
>>> Is there anything that could help me figure out *why* the tserver died?
>>>  I'm guessing it's something in our client code or a config that's not
>>> correct on the server, but it'd be really nice to have a hint before we
>>> start randomly changing things to see what will fix it.
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>
>>
>