Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Dead Tablet Server


Copy link to this message
-
Re: Dead Tablet Server
On Tue, Sep 17, 2013 at 10:23 AM, Ott, Charles H. <[EMAIL PROTECTED]>wrote:

> Forgive my ignorance with this, But I have not yet had a tablet failure
> that I have been able to recover without restarting the entire accumulo
> cluster.****
>
> ** **
>
> I have 3 Tablets, 2 Online, 1 dead.  Using Accumulo 1.4.3****
>
> ** **
>
> The tablet error reports:****
>
> Uncaught exception in TabletServer.main, exiting****
>
>          java.lang.RuntimeException: java.lang.RuntimeException: Too many
> retries, exiting.****
>
>                  at
> org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2684)
> ****
>
>                  at
> org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.java:2703)
> ****
>
>                  at
> org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.java:3168)
> ****
>
>                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)****
>
>                  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> ****
>
>                  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> ****
>
>                  at java.lang.reflect.Method.invoke(Method.java:597)****
>
>                  at org.apache.accumulo.start.Main$1.run(Main.java:89)****
>
>                  at java.lang.Thread.run(Thread.java:662)****
>
>          Caused by: java.lang.RuntimeException: Too many retries, exiting.
> ****
>
>                  at
> org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2681)
> ****
>
>                  ... 8 more****
>
> **
>
Looking at the code, the tablet server couldn't obtain a lock for itself
(using its IP:port). I would start looking there. You could use zkCli.sh
provided by ZooKeeper and look in
/accumulo/${instance_id}/tservers/${ip}:${port} to see if there is another
server which already has the lock somehow.

> **
>
> ** **
>
> The recovery portion of the Admin guide says that recovery is performed by
> asking the loggers to copy their write-ahead logs into HDFS.  The logs are
> copied, sorted and then tablets can find missing updates.  Once complete
> the tablets involved should return to an ‘online’ state.****
>
> ** **
>
> I am not sure how to ask the loggers to copy their write-ahead logs into
> hdfs.  Is this the same as using the flush shell command?  If so, the flush
> command needs a pattern of tables or a table name.  Would I want to perform
> something like, ‘accumulo flush -p .+’ to flush all of the table data to
> HDFS?
>

You shouldn't have to do anything manually here. The loggers should be
handling this completely for you as a part of their normal operations. The
most likely issue you may run into if you're missing WALs is if your logger
process doesn't have enough memory to perform that copy/sort/etc but this
is easily verified by checking the logger*.out file for an OOME.
> ****
>
> ** **
>
> Another concern is that the Tablet Server process was no longer running on
> the server.  I logged into that server and ran “start-here.sh”.  The tablet
> server is now running, but it is still reported as ‘dead’ to the monitor.
>

Can you determine from the monitor if that tablet server is actually
hosting tablets? 1.4.3 had a couple of bugs around the master not updating
it's internal state for nodes in the failed state. Check the Tablet Server
page and see if there's an entry in the table of servers.
> ****
>
> ** **
>
> Thanks in advance,****
>
> Charles****
>