Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection


Copy link to this message
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection
John Vines 2012-08-16, 19:24
You're work ont he clone table failed because you took it offline. The
table needs to be online in order to do a job against it.

John

On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <[EMAIL PROTECTED]> wrote:

> Jim, William, Adam: Thanks for your help with this, I'm running out of
> ideas to google search for answers.
>
> I'm going to try to answer your questions:
>
> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100.  I
> didn't see anything in log indicating that it was shutting down connections
> due to reaching a connection limit.  However, throughout the zookeeper
> logs, even before the hadoop job was run.  I did see lots of lines like
> these.  Up til this point, I've assumed it's innocuous and unrelated to my
> issue.
> INFO  [Thread-371:NIOServerCnxn@1435] - Closed socket connection for
> client /127.0.0.1:59158 (no session established for client)
>
> 2) I had also already adjusted the dfs.datanode.max.xcievers property to
> 4096 in hdfs-site.xml.  An investigation of the log shows that this limit
> is not being reached.  It had been being reached at some point before I
> increased the value from the default.  I had increased the value while
> troubleshooting, but prior to posting to this list; it didn't appear to
> have a noticeable affect on the behavior of the hadoop job.
>
> 3) I'm writing out to a sequence file, so accumulo is only being used for
> input.  As a side note, at one point during my troubleshooting, I
> compacted, cloned, and then took the cloned table offline and tried to use
> that instead.  That failed immediately, without processing any records.
>  From the stacktrace, it appeared as though the iterator was trying to use
> one of the files for the original table (from what I understand about
> cloning, this is normal b/c no changes had been made to the original table)
> but said it did not exist.  I was, however, able to find the file on hdfs.
>  So I just gave up on that.  Also, under the normal case, ie using the
> original table online, nothing is writing to the table while the hadoop job
> is running.
>
> 4) My original open file limit was the os default of 256.  So I upped it
> to 1024, and performed another attempt.  The behavior was the same as
> before.
>
> I'm including a snippet from the tserver debug log.  It looks like an
> expired session might be the root of the problem, but I'm not sure what
> would cause that:
> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s)
> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1]
> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1]
> 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01)
> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728)
> totalmem=132,055,040
> 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
> 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
> 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid