Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection


Copy link to this message
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection
You're work ont he clone table failed because you took it offline. The
table needs to be online in order to do a job against it.

John

On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <[EMAIL PROTECTED]> wrote:

> Jim, William, Adam: Thanks for your help with this, I'm running out of
> ideas to google search for answers.
>
> I'm going to try to answer your questions:
>
> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100.  I
> didn't see anything in log indicating that it was shutting down connections
> due to reaching a connection limit.  However, throughout the zookeeper
> logs, even before the hadoop job was run.  I did see lots of lines like
> these.  Up til this point, I've assumed it's innocuous and unrelated to my
> issue.
> INFO  [Thread-371:NIOServerCnxn@1435] - Closed socket connection for
> client /127.0.0.1:59158 (no session established for client)
>
> 2) I had also already adjusted the dfs.datanode.max.xcievers property to
> 4096 in hdfs-site.xml.  An investigation of the log shows that this limit
> is not being reached.  It had been being reached at some point before I
> increased the value from the default.  I had increased the value while
> troubleshooting, but prior to posting to this list; it didn't appear to
> have a noticeable affect on the behavior of the hadoop job.
>
> 3) I'm writing out to a sequence file, so accumulo is only being used for
> input.  As a side note, at one point during my troubleshooting, I
> compacted, cloned, and then took the cloned table offline and tried to use
> that instead.  That failed immediately, without processing any records.
>  From the stacktrace, it appeared as though the iterator was trying to use
> one of the files for the original table (from what I understand about
> cloning, this is normal b/c no changes had been made to the original table)
> but said it did not exist.  I was, however, able to find the file on hdfs.
>  So I just gave up on that.  Also, under the normal case, ie using the
> original table online, nothing is writing to the table while the hadoop job
> is running.
>
> 4) My original open file limit was the os default of 256.  So I upped it
> to 1024, and performed another attempt.  The behavior was the same as
> before.
>
> I'm including a snippet from the tserver debug log.  It looks like an
> expired session might be the root of the problem, but I'm not sure what
> would cause that:
> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s)
> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1]
> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1]
> 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01)
> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728)
> totalmem=132,055,040
> 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
> 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
> 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid