Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection


Copy link to this message
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection
Jim Klucar 2012-08-16, 11:22
Just shooting from the hip here.

Zookeeper maxclientcxns in zoo.cfg should be increased from the default to
something like 100. Check the zookeeper log file to see if it is shutting
down connections.

Check your what your max open files setting is for your OS with 'ulimit -n'
and increase it if necessary.

Sent from my iPhone

On Aug 16, 2012, at 4:00 AM, Arjumand Bonhomme <[EMAIL PROTECTED]> wrote:

Hello,

I'm fairly new to both Accumulo and Hadoop, so I think my problem may be
due to poor configuration on my part, but I'm running out of ideas.

I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4)
in pseudo-distributed mode.
zookeeper version zookeeper-3.3.5 from cdh3u4
I'm using the 1.4.1 release of accumulo with a configuration copied from
"conf/examples/512MB/standalone"

I've got a Map task that is using an accumulo table as the input.
I'm fetching all rows, but just a single column family, that has hundreds
or even thousands of different column qualifiers.
The table has a SummingCombiner installed for the given the column family.

The task runs fine at first, but after ~9-15K records (I print the record
count to the console every 1K records), it hangs and the following messages
are printed to the console where I'm running the job:
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional data
from server sessionid 0x1392cc35b460d1c, likely server has closed socket,
closing socket connection and attempting reconnect
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/fe80:0:0:0:0:0:0:1%1:2181
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to
ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket
connection
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down
12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost sessionTimeout=30000
watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/0:0:0:0:0:0:0:1:2181
12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/0:0:0:0:0:0:0:1:2181, initiating session
12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment complete
on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1392cc35b460d25,
negotiated timeout = 30000
12/08/16 02:57:11 INFO mapred.LocalJobRunner:
12/08/16 02:57:14 INFO mapred.LocalJobRunner:
12/08/16 02:57:17 INFO mapred.LocalJobRunner:

Sometimes the messages contain a stacktrace like this below:
12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 for
server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing
socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:166)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181
12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket
connection
12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost sessionTimeout=30000
watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/fe80:0:0:0:0:0:0:1%1:2181
12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment complete
on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid 0x1392cc35b460b46, negotiated timeout = 30000
I've poked through the logs in accumulo, and I've noticed that when it
hangs, the following is written to the "logger_HOSTNAME.debug.log" file:
16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected
16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of
current session : Expired
16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired
16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper
registration at null
16 03:29:47,452 [logger.LogService] INFO : Logger shutting down
16 03:29:47,453 [logger.LogWriter] INFO : Shutting down
I've noticed that if I make the map task print out the record count more
frequently (ie every 10 records), it seems to be able get through more
records than when I only print every 1K records. My assumption was that
this had something to do with more time being spent in the map task, and
not fetching data from accumulo.  There was at least one occasion where I
printed to the console for every record, and in that situation it managed
to process 47K records, although I have been unable to repeat that behavior.

I've also noticed that if I stop and start accumulo, the map-reduce job
will pickup where it left off, but seems to fail quicker.

Could someone make some suggestions as to what my problem might be? It
would be greatly appreciated.  If you need any additional information from
me, just let me know.  I'd paste my config files, driver setup, and example
data into this post, but I think