|
Arjumand Bonhomme
2012-08-16, 07:59
Jim Klucar
2012-08-16, 11:22
William Slacum
2012-08-16, 11:24
Adam Fuchs
2012-08-16, 13:32
Arjumand Bonhomme
2012-08-16, 18:36
John Vines
2012-08-16, 19:24
Arjumand Bonhomme
2012-08-16, 19:48
Arjumand Bonhomme
2012-08-17, 02:10
David Medinets
2012-08-17, 02:33
Arjumand Bonhomme
2012-08-17, 03:14
Arjumand Bonhomme
2012-08-20, 17:00
Keith Turner
2012-08-20, 17:34
David Medinets
2012-08-21, 00:26
Keith Turner
2012-08-21, 12:23
ameet kini
2012-10-10, 14:22
Billie Rinaldi
2012-10-11, 18:57
ameet kini
2012-10-17, 14:10
ameet kini
2012-10-17, 14:13
|
-
Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionArjumand Bonhomme 2012-08-16, 07:59
Hello,
I'm fairly new to both Accumulo and Hadoop, so I think my problem may be due to poor configuration on my part, but I'm running out of ideas. I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4) in pseudo-distributed mode. zookeeper version zookeeper-3.3.5 from cdh3u4 I'm using the 1.4.1 release of accumulo with a configuration copied from "conf/examples/512MB/standalone" I've got a Map task that is using an accumulo table as the input. I'm fetching all rows, but just a single column family, that has hundreds or even thousands of different column qualifiers. The table has a SummingCombiner installed for the given the column family. The task runs fine at first, but after ~9-15K records (I print the record count to the console every 1K records), it hangs and the following messages are printed to the console where I'm running the job: 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1392cc35b460d1c, likely server has closed socket, closing socket connection and attempting reconnect 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket connection 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost sessionTimeout=30000 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection established to localhost/0:0:0:0:0:0:0:1:2181, initiating session 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1392cc35b460d25, negotiated timeout = 30000 12/08/16 02:57:11 INFO mapred.LocalJobRunner: 12/08/16 02:57:14 INFO mapred.LocalJobRunner: 12/08/16 02:57:17 INFO mapred.LocalJobRunner: Sometimes the messages contain a stacktrace like this below: 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:166) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket connection 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost sessionTimeout=30000 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid 0x1392cc35b460b46, negotiated timeout = 30000 I've poked through the logs in accumulo, and I've noticed that when it hangs, the following is written to the "logger_HOSTNAME.debug.log" file: 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of current session : Expired 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper registration at null 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down I've noticed that if I make the map task print out the record count more frequently (ie every 10 records), it seems to be able get through more records than when I only print every 1K records. My assumption was that this had something to do with more time being spent in the map task, and not fetching data from accumulo. There was at least one occasion where I printed to the console for every record, and in that situation it managed to process 47K records, although I have been unable to repeat that behavior. I've also noticed that if I stop and start accumulo, the map-reduce job will pickup where it left off, but seems to fail quicker. Could someone make some suggestions as to what my problem might be? It would be greatly appreciated. If you need any additional information from me, just let me know. I'd paste my config files, driver setup, and example data into this post, but I think it's probably long enough already. Thanks in advance, -Arjumand
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionJim Klucar 2012-08-16, 11:22
Just shooting from the hip here.
Zookeeper maxclientcxns in zoo.cfg should be increased from the default to something like 100. Check the zookeeper log file to see if it is shutting down connections. Check your what your max open files setting is for your OS with 'ulimit -n' and increase it if necessary. Sent from my iPhone On Aug 16, 2012, at 4:00 AM, Arjumand Bonhomme <[EMAIL PROTECTED]> wrote: Hello, I'm fairly new to both Accumulo and Hadoop, so I think my problem may be due to poor configuration on my part, but I'm running out of ideas. I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4) in pseudo-distributed mode. zookeeper version zookeeper-3.3.5 from cdh3u4 I'm using the 1.4.1 release of accumulo with a configuration copied from "conf/examples/512MB/standalone" I've got a Map task that is using an accumulo table as the input. I'm fetching all rows, but just a single column family, that has hundreds or even thousands of different column qualifiers. The table has a SummingCombiner installed for the given the column family. The task runs fine at first, but after ~9-15K records (I print the record count to the console every 1K records), it hangs and the following messages are printed to the console where I'm running the job: 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1392cc35b460d1c, likely server has closed socket, closing socket connection and attempting reconnect 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket connection 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost sessionTimeout=30000 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection established to localhost/0:0:0:0:0:0:0:1:2181, initiating session 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1392cc35b460d25, negotiated timeout = 30000 12/08/16 02:57:11 INFO mapred.LocalJobRunner: 12/08/16 02:57:14 INFO mapred.LocalJobRunner: 12/08/16 02:57:17 INFO mapred.LocalJobRunner: Sometimes the messages contain a stacktrace like this below: 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:166) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket connection 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost sessionTimeout=30000 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid 0x1392cc35b460b46, negotiated timeout = 30000 I've poked through the logs in accumulo, and I've noticed that when it hangs, the following is written to the "logger_HOSTNAME.debug.log" file: 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of current session : Expired 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper registration at null 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down I've noticed that if I make the map task print out the record count more frequently (ie every 10 records), it seems to be able get through more records than when I only print every 1K records. My assumption was that this had something to do with more time being spent in the map task, and not fetching data from accumulo. There was at least one occasion where I printed to the console for every record, and in that situation it managed to process 47K records, although I have been unable to repeat that behavior. I've also noticed that if I stop and start accumulo, the map-reduce job will pickup where it left off, but seems to fail quicker. Could someone make some suggestions as to what my problem might be? It would be greatly appreciated. If you need any additional information from me, just let me know. I'd paste my config files, driver setup, and example data into this post, but I think
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionWilliam Slacum 2012-08-16, 11:24
What does your TServer debug log say? Also, are you writing back out to
Accumulo? To follow up what Jim said, you can check the zookeeper log to see if max connections is being hit. You may also want to check and see what your max xceivers is set to for HDFS and check your Accumulo and HDFS logs to see if it is mentioned. On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <[EMAIL PROTECTED]> wrote: > Hello, > > I'm fairly new to both Accumulo and Hadoop, so I think my problem may be > due to poor configuration on my part, but I'm running out of ideas. > > I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4) > in pseudo-distributed mode. > zookeeper version zookeeper-3.3.5 from cdh3u4 > I'm using the 1.4.1 release of accumulo with a configuration copied from > "conf/examples/512MB/standalone" > > I've got a Map task that is using an accumulo table as the input. > I'm fetching all rows, but just a single column family, that has hundreds > or even thousands of different column qualifiers. > The table has a SummingCombiner installed for the given the column family. > > The task runs fine at first, but after ~9-15K records (I print the record > count to the console every 1K records), it hangs and the following messages > are printed to the console where I'm running the job: > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional > data from server sessionid 0x1392cc35b460d1c, likely server has closed > socket, closing socket connection and attempting reconnect > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/fe80:0:0:0:0:0:0:1%1:2181 > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to > ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket > connection > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down > 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection, > connectString=localhost sessionTimeout=30000 > watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c > 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/0:0:0:0:0:0:0:1:2181 > 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/0:0:0:0:0:0:0:1:2181, initiating session > 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment > complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid > 0x1392cc35b460d25, negotiated timeout = 30000 > 12/08/16 02:57:11 INFO mapred.LocalJobRunner: > 12/08/16 02:57:14 INFO mapred.LocalJobRunner: > 12/08/16 02:57:17 INFO mapred.LocalJobRunner: > > Sometimes the messages contain a stacktrace like this below: > 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 for > server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing > socket connection and attempting reconnect > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) > at sun.nio.ch.IOUtil.read(IOUtil.java:166) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) > 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/127.0.0.1:2181 > 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/127.0.0.1:2181, initiating session > 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to > ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket > connection > 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down > 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection,
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionAdam Fuchs 2012-08-16, 13:32
That was going to be my suggestion as well, except the zookeeper property
is maxclientcnxns. Cheers, Adam On Aug 16, 2012 7:22 AM, "Jim Klucar" <[EMAIL PROTECTED]> wrote: > Just shooting from the hip here. > > Zookeeper maxclientcxns in zoo.cfg should be increased from the default to > something like 100. Check the zookeeper log file to see if it is shutting > down connections. > > Check your what your max open files setting is for your OS with 'ulimit > -n' and increase it if necessary. > > > > > > Sent from my iPhone > > On Aug 16, 2012, at 4:00 AM, Arjumand Bonhomme <[EMAIL PROTECTED]> wrote: > > Hello, > > I'm fairly new to both Accumulo and Hadoop, so I think my problem may be > due to poor configuration on my part, but I'm running out of ideas. > > I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4) > in pseudo-distributed mode. > zookeeper version zookeeper-3.3.5 from cdh3u4 > I'm using the 1.4.1 release of accumulo with a configuration copied from > "conf/examples/512MB/standalone" > > I've got a Map task that is using an accumulo table as the input. > I'm fetching all rows, but just a single column family, that has hundreds > or even thousands of different column qualifiers. > The table has a SummingCombiner installed for the given the column family. > > The task runs fine at first, but after ~9-15K records (I print the record > count to the console every 1K records), it hangs and the following messages > are printed to the console where I'm running the job: > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional > data from server sessionid 0x1392cc35b460d1c, likely server has closed > socket, closing socket connection and attempting reconnect > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/fe80:0:0:0:0:0:0:1%1:2181 > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to > ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket > connection > 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down > 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection, > connectString=localhost sessionTimeout=30000 > watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c > 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/0:0:0:0:0:0:0:1:2181 > 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/0:0:0:0:0:0:0:1:2181, initiating session > 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment > complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid > 0x1392cc35b460d25, negotiated timeout = 30000 > 12/08/16 02:57:11 INFO mapred.LocalJobRunner: > 12/08/16 02:57:14 INFO mapred.LocalJobRunner: > 12/08/16 02:57:17 INFO mapred.LocalJobRunner: > > Sometimes the messages contain a stacktrace like this below: > 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 for > server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing > socket connection and attempting reconnect > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) > at sun.nio.ch.IOUtil.read(IOUtil.java:166) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) > 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/127.0.0.1:2181 > 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/127.0.0.1:2181, initiating session > 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionArjumand Bonhomme 2012-08-16, 18:36
Jim, William, Adam: Thanks for your help with this, I'm running out of
ideas to google search for answers. I'm going to try to answer your questions: 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100. I didn't see anything in log indicating that it was shutting down connections due to reaching a connection limit. However, throughout the zookeeper logs, even before the hadoop job was run. I did see lots of lines like these. Up til this point, I've assumed it's innocuous and unrelated to my issue. INFO [Thread-371:NIOServerCnxn@1435] - Closed socket connection for client /127.0.0.1:59158 (no session established for client) 2) I had also already adjusted the dfs.datanode.max.xcievers property to 4096 in hdfs-site.xml. An investigation of the log shows that this limit is not being reached. It had been being reached at some point before I increased the value from the default. I had increased the value while troubleshooting, but prior to posting to this list; it didn't appear to have a noticeable affect on the behavior of the hadoop job. 3) I'm writing out to a sequence file, so accumulo is only being used for input. As a side note, at one point during my troubleshooting, I compacted, cloned, and then took the cloned table offline and tried to use that instead. That failed immediately, without processing any records. From the stacktrace, it appeared as though the iterator was trying to use one of the files for the original table (from what I understand about cloning, this is normal b/c no changes had been made to the original table) but said it did not exist. I was, however, able to find the file on hdfs. So I just gave up on that. Also, under the normal case, ie using the original table online, nothing is writing to the table while the hadoop job is running. 4) My original open file limit was the os default of 256. So I upped it to 1024, and performed another attempt. The behavior was the same as before. I'm including a snippet from the tserver debug log. It looks like an expired session might be the root of the problem, but I'm not sure what would cause that: 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s) 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1] 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1] 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01) secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728) totalmem=132,055,040 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1] 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1] 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1] 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE minor compaction chosen 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: COMPACTING !0<;~ total = 120,840 ingestMemory = 120,840 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: chosenMem 2,416 chosenIT = 300.23 load 3,044 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 secs 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~ 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1] 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1] 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE compactionThreshold = 0.605 maxObserved = 120,840 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER compactionThreshold = 0.666 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 read | 4 written | 4,800 entries/sec | 0.005 secs 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted: !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c 16 13:47:51,956 [log.TabletServerLogger] DEBUG: wrote MinC finish 35: writeTime:1ms 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory] -> /default_tablet/F000051i.rf 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs !0<;~ 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:5148043 in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s lt=0.003s ct=0.001s) 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0, Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionJohn Vines 2012-08-16, 19:24
You're work ont he clone table failed because you took it offline. The
table needs to be online in order to do a job against it. John On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <[EMAIL PROTECTED]> wrote: > Jim, William, Adam: Thanks for your help with this, I'm running out of > ideas to google search for answers. > > I'm going to try to answer your questions: > > 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100. I > didn't see anything in log indicating that it was shutting down connections > due to reaching a connection limit. However, throughout the zookeeper > logs, even before the hadoop job was run. I did see lots of lines like > these. Up til this point, I've assumed it's innocuous and unrelated to my > issue. > INFO [Thread-371:NIOServerCnxn@1435] - Closed socket connection for > client /127.0.0.1:59158 (no session established for client) > > 2) I had also already adjusted the dfs.datanode.max.xcievers property to > 4096 in hdfs-site.xml. An investigation of the log shows that this limit > is not being reached. It had been being reached at some point before I > increased the value from the default. I had increased the value while > troubleshooting, but prior to posting to this list; it didn't appear to > have a noticeable affect on the behavior of the hadoop job. > > 3) I'm writing out to a sequence file, so accumulo is only being used for > input. As a side note, at one point during my troubleshooting, I > compacted, cloned, and then took the cloned table offline and tried to use > that instead. That failed immediately, without processing any records. > From the stacktrace, it appeared as though the iterator was trying to use > one of the files for the original table (from what I understand about > cloning, this is normal b/c no changes had been made to the original table) > but said it did not exist. I was, however, able to find the file on hdfs. > So I just gave up on that. Also, under the normal case, ie using the > original table online, nothing is writing to the table while the hadoop job > is running. > > 4) My original open file limit was the os default of 256. So I upped it > to 1024, and performed another attempt. The behavior was the same as > before. > > I'm including a snippet from the tserver debug log. It looks like an > expired session might be the root of the problem, but I'm not sure what > would cause that: > 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s) > 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1] > 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1] > 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01) > secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728) > totalmem=132,055,040 > 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] > 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1] > 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionArjumand Bonhomme 2012-08-16, 19:48
While this isn't my original concern/problem, so it's not terribly
important, but I'm not sure I understand and I'd like to learn as much as possible. Why wouldn't it work with the cloned table offline? I followed the example laid out in org.apache.accumulo.examples.simple.mapreduce.UniqueColumns Using the "AccumuloInputFormat.setScanOffline(job.getConfiguration(), true);" setup. Can you give me some detail as to why it definitely would not have worked offline? Is this method no longer supported. Thanks! -Arjumand On Thu, Aug 16, 2012 at 3:24 PM, John Vines <[EMAIL PROTECTED]> wrote: > You're work ont he clone table failed because you took it offline. The > table needs to be online in order to do a job against it. > > John > > > On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <[EMAIL PROTECTED]>wrote: > >> Jim, William, Adam: Thanks for your help with this, I'm running out of >> ideas to google search for answers. >> >> I'm going to try to answer your questions: >> >> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100. I >> didn't see anything in log indicating that it was shutting down connections >> due to reaching a connection limit. However, throughout the zookeeper >> logs, even before the hadoop job was run. I did see lots of lines like >> these. Up til this point, I've assumed it's innocuous and unrelated to my >> issue. >> INFO [Thread-371:NIOServerCnxn@1435] - Closed socket connection for >> client /127.0.0.1:59158 (no session established for client) >> >> 2) I had also already adjusted the dfs.datanode.max.xcievers property to >> 4096 in hdfs-site.xml. An investigation of the log shows that this limit >> is not being reached. It had been being reached at some point before I >> increased the value from the default. I had increased the value while >> troubleshooting, but prior to posting to this list; it didn't appear to >> have a noticeable affect on the behavior of the hadoop job. >> >> 3) I'm writing out to a sequence file, so accumulo is only being used for >> input. As a side note, at one point during my troubleshooting, I >> compacted, cloned, and then took the cloned table offline and tried to use >> that instead. That failed immediately, without processing any records. >> From the stacktrace, it appeared as though the iterator was trying to use >> one of the files for the original table (from what I understand about >> cloning, this is normal b/c no changes had been made to the original table) >> but said it did not exist. I was, however, able to find the file on hdfs. >> So I just gave up on that. Also, under the normal case, ie using the >> original table online, nothing is writing to the table while the hadoop job >> is running. >> >> 4) My original open file limit was the os default of 256. So I upped it >> to 1024, and performed another attempt. The behavior was the same as >> before. >> >> I'm including a snippet from the tserver debug log. It looks like an >> expired session might be the root of the problem, but I'm not sure what >> would cause that: >> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s) >> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1] >> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1]
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionArjumand Bonhomme 2012-08-17, 02:10
Thanks for the input, no worries about the offline scanner bit. Although if
I can ever get past my current issue, I'd love to learn why the offline scanner failed as well. So I took a look at my various other settings, and watched the paging while everything was running, and it doesn't appear (at least not to me) to be related to swapping. I'm using the 512MB accumulo configuration, and I'm running the hadoop task with a 256MB heap. I'm running hadoop with the standard pseudo-distributed config, except for bumping the dfs.datanode.max.xcievers property to 4096. I'm running zookeeper with the OOB config, except for the adjustment to maxclientcxns in zoo.cfg increased to be 100 I did a reboot of my machine, then began another series of tests. This time I was watching vm_stat while everything was running. I noticed the same behavior I've been witnessing all along. I'm including the stacktrace that was emitted to the tserver log when it hung as it usually does. 6 20:58:12,311 [zookeeper.ZooLock] DEBUG: event null None Disconnected 16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/tables/1/conf/table.compaction.minor.idle at org.apache.zookeeper.KeeperException.create(KeeperException.java:118) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815) at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208) at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130) at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233) at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188) at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:120) at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:108) at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:70) at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.getMemoryManagementActions(LargestFirstMemoryManager.java:95) at org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.manageMemory(TabletServerResourceManager.java:312) at org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.access$200(TabletServerResourceManager.java:228) at org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework$2.run(TabletServerResourceManager.java:252) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:680) 16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/config/table.scan.max.memory at org.apache.zookeeper.KeeperException.create(KeeperException.java:118) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815) at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208) at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130) at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233) at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188) at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:109) at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:76) at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:103) at org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:47) at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:960) at org.apache.accumulo.server.tabletserver.TabletServerResourceManager.executeReadAhead(TabletServerResourceManager.java:699) at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1316) at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:1284) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59) at $Proxy1.startMultiScan(Unknown Source) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.process(TabletClientService.java:2164) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2037) at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154) at org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631) at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:680) 16 20:58:13,379 [zookeeper.ZooLock] DEBUG: event null None Expired 16 20:58:13,393 [tabletserver.TabletServer] FATAL: Lost tablet server lock (reason = SESSION_EXPIRED), exiting. Below is the matching stacktrace emitted by the hadoop job. 12/08/16 20:57:14 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost sessionTimeout=30000 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@47124746 12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Openi
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionDavid Medinets 2012-08-17, 02:33
Would it help if you had another Accumulo instance to test your code
against? Sometimes starting fresh helps. I could lend you my server. Or you could buy a linode server for about $20. In 60 minutes, you'd have a working brand-new Accumulo instance.
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionArjumand Bonhomme 2012-08-17, 03:14
That's a decent thought. I have a linode instance from an old project that
I've never de-commissioned. I'll may look into that pretty soon. On Thu, Aug 16, 2012 at 10:33 PM, David Medinets <[EMAIL PROTECTED]>wrote: > Would it help if you had another Accumulo instance to test your code > against? Sometimes starting fresh helps. I could lend you my server. > Or you could buy a linode server for about $20. In 60 minutes, you'd > have a working brand-new Accumulo instance. >
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionArjumand Bonhomme 2012-08-20, 17:00
Hey guys,
I'm back with some additional information and I'm hoping that you might be able to help me finally get past this issue. So I changed my config from the 512MB version to the 1GB version without any noticeable improvement. The job still became stuck at roughly 10K input records, and accumulo's zookeeper session timed out as usual. After thinking about this for a while it occurred to me that I had previously used accumulo as an input source on over 30K records with no issues at all (it was writing back out to accumulo as well). So I tried to figure out why this new job could never reliably get past 10K records. The only real difference was that the new job was using a Hadoop MapFile as a lookup table for record values. So after commenting out of the use of the MapFile and using a single hard-coded value, I ran the job again and it made it through the entire input very quickly, ~3.5M entries in just over a minute. So after this I assumed the MapFile was slowing my job down too much for the accumulo input scanner to keep its connection with zookeeper. The MapFile was generated by reading some data from one of my tables in accumulo, so the next thing I tried (which I figured would be very bad in practice) was to read the lookup data directly from accumulo; I re-wrote the mapper to do a scanner lookup against the other accumulo table for each input record. This immediately worked much better than the MapFile. It was pretty slow, but it managed to get up to ~92K records before it failed. Accumulo once again lost its zookeeper session due to timeout. I recognize that this is not exactly an accumulo issue, but I was hoping that you might be able to provide me some guidance as to how to ultimately get past this issue. I'm using this against a small sample of what the actual input will be, and I have ~3.5M input records and ~30K values in my lookup table. Both of these values will likely increase in size substantially when this is run against the actual input. Any suggestions about how to approach this problem will be greatly appreciated. Thanks, -Arjumand BTW, While I recognize this would not be the appropriate way to address my problem, I was wondering if there was a reason why the org.apache.accumulo.core.client.ZooKeeperInstance constructor allows you specify/request a specific session timeout, but the same thing is not available on the .setZooKeeperInstance() methods of the AccumuloInputFormat/AccumuloOutputFormat classe? On Thu, Aug 16, 2012 at 10:10 PM, Arjumand Bonhomme <[EMAIL PROTECTED]>wrote: > Thanks for the input, no worries about the offline scanner bit. Although > if I can ever get past my current issue, I'd love to learn why the offline > scanner failed as well. > > So I took a look at my various other settings, and watched the paging > while everything was running, and it doesn't appear (at least not to me) to > be related to swapping. > I'm using the 512MB accumulo configuration, and I'm running the hadoop > task with a 256MB heap. I'm running hadoop with the standard > pseudo-distributed config, except for bumping the dfs.datanode.max.xcievers > property to 4096. I'm running zookeeper with the OOB config, except for > the adjustment to maxclientcxns in zoo.cfg increased to be 100 > > I did a reboot of my machine, then began another series of tests. This > time I was watching vm_stat while everything was running. > > I noticed the same behavior I've been witnessing all along. I'm including > the stacktrace that was emitted to the tserver log when it hung as it > usually does. > 6 20:58:12,311 [zookeeper.ZooLock] DEBUG: event null None Disconnected > 16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for > /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/tables/1/conf/table.compaction.minor.idle > at org.apache.zookeeper.KeeperException.create(KeeperException.java
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionKeith Turner 2012-08-20, 17:34
Arjumand
It sounds like you are joining two data sets? Reading one data set from Accumulo and then doing a lookup into the other dataset for each input to the mapper? This is a good approach if one data set is small and fits into memory. If this is not the case, then you may want to consider another join strategy. Below are some of the options I know of joining two data sets, D1 and D2 : * Read D1 into memory on each mapper, and do lookups into the D1 map as each is read from D2. Good approach when D1 is small. * Set D1 and D2 as inputs to the map reduce job and join in the reduce phase. Good approach when D1 and D2 are large. * If D1 is an Accumulo table, use the batch scanner to lookup elements in D2. Good approach when D1 is already in a table and D2 is relatively small. If D2 is larger, could do batch scans in reducers or maybe mappers. * Compute a bloom filter for D1, load the bloom filter into memory for each mapper. Check existence in D1 bloom filter as each item is read from D2. This is can work well when lots of things in D2 do not exist in D1. However this is just a filter step that cuts down on what you need to join. A join will still need to be done. The approach you choose for joining depends on the relative size of your two data sets. The goal is to batch work and avoid single lookups into Accumulo or a Map File if possible. Keith On Mon, Aug 20, 2012 at 1:00 PM, Arjumand Bonhomme <[EMAIL PROTECTED]> wrote: > Hey guys, > > I'm back with some additional information and I'm hoping that you might be > able to help me finally get past this issue. > > > So I changed my config from the 512MB version to the 1GB version without any > noticeable improvement. The job still became stuck at roughly 10K input > records, and accumulo's zookeeper session timed out as usual. After > thinking about this for a while it occurred to me that I had previously used > accumulo as an input source on over 30K records with no issues at all (it > was writing back out to accumulo as well). So I tried to figure out why > this new job could never reliably get past 10K records. The only real > difference was that the new job was using a Hadoop MapFile as a lookup table > for record values. So after commenting out of the use of the MapFile and > using a single hard-coded value, I ran the job again and it made it through > the entire input very quickly, ~3.5M entries in just over a minute. So > after this I assumed the MapFile was slowing my job down too much for the > accumulo input scanner to keep its connection with zookeeper. The MapFile > was generated by reading some data from one of my tables in accumulo, so > the next thing I tried (which I figured would be very bad in practice) was > to read the lookup data directly from accumulo; I re-wrote the mapper to do > a scanner lookup against the other accumulo table for each input record. > This immediately worked much better than the MapFile. It was pretty slow, > but it managed to get up to ~92K records before it failed. Accumulo once > again lost its zookeeper session due to timeout. > > I recognize that this is not exactly an accumulo issue, but I was hoping > that you might be able to provide me some guidance as to how to ultimately > get past this issue. I'm using this against a small sample of what the > actual input will be, and I have ~3.5M input records and ~30K values in my > lookup table. Both of these values will likely increase in size > substantially when this is run against the actual input. > > Any suggestions about how to approach this problem will be greatly > appreciated. > > > Thanks, > -Arjumand > > > BTW, While I recognize this would not be the appropriate way to address my > problem, I was wondering if there was a reason why the > org.apache.accumulo.core.client.ZooKeeperInstance constructor allows you > specify/request a specific session timeout, but the same thing is not > available on the .setZooKeeperInstance() methods of the > AccumuloInputFormat/AccumuloOutputFormat classe?
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionDavid Medinets 2012-08-21, 00:26
Can you use a new table to join and then scan the new table? Use the
foreign key as the rowid. Basically create your own materialized view.
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionKeith Turner 2012-08-21, 12:23
Yeah, that would certainly work.
You could run two map only jobs (could run concurrently). A job that reads D1 and writes to Table3 and a job that reads D2 and writes Table3. Map reduce may be faster, unless you want the final result in Accumulo in which case this may be faster. The two map reduce jobs could also produce files to bulk import into table3. Keith On Mon, Aug 20, 2012 at 8:26 PM, David Medinets <[EMAIL PROTECTED]> wrote: > Can you use a new table to join and then scan the new table? Use the foreign > key as the rowid. Basically create your own materialized view.
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionameet kini 2012-10-10, 14:22
I have a related problem where I need to do a 1-1 join (every row in
table A joins with a unique row in table B and vice versa). My join key is the row id of the table. In the past, I've used Hadoop's CompositeInputFormat to do a map-side join over data in HDFS (described here http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/) My tables in Accumulo seem to fit the eligibility criteria of CompositeInputFormat: both tables are sorted by the join key, since the join key is the row id in my case, and the tables are partitioned the same way (i.e., same split points). Has anyone tried using CompositeInputFormat over Accumulo tables? Is it possible to configure CompositeInputFormat with AccumuloInputFormat? Thanks, Ameet On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > Yeah, that would certainly work. > > You could run two map only jobs (could run concurrently). A job that > reads D1 and writes to Table3 and a job that reads D2 and writes > Table3. Map reduce may be faster, unless you want the final result > in Accumulo in which case this may be faster. The two map reduce jobs > could also produce files to bulk import into table3. > > Keith > > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets > <[EMAIL PROTECTED]> wrote: >> Can you use a new table to join and then scan the new table? Use the foreign >> key as the rowid. Basically create your own materialized view.
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionBillie Rinaldi 2012-10-11, 18:57
On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <[EMAIL PROTECTED]> wrote:
> I have a related problem where I need to do a 1-1 join (every row in > table A joins with a unique row in table B and vice versa). My join > key is the row id of the table. In the past, I've used Hadoop's > CompositeInputFormat to do a map-side join over data in HDFS > (described here > http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/) My > tables in Accumulo seem to fit the eligibility criteria of > CompositeInputFormat: both tables are sorted by the join key, since > the join key is the row id in my case, and the tables are partitioned > the same way (i.e., same split points). > > Has anyone tried using CompositeInputFormat over Accumulo tables? Is > it possible to configure CompositeInputFormat with > AccumuloInputFormat? > I haven't tried it. If you do, let us know how it works out. Billie > > Thanks, > Ameet > > > On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > > Yeah, that would certainly work. > > > > You could run two map only jobs (could run concurrently). A job that > > reads D1 and writes to Table3 and a job that reads D2 and writes > > Table3. Map reduce may be faster, unless you want the final result > > in Accumulo in which case this may be faster. The two map reduce jobs > > could also produce files to bulk import into table3. > > > > Keith > > > > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets > > <[EMAIL PROTECTED]> wrote: > >> Can you use a new table to join and then scan the new table? Use the > foreign > >> key as the rowid. Basically create your own materialized view. >
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionameet kini 2012-10-17, 14:10
Turns out that my assumption of tables being partitioned the same way may
be too restrictive. I need to account for join partitions not being co-located on the same tablet server. So the CompositeInputFormat is not applicable as I'd initially thought. That said, I hadn't gotten very far with it, and in particular, couldn't for the life of me figure out how to configure the mapred.join.expr to work on Accumulo's rfile directory structure. I ended up extending AccumuloInputFormat to do the join. The record reader would read table A using AccumuloInputFormat's scannerIterator and issue BatchScanner lookups to get table B's matching records, similar to Keith's suggestion above. Thanks, Ameet That said, I had spent some time trying to configure it with AccumuloInputFormat, and couldn't get very far because I couldn't figure out how to write a mapred.join.expr which would work directly on the underlying rfiles in Accumulo. Even if I flush/compact the table so I end up with exactly 1 rfile per tablet, the mapred.join.expr is On Thu, Oct 11, 2012 at 2:57 PM, Billie Rinaldi <[EMAIL PROTECTED]> wrote: > On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <[EMAIL PROTECTED]> wrote: > >> I have a related problem where I need to do a 1-1 join (every row in >> table A joins with a unique row in table B and vice versa). My join >> key is the row id of the table. In the past, I've used Hadoop's >> CompositeInputFormat to do a map-side join over data in HDFS >> (described here >> http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/) My >> tables in Accumulo seem to fit the eligibility criteria of >> CompositeInputFormat: both tables are sorted by the join key, since >> the join key is the row id in my case, and the tables are partitioned >> the same way (i.e., same split points). >> >> Has anyone tried using CompositeInputFormat over Accumulo tables? Is >> it possible to configure CompositeInputFormat with >> AccumuloInputFormat? >> > > I haven't tried it. If you do, let us know how it works out. > > Billie > > >> >> Thanks, >> Ameet >> >> >> On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <[EMAIL PROTECTED]> wrote: >> > Yeah, that would certainly work. >> > >> > You could run two map only jobs (could run concurrently). A job that >> > reads D1 and writes to Table3 and a job that reads D2 and writes >> > Table3. Map reduce may be faster, unless you want the final result >> > in Accumulo in which case this may be faster. The two map reduce jobs >> > could also produce files to bulk import into table3. >> > >> > Keith >> > >> > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets >> > <[EMAIL PROTECTED]> wrote: >> >> Can you use a new table to join and then scan the new table? Use the >> foreign >> >> key as the rowid. Basically create your own materialized view. >> > >
-
Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connectionameet kini 2012-10-17, 14:13
My previous post had some stale text after my signature - sorry. Reposting
after chopping the stale text off. Turns out that my assumption of tables being partitioned the same way may be too restrictive. I need to account for join partitions not being co-located on the same tablet server. So the CompositeInputFormat is not applicable as I'd initially thought. That said, I hadn't gotten very far with it, and in particular, couldn't for the life of me figure out how to configure the mapred.join.expr to work on Accumulo's rfile directory structure. I ended up extending AccumuloInputFormat to do the join. The record reader would read table A using AccumuloInputFormat's scannerIterator and issue BatchScanner lookups to get table B's matching records, similar to Keith's suggestion above. Thanks, Ameet On Wed, Oct 17, 2012 at 10:10 AM, ameet kini <[EMAIL PROTECTED]> wrote: > > Turns out that my assumption of tables being partitioned the same way may > be too restrictive. I need to account for join partitions not being > co-located on the same tablet server. So the CompositeInputFormat is not > applicable as I'd initially thought. That said, I hadn't gotten very far > with it, and in particular, couldn't for the life of me figure out how to > configure the mapred.join.expr to work on Accumulo's rfile directory > structure. > > I ended up extending AccumuloInputFormat to do the join. The record reader > would read table A using AccumuloInputFormat's scannerIterator and issue > BatchScanner lookups to get table B's matching records, similar to Keith's > suggestion above. > > Thanks, > Ameet > > > > > > > On Thu, Oct 11, 2012 at 2:57 PM, Billie Rinaldi <[EMAIL PROTECTED]> wrote: > >> On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <[EMAIL PROTECTED]> wrote: >> >>> I have a related problem where I need to do a 1-1 join (every row in >>> table A joins with a unique row in table B and vice versa). My join >>> key is the row id of the table. In the past, I've used Hadoop's >>> CompositeInputFormat to do a map-side join over data in HDFS >>> (described here >>> http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/) My >>> tables in Accumulo seem to fit the eligibility criteria of >>> CompositeInputFormat: both tables are sorted by the join key, since >>> the join key is the row id in my case, and the tables are partitioned >>> the same way (i.e., same split points). >>> >>> Has anyone tried using CompositeInputFormat over Accumulo tables? Is >>> it possible to configure CompositeInputFormat with >>> AccumuloInputFormat? >>> >> >> I haven't tried it. If you do, let us know how it works out. >> >> Billie >> >> >>> >>> Thanks, >>> Ameet >>> >>> >>> On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <[EMAIL PROTECTED]> wrote: >>> > Yeah, that would certainly work. >>> > >>> > You could run two map only jobs (could run concurrently). A job that >>> > reads D1 and writes to Table3 and a job that reads D2 and writes >>> > Table3. Map reduce may be faster, unless you want the final result >>> > in Accumulo in which case this may be faster. The two map reduce jobs >>> > could also produce files to bulk import into table3. >>> > >>> > Keith >>> > >>> > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets >>> > <[EMAIL PROTECTED]> wrote: >>> >> Can you use a new table to join and then scan the new table? Use the >>> foreign >>> >> key as the rowid. Basically create your own materialized view. >>> >> >> > |