|
|
-
Never ending distributed log split
Jean-Marc Spaggiari 2012-08-03, 13:33
Hi,
I'm using HBase 0.94.0.
I stopped the cluster for some maintenance, and I'm have some troubles to restart it.
I'm getting one line every about
Start Time Description State Status Fri Aug 03 08:59:54 EDT 2012 Doing distributed log split in [hdfs://node3:9000/hbase/.logs/latitude,60020,1343908057839-splitting, hdfs://node3:9000/hbase/.logs/latitude,60020,1343998595290-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343908057567-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343939284240-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343998593757-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343908059614-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343939286369-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343998595830-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343908054414-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343939282294-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343998590612-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343908056186-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343939282889-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343998592129-splitting, hdfs://node3:9000/hbase/.logs/node5,60020,1343908059158-splitting, hdfs://node3:9000/hbase/.logs/node5,60020,1343998594856-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343908053256-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343939281065-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343998580375-splitting] RUNNING (since 3sec ago) Waiting for distributed tasks to finish. scheduled=1 done=0 error=0 (since 0sec ago)
If I let it run, it will run like that for hours. Adding lines and lines and lines until I stop it. On the master logs, I can see that: 2012-08-03 09:02:49,788 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode1%2C60020%2C1343908057567-splitting%2Fnode1%252C60020%252C1343908057567.1343914548297 entered state err node4,60020,1343998592129 2012-08-03 09:02:49,788 WARN org.apache.hadoop.hbase.master.SplitLogManager: Error splitting /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode1%2C60020%2C1343908057567-splitting%2Fnode1%252C60020%252C1343908057567.1343914548297 2012-08-03 09:02:49,788 WARN org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs in [hdfs://node3:9000/hbase/.logs/latitude,60020,1343908057839-splitting, hdfs://node3:9000/hbase/.logs/latitude,60020,1343998595290-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343908057567-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343939284240-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343998593757-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343908059614-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343939286369-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343998595830-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343908054414-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343939282294-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343998590612-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343908056186-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343939282889-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343998592129-splitting, hdfs://node3:9000/hbase/.logs/node5,60020,1343908059158-splitting, hdfs://node3:9000/hbase/.logs/node5,60020,1343998594856-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343908053256-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343939281065-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343998580375-splitting] installed = 1 but only 0 done 2012-08-03 09:02:49,788 WARN org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting of [latitude,60020,1343908057839, latitude,60020,1343998595290, node1,60020,1343908057567, node1,60020,1343939284240, node1,60020,1343998593757, node2,60020,1343908059614, node2,60020,1343939286369, node2,60020,1343998595830, node3,60020,1343908054414, node3,60020,1343939282294, node3,60020,1343998590612, node4,60020,1343908056186, node4,60020,1343939282889, node4,60020,1343998592129, node5,60020,1343908059158, node5,60020,1343998594856, phenom,60020,1343908053256, phenom,60020,1343939281065, phenom,60020,1343998580375] java.io.IOException: error or interrupt while splitting logs in [hdfs://node3:9000/hbase/.logs/latitude,60020,1343908057839-splitting, hdfs://node3:9000/hbase/.logs/latitude,60020,1343998595290-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343908057567-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343939284240-splitting, hdfs://node3:9000/hbase/.logs/node1,60020,1343998593757-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343908059614-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343939286369-splitting, hdfs://node3:9000/hbase/.logs/node2,60020,1343998595830-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343908054414-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343939282294-splitting, hdfs://node3:9000/hbase/.logs/node3,60020,1343998590612-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343908056186-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343939282889-splitting, hdfs://node3:9000/hbase/.logs/node4,60020,1343998592129-splitting, hdfs://node3:9000/hbase/.logs/node5,60020,1343908059158-splitting, hdfs://node3:9000/hbase/.logs/node5,60020,1343998594856-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343908053256-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343939281065-splitting, hdfs://node3:9000/hbase/.logs/phenom,60020,1343998580375-splitting] Task = installed = 1 done = 0 error = 1 at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:269) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:277) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAf
-
Re: Never ending distributed log split
Jean-Marc Spaggiari 2012-08-03, 13:43
Here us the complete log. And seems it's every 30 seconds and not every 20 seconds... http://pastebin.com/gMiURnnj2012/8/3, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > Hi, > > I'm using HBase 0.94.0. > > I stopped the cluster for some maintenance, and I'm have some troubles > to restart it. > > I'm getting one line every about > > Start Time Description State Status > Fri Aug 03 08:59:54 EDT 2012 Doing distributed log split in > [hdfs://node3:9000/hbase/.logs/latitude,60020,1343908057839-splitting, > hdfs://node3:9000/hbase/.logs/latitude,60020,1343998595290-splitting, > hdfs://node3:9000/hbase/.logs/node1,60020,1343908057567-splitting, > hdfs://node3:9000/hbase/.logs/node1,60020,1343939284240-splitting, > hdfs://node3:9000/hbase/.logs/node1,60020,1343998593757-splitting, > hdfs://node3:9000/hbase/.logs/node2,60020,1343908059614-splitting, > hdfs://node3:9000/hbase/.logs/node2,60020,1343939286369-splitting, > hdfs://node3:9000/hbase/.logs/node2,60020,1343998595830-splitting, > hdfs://node3:9000/hbase/.logs/node3,60020,1343908054414-splitting, > hdfs://node3:9000/hbase/.logs/node3,60020,1343939282294-splitting, > hdfs://node3:9000/hbase/.logs/node3,60020,1343998590612-splitting, > hdfs://node3:9000/hbase/.logs/node4,60020,1343908056186-splitting, > hdfs://node3:9000/hbase/.logs/node4,60020,1343939282889-splitting, > hdfs://node3:9000/hbase/.logs/node4,60020,1343998592129-splitting, > hdfs://node3:9000/hbase/.logs/node5,60020,1343908059158-splitting, > hdfs://node3:9000/hbase/.logs/node5,60020,1343998594856-splitting, > hdfs://node3:9000/hbase/.logs/phenom,60020,1343908053256-splitting, > hdfs://node3:9000/hbase/.logs/phenom,60020,1343939281065-splitting, > hdfs://node3:9000/hbase/.logs/phenom,60020,1343998580375-splitting] > RUNNING (since 3sec ago) Waiting for distributed tasks to finish. > scheduled=1 done=0 error=0 (since 0sec ago) > > If I let it run, it will run like that for hours. Adding lines and > lines and lines until I stop it. > > > On the master logs, I can see that: > 2012-08-03 09:02:49,788 INFO > org.apache.hadoop.hbase.master.SplitLogManager: task > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode1%2C60020%2C1343908057567-splitting%2Fnode1%252C60020%252C1343908057567.1343914548297 > entered state err node4,60020,1343998592129 > 2012-08-03 09:02:49,788 WARN > org.apache.hadoop.hbase.master.SplitLogManager: Error splitting > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode1%2C60020%2C1343908057567-splitting%2Fnode1%252C60020%252C1343908057567.1343914548297 > 2012-08-03 09:02:49,788 WARN > org.apache.hadoop.hbase.master.SplitLogManager: error while splitting > logs in > [hdfs://node3:9000/hbase/.logs/latitude,60020,1343908057839-splitting, > hdfs://node3:9000/hbase/.logs/latitude,60020,1343998595290-splitting, > hdfs://node3:9000/hbase/.logs/node1,60020,1343908057567-splitting, > hdfs://node3:9000/hbase/.logs/node1,60020,1343939284240-splitting, > hdfs://node3:9000/hbase/.logs/node1,60020,1343998593757-splitting, > hdfs://node3:9000/hbase/.logs/node2,60020,1343908059614-splitting, > hdfs://node3:9000/hbase/.logs/node2,60020,1343939286369-splitting, > hdfs://node3:9000/hbase/.logs/node2,60020,1343998595830-splitting, > hdfs://node3:9000/hbase/.logs/node3,60020,1343908054414-splitting, > hdfs://node3:9000/hbase/.logs/node3,60020,1343939282294-splitting, > hdfs://node3:9000/hbase/.logs/node3,60020,1343998590612-splitting, > hdfs://node3:9000/hbase/.logs/node4,60020,1343908056186-splitting, > hdfs://node3:9000/hbase/.logs/node4,60020,1343939282889-splitting, > hdfs://node3:9000/hbase/.logs/node4,60020,1343998592129-splitting, > hdfs://node3:9000/hbase/.logs/node5,60020,1343908059158-splitting, > hdfs://node3:9000/hbase/.logs/node5,60020,1343998594856-splitting, > hdfs://node3:9000/hbase/.logs/phenom,60020,1343908053256-splitting, > hdfs://node3:9000/hbase/.logs/phenom,60020,1343939281065-splitting, > hdfs://node3:9000/hbase/.logs/phenom,60020,1343998580375-splitting] > installed = 1 but only 0 done
-
Re: Never ending distributed log split
Jean-Marc Spaggiari 2012-08-03, 15:15
Me again ;) I did some more investigation. And I found that: http://pastebin.com/Bedm6LdySeems that no region is serving my logs. That's strange because all my servers are up and fsck is telling me that FS is clean. Can I just delete those files? What's the impact of such delete? I don't really worrie about loosing some data. It's a test environment. But I really need it to start again. Thanks, JM 2012/8/3, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > Here us the complete log. And seems it's every 30 seconds and not > every 20 seconds... > > http://pastebin.com/gMiURnnj> > 2012/8/3, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: >> Hi, >> >> I'm using HBase 0.94.0. >> >> I stopped the cluster for some maintenance, and I'm have some troubles >> to restart it. >> >> I'm getting one line every about >> >> Start Time Description State Status >> Fri Aug 03 08:59:54 EDT 2012 Doing distributed log split in >> [hdfs://node3:9000/hbase/.logs/latitude,60020,1343908057839-splitting, >> hdfs://node3:9000/hbase/.logs/latitude,60020,1343998595290-splitting, >> hdfs://node3:9000/hbase/.logs/node1,60020,1343908057567-splitting, >> hdfs://node3:9000/hbase/.logs/node1,60020,1343939284240-splitting, >> hdfs://node3:9000/hbase/.logs/node1,60020,1343998593757-splitting, >> hdfs://node3:9000/hbase/.logs/node2,60020,1343908059614-splitting, >> hdfs://node3:9000/hbase/.logs/node2,60020,1343939286369-splitting, >> hdfs://node3:9000/hbase/.logs/node2,60020,1343998595830-splitting, >> hdfs://node3:9000/hbase/.logs/node3,60020,1343908054414-splitting, >> hdfs://node3:9000/hbase/.logs/node3,60020,1343939282294-splitting, >> hdfs://node3:9000/hbase/.logs/node3,60020,1343998590612-splitting, >> hdfs://node3:9000/hbase/.logs/node4,60020,1343908056186-splitting, >> hdfs://node3:9000/hbase/.logs/node4,60020,1343939282889-splitting, >> hdfs://node3:9000/hbase/.logs/node4,60020,1343998592129-splitting, >> hdfs://node3:9000/hbase/.logs/node5,60020,1343908059158-splitting, >> hdfs://node3:9000/hbase/.logs/node5,60020,1343998594856-splitting, >> hdfs://node3:9000/hbase/.logs/phenom,60020,1343908053256-splitting, >> hdfs://node3:9000/hbase/.logs/phenom,60020,1343939281065-splitting, >> hdfs://node3:9000/hbase/.logs/phenom,60020,1343998580375-splitting] >> RUNNING (since 3sec ago) Waiting for distributed tasks to finish. >> scheduled=1 done=0 error=0 (since 0sec ago) >> >> If I let it run, it will run like that for hours. Adding lines and >> lines and lines until I stop it. >> >> >> On the master logs, I can see that: >> 2012-08-03 09:02:49,788 INFO >> org.apache.hadoop.hbase.master.SplitLogManager: task >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode1%2C60020%2C1343908057567-splitting%2Fnode1%252C60020%252C1343908057567.1343914548297 >> entered state err node4,60020,1343998592129 >> 2012-08-03 09:02:49,788 WARN >> org.apache.hadoop.hbase.master.SplitLogManager: Error splitting >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode1%2C60020%2C1343908057567-splitting%2Fnode1%252C60020%252C1343908057567.1343914548297 >> 2012-08-03 09:02:49,788 WARN >> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting >> logs in >> [hdfs://node3:9000/hbase/.logs/latitude,60020,1343908057839-splitting, >> hdfs://node3:9000/hbase/.logs/latitude,60020,1343998595290-splitting, >> hdfs://node3:9000/hbase/.logs/node1,60020,1343908057567-splitting, >> hdfs://node3:9000/hbase/.logs/node1,60020,1343939284240-splitting, >> hdfs://node3:9000/hbase/.logs/node1,60020,1343998593757-splitting, >> hdfs://node3:9000/hbase/.logs/node2,60020,1343908059614-splitting, >> hdfs://node3:9000/hbase/.logs/node2,60020,1343939286369-splitting, >> hdfs://node3:9000/hbase/.logs/node2,60020,1343998595830-splitting, >> hdfs://node3:9000/hbase/.logs/node3,60020,1343908054414-splitting, >> hdfs://node3:9000/hbase/.logs/node3,60020,1343939282294-splitting, >> hdfs://node3:9000/hbase/.logs/node3,60020,1343998590612-splitting, >> hdfs://node3:9000/hbase/.logs/node4,60020,1343908056186-splitting,
-
Re: Never ending distributed log split
Jean-Daniel Cryans 2012-08-03, 18:29
On Fri, Aug 3, 2012 at 8:15 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > Me again ;) > > I did some more investigation. It would really help to see the region server log although the fsck output might be enough. BTW you'll find 0.94.1 RC1 here: http://people.apache.org/~larsh/hbase-0.94.1-rc1/> > And I found that: > > http://pastebin.com/Bedm6Ldy> > Seems that no region is serving my logs. That's strange because all my > servers are up and fsck is telling me that FS is clean. I don't get the "Seems that no region is serving my logs" part. A region doesn't serve logs, it serves HFiles. You meant to say DataNode? > > Can I just delete those files? What's the impact of such delete? I > don't really worrie about loosing some data. It's a test environment. > But I really need it to start again. I wonder if it's related to: https://issues.apache.org/jira/browse/HBASE-6401Did you remove a datanode from the cluster as part of the maintenance? If you want you can probably move that folder aside but whatever was in those logs is lost (if there ever was anything) until it gets replayed properly. Kinda weird that a file wouldn't have any blocks like that, would be interesting to see the log of the region server that created it. J-D
-
Re: Never ending distributed log split
Jean-Marc Spaggiari 2012-08-03, 20:15
2012/8/3, Jean-Daniel Cryans <[EMAIL PROTECTED]>: > On Fri, Aug 3, 2012 at 8:15 AM, Jean-Marc Spaggiari > <[EMAIL PROTECTED]> wrote: >> Me again ;) >> >> I did some more investigation. > > It would really help to see the region server log although the fsck > output might be enough. I looked under evey directory and only one is containing a file. http://pastebin.com/8Fea2EnAIt seems to be related to node1. On this server, seems that everything is started correctly: hadoop@node1:~$ /usr/local/jdk1.7.0_05/bin/jps 2211 DataNode 2938 Jps 2136 TaskTracker hbase@node1:~$ /usr/local/jdk1.7.0_05/bin/jps 2419 HRegionServer 3708 Jps On the Node1 region server logs, I can see the same information, which is, the file is not hosted anywhere. 2012-08-03 15:01:31,216 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_4965382127800577452_15852 file=/hbase/.logs/node1,60020,1343908057567-splitting/node1%2C60020%2C1343908057567.1343914548297 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2266) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2060) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2221) at java.io.DataInputStream.read(DataInputStream.java:149) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:688) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:850) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:763) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:384) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:351) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:266) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:722) > BTW you'll find 0.94.1 RC1 here: > http://people.apache.org/~larsh/hbase-0.94.1-rc1/Super, thanks! I will most probably try it instead of the 0.94.0 >> And I found that: >> >> http://pastebin.com/Bedm6Ldy>> >> Seems that no region is serving my logs. That's strange because all my >> servers are up and fsck is telling me that FS is clean. > > I don't get the "Seems that no region is serving my logs" part. A > region doesn't serve logs, it serves HFiles. You meant to say > DataNode? I was talking about the files under /hbase/.logs . Base on the directory name I thought it was some logs. What ever this file is supposed to be for, it seems it's not served by any datanode. >> Can I just delete those files? What's the impact of such delete? I >> don't really worrie about loosing some data. It's a test environment. >> But I really need it to start again. > > I wonder if it's related to: > https://issues.apache.org/jira/browse/HBASE-6401> > Did you remove a datanode from the cluster as part of the maintenance? It might be related to this Jira. You, I stopped all the datanodes for the maintenance (Had to work on the power suply...). I had to do that promptly so I "just" stopped everything with init 0. That's fine. Nothing was appening in the cluster for hours. So I'm not really expecting to loose anything. So I will try to delete the file... Here are the logs where we can see the file creation: http://pastebin.com/HBc28zab Nothing weird in it I think. When I removed the file, the region server crashed and had to be restarted. Restart was not working: 2012-08-03 16:07:49,119 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: remote error telling master we are up org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.PleaseHoldException: Server serverName=node1,60020 2012-08-03 16:07:46,112 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: remote error telling master we are up org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.PleaseHoldException: Server serverName=node1,60020,1344024290513 rejected; we already have node1,60020,1343998593757 registered with same hostname and port at org.apache.hadoop.hbase.master.ServerManager.checkAlreadySameHostPort(ServerManager.java:194) at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:153) at org.apache.hadoop.hbase.master.HMaster.regionServerStartup(HMaster.java:860) at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1376) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:918) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.jav
-
Never ending distributed log split
Jean-Marc Spaggiari 2013-06-02, 15:09
My HBase was in a bad state recently. HBCK did a slow but good job and everything is now almost stable. However, I still have one log split which is not working. Every minute, the SplitLogManager try to split the log, fails, and retry. It's always the same file. It's assigned to different nodes, but all failed, and it's starting again and again. 2013-06-02 10:44:20,298 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs to split 2013-06-02 10:44:20,298 INFO org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting] 2013-06-02 10:44:20,298 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: wait for status of task /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 to change to DELETED 2013-06-02 10:44:20,315 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 2013-06-02 10:44:20,329 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 2013-06-02 10:44:20,341 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 2013-06-02 10:44:20,344 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 ver = 0 2013-06-02 10:44:20,346 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 ver = 0 2013-06-02 10:44:20,384 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 acquired by node1,60020,1370136472290 2013-06-02 10:44:20,410 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 acquired by node4,60020,1370136467255 2013-06-02 10:44:20,497 TRACE org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit of last_update = 1370184260384 last_version = 1 cur_worker_name node1,60020,1370136472290 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 2 done = 0 error = 0 because the server node1,60020,1370136472290 is not marked as dead, we waited for 113 while the timeout is 300000 2013-06-02 10:44:20,497 TRACE org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit of last_update = 1370184260410 last_version = 1 cur_worker_name node4,60020,1370136467255 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 2 done = 0 error = 0 because the server node4,60020,1370136467255 is not marked as dead, we waited for 87 while the timeout is 300000 2013-06-02 10:44:20,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 unassigned = 0 2013-06-02 10:44:21,497 TRACE org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit of last_update = 1370184261377 last_version = 2 cur_worker_name node1,60020,1370136472290 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 2 done = 0 error = 0 because the server node1,60020,1370136472290 is not marked as dead, we waited for 120 while the timeout is 300000 2013-06-02 10:44:21,497 TRACE org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit of last_update = 1370184261410 last_version = 2 cur_worker_name node4,60020,1370136467255 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 2 done = 0 error = 0 because the server node4,60020,1370136467255 is not marked as dead, we waited for 87 while the timeout is 300000 2013-06-02 10:44:21,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 unassigned = 0 2013-06-02 10:44:21,708 DEBUG org.apache.hadoop.hbase.master.ServerManager: REPORT: Server node7,60020,1370136467731 came back up, removed it from the dead servers list 2013-06-02 10:44:22,497 TRACE org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit of last_update = 1370184261377 last_version = 2 cur_worker_name node1,60020,1370136472290 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 2 done = 0 error = 0 because the server node1,60020,1370136472290 is not marked as dead, we waited for 1120 while the timeout is 300000 2013-06-02 10:44:22,497 TRACE org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit of last_update = 1370184261410 last_version = 2 cur_worker_name node4,60020,1370136467255 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 2 done = 0 error = 0 because the server node4,60020,1370136467255 is not marked as dead, we waited for 1087 while the timeout is 300000 2013-06-02 10:44:22,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 unassigned = 0 2013-06-02 10:44:23,497 TRACE org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit of last_update = 1370184261377 last_version = 2 cur_worker_name node1,60020,1370136472290 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 2 done = 0 error = 0 because the server node1,60020,1370136472290 is not marked as dead, we waited for 2120 while the timeout
|
|