Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase stucked because HDFS fails to replicate blocks


Copy link to this message
-
Re: HBase stucked because HDFS fails to replicate blocks
I like the way you were able to dig down into multiple logs and present us
the information, but it looks more like GC than an HDFS failure. In your
region server log, go back to the first FATAL and see if it got a session
expired from ZK and other messages like a client not being able to talk to
a server for some amount of time. If it's the case then what you are seeing
is the result of IO fencing by the master.

J-D
On Wed, Oct 2, 2013 at 10:15 AM, Ionut Ignatescu
<[EMAIL PROTECTED]>wrote:

> Hi,
>
> I have a Hadoop&HBase cluster, that runs Hadoop 1.1.2 and HBase 0.94.7.
> I notice an issue that stops normal cluster running.
> My use case: I have several MR jobs that read data from one HBase table in
> map phase and write data in 3 different tables during the reduce phase. I
> create table handler on my own, I don't
> TableOutputFormat. The only way out I found is to restart region server
> deamon on region server with problems.
>
> On namenode:
> cat namenode.2013-10-02 | grep "blk_3136705509461132997_43329"
> Wed Oct 02 13:32:17 2013 GMT namenode 3852-0@namenode:0 [INFO] (IPC Server
> handler 29 on 22700) org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247.
> blk_3136705509461132997_43329
> Wed Oct 02 13:33:38 2013 GMT namenode 3852-0@namenode:0 [INFO] (IPC Server
> handler 13 on 22700) org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_3136705509461132997_43329,
> newgenerationstamp=43366, newlength=40045568, newtargets=[
> 10.81.18.101:50010],
> closeFile=false, deleteBlock=false)
>
> On region server:
> cat regionserver.2013-10-02 | grep "1380720737247"
> Wed Oct 02 13:32:17 2013 GMT regionserver 5854-0@datanode1:0 [INFO]
> (regionserver60020.logRoller)
> org.apache.hadoop.hbase.regionserver.wal.HLog: Roll
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720701436,
> entries=149, filesize=63934833.  for
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> Wed Oct 02 13:33:37 2013 GMT regionserver 5854-0@datanode1:0 [WARN]
> (DataStreamer for file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.DFSClient:
> Error Recovery for block blk_3136705509461132997_43329 bad datanode[0]
> 10.80.40.176:50010
> Wed Oct 02 13:33:37 2013 GMT regionserver 5854-0@datanode1:0 [WARN]
> (DataStreamer for file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.DFSClient:
> Error Recovery for block blk_3136705509461132997_43329 in pipeline
> 10.80.40.176:50010, 10.81.111.8:50010, 10.81.18.101:50010: bad datanode
> 10.80.40.176:50010
> Wed Oct 02 13:33:43 2013 GMT regionserver 5854-0@datanode1:0 [INFO]
> (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> retrying...
> Wed Oct 02 13:33:43 2013 GMT regionserver 5854-0@datanode1:0 [INFO]
> (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> retrying...
> Wed Oct 02 13:33:44 2013 GMT regionserver 5854-0@datanode1:0 [INFO]
> (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> retrying...
>
> cat regionserver.2013-10-02 | grep "1380720737247" | grep 'Could not
> complete' | wc -l
> 5640
>
>
> In datanode logs, that runs on the same host with region server:
> cat datanode.2013-10-02 | grep "blk_3136705509461132997_43329"
> Wed Oct 02 13:32:17 2013 GMT datanode 5651-0@datanode1:0 [INFO]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB