Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Occasional regionserver crashes following socket errors writing to HDFS


+
Eran Kutner 2012-05-10, 08:17
+
dva 2012-08-30, 06:26
+
dva 2012-08-30, 06:26
+
Stack 2012-08-30, 22:36
+
Stack 2012-05-11, 05:07
+
Igal Shilman 2012-05-10, 09:25
Copy link to this message
-
Re: Occasional regionserver crashes following socket errors writing to HDFS
Thanks Igal, but we already have that setting. These are the relevant
setting from hdfs-site.xml :
  <property>
    <name>dfs.datanode.max.xcievers</name>
    <value>65536</value>
  </property>
  <property>
    <name>dfs.datanode.handler.count</name>
    <value>10</value>
  </property>
  <property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>0</value>
  </property>

Other ideas?

-eran

On Thu, May 10, 2012 at 12:25 PM, Igal Shilman <[EMAIL PROTECTED]> wrote:

> Hi Eran,
> Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ?
> (We have set this to zero in our cluster, which means waiting as long as
> necessary for the write to complete)
>
> Igal.
>
> On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> > We're seeing occasional regionserver crashes during heavy write
> operations
> > to Hbase (at the reduce phase of large M/R jobs). I have increased the
> file
> > descriptors, HDFS xceivers, HDFS threads to the recommended settings and
> > actually way above.
> >
> > Here is an example of the HBase log (showing only errors):
> >
> > 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
> > block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010
> >        at
> >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:
> > 2986)
> >
> > 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer
> > Exception: java.io.InterruptedIOException: Interruped while waiting for
> IO
> > on channel java.nio.channels.SocketChannel[connected
> > local=/10.1.104.9:59642remote=/
> > 10.1.104.9:50010]. 0 millis timeout left.
> >        at
> >
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >        at
> >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> >        at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> >        at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >        at
> >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:
> > 2848)
> >
> > 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
> > 10.1.104.6:50010
> > 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-8928911185099340956_5189425 in pipeline
> > 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
> > 10.1.104.6:50010
> > 2012-05-10 03:48:30,174 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server
> > serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
> > load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
> > regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008
> > received expired from ZooKeeper, aborting
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired
> >        at
> >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
> >        at
> >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
> >        at
> >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
> >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> > java.io.InterruptedIOException: Aborting compaction of store properties
> in
> > region
> >
> >
> gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9.
> > because user requested stop.
+
Michel Segel 2012-05-10, 11:53
+
Eran Kutner 2012-05-10, 12:22
+
Michael Segel 2012-05-10, 13:26
+
Dave Revell 2012-05-10, 17:31
+
Michael Segel 2012-05-10, 18:30
+
Dave Revell 2012-05-10, 18:41
+
Michael Segel 2012-05-10, 18:59
+
Eran Kutner 2012-05-10, 19:17
+
Michael Segel 2012-05-10, 19:50
+
Stack 2012-05-10, 21:57
+
Michael Segel 2012-05-11, 02:46
+
Stack 2012-05-11, 03:34
+
Michael Segel 2012-05-11, 01:28
+
Stack 2012-05-11, 03:28
+
Michael Segel 2012-05-11, 03:44
+
Stack 2012-05-11, 03:53
+
Stack 2012-05-11, 05:12
+
Michael Segel 2012-05-11, 11:36
+
Eran Kutner 2012-05-24, 11:15
+
Michael Segel 2012-05-24, 12:13
+
Stack 2012-05-24, 23:39
+
Dave Revell 2012-05-25, 19:52
+
Stack 2012-05-11, 05:08