Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Occasional regionserver crashes following socket errors writing to HDFS

Copy link to this message
Re: Occasional regionserver crashes following socket errors writing to HDFS
Michael Segel 2012-05-10, 13:26

So the issue is that you have a lot of regions on a region server, where the max file size is the default.
On your input to HBase, you have a couple of issues.

1) Your data is most likely sorted. (Not good on inserts)
2) You will want to increase your region size from default (256MB) to something like 1-2GB.
3) You probably don't have MSLABS set up or GC tuned.
4) google dfs.balance.bandwidthPerSec  I believe its also used by HBase when they need to move regions.
Speaking of which what happens when HBase decides to move a region? Does it make a copy on the new RS and then after its there, point to the new RS and then remove the old region?
I'm assuming you're writing out of your reducer straight to HBase.
Are you writing your job to 1 reducer or did you set up multiple reducers?  You may want to play with having multiple reducers ...

Again, here's the issue. You don't need a reducer when writing to HBase. You would be better served by refactoring your job to have the mapper write to Hbase directly.
Think about it. (Really, think about it. If you really don't see it, face a white wall, with a 6 pack of beer and start drinking and focus on the question of why would I say you don't need a reducer on a map job. ) ;-) Note if you don't drink, go to the gym and get on a treadmill and run at a good pace. Put your body in to a zone and then work through the problem

On May 10, 2012, at 7:22 AM, Eran Kutner wrote:

> Hi Mike,
> Not sure I understand the question about the reducer. I'm using a reducer
> because my M/R jobs require one and I want to write the result to Hbase.
> I have two tables I'm writing two, one is using the default file size
> (256MB if I remember correctly) the other one is 512MB.
> There are ~700 regions on each server.
> Didn't know there is a bandwidth limit, is it on HDFS or HBase? How can it
> be configured?
> -eran
> On Thu, May 10, 2012 at 2:53 PM, Michel Segel <[EMAIL PROTECTED]>wrote:
>> Silly question...
>> Why are you using a reducer when working w HBase?
>> Second silly question... What is the max file size of your table that you
>> are writing to?
>> Third silly question... How many regions are on each of your region servers
>> Fourth silly question ... There is this bandwidth setting... Default is
>> 10MB...  Did you modify it?
>> Sent from a remote device. Please excuse any typos...
>> Mike Segel
>> On May 10, 2012, at 6:33 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:
>>> Thanks Igal, but we already have that setting. These are the relevant
>>> setting from hdfs-site.xml :
>>> <property>
>>>   <name>dfs.datanode.max.xcievers</name>
>>>   <value>65536</value>
>>> </property>
>>> <property>
>>>   <name>dfs.datanode.handler.count</name>
>>>   <value>10</value>
>>> </property>
>>> <property>
>>>   <name>dfs.datanode.socket.write.timeout</name>
>>>   <value>0</value>
>>> </property>
>>> Other ideas?
>>> -eran
>>> On Thu, May 10, 2012 at 12:25 PM, Igal Shilman <[EMAIL PROTECTED]> wrote:
>>>> Hi Eran,
>>>> Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ?
>>>> (We have set this to zero in our cluster, which means waiting as long as
>>>> necessary for the write to complete)
>>>> Igal.
>>>> On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:
>>>>> Hi,
>>>>> We're seeing occasional regionserver crashes during heavy write
>>>> operations
>>>>> to Hbase (at the reduce phase of large M/R jobs). I have increased the
>>>> file
>>>>> descriptors, HDFS xceivers, HDFS threads to the recommended settings
>> and
>>>>> actually way above.
>>>>> Here is an example of the HBase log (showing only errors):
>>>>> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
>>>>> DFSOutputStream ResponseProcessor exception  for block
>>>>> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
>>>>> block blk_-8928911185099340956_5189425 from datanode