Without having looked at the logs, this sounds like the common case of overloading a single region due to your sequential row keys. Either hash the keys, or salt them - but the best bet here is to use the bulk loading feature of HBase (http://hbase.apache.org/bulk-loads.html). That bypasses this problem and lets you continue to use sequential keys.
On Dec 1, 2011, at 12:21 PM, edward choi wrote:
> Hi Lars,
> Okay here goes some details.
> There are 21 tasktrackers/datanodes/regionservers
> There is one Jobtracker/namenode/master
> Three zookeepers.
> There are about 200 million tweets in Hbase.
> My mapreduce code is to aggregate tweets by their generated date.
> So in the map stage, I write out tweet date as the key, and document id as
> the value (document id is randomly generated by hash algorithm)
> In the reduce stage, I put the data into a table. The key(which is the
> tweet date) is the table rowid, and values(which are document id's) as the
> column values.
> Now, map stage is fine. I get to 100% map. But during reduce stage, one of
> my regionserver fails.
> I don't know what the exact symptom is. I just get:
>> 1 action: servers with issues: lp171.etri.re.kr:60020,
> About "some node always die" <== scratch this.
> To be precise,
> I narrowed down the range of data that I wanted to process.
> I tried to put tweets that was generated only at 2011/11/22.
> Now the reduce code will produce a row with "20111122" as the rowid, and a
> bunch of document id's as the column value. (I use 32byte string as the
> document id. I append 1000 document id for a single column)
> So the region that my data will be inserted will have "20111122" between
> the Start Key and End Key.
> The regionserver that contains that specific region fails. That is the
> point. If I move that region to another regionserver using hbase shell,
> then that regionserver fails.
> With the same log output.
> After 4 failures, the job is force-cancelled and the put operation was not
> Now, even with the failure, the regionserver is still online. It is not
> dead(sorry for my use of word 'die').
> I have pasted Jobtracker log, tasktracker(one that failed) log,
> regionserver(one that failed) log using PasteBin.
> The job started at 2011-12-01 17:14:43 and was killed at 2011-12-01
> JobTracker Log
> <script src="http://pastebin.com/embed_js.php?i=n6sp8Fyi"></script>
> TaskTracker Log
> <script src="http://pastebin.com/embed_js.php?i=RMFc41D5"></script>
> RegionServer Log
> <script src="http://pastebin.com/embed_js.php?i=UpKF8HwN"></script>
> And finally, according to the logs I pasted, I see other lines with DEBUG
> or INFO. So I thought this was okay.
> Is there a way to change WARN level log to some other level log? If you'd
> let me know, I will paste another set of logs.
> 2011/12/1 Lars George <[EMAIL PROTECTED]>
>> Hi Ed,
>> You need to be more precise I am afraid. First of all what does "some node
>> always dies" mean? Is the process gone? Which process is gone?
>> And the "error" you pasted is a WARN level log that *might* indicate some
>> trouble, but is *not* the reason the "node has died". Please elaborate.
>> Also consider posting the last few hundred lines of the process logs to
>> pastebin so that someone can look at it.
>> On Dec 1, 2011, at 9:48 AM, edward choi wrote:
>>> I've had a problem that has been killing for some days now.
>>> I am using CDH3 update2 version of Hadoop and Hbase.
>>> When I do a large amount of bulk loading into Hbase, some node always
>>> It's not just one particular node.
>>> But one of many nodes fail to serve eventually.
>>> I set 4 gigs of heap space for master, and regionservers. I monitored the
>>> process and when any node fails, it has not used all the heaps yet.
>>> So it is not a heap space problem.