|
|
-
Re: Constant error when putting large data into HBaseLars George 2011-12-01, 13:38
Hi Ed,
Without having looked at the logs, this sounds like the common case of overloading a single region due to your sequential row keys. Either hash the keys, or salt them - but the best bet here is to use the bulk loading feature of HBase (http://hbase.apache.org/bulk-loads.html). That bypasses this problem and lets you continue to use sequential keys. Lars On Dec 1, 2011, at 12:21 PM, edward choi wrote: > Hi Lars, > > Okay here goes some details. > There are 21 tasktrackers/datanodes/regionservers > There is one Jobtracker/namenode/master > Three zookeepers. > > There are about 200 million tweets in Hbase. > My mapreduce code is to aggregate tweets by their generated date. > So in the map stage, I write out tweet date as the key, and document id as > the value (document id is randomly generated by hash algorithm) > In the reduce stage, I put the data into a table. The key(which is the > tweet date) is the table rowid, and values(which are document id's) as the > column values. > > Now, map stage is fine. I get to 100% map. But during reduce stage, one of > my regionserver fails. > I don't know what the exact symptom is. I just get: >> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: > Failed >> 1 action: servers with issues: lp171.etri.re.kr:60020, > > About "some node always die" <== scratch this. > > To be precise, > I narrowed down the range of data that I wanted to process. > I tried to put tweets that was generated only at 2011/11/22. > Now the reduce code will produce a row with "20111122" as the rowid, and a > bunch of document id's as the column value. (I use 32byte string as the > document id. I append 1000 document id for a single column) > So the region that my data will be inserted will have "20111122" between > the Start Key and End Key. > The regionserver that contains that specific region fails. That is the > point. If I move that region to another regionserver using hbase shell, > then that regionserver fails. > With the same log output. > After 4 failures, the job is force-cancelled and the put operation was not > done. > > Now, even with the failure, the regionserver is still online. It is not > dead(sorry for my use of word 'die'). > > I have pasted Jobtracker log, tasktracker(one that failed) log, > regionserver(one that failed) log using PasteBin. > The job started at 2011-12-01 17:14:43 and was killed at 2011-12-01 > 17:20:07. > > JobTracker Log > <script src="http://pastebin.com/embed_js.php?i=n6sp8Fyi"></script> > > TaskTracker Log > <script src="http://pastebin.com/embed_js.php?i=RMFc41D5"></script> > > RegionServer Log > <script src="http://pastebin.com/embed_js.php?i=UpKF8HwN"></script> > > And finally, according to the logs I pasted, I see other lines with DEBUG > or INFO. So I thought this was okay. > Is there a way to change WARN level log to some other level log? If you'd > let me know, I will paste another set of logs. > > Thanks, > Ed > > 2011/12/1 Lars George <[EMAIL PROTECTED]> > >> Hi Ed, >> >> You need to be more precise I am afraid. First of all what does "some node >> always dies" mean? Is the process gone? Which process is gone? >> And the "error" you pasted is a WARN level log that *might* indicate some >> trouble, but is *not* the reason the "node has died". Please elaborate. >> >> Also consider posting the last few hundred lines of the process logs to >> pastebin so that someone can look at it. >> >> Thanks, >> Lars >> >> >> On Dec 1, 2011, at 9:48 AM, edward choi wrote: >> >>> Hi, >>> I've had a problem that has been killing for some days now. >>> I am using CDH3 update2 version of Hadoop and Hbase. >>> When I do a large amount of bulk loading into Hbase, some node always >> die. >>> It's not just one particular node. >>> But one of many nodes fail to serve eventually. >>> >>> I set 4 gigs of heap space for master, and regionservers. I monitored the >>> process and when any node fails, it has not used all the heaps yet. >>> So it is not a heap space problem. |