There is not much you can do in the HBase side, too much is simply too much. I have in the past lowered the number of slots per MR node, so that fewer threads are hitting HBase. Sorry that I misread the already hashed keys, yeah, then all you can try is the bulk loading, as it will give you much better performance in a bulk loading scenario. If you have to trickle data in, then this will not help. But if you have a job that needs to complete and part of that job is to insert something into HBase, you could as well output to HFiles and then bulk load them in (which is very fast).
On Dec 1, 2011, at 2:58 PM, edward choi wrote:
> Thanks Lars,
> I am already familiar with the sequential key problem.
> That is why I am using hash generated random string as the document id.
> But I guess I was still pushing the cluster too much.
> Maybe I am inserting tweet documents too fast?
> Since a single tweet is only 140 bytes, puts are performed really fast.
> So I am guessing maybe random keys alone are not cutting it..?
> I am counting 20,000 requests per region when I perform mapreduce loading.
> Is that too much to handle?
> Is there a way to deliberately slow down input process?
> I am reading from 21 node HDFS cluster and writing to 21 node HBase
> cluster, so the process speed and the sheer volume of data transaction is
> Can I set a limit to the request per region? Say, like 5000 request maximum?
> I really want to know just how far I can push Hbase.
> But I guess developers would say everything depends on the use case.
> I thought about using bulk loading feature but I kinda got lazy and just
> went with the random string rowid.
> If parameter meddling doesn't pan out, I'll have no choice but to try
> bulk-loading feature.
> Thanks for the reply.
> 2011/12/1 Lars George <[EMAIL PROTECTED]>
>> Hi Ed,
>> Without having looked at the logs, this sounds like the common case of
>> overloading a single region due to your sequential row keys. Either hash
>> the keys, or salt them - but the best bet here is to use the bulk loading
>> feature of HBase (http://hbase.apache.org/bulk-loads.html). That bypasses
>> this problem and lets you continue to use sequential keys.
>> On Dec 1, 2011, at 12:21 PM, edward choi wrote:
>>> Hi Lars,
>>> Okay here goes some details.
>>> There are 21 tasktrackers/datanodes/regionservers
>>> There is one Jobtracker/namenode/master
>>> Three zookeepers.
>>> There are about 200 million tweets in Hbase.
>>> My mapreduce code is to aggregate tweets by their generated date.
>>> So in the map stage, I write out tweet date as the key, and document id
>>> the value (document id is randomly generated by hash algorithm)
>>> In the reduce stage, I put the data into a table. The key(which is the
>>> tweet date) is the table rowid, and values(which are document id's) as
>>> column values.
>>> Now, map stage is fine. I get to 100% map. But during reduce stage, one
>>> my regionserver fails.
>>> I don't know what the exact symptom is. I just get:
>>>> 1 action: servers with issues: lp171.etri.re.kr:60020,
>>> About "some node always die" <== scratch this.
>>> To be precise,
>>> I narrowed down the range of data that I wanted to process.
>>> I tried to put tweets that was generated only at 2011/11/22.
>>> Now the reduce code will produce a row with "20111122" as the rowid, and
>>> bunch of document id's as the column value. (I use 32byte string as the
>>> document id. I append 1000 document id for a single column)
>>> So the region that my data will be inserted will have "20111122" between
>>> the Start Key and End Key.
>>> The regionserver that contains that specific region fails. That is the
>>> point. If I move that region to another regionserver using hbase shell,
>>> then that regionserver fails.