Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - using date as key

Copy link to this message
Re: using date as key
Cosmin Lehene 2011-03-28, 07:04

If you already know the key distribution you can create all the regions in advance.
Are you inserting a single day or multiple days?

5X is a good improvement. Here are some more hints:

Hadoop does a sort of the reduce keys before the actual reduce phase. This means that if your keys start with the date you'll get all reducers inserting for a consecutive days.  If you need avoid hot regions and the key component of your date_key is evenly distributed among days, then you can emit key_date from mappers instead of date_key and then reassemble them correctly in reducers. This way you'll have an even distribution of inserts on your pre-created regions.


On Mar 27, 2011, at 8:00 PM, Lior Schachter wrote:

> Hi,
> Last week I consulted he forum about hbase insertion optimization when  the
> key format is : date_key.
> This key format is very good for efficient scans but creates hotspot a
> single region when inserting millions of rows.
> I would like to share and get a feedback on the solution we found:
> 1. insert one day. after region split see the start-end row of each server
> (this is done one to see keys distribution).
> 2. now, before inserting a day create programmatically empty regions with
> the start-end key from 1 (by creating rows in the meta-table).
> Assuming row key-distribution of a day does not change dramatically, the
> reduces can insert to multiple regions (thus avoiding hotspotting).
> Applying this method improved insert performance by a factor of 5 or so.
> Lior