Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> using date as key

Copy link to this message
Re: using date as key
We insert a single day (about 10 million rows), but also support inserting
consecutive days.

We actually add the date to the key only in the reducer phase (the date
comes from the configuration), so our mappers emit the key only.
I wonder if using the TotalOrderPartitioner will give us some more
improvement. Will test it soon....

On Mon, Mar 28, 2011 at 9:04 AM, Cosmin Lehene <[EMAIL PROTECTED]> wrote:

> Lior,
> If you already know the key distribution you can create all the regions in
> advance.
> Are you inserting a single day or multiple days?
> 5X is a good improvement. Here are some more hints:
> Hadoop does a sort of the reduce keys before the actual reduce phase. This
> means that if your keys start with the date you'll get all reducers
> inserting for a consecutive days.  If you need avoid hot regions and the key
> component of your date_key is evenly distributed among days, then you can
> emit key_date from mappers instead of date_key and then reassemble them
> correctly in reducers. This way you'll have an even distribution of inserts
> on your pre-created regions.
> Cosmin
> On Mar 27, 2011, at 8:00 PM, Lior Schachter wrote:
> > Hi,
> > Last week I consulted he forum about hbase insertion optimization when
>  the
> > key format is : date_key.
> > This key format is very good for efficient scans but creates hotspot a
> > single region when inserting millions of rows.
> >
> > I would like to share and get a feedback on the solution we found:
> > 1. insert one day. after region split see the start-end row of each
> server
> > (this is done one to see keys distribution).
> > 2. now, before inserting a day create programmatically empty regions with
> > the start-end key from 1 (by creating rows in the meta-table).
> > Assuming row key-distribution of a day does not change dramatically, the
> > reduces can insert to multiple regions (thus avoiding hotspotting).
> >
> > Applying this method improved insert performance by a factor of 5 or so.
> >
> > Lior