Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> using date as key


Copy link to this message
-
Re: using date as key
Lior,

If you already know the key distribution you can create all the regions in advance.
Are you inserting a single day or multiple days?

5X is a good improvement. Here are some more hints:

Hadoop does a sort of the reduce keys before the actual reduce phase. This means that if your keys start with the date you'll get all reducers inserting for a consecutive days.  If you need avoid hot regions and the key component of your date_key is evenly distributed among days, then you can emit key_date from mappers instead of date_key and then reassemble them correctly in reducers. This way you'll have an even distribution of inserts on your pre-created regions.

Cosmin

On Mar 27, 2011, at 8:00 PM, Lior Schachter wrote:

> Hi,
> Last week I consulted he forum about hbase insertion optimization when  the
> key format is : date_key.
> This key format is very good for efficient scans but creates hotspot a
> single region when inserting millions of rows.
>
> I would like to share and get a feedback on the solution we found:
> 1. insert one day. after region split see the start-end row of each server
> (this is done one to see keys distribution).
> 2. now, before inserting a day create programmatically empty regions with
> the start-end key from 1 (by creating rows in the meta-table).
> Assuming row key-distribution of a day does not change dramatically, the
> reduces can insert to multiple regions (thus avoiding hotspotting).
>
> Applying this method improved insert performance by a factor of 5 or so.
>
> Lior
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB