Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - [ANN]: HBaseWD: Distribute Sequential Writes in HBase

Copy link to this message
[ANN]: HBaseWD: Distribute Sequential Writes in HBase
Alex Baranau 2011-04-19, 17:25
Hello guys,

I'd like to introduce a new small java project/lib around HBase: HBaseWD. It
is aimed to help with distribution of the load (across regionservers) when
writing sequential (becasue of the row key nature) records. It implements
the solution which was discussed several times on this mailing list (e.g.
here: http://search-hadoop.com/m/gNRA82No5Wk).

Please find the sources at https://github.com/sematext/HBaseWD (there's also
a jar of current version for convenience). It is very easy to make use of
it: e.g. I added it to one existing project with 1+2 lines of code (one
where I write to HBase and 2 for configuring MapReduce job).

Any feedback is highly appreciated!

Please find below the short intro to the lib [1].

Alex Baranau
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase


HBaseWD stands for Distributing (sequential) Writes. It was inspired by
discussions on HBase mailing lists around the problem of choosing between:
* writing records with sequential row keys (e.g. time-series data with row
  built based on ts)
* using random unique IDs for records

First approach makes possible to perform fast range scans with help of
start/stop keys on Scanner, but creates single region server hot-spotting
problem upon writing data (as row keys go in sequence all records end up
written into a single region at a time).

Second approach aims for fastest writing performance by distributing new
records over random regions but makes not possible doing fast range scans
against written data.

The suggested approach stays in the middle of the two above and proved to
perform well by distributing records over the cluster during writing data
while allowing range scans over it. HBaseWD provides very simple API to
work with which makes it perfect to use with existing code.

Please refer to unit-tests for lib usage info as they aimed to act as

Brief Usage Info (Examples):

Distributing records with sequential keys which are being written in up to
Byte.MAX_VALUE buckets:

    byte bucketsCount = (byte) 32; // distributing into 32 buckets
    RowKeyDistributor keyDistributor                            new
    for (int i = 0; i < 100; i++) {
      Put put = new Put(keyDistributor.getDistributedKey(originalKey));
      ... // add values
Performing a range scan over written data (internally <bucketsCount>

    Scan scan = new Scan(startKey, stopKey);
    ResultScanner rs = DistributedScanner.create(hTable, scan,
    for (Result current : rs) {

Performing mapreduce job over written data chunk specified by Scan:

    Configuration conf = HBaseConfiguration.create();
    Job job = new Job(conf, "testMapreduceJob");

    Scan scan = new Scan(startKey, stopKey);

    TableMapReduceUtil.initTableMapperJob("table", scan,
      RowCounterMapper.class, ImmutableBytesWritable.class, Result.class,

    // Substituting standard TableInputFormat which was set in
    // TableMapReduceUtil.initTableMapperJob(...)
Extending Row Keys Distributing Patterns:

HBaseWD is designed to be flexible and to support custom row key
approaches. To define custom row key distributing logic just implement
AbstractRowKeyDistributor abstract class which is really very simple:

    public abstract class AbstractRowKeyDistributor implements
Parametrizable {
      public abstract byte[] getDistributedKey(byte[] originalKey);
      public abstract byte[] getOriginalKey(byte[] adjustedKey);
      public abstract byte[][] getAllDistributedKeys(byte[] originalKey);
      ... // some utility methods