Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Re: MR job for creating splits


Copy link to this message
-
Re: MR job for creating splits
I did a very similar approach and it worked fine for me.  Just spot check
the regions after to make sure they look lexicographically sorted.  I used
ImmutableBytesWritable as my key, and the default hadoop sorting for that
turned out to sort lexicographically as required.  Our hbase rows varied in
size, so instead of doing a count of the number of rows, we tallied up the
KeyValue.getLenght() for each KeyValue in a row until the size reached a
certain limit.

On Sat, May 12, 2012 at 7:21 PM, Something Something <
[EMAIL PROTECTED]> wrote:

> Hello,
>
> This is really a MapReduce question, but the output from this will be used
> to create regions for an HBase table.  Here's what I want to do:
>
> Take an input file that contains data about users.
> Sort this file by a key (which consists of a few fields from the row)
> After every x # of rows write the key.
>
>
> Here's how I was going to structure my MapReduce:
>
> public Splitter {
>
>    static int counter;
>
>    private Mapper {
>        map() {
>            Build key by concatenating fields
>            Write key
>            increment counter;
>        }
>    }
>
>    //  # of reducers will be set to 1.  My understanding is that this will
> send the lines to reducer in sorted order one at a time - is this a correct
> assumption?
>    private Reducer {
>         static long i;
>         reduce() {
>             static long splitSize = counter / 300;  //  300 is region size
>             if (i == 0 || i == splitSize) {
>                 Write key;  // this will be used as a 'startkey'.
>                  i = 0;
>             }
>             i++;
>         }
>    }
> }
>
> To summarize, there are 2 questions:
>
> 1)  I am passing # of rows processed by Mapper to Reducer via a static
> counter.  Would this work?  Is there a better way?
> 2)  If I set # of reducers to 1, would the lines be sent to reducer in
> sorted order one at a time?
>
> Thanks in advance for the help.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB