Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: MR job for creating splits


Copy link to this message
-
Re: MR job for creating splits
Bryan Beaudreault 2012-05-13, 02:07
I did a very similar approach and it worked fine for me.  Just spot check
the regions after to make sure they look lexicographically sorted.  I used
ImmutableBytesWritable as my key, and the default hadoop sorting for that
turned out to sort lexicographically as required.  Our hbase rows varied in
size, so instead of doing a count of the number of rows, we tallied up the
KeyValue.getLenght() for each KeyValue in a row until the size reached a
certain limit.

On Sat, May 12, 2012 at 7:21 PM, Something Something <
[EMAIL PROTECTED]> wrote:

> Hello,
>
> This is really a MapReduce question, but the output from this will be used
> to create regions for an HBase table.  Here's what I want to do:
>
> Take an input file that contains data about users.
> Sort this file by a key (which consists of a few fields from the row)
> After every x # of rows write the key.
>
>
> Here's how I was going to structure my MapReduce:
>
> public Splitter {
>
>    static int counter;
>
>    private Mapper {
>        map() {
>            Build key by concatenating fields
>            Write key
>            increment counter;
>        }
>    }
>
>    //  # of reducers will be set to 1.  My understanding is that this will
> send the lines to reducer in sorted order one at a time - is this a correct
> assumption?
>    private Reducer {
>         static long i;
>         reduce() {
>             static long splitSize = counter / 300;  //  300 is region size
>             if (i == 0 || i == splitSize) {
>                 Write key;  // this will be used as a 'startkey'.
>                  i = 0;
>             }
>             i++;
>         }
>    }
> }
>
> To summarize, there are 2 questions:
>
> 1)  I am passing # of rows processed by Mapper to Reducer via a static
> counter.  Would this work?  Is there a better way?
> 2)  If I set # of reducers to 1, would the lines be sent to reducer in
> sorted order one at a time?
>
> Thanks in advance for the help.
>