Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Hbase sequential row merging in MapReduce job


Copy link to this message
-
Re: Hbase sequential row merging in MapReduce job
Michael Segel 2012-10-19, 14:43
Outch...

That could get very nasty. You may end up with a lot of uneven splits.

Suppose your 'metric1' spans 3 regions,  'metric2' 1 but its still in the same split as 'metric1' and then 'metric3' is in two regions, 'metric4' is in two regions where its split between the end of 'metric3' and starts into a different region.

I don't think you want to split regions on a map job.
You may want to consider an alternative.
Something like using an identity mapper to pull the rows and then (dare I say it...) use a reducer.

Another alternative is to think about using an inverted table where there is a row for each 'metricX' and a column for each rowkey.

Just some food for thought.

HTH

-Mike

On Oct 19, 2012, at 9:22 AM, Doug Meil <[EMAIL PROTECTED]> wrote:

>
> As long as you know your keyspace, you should be able to create your own
> splits.  See TableInputFormatBase for the default implementation (which is
> 1 input split per region)
>
>
>
>
>
> On 10/19/12 9:32 AM, "Eric Czech" <[EMAIL PROTECTED]> wrote:
>
>> Hi everyone,
>>
>> Is there any way to create an InputSplit for a MapReduce job (reading from
>> an HBase table) that guarantees sequential rows with some shared key
>> prefix
>> will end up in the same mapper?
>>
>> For example, if I have sequential keys like this:
>>
>> metric1_2010,
>> metric1_2011,
>> metric1_2012,
>> metric2_2011,
>> metric2_2012,
>> ...
>>
>> I want a mapper that will definitely see all the rows with keys that start
>> with "metric1".
>>
>> Is there a way to do this?
>>
>> Thank you!
>
>
>