|
|
-
Hbase sequential row merging in MapReduce job
Eric Czech 2012-10-19, 13:32
Hi everyone,
Is there any way to create an InputSplit for a MapReduce job (reading from an HBase table) that guarantees sequential rows with some shared key prefix will end up in the same mapper?
For example, if I have sequential keys like this:
metric1_2010, metric1_2011, metric1_2012, metric2_2011, metric2_2012, ...
I want a mapper that will definitely see all the rows with keys that start with "metric1".
Is there a way to do this?
Thank you!
+
Eric Czech 2012-10-19, 13:32
-
Re: Hbase sequential row merging in MapReduce job
Doug Meil 2012-10-19, 14:22
As long as you know your keyspace, you should be able to create your own splits. See TableInputFormatBase for the default implementation (which is 1 input split per region)
On 10/19/12 9:32 AM, "Eric Czech" <[EMAIL PROTECTED]> wrote:
>Hi everyone, > >Is there any way to create an InputSplit for a MapReduce job (reading from >an HBase table) that guarantees sequential rows with some shared key >prefix >will end up in the same mapper? > >For example, if I have sequential keys like this: > >metric1_2010, >metric1_2011, >metric1_2012, >metric2_2011, >metric2_2012, >... > >I want a mapper that will definitely see all the rows with keys that start >with "metric1". > >Is there a way to do this? > >Thank you!
+
Doug Meil 2012-10-19, 14:22
-
Re: Hbase sequential row merging in MapReduce job
Michael Segel 2012-10-19, 14:43
Outch...
That could get very nasty. You may end up with a lot of uneven splits.
Suppose your 'metric1' spans 3 regions, 'metric2' 1 but its still in the same split as 'metric1' and then 'metric3' is in two regions, 'metric4' is in two regions where its split between the end of 'metric3' and starts into a different region.
I don't think you want to split regions on a map job. You may want to consider an alternative. Something like using an identity mapper to pull the rows and then (dare I say it...) use a reducer.
Another alternative is to think about using an inverted table where there is a row for each 'metricX' and a column for each rowkey.
Just some food for thought.
HTH
-Mike
On Oct 19, 2012, at 9:22 AM, Doug Meil <[EMAIL PROTECTED]> wrote:
> > As long as you know your keyspace, you should be able to create your own > splits. See TableInputFormatBase for the default implementation (which is > 1 input split per region) > > > > > > On 10/19/12 9:32 AM, "Eric Czech" <[EMAIL PROTECTED]> wrote: > >> Hi everyone, >> >> Is there any way to create an InputSplit for a MapReduce job (reading from >> an HBase table) that guarantees sequential rows with some shared key >> prefix >> will end up in the same mapper? >> >> For example, if I have sequential keys like this: >> >> metric1_2010, >> metric1_2011, >> metric1_2012, >> metric2_2011, >> metric2_2012, >> ... >> >> I want a mapper that will definitely see all the rows with keys that start >> with "metric1". >> >> Is there a way to do this? >> >> Thank you! > > >
+
Michael Segel 2012-10-19, 14:43
-
Re: Hbase sequential row merging in MapReduce job
Eric Czech 2012-10-19, 16:25
Well it looks like I might be able to make it work in TableInputFormatBase if I parse the start and end keys and add the logic there (thanks Doug).
I definitely want to avoid the reduce step and since I am storing timeseries data, I can probably just live without putting any part of the date in the key. We would still never have more than 50k observations per row that way, but I hate leaving in the possibility for a row to grow too large.
Thanks Michael.
On Fri, Oct 19, 2012 at 10:43 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
> Outch... > > That could get very nasty. You may end up with a lot of uneven splits. > > Suppose your 'metric1' spans 3 regions, 'metric2' 1 but its still in the > same split as 'metric1' and then 'metric3' is in two regions, 'metric4' is > in two regions where its split between the end of 'metric3' and starts into > a different region. > > I don't think you want to split regions on a map job. > > > You may want to consider an alternative. > Something like using an identity mapper to pull the rows and then (dare I > say it...) use a reducer. > > Another alternative is to think about using an inverted table where there > is a row for each 'metricX' and a column for each rowkey. > > Just some food for thought. > > HTH > > -Mike > > On Oct 19, 2012, at 9:22 AM, Doug Meil <[EMAIL PROTECTED]> > wrote: > > > > > As long as you know your keyspace, you should be able to create your own > > splits. See TableInputFormatBase for the default implementation (which > is > > 1 input split per region) > > > > > > > > > > > > On 10/19/12 9:32 AM, "Eric Czech" <[EMAIL PROTECTED]> wrote: > > > >> Hi everyone, > >> > >> Is there any way to create an InputSplit for a MapReduce job (reading > from > >> an HBase table) that guarantees sequential rows with some shared key > >> prefix > >> will end up in the same mapper? > >> > >> For example, if I have sequential keys like this: > >> > >> metric1_2010, > >> metric1_2011, > >> metric1_2012, > >> metric2_2011, > >> metric2_2012, > >> ... > >> > >> I want a mapper that will definitely see all the rows with keys that > start > >> with "metric1". > >> > >> Is there a way to do this? > >> > >> Thank you! > > > > > > > >
+
Eric Czech 2012-10-19, 16:25
|
|