Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Controlling TableMapReduceUtil table split points


Copy link to this message
-
Re: Controlling TableMapReduceUtil table split points
Ted Yu 2013-01-06, 16:11
David:
The determination of splits is done in TableInputFormatBase.getSplits()
where table.getStartEndKeys() is called to get the boundaries of regions.
You can take a look and see how you can customize the splits.

bq. how do I make sure that a user's history is not split across several
Mapper instances?

If events for one user are processed by a single mapper, I think you would
continue to see timeouts in your map/reduce job.

Cheers

On Sun, Jan 6, 2013 at 4:37 AM, David Koch <[EMAIL PROTECTED]> wrote:

> Hello,
>
> Is there a way to override the method which is used by TableMapReduceUtil
> to split a HBase table across several mapper instances when running a Map
> Reduce over it?
>
> Our current data model is:
>
> Row-key: <user_id>
> Qualifier: <timestamp>
> Value: <event>
>
> The distribution of "events per <user_id>" is long-tailed. Some users have
> in the order of 10^6 events, whereas the mean is somewhat around 100.
>
> These occasional big rows cause problems when running M/R jobs over the
> table (timeouts, etc.). Also, according to the HBase book they are
> detrimental to the region splitting process. We currently use server-side
> filters to not retrieve these large rows but it's not a nice solution.
>
> To overcome this, I was thinking about using a "tall and narrow" key design
> (term from HBase book), meaning in our case that rowkeys are formed by a
> composite of <user_id>_<tmst>.
>
> However, and hence my original question, how would I process this table
> "user-wise" in a HBase M/R job? Specifically, how do I make sure that a
> user's history is not split across several Mapper instances?
>
> Thank you,
>
> /David
>