Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Controlling TableMapReduceUtil table split points


+
David Koch 2013-01-06, 12:37
Copy link to this message
-
Re: Controlling TableMapReduceUtil table split points
David:
The determination of splits is done in TableInputFormatBase.getSplits()
where table.getStartEndKeys() is called to get the boundaries of regions.
You can take a look and see how you can customize the splits.

bq. how do I make sure that a user's history is not split across several
Mapper instances?

If events for one user are processed by a single mapper, I think you would
continue to see timeouts in your map/reduce job.

Cheers

On Sun, Jan 6, 2013 at 4:37 AM, David Koch <[EMAIL PROTECTED]> wrote:

> Hello,
>
> Is there a way to override the method which is used by TableMapReduceUtil
> to split a HBase table across several mapper instances when running a Map
> Reduce over it?
>
> Our current data model is:
>
> Row-key: <user_id>
> Qualifier: <timestamp>
> Value: <event>
>
> The distribution of "events per <user_id>" is long-tailed. Some users have
> in the order of 10^6 events, whereas the mean is somewhat around 100.
>
> These occasional big rows cause problems when running M/R jobs over the
> table (timeouts, etc.). Also, according to the HBase book they are
> detrimental to the region splitting process. We currently use server-side
> filters to not retrieve these large rows but it's not a nice solution.
>
> To overcome this, I was thinking about using a "tall and narrow" key design
> (term from HBase book), meaning in our case that rowkeys are formed by a
> composite of <user_id>_<tmst>.
>
> However, and hence my original question, how would I process this table
> "user-wise" in a HBase M/R job? Specifically, how do I make sure that a
> user's history is not split across several Mapper instances?
>
> Thank you,
>
> /David
>
+
David Koch 2013-01-06, 16:37
+
Ted Yu 2013-01-06, 16:47
+
Dhaval Shah 2013-01-06, 17:29
+
David Koch 2013-01-06, 17:53
+
Dhaval Shah 2013-01-22, 23:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB