Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Controlling TableMapReduceUtil table split points

Copy link to this message
Controlling TableMapReduceUtil table split points

Is there a way to override the method which is used by TableMapReduceUtil
to split a HBase table across several mapper instances when running a Map
Reduce over it?

Our current data model is:

Row-key: <user_id>
Qualifier: <timestamp>
Value: <event>

The distribution of "events per <user_id>" is long-tailed. Some users have
in the order of 10^6 events, whereas the mean is somewhat around 100.

These occasional big rows cause problems when running M/R jobs over the
table (timeouts, etc.). Also, according to the HBase book they are
detrimental to the region splitting process. We currently use server-side
filters to not retrieve these large rows but it's not a nice solution.

To overcome this, I was thinking about using a "tall and narrow" key design
(term from HBase book), meaning in our case that rowkeys are formed by a
composite of <user_id>_<tmst>.

However, and hence my original question, how would I process this table
"user-wise" in a HBase M/R job? Specifically, how do I make sure that a
user's history is not split across several Mapper instances?

Thank you,

Ted Yu 2013-01-06, 16:11
David Koch 2013-01-06, 16:37
Ted Yu 2013-01-06, 16:47
Dhaval Shah 2013-01-06, 17:29
David Koch 2013-01-06, 17:53
Dhaval Shah 2013-01-22, 23:10