Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Controlling TableMapReduceUtil table split points


Copy link to this message
-
Controlling TableMapReduceUtil table split points
Hello,

Is there a way to override the method which is used by TableMapReduceUtil
to split a HBase table across several mapper instances when running a Map
Reduce over it?

Our current data model is:

Row-key: <user_id>
Qualifier: <timestamp>
Value: <event>

The distribution of "events per <user_id>" is long-tailed. Some users have
in the order of 10^6 events, whereas the mean is somewhat around 100.

These occasional big rows cause problems when running M/R jobs over the
table (timeouts, etc.). Also, according to the HBase book they are
detrimental to the region splitting process. We currently use server-side
filters to not retrieve these large rows but it's not a nice solution.

To overcome this, I was thinking about using a "tall and narrow" key design
(term from HBase book), meaning in our case that rowkeys are formed by a
composite of <user_id>_<tmst>.

However, and hence my original question, how would I process this table
"user-wise" in a HBase M/R job? Specifically, how do I make sure that a
user's history is not split across several Mapper instances?

Thank you,

/David
+
Ted Yu 2013-01-06, 16:11
+
David Koch 2013-01-06, 16:37
+
Ted Yu 2013-01-06, 16:47
+
Dhaval Shah 2013-01-06, 17:29
+
David Koch 2013-01-06, 17:53
+
Dhaval Shah 2013-01-22, 23:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB