Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> HBaseStorage STORE method comparison

Copy link to this message
HBaseStorage STORE method comparison
Hi folks --

For a very sparse HBase table (2 column families, 1000s of columns) what's
the expected performance difference in using HBaseStorage with the
following two STORE methods?  Note that in our use case, there only a
handful of unique rowkeys (approx 10).

1) GROUP BY the 1000s of columns by rowkey, and write only 10 very wide
rows into HBase
2) Skip the GROUP BY, and just write the raw data as is.  Conceptually,
this seems like a rewrite on the 10 rowkeys, but we're writing a different
column each time.

Originally our processing was using approach #1, but I just modified it to
use method #2, and I'm seeing a decent performance increase.  I think much
of the difference uis the overhead of launching another Hadoop job, since
GROUP BY is a blocking operator.  Any thoughts?