fastest might be to use local mode, and avoid even the first map only job :)
You are right, for 10 keys it does not really matter. Even doing 1000s of
updates to the same row in #2 is still a in-memory update for HBase. The
actual cost of HBase put() is probably slightly high for #2, but it is a
negligible part of the rest of the overhead.
On Tue, Mar 6, 2012 at 10:24 AM, Norbert Burger <[EMAIL PROTECTED]>wrote:
> Hi folks --
> For a very sparse HBase table (2 column families, 1000s of columns) what's
> the expected performance difference in using HBaseStorage with the
> following two STORE methods? Note that in our use case, there only a
> handful of unique rowkeys (approx 10).
> 1) GROUP BY the 1000s of columns by rowkey, and write only 10 very wide
> rows into HBase
> 2) Skip the GROUP BY, and just write the raw data as is. Conceptually,
> this seems like a rewrite on the 10 rowkeys, but we're writing a different
> column each time.
> Originally our processing was using approach #1, but I just modified it to
> use method #2, and I'm seeing a decent performance increase. I think much
> of the difference uis the overhead of launching another Hadoop job, since
> GROUP BY is a blocking operator. Any thoughts?