Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - HBaseStorage STORE method comparison


Copy link to this message
-
Re: HBaseStorage STORE method comparison
Raghu Angadi 2012-03-07, 08:22
fastest might be to use local mode, and avoid even the first map only job :)

You are right, for 10 keys it does not really matter. Even doing 1000s of
updates to the same row in #2 is still a in-memory update for HBase. The
actual cost of HBase put() is probably slightly high for #2, but it is a
negligible part of the rest of the overhead.

On Tue, Mar 6, 2012 at 10:24 AM, Norbert Burger <[EMAIL PROTECTED]>wrote:

> Hi folks --
>
> For a very sparse HBase table (2 column families, 1000s of columns) what's
> the expected performance difference in using HBaseStorage with the
> following two STORE methods?  Note that in our use case, there only a
> handful of unique rowkeys (approx 10).
>
> 1) GROUP BY the 1000s of columns by rowkey, and write only 10 very wide
> rows into HBase
> 2) Skip the GROUP BY, and just write the raw data as is.  Conceptually,
> this seems like a rewrite on the 10 rowkeys, but we're writing a different
> column each time.
>
> Originally our processing was using approach #1, but I just modified it to
> use method #2, and I'm seeing a decent performance increase.  I think much
> of the difference uis the overhead of launching another Hadoop job, since
> GROUP BY is a blocking operator.  Any thoughts?
>
> Norbert
>