Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> HBaseStorage STORE method comparison


Copy link to this message
-
Re: HBaseStorage STORE method comparison
fastest might be to use local mode, and avoid even the first map only job :)

You are right, for 10 keys it does not really matter. Even doing 1000s of
updates to the same row in #2 is still a in-memory update for HBase. The
actual cost of HBase put() is probably slightly high for #2, but it is a
negligible part of the rest of the overhead.

On Tue, Mar 6, 2012 at 10:24 AM, Norbert Burger <[EMAIL PROTECTED]>wrote:

> Hi folks --
>
> For a very sparse HBase table (2 column families, 1000s of columns) what's
> the expected performance difference in using HBaseStorage with the
> following two STORE methods?  Note that in our use case, there only a
> handful of unique rowkeys (approx 10).
>
> 1) GROUP BY the 1000s of columns by rowkey, and write only 10 very wide
> rows into HBase
> 2) Skip the GROUP BY, and just write the raw data as is.  Conceptually,
> this seems like a rewrite on the 10 rowkeys, but we're writing a different
> column each time.
>
> Originally our processing was using approach #1, but I just modified it to
> use method #2, and I'm seeing a decent performance increase.  I think much
> of the difference uis the overhead of launching another Hadoop job, since
> GROUP BY is a blocking operator.  Any thoughts?
>
> Norbert
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB