Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Explosion in datasize using HBase as a MR sink


Copy link to this message
-
Re: Explosion in datasize using HBase as a MR sink
Did you preslit Table2.1 ?

>From master log, do you see region splitting happen during the MR job run ?

Thanks

On Wed, May 29, 2013 at 8:28 AM, Rob <[EMAIL PROTECTED]> wrote:

>
> We're moving from ingesting our data via the Thrift API to inserting our
> records via a MapReduce job. For the MR job I've used the exact same job
> setup from HBase DefG, page 309. We're running CDH4.0.1, Hbase 0.92.1
>
> We are parsing data from a Hbase Table1 into a Hbase Table2, Table1 is
> unparsed data, Table2 is parsed and stored as a protobuf. This works fine
> when doing it via the Thrift API(in Python), this doesn't scale so we want
> to move to using a MR job.  Both T1 and T2 contain 100M records. Current
> stats, using 2GB region sizes:
>
> Table1: 130 regions, taking up 134Gb space
> Table2: 28 regions, taking up 39,3Gb space
>
> The problem arrises when I take a sample from Table1 of 6M records and M/R
> those into a new Table2.1. Those 6M records suddenly get spread over 178
> regions taking up 217.5GB of disk space.
>
> Both T2 and T2.1 have the following simple schema:
>         create 'Table2', {NAME => 'data', COMPRESSION => 'SNAPPY',
> VERSIONS => 1}
>
> I can retrieve and parse records from both T2 and T2.1, so the data is
> there and validated, but I can't seem to figure out why the explosion in
> size is happening. Triggering a major compaction does not differ much(2Gb
> in total size). I understand that snappy compression gets applied directly
> when RS's create store- and hfiles, so compression should be applied
> directly.
>
> Any thoughts?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB