Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Explosion in datasize using HBase as a MR sink


+
Rob 2013-05-29, 15:28
+
Ted Yu 2013-05-29, 16:20
+
Rob 2013-05-29, 19:27
Copy link to this message
-
Re: Explosion in datasize using HBase as a MR sink
bq. but does that account for the sizes?

No. It should not.

Can you tell us more about your MR job ?

I assume that you have run RowCounter on Table2.1 to verify the number of
rows matches 6M records.

Cheers

On Wed, May 29, 2013 at 12:27 PM, Rob <[EMAIL PROTECTED]> wrote:

> No I did not presplit and yes splits happen during the job run.
>
> I know pre splitting is a best practice, but does that account for the
> sizes?
>
> On May 29, 2013, at 18:20, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > Did you preslit Table2.1 ?
> >
> > From master log, do you see region splitting happen during the MR job
> run ?
> >
> > Thanks
> >
> > On Wed, May 29, 2013 at 8:28 AM, Rob <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> We're moving from ingesting our data via the Thrift API to inserting our
> >> records via a MapReduce job. For the MR job I've used the exact same job
> >> setup from HBase DefG, page 309. We're running CDH4.0.1, Hbase 0.92.1
> >>
> >> We are parsing data from a Hbase Table1 into a Hbase Table2, Table1 is
> >> unparsed data, Table2 is parsed and stored as a protobuf. This works
> fine
> >> when doing it via the Thrift API(in Python), this doesn't scale so we
> want
> >> to move to using a MR job.  Both T1 and T2 contain 100M records. Current
> >> stats, using 2GB region sizes:
> >>
> >> Table1: 130 regions, taking up 134Gb space
> >> Table2: 28 regions, taking up 39,3Gb space
> >>
> >> The problem arrises when I take a sample from Table1 of 6M records and
> M/R
> >> those into a new Table2.1. Those 6M records suddenly get spread over 178
> >> regions taking up 217.5GB of disk space.
> >>
> >> Both T2 and T2.1 have the following simple schema:
> >>        create 'Table2', {NAME => 'data', COMPRESSION => 'SNAPPY',
> >> VERSIONS => 1}
> >>
> >> I can retrieve and parse records from both T2 and T2.1, so the data is
> >> there and validated, but I can't seem to figure out why the explosion in
> >> size is happening. Triggering a major compaction does not differ
> much(2Gb
> >> in total size). I understand that snappy compression gets applied
> directly
> >> when RS's create store- and hfiles, so compression should be applied
> >> directly.
> >>
> >> Any thoughts?
>
>
+
Rob 2013-05-29, 20:44
+
Stack 2013-05-30, 02:51
+
Rob Verkuylen 2013-05-30, 19:52
+
Asaf Mesika 2013-05-31, 20:02
+
Rob Verkuylen 2013-06-04, 19:58
+
Stack 2013-06-04, 23:07
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB