Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Explosion in datasize using HBase as a MR sink


Copy link to this message
-
Re: Explosion in datasize using HBase as a MR sink
Ted Yu 2013-05-29, 19:32
bq. but does that account for the sizes?

No. It should not.

Can you tell us more about your MR job ?

I assume that you have run RowCounter on Table2.1 to verify the number of
rows matches 6M records.

Cheers

On Wed, May 29, 2013 at 12:27 PM, Rob <[EMAIL PROTECTED]> wrote:

> No I did not presplit and yes splits happen during the job run.
>
> I know pre splitting is a best practice, but does that account for the
> sizes?
>
> On May 29, 2013, at 18:20, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > Did you preslit Table2.1 ?
> >
> > From master log, do you see region splitting happen during the MR job
> run ?
> >
> > Thanks
> >
> > On Wed, May 29, 2013 at 8:28 AM, Rob <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> We're moving from ingesting our data via the Thrift API to inserting our
> >> records via a MapReduce job. For the MR job I've used the exact same job
> >> setup from HBase DefG, page 309. We're running CDH4.0.1, Hbase 0.92.1
> >>
> >> We are parsing data from a Hbase Table1 into a Hbase Table2, Table1 is
> >> unparsed data, Table2 is parsed and stored as a protobuf. This works
> fine
> >> when doing it via the Thrift API(in Python), this doesn't scale so we
> want
> >> to move to using a MR job.  Both T1 and T2 contain 100M records. Current
> >> stats, using 2GB region sizes:
> >>
> >> Table1: 130 regions, taking up 134Gb space
> >> Table2: 28 regions, taking up 39,3Gb space
> >>
> >> The problem arrises when I take a sample from Table1 of 6M records and
> M/R
> >> those into a new Table2.1. Those 6M records suddenly get spread over 178
> >> regions taking up 217.5GB of disk space.
> >>
> >> Both T2 and T2.1 have the following simple schema:
> >>        create 'Table2', {NAME => 'data', COMPRESSION => 'SNAPPY',
> >> VERSIONS => 1}
> >>
> >> I can retrieve and parse records from both T2 and T2.1, so the data is
> >> there and validated, but I can't seem to figure out why the explosion in
> >> size is happening. Triggering a major compaction does not differ
> much(2Gb
> >> in total size). I understand that snappy compression gets applied
> directly
> >> when RS's create store- and hfiles, so compression should be applied
> >> directly.
> >>
> >> Any thoughts?
>
>