Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - HBase Map/Reduce Data Ingest Performance


+
Upender K. Nimbekar 2012-12-17, 15:34
+
Ted Yu 2012-12-17, 17:45
+
Upender K. Nimbekar 2012-12-17, 19:11
Copy link to this message
-
Re: HBase Map/Reduce Data Ingest Performance
Ted Yu 2012-12-18, 00:52
I think second approach is better.

Cheers

On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar <
[EMAIL PROTECTED]> wrote:

> Sure. I can try that. Just curious, out of these 2 strategies, which one do
> you thin is better ? Do you have any experience of trying one or the other
> ?
>
> Thanks
> Upen
>
> On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > Thanks for sharing your experiences.
> >
> > Have you considered upgrading to HBase 0.92 or 0.94 ?
> > There have been several bug fixes / enhancements
> > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
> >
> > Cheers
> >
> > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hi All,
> > > I have question about improving the Map / Reduce job performance while
> > > ingesting huge amount of data into Hbase using HFileOutputFormat. Here
> is
> > > what we are using:
> > >
> > > 1) *Cloudera hadoop-0.20.2-cdh3u*
> > > 2) *hbase-0.90.40cdh3u2*
> > >
> > > I've used 2 different strategies as described below:
> > >
> > > *Strategy#1:* PreSplit the number of regions with 10 regions per region
> > > server. And then subsequently kick off the hadoop job with
> > > HFileOutputFormat.configureIncrementLoad. This mchanism does create
> > reduce
> > > tasks equal to the number of regions * 10. We used the "hash" of each
> > > record as the Key to Mapoutput. This process resulted in each mapper
> > finish
> > > process in accepetable amount of time. But the reduce task takes
> forever
> > to
> > > finish. We found that first the copy/shuffle process too condierable
> > amoun
> > > of time and then the sort process took foreever to finish.
> > > We tried to address this issue by constructing the key as
> > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records
> of a
> > > gven mapper. The idea was to reduce shuffling / copying from each
> mapper.
> > > But even this solution didn't save us anytime and the reduce step took
> > > significant amount to finish. I played with adjusting the number of
> > > pre-split regions in both dierctions but to no avail.
> > > This led us to move to Strategy#2 we got rid of the reduce step.
> > >
> > > *QUESTION:* Is there anything I could've done better in this strategy
> to
> > > make reduce step finish faster ? Do I need to produce Row Keys
> > differently
> > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or
> > > Hbase0.90 ? Please help me troubleshoot.
> > >
> > > Strategy#2: PreSplit the number of regions with 10 regions per region
> > > server. And then subsequently kick off the hadoop job with
> > > HFileOutputFormat.configureIncrementLoad. But set the number of
> reducer > > > 0. In this strategy (current), I pre-sorted all the mapper input using
> > > Treeset before writing to output. With No. of reducers = 0, this
> resulted
> > > the mapper to write directly to HFiles. This was cool because
> map/reduce
> > > (no reduce phase actually) finished very fast and we noticed the HFiles
> > got
> > > written very quickly. Then I used *
> > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into
> > Hbase.
> > > I called this method on successful completon of the job in the
> > > driver class. This is working much better than the Strategy#1 in terms
> of
> > > performance. But the bulkLoad() call in the driver sometimes takes
> longer
> > > if there is huge amount of data.
> > >
> > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I
> > call
> > > this api from Mapper directly, instead of waiting the whole job to
> finish
> > > first?  I've used used habse "completebulkload" utilty but it has two
> > > issues with it. First, I do not see any performance improvement with
> it.
> > > Second, it needs to be run separately from Hadoop Job driver class and
> we
> > > wanted to integrate both the piece. So we used
> > > *hbase.utils.LoadIncrementHFiles.bulkLoad().
> >  > *
> > > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase
+
Upender K. Nimbekar 2012-12-18, 02:30
+
Ted Yu 2012-12-18, 03:28
+
Nick Dimiduk 2012-12-18, 17:31
+
Upender K. Nimbekar 2012-12-18, 19:06
+
Jean-Daniel Cryans 2012-12-18, 19:17
+
Nick Dimiduk 2012-12-18, 19:20
+
lars hofhansl 2012-12-19, 07:07
+
lars hofhansl 2012-12-19, 07:10