Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> HBase Map/Reduce Data Ingest Performance


+
Upender K. Nimbekar 2012-12-17, 15:34
+
Ted Yu 2012-12-17, 17:45
+
Upender K. Nimbekar 2012-12-17, 19:11
Copy link to this message
-
Re: HBase Map/Reduce Data Ingest Performance
I think second approach is better.

Cheers

On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar <
[EMAIL PROTECTED]> wrote:

> Sure. I can try that. Just curious, out of these 2 strategies, which one do
> you thin is better ? Do you have any experience of trying one or the other
> ?
>
> Thanks
> Upen
>
> On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > Thanks for sharing your experiences.
> >
> > Have you considered upgrading to HBase 0.92 or 0.94 ?
> > There have been several bug fixes / enhancements
> > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
> >
> > Cheers
> >
> > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hi All,
> > > I have question about improving the Map / Reduce job performance while
> > > ingesting huge amount of data into Hbase using HFileOutputFormat. Here
> is
> > > what we are using:
> > >
> > > 1) *Cloudera hadoop-0.20.2-cdh3u*
> > > 2) *hbase-0.90.40cdh3u2*
> > >
> > > I've used 2 different strategies as described below:
> > >
> > > *Strategy#1:* PreSplit the number of regions with 10 regions per region
> > > server. And then subsequently kick off the hadoop job with
> > > HFileOutputFormat.configureIncrementLoad. This mchanism does create
> > reduce
> > > tasks equal to the number of regions * 10. We used the "hash" of each
> > > record as the Key to Mapoutput. This process resulted in each mapper
> > finish
> > > process in accepetable amount of time. But the reduce task takes
> forever
> > to
> > > finish. We found that first the copy/shuffle process too condierable
> > amoun
> > > of time and then the sort process took foreever to finish.
> > > We tried to address this issue by constructing the key as
> > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records
> of a
> > > gven mapper. The idea was to reduce shuffling / copying from each
> mapper.
> > > But even this solution didn't save us anytime and the reduce step took
> > > significant amount to finish. I played with adjusting the number of
> > > pre-split regions in both dierctions but to no avail.
> > > This led us to move to Strategy#2 we got rid of the reduce step.
> > >
> > > *QUESTION:* Is there anything I could've done better in this strategy
> to
> > > make reduce step finish faster ? Do I need to produce Row Keys
> > differently
> > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or
> > > Hbase0.90 ? Please help me troubleshoot.
> > >
> > > Strategy#2: PreSplit the number of regions with 10 regions per region
> > > server. And then subsequently kick off the hadoop job with
> > > HFileOutputFormat.configureIncrementLoad. But set the number of
> reducer > > > 0. In this strategy (current), I pre-sorted all the mapper input using
> > > Treeset before writing to output. With No. of reducers = 0, this
> resulted
> > > the mapper to write directly to HFiles. This was cool because
> map/reduce
> > > (no reduce phase actually) finished very fast and we noticed the HFiles
> > got
> > > written very quickly. Then I used *
> > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into
> > Hbase.
> > > I called this method on successful completon of the job in the
> > > driver class. This is working much better than the Strategy#1 in terms
> of
> > > performance. But the bulkLoad() call in the driver sometimes takes
> longer
> > > if there is huge amount of data.
> > >
> > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I
> > call
> > > this api from Mapper directly, instead of waiting the whole job to
> finish
> > > first?  I've used used habse "completebulkload" utilty but it has two
> > > issues with it. First, I do not see any performance improvement with
> it.
> > > Second, it needs to be run separately from Hadoop Job driver class and
> we
> > > wanted to integrate both the piece. So we used
> > > *hbase.utils.LoadIncrementHFiles.bulkLoad().
> >  > *
> > > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase
+
Upender K. Nimbekar 2012-12-18, 02:30
+
Ted Yu 2012-12-18, 03:28
+
Nick Dimiduk 2012-12-18, 17:31
+
Upender K. Nimbekar 2012-12-18, 19:06
+
Jean-Daniel Cryans 2012-12-18, 19:17
+
Nick Dimiduk 2012-12-18, 19:20
+
lars hofhansl 2012-12-19, 07:07
+
lars hofhansl 2012-12-19, 07:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB