Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> HBase Map/Reduce Data Ingest Performance


+
Upender K. Nimbekar 2012-12-17, 15:34
Copy link to this message
-
Re: HBase Map/Reduce Data Ingest Performance
Thanks for sharing your experiences.

Have you considered upgrading to HBase 0.92 or 0.94 ?
There have been several bug fixes / enhancements
to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.

Cheers

On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
[EMAIL PROTECTED]> wrote:

> Hi All,
> I have question about improving the Map / Reduce job performance while
> ingesting huge amount of data into Hbase using HFileOutputFormat. Here is
> what we are using:
>
> 1) *Cloudera hadoop-0.20.2-cdh3u*
> 2) *hbase-0.90.40cdh3u2*
>
> I've used 2 different strategies as described below:
>
> *Strategy#1:* PreSplit the number of regions with 10 regions per region
> server. And then subsequently kick off the hadoop job with
> HFileOutputFormat.configureIncrementLoad. This mchanism does create reduce
> tasks equal to the number of regions * 10. We used the "hash" of each
> record as the Key to Mapoutput. This process resulted in each mapper finish
> process in accepetable amount of time. But the reduce task takes forever to
> finish. We found that first the copy/shuffle process too condierable amoun
> of time and then the sort process took foreever to finish.
> We tried to address this issue by constructing the key as
> "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a
> gven mapper. The idea was to reduce shuffling / copying from each mapper.
> But even this solution didn't save us anytime and the reduce step took
> significant amount to finish. I played with adjusting the number of
> pre-split regions in both dierctions but to no avail.
> This led us to move to Strategy#2 we got rid of the reduce step.
>
> *QUESTION:* Is there anything I could've done better in this strategy to
> make reduce step finish faster ? Do I need to produce Row Keys differently
> than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or
> Hbase0.90 ? Please help me troubleshoot.
>
> Strategy#2: PreSplit the number of regions with 10 regions per region
> server. And then subsequently kick off the hadoop job with
> HFileOutputFormat.configureIncrementLoad. But set the number of reducer > 0. In this strategy (current), I pre-sorted all the mapper input using
> Treeset before writing to output. With No. of reducers = 0, this resulted
> the mapper to write directly to HFiles. This was cool because map/reduce
> (no reduce phase actually) finished very fast and we noticed the HFiles got
> written very quickly. Then I used *
> hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into Hbase.
> I called this method on successful completon of the job in the
> driver class. This is working much better than the Strategy#1 in terms of
> performance. But the bulkLoad() call in the driver sometimes takes longer
> if there is huge amount of data.
>
> *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I call
> this api from Mapper directly, instead of waiting the whole job to finish
> first?  I've used used habse "completebulkload" utilty but it has two
> issues with it. First, I do not see any performance improvement with it.
> Second, it needs to be run separately from Hadoop Job driver class and we
> wanted to integrate both the piece. So we used
> *hbase.utils.LoadIncrementHFiles.bulkLoad().
> *
> Also, we used Hbase RegionSplitter to pre-split the regions. But hbase 0.90
> version doesn't have the option to pass ALGORITHM. Is that something we
> need to worry about?
>
> Please help me point in the right direction to address this problem.
>
> Thanks
> Upen
>
+
Upender K. Nimbekar 2012-12-17, 19:11
+
Ted Yu 2012-12-18, 00:52
+
Upender K. Nimbekar 2012-12-18, 02:30
+
Ted Yu 2012-12-18, 03:28
+
Nick Dimiduk 2012-12-18, 17:31
+
Upender K. Nimbekar 2012-12-18, 19:06
+
Jean-Daniel Cryans 2012-12-18, 19:17
+
Nick Dimiduk 2012-12-18, 19:20
+
lars hofhansl 2012-12-19, 07:07
+
lars hofhansl 2012-12-19, 07:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB