Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> HBase Map/Reduce Data Ingest Performance


+
Upender K. Nimbekar 2012-12-17, 15:34
+
Ted Yu 2012-12-17, 17:45
+
Upender K. Nimbekar 2012-12-17, 19:11
+
Ted Yu 2012-12-18, 00:52
+
Upender K. Nimbekar 2012-12-18, 02:30
Copy link to this message
-
Re: HBase Map/Reduce Data Ingest Performance
Experts from Cloudera would be more familiar with security in
hadoop-0.20.2-cdh3u

If you can show us the exception (using pastebin e.g.), that would help
find the root cause.

Cheers

On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar <
[EMAIL PROTECTED]> wrote:

> Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But running
> into permission issues while hbase user tries to import Hfile into Hbase.
> Not sure, if there is way to change the target HDFS file permission via
> HFileOutputFormat.
>
>
> On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > I think second approach is better.
> >
> > Cheers
> >
> > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Sure. I can try that. Just curious, out of these 2 strategies, which
> one
> > do
> > > you thin is better ? Do you have any experience of trying one or the
> > other
> > > ?
> > >
> > > Thanks
> > > Upen
> > >
> > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > Thanks for sharing your experiences.
> > > >
> > > > Have you considered upgrading to HBase 0.92 or 0.94 ?
> > > > There have been several bug fixes / enhancements
> > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
> > > >
> > > > Cheers
> > > >
> > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > > > Hi All,
> > > > > I have question about improving the Map / Reduce job performance
> > while
> > > > > ingesting huge amount of data into Hbase using HFileOutputFormat.
> > Here
> > > is
> > > > > what we are using:
> > > > >
> > > > > 1) *Cloudera hadoop-0.20.2-cdh3u*
> > > > > 2) *hbase-0.90.40cdh3u2*
> > > > >
> > > > > I've used 2 different strategies as described below:
> > > > >
> > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per
> > region
> > > > > server. And then subsequently kick off the hadoop job with
> > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does create
> > > > reduce
> > > > > tasks equal to the number of regions * 10. We used the "hash" of
> each
> > > > > record as the Key to Mapoutput. This process resulted in each
> mapper
> > > > finish
> > > > > process in accepetable amount of time. But the reduce task takes
> > > forever
> > > > to
> > > > > finish. We found that first the copy/shuffle process too
> condierable
> > > > amoun
> > > > > of time and then the sort process took foreever to finish.
> > > > > We tried to address this issue by constructing the key as
> > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the
> records
> > > of a
> > > > > gven mapper. The idea was to reduce shuffling / copying from each
> > > mapper.
> > > > > But even this solution didn't save us anytime and the reduce step
> > took
> > > > > significant amount to finish. I played with adjusting the number of
> > > > > pre-split regions in both dierctions but to no avail.
> > > > > This led us to move to Strategy#2 we got rid of the reduce step.
> > > > >
> > > > > *QUESTION:* Is there anything I could've done better in this
> strategy
> > > to
> > > > > make reduce step finish faster ? Do I need to produce Row Keys
> > > > differently
> > > > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or
> > > > > Hbase0.90 ? Please help me troubleshoot.
> > > > >
> > > > > Strategy#2: PreSplit the number of regions with 10 regions per
> region
> > > > > server. And then subsequently kick off the hadoop job with
> > > > > HFileOutputFormat.configureIncrementLoad. But set the number of
> > > reducer > > > > > 0. In this strategy (current), I pre-sorted all the mapper input
> > using
> > > > > Treeset before writing to output. With No. of reducers = 0, this
> > > resulted
> > > > > the mapper to write directly to HFiles. This was cool because
> > > map/reduce
> > > > > (no reduce phase actually) finished very fast and we noticed the
> > HFiles
+
Nick Dimiduk 2012-12-18, 17:31
+
Upender K. Nimbekar 2012-12-18, 19:06
+
Jean-Daniel Cryans 2012-12-18, 19:17
+
Nick Dimiduk 2012-12-18, 19:20
+
lars hofhansl 2012-12-19, 07:07
+
lars hofhansl 2012-12-19, 07:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB