Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase Map/Reduce Data Ingest Performance


Copy link to this message
-
Re: HBase Map/Reduce Data Ingest Performance
Hi Upender,

I think you misinterpreted what what Nick was saying.
Personally, if I start something with "Dumb question" what I mean is "please forgive me if you had already thought about this, just making sure in case you missed it". I think Nick meant it the same way.
We're pretty friendly folks here (mostly ;-) ).
-- Lars

________________________________
 From: Upender K. Nimbekar <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Tuesday, December 18, 2012 11:06 AM
Subject: Re: HBase Map/Reduce Data Ingest Performance
 
I would like to request you maintain the respect of people asking questions
on this forum. Let's not start the thread in the wrong direction.
I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad.
Call succeeded but bulkLoad call still threw exception. However, it does
work if I do chmod and bulkLoad() from Hadoop Driver after the job is
finished.
BTW, Hbase user needs a WRITE permission and NOT read bease it created some
_tmp directories.

Upen

On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:

> Dumb question: what's the filesystem permissions of your generated HFiles?
> Can the HBase process read them? Maybe a simple chmod or chown will get you
> the rest of the way there.
>
> On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar <
>  [EMAIL PROTECTED]> wrote:
>
> > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But
> running
> > into permission issues while hbase user tries to import Hfile into Hbase.
> > Not sure, if there is way to change the target HDFS file permission via
> > HFileOutputFormat.
> >
> >
> > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> > > I think second approach is better.
> > >
> > > Cheers
> > >
> > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Sure. I can try that. Just curious, out of these 2 strategies, which
> > one
> > > do
> > > > you thin is better ? Do you have any experience of trying one or the
> > > other
> > > > ?
> > > >
> > > > Thanks
> > > > Upen
> > > >
> > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]>
> wrote:
> > > >
> > > > > Thanks for sharing your experiences.
> > > > >
> > > > > Have you considered upgrading to HBase 0.92 or 0.94 ?
> > > > > There have been several bug fixes / enhancements
> > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
> > > > > [EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > Hi All,
> > > > > > I have question about improving the Map / Reduce job performance
> > > while
> > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat.
> > > Here
> > > > is
> > > > > > what we are using:
> > > > > >
> > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u*
> > > > > > 2) *hbase-0.90.40cdh3u2*
> > > > > >
> > > > > > I've used 2 different strategies as described below:
> > > > > >
> > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per
> > > region
> > > > > > server. And then subsequently kick off the hadoop job with
> > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does
> create
> > > > > reduce
> > > > > > tasks equal to the number of regions * 10. We used the "hash" of
> > each
> > > > > > record as the Key to Mapoutput. This process resulted in each
> > mapper
> > > > > finish
> > > > > > process in accepetable amount of time. But the reduce task takes
> > > > forever
> > > > > to
> > > > > > finish. We found that first the copy/shuffle process too
> > condierable
> > > > > amoun
> > > > > > of time and then the sort process took foreever to finish.
> > > > > > We tried to address this issue by constructing the key as
> > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the
> > records
> > > > of a
> > > > > > gven mapper. The idea was to reduce shuffling / copying from each
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB