Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase Map/Reduce Data Ingest Performance


Copy link to this message
-
Re: HBase Map/Reduce Data Ingest Performance
Please forgive my poor choice of words; I meant no disrespect.

-n

On Tue, Dec 18, 2012 at 11:06 AM, Upender K. Nimbekar <
[EMAIL PROTECTED]> wrote:

> I would like to request you maintain the respect of people asking questions
> on this forum. Let's not start the thread in the wrong direction.
> I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad.
> Call succeeded but bulkLoad call still threw exception. However, it does
> work if I do chmod and bulkLoad() from Hadoop Driver after the job is
> finished.
> BTW, Hbase user needs a WRITE permission and NOT read bease it created some
> _tmp directories.
>
> Upen
>
> On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:
>
> > Dumb question: what's the filesystem permissions of your generated
> HFiles?
> > Can the HBase process read them? Maybe a simple chmod or chown will get
> you
> > the rest of the way there.
> >
> > On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar <
> >  [EMAIL PROTECTED]> wrote:
> >
> > > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But
> > running
> > > into permission issues while hbase user tries to import Hfile into
> Hbase.
> > > Not sure, if there is way to change the target HDFS file permission via
> > > HFileOutputFormat.
> > >
> > >
> > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > I think second approach is better.
> > > >
> > > > Cheers
> > > >
> > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > > > Sure. I can try that. Just curious, out of these 2 strategies,
> which
> > > one
> > > > do
> > > > > you thin is better ? Do you have any experience of trying one or
> the
> > > > other
> > > > > ?
> > > > >
> > > > > Thanks
> > > > > Upen
> > > > >
> > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]>
> > wrote:
> > > > >
> > > > > > Thanks for sharing your experiences.
> > > > > >
> > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ?
> > > > > > There have been several bug fixes / enhancements
> > > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
> > > > > > [EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > > I have question about improving the Map / Reduce job
> performance
> > > > while
> > > > > > > ingesting huge amount of data into Hbase using
> HFileOutputFormat.
> > > > Here
> > > > > is
> > > > > > > what we are using:
> > > > > > >
> > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u*
> > > > > > > 2) *hbase-0.90.40cdh3u2*
> > > > > > >
> > > > > > > I've used 2 different strategies as described below:
> > > > > > >
> > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions
> per
> > > > region
> > > > > > > server. And then subsequently kick off the hadoop job with
> > > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does
> > create
> > > > > > reduce
> > > > > > > tasks equal to the number of regions * 10. We used the "hash"
> of
> > > each
> > > > > > > record as the Key to Mapoutput. This process resulted in each
> > > mapper
> > > > > > finish
> > > > > > > process in accepetable amount of time. But the reduce task
> takes
> > > > > forever
> > > > > > to
> > > > > > > finish. We found that first the copy/shuffle process too
> > > condierable
> > > > > > amoun
> > > > > > > of time and then the sort process took foreever to finish.
> > > > > > > We tried to address this issue by constructing the key as
> > > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the
> > > records
> > > > > of a
> > > > > > > gven mapper. The idea was to reduce shuffling / copying from
> each
> > > > > mapper.
> > > > > > > But even this solution didn't save us anytime and the reduce
> step
> > > > took
> > > > > > > significant amount to finish. I played with adjusting the
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB