Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - HBase Map/Reduce Data Ingest Performance


Copy link to this message
-
Re: HBase Map/Reduce Data Ingest Performance
Jean-Daniel Cryans 2012-12-18, 19:17
I don't think Nick was being disrespectful, usually when people prefix
a question with "Dumb question" it means that they think their own
question is dumb but they feel like asking it anyway in case something
basic wasn't covered.

J-D

On Tue, Dec 18, 2012 at 11:06 AM, Upender K. Nimbekar
<[EMAIL PROTECTED]> wrote:
> I would like to request you maintain the respect of people asking questions
> on this forum. Let's not start the thread in the wrong direction.
> I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad.
> Call succeeded but bulkLoad call still threw exception. However, it does
> work if I do chmod and bulkLoad() from Hadoop Driver after the job is
> finished.
> BTW, Hbase user needs a WRITE permission and NOT read bease it created some
> _tmp directories.
>
> Upen
>
> On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:
>
>> Dumb question: what's the filesystem permissions of your generated HFiles?
>> Can the HBase process read them? Maybe a simple chmod or chown will get you
>> the rest of the way there.
>>
>> On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar <
>>  [EMAIL PROTECTED]> wrote:
>>
>> > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But
>> running
>> > into permission issues while hbase user tries to import Hfile into Hbase.
>> > Not sure, if there is way to change the target HDFS file permission via
>> > HFileOutputFormat.
>> >
>> >
>> > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>> >
>> > > I think second approach is better.
>> > >
>> > > Cheers
>> > >
>> > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar <
>> > > [EMAIL PROTECTED]> wrote:
>> > >
>> > > > Sure. I can try that. Just curious, out of these 2 strategies, which
>> > one
>> > > do
>> > > > you thin is better ? Do you have any experience of trying one or the
>> > > other
>> > > > ?
>> > > >
>> > > > Thanks
>> > > > Upen
>> > > >
>> > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]>
>> wrote:
>> > > >
>> > > > > Thanks for sharing your experiences.
>> > > > >
>> > > > > Have you considered upgrading to HBase 0.92 or 0.94 ?
>> > > > > There have been several bug fixes / enhancements
>> > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
>> > > > > [EMAIL PROTECTED]> wrote:
>> > > > >
>> > > > > > Hi All,
>> > > > > > I have question about improving the Map / Reduce job performance
>> > > while
>> > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat.
>> > > Here
>> > > > is
>> > > > > > what we are using:
>> > > > > >
>> > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u*
>> > > > > > 2) *hbase-0.90.40cdh3u2*
>> > > > > >
>> > > > > > I've used 2 different strategies as described below:
>> > > > > >
>> > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per
>> > > region
>> > > > > > server. And then subsequently kick off the hadoop job with
>> > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does
>> create
>> > > > > reduce
>> > > > > > tasks equal to the number of regions * 10. We used the "hash" of
>> > each
>> > > > > > record as the Key to Mapoutput. This process resulted in each
>> > mapper
>> > > > > finish
>> > > > > > process in accepetable amount of time. But the reduce task takes
>> > > > forever
>> > > > > to
>> > > > > > finish. We found that first the copy/shuffle process too
>> > condierable
>> > > > > amoun
>> > > > > > of time and then the sort process took foreever to finish.
>> > > > > > We tried to address this issue by constructing the key as
>> > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the
>> > records
>> > > > of a
>> > > > > > gven mapper. The idea was to reduce shuffling / copying from each
>> > > > mapper.
>> > > > > > But even this solution didn't save us anytime and the reduce step
>> > > took