Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase Map/Reduce Data Ingest Performance


Copy link to this message
-
Re: HBase Map/Reduce Data Ingest Performance
I don't think Nick was being disrespectful, usually when people prefix
a question with "Dumb question" it means that they think their own
question is dumb but they feel like asking it anyway in case something
basic wasn't covered.

J-D

On Tue, Dec 18, 2012 at 11:06 AM, Upender K. Nimbekar
<[EMAIL PROTECTED]> wrote:
> I would like to request you maintain the respect of people asking questions
> on this forum. Let's not start the thread in the wrong direction.
> I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad.
> Call succeeded but bulkLoad call still threw exception. However, it does
> work if I do chmod and bulkLoad() from Hadoop Driver after the job is
> finished.
> BTW, Hbase user needs a WRITE permission and NOT read bease it created some
> _tmp directories.
>
> Upen
>
> On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:
>
>> Dumb question: what's the filesystem permissions of your generated HFiles?
>> Can the HBase process read them? Maybe a simple chmod or chown will get you
>> the rest of the way there.
>>
>> On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar <
>>  [EMAIL PROTECTED]> wrote:
>>
>> > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But
>> running
>> > into permission issues while hbase user tries to import Hfile into Hbase.
>> > Not sure, if there is way to change the target HDFS file permission via
>> > HFileOutputFormat.
>> >
>> >
>> > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>> >
>> > > I think second approach is better.
>> > >
>> > > Cheers
>> > >
>> > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar <
>> > > [EMAIL PROTECTED]> wrote:
>> > >
>> > > > Sure. I can try that. Just curious, out of these 2 strategies, which
>> > one
>> > > do
>> > > > you thin is better ? Do you have any experience of trying one or the
>> > > other
>> > > > ?
>> > > >
>> > > > Thanks
>> > > > Upen
>> > > >
>> > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]>
>> wrote:
>> > > >
>> > > > > Thanks for sharing your experiences.
>> > > > >
>> > > > > Have you considered upgrading to HBase 0.92 or 0.94 ?
>> > > > > There have been several bug fixes / enhancements
>> > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
>> > > > > [EMAIL PROTECTED]> wrote:
>> > > > >
>> > > > > > Hi All,
>> > > > > > I have question about improving the Map / Reduce job performance
>> > > while
>> > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat.
>> > > Here
>> > > > is
>> > > > > > what we are using:
>> > > > > >
>> > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u*
>> > > > > > 2) *hbase-0.90.40cdh3u2*
>> > > > > >
>> > > > > > I've used 2 different strategies as described below:
>> > > > > >
>> > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per
>> > > region
>> > > > > > server. And then subsequently kick off the hadoop job with
>> > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does
>> create
>> > > > > reduce
>> > > > > > tasks equal to the number of regions * 10. We used the "hash" of
>> > each
>> > > > > > record as the Key to Mapoutput. This process resulted in each
>> > mapper
>> > > > > finish
>> > > > > > process in accepetable amount of time. But the reduce task takes
>> > > > forever
>> > > > > to
>> > > > > > finish. We found that first the copy/shuffle process too
>> > condierable
>> > > > > amoun
>> > > > > > of time and then the sort process took foreever to finish.
>> > > > > > We tried to address this issue by constructing the key as
>> > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the
>> > records
>> > > > of a
>> > > > > > gven mapper. The idea was to reduce shuffling / copying from each
>> > > > mapper.
>> > > > > > But even this solution didn't save us anytime and the reduce step
>> > > took
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB