Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - HBase Performance Improvements?


+
Something Something 2012-05-09, 14:51
+
Oliver Meyn 2012-05-09, 15:02
+
Something Something 2012-05-09, 15:08
+
Bryan Beaudreault 2012-05-09, 15:25
+
Something Something 2012-05-10, 07:22
+
Bryan Beaudreault 2012-05-10, 14:15
+
Something Something 2012-05-10, 15:55
+
Bryan Beaudreault 2012-05-10, 16:04
Copy link to this message
-
Re: HBase Performance Improvements?
Something Something 2012-05-16, 07:40
Hello Bryan & Oliver,

I am using suggestions from both of you to do the bulk upload.  The problem
I am running into is that the job that uses 'HFileOutputFormat.
configureIncrementalLoad' is taking very long to complete.  One thing I
noticed is that it's using only 1 Reducer.

When I looked at the source code for HFileOutputFormat, I noticed that the
no. of Reducers is determined by this:

    List<ImmutableBytesWritable> startKeys = getRegionStartKeys(table);
    LOG.info("Configuring " + startKeys.size() + " reduce partitions " +
        "to match current region count");
    job.setNumReduceTasks(startKeys.size());
When I look at the log I see this:

12/05/16 03:11:02 INFO mapreduce.HFileOutputFormat: Configuring 301 reduce
partitions to match current region count

which implies that the regions were created successfully.  But shouldn't
this set the number of Reducers to 301?  What am I missing?

Thanks for your help.

On Thu, May 10, 2012 at 9:04 AM, Bryan Beaudreault <[EMAIL PROTECTED]
> wrote:

> I don't think there is.  You need to have a table seeded with the right
> regions in order to run the bulk loader jobs.
>
> My machines are sufficiently fast that it did not take that long to sort.
>  One thing I did do to speed this up was add a mapper to the job that
> generates the splits,  which would calculate the size of each KeyValue.  So
> instead of passing around the KeyValue's I would pass around just the size
> of the KeyValues.  You could do a similar thing with the Puts.  Here are my
> keys/values for the job in full:
>
> Mapper:
>
> KeyIn: ImmutableBytesWritable
> ValueIn: KeyValue
>
> KeyOut: ImmutableBytesWritable
> ValueOut: IntWritable
>
> Reducer:
>
> KeyIn: ImmutableBytesWritable
> ValueIn: IntWritable
>
> At this point I would just add up the ints from the IntWritable.  This cuts
> down drastically on the amount of data passed around in the sort.
>
> Hope this helps.  If it is still too slow you might have to experiment with
> using many reducers and making sure you don't have holes or regions that
> are too big due to the way the keys are partitioned.  I was lucky enough to
> not have to go that far.
>
>
> On Thu, May 10, 2012 at 11:55 AM, Something Something <
> [EMAIL PROTECTED]> wrote:
>
> > I am beginning to get a sinking feeling about this :(  But I won't give
> up!
> >
> > Problem is that when I use one Reducer the job runs for a long time.  I
> > killed it after about an hour.  Keep in mind, we do have a decent cluster
> > size.  The Map stage completes in a minute & when I set no. of reducers
> to
> > 0 (which is not what we want) the job completes in 12 minutes.  In other
> > words, sorting is taking very  very long!  What could be the problem?
> >
> > Is there no other way to do the bulk upload without first *learning* the
> > data?
> >
> > On Thu, May 10, 2012 at 7:15 AM, Bryan Beaudreault <
> > [EMAIL PROTECTED]
> > > wrote:
> >
> > > Since our Key was ImmutableByteWritable (representing a rowKey) and the
> > > Value was KeyValue, there could be many KeyValue's per row key (thus
> > values
> > > per hadoop key in the reducer).  So yes, what we did is very much the
> > same
> > > as what you described.  Hadoop will sort the ImutableByteWritable keys
> > > before sending them to the reducer.  This is the primary sort.  We then
> > > loop the values for each key, adding up the size of each KeyValue until
> > we
> > > reach the region size.  Each time that happens we record the rowKey
> from
> > > the hadoop key and use that as the start key for a new region.
> > >
> > > Secondary sort is not necessary unless the order of the values matter
> for
> > > you.  In this case (with the row key as the reducer key), I don't think
> > > that matters.
> > >
> > > On Thu, May 10, 2012 at 3:22 AM, Something Something <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Thank you Tim & Bryan for the responses.  Sorry for the delayed
> > response.
> > > > Got busy with other things.
>
+
Oliver Meyn 2012-05-10, 10:37
+
Tim Robertson 2012-05-09, 15:23