Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> HBase Performance Improvements?


+
Something Something 2012-05-09, 14:51
+
Oliver Meyn 2012-05-09, 15:02
+
Something Something 2012-05-09, 15:08
+
Bryan Beaudreault 2012-05-09, 15:25
+
Something Something 2012-05-10, 07:22
+
Bryan Beaudreault 2012-05-10, 14:15
+
Something Something 2012-05-10, 15:55
+
Bryan Beaudreault 2012-05-10, 16:04
Copy link to this message
-
Re: HBase Performance Improvements?
Hello Bryan & Oliver,

I am using suggestions from both of you to do the bulk upload.  The problem
I am running into is that the job that uses 'HFileOutputFormat.
configureIncrementalLoad' is taking very long to complete.  One thing I
noticed is that it's using only 1 Reducer.

When I looked at the source code for HFileOutputFormat, I noticed that the
no. of Reducers is determined by this:

    List<ImmutableBytesWritable> startKeys = getRegionStartKeys(table);
    LOG.info("Configuring " + startKeys.size() + " reduce partitions " +
        "to match current region count");
    job.setNumReduceTasks(startKeys.size());
When I look at the log I see this:

12/05/16 03:11:02 INFO mapreduce.HFileOutputFormat: Configuring 301 reduce
partitions to match current region count

which implies that the regions were created successfully.  But shouldn't
this set the number of Reducers to 301?  What am I missing?

Thanks for your help.

On Thu, May 10, 2012 at 9:04 AM, Bryan Beaudreault <[EMAIL PROTECTED]
> wrote:

> I don't think there is.  You need to have a table seeded with the right
> regions in order to run the bulk loader jobs.
>
> My machines are sufficiently fast that it did not take that long to sort.
>  One thing I did do to speed this up was add a mapper to the job that
> generates the splits,  which would calculate the size of each KeyValue.  So
> instead of passing around the KeyValue's I would pass around just the size
> of the KeyValues.  You could do a similar thing with the Puts.  Here are my
> keys/values for the job in full:
>
> Mapper:
>
> KeyIn: ImmutableBytesWritable
> ValueIn: KeyValue
>
> KeyOut: ImmutableBytesWritable
> ValueOut: IntWritable
>
> Reducer:
>
> KeyIn: ImmutableBytesWritable
> ValueIn: IntWritable
>
> At this point I would just add up the ints from the IntWritable.  This cuts
> down drastically on the amount of data passed around in the sort.
>
> Hope this helps.  If it is still too slow you might have to experiment with
> using many reducers and making sure you don't have holes or regions that
> are too big due to the way the keys are partitioned.  I was lucky enough to
> not have to go that far.
>
>
> On Thu, May 10, 2012 at 11:55 AM, Something Something <
> [EMAIL PROTECTED]> wrote:
>
> > I am beginning to get a sinking feeling about this :(  But I won't give
> up!
> >
> > Problem is that when I use one Reducer the job runs for a long time.  I
> > killed it after about an hour.  Keep in mind, we do have a decent cluster
> > size.  The Map stage completes in a minute & when I set no. of reducers
> to
> > 0 (which is not what we want) the job completes in 12 minutes.  In other
> > words, sorting is taking very  very long!  What could be the problem?
> >
> > Is there no other way to do the bulk upload without first *learning* the
> > data?
> >
> > On Thu, May 10, 2012 at 7:15 AM, Bryan Beaudreault <
> > [EMAIL PROTECTED]
> > > wrote:
> >
> > > Since our Key was ImmutableByteWritable (representing a rowKey) and the
> > > Value was KeyValue, there could be many KeyValue's per row key (thus
> > values
> > > per hadoop key in the reducer).  So yes, what we did is very much the
> > same
> > > as what you described.  Hadoop will sort the ImutableByteWritable keys
> > > before sending them to the reducer.  This is the primary sort.  We then
> > > loop the values for each key, adding up the size of each KeyValue until
> > we
> > > reach the region size.  Each time that happens we record the rowKey
> from
> > > the hadoop key and use that as the start key for a new region.
> > >
> > > Secondary sort is not necessary unless the order of the values matter
> for
> > > you.  In this case (with the row key as the reducer key), I don't think
> > > that matters.
> > >
> > > On Thu, May 10, 2012 at 3:22 AM, Something Something <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Thank you Tim & Bryan for the responses.  Sorry for the delayed
> > response.
> > > > Got busy with other things.
>
+
Oliver Meyn 2012-05-10, 10:37
+
Tim Robertson 2012-05-09, 15:23
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB