Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase Performance Improvements?


Copy link to this message
-
Re: HBase Performance Improvements?
Since our Key was ImmutableByteWritable (representing a rowKey) and the
Value was KeyValue, there could be many KeyValue's per row key (thus values
per hadoop key in the reducer).  So yes, what we did is very much the same
as what you described.  Hadoop will sort the ImutableByteWritable keys
before sending them to the reducer.  This is the primary sort.  We then
loop the values for each key, adding up the size of each KeyValue until we
reach the region size.  Each time that happens we record the rowKey from
the hadoop key and use that as the start key for a new region.

Secondary sort is not necessary unless the order of the values matter for
you.  In this case (with the row key as the reducer key), I don't think
that matters.

On Thu, May 10, 2012 at 3:22 AM, Something Something <
[EMAIL PROTECTED]> wrote:

> Thank you Tim & Bryan for the responses.  Sorry for the delayed response.
> Got busy with other things.
>
> Bryan - I decided to focus on the region split problem first.  The
> challenge here is to find the correct start key for each region, right?
> Here are the steps I could think of:
>
> 1)  Sort the keys.
> 2)  Count how many keys & divide by # of regions we want to create.  (e.g.
> 300).  This gives us # of keys in a region (region size).
> 3)  Loop thru the sorted keys & every time region size is reached, write
> down region # & starting key.  This info can later be used to create the
> table.
>
> Honestly, I am not sure what you mean by "hadoop does this automatically".
> If you used a single reducer, did you use secondary sort
> (setOutputValueGroupingComparator) to sort the keys?  Did you loop thru the
> *values* to find regions?  Would appreciate it if you would describe this
> MR job.  Thanks.
>
>
> On Wed, May 9, 2012 at 8:25 AM, Bryan Beaudreault
> <[EMAIL PROTECTED]>wrote:
>
> > I also recently had this problem, trying to index 6+ billion records into
> > HBase.  The job would take about 4 hours before it brought down the
> entire
> > cluster, at only around 60% complete.
> >
> > After trying a bunch of things, we went to bulk loading.  This is
> actually
> > pretty easy, though the hardest part is that you need to have a table
> ready
> > with the region splits you are going to use.  Region splits aside, there
> > are 2 steps:
> >
> > 1) Change your job to instead of executing yours Puts, just output them
> > using context.write.  Put is writable. (We used ImmutableBytesWritable as
> > the Key, representing the rowKey)
> > 2) Add another job that reads that input and configure it
> > using HFileOutputFormat.configureIncrementalLoad(Job job, HTable table);
> >  This will add the right reducer.
> >
> > Once those two have run, you can finalize the process using the
> > completebulkload tool documented at
> > http://hbase.apache.org/bulk-loads.html
> >
> > For the region splits problem, we created another job which sorted all of
> > the puts by the key (hadoop does this automatically) and had a single
> > reducer.  It stepped through all of the Puts calculating up the total
> size
> > until it reached some threshold.  When it did it recorded the bytearray
> and
> > used that for the start of the next region. We used the result of this
> job
> > to create a new table.  There is probably a better way to do this but it
> > takes like 20 minutes to write.
> >
> > This whole process took less than an hour, with the bulk load part only
> > taking 15 minutes.  Much better!
> >
> > On Wed, May 9, 2012 at 11:08 AM, Something Something <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hey Oliver,
> > >
> > > Thanks a "billion" for the response -:)  I will take any code you can
> > > provide even if it's a hack!  I will even send you an Amazon gift card
> -
> > > not that you care or need it -:)
> > >
> > > Can you share some performance statistics?  Thanks again.
> > >
> > >
> > > On Wed, May 9, 2012 at 8:02 AM, Oliver Meyn (GBIF) <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Heya Something,
> > > >
> > > > I had a similar task recently and by far the best way to go about