Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase Performance Improvements?


Copy link to this message
-
Re: HBase Performance Improvements?
Thank you Tim & Bryan for the responses.  Sorry for the delayed response.
Got busy with other things.

Bryan - I decided to focus on the region split problem first.  The
challenge here is to find the correct start key for each region, right?
Here are the steps I could think of:

1)  Sort the keys.
2)  Count how many keys & divide by # of regions we want to create.  (e.g.
300).  This gives us # of keys in a region (region size).
3)  Loop thru the sorted keys & every time region size is reached, write
down region # & starting key.  This info can later be used to create the
table.

Honestly, I am not sure what you mean by "hadoop does this automatically".
If you used a single reducer, did you use secondary sort
(setOutputValueGroupingComparator) to sort the keys?  Did you loop thru the
*values* to find regions?  Would appreciate it if you would describe this
MR job.  Thanks.
On Wed, May 9, 2012 at 8:25 AM, Bryan Beaudreault
<[EMAIL PROTECTED]>wrote:

> I also recently had this problem, trying to index 6+ billion records into
> HBase.  The job would take about 4 hours before it brought down the entire
> cluster, at only around 60% complete.
>
> After trying a bunch of things, we went to bulk loading.  This is actually
> pretty easy, though the hardest part is that you need to have a table ready
> with the region splits you are going to use.  Region splits aside, there
> are 2 steps:
>
> 1) Change your job to instead of executing yours Puts, just output them
> using context.write.  Put is writable. (We used ImmutableBytesWritable as
> the Key, representing the rowKey)
> 2) Add another job that reads that input and configure it
> using HFileOutputFormat.configureIncrementalLoad(Job job, HTable table);
>  This will add the right reducer.
>
> Once those two have run, you can finalize the process using the
> completebulkload tool documented at
> http://hbase.apache.org/bulk-loads.html
>
> For the region splits problem, we created another job which sorted all of
> the puts by the key (hadoop does this automatically) and had a single
> reducer.  It stepped through all of the Puts calculating up the total size
> until it reached some threshold.  When it did it recorded the bytearray and
> used that for the start of the next region. We used the result of this job
> to create a new table.  There is probably a better way to do this but it
> takes like 20 minutes to write.
>
> This whole process took less than an hour, with the bulk load part only
> taking 15 minutes.  Much better!
>
> On Wed, May 9, 2012 at 11:08 AM, Something Something <
> [EMAIL PROTECTED]> wrote:
>
> > Hey Oliver,
> >
> > Thanks a "billion" for the response -:)  I will take any code you can
> > provide even if it's a hack!  I will even send you an Amazon gift card -
> > not that you care or need it -:)
> >
> > Can you share some performance statistics?  Thanks again.
> >
> >
> > On Wed, May 9, 2012 at 8:02 AM, Oliver Meyn (GBIF) <[EMAIL PROTECTED]>
> wrote:
> >
> > > Heya Something,
> > >
> > > I had a similar task recently and by far the best way to go about this
> is
> > > with bulk loading after pre-splitting your target table.  As you know
> > > ImportTsv doesn't understand Avro files so I hacked together my own
> > > ImportAvro class to create the Hfiles that I eventually moved into
> HBase
> > > with completebulkload.  I haven't committed my class anywhere because
> > it's
> > > a pretty ugly hack, but I'm happy to share it with you as a starting
> > point.
> > >  Doing billions of puts will just drive you crazy.
> > >
> > > Cheers,
> > > Oliver
> > >
> > > On 2012-05-09, at 4:51 PM, Something Something wrote:
> > >
> > > > I ran the following MR job that reads AVRO files & puts them on
> HBase.
> > >  The
> > > > files have tons of data (billions).  We have a fairly decent size
> > > cluster.
> > > > When I ran this MR job, it brought down HBase.  When I commented out
> > the
> > > > Puts on HBase, the job completed in 45 seconds (yes that's seconds).