Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - querying salted Hbase tables


+
Norbert Burger 2011-09-12, 14:51
+
Dmitriy Ryaboy 2011-09-13, 14:43
Copy link to this message
-
Re: querying salted Hbase tables
Norbert Burger 2011-09-13, 15:08
We tried using multiple LOADs because we want to minimize the data loaded
and take advantage of the pushdown filter support for -gte and -lte in
HBaseStorage.  At the same time, a salted key schema forces different key
prefixes, so we ended up with 14 LOADs, one for each salted region.

Doing some research, it seems like the Mozilla folks solved the issue in
Socorro by writing a custom LoadFunc:
https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/pig/load/HBaseMultiScanLoader.java

The custom LoadFunc seems cleaner, since we can manipulate the 14 HBase
scanners directly, but at the cost of writing some Java glue code.  Should
we expect however the 14 Pig LOADs also to work?

I'll check and see why the scanners are timing out.  We do have automatic
splitting turned on, but the region size is high enough (1 GB) that they
shouldn't be splitting often.  The HBase rebalancer is probably turned on -
would this be enough to cause the timeouts?

Norbert

On Tue, Sep 13, 2011 at 10:43 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> Why not just one load?
>
> Check why the scanners are timing out. Are the regions splitting under you
> while you scan? Do you have the hbase rebalancer turned on?
>
> On Sep 12, 2011, at 7:51 AM, Norbert Burger <[EMAIL PROTECTED]>
> wrote:
>
> > Folks -- we have a timeseries-based table we recently converted to a
> salted
> > key schema [1] in order to avoid region hotspotting.  The rowkey format
> is:
> >
> > salt-timestamp-sessionid-eventtype, where:
> >
> > salt has the form 00..13, and the timestamp is a Unix timestamp (epoch
> > based).
> >
> > With the version 0.10.0 HBaseStorage, what's the recommended way to LOAD
> a
> > salted schema from Pig?  Initially, I thought we'd just fire off multiple
> > LOADs, one for each region (in our case, up to 14), but we're hitting
> > frequently ScannerTimeoutExceptions with this approach, even on a sample
> > script that does nothing but LOADs.
> >
> > Is there a better way?
> >
> > Thanks,
> > Norbert
> >
> > [1]
> >
> http://ofps.oreilly.com/titles/9781449396107/advanced.html#ch09_id2336987
>
+
Dmitriy Ryaboy 2011-09-13, 16:35