-Re: Fanning out hbase queries in parallel
Michel Segel 2011-07-25, 14:37
Which release(s) have coprocessors enabled?
Sent from a remote device. Please excuse any typos...
On Jul 24, 2011, at 11:03 PM, Sonal Goyal <[EMAIL PROTECTED]> wrote:
> Hi Paul,
> Have you taken a look at HBase coprocessors? I think you will find them
> Best Regards,
> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> Nube Technologies <http://www.nubetech.co>
> On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <[EMAIL PROTECTED]
>> I would like to implement a multidimensional query system that aggregates
>> large amounts of data on-the-fly by fanning out queries in parallel. It
>> should be fast enough for interactive exploration of the data and extensible
>> enough to take sets of hundreds or thousands of dimensions with high
>> cardinality, and aggregate them from high granularity to low granularity.
>> Dimensions and their values are stored in the row key. For instance, row
>> keys look like this
>> and each row contains numerical values within their column families, such
>> as plays=100, versioned by the date of calculation.
>> User wants the top "Foo" values with blah=123 sorted downward by total
>> plays in july. My current thinking is that a query would get executed by
>> grouping all Foo-prefixed row keys by region server, and send the query to
>> each of those. Each region server iterates through all of it's row keys that
>> start with Foo=something,blah=, and passes the query on to all regions
>> containing blahs that equal 123, which then contain play counts. Matching
>> row keys, as well as the sum of all their play values within july, are
>> passed back up the chain and sorted/truncated when possible.
>> It seems quite complicated and would involve either modifying hbase source
>> code or at the very least using the deep internals of the api. Does this
>> seem like a practical solution or could someone offer some ideas?
>> Thank you!