Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Fanning out hbase queries in parallel


+
Paul Nickerson 2011-07-25, 02:43
+
Sonal Goyal 2011-07-25, 04:03
+
Paul Nickerson 2011-07-25, 04:45
Copy link to this message
-
Re: Fanning out hbase queries in parallel
Which release(s) have coprocessors enabled?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 24, 2011, at 11:03 PM, Sonal Goyal <[EMAIL PROTECTED]> wrote:

> Hi Paul,
>
> Have you taken a look at HBase coprocessors? I think you will find them
> useful.
>
> Best Regards,
> Sonal
> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> Integration<https://github.com/sonalgoyal/hiho>
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
>
> On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <[EMAIL PROTECTED]
>> wrote:
>
>>
>> I would like to implement a multidimensional query system that aggregates
>> large amounts of data on-the-fly by fanning out queries in parallel. It
>> should be fast enough for interactive exploration of the data and extensible
>> enough to take sets of hundreds or thousands of dimensions with high
>> cardinality, and aggregate them from high granularity to low granularity.
>> Dimensions and their values are stored in the row key. For instance, row
>> keys look like this
>> Foo=bar,blah=123
>> and each row contains numerical values within their column families, such
>> as plays=100, versioned by the date of calculation.
>> User wants the top "Foo" values with blah=123 sorted downward by total
>> plays in july. My current thinking is that a query would get executed by
>> grouping all Foo-prefixed row keys by region server, and send the query to
>> each of those. Each region server iterates through all of it's row keys that
>> start with Foo=something,blah=, and passes the query on to all regions
>> containing blahs that equal 123, which then contain play counts. Matching
>> row keys, as well as the sum of all their play values within july, are
>> passed back up the chain and sorted/truncated when possible.
>>
>>
>> It seems quite complicated and would involve either modifying hbase source
>> code or at the very least using the deep internals of the api. Does this
>> seem like a practical solution or could someone offer some ideas?
>>
>>
>> Thank you!
+
Gary Helmling 2011-07-25, 18:02
+
Paul Nickerson 2011-07-25, 20:23
+
Stack 2011-07-25, 21:25
+
Gary Helmling 2011-07-25, 21:26
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB