Paul Nickerson 2011-07-25, 02:43
Have you taken a look at HBase coprocessors? I think you will find them
<https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
Nube Technologies <http://www.nubetech.co>
On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <[EMAIL PROTECTED]
> I would like to implement a multidimensional query system that aggregates
> large amounts of data on-the-fly by fanning out queries in parallel. It
> should be fast enough for interactive exploration of the data and extensible
> enough to take sets of hundreds or thousands of dimensions with high
> cardinality, and aggregate them from high granularity to low granularity.
> Dimensions and their values are stored in the row key. For instance, row
> keys look like this
> and each row contains numerical values within their column families, such
> as plays=100, versioned by the date of calculation.
> User wants the top "Foo" values with blah=123 sorted downward by total
> plays in july. My current thinking is that a query would get executed by
> grouping all Foo-prefixed row keys by region server, and send the query to
> each of those. Each region server iterates through all of it's row keys that
> start with Foo=something,blah=, and passes the query on to all regions
> containing blahs that equal 123, which then contain play counts. Matching
> row keys, as well as the sum of all their play values within july, are
> passed back up the chain and sorted/truncated when possible.
> It seems quite complicated and would involve either modifying hbase source
> code or at the very least using the deep internals of the api. Does this
> seem like a practical solution or could someone offer some ideas?
> Thank you!
Paul Nickerson 2011-07-25, 04:45
Michel Segel 2011-07-25, 14:37
Gary Helmling 2011-07-25, 18:02
Paul Nickerson 2011-07-25, 20:23
Stack 2011-07-25, 21:25
Gary Helmling 2011-07-25, 21:26