|
kfarmer
2012-01-11, 18:59
Doug Meil
2012-01-11, 19:05
kisalay
2012-01-11, 19:10
Ian Varley
2012-01-11, 19:15
Dmitriy Lyubimov
2012-01-11, 19:48
Dmitriy Lyubimov
2012-01-11, 19:52
|
-
HBase for ad-hoc aggregate querieskfarmer 2012-01-11, 18:59
I'm taking a look at moving our datastore from Oracle to HBase, and trying to understand how HBase could be used for ad-hoc aggregation queries across our data. My understanding is MapReduce is more of a batch framework, so if we want a query to come back to the user's request in a few seconds, that won't work because of the overheard of running MR and because the MR jobs write back to a new table. Is that correct? Instead should we be pre-aggregating data as we load into separate tables, and then when a user queries instead just do a scan on these pre-aggregated tables? Thanks. -- View this message in context: http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html Sent from the HBase User mailing list archive at Nabble.com.
-
Re: HBase for ad-hoc aggregate queriesDoug Meil 2012-01-11, 19:05
re: "My understanding is MapReduce is more of a batch framework," Yes. re: "and because the MR jobs write back to a new table." They can write to where-ever they need to write (HDFS, Hbase, etc.) Probably want to check out the Hbase Book/RefGuide on the Architecture, DataModel, and MapReduce chapters. http://hbase.apache.org/book.html On 1/11/12 1:59 PM, "kfarmer" <[EMAIL PROTECTED]> wrote: > >I'm taking a look at moving our datastore from Oracle to HBase, and >trying to >understand how HBase could be used for ad-hoc aggregation queries across >our >data. > >My understanding is MapReduce is more of a batch framework, so if we want >a >query to come back to the user's request in a few seconds, that won't work >because of the overheard of running MR and because the MR jobs write back >to >a new table. Is that correct? > >Instead should we be pre-aggregating data as we load into separate tables, >and then when a user queries instead just do a scan on these >pre-aggregated >tables? > >Thanks. >-- >View this message in context: >http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p331233 >13.html >Sent from the HBase User mailing list archive at Nabble.com. > >
-
Re: HBase for ad-hoc aggregate querieskisalay 2012-01-11, 19:10
U can have a look at opentsdb which does aggregations on the data:
http://opentsdb.net/ Also, you can use endpoint coprocessors to do aggregations on a per region and then merge the results. http://hbase-coprocessor-experiments.blogspot.com/2011/05/extending.html Both of these approaches will give you alternatives apart from traditional MR. On Thu, Jan 12, 2012 at 12:29 AM, kfarmer <[EMAIL PROTECTED]> wrote: > > I'm taking a look at moving our datastore from Oracle to HBase, and trying > to > understand how HBase could be used for ad-hoc aggregation queries across > our > data. > > My understanding is MapReduce is more of a batch framework, so if we want a > query to come back to the user's request in a few seconds, that won't work > because of the overheard of running MR and because the MR jobs write back > to > a new table. Is that correct? > > Instead should we be pre-aggregating data as we load into separate tables, > and then when a user queries instead just do a scan on these pre-aggregated > tables? > > Thanks. > -- > View this message in context: > http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html > Sent from the HBase User mailing list archive at Nabble.com. > >
-
Re: HBase for ad-hoc aggregate queriesIan Varley 2012-01-11, 19:15
And in case no one else says it ...
I'm taking a look at moving our datastore from Oracle to HBase This is a questionable project in the general case. HBase is not a relational store and lacks indexes, transactions, isolation, easy ad-hoc querying, and nearly everything else you get from Oracle. It may work for specific cases, but it's not usually prudent to think of it as "simply" converting from one database to another. On Jan 11, 2012, at 11:10 AM, "kisalay" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I'm taking a look at moving our datastore from Oracle to HBase
-
Re: HBase for ad-hoc aggregate queriesDmitriy Lyubimov 2012-01-11, 19:48
IMO You will never get the same flexibility. There are also numerous
differences in data modelling approach (TTL, uniformly-distributed ids requirement to scale query volume, etc.) The most flexibility in that regard we reached so far w.r.t. aggregation queries is OLAPish model (see link on HBase wiki, supported projects, HBase-Lattice). This is for aggregating really high qps RT fact streams and the list of current limitations is huge but it serves our purpose so far. Most obvious benefits are that queries are fast (because of precomputed cuboids in a lattice, similar to cuboid lattice approach in ROLAP), short incremental compilation cycle (one can grow and update the cube in just a few minutes after the fact got fed into system), and one can scale compilation horizontally for high volume fact feeds. There's a fairly limited query language and a basic set of aggregate functions (along with some weighted time series aggregates as well). Most severe limitation right now is lack of commonly used multidimensional query dialect such as MDX which prevents use of the widely used UI pivoting exploratory clients such as excel or JPivot or Tableau etc. So it is either custom UI integration or custom data source providers for canned reports with tools like pentaho and jasper, or some RT decisioning framework that doesn't require any UI at all and can use java API. I also plan to enable R to run queries against it (cause i personally don't beleive in doing ml or analytics using Excel). -d On Wed, Jan 11, 2012 at 10:59 AM, kfarmer <[EMAIL PROTECTED]> wrote: > > I'm taking a look at moving our datastore from Oracle to HBase, and trying to > understand how HBase could be used for ad-hoc aggregation queries across our > data. > > My understanding is MapReduce is more of a batch framework, so if we want a > query to come back to the user's request in a few seconds, that won't work > because of the overheard of running MR and because the MR jobs write back to > a new table. Is that correct? > > Instead should we be pre-aggregating data as we load into separate tables, > and then when a user queries instead just do a scan on these pre-aggregated > tables? > > Thanks. > -- > View this message in context: http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html > Sent from the HBase User mailing list archive at Nabble.com. >
-
Re: HBase for ad-hoc aggregate queriesDmitriy Lyubimov 2012-01-11, 19:52
Bottom line, imo you have to consider how your data is organized. for
90% of relational schema (but perhaps 10% of volume) the move to hbase based solutions is not warranted. However, for 10% of the schema (and 90% of the volume) you may consider using HBase-based solutions. Most typically time series data feeds. -d On Wed, Jan 11, 2012 at 11:48 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: > IMO You will never get the same flexibility. There are also numerous > differences in data modelling approach (TTL, uniformly-distributed ids > requirement to scale query volume, etc.) > > The most flexibility in that regard we reached so far w.r.t. > aggregation queries is OLAPish model (see link on HBase wiki, > supported projects, HBase-Lattice). > > This is for aggregating really high qps RT fact streams and the list > of current limitations is huge but it serves our purpose so far. > > Most obvious benefits are that queries are fast (because of > precomputed cuboids in a lattice, similar to cuboid lattice approach > in ROLAP), short incremental compilation cycle (one can grow and > update the cube in just a few minutes after the fact got fed into > system), and one can scale compilation horizontally for high volume > fact feeds. There's a fairly limited query language and a basic set of > aggregate functions (along with some weighted time series aggregates > as well). > > Most severe limitation right now is lack of commonly used > multidimensional query dialect such as MDX which prevents use of the > widely used UI pivoting exploratory clients such as excel or JPivot or > Tableau etc. So it is either custom UI integration or custom data > source providers for canned reports with tools like pentaho and > jasper, or some RT decisioning framework that doesn't require any UI > at all and can use java API. I also plan to enable R to run queries > against it (cause i personally don't beleive in doing ml or analytics > using Excel). > > -d > > On Wed, Jan 11, 2012 at 10:59 AM, kfarmer <[EMAIL PROTECTED]> wrote: >> >> I'm taking a look at moving our datastore from Oracle to HBase, and trying to >> understand how HBase could be used for ad-hoc aggregation queries across our >> data. >> >> My understanding is MapReduce is more of a batch framework, so if we want a >> query to come back to the user's request in a few seconds, that won't work >> because of the overheard of running MR and because the MR jobs write back to >> a new table. Is that correct? >> >> Instead should we be pre-aggregating data as we load into separate tables, >> and then when a user queries instead just do a scan on these pre-aggregated >> tables? >> >> Thanks. >> -- >> View this message in context: http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html >> Sent from the HBase User mailing list archive at Nabble.com. >> |