|
Otis Gospodnetic
2011-03-11, 06:43
Andrew Purtell
2011-03-11, 08:52
Lars George
2011-03-11, 16:40
Jean-Daniel Cryans
2011-03-11, 17:58
Amandeep Khurana
2011-03-11, 18:50
Otis Gospodnetic
2011-03-11, 19:09
Otis Gospodnetic
2011-03-11, 19:13
Andrew Purtell
2011-03-11, 19:51
|
-
HBase => replication => HiveOtis Gospodnetic 2011-03-11, 06:43
Hi,
Since HBase has a mechanism to replicate edit logs to another HBase cluster, I was wondering if people think it would be possible to implement HBase=>Hive replication? (and really make the destination pluggable later on) I'm asking because while one can integrate Hive and HBase by creating external tables in Hive that actually point to tables in HBase, apparently Hive queries run about x5 slower than queries that go against normal Hive tables. And because all HBase export options are for 1 table at a time and not point in time snapshots of the whole table, exporting data from HBase and importing into Hive doesn't sound like a viable option. Thanks, Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop Hadoop ecosystem search :: http://search-hadoop.com/
-
Re: HBase => replication => HiveAndrew Purtell 2011-03-11, 08:52
Pardon, I'm not as familiar with this area as I should, but
> apparently Hive queries run about x5 > slower than queries that go against normal Hive tables. Is this not a reasonable place to start? Why is this? > I was wondering if people think it would be possible to > implement HBase=>Hive replication? This strikes me as non trivial. If doing this level of effort, why not look into the Hive/HBase integration? Maybe there is something HBase can do to make it faster? Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --- On Thu, 3/10/11, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > From: Otis Gospodnetic <[EMAIL PROTECTED]> > Subject: HBase => replication => Hive > To: [EMAIL PROTECTED] > Date: Thursday, March 10, 2011, 10:43 PM > Hi, > > Since HBase has a mechanism to replicate edit logs to > another HBase cluster, I was wondering if people think it > would be possible to implement HBase=>Hive > replication? (and really make the destination pluggable > later on) > > I'm asking because while one can integrate Hive and HBase > by creating external tables in Hive that actually point to > tables in HBase, apparently Hive queries run about x5 > slower than queries that go against normal Hive tables. > > And because all HBase export options are for 1 table at a > time and not point in time snapshots of the whole table, > exporting data from HBase and importing into Hive doesn't > sound like a viable option. > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop > Hadoop ecosystem search :: http://search-hadoop.com/ > >
-
Re: HBase => replication => HiveLars George 2011-03-11, 16:40
Hi,
I found the opposite. Depends on the queries but if you are not doing a full table scan the direct HBase handler approach is actually faster as it is more fine grained than the usual Hive partition granularity of a day or so. The scan can make use of row range selection and column families, reducing the scanned data tremendously. Add time and bloom filter if applicable and the result is awesome. Lars On Mar 11, 2011, at 9:52, Andrew Purtell <[EMAIL PROTECTED]> wrote: > Pardon, I'm not as familiar with this area as I should, but > >> apparently Hive queries run about x5 >> slower than queries that go against normal Hive tables. > > Is this not a reasonable place to start? Why is this? > >> I was wondering if people think it would be possible to >> implement HBase=>Hive replication? > > This strikes me as non trivial. If doing this level of effort, why not look into the Hive/HBase integration? Maybe there is something HBase can do to make it faster? > > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. > - Piet Hein (via Tom White) > > > --- On Thu, 3/10/11, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > >> From: Otis Gospodnetic <[EMAIL PROTECTED]> >> Subject: HBase => replication => Hive >> To: [EMAIL PROTECTED] >> Date: Thursday, March 10, 2011, 10:43 PM >> Hi, >> >> Since HBase has a mechanism to replicate edit logs to >> another HBase cluster, I was wondering if people think it >> would be possible to implement HBase=>Hive >> replication? (and really make the destination pluggable >> later on) >> >> I'm asking because while one can integrate Hive and HBase >> by creating external tables in Hive that actually point to >> tables in HBase, apparently Hive queries run about x5 >> slower than queries that go against normal Hive tables. >> >> And because all HBase export options are for 1 table at a >> time and not point in time snapshots of the whole table, >> exporting data from HBase and importing into Hive doesn't >> sound like a viable option. >> >> Thanks, >> Otis >> ---- >> Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop >> Hadoop ecosystem search :: http://search-hadoop.com/ >> >> > > >
-
Re: HBase => replication => HiveJean-Daniel Cryans 2011-03-11, 17:58
Given Hive's support for RCFile, I'm pretty sure that writing a
handler for HFiles would be pretty easy. That may work in your situation... or not. I personally found Hive's performance very adequate for ad hoc querying. We replicate some prod data on demand using CopyTables but we're moving to use multi-slave replication and have our prod data streamed to the MR cluster live. We also have our mysql data in there. J-D On Thu, Mar 10, 2011 at 10:43 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi, > > Since HBase has a mechanism to replicate edit logs to another HBase cluster, I > was wondering if people think it would be possible to implement HBase=>Hive > replication? (and really make the destination pluggable later on) > > I'm asking because while one can integrate Hive and HBase by creating external > tables in Hive that actually point to tables in HBase, apparently Hive queries > run about x5 slower than queries that go against normal Hive tables. > > And because all HBase export options are for 1 table at a time and not point in > time snapshots of the whole table, exporting data from HBase and importing into > Hive doesn't sound like a viable option. > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop > Hadoop ecosystem search :: http://search-hadoop.com/ > >
-
Re: HBase => replication => HiveAmandeep Khurana 2011-03-11, 18:50
So, you essentially want to dump HBase tables into sequence files/RC
files/text files and read it from Hive? How do you plan to handle updates, deletes, IVS etc if you use the log edits to replicate from hbase to these files? Getting Hive to talk to HFiles gives you the same problem.. Isn't it easier to take a snapshot of the table when you actually want to run queries on it? In my prelim testing, I did see Hive-HBase full table scans slower than direct Hive table scans but I don't remember the numbers off hand. On Thu, Mar 10, 2011 at 10:43 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > Hi, > > Since HBase has a mechanism to replicate edit logs to another HBase cluster, I > was wondering if people think it would be possible to implement HBase=>Hive > replication? (and really make the destination pluggable later on) > > I'm asking because while one can integrate Hive and HBase by creating external > tables in Hive that actually point to tables in HBase, apparently Hive queries > run about x5 slower than queries that go against normal Hive tables. > > And because all HBase export options are for 1 table at a time and not point in > time snapshots of the whole table, exporting data from HBase and importing into > Hive doesn't sound like a viable option. > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop > Hadoop ecosystem search :: http://search-hadoop.com/ >
-
Re: HBase => replication => HiveOtis Gospodnetic 2011-03-11, 19:09
Hi,
> So, you essentially want to dump HBase tables into sequence files/RC > files/text files and read it from Hive? I think that's a Q for J-D. I know that what I had in mind was not about creating periodic dumps because that means data in Hive would always be behind data in HBase, but a more real-time replication a la http://hbase.apache.org/replication.html except with Hive being on the right side of that pretty picture. > How do you plan to handle updates, deletes, IVS etc if you use the log > edits to replicate from hbase to these files? Getting Hive to talk to > HFiles gives you the same problem.. Isn't it easier to take a snapshot > of the table when you actually want to run queries on it? In my prelim The thing is, it looks like there is no way to take a snapshot of a HBase table: http://blog.sematext.com/2011/03/11/hbase-backup-options/ > testing, I did see Hive-HBase full table scans slower than direct Hive > table scans but I don't remember the numbers off hand. This is what made me start this particular thread: http://search-hadoop.com/m/rMdPh9rFlY1 Otis > On Thu, Mar 10, 2011 at 10:43 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > Since HBase has a mechanism to replicate edit logs to another HBase cluster, >I > > was wondering if people think it would be possible to implement HBase=>Hive > > replication? (and really make the destination pluggable later on) > > > > I'm asking because while one can integrate Hive and HBase by creating >external > > tables in Hive that actually point to tables in HBase, apparently Hive >queries > > run about x5 slower than queries that go against normal Hive tables. > > > > And because all HBase export options are for 1 table at a time and not point >in > > time snapshots of the whole table, exporting data from HBase and importing >into > > Hive doesn't sound like a viable option. > > > > Thanks, > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop > > Hadoop ecosystem search :: http://search-hadoop.com/ > > >
-
Re: HBase => replication => HiveOtis Gospodnetic 2011-03-11, 19:13
Hi,
----- Original Message ---- > From: Andrew Purtell <[EMAIL PROTECTED]> > > Pardon, I'm not as familiar with this area as I should, but > > > apparently Hive queries run about x5 > > slower than queries that go against normal Hive tables. > > Is this not a reasonable place to start? Why is this? Reasonable? I don't know. :) That's really the first thing I was hoping to find out. J-Ds reaction makes it sound like this is not unreasonable. > > I was wondering if people think it would be possible to > > implement HBase=>Hive replication? > > This strikes me as non trivial. If doing this level of effort, why not look >into the Hive/HBase integration? Maybe there is something HBase can do to make >it faster? At this point I don't know how trivial or non-trivial it is yet. But I thought that if John Sichi, who strikes me as a pretty smart fellow, says he's seeing x5 performance loss and he's the one who worked on the integration, getting from 5 to 4 or lower may be non-trivial. HBase => Hive is terra incognita so, who knows, maybe it's easy to do. :) Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. > - Piet Hein (via Tom White) > > > --- On Thu, 3/10/11, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > > From: Otis Gospodnetic <[EMAIL PROTECTED]> > > Subject: HBase => replication => Hive > > To: [EMAIL PROTECTED] > > Date: Thursday, March 10, 2011, 10:43 PM > > Hi, > > > > Since HBase has a mechanism to replicate edit logs to > > another HBase cluster, I was wondering if people think it > > would be possible to implement HBase=>Hive > > replication? (and really make the destination pluggable > > later on) > > > > I'm asking because while one can integrate Hive and HBase > > by creating external tables in Hive that actually point to > > tables in HBase, apparently Hive queries run about x5 > > slower than queries that go against normal Hive tables. > > > > And because all HBase export options are for 1 table at a > > time and not point in time snapshots of the whole table, > > exporting data from HBase and importing into Hive doesn't > > sound like a viable option. > > > > Thanks, > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop > > Hadoop ecosystem search :: http://search-hadoop.com/ > > > > > > > >
-
Re: HBase => replication => HiveAndrew Purtell 2011-03-11, 19:51
See Lars George's response.
What I hear is full table scans without taking advantage of any HBase features for predicate push down or blooms etc. is slower. I can buy that. And say don't do it that way. Isn't the best way to go is first look at the underlying cause of the slowdown? I don't have much insight into that, so don't know the probability of getting improvement. But it seems the level of effort for doing some kind of continuous export via replication would at least be as high as digging in there for a bit. - Andy > From: Otis Gospodnetic <[EMAIL PROTECTED]> > Subject: Re: HBase => replication => Hive > To: [EMAIL PROTECTED] > Date: Friday, March 11, 2011, 11:13 AM > Hi, > > > ----- Original Message ---- > > > From: Andrew Purtell <[EMAIL PROTECTED]> > > > > Pardon, I'm not as familiar with this area as I > should, but > > > > > apparently Hive queries run about x5 > > > slower than queries that go against normal > Hive tables. > > > > Is this not a reasonable place to start? Why is > this? > > Reasonable? I don't know. :) That's really the > first thing I was hoping to > find out. J-Ds reaction makes it sound like this is > not unreasonable. > > > > I was wondering if people think it would be > possible to > > > implement HBase=>Hive replication? > > > > This strikes me as non trivial. If doing this > level of effort, why not look > >into the Hive/HBase integration? Maybe there is > something HBase can do to make > >it faster? > > > At this point I don't know how trivial or non-trivial it is > yet. But I thought > that if John Sichi, who strikes me as a pretty smart > fellow, says he's seeing x5 > performance loss and he's the one who worked on the > integration, getting from 5 > to 4 or lower may be non-trivial. HBase => Hive is > terra incognita so, who > knows, maybe it's easy to do. :) > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > Best regards, > > > > - Andy > > > > Problems worthy of attack prove their worth by > hitting back. > > - Piet Hein (via Tom White) > > > > > > --- On Thu, 3/10/11, Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > > > From: Otis Gospodnetic <[EMAIL PROTECTED]> > > > Subject: HBase => replication => > Hive > > > To: [EMAIL PROTECTED] > > > Date: Thursday, March 10, 2011, 10:43 PM > > > Hi, > > > > > > Since HBase has a mechanism to replicate > edit logs to > > > another HBase cluster, I was wondering if > people think it > > > would be possible to implement > HBase=>Hive > > > replication? (and really make the > destination pluggable > > > later on) > > > > > > I'm asking because while one can integrate > Hive and HBase > > > by creating external tables in Hive that > actually point to > > > tables in HBase, apparently Hive queries run > about x5 > > > slower than queries that go against normal Hive > tables. > > > > > > And because all HBase export options are for 1 > table at a > > > time and not point in time snapshots of the > whole table, > > > exporting data from HBase and importing > into Hive doesn't > > > sound like a viable option. > > > > > > Thanks, > > > Otis > > > ---- > > > Sematext :: http://sematext.com/ :: Solr - > Lucene - Hadoop > > > Hadoop ecosystem search :: http://search-hadoop.com/ > > > > > > > > > > > > > > > |