|
|
-
Performance: hive+hbase integration query against the row_key
Shengjie Min 2012-09-11, 13:40
Hi, I am trying to get hive working on top of my hbase table following the guide below: https://cwiki.apache.org/Hive/hbaseintegration.htmlCREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES (" hbase.table.name"="test"); this hive table creation makes my mapping roughly look like this: hive_hbase_test VS test Hive key - hbase row_key Hive column a - hbase cf:a Hive column b - hbase cf:b Hive column c - hbase cf:c >From my understanding on how HBaseStorageHandler works, it's supposed to take advantage of the hbase row_key index as much as possible. So I would expect, 1. if you do a hive query against the row key like "select * from hive_hbase_test where key='blabla'", this would utilize the hbase row_key index which give you very quick nearly real-time response just like hbase does. 2. of coz, if you do a hive query against a column like "select * from hive_hbase_test where a='blabla'", in this case, it queries against a specific column, it probably uses mapred because there is nothing from Hbase side can be utilized. >From my test, query 1 doesn't seem fast at all, still taking ages, so select * from hive_hbase_test where key='blabla' 36secs vs get 'test', 'blabla' less than 1 sec still shows a huge difference. Anybody has tried this before? Is there anyway I can do sort of query plan analysis against hive query? or I am not mapping hive table against hbase table correctly? -- All the best, Shengjie Min
+
Shengjie Min 2012-09-11, 13:40
-
Re: Performance: hive+hbase integration query against the row_key
bharath vissapragada 2012-09-11, 14:00
Hey, Hive does all kinds of parsing , metadata lookups, query tree building and stuff before executing the query. Not sure if this all was included in those 36 seconds ! Also what hive does is, it builds a scan object with ranges based on predicates (and mappers too ) on key column and not a direct "get" call as in hbase shell. This might incur some overhead too! On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying to get hive working on top of my hbase table following the > guide below: > https://cwiki.apache.org/Hive/hbaseintegration.html> > CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c > string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES > ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES (" > hbase.table.name"="test"); > > this hive table creation makes my mapping roughly look like this: > > hive_hbase_test VS test > Hive key - hbase row_key > Hive column a - hbase cf:a > Hive column b - hbase cf:b > Hive column c - hbase cf:c > > From my understanding on how HBaseStorageHandler works, it's supposed to > take advantage of the hbase row_key index as much as possible. So I would > expect, > > 1. if you do a hive query against the row key like "select * from > hive_hbase_test where key='blabla'", this would utilize the hbase row_key > index which give you very quick nearly real-time response just like hbase > does. > > 2. of coz, if you do a hive query against a column like "select * from > hive_hbase_test where a='blabla'", in this case, it queries against a > specific column, it probably uses mapred because there is nothing from > Hbase side can be utilized. > > From my test, query 1 doesn't seem fast at all, still taking ages, so > select * from hive_hbase_test where key='blabla' 36secs > vs > get 'test', 'blabla' less than 1 sec > still shows a huge difference. > > Anybody has tried this before? Is there anyway I can do sort of query plan > analysis against hive query? or I am not mapping hive table against hbase > table correctly? > > -- > All the best, > Shengjie Min > > -- Regards, Bharath .V w: http://researchweb.iiit.ac.in/~bharath.v
+
bharath vissapragada 2012-09-11, 14:00
-
Re: Performance: hive+hbase integration query against the row_key
Alan Gates 2012-09-12, 01:20
On Sep 11, 2012, at 7:00 AM, bharath vissapragada wrote: > Hey, > > Hive does all kinds of parsing , metadata lookups, query tree building and stuff before executing the query. Not sure if this all was included in those 36 seconds ! > > Also what hive does is, it builds a scan object with ranges based on predicates (and mappers too ) on key column and not a direct "get" call as in hbase shell. This might incur some overhead too! Since Hive does this in a MapReduce job it definitely incurs overhead. It does not run directly against HBase as you might wish it did here. Alan. > > On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying to get hive working on top of my hbase table following the guide below: > https://cwiki.apache.org/Hive/hbaseintegration.html> > CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES > ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES ("hbase.table.name"="test"); > > this hive table creation makes my mapping roughly look like this: > > hive_hbase_test VS test > Hive key - hbase row_key > Hive column a - hbase cf:a > Hive column b - hbase cf:b > Hive column c - hbase cf:c > > From my understanding on how HBaseStorageHandler works, it's supposed to take advantage of the hbase row_key index as much as possible. So I would expect, > > 1. if you do a hive query against the row key like "select * from hive_hbase_test where key='blabla'", this would utilize the hbase row_key index which give you very quick nearly real-time response just like hbase does. > > 2. of coz, if you do a hive query against a column like "select * from hive_hbase_test where a='blabla'", in this case, it queries against a specific column, it probably uses mapred because there is nothing from Hbase side can be utilized. > > From my test, query 1 doesn't seem fast at all, still taking ages, so > select * from hive_hbase_test where key='blabla' 36secs > vs > get 'test', 'blabla' less than 1 sec > still shows a huge difference. > > Anybody has tried this before? Is there anyway I can do sort of query plan analysis against hive query? or I am not mapping hive table against hbase table correctly? > > -- > All the best, > Shengjie Min > > > > > -- > Regards, > Bharath .V > w: http://researchweb.iiit.ac.in/~bharath.v
+
Alan Gates 2012-09-12, 01:20
-
RE: Performance: hive+hbase integration query against the row_key
ashok.samal@... 2012-09-12, 03:25
after loading the data into hive tables, the files gets automatically deleted from HDFS...how to stop that? Thanks Ashok -----Original Message----- From: Alan Gates [mailto:[EMAIL PROTECTED]] Sent: 12 September 2012 06:51 To: [EMAIL PROTECTED] Subject: Re: Performance: hive+hbase integration query against the row_key On Sep 11, 2012, at 7:00 AM, bharath vissapragada wrote: > Hey, > > Hive does all kinds of parsing , metadata lookups, query tree building and stuff before executing the query. Not sure if this all was included in those 36 seconds ! > > Also what hive does is, it builds a scan object with ranges based on predicates (and mappers too ) on key column and not a direct "get" call as in hbase shell. This might incur some overhead too! Since Hive does this in a MapReduce job it definitely incurs overhead. It does not run directly against HBase as you might wish it did here. Alan. > > On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying to get hive working on top of my hbase table following the guide below: > https://cwiki.apache.org/Hive/hbaseintegration.html> > CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES > ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES ("hbase.table.name"="test"); > > this hive table creation makes my mapping roughly look like this: > > hive_hbase_test VS test > Hive key - hbase row_key > Hive column a - hbase cf:a > Hive column b - hbase cf:b > Hive column c - hbase cf:c > > From my understanding on how HBaseStorageHandler works, it's supposed to take advantage of the hbase row_key index as much as possible. So I would expect, > > 1. if you do a hive query against the row key like "select * from hive_hbase_test where key='blabla'", this would utilize the hbase row_key index which give you very quick nearly real-time response just like hbase does. > > 2. of coz, if you do a hive query against a column like "select * from hive_hbase_test where a='blabla'", in this case, it queries against a specific column, it probably uses mapred because there is nothing from Hbase side can be utilized. > > From my test, query 1 doesn't seem fast at all, still taking ages, so > select * from hive_hbase_test where key='blabla' 36secs > vs > get 'test', 'blabla' less than 1 sec > still shows a huge difference. > > Anybody has tried this before? Is there anyway I can do sort of query plan analysis against hive query? or I am not mapping hive table against hbase table correctly? > > -- > All the best, > Shengjie Min > > > > > -- > Regards, > Bharath .V > w: http://researchweb.iiit.ac.in/~bharath.vThe information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
+
ashok.samal@... 2012-09-12, 03:25
-
Re: Performance: hive+hbase integration query against the row_key
Bejoy KS 2012-09-12, 21:09
Hi Ashok 'LOAD DATA INPATH ..' issues a hdfs move under the hood, that is why the original data in hdfs is not present after the load operation. If you want to preserve the data in some hdfs location and use the same with hive, why not create an external table and point it to the required hdfs location. Regards, Bejoy KS ________________________________ From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, September 12, 2012 8:55 AM Subject: RE: Performance: hive+hbase integration query against the row_key after loading the data into hive tables, the files gets automatically deleted from HDFS...how to stop that? Thanks Ashok -----Original Message----- From: Alan Gates [mailto:[EMAIL PROTECTED]] Sent: 12 September 2012 06:51 To: [EMAIL PROTECTED] Subject: Re: Performance: hive+hbase integration query against the row_key On Sep 11, 2012, at 7:00 AM, bharath vissapragada wrote: > Hey, > > Hive does all kinds of parsing , metadata lookups, query tree building and stuff before executing the query. Not sure if this all was included in those 36 seconds ! > > Also what hive does is, it builds a scan object with ranges based on predicates (and mappers too ) on key column and not a direct "get" call as in hbase shell. This might incur some overhead too! Since Hive does this in a MapReduce job it definitely incurs overhead. It does not run directly against HBase as you might wish it did here. Alan. > > On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying to get hive working on top of my hbase table following the guide below: > https://cwiki.apache.org/Hive/hbaseintegration.html> > CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES > ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES ("hbase.table.name"="test"); > > this hive table creation makes my mapping roughly look like this: > > hive_hbase_test VS test > Hive key - hbase row_key > Hive column a - hbase cf:a > Hive column b - hbase cf:b > Hive column c - hbase cf:c > > From my understanding on how HBaseStorageHandler works, it's supposed to take advantage of the hbase row_key index as much as possible. So I would expect, > > 1. if you do a hive query against the row key like "select * from hive_hbase_test where key='blabla'", this would utilize the hbase row_key index which give you very quick nearly real-time response just like hbase does. > > 2. of coz, if you do a hive query against a column like "select * from hive_hbase_test where a='blabla'", in this case, it queries against a specific column, it probably uses mapred because there is nothing from Hbase side can be utilized. > > From my test, query 1 doesn't seem fast at all, still taking ages, so > select * from hive_hbase_test where key='blabla' 36secs > vs > get 'test', 'blabla' less than 1 sec > still shows a huge difference. > > Anybody has tried this before? Is there anyway I can do sort of query plan analysis against hive query? or I am not mapping hive table against hbase table correctly? > > -- > All the best, > Shengjie Min > > > > > -- > Regards, > Bharath .V > w: http://researchweb.iiit.ac.in/~bharath.vThe information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
+
Bejoy KS 2012-09-12, 21:09
-
RE: Performance: hive+hbase integration query against the row_key
ashok.samal@... 2012-09-12, 21:12
Yes bejoy i did it today and it's working. But i was thinking by setting some property we can achieve it. Is there anything like that? Thanks Ashok From: Bejoy KS [mailto:[EMAIL PROTECTED]] Sent: 13 September 2012 02:40 To: [EMAIL PROTECTED] Subject: Re: Performance: hive+hbase integration query against the row_key Hi Ashok 'LOAD DATA INPATH ..' issues a hdfs move under the hood, that is why the original data in hdfs is not present after the load operation. If you want to preserve the data in some hdfs location and use the same with hive, why not create an external table and point it to the required hdfs location. Regards, Bejoy KS ________________________________ From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, September 12, 2012 8:55 AM Subject: RE: Performance: hive+hbase integration query against the row_key after loading the data into hive tables, the files gets automatically deleted from HDFS...how to stop that? Thanks Ashok -----Original Message----- From: Alan Gates [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: 12 September 2012 06:51 To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Performance: hive+hbase integration query against the row_key On Sep 11, 2012, at 7:00 AM, bharath vissapragada wrote: > Hey, > > Hive does all kinds of parsing , metadata lookups, query tree building and stuff before executing the query. Not sure if this all was included in those 36 seconds ! > > Also what hive does is, it builds a scan object with ranges based on predicates (and mappers too ) on key column and not a direct "get" call as in hbase shell. This might incur some overhead too! Since Hive does this in a MapReduce job it definitely incurs overhead. It does not run directly against HBase as you might wish it did here. Alan. > > On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hi, > > I am trying to get hive working on top of my hbase table following the guide below: > https://cwiki.apache.org/Hive/hbaseintegration.html> > CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES > ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES ("hbase.table.name"="test"); > > this hive table creation makes my mapping roughly look like this: > > hive_hbase_test VS test > Hive key - hbase row_key > Hive column a - hbase cf:a > Hive column b - hbase cf:b > Hive column c - hbase cf:c > > From my understanding on how HBaseStorageHandler works, it's supposed to take advantage of the hbase row_key index as much as possible. So I would expect, > > 1. if you do a hive query against the row key like "select * from hive_hbase_test where key='blabla'", this would utilize the hbase row_key index which give you very quick nearly real-time response just like hbase does. > > 2. of coz, if you do a hive query against a column like "select * from hive_hbase_test where a='blabla'", in this case, it queries against a specific column, it probably uses mapred because there is nothing from Hbase side can be utilized. > > From my test, query 1 doesn't seem fast at all, still taking ages, so > select * from hive_hbase_test where key='blabla' 36secs > vs > get 'test', 'blabla' less than 1 sec > still shows a huge difference. > > Anybody has tried this before? Is there anyway I can do sort of query plan analysis against hive query? or I am not mapping hive table against hbase table correctly? > > -- > All the best, > Shengjie Min > > > > > -- > Regards, > Bharath .V > w: http://researchweb.iiit.ac.in/~bharath.vThe information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
+
ashok.samal@... 2012-09-12, 21:12
-
Re: Performance: hive+hbase integration query against the row_key
Bejoy KS 2012-09-12, 21:33
Hi Ashok, AFAIK, there is no property that will get you this functionality on the fly. Regards, Bejoy KS ________________________________ From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Thursday, September 13, 2012 2:42 AM Subject: RE: Performance: hive+hbase integration query against the row_key Yes bejoy i did it today and it’s working. But i was thinking by setting some property we can achieve it. Is there anything like that? Thanks Ashok From:Bejoy KS [mailto:[EMAIL PROTECTED]] Sent: 13 September 2012 02:40 To: [EMAIL PROTECTED] Subject: Re: Performance: hive+hbase integration query against the row_key Hi Ashok 'LOAD DATA INPATH ..' issues a hdfs move under the hood, that is why the original data in hdfs is not present after the load operation. If you want to preserve the data in some hdfs location and use the same with hive, why not create an external table and point it to the required hdfs location. Regards, Bejoy KS ________________________________ From:"[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, September 12, 2012 8:55 AM Subject: RE: Performance: hive+hbase integration query against the row_key after loading the data into hive tables, the files gets automatically deleted from HDFS...how to stop that? Thanks Ashok -----Original Message----- From: Alan Gates [mailto:[EMAIL PROTECTED]] Sent: 12 September 2012 06:51 To: [EMAIL PROTECTED] Subject: Re: Performance: hive+hbase integration query against the row_key On Sep 11, 2012, at 7:00 AM, bharath vissapragada wrote: > Hey, > > Hive does all kinds of parsing , metadata lookups, query tree building and stuff before executing the query. Not sure if this all was included in those 36 seconds ! > > Also what hive does is, it builds a scan object with ranges based on predicates (and mappers too ) on key column and not a direct "get" call as in hbase shell. This might incur some overhead too! Since Hive does this in a MapReduce job it definitely incurs overhead. It does not run directly against HBase as you might wish it did here. Alan. > > On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying to get hive working on top of my hbase table following the guide below: > https://cwiki.apache.org/Hive/hbaseintegration.html> > CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES > ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES ("hbase.table.name"="test"); > > this hive table creation makes my mapping roughly look like this: > > hive_hbase_test VS�� test > Hive key - hbase row_key > Hive column a - hbase cf:a > Hive column b - hbase cf:b > Hive column c�� - hbase cf:c > > From my understanding on how HBaseStorageHandler works, it's supposed to take advantage of the hbase row_key index as much as possible. So I would expect, > > 1. if you do a hive query against the row key like "select * from hive_hbase_test where key='blabla'", this would utilize the hbase row_key index which give you very quick nearly real-time response just like hbase does. > > 2. of coz, if you do a hive query against a column like "select * from hive_hbase_test where a='blabla'", in this case, it queries against a specific column, it probably uses mapred because there is nothing from Hbase side can be utilized. > > From my test, query 1 doesn't seem fast at all, still taking ages, so > select * from hive_hbase_test where key='blabla' 36secs > vs > get 'test', 'blabla' less than 1 sec > still shows a huge difference. > > Anybody has tried this before? Is there anyway I can do sort of query plan analysis against hive query? or I am not mapping hive table against hbase table correctly? > > -- > All the best, > Shengjie Min > > > > > -- > Regards, > Bharath .V > w: http://researchweb.iiit.ac.in/~bharath.vThe information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
+
Bejoy KS 2012-09-12, 21:33
|
|