Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Performance: hive+hbase integration query against the row_key

Copy link to this message
Re: Performance: hive+hbase integration query against the row_key
Hi Ashok

'LOAD DATA INPATH ..' issues a hdfs move under the hood, that is why the original data in hdfs is not present after the load operation. If you want to preserve the data in some hdfs location and use the same with hive, why not create an external table and point it to the required hdfs location. 
Bejoy KS
Sent: Wednesday, September 12, 2012 8:55 AM
Subject: RE: Performance: hive+hbase integration query against the row_key
after loading the data into hive tables, the files gets automatically deleted from HDFS...how to stop that?


-----Original Message-----
From: Alan Gates [mailto:[EMAIL PROTECTED]]
Sent: 12 September 2012 06:51
Subject: Re: Performance: hive+hbase integration query against the row_key
On Sep 11, 2012, at 7:00 AM, bharath vissapragada wrote:

> Hey,
> Hive does all kinds of parsing , metadata lookups, query tree building and stuff before executing the query. Not sure if this all was included in those 36 seconds !
> Also what hive does is, it builds a scan object with ranges based on predicates (and mappers too ) on key column and not a direct "get" call as in hbase shell. This might incur some overhead too!

Since Hive does this in a MapReduce job it definitely incurs overhead.  It does not run directly against HBase as you might wish it did here.


> On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <[EMAIL PROTECTED]> wrote:
> Hi,
> I am trying to get hive working on top of my hbase table following the guide below:
> https://cwiki.apache.org/Hive/hbaseintegration.html
> CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES ("hbase.table.name"="test");
> this hive table creation makes my mapping roughly look like this:
> hive_hbase_test  VS   test
> Hive key  -   hbase row_key
> Hive column a -  hbase cf:a
> Hive column b  -  hbase cf:b
> Hive column c  -  hbase cf:c
> From my understanding on how HBaseStorageHandler works, it's supposed to take advantage of the hbase row_key index as much as possible. So I would expect,
> 1. if you do a hive query against the row key like "select * from hive_hbase_test where key='blabla'", this would utilize the hbase row_key index which give you very quick nearly real-time response just like hbase does.
> 2. of coz, if you do a hive query against a column like "select * from hive_hbase_test where a='blabla'", in this case, it queries against a specific column, it probably uses mapred because there is nothing from Hbase side can be utilized.
> From my test, query 1 doesn't seem fast at all, still taking ages, so
> select * from hive_hbase_test where key='blabla'   36secs
> vs
> get 'test', 'blabla'      less than 1 sec
> still shows a huge difference.
> Anybody has tried this before? Is there anyway I can do sort of query plan analysis against hive query? or I am not mapping hive table against hbase table correctly?
> --
> All the best,
> Shengjie Min
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.