Hao Ren 2013-08-01, 14:45
Need to set scanner caching, otherwise each call to next will be an network RTT.
From: Hao Ren <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Thursday, August 1, 2013 7:45 AM
Subject: Why HBase integation with Hive makes Hive slow
I have a cluster (1 master + 3 slaves) on which there Hive, Hbase, and
In order to do some daily row-level update routine, we need to integrate
Hbase with hive, but the performance is not good.
E.g. There are 2 tables in hive,
hbase_table: a hbase table created via Hive
hive_table: a native hive table
both hold the same data set.
select count(*) from hbase_table; ===> takes 500 s
select count(*) from hive_table; ===> takes 6 s
I have tried a lot of queries on the two tables. But hbase_table is
always very slow.
To be claire, I created the hbase_ table as below:
CREATE TABLE hbase_table (
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
TBLPROPERTIES("hbase.table.name" = "table_test")
And my Hbase is on pseudo-distributed mode.
I guess, at the beginning of a hive query execution, hive will load data
from Hbase, where serde takes a long time.
Could someone tell me how to improve my poor performance ?
Is this cause by my wrongly configured integration ?
Is a fully-distributed mode needed here ?
Thank you in advance for your time.
Hao Ren 2013-08-02, 15:04
Andrew Purtell 2013-08-02, 18:41