Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Hive+HBase performance is much poorer than Hive+HDFS


Copy link to this message
-
Hive+HBase performance is much poorer than Hive+HDFS
Hi all,

I have made some perf test about Hive+HBase. The table is a normal 2D
table with about 160M rows (each row with 7 small columns) and 32
regions. There is only one column family and all regions have been
major compacted to one store file before test.

On a cluster with 11 task trackers (each with 4 map slots and 1 reduce
slot, these servers also act as region servers), a simple SQL in Hive
   select count(*) from table where column3='Y';
needs ~1700 seconds to finish.

But, after use CTAS statement to create an internal table (stored as
sequence file), this statement only needs 43 seconds to finish.

So Hive+HBase is 40X slower than Hive+HDFS.

Though Hive+HBase has less map tasks (32 vs 223), but since there are
only 44 map slots available, I don't think it is the main cause.

I studied the source code of HBase scan implementation. To me, it
seems, in my case, the scan performs HFile read in a quite similar way
as sequence file read (sequential reading of each key/value pair). So,
in theory, the performance shall be quite similar.

Can anyone explain the 40X slowdown?

Thanks
Weihua
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB