Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Hive+HBase performance is much poorer than Hive+HDFS

Copy link to this message
Hive+HBase performance is much poorer than Hive+HDFS
Hi all,

I have made some perf test about Hive+HBase. The table is a normal 2D
table with about 160M rows (each row with 7 small columns) and 32
regions. There is only one column family and all regions have been
major compacted to one store file before test.

On a cluster with 11 task trackers (each with 4 map slots and 1 reduce
slot, these servers also act as region servers), a simple SQL in Hive
   select count(*) from table where column3='Y';
needs ~1700 seconds to finish.

But, after use CTAS statement to create an internal table (stored as
sequence file), this statement only needs 43 seconds to finish.

So Hive+HBase is 40X slower than Hive+HDFS.

Though Hive+HBase has less map tasks (32 vs 223), but since there are
only 44 map slots available, I don't think it is the main cause.

I studied the source code of HBase scan implementation. To me, it
seems, in my case, the scan performs HFile read in a quite similar way
as sequence file read (sequential reading of each key/value pair). So,
in theory, the performance shall be quite similar.

Can anyone explain the 40X slowdown?