Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase-Hive integration performance issues


Copy link to this message
-
HBase-Hive integration performance issues
Hi,

I am running Hive and HBase on Amazon EC2. By following the tutorial:
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration , I
managed to create a HBase table from Hive and insert data into it.

It works but with a low performance. To be specific, inserting 1.3 Gb
(50 M rows, 3 columns) takes 30 mins. It is far from what I excepted,
say 100 s.

Actually, my EC2 cluster contains 3 slaves and 1 master whose instance
type is medium(http://aws.amazon.com/ec2/instance-types/#instance-type).

Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed
mode. A region server is running on the master. HDFS is used as storage.

Here are some configuration files:

*// hive-site.xml*

<configuration>

     <property>
         <name>hbase.zookeeper.quorum</name>
         <value>ip-10-178-13-39.ec2.internal</value>
     </property>

     <property>
         <name>hive.aux.jars.path</name>
<value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>
     </property>

     <property>
         <name>hbase.client.scanner.caching</name>
         <value>10000</value>
     </property>

</configuration>

*// hbase-site.xml*

<configuration>

     <property>
         <name>hbase.rootdir</name>
<value>hdfs://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase</value>
     </property>

     <property>
         <name>hbase.cluster.distributed</name>
         <value>true</value>
     </property>

     <property>
         <name>hbase.zookeeper.quorum</name>
         <value>ip-10-178-13-39.ec2.internal</value>
     </property>

     <property>
         <name>hbase.client.scanner.caching</name>
         <value>10000</value>
     </property>

</configuration>

*For understanding, I have some questions:*
1) In order to improve read performance, I have set
hbase.client.scanner.caching to 10000. But I don't know how to improve
write performance. Is there some basic config to do ?
2) Does the distributed mode matter ? Does fully-distributed mode have
better write performance than pseudo-distributed mode ?
3) If the number of region server is increased, will the write
performance be improved ?
4) In pseudo-distributed mode (one hbase daemon on master), when writing
data from hive to a hbase table, is the master the only entry to HBase ?
I don't think all data passes through the master is efficient. I wonder
whether it is possible write data in parallel from hive to hbase
directly in using mapReduce ?
5) Will the HBase bulk loading help a lot ?

I am new to HBase, but I really want to integrate HBase in production.

Any help is highly appreciated ! =)

Hao

--
Hao Ren
ClaraVista
www.claravista.fr

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB