Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - HBase-Hive integration performance issues


Copy link to this message
-
HBase-Hive integration performance issues
Hao Ren 2013-08-27, 13:51
Hi,

I am running Hive and HBase on Amazon EC2. By following the tutorial:
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration , I
managed to create a HBase table from Hive and insert data into it.

It works but with a low performance. To be specific, inserting 1.3 Gb
(50 M rows, 3 columns) takes 30 mins. It is far from what I excepted,
say 100 s.

Actually, my EC2 cluster contains 3 slaves and 1 master whose instance
type is medium(http://aws.amazon.com/ec2/instance-types/#instance-type).

Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed
mode. A region server is running on the master. HDFS is used as storage.

Here are some configuration files:

*// hive-site.xml*

<configuration>

     <property>
         <name>hbase.zookeeper.quorum</name>
         <value>ip-10-178-13-39.ec2.internal</value>
     </property>

     <property>
         <name>hive.aux.jars.path</name>
<value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>
     </property>

     <property>
         <name>hbase.client.scanner.caching</name>
         <value>10000</value>
     </property>

</configuration>

*// hbase-site.xml*

<configuration>

     <property>
         <name>hbase.rootdir</name>
<value>hdfs://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase</value>
     </property>

     <property>
         <name>hbase.cluster.distributed</name>
         <value>true</value>
     </property>

     <property>
         <name>hbase.zookeeper.quorum</name>
         <value>ip-10-178-13-39.ec2.internal</value>
     </property>

     <property>
         <name>hbase.client.scanner.caching</name>
         <value>10000</value>
     </property>

</configuration>

*For understanding, I have some questions:*
1) In order to improve read performance, I have set
hbase.client.scanner.caching to 10000. But I don't know how to improve
write performance. Is there some basic config to do ?
2) Does the distributed mode matter ? Does fully-distributed mode have
better write performance than pseudo-distributed mode ?
3) If the number of region server is increased, will the write
performance be improved ?
4) In pseudo-distributed mode (one hbase daemon on master), when writing
data from hive to a hbase table, is the master the only entry to HBase ?
I don't think all data passes through the master is efficient. I wonder
whether it is possible write data in parallel from hive to hbase
directly in using mapReduce ?
5) Will the HBase bulk loading help a lot ?

I am new to HBase, but I really want to integrate HBase in production.

Any help is highly appreciated ! =)

Hao

--
Hao Ren
ClaraVista
www.claravista.fr

+
Matt Davies 2013-08-27, 14:13
+
Hao Ren 2013-08-27, 15:35