Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - HBase-Hive integration performance issues


+
Hao Ren 2013-08-27, 13:51
+
Matt Davies 2013-08-27, 14:13
Copy link to this message
-
Re: HBase-Hive integration performance issues
Hao Ren 2013-08-27, 15:35
Matt,

Thank you for the lightning reply.

I will try out what you have mentioned in these days, thus I could tell
you some news on the issue in detail.

Thank you again. Your suggestions show me the way. =)

Hao

Le 27/08/2013 16:13, Matt Davies a �crit :
> Hao,
>
> A couple thoughts here.
>
> This could be related to many things.
> 1. Did you pre-split your regions? If not, you could be hot-spotting on a
> single server and then waiting for the region to split. If that is the
> case, you could actually only be using a single server for much of your
> load (if not all - depends on the region size you have configured) While
> running did you see one system take the full load (via top, ganglia, or
> some other tool)?
>
> 2.  The memory on each of these systems is quite low - 1.7 or 3.7 gb
> depending if it is compute or memory - either way, it is way low, and I'd
> expect you to be doing a lot of swapping.  You'll need 1 GB for each
> daemon, which leaves you very little room for the OS (at 3.7 gb).  Do you
> see swapping?  What are your JVM parameters?
>
> 3. Do these same 4 servers run your Hadoop infrastructure and the hive
> query? If so, the system is woefully underpowered if you expect to see
> production-like speed.  Running an Hive query on top of an HBase cluster
> with so few resources will just not work out well in the end ;)
>
>
> -Matt
>
>
> On Tue, Aug 27, 2013 at 7:51 AM, Hao Ren <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> I am running Hive and HBase on Amazon EC2. By following the tutorial:
>> https://cwiki.apache.org/**confluence/display/Hive/**HBaseIntegration<https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration>, I managed to create a HBase table from Hive and insert data into it.
>>
>> It works but with a low performance. To be specific, inserting 1.3 Gb (50
>> M rows, 3 columns) takes 30 mins. It is far from what I excepted, say 100 s.
>>
>> Actually, my EC2 cluster contains 3 slaves and 1 master whose instance
>> type is medium(http://aws.amazon.com/**ec2/instance-types/#instance-**type<http://aws.amazon.com/ec2/instance-types/#instance-type>
>> ).
>>
>> Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed
>> mode. A region server is running on the master. HDFS is used as storage.
>>
>> Here are some configuration files:
>>
>> *// hive-site.xml*
>>
>> <configuration>
>>
>>      <property>
>>          <name>hbase.zookeeper.quorum</**name>
>>          <value>ip-10-178-13-39.ec2.**internal</value>
>>      </property>
>>
>>      <property>
>>          <name>hive.aux.jars.path</**name>
>> <value>/root/hive/build/dist/**lib/hive-hbase-handler-0.9.0-**
>> amplab-4.jar,/root/hive/build/**dist/lib/hbase-0.92.0.jar,/**
>> root/hive/build/dist/lib/**zookeeper-3.4.3.jar,/root/**
>> hive/build/dist/lib/guava-r09.**jar</value>
>>      </property>
>>
>>      <property>
>>          <name>hbase.client.scanner.**caching</name>
>>          <value>10000</value>
>>      </property>
>>
>> </configuration>
>>
>> *// hbase-site.xml*
>>
>> <configuration>
>>
>>      <property>
>>          <name>hbase.rootdir</name>
>> <value>hdfs://ec2-54-226-206-**28.compute-1.amazonaws.com:**9010/hbase<http://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase>
>> </value>
>>      </property>
>>
>>      <property>
>>          <name>hbase.cluster.**distributed</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>>          <name>hbase.zookeeper.quorum</**name>
>>          <value>ip-10-178-13-39.ec2.**internal</value>
>>      </property>
>>
>>      <property>
>>          <name>hbase.client.scanner.**caching</name>
>>          <value>10000</value>
>>      </property>
>>
>> </configuration>
>>
>> *For understanding, I have some questions:*
>> 1) In order to improve read performance, I have set
>> hbase.client.scanner.caching to 10000. But I don't know how to improve
>> write performance. Is there some basic config to do ?
>> 2) Does the distributed mode matter ? Does fully-distributed mode have
Hao Ren
ClaraVista
www.claravista.fr