Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase-Hive integration performance issues


Copy link to this message
-
Re: HBase-Hive integration performance issues
Matt,

Thank you for the lightning reply.

I will try out what you have mentioned in these days, thus I could tell
you some news on the issue in detail.

Thank you again. Your suggestions show me the way. =)

Hao

Le 27/08/2013 16:13, Matt Davies a �crit :
> Hao,
>
> A couple thoughts here.
>
> This could be related to many things.
> 1. Did you pre-split your regions? If not, you could be hot-spotting on a
> single server and then waiting for the region to split. If that is the
> case, you could actually only be using a single server for much of your
> load (if not all - depends on the region size you have configured) While
> running did you see one system take the full load (via top, ganglia, or
> some other tool)?
>
> 2.  The memory on each of these systems is quite low - 1.7 or 3.7 gb
> depending if it is compute or memory - either way, it is way low, and I'd
> expect you to be doing a lot of swapping.  You'll need 1 GB for each
> daemon, which leaves you very little room for the OS (at 3.7 gb).  Do you
> see swapping?  What are your JVM parameters?
>
> 3. Do these same 4 servers run your Hadoop infrastructure and the hive
> query? If so, the system is woefully underpowered if you expect to see
> production-like speed.  Running an Hive query on top of an HBase cluster
> with so few resources will just not work out well in the end ;)
>
>
> -Matt
>
>
> On Tue, Aug 27, 2013 at 7:51 AM, Hao Ren <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> I am running Hive and HBase on Amazon EC2. By following the tutorial:
>> https://cwiki.apache.org/**confluence/display/Hive/**HBaseIntegration<https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration>, I managed to create a HBase table from Hive and insert data into it.
>>
>> It works but with a low performance. To be specific, inserting 1.3 Gb (50
>> M rows, 3 columns) takes 30 mins. It is far from what I excepted, say 100 s.
>>
>> Actually, my EC2 cluster contains 3 slaves and 1 master whose instance
>> type is medium(http://aws.amazon.com/**ec2/instance-types/#instance-**type<http://aws.amazon.com/ec2/instance-types/#instance-type>
>> ).
>>
>> Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed
>> mode. A region server is running on the master. HDFS is used as storage.
>>
>> Here are some configuration files:
>>
>> *// hive-site.xml*
>>
>> <configuration>
>>
>>      <property>
>>          <name>hbase.zookeeper.quorum</**name>
>>          <value>ip-10-178-13-39.ec2.**internal</value>
>>      </property>
>>
>>      <property>
>>          <name>hive.aux.jars.path</**name>
>> <value>/root/hive/build/dist/**lib/hive-hbase-handler-0.9.0-**
>> amplab-4.jar,/root/hive/build/**dist/lib/hbase-0.92.0.jar,/**
>> root/hive/build/dist/lib/**zookeeper-3.4.3.jar,/root/**
>> hive/build/dist/lib/guava-r09.**jar</value>
>>      </property>
>>
>>      <property>
>>          <name>hbase.client.scanner.**caching</name>
>>          <value>10000</value>
>>      </property>
>>
>> </configuration>
>>
>> *// hbase-site.xml*
>>
>> <configuration>
>>
>>      <property>
>>          <name>hbase.rootdir</name>
>> <value>hdfs://ec2-54-226-206-**28.compute-1.amazonaws.com:**9010/hbase<http://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase>
>> </value>
>>      </property>
>>
>>      <property>
>>          <name>hbase.cluster.**distributed</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>>          <name>hbase.zookeeper.quorum</**name>
>>          <value>ip-10-178-13-39.ec2.**internal</value>
>>      </property>
>>
>>      <property>
>>          <name>hbase.client.scanner.**caching</name>
>>          <value>10000</value>
>>      </property>
>>
>> </configuration>
>>
>> *For understanding, I have some questions:*
>> 1) In order to improve read performance, I have set
>> hbase.client.scanner.caching to 10000. But I don't know how to improve
>> write performance. Is there some basic config to do ?
>> 2) Does the distributed mode matter ? Does fully-distributed mode have
Hao Ren
ClaraVista
www.claravista.fr
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB