Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Loading data from Hive to HBase takes too long


+
Hao Ren 2013-08-19, 08:40
+
Hao Ren 2013-08-19, 08:44
+
Hao Ren 2013-08-19, 08:50
+
lars hofhansl 2013-08-19, 23:51
Copy link to this message
-
Re: Loading data from Hive to HBase takes too long
Hao Ren 2013-08-20, 08:27
Hi, Lars

Thank you for your reply and sorry for the unclarity.

Actually, hbase daemon is runing only on the master, just one server. It
uses HDFS as its storage.
The input data is on the EBS. It is wrtten in HBase which is over Hdfs
based on EBS.

The only turning I did is :

<property>
          <name>hbase.client.scanner.caching</name>
          <value>10000</value>
</property>

That makes count(*) fast.

When loading to HDFS dirctly, it just ends in less than 10 mins.

In addition, when loading loading other data sets with different schema which is about 700 mb into HBase, it takes only a few minutes.

Thank you again.

Hao.

Le 20/08/2013 01:51, lars hofhansl a �crit :
> Hi Hao,
>
> how do you run HBase in pseudo distributed mode, yet with 3 slaves?
> Where is the data written in EC2? EBS or local storage?
> Did you do any other tuning at the HBase or HDFS level (server side)?
>
> If your replication level is still set to 3 you're seeing somewhat of a worst case scenario, where each node gets 100% of all writes, and the speed is always dominated by your slowest machine.
> How does Hive perform here when you write to HDFS directly?
>
> Sorry, many questions :)
>
> -- Lars
>
> ________________________________
> From: Hao Ren <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Monday, August 19, 2013 1:50 AM
> Subject: Re: Loading data from Hive to HBase takes too long
>
>
> Update:
>
> There are 1 master and 3 slaves in my cluster.
> They are all m1.medium instances.
>
> *Instance Family*     *Instance Type*     *Processor Arch*     *vCPU*     *ECU*
> *Memory (GiB)*     *Instance Storage (GB)*     *EBS-optimized Available*
> *Network Performance*
>
>
>
>
>
>
>
>
>
> General purpose     m1.medium     32-bit or
> 64-bit     1     2     3.75     1 x 410     -     Moderate
>
>
> Le 19/08/2013 10:44, Hao Ren a �crit :
>> Update:
>>
>> I messed up some queries, here are the right ones:
>>
>> CREATE TABLE hbase_table (
>> material_id int,
>> new_id_client int,
>> last_purchase_date int)
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES ("hbase.columns.mapping" >> ":key,cf1:idclt,cf1:dt_last_purchase")
>> TBLPROPERTIES("hbase.table.name" = "test");
>>
>> insert OVERWRITE TABLE hbase_table
>> select * from test;  -- takes a long time (about 8 hours)
>>
>> # bin/hadoop dfs -dus /user/hive/warehouse/test
>> hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/user/hive/warehouse/test
>> 1318012108
>>
>> the table 'test' is just about 1.3 GB.
>>
>>
>>
>> Le 19/08/2013 10:40, Hao Ren a �crit :
>>> Hi,
>>>
>>> I am runing Hive and Hbase on the same Amazon EC2 cluster, where
>>> Hbase is in a pseudo-distributed mode.
>>>
>>> After integrating HBase in Hive, I find that it takes a long time
>>> when runing a "insert overwrite" query from hive in order to load
>>> data into a related HBase table.
>>>
>>> In fact, the size of data is about 1.3Gb. I dont think it's normal.
>>>
>>> Maybe there are something wrong with my configuration.
>>>
>>> Here are some queries:
>>>
>>> CREATE TABLE hbase_table (
>>> material_id int,
>>> new_id_client int,
>>> last_purchase_date int)
>>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>>> WITH SERDEPROPERTIES ("hbase.columns.mapping" >>> ":key,cf1:idclt,cf1:dt_last_purchase")
>>> TBLPROPERTIES("hbase.table.name" = "test");
>>>
>>> insert OVERWRITE TABLE t_LIGNES_DERN_VENTES
>>> select * from test;  -- takes a long time (about 8 hours)
>>>
>>>
>>> Here are some configurations files for my cluster :
>>>
>>> # cat hive/conf/hive-site.xml
>>>
>>> <?xml version="1.0"?>
>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>
>>> <configuration>
>>>
>>>       <property>
>>>           <name>hbase.zookeeper.quorum</name>
>>>           <value>ip-10-159-41-177.ec2.internal</value>
>>>       </property>
>>>
>>>       <property>
>>>           <name>hive.aux.jars.path</name>
>>> <value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>
Hao Ren
ClaraVista
www.claravista.fr