Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Loading data from Hive to HBase takes too long


Copy link to this message
-
Re: Loading data from Hive to HBase takes too long
Hao Ren 2013-08-19, 08:50
Update:

There are 1 master and 3 slaves in my cluster.
They are all m1.medium instances.

*Instance Family* *Instance Type* *Processor Arch* *vCPU* *ECU*
*Memory (GiB)* *Instance Storage (GB)* *EBS-optimized Available*
*Network Performance*









General purpose m1.medium 32-bit or
64-bit 1 2 3.75 1 x 410 - Moderate
Le 19/08/2013 10:44, Hao Ren a �crit :
> Update:
>
> I messed up some queries, here are the right ones:
>
> CREATE TABLE hbase_table (
> material_id int,
> new_id_client int,
> last_purchase_date int)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" =
> ":key,cf1:idclt,cf1:dt_last_purchase")
> TBLPROPERTIES("hbase.table.name" = "test");
>
> insert OVERWRITE TABLE hbase_table
> select * from test;  -- takes a long time (about 8 hours)
>
> # bin/hadoop dfs -dus /user/hive/warehouse/test
> hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/user/hive/warehouse/test
> 1318012108
>
> the table 'test' is just about 1.3 GB.
>
>
>
> Le 19/08/2013 10:40, Hao Ren a �crit :
>> Hi,
>>
>> I am runing Hive and Hbase on the same Amazon EC2 cluster, where
>> Hbase is in a pseudo-distributed mode.
>>
>> After integrating HBase in Hive, I find that it takes a long time
>> when runing a "insert overwrite" query from hive in order to load
>> data into a related HBase table.
>>
>> In fact, the size of data is about 1.3Gb. I dont think it's normal.
>>
>> Maybe there are something wrong with my configuration.
>>
>> Here are some queries:
>>
>> CREATE TABLE hbase_table (
>> material_id int,
>> new_id_client int,
>> last_purchase_date int)
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES ("hbase.columns.mapping" =
>> ":key,cf1:idclt,cf1:dt_last_purchase")
>> TBLPROPERTIES("hbase.table.name" = "test");
>>
>> insert OVERWRITE TABLE t_LIGNES_DERN_VENTES
>> select * from test;  -- takes a long time (about 8 hours)
>>
>>
>> Here are some configurations files for my cluster :
>>
>> # cat hive/conf/hive-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <configuration>
>>
>>     <property>
>>         <name>hbase.zookeeper.quorum</name>
>>         <value>ip-10-159-41-177.ec2.internal</value>
>>     </property>
>>
>>     <property>
>>         <name>hive.aux.jars.path</name>
>> <value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>
>>
>>     </property>
>>
>>     <property>
>>         <name>hbase.client.scanner.caching</name>
>>         <value>10000</value>
>>     </property>
>>
>> </configuration>
>>
>> # cat hbase-0.92.0/conf/hbase-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <configuration>
>>
>>     <property>
>>         <name>hbase.rootdir</name>
>> <value>hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/hbase</value>
>>
>>     </property>
>>
>>     <property>
>>         <name>hbase.cluster.distributed</name>
>>         <value>true</value>
>>     </property>
>>
>>     <property>
>>         <name>hbase.zookeeper.quorum</name>
>>         <value>ip-10-159-41-177.ec2.internal</value>
>>     </property>
>>
>>     <property>
>>         <name>hbase.client.scanner.caching</name>
>>         <value>10000</value>
>>     </property>
>>
>> </configuration>
>>
>> Any help is highly appreciated!
>>
>> Thank you.
>>
>> Hao
>>
>
>
--
Hao Ren
ClaraVista
www.claravista.fr