Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Hive-645 is slow to insert query results to mysql


+
Lu, Wei 2012-03-09, 02:57
Copy link to this message
-
RE: Hive-645 is slow to insert query results to mysql
Viral Bajaria 2012-03-09, 17:49
Hey Wei,

I have used the udf before and figured it is only useful for summary
results and not for big datasets due to the fault tolerant nature of
map/reduce. If you don't have a well defined primary key you will end up
with more rows than your query results. And you are correct in saying that
this is not a bulk insert since the udf executes at the select statement
and hence it processes each returned row.

You can try your solution using swoop, it seems to be the most common way
of getting data out into DBs though I have not used it personally.

Viral
------------------------------
From: Lu, Wei
Sent: 3/8/2012 6:58 PM
To: [EMAIL PROTECTED]
Subject: Hive-645 is slow to insert query results to mysql

    Hi,

I recently tried Hive-645 feature and save query results directly to Mysql
table. The feature can be found here:
https://issues.apache.org/jira/browse/HIVE-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel.

The query I tried looks like this:

hive>CREATE TEMPORARY FUNCTION dboutput AS
'org.apache.hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput';

hive>SELECT dboutput('jdbc:mysql://localhost/hdfs_db','uid','pwd','INSERT
INTO dc(t,c) VALUES (?,?)',requestbegintime,count(1)) FROM impressions2
GROUP BY requestbegintime;

It works, but the reduce tasks are very slow:

Task

Complete

Status

Start Time

Finish Time

Errors

Counters

task_201203081246_0001_r_000000<http://wlu-hadoop01:50030/taskdetails.jsp?jobid=job_201203081246_0001&tipid=task_201203081246_0001_r_000000>

100.00%

reduce > reduce

8-Mar-2012 12:47:06

8-Mar-2012 13:33:54 (46mins, 47sec)

11<http://wlu-hadoop01:50030/taskstats.jsp?jobid=job_201203081246_0001&tipid=task_201203081246_0001_r_000000>

I set #reduce to be 4 but is still very slow (finally 171, 667 rows are
inserted to Mysql).

I guess the reduce process didn’t insert data to MySql in batch mode, can
anyone give me some suggestions to improve the performance??

PS: I think it might be better to first save results to HDFS and then use
Sqoop to load data to Mysql, right??

Regards,

Wei
+
Bejoy Ks 2012-03-09, 18:11
+
Lu, Wei 2012-03-12, 01:17