I have used the udf before and figured it is only useful for summary
results and not for big datasets due to the fault tolerant nature of
map/reduce. If you don't have a well defined primary key you will end up
with more rows than your query results. And you are correct in saying that
this is not a bulk insert since the udf executes at the select statement
and hence it processes each returned row.
You can try your solution using swoop, it seems to be the most common way
of getting data out into DBs though I have not used it personally.
From: Lu, Wei
Sent: 3/8/2012 6:58 PM
To: [EMAIL PROTECTED]
Subject: Hive-645 is slow to insert query results to mysql
I recently tried Hive-645 feature and save query results directly to Mysql
table. The feature can be found here:
The query I tried looks like this:
hive>CREATE TEMPORARY FUNCTION dboutput AS
INTO dc(t,c) VALUES (?,?)',requestbegintime,count(1)) FROM impressions2
GROUP BY requestbegintime;
It works, but the reduce tasks are very slow:
reduce > reduce
8-Mar-2012 13:33:54 (46mins, 47sec)
I set #reduce to be 4 but is still very slow (finally 171, 667 rows are
inserted to Mysql).
I guess the reduce process didn’t insert data to MySql in batch mode, can
anyone give me some suggestions to improve the performance??
PS: I think it might be better to first save results to HDFS and then use
Sqoop to load data to Mysql, right??