Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Lifecycle and Configuration of a hive UDF

Copy link to this message
Re: Lifecycle and Configuration of a hive UDF
Added a tiny blurb here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-UDFinternals
Comments/suggestions welcome!

Thanks for bringing it up, Justin.


Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.
----- Original Message -----
From: "Justin Coffey" <[EMAIL PROTECTED]>
Sent: Monday, April 23, 2012 5:19:15 AM
Subject: Re: Lifecycle and Configuration of a hive UDF

Hello All,
Thank you much for the responses. I can confirm that the lag function implementation works in my case:
create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id)
from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id
is not null
distribute by session_id
sort by session_id,hit_datetime_gmt ) X
distribute by session_id limit 1000
For the rank it looks like:

create temporary function rank as 'com.example.hadoop.hive.udf.Rank';
select user_id, time, rank(user_id) as rank
from (
select user_id, time
from log
where day = '2012-04-01' and hour = 7
distribute by user_id
sort by user_id, time
) X
distribute by user_id
limit 2000
As mentioned by others this appears to force the UDF to be executed Reduce side. At least, I can't figure out how it works otherwise because only one MapReduce job is created (with multiple reducers).
As a note to the documentation maintainers, it might be nice to have the procedural workflow of UDF/UDTF/UDAF's documented in the wiki. I know it is logical that an aggregation function happens reducer side, but I think there is sufficient complexity in an SQL to MR translator that it is worth the effort to explicitly document it and the other functions (or please just bludgeon me over the head if I happened to miss it).
Not to be pedantic, but for example, the UDAF case study doc does not even mention the word "reduce":
Thanks again to all the pointers!
On Fri, Apr 20, 2012 at 8:18 PM, Alex Kozlov < [EMAIL PROTECTED] > wrote:
You might also look at http://www. quora .com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hive for a way to utilize secondary sort for analytic windowing functions.

RANK() OVER(...) will require grouping and sorting. While it can be done in the mapper or reducer stage, it is better to utilize Hadoop's shuffle properties to accomplish both of them. The disadvantage may be that you can compute only one RANK() in a MapReduce job.


Alex K
On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans < [EMAIL PROTECTED] > wrote:
Have a read of the thread "Lag function in Hive", linked from:


There's an example of how to force a function to run reduce-side. I've
written a UDF which replicates RANK () OVER (...), but it requires the
syntactic sugar given in the thread. I'd like to make changes to the
hive query planner at some point, so that you can annotate a UDF with
a "run on reducer" hint, and after that I'd happily open source
everything. If you want more details of how to implement your own
partitionedRowNumber() UDF then I'd be happy to elaborate.



On 20 April 2012 18:35, Mark Grover < [EMAIL PROTECTED] > wrote:
> Hi Rajan and Justin,
> As per my understanding, the scope of a UDF is only one row of data at a time. Therefore, it can be done all map side without the need for the reducer being involved. Now, depending on where you are storing the result of the query, your query may have reducers that do something.
> A simple query like Rajan mentioned
> select MyUDF(field1,field2) from table;