Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Lag function in Hive


Copy link to this message
-
Re: Lag function in Hive
Mark Grover 2012-04-11, 13:31
Hi Karan,
The error you mentioned you get on creating the temporary function typically happens when there is a typo in the class name (com.example.hive.udf.Lag, in this case).

Can you ensure that the jar was properly built and contains the Lag class in the com.example.hive.udf package?

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: [EMAIL PROTECTED]

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.
----- Original Message -----
From: "karanveer singh" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Wednesday, April 11, 2012 4:15:59 AM
Subject: RE: Lag function in Hive

Rob n all -

I tried below and created the jar file. For adding jar to class path, I do following:

hive> add jar /users/unix/singhka/Analytics.jar;

The above seems to have worked fine as I see the resource added but when I go ahead and create a function, I get the following error. Any ideas what the issue can be?

hive> create temporary function lag as 'com.example.hive.udf.Lag';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
Regards,
-----Original Message-----
From: Hamilton, Robert (Austin) [mailto:[EMAIL PROTECTED]]
Sent: 10 April 2012 20:32
To: [EMAIL PROTECTED]
Subject: RE: Lag function in Hive

You can write a custom UDF -

Here is one that I have played around with, along with some test SQL. It comes with no warrantee :)

Sorry I can't really share the test data, but hopefully you get the idea.  To run, compile the Lag class, jar it up into Analytics.jar, put the jar on the CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the hive command below.

Note the "distribute by" and "sort by"  are critical.  Also the sub-select is just an artifice to make sure the UDF is running in the reducer (so that it is sorted).  Maybe the hive experts can suggest a better way for that to work...

#
# use live clickstream test data from 2012-01-12
#
hive -e "add jar Analytics.jar;

create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id)
from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id is not null
distribute by session_id
sort by session_id,hit_datetime_gmt ) X
distribute by session_id limit 1000
"

------------------------ Contents of Lag.java -----------------------------------------
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public final class Lag extends UDF{
    private int  counter;
    private String last_key;
    private String lastGroup;
    private String return_value="";

    public String evaluate(String key, String groupKey){
        if(groupKey==null){
                this.last_key=null;
        }else
          if ( !groupKey.equalsIgnoreCase(this.lastGroup )) {
                this.last_key=null;
        }
     return_value=this.last_key;
     this.last_key = key;
     this.lastGroup=groupKey;
     return return_value;
    }
}

Result of test run:

1326326437-26270601625187049522752846106448274394       2012-01-12 00:00:37     NULL
1326326437-26270601625187049522752846106448274394       2012-01-12 00:00:59     2012-01-12 00:00:37
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:05     2012-01-12 00:00:59
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:07     2012-01-12 00:01:05
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:11     2012-01-12 00:01:07
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:12     2012-01-12 00:01:11
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:24     2012-01-12 00:01:12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:32     2012-01-12 00:01:24
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:45     2012-01-12 00:01:32
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:48     2012-01-12 00:01:45

From: Philip Tromans [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 10, 2012 9:18 AM
To: [EMAIL PROTECTED]
Subject: Re: Lag function in Hive

Hi Karan,

To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee).

Cheers,

Phil.

On 10 April 2012 14:44,  <[EMAIL PROTECTED]> wrote: