Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Jython UDFs, Tuples and Stringconversions


Copy link to this message
-
Re: Jython UDFs, Tuples and Stringconversions
Hi,

Please try this:

1. I used a tab-separated input file as follows:

cheolsoo@localhost:~/workspace/pig-svn $cat tag_count_ts_pro_userpair
('a','b','c','d') 3 {('2012-03-04 10:10:10'),('2013-03-04 10:10:11')}

2. My udf is as follows:

import datetime

@outputSchema("days_from_start:bag{t:tuple(cnt:int)}")
def daysFromStart(startDate, aBagOfDates):
        if aBagOfDates is None: return None
        result=[]
        for someDate in aBagOfDates:
            if someDate is None: continue
            someDate = ''.join(someDate)
            if len(someDate)==21: result.append(diffTime(startDate,
someDate))
        return result

@outputSchema("diff:int")
def diffTime(dateFrom, dateTil):
    dateSmall = datetime.datetime.strptime(dateFrom, "%Y-%m-%d %H:%M:%S")
    dateBig = datetime.datetime.strptime(dateTil[1:-1], "%Y-%m-%d %H:%M:%S")
    delta = dateBig - dateSmall
    return delta.days

3. My pig script is as follows:

register 'udf.py' using jython as moins;

x = load 'tag_count_ts_pro_userpair' using PigStorage('\t') as (group:(),
cnt:int, times:{(chararray)});
y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', times);
dump y;

This returns:

(('a','b','c','d'),3,{('2012-03-04 10:10:10'),('2013-03-04
10:10:11')},{(277),(642)})

Thanks,
Cheolsoo

On Mon, Oct 1, 2012 at 7:42 AM, Björn-Elmar Macek <[EMAIL PROTECTED]>wrote:

> Hi,
>
> i am currently writing a PIG script that works with a bags of timestamp
> tuples. So i am basically working on a datastructure like this:
> (tuple(chararray)), int, bag{tuple(chararray)})
>
> for example:
> ( ('a','b','c','d'), 3, {('2012-03-04 10:10:10'), ('2012-03-04 10:10:11')}
> )
>
> When loading the data i add a schema, so pig knows what data is coming in:
> x = load 'tag_count_ts_pro_userpair' as (group:tuple(),cnt:int,times:**
> bag{});
>
> I then want to change the content of the times-bag, by replacing every
> timestamp with an integer, based on the time distance to a certain date,
> which i do with the follwing UDFs:
> ###### myUDF.py ##############
> from org.apache.pig.scripting import *
> import datetime
> import math
>
>
> @outputSchema("days_from_**start:bag{t:tuple(cnt:int)}")
> def daysFromStart(startDate, aBagOfDates):
>         if aBagOfDates is None: return None;
>         result=[]
>         for somedate in aBagOfDates:
>             if somedate is None: continue
>             aDateString = ''.join(somedate)
>             #ALTERNATIVELY I USED ALSO: aDateString = ''.join(somedate[0])
> // aDateString = ''.join(somedate[1])
>             if len(aDateString==16): result.append(diffTime(**startDate,
> aDateString))
>         return result
>
>
> @outputSchema("diff:int")
> def diffTime(dateFrom,dateTil):
>     dateSmall = datetime.datetime.strptime(**dateFrom,"%Y-%m-%d
> %H:%M:%S");
>     dateBig = datetime.datetime.strptime(**dateTil,"%Y-%m-%d %H:%M:%S");
>     delta = dateBig-dateSmall
>     return delta.days
>
> ##########################
>
> I do this by executing the following command in the grunt:
> y = foreach x generate *, moins.daysFromStart('2011-06-**01 00:00:00',
> times);
>
> But when i try to store y, i get the following error message:
>
> ######## LOG #############
> 2012-10-01 16:35:03,499 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
> - ERROR 2997: Unable to recreate exception from backed error:
> org.apache.pig.backend.**executionengine.ExecException: ERROR 0: Error
> executing function
>     at org.apache.pig.scripting.**jython.JythonFunction.exec(**
> JythonFunction.java:106)
>     at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
> expressionOperators.**POUserFunc.getNext(POUserFunc.**java:216)
>     at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
> expressionOperators.**POUserFunc.getNext(POUserFunc.**java:258)
>     at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
> PhysicalOperator.getNext(**PhysicalOperator.java:316)
>     at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**