Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Jython UDFs, Tuples and Stringconversions


Copy link to this message
-
Re: Jython UDFs, Tuples and Stringconversions
Hi,

Please try this:

1. I used a tab-separated input file as follows:

cheolsoo@localhost:~/workspace/pig-svn $cat tag_count_ts_pro_userpair
('a','b','c','d') 3 {('2012-03-04 10:10:10'),('2013-03-04 10:10:11')}

2. My udf is as follows:

import datetime

@outputSchema("days_from_start:bag{t:tuple(cnt:int)}")
def daysFromStart(startDate, aBagOfDates):
        if aBagOfDates is None: return None
        result=[]
        for someDate in aBagOfDates:
            if someDate is None: continue
            someDate = ''.join(someDate)
            if len(someDate)==21: result.append(diffTime(startDate,
someDate))
        return result

@outputSchema("diff:int")
def diffTime(dateFrom, dateTil):
    dateSmall = datetime.datetime.strptime(dateFrom, "%Y-%m-%d %H:%M:%S")
    dateBig = datetime.datetime.strptime(dateTil[1:-1], "%Y-%m-%d %H:%M:%S")
    delta = dateBig - dateSmall
    return delta.days

3. My pig script is as follows:

register 'udf.py' using jython as moins;

x = load 'tag_count_ts_pro_userpair' using PigStorage('\t') as (group:(),
cnt:int, times:{(chararray)});
y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', times);
dump y;

This returns:

(('a','b','c','d'),3,{('2012-03-04 10:10:10'),('2013-03-04
10:10:11')},{(277),(642)})

Thanks,
Cheolsoo

On Mon, Oct 1, 2012 at 7:42 AM, Björn-Elmar Macek <[EMAIL PROTECTED]>wrote:

> Hi,
>
> i am currently writing a PIG script that works with a bags of timestamp
> tuples. So i am basically working on a datastructure like this:
> (tuple(chararray)), int, bag{tuple(chararray)})
>
> for example:
> ( ('a','b','c','d'), 3, {('2012-03-04 10:10:10'), ('2012-03-04 10:10:11')}
> )
>
> When loading the data i add a schema, so pig knows what data is coming in:
> x = load 'tag_count_ts_pro_userpair' as (group:tuple(),cnt:int,times:**
> bag{});
>
> I then want to change the content of the times-bag, by replacing every
> timestamp with an integer, based on the time distance to a certain date,
> which i do with the follwing UDFs:
> ###### myUDF.py ##############
> from org.apache.pig.scripting import *
> import datetime
> import math
>
>
> @outputSchema("days_from_**start:bag{t:tuple(cnt:int)}")
> def daysFromStart(startDate, aBagOfDates):
>         if aBagOfDates is None: return None;
>         result=[]
>         for somedate in aBagOfDates:
>             if somedate is None: continue
>             aDateString = ''.join(somedate)
>             #ALTERNATIVELY I USED ALSO: aDateString = ''.join(somedate[0])
> // aDateString = ''.join(somedate[1])
>             if len(aDateString==16): result.append(diffTime(**startDate,
> aDateString))
>         return result
>
>
> @outputSchema("diff:int")
> def diffTime(dateFrom,dateTil):
>     dateSmall = datetime.datetime.strptime(**dateFrom,"%Y-%m-%d
> %H:%M:%S");
>     dateBig = datetime.datetime.strptime(**dateTil,"%Y-%m-%d %H:%M:%S");
>     delta = dateBig-dateSmall
>     return delta.days
>
> ##########################
>
> I do this by executing the following command in the grunt:
> y = foreach x generate *, moins.daysFromStart('2011-06-**01 00:00:00',
> times);
>
> But when i try to store y, i get the following error message:
>
> ######## LOG #############
> 2012-10-01 16:35:03,499 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
> - ERROR 2997: Unable to recreate exception from backed error:
> org.apache.pig.backend.**executionengine.ExecException: ERROR 0: Error
> executing function
>     at org.apache.pig.scripting.**jython.JythonFunction.exec(**
> JythonFunction.java:106)
>     at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
> expressionOperators.**POUserFunc.getNext(POUserFunc.**java:216)
>     at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
> expressionOperators.**POUserFunc.getNext(POUserFunc.**java:258)
>     at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
> PhysicalOperator.getNext(**PhysicalOperator.java:316)
>     at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB