Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Jython UDFs, Tuples and Stringconversions


Copy link to this message
-
Re: Jython UDFs, Tuples and Stringconversions
Björn-Elmar Macek 2012-10-02, 08:13
Hi Cheilsoo,

ahh thank you for the modifications: the output is what i expect it to
be. I will have to look up the Arrayconstruct [1:-1]  tho. I could solve
the issue by adding a complete schema just as you did with

times:{(chararray)}

.

Thank you alot for your time and insight!
Bj�rn
Am 01.10.2012 23:11, schrieb Cheolsoo Park:
> Hi,
>
> Please try this:
>
> 1. I used a tab-separated input file as follows:
>
> cheolsoo@localhost:~/workspace/pig-svn $cat tag_count_ts_pro_userpair
> ('a','b','c','d') 3 {('2012-03-04 10:10:10'),('2013-03-04 10:10:11')}
>
> 2. My udf is as follows:
>
> import datetime
>
> @outputSchema("days_from_start:bag{t:tuple(cnt:int)}")
> def daysFromStart(startDate, aBagOfDates):
>          if aBagOfDates is None: return None
>          result=[]
>          for someDate in aBagOfDates:
>              if someDate is None: continue
>              someDate = ''.join(someDate)
>              if len(someDate)==21: result.append(diffTime(startDate,
> someDate))
>          return result
>
> @outputSchema("diff:int")
> def diffTime(dateFrom, dateTil):
>      dateSmall = datetime.datetime.strptime(dateFrom, "%Y-%m-%d %H:%M:%S")
>      dateBig = datetime.datetime.strptime(dateTil[1:-1], "%Y-%m-%d %H:%M:%S")
>      delta = dateBig - dateSmall
>      return delta.days
>
> 3. My pig script is as follows:
>
> register 'udf.py' using jython as moins;
>
> x = load 'tag_count_ts_pro_userpair' using PigStorage('\t') as (group:(),
> cnt:int, times:{(chararray)});
> y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', times);
> dump y;
>
> This returns:
>
> (('a','b','c','d'),3,{('2012-03-04 10:10:10'),('2013-03-04
> 10:10:11')},{(277),(642)})
>
> Thanks,
> Cheolsoo
>
> On Mon, Oct 1, 2012 at 7:42 AM, Bj�rn-Elmar Macek <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>>
>> i am currently writing a PIG script that works with a bags of timestamp
>> tuples. So i am basically working on a datastructure like this:
>> (tuple(chararray)), int, bag{tuple(chararray)})
>>
>> for example:
>> ( ('a','b','c','d'), 3, {('2012-03-04 10:10:10'), ('2012-03-04 10:10:11')}
>> )
>>
>> When loading the data i add a schema, so pig knows what data is coming in:
>> x = load 'tag_count_ts_pro_userpair' as (group:tuple(),cnt:int,times:**
>> bag{});
>>
>> I then want to change the content of the times-bag, by replacing every
>> timestamp with an integer, based on the time distance to a certain date,
>> which i do with the follwing UDFs:
>> ###### myUDF.py ##############
>> from org.apache.pig.scripting import *
>> import datetime
>> import math
>>
>>
>> @outputSchema("days_from_**start:bag{t:tuple(cnt:int)}")
>> def daysFromStart(startDate, aBagOfDates):
>>          if aBagOfDates is None: return None;
>>          result=[]
>>          for somedate in aBagOfDates:
>>              if somedate is None: continue
>>              aDateString = ''.join(somedate)
>>              #ALTERNATIVELY I USED ALSO: aDateString = ''.join(somedate[0])
>> // aDateString = ''.join(somedate[1])
>>              if len(aDateString==16): result.append(diffTime(**startDate,
>> aDateString))
>>          return result
>>
>>
>> @outputSchema("diff:int")
>> def diffTime(dateFrom,dateTil):
>>      dateSmall = datetime.datetime.strptime(**dateFrom,"%Y-%m-%d
>> %H:%M:%S");
>>      dateBig = datetime.datetime.strptime(**dateTil,"%Y-%m-%d %H:%M:%S");
>>      delta = dateBig-dateSmall
>>      return delta.days
>>
>> ##########################
>>
>> I do this by executing the following command in the grunt:
>> y = foreach x generate *, moins.daysFromStart('2011-06-**01 00:00:00',
>> times);
>>
>> But when i try to store y, i get the following error message:
>>
>> ######## LOG #############
>> 2012-10-01 16:35:03,499 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
>> - ERROR 2997: Unable to recreate exception from backed error:
>> org.apache.pig.backend.**executionengine.ExecException: ERROR 0: Error
>> executing function
>>      at org.apache.pig.scripting.**jython.JythonFunction.exec(**