Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Jython UDFs, Tuples and Stringconversions


+
Björn-Elmar Macek 2012-10-01, 14:42
+
Cheolsoo Park 2012-10-01, 21:11
Copy link to this message
-
Re: Jython UDFs, Tuples and Stringconversions
Hi Cheilsoo,

ahh thank you for the modifications: the output is what i expect it to
be. I will have to look up the Arrayconstruct [1:-1]  tho. I could solve
the issue by adding a complete schema just as you did with

times:{(chararray)}

.

Thank you alot for your time and insight!
Bj�rn
Am 01.10.2012 23:11, schrieb Cheolsoo Park:
> Hi,
>
> Please try this:
>
> 1. I used a tab-separated input file as follows:
>
> cheolsoo@localhost:~/workspace/pig-svn $cat tag_count_ts_pro_userpair
> ('a','b','c','d') 3 {('2012-03-04 10:10:10'),('2013-03-04 10:10:11')}
>
> 2. My udf is as follows:
>
> import datetime
>
> @outputSchema("days_from_start:bag{t:tuple(cnt:int)}")
> def daysFromStart(startDate, aBagOfDates):
>          if aBagOfDates is None: return None
>          result=[]
>          for someDate in aBagOfDates:
>              if someDate is None: continue
>              someDate = ''.join(someDate)
>              if len(someDate)==21: result.append(diffTime(startDate,
> someDate))
>          return result
>
> @outputSchema("diff:int")
> def diffTime(dateFrom, dateTil):
>      dateSmall = datetime.datetime.strptime(dateFrom, "%Y-%m-%d %H:%M:%S")
>      dateBig = datetime.datetime.strptime(dateTil[1:-1], "%Y-%m-%d %H:%M:%S")
>      delta = dateBig - dateSmall
>      return delta.days
>
> 3. My pig script is as follows:
>
> register 'udf.py' using jython as moins;
>
> x = load 'tag_count_ts_pro_userpair' using PigStorage('\t') as (group:(),
> cnt:int, times:{(chararray)});
> y = foreach x generate *, moins.daysFromStart('2011-06-01 00:00:00', times);
> dump y;
>
> This returns:
>
> (('a','b','c','d'),3,{('2012-03-04 10:10:10'),('2013-03-04
> 10:10:11')},{(277),(642)})
>
> Thanks,
> Cheolsoo
>
> On Mon, Oct 1, 2012 at 7:42 AM, Bj�rn-Elmar Macek <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>>
>> i am currently writing a PIG script that works with a bags of timestamp
>> tuples. So i am basically working on a datastructure like this:
>> (tuple(chararray)), int, bag{tuple(chararray)})
>>
>> for example:
>> ( ('a','b','c','d'), 3, {('2012-03-04 10:10:10'), ('2012-03-04 10:10:11')}
>> )
>>
>> When loading the data i add a schema, so pig knows what data is coming in:
>> x = load 'tag_count_ts_pro_userpair' as (group:tuple(),cnt:int,times:**
>> bag{});
>>
>> I then want to change the content of the times-bag, by replacing every
>> timestamp with an integer, based on the time distance to a certain date,
>> which i do with the follwing UDFs:
>> ###### myUDF.py ##############
>> from org.apache.pig.scripting import *
>> import datetime
>> import math
>>
>>
>> @outputSchema("days_from_**start:bag{t:tuple(cnt:int)}")
>> def daysFromStart(startDate, aBagOfDates):
>>          if aBagOfDates is None: return None;
>>          result=[]
>>          for somedate in aBagOfDates:
>>              if somedate is None: continue
>>              aDateString = ''.join(somedate)
>>              #ALTERNATIVELY I USED ALSO: aDateString = ''.join(somedate[0])
>> // aDateString = ''.join(somedate[1])
>>              if len(aDateString==16): result.append(diffTime(**startDate,
>> aDateString))
>>          return result
>>
>>
>> @outputSchema("diff:int")
>> def diffTime(dateFrom,dateTil):
>>      dateSmall = datetime.datetime.strptime(**dateFrom,"%Y-%m-%d
>> %H:%M:%S");
>>      dateBig = datetime.datetime.strptime(**dateTil,"%Y-%m-%d %H:%M:%S");
>>      delta = dateBig-dateSmall
>>      return delta.days
>>
>> ##########################
>>
>> I do this by executing the following command in the grunt:
>> y = foreach x generate *, moins.daysFromStart('2011-06-**01 00:00:00',
>> times);
>>
>> But when i try to store y, i get the following error message:
>>
>> ######## LOG #############
>> 2012-10-01 16:35:03,499 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
>> - ERROR 2997: Unable to recreate exception from backed error:
>> org.apache.pig.backend.**executionengine.ExecException: ERROR 0: Error
>> executing function
>>      at org.apache.pig.scripting.**jython.JythonFunction.exec(**
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB