Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Python UDF got problems converting Strings to Integers


Copy link to this message
-
Re: Python UDF got problems converting Strings to Integers
Hi Cheolsoo,

this is because i have a 24-dimensional tuple and the definition alone
is a pain. It makes my code unreadable and worse to interpret or fix:
imagine how many errors you can make there.

I would prefer solving this issue within python, so my pig calls do not
get too complicated and possibly messy.

Thanks,
Bj�rn-Elmar
Am 31.10.12 05:59, schrieb Cheolsoo Park:
> Hi,
>
> First of all, why can't you pass a tuple of integers to your udf in the
> first place? Because then you don't have to cast strings to integers inside
> your udf.
>
> Here is how I got your udf working.
>
> cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt
> 1,2,3
> 4,5,6
>
> cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig
> register 'test.py' using jython as myfuncs;
> a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // declare
> as integers
> b = group a all;
> c = foreach b generate myfuncs.aggHisto(a);
> dump c;
>
> @outputSchema("res_histo:tuple()")
> def aggHisto(aHistogramSet):
>      if aHistogramSet is None:
>          return None;
>
>      hist_len = len(aHistogramSet[0])
>      result=[0]*hist_len
>      print(aHistogramSet);
>
>      for aHistogram in aHistogramSet:
>          for i in range(0, hist_len):
>              result[i] = result[i] + aHistogram[i]; // vector addition
>      return tuple(result)
>
> I get the following result:
> ((5,7,9))
>
> Thanks,
> Cheolsoo
>
> On Tue, Oct 30, 2012 at 10:22 AM, Bj�rn-Elmar Macek <[EMAIL PROTECTED]>wrote:
>
>> Hi together,
>>
>> i got a UDF that  sums up histograms in form of tuples. The function i
>> wrote looks like this:
>>
>> @outputSchema("res_histo:**tuple()")
>> def aggHisto(aHistogramSet):
>>                  if aHistogramSet is None: return None;
>>                  hist_len = len(aHistogramSet[0])
>>                  result=[0]*hist_len
>>
>>                  for aHistogram in aHistogramSet:
>>                          for i in range(0,hist_len):
>>                                  value = int(''.join(map(str,**
>> aHistogram[i])));
>>                                  result[i] = result[i] + (value)
>>                  return tuple(result)
>>
>> So for the following input {(1,23,45),(0,0,0)} i SHOULD get the following
>> output: (1,23,45)
>> But instead i get: (49,5051,52,5353)
>> I played around with this for some time and found out this program does
>> the following:
>> The line "value = int(''.join(map(str,**aHistogram[i])));" does not
>> convert the "23" to 23, but it does the following:
>> It takes every single digit starting with the most siginificant one and
>> adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051
>>
>> Why does this happen? Can anybody help me here?
>>
>> Best regards,
>> Elmar
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB