Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Python UDF got problems converting Strings to Integers


Copy link to this message
-
Re: Python UDF got problems converting Strings to Integers
Hi Cheolsoo,

this is because i have a 24-dimensional tuple and the definition alone
is a pain. It makes my code unreadable and worse to interpret or fix:
imagine how many errors you can make there.

I would prefer solving this issue within python, so my pig calls do not
get too complicated and possibly messy.

Thanks,
Bj�rn-Elmar
Am 31.10.12 05:59, schrieb Cheolsoo Park:
> Hi,
>
> First of all, why can't you pass a tuple of integers to your udf in the
> first place? Because then you don't have to cast strings to integers inside
> your udf.
>
> Here is how I got your udf working.
>
> cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt
> 1,2,3
> 4,5,6
>
> cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig
> register 'test.py' using jython as myfuncs;
> a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // declare
> as integers
> b = group a all;
> c = foreach b generate myfuncs.aggHisto(a);
> dump c;
>
> @outputSchema("res_histo:tuple()")
> def aggHisto(aHistogramSet):
>      if aHistogramSet is None:
>          return None;
>
>      hist_len = len(aHistogramSet[0])
>      result=[0]*hist_len
>      print(aHistogramSet);
>
>      for aHistogram in aHistogramSet:
>          for i in range(0, hist_len):
>              result[i] = result[i] + aHistogram[i]; // vector addition
>      return tuple(result)
>
> I get the following result:
> ((5,7,9))
>
> Thanks,
> Cheolsoo
>
> On Tue, Oct 30, 2012 at 10:22 AM, Bj�rn-Elmar Macek <[EMAIL PROTECTED]>wrote:
>
>> Hi together,
>>
>> i got a UDF that  sums up histograms in form of tuples. The function i
>> wrote looks like this:
>>
>> @outputSchema("res_histo:**tuple()")
>> def aggHisto(aHistogramSet):
>>                  if aHistogramSet is None: return None;
>>                  hist_len = len(aHistogramSet[0])
>>                  result=[0]*hist_len
>>
>>                  for aHistogram in aHistogramSet:
>>                          for i in range(0,hist_len):
>>                                  value = int(''.join(map(str,**
>> aHistogram[i])));
>>                                  result[i] = result[i] + (value)
>>                  return tuple(result)
>>
>> So for the following input {(1,23,45),(0,0,0)} i SHOULD get the following
>> output: (1,23,45)
>> But instead i get: (49,5051,52,5353)
>> I played around with this for some time and found out this program does
>> the following:
>> The line "value = int(''.join(map(str,**aHistogram[i])));" does not
>> convert the "23" to 23, but it does the following:
>> It takes every single digit starting with the most siginificant one and
>> adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051
>>
>> Why does this happen? Can anybody help me here?
>>
>> Best regards,
>> Elmar
>>