Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Python UDF got problems converting Strings to Integers


+
Björn-Elmar Macek 2012-10-30, 17:22
+
Cheolsoo Park 2012-10-31, 04:59
+
Björn-Elmar Macek 2012-10-31, 09:36
Copy link to this message
-
Re: Python UDF got problems converting Strings to Integers
Ok, i got it solved after realizing what happens internally. The
solution looks like this:
@outputSchema("res_histo:tuple()")
def aggHisto(aHistogramSet):
         if aHistogramSet is None: return None;
         hist_len = len(aHistogramSet[0])
         result=[0]*hist_len

         for aHistogram in aHistogramSet:
             for i in range(0,hist_len):
                 value = aHistogram[i]
                 val_len=len(value)
                 tmp_conv=''
                 for j in range(0,val_len):
                     tmp_conv = tmp_conv + str(int(value[j])-48)
                 value2=int(tmp_conv)
                 result[i] = result[i] + value2

         return tuple(result)

It is important to know that aHistogram[i] is of type array. If left
untouched and returned by the function, it properly displays the value
of the histogram tuple at position i. Any direct conversion to int or
string does not work the way it is supposed to. If you access the
positions (value[j]) you get the j-th significant position of the
integer, but increased by 48. The code above restores the information
encoded into this array. It is not a clean solution and looks more like
a hack, but at least this does the trick.

Thanks,
Bj�rn-Elmar
Am 31.10.12 10:36, schrieb Bj�rn-Elmar Macek:
> Hi Cheolsoo,
>
> this is because i have a 24-dimensional tuple and the definition alone
> is a pain. It makes my code unreadable and worse to interpret or fix:
> imagine how many errors you can make there.
>
> I would prefer solving this issue within python, so my pig calls do
> not get too complicated and possibly messy.
>
> Thanks,
> Bj�rn-Elmar
>
>
> Am 31.10.12 05:59, schrieb Cheolsoo Park:
>> Hi,
>>
>> First of all, why can't you pass a tuple of integers to your udf in the
>> first place? Because then you don't have to cast strings to integers
>> inside
>> your udf.
>>
>> Here is how I got your udf working.
>>
>> cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt
>> 1,2,3
>> 4,5,6
>>
>> cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig
>> register 'test.py' using jython as myfuncs;
>> a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); //
>> declare
>> as integers
>> b = group a all;
>> c = foreach b generate myfuncs.aggHisto(a);
>> dump c;
>>
>> @outputSchema("res_histo:tuple()")
>> def aggHisto(aHistogramSet):
>>      if aHistogramSet is None:
>>          return None;
>>
>>      hist_len = len(aHistogramSet[0])
>>      result=[0]*hist_len
>>      print(aHistogramSet);
>>
>>      for aHistogram in aHistogramSet:
>>          for i in range(0, hist_len):
>>              result[i] = result[i] + aHistogram[i]; // vector addition
>>      return tuple(result)
>>
>> I get the following result:
>> ((5,7,9))
>>
>> Thanks,
>> Cheolsoo
>>
>> On Tue, Oct 30, 2012 at 10:22 AM, Bj�rn-Elmar Macek
>> <[EMAIL PROTECTED]>wrote:
>>
>>> Hi together,
>>>
>>> i got a UDF that  sums up histograms in form of tuples. The function i
>>> wrote looks like this:
>>>
>>> @outputSchema("res_histo:**tuple()")
>>> def aggHisto(aHistogramSet):
>>>                  if aHistogramSet is None: return None;
>>>                  hist_len = len(aHistogramSet[0])
>>>                  result=[0]*hist_len
>>>
>>>                  for aHistogram in aHistogramSet:
>>>                          for i in range(0,hist_len):
>>>                                  value = int(''.join(map(str,**
>>> aHistogram[i])));
>>>                                  result[i] = result[i] + (value)
>>>                  return tuple(result)
>>>
>>> So for the following input {(1,23,45),(0,0,0)} i SHOULD get the
>>> following
>>> output: (1,23,45)
>>> But instead i get: (49,5051,52,5353)
>>> I played around with this for some time and found out this program does
>>> the following:
>>> The line "value = int(''.join(map(str,**aHistogram[i])));" does not
>>> convert the "23" to 23, but it does the following:
>>> It takes every single digit starting with the most siginificant one and
>>> adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051