|
|
-
Python UDF got problems converting Strings to Integers
Björn-Elmar Macek 2012-10-30, 17:22
Hi together,
i got a UDF that sums up histograms in form of tuples. The function i wrote looks like this:
@outputSchema("res_histo:tuple()") def aggHisto(aHistogramSet): if aHistogramSet is None: return None; hist_len = len(aHistogramSet[0]) result=[0]*hist_len
for aHistogram in aHistogramSet: for i in range(0,hist_len): value = int(''.join(map(str,aHistogram[i]))); result[i] = result[i] + (value) return tuple(result)
So for the following input {(1,23,45),(0,0,0)} i SHOULD get the following output: (1,23,45) But instead i get: (49,5051,52,5353) I played around with this for some time and found out this program does the following: The line "value = int(''.join(map(str,aHistogram[i])));" does not convert the "23" to 23, but it does the following: It takes every single digit starting with the most siginificant one and adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051
Why does this happen? Can anybody help me here?
Best regards, Elmar
+
Björn-Elmar Macek 2012-10-30, 17:22
-
Re: Python UDF got problems converting Strings to Integers
Cheolsoo Park 2012-10-31, 04:59
Hi,
First of all, why can't you pass a tuple of integers to your udf in the first place? Because then you don't have to cast strings to integers inside your udf.
Here is how I got your udf working.
cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt 1,2,3 4,5,6
cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig register 'test.py' using jython as myfuncs; a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // declare as integers b = group a all; c = foreach b generate myfuncs.aggHisto(a); dump c;
@outputSchema("res_histo:tuple()") def aggHisto(aHistogramSet): if aHistogramSet is None: return None;
hist_len = len(aHistogramSet[0]) result=[0]*hist_len print(aHistogramSet);
for aHistogram in aHistogramSet: for i in range(0, hist_len): result[i] = result[i] + aHistogram[i]; // vector addition return tuple(result)
I get the following result: ((5,7,9))
Thanks, Cheolsoo
On Tue, Oct 30, 2012 at 10:22 AM, Björn-Elmar Macek <[EMAIL PROTECTED]>wrote:
> Hi together, > > i got a UDF that sums up histograms in form of tuples. The function i > wrote looks like this: > > @outputSchema("res_histo:**tuple()") > def aggHisto(aHistogramSet): > if aHistogramSet is None: return None; > hist_len = len(aHistogramSet[0]) > result=[0]*hist_len > > for aHistogram in aHistogramSet: > for i in range(0,hist_len): > value = int(''.join(map(str,** > aHistogram[i]))); > result[i] = result[i] + (value) > return tuple(result) > > So for the following input {(1,23,45),(0,0,0)} i SHOULD get the following > output: (1,23,45) > But instead i get: (49,5051,52,5353) > I played around with this for some time and found out this program does > the following: > The line "value = int(''.join(map(str,**aHistogram[i])));" does not > convert the "23" to 23, but it does the following: > It takes every single digit starting with the most siginificant one and > adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051 > > Why does this happen? Can anybody help me here? > > Best regards, > Elmar >
+
Cheolsoo Park 2012-10-31, 04:59
-
Re: Python UDF got problems converting Strings to Integers
Björn-Elmar Macek 2012-10-31, 09:36
Hi Cheolsoo,
this is because i have a 24-dimensional tuple and the definition alone is a pain. It makes my code unreadable and worse to interpret or fix: imagine how many errors you can make there.
I would prefer solving this issue within python, so my pig calls do not get too complicated and possibly messy.
Thanks, Bj�rn-Elmar Am 31.10.12 05:59, schrieb Cheolsoo Park: > Hi, > > First of all, why can't you pass a tuple of integers to your udf in the > first place? Because then you don't have to cast strings to integers inside > your udf. > > Here is how I got your udf working. > > cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt > 1,2,3 > 4,5,6 > > cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig > register 'test.py' using jython as myfuncs; > a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // declare > as integers > b = group a all; > c = foreach b generate myfuncs.aggHisto(a); > dump c; > > @outputSchema("res_histo:tuple()") > def aggHisto(aHistogramSet): > if aHistogramSet is None: > return None; > > hist_len = len(aHistogramSet[0]) > result=[0]*hist_len > print(aHistogramSet); > > for aHistogram in aHistogramSet: > for i in range(0, hist_len): > result[i] = result[i] + aHistogram[i]; // vector addition > return tuple(result) > > I get the following result: > ((5,7,9)) > > Thanks, > Cheolsoo > > On Tue, Oct 30, 2012 at 10:22 AM, Bj�rn-Elmar Macek <[EMAIL PROTECTED]>wrote: > >> Hi together, >> >> i got a UDF that sums up histograms in form of tuples. The function i >> wrote looks like this: >> >> @outputSchema("res_histo:**tuple()") >> def aggHisto(aHistogramSet): >> if aHistogramSet is None: return None; >> hist_len = len(aHistogramSet[0]) >> result=[0]*hist_len >> >> for aHistogram in aHistogramSet: >> for i in range(0,hist_len): >> value = int(''.join(map(str,** >> aHistogram[i]))); >> result[i] = result[i] + (value) >> return tuple(result) >> >> So for the following input {(1,23,45),(0,0,0)} i SHOULD get the following >> output: (1,23,45) >> But instead i get: (49,5051,52,5353) >> I played around with this for some time and found out this program does >> the following: >> The line "value = int(''.join(map(str,**aHistogram[i])));" does not >> convert the "23" to 23, but it does the following: >> It takes every single digit starting with the most siginificant one and >> adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051 >> >> Why does this happen? Can anybody help me here? >> >> Best regards, >> Elmar >>
+
Björn-Elmar Macek 2012-10-31, 09:36
-
Re: Python UDF got problems converting Strings to Integers
Björn-Elmar Macek 2012-10-31, 10:49
Ok, i got it solved after realizing what happens internally. The solution looks like this: @outputSchema("res_histo:tuple()") def aggHisto(aHistogramSet): if aHistogramSet is None: return None; hist_len = len(aHistogramSet[0]) result=[0]*hist_len
for aHistogram in aHistogramSet: for i in range(0,hist_len): value = aHistogram[i] val_len=len(value) tmp_conv='' for j in range(0,val_len): tmp_conv = tmp_conv + str(int(value[j])-48) value2=int(tmp_conv) result[i] = result[i] + value2
return tuple(result)
It is important to know that aHistogram[i] is of type array. If left untouched and returned by the function, it properly displays the value of the histogram tuple at position i. Any direct conversion to int or string does not work the way it is supposed to. If you access the positions (value[j]) you get the j-th significant position of the integer, but increased by 48. The code above restores the information encoded into this array. It is not a clean solution and looks more like a hack, but at least this does the trick.
Thanks, Bj�rn-Elmar Am 31.10.12 10:36, schrieb Bj�rn-Elmar Macek: > Hi Cheolsoo, > > this is because i have a 24-dimensional tuple and the definition alone > is a pain. It makes my code unreadable and worse to interpret or fix: > imagine how many errors you can make there. > > I would prefer solving this issue within python, so my pig calls do > not get too complicated and possibly messy. > > Thanks, > Bj�rn-Elmar > > > Am 31.10.12 05:59, schrieb Cheolsoo Park: >> Hi, >> >> First of all, why can't you pass a tuple of integers to your udf in the >> first place? Because then you don't have to cast strings to integers >> inside >> your udf. >> >> Here is how I got your udf working. >> >> cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt >> 1,2,3 >> 4,5,6 >> >> cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig >> register 'test.py' using jython as myfuncs; >> a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // >> declare >> as integers >> b = group a all; >> c = foreach b generate myfuncs.aggHisto(a); >> dump c; >> >> @outputSchema("res_histo:tuple()") >> def aggHisto(aHistogramSet): >> if aHistogramSet is None: >> return None; >> >> hist_len = len(aHistogramSet[0]) >> result=[0]*hist_len >> print(aHistogramSet); >> >> for aHistogram in aHistogramSet: >> for i in range(0, hist_len): >> result[i] = result[i] + aHistogram[i]; // vector addition >> return tuple(result) >> >> I get the following result: >> ((5,7,9)) >> >> Thanks, >> Cheolsoo >> >> On Tue, Oct 30, 2012 at 10:22 AM, Bj�rn-Elmar Macek >> <[EMAIL PROTECTED]>wrote: >> >>> Hi together, >>> >>> i got a UDF that sums up histograms in form of tuples. The function i >>> wrote looks like this: >>> >>> @outputSchema("res_histo:**tuple()") >>> def aggHisto(aHistogramSet): >>> if aHistogramSet is None: return None; >>> hist_len = len(aHistogramSet[0]) >>> result=[0]*hist_len >>> >>> for aHistogram in aHistogramSet: >>> for i in range(0,hist_len): >>> value = int(''.join(map(str,** >>> aHistogram[i]))); >>> result[i] = result[i] + (value) >>> return tuple(result) >>> >>> So for the following input {(1,23,45),(0,0,0)} i SHOULD get the >>> following >>> output: (1,23,45) >>> But instead i get: (49,5051,52,5353) >>> I played around with this for some time and found out this program does >>> the following: >>> The line "value = int(''.join(map(str,**aHistogram[i])));" does not >>> convert the "23" to 23, but it does the following: >>> It takes every single digit starting with the most siginificant one and >>> adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051
+
Björn-Elmar Macek 2012-10-31, 10:49
|
|