Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - aggregate functions in python

Copy link to this message
aggregate functions in python
Björn-Elmar Macek 2012-10-05, 14:59

Hi there,

i am currently trying to implement a function in pythan that can be used
for aggregation. I know that java might be better to use because of the
Algebraic Interface and its benefits for MR, but i like to keep it
simple at the moment.

What i currently have is a datastructure containing lines like the following

(somebody, hadoop, (1,0,3,5,1,2))

The first col is named AUTHOR, the 2nd is named TAG and the third is a
histogram called HIST.
I now want to group those values by TAG. THe result looks like this:

(hadoop, {(somebody, hadoop, (1,0,3,5,1,2)), ... ,
(somebodyCompletlyDifferent, hadoop, (2,0,3,5,6,3))})

I now want to create an aggregate function, that takes a bag of
histograms and returns a final histogram which contains the pairwise sum
of all dimensions: in our case:
(1,0,3,5,1,2) "+" (2,0,3,5,6,3) "=" (3,0,6,10,7,5)

The code for this function looks like this:
def aggHisto(aHistogramSet):
         if aHistogramSet is None: return None;
         hist_len = len(aHistogramSet[0][0])

         for aHistogram in aHistogramSet:
             for i in range(0,hist_len-1):
                 value = int(aHistogram[0][i])
                 result[i] = result[i] + value

         return tuple(result)

My problem is, that the computation fails with an error saying:
value = int(aHistogram[0][i])
TypeError: int() argument must be a string or number

Strange thing is: when this functions simply returns the first value it
sees without trying to cast it to an int, it looks like an int in the
result. BUT if i omit the "cast" i get the error message saying that
"+ is not defined for int and array.array"

It already took some time to realize, that the bag does NOT contain the
tuples representing the histogram, but a tuple containing the
histo-tuple. Thats also why i had to add "[0]" to "aHistogram[i]".

Did i oversee an important point?

Best regards,