|
|
-
aggregate functions in pythonBjörn-Elmar Macek 2012-10-05, 14:59
Hi there, i am currently trying to implement a function in pythan that can be used for aggregation. I know that java might be better to use because of the Algebraic Interface and its benefits for MR, but i like to keep it simple at the moment. What i currently have is a datastructure containing lines like the following (somebody, hadoop, (1,0,3,5,1,2)) The first col is named AUTHOR, the 2nd is named TAG and the third is a histogram called HIST. I now want to group those values by TAG. THe result looks like this: (hadoop, {(somebody, hadoop, (1,0,3,5,1,2)), ... , (somebodyCompletlyDifferent, hadoop, (2,0,3,5,6,3))}) I now want to create an aggregate function, that takes a bag of histograms and returns a final histogram which contains the pairwise sum of all dimensions: in our case: (1,0,3,5,1,2) "+" (2,0,3,5,6,3) "=" (3,0,6,10,7,5) The code for this function looks like this: ########### @outputSchema("t:tuple()") def aggHisto(aHistogramSet): if aHistogramSet is None: return None; hist_len = len(aHistogramSet[0][0]) result=[0]*hist_len for aHistogram in aHistogramSet: for i in range(0,hist_len-1): value = int(aHistogram[0][i]) result[i] = result[i] + value return tuple(result) ############# My problem is, that the computation fails with an error saying: value = int(aHistogram[0][i]) TypeError: int() argument must be a string or number Strange thing is: when this functions simply returns the first value it sees without trying to cast it to an int, it looks like an int in the result. BUT if i omit the "cast" i get the error message saying that "+ is not defined for int and array.array" It already took some time to realize, that the bag does NOT contain the tuples representing the histogram, but a tuple containing the histo-tuple. Thats also why i had to add "[0]" to "aHistogram[i]". Did i oversee an important point? Best regards, Elmar |