Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> aggregate functions in python


Copy link to this message
-
aggregate functions in python

Hi there,

i am currently trying to implement a function in pythan that can be used
for aggregation. I know that java might be better to use because of the
Algebraic Interface and its benefits for MR, but i like to keep it
simple at the moment.

What i currently have is a datastructure containing lines like the following

(somebody, hadoop, (1,0,3,5,1,2))

The first col is named AUTHOR, the 2nd is named TAG and the third is a
histogram called HIST.
I now want to group those values by TAG. THe result looks like this:

(hadoop, {(somebody, hadoop, (1,0,3,5,1,2)), ... ,
(somebodyCompletlyDifferent, hadoop, (2,0,3,5,6,3))})

I now want to create an aggregate function, that takes a bag of
histograms and returns a final histogram which contains the pairwise sum
of all dimensions: in our case:
(1,0,3,5,1,2) "+" (2,0,3,5,6,3) "=" (3,0,6,10,7,5)

The code for this function looks like this:
###########
@outputSchema("t:tuple()")
def aggHisto(aHistogramSet):
         if aHistogramSet is None: return None;
         hist_len = len(aHistogramSet[0][0])
         result=[0]*hist_len

         for aHistogram in aHistogramSet:
             for i in range(0,hist_len-1):
                 value = int(aHistogram[0][i])
                 result[i] = result[i] + value

         return tuple(result)
#############

My problem is, that the computation fails with an error saying:
value = int(aHistogram[0][i])
TypeError: int() argument must be a string or number

Strange thing is: when this functions simply returns the first value it
sees without trying to cast it to an int, it looks like an int in the
result. BUT if i omit the "cast" i get the error message saying that
"+ is not defined for int and array.array"

It already took some time to realize, that the bag does NOT contain the
tuples representing the histogram, but a tuple containing the
histo-tuple. Thats also why i had to add "[0]" to "aHistogram[i]".

Did i oversee an important point?

Best regards,
Elmar
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB