-Using average function is really slow
James Newhaven 2012-07-04, 17:37
I am using the built-in org.apache.pig.builtin.AVG function. I have a set
of 100,000 items that I want to average.
The relevant pig latin is below:
L = FOREACH K GENERATE AVG(I.productcost), AVG(I.deliverycost);
STORE L INTO 'output' USING PigStorage (',');
In the Hadoop Admin Console, I can see several jobs that finish quickly (I
can see they all use many map and reduce tasks).
However, eventually Hadoop executes a job with a single map and reduce task
which is taking forever to finish (it has been running for several hours so
far). All the map and reduce tasks report 100% complete, but I can see that
one of the statistics called "Map output records" is slowly increasing and
the job status remains as 'Running'.
Could anyone provide any advice in how I could go about diagnosing the
cause of this problem? I suspect the average function is taking a long time
to execute, but I thought calculating the average of 100,000 items would
not take that long.