Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Correlation function out of memory error

Copy link to this message
Correlation function out of memory error

I have a file with a few hundreds of columns with doubles and I am
interested in creating a correlation matrix for the columns:

A = load 'myData' using PigStorage(':');
B = group A all;
D = foreach B generate group,COR(A.$0,A.$1,A.$2);

For N parameters, the COR function will generate N(N-1)/2 correlations.
This is fine as long as N is less than 100: COR(A.$0,A.$1, .... A.$100);
However once N is more than 100 or 200 I have an out of memory error (of
course this would depend on the amount of RAM you have):

883 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR
6016: Out of memory.
893 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map
reduce job(s) failed!

My file is less than 50Mb, so pig is running all the time with only one mapper.

This behavior was the same whether I run the script locally (pig -x
local) or on Amazon ElasticMapReduce with multiple instances assigned
to the job.

Is there a solution to be able to run the correlation function for a
big number of parameters?

Thank you in advance!