Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Correlation function out of memory error


Copy link to this message
-
Correlation function out of memory error
Hi,

I have a file with a few hundreds of columns with doubles and I am
interested in creating a correlation matrix for the columns:

A = load 'myData' using PigStorage(':');
B = group A all;
D = foreach B generate group,COR(A.$0,A.$1,A.$2);

For N parameters, the COR function will generate N(N-1)/2 correlations.
This is fine as long as N is less than 100: COR(A.$0,A.$1, .... A.$100);
However once N is more than 100 or 200 I have an out of memory error (of
course this would depend on the amount of RAM you have):

883 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR
6016: Out of memory.
893 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map
reduce job(s) failed!

My file is less than 50Mb, so pig is running all the time with only one mapper.

This behavior was the same whether I run the script locally (pig -x
local) or on Amazon ElasticMapReduce with multiple instances assigned
to the job.

Is there a solution to be able to run the correlation function for a
big number of parameters?

Thank you in advance!

-Houssam
+
Johnny Zhang 2013-02-22, 22:18
+
Johnny Zhang 2013-02-22, 23:04
+
Johnny Zhang 2013-02-23, 03:10
+
Houssam 2013-02-25, 08:30
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB