|
Houssam H.
2013-02-21, 19:39
Johnny Zhang
2013-02-22, 22:18
Johnny Zhang
2013-02-22, 23:04
Johnny Zhang
2013-02-23, 03:10
Houssam
2013-02-25, 08:30
|
-
Correlation function out of memory errorHoussam H. 2013-02-21, 19:39
Hi,
I have a file with a few hundreds of columns with doubles and I am interested in creating a correlation matrix for the columns: A = load 'myData' using PigStorage(':'); B = group A all; D = foreach B generate group,COR(A.$0,A.$1,A.$2); For N parameters, the COR function will generate N(N-1)/2 correlations. This is fine as long as N is less than 100: COR(A.$0,A.$1, .... A.$100); However once N is more than 100 or 200 I have an out of memory error (of course this would depend on the amount of RAM you have): 883 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 6016: Out of memory. 893 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! My file is less than 50Mb, so pig is running all the time with only one mapper. This behavior was the same whether I run the script locally (pig -x local) or on Amazon ElasticMapReduce with multiple instances assigned to the job. Is there a solution to be able to run the correlation function for a big number of parameters? Thank you in advance! -Houssam
-
Re: Correlation function out of memory errorJohnny Zhang 2013-02-22, 22:18
Hi, Houssam:
What's the error in your pig log file? I were trying to reproduce it with 1000 rows, 500 columns. A = load 'random.txt' using PigStorage(':') as (f1:double,f2:double,.........,f500:double); B = group A all; D = foreach B generate group,COR(A.$0,A.$1,A.$2,A.$3,.......A.$499); dump D; The exception in pig log file is Backend error message --------------------- Error: java.lang.OutOfMemoryError: *GC overhead limit exceeded* at java.lang.Double.valueOf(Double.java:492) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) at org.apache.pig.backend.hadoop.executionengine.physi Backend error message --------------------- Error: java.lang.OutOfMemoryError: Java heap space at java.lang.Double.valueOf(Double.java:492) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.ex Backend error message --------------------- Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.<init>(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:67) at org.apache.pig.data.BinSedesTuple.<init>(BinSedesTuple.java:67) at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Inte Backend error message --------------------- Error: java.lang.OutOfMemoryError: Java heap space at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337) at org.apache.pig.backend.hadoop.executionengin Error message from task (map) task_201302211102_0561_m_000000 ERROR 6016: Out of memory. org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of memory. at java.lang.Double.valueOf(Double.java:492) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded ... 13 more ===============================================================================Error message from task (map) task_201302211102_0561_m_000000 ERROR 6016: Out of memory. org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of memory. at java.lang.D
-
Re: Correlation function out of memory errorJohnny Zhang 2013-02-22, 23:04
Hi, Houssam:
can you try change your HDFS block size smaller and also 'SET pig.noSplitCombination false;' in Pig? (so that number of mapper will be equal to number of file block) The OOM seems happen in COR function when it is trying to combine different data chunk together in maper. So more mapper may help. I will try it when I got a cluster to play with. Johnny On Fri, Feb 22, 2013 at 2:18 PM, Johnny Zhang <[EMAIL PROTECTED]> wrote: > Hi, Houssam: > What's the error in your pig log file? I were trying to reproduce it with > 1000 rows, 500 columns. > A = load 'random.txt' using PigStorage(':') as > (f1:double,f2:double,.........,f500:double); > B = group A all; > D = foreach B generate group,COR(A.$0,A.$1,A.$2,A.$3,.......A.$499); > dump D; > > The exception in pig log file is > Backend error message > --------------------- > Error: java.lang.OutOfMemoryError: *GC overhead limit exceeded* > at java.lang.Double.valueOf(Double.java:492) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > at > org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > at > org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > at > org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) > at org.apache.pig.builtin.COR.combine(COR.java:258) > at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) > at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) > at org.apache.pig.backend.hadoop.executionengine.physi > > Backend error message > --------------------- > Error: java.lang.OutOfMemoryError: Java heap space > at java.lang.Double.valueOf(Double.java:492) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > at > org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > at > org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > at > org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) > at org.apache.pig.builtin.COR.combine(COR.java:258) > at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) > at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) > at org.apache.pig.backend.hadoop.executionengine.physicalLayer.ex > > Backend error message > --------------------- > Error: java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.ArrayList.<init>(ArrayList.java:112) > at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:67) > at org.apache.pig.data.BinSedesTuple.<init>(BinSedesTuple.java:67) > at > org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38) > at > org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > at > org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > at > org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) > at org.apache.pig.builtin.COR.combine(COR.java:258)
-
Re: Correlation function out of memory errorJohnny Zhang 2013-02-23, 03:10
Hi, Houssam:
I think above workaround works: increase number of mapper (two steps mentioned in last email). I just verify it by run same query against 1 mapper, with 500 columns but only a few rows, and it pass. I guess it means if you can increase the number of mapper big enough so that each mapper take less rows, mapper can survive the CON calculation for huge number of columns. I think if each mapper doesn't got many rows, it can survive huge number of columns. It may reach the point that when columns number is too huge, even each mapper assigned one row of data, it will still crash. I haven't tested this limit yet, but I think it is much bigger than 500. Hope it is helpful. Johnny On Fri, Feb 22, 2013 at 3:04 PM, Johnny Zhang <[EMAIL PROTECTED]> wrote: > Hi, Houssam: > can you try change your HDFS block size smaller and also 'SET > pig.noSplitCombination false;' in Pig? (so that number of mapper will be > equal to number of file block) > > The OOM seems happen in COR function when it is trying to combine > different data chunk together in maper. So more mapper may help. I will try > it when I got a cluster to play with. > > Johnny > > > On Fri, Feb 22, 2013 at 2:18 PM, Johnny Zhang <[EMAIL PROTECTED]>wrote: > >> Hi, Houssam: >> What's the error in your pig log file? I were trying to reproduce it with >> 1000 rows, 500 columns. >> A = load 'random.txt' using PigStorage(':') as >> (f1:double,f2:double,.........,f500:double); >> B = group A all; >> D = foreach B generate group,COR(A.$0,A.$1,A.$2,A.$3,.......A.$499); >> dump D; >> >> The exception in pig log file is >> Backend error message >> --------------------- >> Error: java.lang.OutOfMemoryError: *GC overhead limit exceeded* >> at java.lang.Double.valueOf(Double.java:492) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) >> at >> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) >> at >> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) >> at >> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) >> at org.apache.pig.builtin.COR.combine(COR.java:258) >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) >> at org.apache.pig.backend.hadoop.executionengine.physi >> >> Backend error message >> --------------------- >> Error: java.lang.OutOfMemoryError: Java heap space >> at java.lang.Double.valueOf(Double.java:492) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) >> at >> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) >> at >> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) >> at >> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) >> at org.apache.pig.builtin.COR.combine(COR.java:258) >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) >> at org.apache.pig.backend.hadoop.executionengine.physicalLayer.ex >> >> Backend error message >> --------------------- >> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded >> at java.util.ArrayList.<init>(ArrayList.java:112)
-
Re: Correlation function out of memory errorHoussam 2013-02-25, 08:30
Hi Johnny,
Thank you for your help. Yes indeed, setting mapred.min.split.size to 1 or 10Mb increased greatly the number of mapper and thus made the job complete successfully. For the reducers however, we can only have as much reducers as machines running (by setting default_parallel) and this is a huge bottleneck. As a comparison benchmark: The correlation matrix for 300 columns and 10k rows on 3 AWS high memory intensive extra large instances was computed in 9 minutes. The same calculation was done using matlab on a laptop in 0.1 second. I know that is an unfair comparison because correlation calculation is prone to vectorization and that matlab was getting its data from RAM but just to say that Hadoop is not a solution for every problem ;) -Houssam. On Sat, Feb 23, 2013 at 4:10 AM, Johnny Zhang <[EMAIL PROTECTED]> wrote: > Hi, Houssam: > I think above workaround works: increase number of mapper (two steps > mentioned in last email). I just verify it by run same query against 1 > mapper, with 500 columns but only a few rows, and it pass. I guess it means > if you can increase the number of mapper big enough so that each mapper > take less rows, mapper can survive the CON calculation for huge number of > columns. > > I think if each mapper doesn't got many rows, it can survive huge number of > columns. It may reach the point that when columns number is too huge, even > each mapper assigned one row of data, it will still crash. I haven't tested > this limit yet, but I think it is much bigger than 500. Hope it is helpful. > > Johnny > > > On Fri, Feb 22, 2013 at 3:04 PM, Johnny Zhang <[EMAIL PROTECTED]> > wrote: > > > Hi, Houssam: > > can you try change your HDFS block size smaller and also 'SET > > pig.noSplitCombination false;' in Pig? (so that number of mapper will be > > equal to number of file block) > > > > The OOM seems happen in COR function when it is trying to combine > > different data chunk together in maper. So more mapper may help. I will > try > > it when I got a cluster to play with. > > > > Johnny > > > > > > On Fri, Feb 22, 2013 at 2:18 PM, Johnny Zhang <[EMAIL PROTECTED] > >wrote: > > > >> Hi, Houssam: > >> What's the error in your pig log file? I were trying to reproduce it > with > >> 1000 rows, 500 columns. > >> A = load 'random.txt' using PigStorage(':') as > >> (f1:double,f2:double,.........,f500:double); > >> B = group A all; > >> D = foreach B generate group,COR(A.$0,A.$1,A.$2,A.$3,.......A.$499); > >> dump D; > >> > >> The exception in pig log file is > >> Backend error message > >> --------------------- > >> Error: java.lang.OutOfMemoryError: *GC overhead limit exceeded* > >> at java.lang.Double.valueOf(Double.java:492) > >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) > >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > >> at > >> > org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) > >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) > >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > >> at > >> > org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) > >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) > >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) > >> at > >> > org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) > >> at org.apache.pig.builtin.COR.combine(COR.java:258) > >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) > >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) > >> at org.apache.pig.backend.hadoop.executionengine.physi > >> > >> Backend error message > >> --------------------- > >> Error: java.lang.OutOfMemoryError: Java heap space > >> at java.lang.Double.valueOf(Double.java:492) > >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) > >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) |