Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Correlation function out of memory error


Copy link to this message
-
Re: Correlation function out of memory error
Johnny Zhang 2013-02-23, 03:10
Hi, Houssam:
I think above workaround works: increase number of mapper (two steps
mentioned in last email). I just verify it by run same query against 1
mapper, with 500 columns but only a few rows, and it pass. I guess it means
if you can increase the number of mapper big enough so that each mapper
take less rows, mapper can survive the CON calculation for huge number of
columns.

I think if each mapper doesn't got many rows, it can survive huge number of
columns. It may reach the point that when columns number is too huge, even
each mapper assigned one row of data, it will still crash. I haven't tested
this limit yet, but I think it is much bigger than 500. Hope it is helpful.

Johnny
On Fri, Feb 22, 2013 at 3:04 PM, Johnny Zhang <[EMAIL PROTECTED]> wrote:

> Hi, Houssam:
> can you try change your HDFS block size smaller and also 'SET
> pig.noSplitCombination false;' in Pig? (so that number of mapper will be
> equal to number of file block)
>
> The OOM seems happen in COR function when it is trying to combine
> different data chunk together in maper. So more mapper may help. I will try
> it when I got a cluster to play with.
>
> Johnny
>
>
> On Fri, Feb 22, 2013 at 2:18 PM, Johnny Zhang <[EMAIL PROTECTED]>wrote:
>
>> Hi, Houssam:
>> What's the error in your pig log file? I were trying to reproduce it with
>> 1000 rows, 500 columns.
>> A = load 'random.txt' using PigStorage(':') as
>> (f1:double,f2:double,.........,f500:double);
>> B = group A all;
>> D = foreach B generate group,COR(A.$0,A.$1,A.$2,A.$3,.......A.$499);
>> dump D;
>>
>> The exception in pig log file is
>> Backend error message
>> ---------------------
>> Error: java.lang.OutOfMemoryError: *GC overhead limit exceeded*
>> at java.lang.Double.valueOf(Double.java:492)
>>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
>> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>>  at
>> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>> at
>> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>>  at
>> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>> at org.apache.pig.builtin.COR.combine(COR.java:258)
>>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
>> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
>>  at org.apache.pig.backend.hadoop.executionengine.physi
>>
>> Backend error message
>> ---------------------
>> Error: java.lang.OutOfMemoryError: Java heap space
>>  at java.lang.Double.valueOf(Double.java:492)
>> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
>>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>> at
>> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>>  at
>> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>> at
>> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>>  at org.apache.pig.builtin.COR.combine(COR.java:258)
>> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
>>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
>> at org.apache.pig.backend.hadoop.executionengine.physicalLayer.ex
>>
>> Backend error message
>> ---------------------
>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at java.util.ArrayList.<init>(ArrayList.java:112)