Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Java heap error


Copy link to this message
-
Re: Java heap error
Hi Syed,
Disabling the combiner in the pig query should get this working.
As I mentioned, it looks like combiner is being used in your query. You can confirm that by running explain on your query and checking the MR plan.  For some reason the system property 'pig.exec.nocombiner' is not getting set to 'true' in pig. Could it be a typo in the cmdline argument you are adding to disable it (-Dpig.exec.nocombiner=true)?
-Thejas
On 7/29/10 11:10 AM, "Syed Wasti" <[EMAIL PROTECTED]> wrote:

Hi Thejas,
It is from the same script which I shared earlier,  I will paste it here again and this error I see is in the same map reduce job where it fails with OOME.
I have a similar script where I am calling MAX, MIN and SUM functions on the grouped data and fails with similar errors.

AA = LOAD 'table1' USING PigStorage('\t') as
(ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

AB = FOREACH AA GENERATE ID, e, f, n,o;

AC = FILTER AB BY o == 1;

AD = GROUP AC BY (ID, b);

AE = FOREACH AD { A = DISTINCT AC.d;
        GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

The same steps are repeated to load 5 different tables and then a UNION is
done on them.

Final_res = UNION AE, AF, AG, AH, AI;

The actual number of columns will be 15 here I am showing with one table.

Final_table =   FOREACH Final_res GENERATE ID,
                (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
                (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
                (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
                (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

It works fine until here, it is only after adding this last part of the
query it starts throwing heap errors.

grp_id =    GROUP Final_table BY ID;

Final_data = FOREACH grp_reg_id GENERATE group AS ID
SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

STORE Final_data;

Regards
Syed Wasti
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Date: Wed, 28 Jul 2010 17:29:21 -0700
Subject: Re: Java heap error

Message body
>From the 2nd stack trace it looks like the combiner did not get disabled . You can verify that by looking at MapReduce plan in explain output.

It looks like for some reason the system property 'pig.exec.nocombiner' is not getting set to 'true' .

Can you send the other pig script that errors out with "Error: GC overhead limit exceeded" ?

-Thejas

On 7/27/10 11:27 PM, "Syed Wasti" <[EMAIL PROTECTED]> wrote:

Thank you Thejas for the response.

I want to share my feedback after trying all the recommended options.

Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created.

Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out.

I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try.

If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks.

Regards

Syed Wasti

Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

    at java.util.ArrayList.(ArrayList.java:112)

    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60)

    at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66)

    at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37)

    at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100)

    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267)

    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250)

    at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568)

    at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48)

    at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)

    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)

    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)

    at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)

    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)

    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)

    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)

Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

    at java.util.AbstractList.iterator(AbstractList.java:273)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)

    at org.apache.pig.backend.hadoop.executionen
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB