Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Java heap error


Copy link to this message
-
Re: Java heap error
Ashutosh Chauhan 2010-07-09, 00:45
I will recommend following things in the order:

1) Increasing heap size should help.
2) It seems you are on 0.7. There are couple of memory fixes we have
committed both on 0.7 branch as well as on trunk. Those should help as
well. So, build Pig either from trunk or 0.7 branch and use that.
3) Only if these dont help, you can try tuning the param
pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
should help. Try with 0.05, 0.02 and then further down. Downside is,
as you go lower and lower, it will make your query go slower.

Let us know if these changes get your query to completion.

Ashutosh

On Thu, Jul 8, 2010 at 15:48, Syed Wasti <[EMAIL PROTECTED]> wrote:
> Thanks Ashutosh, is there any workaround for this, will increasing the heap
> size help ?
>
>
> On 7/8/10 1:59 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote:
>
>> Syed,
>>
>> You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
>> Your query and stacktrace look very similar to the one in the jira
>> ticket. This may get fixed by 0.8 release.
>>
>> Ashutosh
>>
>> On Thu, Jul 8, 2010 at 13:42, Syed Wasti <[EMAIL PROTECTED]> wrote:
>>> Sorry about the delay, was held with different things.
>>> Here is the script and the errors below;
>>>
>>> AA = LOAD 'table1' USING PigStorage('\t') as
>>> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
>>>
>>> AB = FOREACH AA GENERATE ID, e, f, n,o;
>>>
>>> AC = FILTER AB BY o == 1;
>>>
>>> AD = GROUP AC BY (ID, b);
>>>
>>> AE = FOREACH AD { A = DISTINCT AC.d;
>>>        GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
>>> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }
>>>
>>> The same steps are repeated to load 5 different tables and then a UNION is
>>> done on them.
>>>
>>> Final_res = UNION AE, AF, AG, AH, AI;
>>>
>>> The actual number of columns will be 15 here I am showing with one table.
>>>
>>> Final_table =   FOREACH Final_res GENERATE ID,
>>>                (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
>>>                (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
>>>                (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
>>>                (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;
>>>
>>> It works fine until here, it is only after adding this last part of the
>>> query it starts throwing heap errors.
>>>
>>> grp_id =    GROUP Final_table BY ID;
>>>
>>> Final_data = FOREACH grp_reg_id GENERATE group AS ID
>>> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
>>> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);
>>>
>>> STORE Final_data;
>>>
>>>
>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>  at java.util.ArrayList.(ArrayList.java:112)
>>>  at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
>>>  at
>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
>>> )
>>>  at
>>> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
>>>  at
>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>>>  at
>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
>>>  at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
>>>  at
>>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
>>> va:114)
>>>  at
>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
>>> eserialize(WritableSerialization.java:67)
>>>  at
>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
>>> eserialize(WritableSerialization.java:40)
>>>  at
>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
>>> 6)
>>>  at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
>>>  at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
>>> 227)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64