|
Syed Wasti
2010-07-07, 21:09
Ashutosh Chauhan
2010-07-08, 00:50
Syed Wasti
2010-07-08, 20:42
Ashutosh Chauhan
2010-07-08, 20:59
Syed Wasti
2010-07-08, 22:48
Ashutosh Chauhan
2010-07-09, 00:45
Ashutosh Chauhan
2010-07-09, 00:58
Syed Wasti
2010-07-09, 19:50
Ashutosh Chauhan
2010-07-09, 21:32
Syed Wasti
2010-07-09, 23:01
Thejas M Nair
2010-07-23, 20:15
Syed Wasti
2010-07-28, 06:27
Thejas M Nair
2010-07-29, 00:29
Syed Wasti
2010-07-29, 18:10
Thejas M Nair
2010-07-29, 19:38
|
-
Java heap errorSyed Wasti 2010-07-07, 21:09
I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow) and has Cloudera CDH2 release installed and global heap max is Xmx4096m. I am constantly getting OutOfMemory errors (see below) on my map and reduce jobs, when I try run my script against large data where it produces around 600 maps. Looking for some tips on the best configuration for pig and to get rid of these errors. Thanks. Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java heap space Regards Syed +
Syed Wasti 2010-07-07, 21:09
-
Re: Java heap errorAshutosh Chauhan 2010-07-08, 00:50
Syed,
One line stack traces arent much helpful :) Please provide the full stack trace and the pig script which produced it and we can take a look. Ashutosh On Wed, Jul 7, 2010 at 14:09, Syed Wasti <[EMAIL PROTECTED]> wrote: > > I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow) > and has Cloudera CDH2 release installed and global heap max is –Xmx4096m.I am constantly getting OutOfMemory errors (see below) on my map and reduce > jobs, when I try run my script against large data where it produces around > 600 maps. > Looking for some tips on the best configuration for pig and to get rid of > these errors. Thanks. > > > > Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java > heap space > > Regards > Syed > +
Ashutosh Chauhan 2010-07-08, 00:50
-
Re: Java heap errorSyed Wasti 2010-07-08, 20:42
Sorry about the delay, was held with different things.
Here is the script and the errors below; AA = LOAD 'table1' USING PigStorage('\t') as (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); AB = FOREACH AA GENERATE ID, e, f, n,o; AC = FILTER AB BY o == 1; AD = GROUP AC BY (ID, b); AE = FOREACH AD { A = DISTINCT AC.d; GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } The same steps are repeated to load 5 different tables and then a UNION is done on them. Final_res = UNION AE, AF, AG, AH, AI; The actual number of columns will be 15 here I am showing with one table. Final_table = FOREACH Final_res GENERATE ID, (type == 'S' AND b == 1?cnt:0) AS 12_tmp, (type == 'S' AND b == 2?cnt:0) AS 13_tmp, (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; It works fine until here, it is only after adding this last part of the query it starts throwing heap errors. grp_id = GROUP Final_table BY ID; Final_data = FOREACH grp_reg_id GENERATE group AS ID SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); STORE Final_data; Error: java.lang.OutOfMemoryError: Java heap space at java.util.ArrayList.(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35 ) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja va:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d eserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d eserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11 6) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 227) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 8) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav a:1135) Error: java.lang.OutOfMemoryError: Java heap space at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POCombinerPackage.getNext(POCombinerPackage.java:148) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com bine.processOnePackageOutput(PigCombiner.java:168) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com bine.reduce(PigCombiner.java:159) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com bine.reduce(PigCombiner.java:50) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 227) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 8) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav a:1135) Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.AbstractList.iterator(AbstractList.java:273) at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185) at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POCombinerPackage.getNext(POCombinerPackage.java:168) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com bine.processOnePackageOutput(PigCombiner.java:168) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com bine.reduce(PigCombiner.java:159) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com bine.reduce(PigCombiner.java:50) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 227) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 8) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav a:1135) Error: GC overhead limit exceeded Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35 ) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja va:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d eserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d eserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11 6) at org.apache.hadoop.mapreduce.ReduceContext.nex +
Syed Wasti 2010-07-08, 20:42
-
Re: Java heap errorAshutosh Chauhan 2010-07-08, 20:59
Syed,
You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 . Your query and stacktrace look very similar to the one in the jira ticket. This may get fixed by 0.8 release. Ashutosh On Thu, Jul 8, 2010 at 13:42, Syed Wasti <[EMAIL PROTECTED]> wrote: > Sorry about the delay, was held with different things. > Here is the script and the errors below; > > AA = LOAD 'table1' USING PigStorage('\t') as > (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); > > AB = FOREACH AA GENERATE ID, e, f, n,o; > > AC = FILTER AB BY o == 1; > > AD = GROUP AC BY (ID, b); > > AE = FOREACH AD { A = DISTINCT AC.d; > GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) > COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } > > The same steps are repeated to load 5 different tables and then a UNION is > done on them. > > Final_res = UNION AE, AF, AG, AH, AI; > > The actual number of columns will be 15 here I am showing with one table. > > Final_table = FOREACH Final_res GENERATE ID, > (type == 'S' AND b == 1?cnt:0) AS 12_tmp, > (type == 'S' AND b == 2?cnt:0) AS 13_tmp, > (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, > (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; > > It works fine until here, it is only after adding this last part of the > query it starts throwing heap errors. > > grp_id = GROUP Final_table BY ID; > > Final_data = FOREACH grp_reg_id GENERATE group AS ID > SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), > SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); > > STORE Final_data; > > > Error: java.lang.OutOfMemoryError: Java heap space > at java.util.ArrayList.(ArrayList.java:112) > at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) > at > org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35 > ) > at > org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55) > at > org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) > at > org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) > at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) > at > org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja > va:114) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d > eserialize(WritableSerialization.java:67) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d > eserialize(WritableSerialization.java:40) > at > org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11 > 6) > at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) > at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 > 227) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 > 8) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav > a:1135) > > > Error: java.lang.OutOfMemoryError: Java heap space > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat > ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat > ors.POCombinerPackage.getNext(POCombinerPackage.java:148) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.processOnePackageOutput(PigCombiner.java:168) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.reduce(PigCombiner.java:159) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.reduce(PigCombiner.java:50) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) +
Ashutosh Chauhan 2010-07-08, 20:59
-
Re: Java heap errorSyed Wasti 2010-07-08, 22:48
Thanks Ashutosh, is there any workaround for this, will increasing the heap
size help ? On 7/8/10 1:59 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: > Syed, > > You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 . > Your query and stacktrace look very similar to the one in the jira > ticket. This may get fixed by 0.8 release. > > Ashutosh > > On Thu, Jul 8, 2010 at 13:42, Syed Wasti <[EMAIL PROTECTED]> wrote: >> Sorry about the delay, was held with different things. >> Here is the script and the errors below; >> >> AA = LOAD 'table1' USING PigStorage('\t') as >> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); >> >> AB = FOREACH AA GENERATE ID, e, f, n,o; >> >> AC = FILTER AB BY o == 1; >> >> AD = GROUP AC BY (ID, b); >> >> AE = FOREACH AD { A = DISTINCT AC.d; >> GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) >> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } >> >> The same steps are repeated to load 5 different tables and then a UNION is >> done on them. >> >> Final_res = UNION AE, AF, AG, AH, AI; >> >> The actual number of columns will be 15 here I am showing with one table. >> >> Final_table = FOREACH Final_res GENERATE ID, >> (type == 'S' AND b == 1?cnt:0) AS 12_tmp, >> (type == 'S' AND b == 2?cnt:0) AS 13_tmp, >> (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, >> (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; >> >> It works fine until here, it is only after adding this last part of the >> query it starts throwing heap errors. >> >> grp_id = GROUP Final_table BY ID; >> >> Final_data = FOREACH grp_reg_id GENERATE group AS ID >> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), >> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); >> >> STORE Final_data; >> >> >> Error: java.lang.OutOfMemoryError: Java heap space >> at java.util.ArrayList.(ArrayList.java:112) >> at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) >> at >> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35 >> ) >> at >> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55) >> at >> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) >> at >> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) >> at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) >> at >> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja >> va:114) >> at >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d >> eserialize(WritableSerialization.java:67) >> at >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d >> eserialize(WritableSerialization.java:40) >> at >> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11 >> 6) >> at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) >> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) >> at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 >> 227) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 >> 8) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav >> a:1135) >> >> >> Error: java.lang.OutOfMemoryError: Java heap space >> at >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat >> ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139) >> at >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat >> ors.POCombinerPackage.getNext(POCombinerPackage.java:148) >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com >> bine.processOnePackageOutput(PigCombiner.java:168) >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com +
Syed Wasti 2010-07-08, 22:48
-
Re: Java heap errorAshutosh Chauhan 2010-07-09, 00:45
I will recommend following things in the order:
1) Increasing heap size should help. 2) It seems you are on 0.7. There are couple of memory fixes we have committed both on 0.7 branch as well as on trunk. Those should help as well. So, build Pig either from trunk or 0.7 branch and use that. 3) Only if these dont help, you can try tuning the param pig.cachedbag.memusage. By default, it is set at 0.1, lowering it should help. Try with 0.05, 0.02 and then further down. Downside is, as you go lower and lower, it will make your query go slower. Let us know if these changes get your query to completion. Ashutosh On Thu, Jul 8, 2010 at 15:48, Syed Wasti <[EMAIL PROTECTED]> wrote: > Thanks Ashutosh, is there any workaround for this, will increasing the heap > size help ? > > > On 7/8/10 1:59 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: > >> Syed, >> >> You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 . >> Your query and stacktrace look very similar to the one in the jira >> ticket. This may get fixed by 0.8 release. >> >> Ashutosh >> >> On Thu, Jul 8, 2010 at 13:42, Syed Wasti <[EMAIL PROTECTED]> wrote: >>> Sorry about the delay, was held with different things. >>> Here is the script and the errors below; >>> >>> AA = LOAD 'table1' USING PigStorage('\t') as >>> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); >>> >>> AB = FOREACH AA GENERATE ID, e, f, n,o; >>> >>> AC = FILTER AB BY o == 1; >>> >>> AD = GROUP AC BY (ID, b); >>> >>> AE = FOREACH AD { A = DISTINCT AC.d; >>> GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) >>> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } >>> >>> The same steps are repeated to load 5 different tables and then a UNION is >>> done on them. >>> >>> Final_res = UNION AE, AF, AG, AH, AI; >>> >>> The actual number of columns will be 15 here I am showing with one table. >>> >>> Final_table = FOREACH Final_res GENERATE ID, >>> (type == 'S' AND b == 1?cnt:0) AS 12_tmp, >>> (type == 'S' AND b == 2?cnt:0) AS 13_tmp, >>> (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, >>> (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; >>> >>> It works fine until here, it is only after adding this last part of the >>> query it starts throwing heap errors. >>> >>> grp_id = GROUP Final_table BY ID; >>> >>> Final_data = FOREACH grp_reg_id GENERATE group AS ID >>> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), >>> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); >>> >>> STORE Final_data; >>> >>> >>> Error: java.lang.OutOfMemoryError: Java heap space >>> at java.util.ArrayList.(ArrayList.java:112) >>> at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) >>> at >>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35 >>> ) >>> at >>> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55) >>> at >>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) >>> at >>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) >>> at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) >>> at >>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja >>> va:114) >>> at >>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d >>> eserialize(WritableSerialization.java:67) >>> at >>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d >>> eserialize(WritableSerialization.java:40) >>> at >>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11 >>> 6) >>> at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) >>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) >>> at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 >>> 227) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 +
Ashutosh Chauhan 2010-07-09, 00:45
-
Re: Java heap errorAshutosh Chauhan 2010-07-09, 00:58
Aah.. forgot to tell how to set that param in 3). While launching
pig, provide it as -D cmd line switch, as follows: pig -Dpig.cachedbag.memusage=0.02f myscript.pig On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan <[EMAIL PROTECTED]> wrote: > I will recommend following things in the order: > > 1) Increasing heap size should help. > 2) It seems you are on 0.7. There are couple of memory fixes we have > committed both on 0.7 branch as well as on trunk. Those should help as > well. So, build Pig either from trunk or 0.7 branch and use that. > 3) Only if these dont help, you can try tuning the param > pig.cachedbag.memusage. By default, it is set at 0.1, lowering it > should help. Try with 0.05, 0.02 and then further down. Downside is, > as you go lower and lower, it will make your query go slower. > > Let us know if these changes get your query to completion. > > Ashutosh > > On Thu, Jul 8, 2010 at 15:48, Syed Wasti <[EMAIL PROTECTED]> wrote: >> Thanks Ashutosh, is there any workaround for this, will increasing the heap >> size help ? >> >> >> On 7/8/10 1:59 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: >> >>> Syed, >>> >>> You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 . >>> Your query and stacktrace look very similar to the one in the jira >>> ticket. This may get fixed by 0.8 release. >>> >>> Ashutosh >>> >>> On Thu, Jul 8, 2010 at 13:42, Syed Wasti <[EMAIL PROTECTED]> wrote: >>>> Sorry about the delay, was held with different things. >>>> Here is the script and the errors below; >>>> >>>> AA = LOAD 'table1' USING PigStorage('\t') as >>>> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); >>>> >>>> AB = FOREACH AA GENERATE ID, e, f, n,o; >>>> >>>> AC = FILTER AB BY o == 1; >>>> >>>> AD = GROUP AC BY (ID, b); >>>> >>>> AE = FOREACH AD { A = DISTINCT AC.d; >>>> GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) >>>> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } >>>> >>>> The same steps are repeated to load 5 different tables and then a UNION is >>>> done on them. >>>> >>>> Final_res = UNION AE, AF, AG, AH, AI; >>>> >>>> The actual number of columns will be 15 here I am showing with one table. >>>> >>>> Final_table = FOREACH Final_res GENERATE ID, >>>> (type == 'S' AND b == 1?cnt:0) AS 12_tmp, >>>> (type == 'S' AND b == 2?cnt:0) AS 13_tmp, >>>> (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, >>>> (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; >>>> >>>> It works fine until here, it is only after adding this last part of the >>>> query it starts throwing heap errors. >>>> >>>> grp_id = GROUP Final_table BY ID; >>>> >>>> Final_data = FOREACH grp_reg_id GENERATE group AS ID >>>> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), >>>> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); >>>> >>>> STORE Final_data; >>>> >>>> >>>> Error: java.lang.OutOfMemoryError: Java heap space >>>> at java.util.ArrayList.(ArrayList.java:112) >>>> at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) >>>> at >>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35 >>>> ) >>>> at >>>> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55) >>>> at >>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) >>>> at >>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) >>>> at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) >>>> at >>>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja >>>> va:114) >>>> at >>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d >>>> eserialize(WritableSerialization.java:67) >>>> at >>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d >>>> eserialize(WritableSerialization.java:40) >>>> at >>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11 +
Ashutosh Chauhan 2010-07-09, 00:58
-
Re: Java heap errorSyed Wasti 2010-07-09, 19:50
Hi Ashutosh,
Did not try option 2 and 3, I shall work sometime next week on that. But increasing the heap size did not help initially, with the increased heap size I came up with a UDF to do the SUM on the grouped data for the last step in my script and it completes my query without any errors now. Syed On 7/8/10 5:58 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: > Aah.. forgot to tell how to set that param in 3). While launching > pig, provide it as -D cmd line switch, as follows: > pig -Dpig.cachedbag.memusage=0.02f myscript.pig > > On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan > <[EMAIL PROTECTED]> wrote: >> I will recommend following things in the order: >> >> 1) Increasing heap size should help. >> 2) It seems you are on 0.7. There are couple of memory fixes we have >> committed both on 0.7 branch as well as on trunk. Those should help as >> well. So, build Pig either from trunk or 0.7 branch and use that. >> 3) Only if these dont help, you can try tuning the param >> pig.cachedbag.memusage. By default, it is set at 0.1, lowering it >> should help. Try with 0.05, 0.02 and then further down. Downside is, >> as you go lower and lower, it will make your query go slower. >> >> Let us know if these changes get your query to completion. >> >> Ashutosh >> >> On Thu, Jul 8, 2010 at 15:48, Syed Wasti <[EMAIL PROTECTED]> wrote: >>> Thanks Ashutosh, is there any workaround for this, will increasing the heap >>> size help ? >>> >>> >>> On 7/8/10 1:59 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: >>> >>>> Syed, >>>> >>>> You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 . >>>> Your query and stacktrace look very similar to the one in the jira >>>> ticket. This may get fixed by 0.8 release. >>>> >>>> Ashutosh >>>> >>>> On Thu, Jul 8, 2010 at 13:42, Syed Wasti <[EMAIL PROTECTED]> wrote: >>>>> Sorry about the delay, was held with different things. >>>>> Here is the script and the errors below; >>>>> >>>>> AA = LOAD 'table1' USING PigStorage('\t') as >>>>> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); >>>>> >>>>> AB = FOREACH AA GENERATE ID, e, f, n,o; >>>>> >>>>> AC = FILTER AB BY o == 1; >>>>> >>>>> AD = GROUP AC BY (ID, b); >>>>> >>>>> AE = FOREACH AD { A = DISTINCT AC.d; >>>>> GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) >>>>> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } >>>>> >>>>> The same steps are repeated to load 5 different tables and then a UNION is >>>>> done on them. >>>>> >>>>> Final_res = UNION AE, AF, AG, AH, AI; >>>>> >>>>> The actual number of columns will be 15 here I am showing with one table. >>>>> >>>>> Final_table = FOREACH Final_res GENERATE ID, >>>>> (type == 'S' AND b == 1?cnt:0) AS 12_tmp, >>>>> (type == 'S' AND b == 2?cnt:0) AS 13_tmp, >>>>> (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, >>>>> (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; >>>>> >>>>> It works fine until here, it is only after adding this last part of the >>>>> query it starts throwing heap errors. >>>>> >>>>> grp_id = GROUP Final_table BY ID; >>>>> >>>>> Final_data = FOREACH grp_reg_id GENERATE group AS ID >>>>> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), >>>>> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); >>>>> >>>>> STORE Final_data; >>>>> >>>>> >>>>> Error: java.lang.OutOfMemoryError: Java heap space >>>>> at java.util.ArrayList.(ArrayList.java:112) >>>>> at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) >>>>> at >>>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java: >>>>> 35 >>>>> ) >>>>> at >>>>> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>> ) >>>>> at >>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) >>>>> at >>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) >>>>> at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>> ) +
Syed Wasti 2010-07-09, 19:50
-
Re: Java heap errorAshutosh Chauhan 2010-07-09, 21:32
Hi Syed,
Do you mean your query fails with OOME if you use Pig's builtin SUM, but succeeds if you use your own SUM UDF? If that is so, thats interesting. I have a hunch, why that is the case, but would like to confirm. Would you mind sharing your SUM UDF. Ashutosh On Fri, Jul 9, 2010 at 12:50, Syed Wasti <[EMAIL PROTECTED]> wrote: > Hi Ashutosh, > Did not try option 2 and 3, I shall work sometime next week on that. > But increasing the heap size did not help initially, with the increased heap > size I came up with a UDF to do the SUM on the grouped data for the last > step in my script and it completes my query without any errors now. > > Syed > > > On 7/8/10 5:58 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: > >> Aah.. forgot to tell how to set that param in 3). While launching >> pig, provide it as -D cmd line switch, as follows: >> pig -Dpig.cachedbag.memusage=0.02f myscript.pig >> >> On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan >> <[EMAIL PROTECTED]> wrote: >>> I will recommend following things in the order: >>> >>> 1) Increasing heap size should help. >>> 2) It seems you are on 0.7. There are couple of memory fixes we have >>> committed both on 0.7 branch as well as on trunk. Those should help as >>> well. So, build Pig either from trunk or 0.7 branch and use that. >>> 3) Only if these dont help, you can try tuning the param >>> pig.cachedbag.memusage. By default, it is set at 0.1, lowering it >>> should help. Try with 0.05, 0.02 and then further down. Downside is, >>> as you go lower and lower, it will make your query go slower. >>> >>> Let us know if these changes get your query to completion. >>> >>> Ashutosh >>> >>> On Thu, Jul 8, 2010 at 15:48, Syed Wasti <[EMAIL PROTECTED]> wrote: >>>> Thanks Ashutosh, is there any workaround for this, will increasing the heap >>>> size help ? >>>> >>>> >>>> On 7/8/10 1:59 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: >>>> >>>>> Syed, >>>>> >>>>> You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 . >>>>> Your query and stacktrace look very similar to the one in the jira >>>>> ticket. This may get fixed by 0.8 release. >>>>> >>>>> Ashutosh >>>>> >>>>> On Thu, Jul 8, 2010 at 13:42, Syed Wasti <[EMAIL PROTECTED]> wrote: >>>>>> Sorry about the delay, was held with different things. >>>>>> Here is the script and the errors below; >>>>>> >>>>>> AA = LOAD 'table1' USING PigStorage('\t') as >>>>>> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); >>>>>> >>>>>> AB = FOREACH AA GENERATE ID, e, f, n,o; >>>>>> >>>>>> AC = FILTER AB BY o == 1; >>>>>> >>>>>> AD = GROUP AC BY (ID, b); >>>>>> >>>>>> AE = FOREACH AD { A = DISTINCT AC.d; >>>>>> GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) >>>>>> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } >>>>>> >>>>>> The same steps are repeated to load 5 different tables and then a UNION is >>>>>> done on them. >>>>>> >>>>>> Final_res = UNION AE, AF, AG, AH, AI; >>>>>> >>>>>> The actual number of columns will be 15 here I am showing with one table. >>>>>> >>>>>> Final_table = FOREACH Final_res GENERATE ID, >>>>>> (type == 'S' AND b == 1?cnt:0) AS 12_tmp, >>>>>> (type == 'S' AND b == 2?cnt:0) AS 13_tmp, >>>>>> (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, >>>>>> (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; >>>>>> >>>>>> It works fine until here, it is only after adding this last part of the >>>>>> query it starts throwing heap errors. >>>>>> >>>>>> grp_id = GROUP Final_table BY ID; >>>>>> >>>>>> Final_data = FOREACH grp_reg_id GENERATE group AS ID >>>>>> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), >>>>>> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); >>>>>> >>>>>> STORE Final_data; >>>>>> >>>>>> >>>>>> Error: java.lang.OutOfMemoryError: Java heap space >>>>>> at java.util.ArrayList.(ArrayList.java:112) >>>>>> at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) +
Ashutosh Chauhan 2010-07-09, 21:32
-
Re: Java heap errorSyed Wasti 2010-07-09, 23:01
Yes Ashutosh, that is the case and here the code for the UDF. Let me know
what you find. public class GroupSum extends EvalFunc<DataBag> { TupleFactory mTupleFactory; BagFactory mBagFactory; public GroupSum() { this.mTupleFactory = TupleFactory.getInstance(); this.mBagFactory = BagFactory.getInstance(); } public DataBag exec(Tuple input) throws IOException { if (input.size() < 0) { int errCode = 2107; String msg = "GroupSum expects one input but received " + input.size() + " inputs. \n"; throw new ExecException(msg, errCode); } try { DataBag output = this.mBagFactory.newDefaultBag(); Object o1 = input.get(0); if (o1 instanceof DataBag) { DataBag bag1 = (DataBag) o1; if (bag1.size() == 1L) { return bag1; } sumBag(bag1, output); } return output; } catch (ExecException ee) { throw ee; } } private void sumBag(DataBag o1, DataBag emitTo) throws IOException { Iterator<?> i1 = o1.iterator(); Tuple row = null; Tuple firstRow = null;; int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0; int cnt = 0; while (i1.hasNext()) { row = (Tuple) i1.next(); if (cnt == 0) { firstRow = row; } fld1 += (Integer) row.get(1); fld2 += (Integer) row.get(2); fld3 += (Integer) row.get(3); fld4 += (Integer) row.get(4); fld5 += (Integer) row.get(5); cnt ++; } //field 0 has the id in it. firstRow.set(1, fld1); firstRow.set(2, fld2); firstRow.set(3, fld3); firstRow.set(4, fld4); firstRow.set(5, fld5); emitTo.add(firstRow); } public Schema outputSchema(Schema input) { try { Schema tupleSchema = new Schema(); tupleSchema.add(input.getField(0)); tupleSchema.setTwoLevelAccessRequired(true); return tupleSchema; } catch (Exception e) { } return null; } } On 7/9/10 2:32 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: > Hi Syed, > > Do you mean your query fails with OOME if you use Pig's builtin SUM, > but succeeds if you use your own SUM UDF? If that is so, thats > interesting. I have a hunch, why that is the case, but would like to > confirm. Would you mind sharing your SUM UDF. > > Ashutosh > On Fri, Jul 9, 2010 at 12:50, Syed Wasti <[EMAIL PROTECTED]> wrote: >> Hi Ashutosh, >> Did not try option 2 and 3, I shall work sometime next week on that. >> But increasing the heap size did not help initially, with the increased heap >> size I came up with a UDF to do the SUM on the grouped data for the last >> step in my script and it completes my query without any errors now. >> >> Syed >> >> >> On 7/8/10 5:58 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: >> >>> Aah.. forgot to tell how to set that param in 3). While launching >>> pig, provide it as -D cmd line switch, as follows: >>> pig -Dpig.cachedbag.memusage=0.02f myscript.pig >>> >>> On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan >>> <[EMAIL PROTECTED]> wrote: >>>> I will recommend following things in the order: >>>> >>>> 1) Increasing heap size should help. >>>> 2) It seems you are on 0.7. There are couple of memory fixes we have >>>> committed both on 0.7 branch as well as on trunk. Those should help as >>>> well. So, build Pig either from trunk or 0.7 branch and use that. >>>> 3) Only if these dont help, you can try tuning the param >>>> pig.cachedbag.memusage. By default, it is set at 0.1, lowering it >>>> should help. Try with 0.05, 0.02 and then further down. Downside is, >>>> as you go lower and lower, it will make your query go slower. org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>> ) org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>> ) org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>> ) org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>> ) +
Syed Wasti 2010-07-09, 23:01
-
Re: Java heap errorThejas M Nair 2010-07-23, 20:15
Hi Syed,
I think the problem you faced is same as what is present in the newly created jira - https://issues.apache.org/jira/browse/PIG-1516 . As a workaround, you can disable the combiner (See above jira). This is what you have done indirectly, by using a new sum udf that does not implement the algebraic interface. I will be submitting a patch soon for the 0.8 release. -Thejas On 7/9/10 4:01 PM, "Syed Wasti" <[EMAIL PROTECTED]> wrote: Yes Ashutosh, that is the case and here the code for the UDF. Let me know what you find. public class GroupSum extends EvalFunc<DataBag> { TupleFactory mTupleFactory; BagFactory mBagFactory; public GroupSum() { this.mTupleFactory = TupleFactory.getInstance(); this.mBagFactory = BagFactory.getInstance(); } public DataBag exec(Tuple input) throws IOException { if (input.size() < 0) { int errCode = 2107; String msg = "GroupSum expects one input but received " + input.size() + " inputs. \n"; throw new ExecException(msg, errCode); } try { DataBag output = this.mBagFactory.newDefaultBag(); Object o1 = input.get(0); if (o1 instanceof DataBag) { DataBag bag1 = (DataBag) o1; if (bag1.size() == 1L) { return bag1; } sumBag(bag1, output); } return output; } catch (ExecException ee) { throw ee; } } private void sumBag(DataBag o1, DataBag emitTo) throws IOException { Iterator<?> i1 = o1.iterator(); Tuple row = null; Tuple firstRow = null;; int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0; int cnt = 0; while (i1.hasNext()) { row = (Tuple) i1.next(); if (cnt == 0) { firstRow = row; } fld1 += (Integer) row.get(1); fld2 += (Integer) row.get(2); fld3 += (Integer) row.get(3); fld4 += (Integer) row.get(4); fld5 += (Integer) row.get(5); cnt ++; } //field 0 has the id in it. firstRow.set(1, fld1); firstRow.set(2, fld2); firstRow.set(3, fld3); firstRow.set(4, fld4); firstRow.set(5, fld5); emitTo.add(firstRow); } public Schema outputSchema(Schema input) { try { Schema tupleSchema = new Schema(); tupleSchema.add(input.getField(0)); tupleSchema.setTwoLevelAccessRequired(true); return tupleSchema; } catch (Exception e) { } return null; } } On 7/9/10 2:32 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: > Hi Syed, > > Do you mean your query fails with OOME if you use Pig's builtin SUM, > but succeeds if you use your own SUM UDF? If that is so, thats > interesting. I have a hunch, why that is the case, but would like to > confirm. Would you mind sharing your SUM UDF. > > Ashutosh > On Fri, Jul 9, 2010 at 12:50, Syed Wasti <[EMAIL PROTECTED]> wrote: >> Hi Ashutosh, >> Did not try option 2 and 3, I shall work sometime next week on that. >> But increasing the heap size did not help initially, with the increased heap >> size I came up with a UDF to do the SUM on the grouped data for the last >> step in my script and it completes my query without any errors now. >> >> Syed >> >> >> On 7/8/10 5:58 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote: >> >>> Aah.. forgot to tell how to set that param in 3). While launching >>> pig, provide it as -D cmd line switch, as follows: >>> pig -Dpig.cachedbag.memusage=0.02f myscript.pig >>> >>> On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan >>> <[EMAIL PROTECTED]> wrote: >>>> I will recommend following things in the order: >>>> >>>> 1) Increasing heap size should help. >>>> 2) It seems you are on 0.7. There are couple of memory fixes we have org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>> ) org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>> ) org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>> ) org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>> ) +
Thejas M Nair 2010-07-23, 20:15
-
RE: Java heap errorSyed Wasti 2010-07-28, 06:27
Thank you Thejas for the response. I want to share my feedback after trying all the recommended options. Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created. Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out. I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try. If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks. Regards Syed Wasti Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60) at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66) at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37) at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250) at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568) at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.AbstractList.iterator(AbstractList.java:273) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:184) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) +
Syed Wasti 2010-07-28, 06:27
-
Re: Java heap errorThejas M Nair 2010-07-29, 00:29
>From the 2nd stack trace it looks like the combiner did not get disabled . You can verify that by looking at MapReduce plan in explain output.
It looks like for some reason the system property 'pig.exec.nocombiner' is not getting set to 'true' . Can you send the other pig script that errors out with "Error: GC overhead limit exceeded" ? -Thejas On 7/27/10 11:27 PM, "Syed Wasti" <[EMAIL PROTECTED]> wrote: Thank you Thejas for the response. I want to share my feedback after trying all the recommended options. Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created. Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out. I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try. If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks. Regards Syed Wasti Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60) at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66) at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37) at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250) at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568) at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.AbstractList.iterator(AbstractList.java:273) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:184) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) +
Thejas M Nair 2010-07-29, 00:29
-
RE: Java heap errorSyed Wasti 2010-07-29, 18:10
Hi Thejas, It is from the same script which I shared earlier, I will paste it here again and this error I see is in the same map reduce job where it fails with OOME. I have a similar script where I am calling MAX, MIN and SUM functions on the grouped data and fails with similar errors. AA = LOAD 'table1' USING PigStorage('\t') as (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); AB = FOREACH AA GENERATE ID, e, f, n,o; AC = FILTER AB BY o == 1; AD = GROUP AC BY (ID, b); AE = FOREACH AD { A = DISTINCT AC.d; GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } The same steps are repeated to load 5 different tables and then a UNION is done on them. Final_res = UNION AE, AF, AG, AH, AI; The actual number of columns will be 15 here I am showing with one table. Final_table = FOREACH Final_res GENERATE ID, (type == 'S' AND b == 1?cnt:0) AS 12_tmp, (type == 'S' AND b == 2?cnt:0) AS 13_tmp, (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; It works fine until here, it is only after adding this last part of the query it starts throwing heap errors. grp_id = GROUP Final_table BY ID; Final_data = FOREACH grp_reg_id GENERATE group AS ID SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); STORE Final_data; Regards Syed Wasti From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Date: Wed, 28 Jul 2010 17:29:21 -0700 Subject: Re: Java heap error Message body >From the 2nd stack trace it looks like the combiner did not get disabled . You can verify that by looking at MapReduce plan in explain output. It looks like for some reason the system property ‘pig.exec.nocombiner’ is not getting set to ‘true’ . Can you send the other pig script that errors out with "Error: GC overhead limit exceeded" ? -Thejas On 7/27/10 11:27 PM, "Syed Wasti" <[EMAIL PROTECTED]> wrote: Thank you Thejas for the response. I want to share my feedback after trying all the recommended options. Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created. Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out. I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try. If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks. Regards Syed Wasti Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60) at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66) at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37) at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250) at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568) at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.AbstractList.iterator(AbstractList.java:273) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:184) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162) at org. +
Syed Wasti 2010-07-29, 18:10
-
Re: Java heap errorThejas M Nair 2010-07-29, 19:38
Hi Syed,
Disabling the combiner in the pig query should get this working. As I mentioned, it looks like combiner is being used in your query. You can confirm that by running explain on your query and checking the MR plan. For some reason the system property 'pig.exec.nocombiner' is not getting set to 'true' in pig. Could it be a typo in the cmdline argument you are adding to disable it (-Dpig.exec.nocombiner=true)? -Thejas On 7/29/10 11:10 AM, "Syed Wasti" <[EMAIL PROTECTED]> wrote: Hi Thejas, It is from the same script which I shared earlier, I will paste it here again and this error I see is in the same map reduce job where it fails with OOME. I have a similar script where I am calling MAX, MIN and SUM functions on the grouped data and fails with similar errors. AA = LOAD 'table1' USING PigStorage('\t') as (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); AB = FOREACH AA GENERATE ID, e, f, n,o; AC = FILTER AB BY o == 1; AD = GROUP AC BY (ID, b); AE = FOREACH AD { A = DISTINCT AC.d; GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } The same steps are repeated to load 5 different tables and then a UNION is done on them. Final_res = UNION AE, AF, AG, AH, AI; The actual number of columns will be 15 here I am showing with one table. Final_table = FOREACH Final_res GENERATE ID, (type == 'S' AND b == 1?cnt:0) AS 12_tmp, (type == 'S' AND b == 2?cnt:0) AS 13_tmp, (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; It works fine until here, it is only after adding this last part of the query it starts throwing heap errors. grp_id = GROUP Final_table BY ID; Final_data = FOREACH grp_reg_id GENERATE group AS ID SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); STORE Final_data; Regards Syed Wasti From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Date: Wed, 28 Jul 2010 17:29:21 -0700 Subject: Re: Java heap error Message body >From the 2nd stack trace it looks like the combiner did not get disabled . You can verify that by looking at MapReduce plan in explain output. It looks like for some reason the system property 'pig.exec.nocombiner' is not getting set to 'true' . Can you send the other pig script that errors out with "Error: GC overhead limit exceeded" ? -Thejas On 7/27/10 11:27 PM, "Syed Wasti" <[EMAIL PROTECTED]> wrote: Thank you Thejas for the response. I want to share my feedback after trying all the recommended options. Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created. Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out. I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try. If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks. Regards Syed Wasti Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60) at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66) at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37) at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250) at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568) at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.AbstractList.iterator(AbstractList.java:273) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionen +
Thejas M Nair 2010-07-29, 19:38
|