|
Ruslan Al-fakikh
2011-11-17, 14:13
Dmitriy Ryaboy
2011-11-17, 16:43
pablomar
2011-11-17, 17:59
Dmitriy Ryaboy
2011-11-17, 20:07
Ruslan Al-Fakikh
2011-11-21, 14:11
Dmitriy Ryaboy
2011-11-21, 16:32
Ruslan Al-fakikh
2011-11-21, 17:10
Jonathan Coveney
2011-11-21, 18:22
pablomar
2011-11-21, 20:53
Jonathan Coveney
2011-11-21, 21:53
Dmitriy Ryaboy
2011-11-21, 22:20
Ruslan Al-fakikh
2011-11-22, 15:08
pablomar
2011-11-23, 03:10
Jonathan Coveney
2011-11-23, 07:45
Ruslan Al-fakikh
2011-11-24, 11:55
Ruslan Al-fakikh
2011-12-15, 14:57
Ruslan Al-fakikh
2011-12-16, 13:32
Dmitriy Ryaboy
2011-12-16, 20:15
Ruslan Al-fakikh
2011-12-22, 01:37
Ruslan Al-fakikh
2011-12-27, 15:48
Jonathan Coveney
2011-12-28, 19:18
Ruslan Al-fakikh
2011-12-28, 22:21
Ruslan Al-fakikh
2012-01-06, 03:14
Jonathan Coveney
2012-01-06, 04:10
|
-
java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-11-17, 14:13
Hey guys,
I encounter java.lang.OutOfMemoryError when using TOP udf. It seems that the udf tries to process all data in memory. Is there a workaround for TOP? Or maybe there is some other way of getting top results? I cannot use LIMIT since I need to 5% of data, not a constant number of rows. I am using: Apache Pig version 0.8.1-cdh3u2 (rexported) The stack trace is: [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new decompressor [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last merge-pass, with 21 segments left of total size: 2057257173 bytes [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first memory handler call- Usage threshold init = 175308800(171200K) used 373454552(364701K) committed = 524288000(512000K) max = 524288000(512000K) [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - first memory handler call - Collection threshold init = 175308800(171200K) used 496500704(484863K) committed = 524288000(512000K) max = 524288000(512000K) [2011-11-16 12:37:28] INFO (TaskLogsTruncater.java:69) - Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 [2011-11-16 12:37:28] FATAL (Child.java:318) - Error running child : java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.<init>(String.java:215) at java.io.DataInputStream.readUTF(DataInputStream.java:644) at java.io.DataInputStream.readUTF(DataInputStream.java:547) at org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCach edBag.java:237) at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) at org.apache.pig.builtin.TOP.exec(TOP.java:116) at org.apache.pig.builtin.TOP.exec(TOP.java:65) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat ors.POUserFunc.getNext(POUserFunc.java:245) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat ors.POUserFunc.getNext(POUserFunc.java:287) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POForEach.processPlan(POForEach.java:338) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POForEach.getNext(POForEach.java:290) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator .processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POForEach.getNext(POForEach.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re duce.runPipeline(PigMapReduce.java:434) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re duce.processOnePackageOutput(PigMapReduce.java:402) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re duce.reduce(PigMapReduce.java:382) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re duce.reduce(PigMapReduce.java:251) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja va:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) stderr logs Exception in thread "Low Memory Detector" java.lang.OutOfMemoryError: Java heap space at sun.management.MemoryNotifInfoCompositeData.getCompositeData(MemoryNotifInfo CompositeData.java:42) at sun.management.MemoryNotifInfoCompositeData.toCompositeData(MemoryNotifInfoC ompositeData.java:36) at sun.management.MemoryImpl.createNotification(MemoryImpl.java:168) at sun.management.MemoryPoolImpl$CollectionSensor.triggerAction(MemoryPoolImpl. java:300) at sun.management.Sensor.trigger(Sensor.java:120) Thanks in advance!
-
Re: java.lang.OutOfMemoryError when using TOP udfDmitriy Ryaboy 2011-11-17, 16:43
The top udf does not try to process all data in memory if the algebraic optimization can be applied. It does need to keep the topn numbers in memory of course. Can you confirm algebraic mode is used?
On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" <[EMAIL PROTECTED]> wrote: > Hey guys, > > > > I encounter java.lang.OutOfMemoryError when using TOP udf. It seems that the > udf tries to process all data in memory. > > Is there a workaround for TOP? Or maybe there is some other way of getting > top results? I cannot use LIMIT since I need to 5% of data, not a constant > number of rows. > > > > I am using: > > Apache Pig version 0.8.1-cdh3u2 (rexported) > > > > The stack trace is: > > [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new > decompressor > > [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last merge-pass, > with 21 segments left of total size: 2057257173 bytes > > [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first memory > handler call- Usage threshold init = 175308800(171200K) used > 373454552(364701K) committed = 524288000(512000K) max = 524288000(512000K) > > [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - first memory > handler call - Collection threshold init = 175308800(171200K) used > 496500704(484863K) committed = 524288000(512000K) max = 524288000(512000K) > > [2011-11-16 12:37:28] INFO (TaskLogsTruncater.java:69) - Initializing logs' > truncater with mapRetainSize=-1 and reduceRetainSize=-1 > > [2011-11-16 12:37:28] FATAL (Child.java:318) - Error running child : > java.lang.OutOfMemoryError: Java heap space > > at java.util.Arrays.copyOfRange(Arrays.java:3209) > > at java.lang.String.<init>(String.java:215) > > at java.io.DataInputStream.readUTF(DataInputStream.java:644) > > at java.io.DataInputStream.readUTF(DataInputStream.java:547) > > at > org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) > > at > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) > > at > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) > > at > org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) > > at > org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) > > at > org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCach > edBag.java:237) > > at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) > > at org.apache.pig.builtin.TOP.exec(TOP.java:116) > > at org.apache.pig.builtin.TOP.exec(TOP.java:65) > > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat > ors.POUserFunc.getNext(POUserFunc.java:245) > > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat > ors.POUserFunc.getNext(POUserFunc.java:287) > > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat > ors.POForEach.processPlan(POForEach.java:338) > > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat > ors.POForEach.getNext(POForEach.java:290) > > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator > .processInput(PhysicalOperator.java:276) > > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat > ors.POForEach.getNext(POForEach.java:240) > > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re > duce.runPipeline(PigMapReduce.java:434) > > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re > duce.processOnePackageOutput(PigMapReduce.java:402) > > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re
-
Re: java.lang.OutOfMemoryError when using TOP udfpablomar 2011-11-17, 17:59
according to the stack trace, the algebraic is not being used
it says updateTop(Top.java:139) exec(Top.java:116) On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > The top udf does not try to process all data in memory if the algebraic > optimization can be applied. It does need to keep the topn numbers in memory > of course. Can you confirm algebraic mode is used? > > On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" <[EMAIL PROTECTED]> > wrote: > >> Hey guys, >> >> >> >> I encounter java.lang.OutOfMemoryError when using TOP udf. It seems that >> the >> udf tries to process all data in memory. >> >> Is there a workaround for TOP? Or maybe there is some other way of getting >> top results? I cannot use LIMIT since I need to 5% of data, not a constant >> number of rows. >> >> >> >> I am using: >> >> Apache Pig version 0.8.1-cdh3u2 (rexported) >> >> >> >> The stack trace is: >> >> [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new >> decompressor >> >> [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last >> merge-pass, >> with 21 segments left of total size: 2057257173 bytes >> >> [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first >> memory >> handler call- Usage threshold init = 175308800(171200K) used >> 373454552(364701K) committed = 524288000(512000K) max = 524288000(512000K) >> >> [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - first >> memory >> handler call - Collection threshold init = 175308800(171200K) used >> 496500704(484863K) committed = 524288000(512000K) max = 524288000(512000K) >> >> [2011-11-16 12:37:28] INFO (TaskLogsTruncater.java:69) - Initializing >> logs' >> truncater with mapRetainSize=-1 and reduceRetainSize=-1 >> >> [2011-11-16 12:37:28] FATAL (Child.java:318) - Error running child : >> java.lang.OutOfMemoryError: Java heap space >> >> at java.util.Arrays.copyOfRange(Arrays.java:3209) >> >> at java.lang.String.<init>(String.java:215) >> >> at >> java.io.DataInputStream.readUTF(DataInputStream.java:644) >> >> at >> java.io.DataInputStream.readUTF(DataInputStream.java:547) >> >> at >> org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) >> >> at >> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) >> >> at >> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) >> >> at >> org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) >> >> at >> org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) >> >> at >> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCach >> edBag.java:237) >> >> at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) >> >> at org.apache.pig.builtin.TOP.exec(TOP.java:116) >> >> at org.apache.pig.builtin.TOP.exec(TOP.java:65) >> >> at >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat >> ors.POUserFunc.getNext(POUserFunc.java:245) >> >> at >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat >> ors.POUserFunc.getNext(POUserFunc.java:287) >> >> at >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat >> ors.POForEach.processPlan(POForEach.java:338) >> >> at >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat >> ors.POForEach.getNext(POForEach.java:290) >> >> at >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator >> .processInput(PhysicalOperator.java:276) >> >> at >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat >> ors.POForEach.getNext(POForEach.java:240) >> >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re >> duce.runPipeline(PigMapReduce.java:434)
-
Re: java.lang.OutOfMemoryError when using TOP udfDmitriy Ryaboy 2011-11-17, 20:07
Ok, so it's something in the rest of the script that's causing this to
happen. Ruslan, if you send your script, I can probably figure out why (usually, it's using another, non-agebraic udf in your foreach, or for pig 0.8, generating a constant in the foreach). D On Thu, Nov 17, 2011 at 9:59 AM, pablomar <[EMAIL PROTECTED]> wrote: > according to the stack trace, the algebraic is not being used > it says > updateTop(Top.java:139) > exec(Top.java:116) > > On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> The top udf does not try to process all data in memory if the algebraic >> optimization can be applied. It does need to keep the topn numbers in memory >> of course. Can you confirm algebraic mode is used? >> >> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" <[EMAIL PROTECTED]> >> wrote: >> >>> Hey guys, >>> >>> >>> >>> I encounter java.lang.OutOfMemoryError when using TOP udf. It seems that >>> the >>> udf tries to process all data in memory. >>> >>> Is there a workaround for TOP? Or maybe there is some other way of getting >>> top results? I cannot use LIMIT since I need to 5% of data, not a constant >>> number of rows. >>> >>> >>> >>> I am using: >>> >>> Apache Pig version 0.8.1-cdh3u2 (rexported) >>> >>> >>> >>> The stack trace is: >>> >>> [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new >>> decompressor >>> >>> [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last >>> merge-pass, >>> with 21 segments left of total size: 2057257173 bytes >>> >>> [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first >>> memory >>> handler call- Usage threshold init = 175308800(171200K) used >>> 373454552(364701K) committed = 524288000(512000K) max = 524288000(512000K) >>> >>> [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - first >>> memory >>> handler call - Collection threshold init = 175308800(171200K) used >>> 496500704(484863K) committed = 524288000(512000K) max = 524288000(512000K) >>> >>> [2011-11-16 12:37:28] INFO (TaskLogsTruncater.java:69) - Initializing >>> logs' >>> truncater with mapRetainSize=-1 and reduceRetainSize=-1 >>> >>> [2011-11-16 12:37:28] FATAL (Child.java:318) - Error running child : >>> java.lang.OutOfMemoryError: Java heap space >>> >>> at java.util.Arrays.copyOfRange(Arrays.java:3209) >>> >>> at java.lang.String.<init>(String.java:215) >>> >>> at >>> java.io.DataInputStream.readUTF(DataInputStream.java:644) >>> >>> at >>> java.io.DataInputStream.readUTF(DataInputStream.java:547) >>> >>> at >>> org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) >>> >>> at >>> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) >>> >>> at >>> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) >>> >>> at >>> org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) >>> >>> at >>> org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) >>> >>> at >>> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCach >>> edBag.java:237) >>> >>> at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) >>> >>> at org.apache.pig.builtin.TOP.exec(TOP.java:116) >>> >>> at org.apache.pig.builtin.TOP.exec(TOP.java:65) >>> >>> at >>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat >>> ors.POUserFunc.getNext(POUserFunc.java:245) >>> >>> at >>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat >>> ors.POUserFunc.getNext(POUserFunc.java:287) >>> >>> at >>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat >>> ors.POForEach.processPlan(POForEach.java:338) >>> >>> at >>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
-
Re: java.lang.OutOfMemoryError when using TOP udfRuslan Al-Fakikh 2011-11-21, 14:11
Hey Dmitriy,
I attached the script. It is not a plain-pig script, because I make some preprocessing before submitting it to cluster, but the general idea of what I submit is clear. Thanks in advance! On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Ok, so it's something in the rest of the script that's causing this to > happen. Ruslan, if you send your script, I can probably figure out why > (usually, it's using another, non-agebraic udf in your foreach, or for > pig 0.8, generating a constant in the foreach). > > D > > On Thu, Nov 17, 2011 at 9:59 AM, pablomar > <[EMAIL PROTECTED]> wrote: >> according to the stack trace, the algebraic is not being used >> it says >> updateTop(Top.java:139) >> exec(Top.java:116) >> >> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >>> The top udf does not try to process all data in memory if the algebraic >>> optimization can be applied. It does need to keep the topn numbers in memory >>> of course. Can you confirm algebraic mode is used? >>> >>> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hey guys, >>>> >>>> >>>> >>>> I encounter java.lang.OutOfMemoryError when using TOP udf. It seems that >>>> the >>>> udf tries to process all data in memory. >>>> >>>> Is there a workaround for TOP? Or maybe there is some other way of getting >>>> top results? I cannot use LIMIT since I need to 5% of data, not a constant >>>> number of rows. >>>> >>>> >>>> >>>> I am using: >>>> >>>> Apache Pig version 0.8.1-cdh3u2 (rexported) >>>> >>>> >>>> >>>> The stack trace is: >>>> >>>> [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new >>>> decompressor >>>> >>>> [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last >>>> merge-pass, >>>> with 21 segments left of total size: 2057257173 bytes >>>> >>>> [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first >>>> memory >>>> handler call- Usage threshold init = 175308800(171200K) used >>>> 373454552(364701K) committed = 524288000(512000K) max = 524288000(512000K) >>>> >>>> [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - first >>>> memory >>>> handler call - Collection threshold init = 175308800(171200K) used >>>> 496500704(484863K) committed = 524288000(512000K) max = 524288000(512000K) >>>> >>>> [2011-11-16 12:37:28] INFO (TaskLogsTruncater.java:69) - Initializing >>>> logs' >>>> truncater with mapRetainSize=-1 and reduceRetainSize=-1 >>>> >>>> [2011-11-16 12:37:28] FATAL (Child.java:318) - Error running child : >>>> java.lang.OutOfMemoryError: Java heap space >>>> >>>> at java.util.Arrays.copyOfRange(Arrays.java:3209) >>>> >>>> at java.lang.String.<init>(String.java:215) >>>> >>>> at >>>> java.io.DataInputStream.readUTF(DataInputStream.java:644) >>>> >>>> at >>>> java.io.DataInputStream.readUTF(DataInputStream.java:547) >>>> >>>> at >>>> org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) >>>> >>>> at >>>> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) >>>> >>>> at >>>> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) >>>> >>>> at >>>> org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) >>>> >>>> at >>>> org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) >>>> >>>> at >>>> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCach >>>> edBag.java:237) >>>> >>>> at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) >>>> >>>> at org.apache.pig.builtin.TOP.exec(TOP.java:116) >>>> >>>> at org.apache.pig.builtin.TOP.exec(TOP.java:65) >>>> >>>> at >>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat >>>> ors.POUserFunc.getNext(POUserFunc.java:245) >>>> >>>> at Best Regards, Ruslan Al-Fakikh
-
Re: java.lang.OutOfMemoryError when using TOP udfDmitriy Ryaboy 2011-11-21, 16:32
Ruslan, I think the mailing list is set to reject attachments -- can
you post it as a github gist or something similar, and send a link? D On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > Hey Dmitriy, > > I attached the script. It is not a plain-pig script, because I make > some preprocessing before submitting it to cluster, but the general > idea of what I submit is clear. > > Thanks in advance! > > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> Ok, so it's something in the rest of the script that's causing this to >> happen. Ruslan, if you send your script, I can probably figure out why >> (usually, it's using another, non-agebraic udf in your foreach, or for >> pig 0.8, generating a constant in the foreach). >> >> D >> >> On Thu, Nov 17, 2011 at 9:59 AM, pablomar >> <[EMAIL PROTECTED]> wrote: >>> according to the stack trace, the algebraic is not being used >>> it says >>> updateTop(Top.java:139) >>> exec(Top.java:116) >>> >>> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >>>> The top udf does not try to process all data in memory if the algebraic >>>> optimization can be applied. It does need to keep the topn numbers in memory >>>> of course. Can you confirm algebraic mode is used? >>>> >>>> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hey guys, >>>>> >>>>> >>>>> >>>>> I encounter java.lang.OutOfMemoryError when using TOP udf. It seems that >>>>> the >>>>> udf tries to process all data in memory. >>>>> >>>>> Is there a workaround for TOP? Or maybe there is some other way of getting >>>>> top results? I cannot use LIMIT since I need to 5% of data, not a constant >>>>> number of rows. >>>>> >>>>> >>>>> >>>>> I am using: >>>>> >>>>> Apache Pig version 0.8.1-cdh3u2 (rexported) >>>>> >>>>> >>>>> >>>>> The stack trace is: >>>>> >>>>> [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new >>>>> decompressor >>>>> >>>>> [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last >>>>> merge-pass, >>>>> with 21 segments left of total size: 2057257173 bytes >>>>> >>>>> [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first >>>>> memory >>>>> handler call- Usage threshold init = 175308800(171200K) used >>>>> 373454552(364701K) committed = 524288000(512000K) max = 524288000(512000K) >>>>> >>>>> [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - first >>>>> memory >>>>> handler call - Collection threshold init = 175308800(171200K) used >>>>> 496500704(484863K) committed = 524288000(512000K) max = 524288000(512000K) >>>>> >>>>> [2011-11-16 12:37:28] INFO (TaskLogsTruncater.java:69) - Initializing >>>>> logs' >>>>> truncater with mapRetainSize=-1 and reduceRetainSize=-1 >>>>> >>>>> [2011-11-16 12:37:28] FATAL (Child.java:318) - Error running child : >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> >>>>> at java.util.Arrays.copyOfRange(Arrays.java:3209) >>>>> >>>>> at java.lang.String.<init>(String.java:215) >>>>> >>>>> at >>>>> java.io.DataInputStream.readUTF(DataInputStream.java:644) >>>>> >>>>> at >>>>> java.io.DataInputStream.readUTF(DataInputStream.java:547) >>>>> >>>>> at >>>>> org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) >>>>> >>>>> at >>>>> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) >>>>> >>>>> at >>>>> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) >>>>> >>>>> at >>>>> org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) >>>>> >>>>> at >>>>> org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) >>>>> >>>>> at >>>>> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCach >>>>> edBag.java:237) >>>>> >>>>> at org.apache.pig.builtin.TOP.updateTop(TOP.java:139)
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-11-21, 17:10
Ok. Here it is:
https://gist.github.com/1383266 -----Original Message----- From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] Sent: 21 ноября 2011 г. 20:32 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf Ruslan, I think the mailing list is set to reject attachments -- can you post it as a github gist or something similar, and send a link? D On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > Hey Dmitriy, > > I attached the script. It is not a plain-pig script, because I make > some preprocessing before submitting it to cluster, but the general > idea of what I submit is clear. > > Thanks in advance! > > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> Ok, so it's something in the rest of the script that's causing this >> to happen. Ruslan, if you send your script, I can probably figure out >> why (usually, it's using another, non-agebraic udf in your foreach, >> or for pig 0.8, generating a constant in the foreach). >> >> D >> >> On Thu, Nov 17, 2011 at 9:59 AM, pablomar >> <[EMAIL PROTECTED]> wrote: >>> according to the stack trace, the algebraic is not being used it >>> says >>> updateTop(Top.java:139) >>> exec(Top.java:116) >>> >>> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >>>> The top udf does not try to process all data in memory if the >>>> algebraic optimization can be applied. It does need to keep the >>>> topn numbers in memory of course. Can you confirm algebraic mode is used? >>>> >>>> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" >>>> <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hey guys, >>>>> >>>>> >>>>> >>>>> I encounter java.lang.OutOfMemoryError when using TOP udf. It >>>>> seems that the udf tries to process all data in memory. >>>>> >>>>> Is there a workaround for TOP? Or maybe there is some other way of >>>>> getting top results? I cannot use LIMIT since I need to 5% of >>>>> data, not a constant number of rows. >>>>> >>>>> >>>>> >>>>> I am using: >>>>> >>>>> Apache Pig version 0.8.1-cdh3u2 (rexported) >>>>> >>>>> >>>>> >>>>> The stack trace is: >>>>> >>>>> [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new >>>>> decompressor >>>>> >>>>> [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last >>>>> merge-pass, with 21 segments left of total size: 2057257173 bytes >>>>> >>>>> [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - >>>>> first memory handler call- Usage threshold init = >>>>> 175308800(171200K) used >>>>> 373454552(364701K) committed = 524288000(512000K) max = >>>>> 524288000(512000K) >>>>> >>>>> [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - >>>>> first memory handler call - Collection threshold init = >>>>> 175308800(171200K) used >>>>> 496500704(484863K) committed = 524288000(512000K) max = >>>>> 524288000(512000K) >>>>> >>>>> [2011-11-16 12:37:28] INFO (TaskLogsTruncater.java:69) - >>>>> Initializing logs' >>>>> truncater with mapRetainSize=-1 and reduceRetainSize=-1 >>>>> >>>>> [2011-11-16 12:37:28] FATAL (Child.java:318) - Error running child : >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> >>>>> at java.util.Arrays.copyOfRange(Arrays.java:3209) >>>>> >>>>> at java.lang.String.<init>(String.java:215) >>>>> >>>>> at >>>>> java.io.DataInputStream.readUTF(DataInputStream.java:644) >>>>> >>>>> at >>>>> java.io.DataInputStream.readUTF(DataInputStream.java:547) >>>>> >>>>> at >>>>> org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java >>>>> :210) >>>>> >>>>> at >>>>> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333 >>>>> ) >>>>> >>>>> at >>>>> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251 >>>>> ) >>>>> >>>>> at >>>>> org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.jav >>>>> a:555) >>>>> >>>>sun.management.MemoryPoolImpl$CollectionSensor.triggerAction(MemoryPoolImpl.
-
Re: java.lang.OutOfMemoryError when using TOP udfJonathan Coveney 2011-11-21, 18:22
Internally, TOP is using a priority queue. It tries to be smart about
pulling off excess elements, but if you ask it for enough elements, it can blow up, because the priority queue is going to have n elements, where n is the ranking you want. This is consistent with the stack trace, which died on updateTop which is when elements are added to the priority queue. Ruslan, how large are the limits you're setting? ie (int)(count * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ) As far as TOP's implementation, I imagine you could get around the issue by using a sorted data bag, but that might be much slower. hmm. 2011/11/21 Ruslan Al-fakikh <[EMAIL PROTECTED]> > Ok. Here it is: > https://gist.github.com/1383266 > > -----Original Message----- > From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] > Sent: 21 ноября 2011 г. 20:32 > To: [EMAIL PROTECTED] > Subject: Re: java.lang.OutOfMemoryError when using TOP udf > > Ruslan, I think the mailing list is set to reject attachments -- can you > post it as a github gist or something similar, and send a link? > > D > > On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh > <[EMAIL PROTECTED]> wrote: > > Hey Dmitriy, > > > > I attached the script. It is not a plain-pig script, because I make > > some preprocessing before submitting it to cluster, but the general > > idea of what I submit is clear. > > > > Thanks in advance! > > > > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > >> Ok, so it's something in the rest of the script that's causing this > >> to happen. Ruslan, if you send your script, I can probably figure out > >> why (usually, it's using another, non-agebraic udf in your foreach, > >> or for pig 0.8, generating a constant in the foreach). > >> > >> D > >> > >> On Thu, Nov 17, 2011 at 9:59 AM, pablomar > >> <[EMAIL PROTECTED]> wrote: > >>> according to the stack trace, the algebraic is not being used it > >>> says > >>> updateTop(Top.java:139) > >>> exec(Top.java:116) > >>> > >>> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >>>> The top udf does not try to process all data in memory if the > >>>> algebraic optimization can be applied. It does need to keep the > >>>> topn numbers in memory of course. Can you confirm algebraic mode is > used? > >>>> > >>>> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" > >>>> <[EMAIL PROTECTED]> > >>>> wrote: > >>>> > >>>>> Hey guys, > >>>>> > >>>>> > >>>>> > >>>>> I encounter java.lang.OutOfMemoryError when using TOP udf. It > >>>>> seems that the udf tries to process all data in memory. > >>>>> > >>>>> Is there a workaround for TOP? Or maybe there is some other way of > >>>>> getting top results? I cannot use LIMIT since I need to 5% of > >>>>> data, not a constant number of rows. > >>>>> > >>>>> > >>>>> > >>>>> I am using: > >>>>> > >>>>> Apache Pig version 0.8.1-cdh3u2 (rexported) > >>>>> > >>>>> > >>>>> > >>>>> The stack trace is: > >>>>> > >>>>> [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new > >>>>> decompressor > >>>>> > >>>>> [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last > >>>>> merge-pass, with 21 segments left of total size: 2057257173 bytes > >>>>> > >>>>> [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - > >>>>> first memory handler call- Usage threshold init > >>>>> 175308800(171200K) used > >>>>> 373454552(364701K) committed = 524288000(512000K) max > >>>>> 524288000(512000K) > >>>>> > >>>>> [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - > >>>>> first memory handler call - Collection threshold init > >>>>> 175308800(171200K) used > >>>>> 496500704(484863K) committed = 524288000(512000K) max > >>>>> 524288000(512000K) > >>>>> > >>>>> [2011-11-16 12:37:28] INFO (TaskLogsTruncater.java:69) - > >>>>> Initializing logs' > >>>>> truncater with mapRetainSize=-1 and reduceRetainSize=-1 > >>>>> > >>>>> [2011-11-16 12:37:28] FATAL (Child.java:318) - Error running child : > >>>>> java.lang.OutOfMemoryError: Java heap space
-
Re: java.lang.OutOfMemoryError when using TOP udfpablomar 2011-11-21, 20:53
i might be wrong, but it seems the error comes from
while(itr.hasNext()) not from the add to the queue so i don't think it is related to the number of elements in the queue ... maybe the field lenght? On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > Internally, TOP is using a priority queue. It tries to be smart about > pulling off excess elements, but if you ask it for enough elements, it can > blow up, because the priority queue is going to have n elements, where n is > the ranking you want. This is consistent with the stack trace, which died > on updateTop which is when elements are added to the priority queue. > > Ruslan, how large are the limits you're setting? ie (int)(count * (double) > ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ) > > As far as TOP's implementation, I imagine you could get around the issue by > using a sorted data bag, but that might be much slower. hmm. > > 2011/11/21 Ruslan Al-fakikh <[EMAIL PROTECTED]> > >> Ok. Here it is: >> https://gist.github.com/1383266 >> >> -----Original Message----- >> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] >> Sent: 21 ноября 2011 г. 20:32 >> To: [EMAIL PROTECTED] >> Subject: Re: java.lang.OutOfMemoryError when using TOP udf >> >> Ruslan, I think the mailing list is set to reject attachments -- can you >> post it as a github gist or something similar, and send a link? >> >> D >> >> On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh >> <[EMAIL PROTECTED]> wrote: >> > Hey Dmitriy, >> > >> > I attached the script. It is not a plain-pig script, because I make >> > some preprocessing before submitting it to cluster, but the general >> > idea of what I submit is clear. >> > >> > Thanks in advance! >> > >> > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> >> wrote: >> >> Ok, so it's something in the rest of the script that's causing this >> >> to happen. Ruslan, if you send your script, I can probably figure out >> >> why (usually, it's using another, non-agebraic udf in your foreach, >> >> or for pig 0.8, generating a constant in the foreach). >> >> >> >> D >> >> >> >> On Thu, Nov 17, 2011 at 9:59 AM, pablomar >> >> <[EMAIL PROTECTED]> wrote: >> >>> according to the stack trace, the algebraic is not being used it >> >>> says >> >>> updateTop(Top.java:139) >> >>> exec(Top.java:116) >> >>> >> >>> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> >>>> The top udf does not try to process all data in memory if the >> >>>> algebraic optimization can be applied. It does need to keep the >> >>>> topn numbers in memory of course. Can you confirm algebraic mode is >> used? >> >>>> >> >>>> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" >> >>>> <[EMAIL PROTECTED]> >> >>>> wrote: >> >>>> >> >>>>> Hey guys, >> >>>>> >> >>>>> >> >>>>> >> >>>>> I encounter java.lang.OutOfMemoryError when using TOP udf. It >> >>>>> seems that the udf tries to process all data in memory. >> >>>>> >> >>>>> Is there a workaround for TOP? Or maybe there is some other way of >> >>>>> getting top results? I cannot use LIMIT since I need to 5% of >> >>>>> data, not a constant number of rows. >> >>>>> >> >>>>> >> >>>>> >> >>>>> I am using: >> >>>>> >> >>>>> Apache Pig version 0.8.1-cdh3u2 (rexported) >> >>>>> >> >>>>> >> >>>>> >> >>>>> The stack trace is: >> >>>>> >> >>>>> [2011-11-16 12:34:55] INFO (CodecPool.java:128) - Got brand-new >> >>>>> decompressor >> >>>>> >> >>>>> [2011-11-16 12:34:55] INFO (Merger.java:473) - Down to the last >> >>>>> merge-pass, with 21 segments left of total size: 2057257173 bytes >> >>>>> >> >>>>> [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - >> >>>>> first memory handler call- Usage threshold init >> >>>>> 175308800(171200K) used >> >>>>> 373454552(364701K) committed = 524288000(512000K) max >> >>>>> 524288000(512000K) >> >>>>> >> >>>>> [2011-11-16 12:36:22] INFO (SpillableMemoryManager.java:167) - >> >>>>> first memory handler call - Collection threshold init >> >>>>> 175308800(171200K) used >> >>>>> 496500704(484863K) committed = 524288000(512000K) max >> >>>>> 524288000(512000K)
-
Re: java.lang.OutOfMemoryError when using TOP udfJonathan Coveney 2011-11-21, 21:53
You're right pablomar...hmm
Ruslan: are you running this in mr mode on a cluster, or locally? I'm noticing this: [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first memory handler call- Usage threshold init = 175308800(171200K) used 373454552(364701K) committed = 524288000(512000K) max = 524288000(512000K) It looks like your max memory is 512MB. I've had issues with bag spilling with less than 1GB allocated (-Xmx1024mb). 2011/11/21 pablomar <[EMAIL PROTECTED]> > i might be wrong, but it seems the error comes from > while(itr.hasNext()) > not from the add to the queue > so i don't think it is related to the number of elements in the queue > ... maybe the field lenght? > > On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > > Internally, TOP is using a priority queue. It tries to be smart about > > pulling off excess elements, but if you ask it for enough elements, it > can > > blow up, because the priority queue is going to have n elements, where n > is > > the ranking you want. This is consistent with the stack trace, which died > > on updateTop which is when elements are added to the priority queue. > > > > Ruslan, how large are the limits you're setting? ie (int)(count * > (double) > > ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ) > > > > As far as TOP's implementation, I imagine you could get around the issue > by > > using a sorted data bag, but that might be much slower. hmm. > > > > 2011/11/21 Ruslan Al-fakikh <[EMAIL PROTECTED]> > > > >> Ok. Here it is: > >> https://gist.github.com/1383266 > >> > >> -----Original Message----- > >> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] > >> Sent: 21 ноября 2011 г. 20:32 > >> To: [EMAIL PROTECTED] > >> Subject: Re: java.lang.OutOfMemoryError when using TOP udf > >> > >> Ruslan, I think the mailing list is set to reject attachments -- can you > >> post it as a github gist or something similar, and send a link? > >> > >> D > >> > >> On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh > >> <[EMAIL PROTECTED]> wrote: > >> > Hey Dmitriy, > >> > > >> > I attached the script. It is not a plain-pig script, because I make > >> > some preprocessing before submitting it to cluster, but the general > >> > idea of what I submit is clear. > >> > > >> > Thanks in advance! > >> > > >> > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> wrote: > >> >> Ok, so it's something in the rest of the script that's causing this > >> >> to happen. Ruslan, if you send your script, I can probably figure out > >> >> why (usually, it's using another, non-agebraic udf in your foreach, > >> >> or for pig 0.8, generating a constant in the foreach). > >> >> > >> >> D > >> >> > >> >> On Thu, Nov 17, 2011 at 9:59 AM, pablomar > >> >> <[EMAIL PROTECTED]> wrote: > >> >>> according to the stack trace, the algebraic is not being used it > >> >>> says > >> >>> updateTop(Top.java:139) > >> >>> exec(Top.java:116) > >> >>> > >> >>> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> >>>> The top udf does not try to process all data in memory if the > >> >>>> algebraic optimization can be applied. It does need to keep the > >> >>>> topn numbers in memory of course. Can you confirm algebraic mode is > >> used? > >> >>>> > >> >>>> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh" > >> >>>> <[EMAIL PROTECTED]> > >> >>>> wrote: > >> >>>> > >> >>>>> Hey guys, > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> I encounter java.lang.OutOfMemoryError when using TOP udf. It > >> >>>>> seems that the udf tries to process all data in memory. > >> >>>>> > >> >>>>> Is there a workaround for TOP? Or maybe there is some other way of > >> >>>>> getting top results? I cannot use LIMIT since I need to 5% of > >> >>>>> data, not a constant number of rows. > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> I am using: > >> >>>>> > >> >>>>> Apache Pig version 0.8.1-cdh3u2 (rexported) > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> The stack trace is:
-
Re: java.lang.OutOfMemoryError when using TOP udfDmitriy Ryaboy 2011-11-21, 22:20
Ok so this:
thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { count = COUNT(thirdLevelsSummed); result = TOP( (int)(count * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); GENERATE FLATTEN(result); } requires "count" to be calculated before TOP can be applied. Since count can't be calculated until the reduce side, naturally, TOP can't start working on the map side (as it doesn't know its arguments yet). Try generating the counts * ($TLP + $BP) separately, joining them in (I am guessing you have no more than a few K categories -- in that case, you can do a replicated join), and then do group and TOP on. On Mon, Nov 21, 2011 at 1:53 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > You're right pablomar...hmm > > Ruslan: are you running this in mr mode on a cluster, or locally? > > I'm noticing this: > [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first memory > handler call- Usage threshold init = 175308800(171200K) used > 373454552(364701K) committed = 524288000(512000K) max = 524288000(512000K) > > It looks like your max memory is 512MB. I've had issues with bag spilling > with less than 1GB allocated (-Xmx1024mb). > > 2011/11/21 pablomar <[EMAIL PROTECTED]> > >> i might be wrong, but it seems the error comes from >> while(itr.hasNext()) >> not from the add to the queue >> so i don't think it is related to the number of elements in the queue >> ... maybe the field lenght? >> >> On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote: >> > Internally, TOP is using a priority queue. It tries to be smart about >> > pulling off excess elements, but if you ask it for enough elements, it >> can >> > blow up, because the priority queue is going to have n elements, where n >> is >> > the ranking you want. This is consistent with the stack trace, which died >> > on updateTop which is when elements are added to the priority queue. >> > >> > Ruslan, how large are the limits you're setting? ie (int)(count * >> (double) >> > ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ) >> > >> > As far as TOP's implementation, I imagine you could get around the issue >> by >> > using a sorted data bag, but that might be much slower. hmm. >> > >> > 2011/11/21 Ruslan Al-fakikh <[EMAIL PROTECTED]> >> > >> >> Ok. Here it is: >> >> https://gist.github.com/1383266 >> >> >> >> -----Original Message----- >> >> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] >> >> Sent: 21 ноября 2011 г. 20:32 >> >> To: [EMAIL PROTECTED] >> >> Subject: Re: java.lang.OutOfMemoryError when using TOP udf >> >> >> >> Ruslan, I think the mailing list is set to reject attachments -- can you >> >> post it as a github gist or something similar, and send a link? >> >> >> >> D >> >> >> >> On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh >> >> <[EMAIL PROTECTED]> wrote: >> >> > Hey Dmitriy, >> >> > >> >> > I attached the script. It is not a plain-pig script, because I make >> >> > some preprocessing before submitting it to cluster, but the general >> >> > idea of what I submit is clear. >> >> > >> >> > Thanks in advance! >> >> > >> >> > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> >> >> wrote: >> >> >> Ok, so it's something in the rest of the script that's causing this >> >> >> to happen. Ruslan, if you send your script, I can probably figure out >> >> >> why (usually, it's using another, non-agebraic udf in your foreach, >> >> >> or for pig 0.8, generating a constant in the foreach). >> >> >> >> >> >> D >> >> >> >> >> >> On Thu, Nov 17, 2011 at 9:59 AM, pablomar >> >> >> <[EMAIL PROTECTED]> wrote: >> >> >>> according to the stack trace, the algebraic is not being used it >> >> >>> says >> >> >>> updateTop(Top.java:139) >> >> >>> exec(Top.java:116) >> >> >>> >> >> >>> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> >> >>>> The top udf does not try to process all data in memory if the >> >> >>>> algebraic optimization can be applied. It does need to keep the
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-11-22, 15:08
Jonathan,
I am running it on Prod cluster in MR mode, not locally. I started to see the issue when input size grew. A few days ago I found a workaround of putting this property: mapred.child.java.opts=-Xmx1024m But I think this is a temporary solution and the job will fail when the input size will grow again. Dmitriy, Thanks a lot for the investigation. I'll try it. -----Original Message----- From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] Sent: 22 ноября 2011 г. 2:21 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf Ok so this: thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { count = COUNT(thirdLevelsSummed); result = TOP( (int)(count * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); GENERATE FLATTEN(result); } requires "count" to be calculated before TOP can be applied. Since count can't be calculated until the reduce side, naturally, TOP can't start working on the map side (as it doesn't know its arguments yet). Try generating the counts * ($TLP + $BP) separately, joining them in (I am guessing you have no more than a few K categories -- in that case, you can do a replicated join), and then do group and TOP on. On Mon, Nov 21, 2011 at 1:53 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > You're right pablomar...hmm > > Ruslan: are you running this in mr mode on a cluster, or locally? > > I'm noticing this: > [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first > memory handler call- Usage threshold init = 175308800(171200K) used > 373454552(364701K) committed = 524288000(512000K) max = > 524288000(512000K) > > It looks like your max memory is 512MB. I've had issues with bag > spilling with less than 1GB allocated (-Xmx1024mb). > > 2011/11/21 pablomar <[EMAIL PROTECTED]> > >> i might be wrong, but it seems the error comes from >> while(itr.hasNext()) >> not from the add to the queue >> so i don't think it is related to the number of elements in the queue >> ... maybe the field lenght? >> >> On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote: >> > Internally, TOP is using a priority queue. It tries to be smart >> > about pulling off excess elements, but if you ask it for enough >> > elements, it >> can >> > blow up, because the priority queue is going to have n elements, >> > where n >> is >> > the ranking you want. This is consistent with the stack trace, >> > which died on updateTop which is when elements are added to the priority queue. >> > >> > Ruslan, how large are the limits you're setting? ie (int)(count * >> (double) >> > ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ) >> > >> > As far as TOP's implementation, I imagine you could get around the >> > issue >> by >> > using a sorted data bag, but that might be much slower. hmm. >> > >> > 2011/11/21 Ruslan Al-fakikh <[EMAIL PROTECTED]> >> > >> >> Ok. Here it is: >> >> https://gist.github.com/1383266 >> >> >> >> -----Original Message----- >> >> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] >> >> Sent: 21 ноября 2011 г. 20:32 >> >> To: [EMAIL PROTECTED] >> >> Subject: Re: java.lang.OutOfMemoryError when using TOP udf >> >> >> >> Ruslan, I think the mailing list is set to reject attachments -- >> >> can you post it as a github gist or something similar, and send a link? >> >> >> >> D >> >> >> >> On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh >> >> <[EMAIL PROTECTED]> wrote: >> >> > Hey Dmitriy, >> >> > >> >> > I attached the script. It is not a plain-pig script, because I >> >> > make some preprocessing before submitting it to cluster, but the >> >> > general idea of what I submit is clear. >> >> > >> >> > Thanks in advance! >> >> > >> >> > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy >> >> > <[EMAIL PROTECTED]> >> >> wrote: >> >> >> Ok, so it's something in the rest of the script that's causing >> >> >> this to happen. Ruslan, if you send your script, I can probably >> >> >> figure out why (usually, it's using another, non-agebraic udf
-
Re: java.lang.OutOfMemoryError when using TOP udfpablomar 2011-11-23, 03:10
just a guess .. could it be possible that the Bag is kept in memory instead
of being spilled to disk ? browsing the code of InternalCachedBag, I saw: private void init(int bagCount, float percent) { factory = TupleFactory.getInstance(); mContents = new ArrayList<Tuple> <http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/util/ArrayList.java.html>(); long max = Runtime.getRuntime().maxMemory(); maxMemUsage = (long)(((float)max * percent) / (float)bagCount); cacheLimit = Integer.MAX_VALUE; // set limit to 0, if memusage is 0 or really really small. // then all tuples are put into disk if (maxMemUsage < 1) { cacheLimit = 0; } addDone = false; } my guess is the cacheLimit was set to Integer.MAX_VALUE and it's trying to keep all in memory when it is not big enough but not so small to have cacheLimit reset to 0 On Tue, Nov 22, 2011 at 10:08 AM, Ruslan Al-fakikh < [EMAIL PROTECTED]> wrote: > Jonathan, > > I am running it on Prod cluster in MR mode, not locally. I started to see > the issue when input size grew. A few days ago I found a workaround of > putting this property: > mapred.child.java.opts=-Xmx1024m > But I think this is a temporary solution and the job will fail when the > input size will grow again. > > Dmitriy, > > Thanks a lot for the investigation. I'll try it. > > -----Original Message----- > From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] > Sent: 22 ноября 2011 г. 2:21 > To: [EMAIL PROTECTED] > Subject: Re: java.lang.OutOfMemoryError when using TOP udf > > Ok so this: > > thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { > count = COUNT(thirdLevelsSummed); > result = TOP( (int)(count * (double) > ($THIRD_LEVELS_PERCENTAGE + > $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); > GENERATE FLATTEN(result); > } > > requires "count" to be calculated before TOP can be applied. Since count > can't be calculated until the reduce side, naturally, TOP can't start > working on the map side (as it doesn't know its arguments yet). > > Try generating the counts * ($TLP + $BP) separately, joining them in (I am > guessing you have no more than a few K categories -- in that case, you can > do a replicated join), and then do group and TOP on. > > On Mon, Nov 21, 2011 at 1:53 PM, Jonathan Coveney <[EMAIL PROTECTED]> > wrote: > > You're right pablomar...hmm > > > > Ruslan: are you running this in mr mode on a cluster, or locally? > > > > I'm noticing this: > > [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first > > memory handler call- Usage threshold init = 175308800(171200K) used > > 373454552(364701K) committed = 524288000(512000K) max > > 524288000(512000K) > > > > It looks like your max memory is 512MB. I've had issues with bag > > spilling with less than 1GB allocated (-Xmx1024mb). > > > > 2011/11/21 pablomar <[EMAIL PROTECTED]> > > > >> i might be wrong, but it seems the error comes from > >> while(itr.hasNext()) > >> not from the add to the queue > >> so i don't think it is related to the number of elements in the queue > >> ... maybe the field lenght? > >> > >> On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > >> > Internally, TOP is using a priority queue. It tries to be smart > >> > about pulling off excess elements, but if you ask it for enough > >> > elements, it > >> can > >> > blow up, because the priority queue is going to have n elements, > >> > where n > >> is > >> > the ranking you want. This is consistent with the stack trace, > >> > which died on updateTop which is when elements are added to the > priority queue. > >> > > >> > Ruslan, how large are the limits you're setting? ie (int)(count * > >> (double) > >> > ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ) > >> > > >> > As far as TOP's implementation, I imagine you could get around the > >> > issue > >> by
-
Re: java.lang.OutOfMemoryError when using TOP udfJonathan Coveney 2011-11-23, 07:45
I have seen issues with spilling if it had less than 1GB of heap. Once I
allocated enough ram, no issues. It seems unlikely to me that the bag implementation fails on this because it's such a common use and nobody has reported an error, and running with less than 1GB of heap is definitely not recommended. Very curious if the error crops up again. 2011/11/22 pablomar <[EMAIL PROTECTED]> > just a guess .. could it be possible that the Bag is kept in memory instead > of being spilled to disk ? > browsing the code of InternalCachedBag, I saw: > > private void init(int bagCount, float percent) { > factory = TupleFactory.getInstance(); > mContents = new ArrayList<Tuple> > < > http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/util/ArrayList.java.html > >(); > > long max = Runtime.getRuntime().maxMemory(); > maxMemUsage = (long)(((float)max * percent) / (float)bagCount); > cacheLimit = Integer.MAX_VALUE; > > // set limit to 0, if memusage is 0 or really really small. > // then all tuples are put into disk if (maxMemUsage < 1) { > cacheLimit = 0; > } > > addDone = false; > } > > my guess is the cacheLimit was set to Integer.MAX_VALUE and it's trying to > keep all in memory when it is not big enough but not so small to have > cacheLimit reset to 0 > > > > > On Tue, Nov 22, 2011 at 10:08 AM, Ruslan Al-fakikh < > [EMAIL PROTECTED]> wrote: > > > Jonathan, > > > > I am running it on Prod cluster in MR mode, not locally. I started to see > > the issue when input size grew. A few days ago I found a workaround of > > putting this property: > > mapred.child.java.opts=-Xmx1024m > > But I think this is a temporary solution and the job will fail when the > > input size will grow again. > > > > Dmitriy, > > > > Thanks a lot for the investigation. I'll try it. > > > > -----Original Message----- > > From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] > > Sent: 22 ноября 2011 г. 2:21 > > To: [EMAIL PROTECTED] > > Subject: Re: java.lang.OutOfMemoryError when using TOP udf > > > > Ok so this: > > > > thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { > > count = COUNT(thirdLevelsSummed); > > result = TOP( (int)(count * (double) > > ($THIRD_LEVELS_PERCENTAGE + > > $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); > > GENERATE FLATTEN(result); > > } > > > > requires "count" to be calculated before TOP can be applied. Since count > > can't be calculated until the reduce side, naturally, TOP can't start > > working on the map side (as it doesn't know its arguments yet). > > > > Try generating the counts * ($TLP + $BP) separately, joining them in (I > am > > guessing you have no more than a few K categories -- in that case, you > can > > do a replicated join), and then do group and TOP on. > > > > On Mon, Nov 21, 2011 at 1:53 PM, Jonathan Coveney <[EMAIL PROTECTED]> > > wrote: > > > You're right pablomar...hmm > > > > > > Ruslan: are you running this in mr mode on a cluster, or locally? > > > > > > I'm noticing this: > > > [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - first > > > memory handler call- Usage threshold init = 175308800(171200K) used > > > 373454552(364701K) committed = 524288000(512000K) max > > > 524288000(512000K) > > > > > > It looks like your max memory is 512MB. I've had issues with bag > > > spilling with less than 1GB allocated (-Xmx1024mb). > > > > > > 2011/11/21 pablomar <[EMAIL PROTECTED]> > > > > > >> i might be wrong, but it seems the error comes from > > >> while(itr.hasNext()) > > >> not from the add to the queue > > >> so i don't think it is related to the number of elements in the queue > > >> ... maybe the field lenght? > > >> > > >> On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > > >> > Internally, TOP is using a priority queue. It tries to be smart > > >> > about pulling off excess elements, but if you ask it for enough
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-11-24, 11:55
Hm. Interesting. Yeah, I really haven't seen the error after setting mapred.child.java.opts=-Xmx1024m.
Probably I won't have to fix the Pig script:) -----Original Message----- From: Jonathan Coveney [mailto:[EMAIL PROTECTED]] Sent: 23 ноября 2011 г. 11:46 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf I have seen issues with spilling if it had less than 1GB of heap. Once I allocated enough ram, no issues. It seems unlikely to me that the bag implementation fails on this because it's such a common use and nobody has reported an error, and running with less than 1GB of heap is definitely not recommended. Very curious if the error crops up again. 2011/11/22 pablomar <[EMAIL PROTECTED]> > just a guess .. could it be possible that the Bag is kept in memory > instead of being spilled to disk ? > browsing the code of InternalCachedBag, I saw: > > private void init(int bagCount, float percent) { > factory = TupleFactory.getInstance(); > mContents = new ArrayList<Tuple> < > http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/util/Arra > yList.java.html > >(); > > long max = Runtime.getRuntime().maxMemory(); > maxMemUsage = (long)(((float)max * percent) / (float)bagCount); > cacheLimit = Integer.MAX_VALUE; > > // set limit to 0, if memusage is 0 or really really small. > // then all tuples are put into disk if (maxMemUsage < 1) { > cacheLimit = 0; > } > > addDone = false; > } > > my guess is the cacheLimit was set to Integer.MAX_VALUE and it's > trying to keep all in memory when it is not big enough but not so > small to have cacheLimit reset to 0 > > > > > On Tue, Nov 22, 2011 at 10:08 AM, Ruslan Al-fakikh < > [EMAIL PROTECTED]> wrote: > > > Jonathan, > > > > I am running it on Prod cluster in MR mode, not locally. I started > > to see the issue when input size grew. A few days ago I found a > > workaround of putting this property: > > mapred.child.java.opts=-Xmx1024m > > But I think this is a temporary solution and the job will fail when > > the input size will grow again. > > > > Dmitriy, > > > > Thanks a lot for the investigation. I'll try it. > > > > -----Original Message----- > > From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] > > Sent: 22 ноября 2011 г. 2:21 > > To: [EMAIL PROTECTED] > > Subject: Re: java.lang.OutOfMemoryError when using TOP udf > > > > Ok so this: > > > > thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { > > count = COUNT(thirdLevelsSummed); > > result = TOP( (int)(count * (double) > > ($THIRD_LEVELS_PERCENTAGE + > > $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); > > GENERATE FLATTEN(result); } > > > > requires "count" to be calculated before TOP can be applied. Since > > count can't be calculated until the reduce side, naturally, TOP > > can't start working on the map side (as it doesn't know its arguments yet). > > > > Try generating the counts * ($TLP + $BP) separately, joining them in > > (I > am > > guessing you have no more than a few K categories -- in that case, > > you > can > > do a replicated join), and then do group and TOP on. > > > > On Mon, Nov 21, 2011 at 1:53 PM, Jonathan Coveney > > <[EMAIL PROTECTED]> > > wrote: > > > You're right pablomar...hmm > > > > > > Ruslan: are you running this in mr mode on a cluster, or locally? > > > > > > I'm noticing this: > > > [2011-11-16 12:34:55] INFO (SpillableMemoryManager.java:154) - > > > first memory handler call- Usage threshold init = > > > 175308800(171200K) used > > > 373454552(364701K) committed = 524288000(512000K) max > > > 524288000(512000K) > > > > > > It looks like your max memory is 512MB. I've had issues with bag > > > spilling with less than 1GB allocated (-Xmx1024mb). > > > > > > 2011/11/21 pablomar <[EMAIL PROTECTED]> > > > > > >> i might be wrong, but it seems the error comes from
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-12-15, 14:57
Hey guys,
Another problem appeared after setting. mapred.child.java.opts=-Xmx1024m Guys, do you have any idea? The job started to fail with: [2011-12-11 05:05:25] ERROR (LogUtils.java:173) - Backend error message Error: java.lang.OutOfMemoryError at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) at java.net.InetAddress.getAllByName0(InetAddress.java:1154) at java.net.InetAddress.getAllByName(InetAddress.java:1084) ... And sometimes with this: java.lang.OutOfMemoryError at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) Also sometimes the stack trace is: [2011-11-29 05:10:00] ERROR (LogUtils.java:173) - Backend error message Error: java.lang.NoClassDefFoundError: java/net/SocketOutputStream at java.net.PlainSocketImpl.getOutputStream(PlainSocketImpl.java:426) at java.net.Socket$3.run(Socket.java:839) at java.security.AccessController.doPrivileged(Native Method) at java.net.Socket.getOutputStream(Socket.java:836) at sun.net.www.http.HttpClient.openServer(HttpClient.java:396) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.<init>(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnecti on.java:860) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.j ava:801) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:7 26) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStr eam(ReduceTask.java:1541) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$ [2011-11-29 05:10:00] ERROR (PigStats.java:673) - ERROR 2997: Unable to recreate exception from backed error: Error: java.lang.NoClassDefFoundError: java/net/SocketOutputStream [2011-11-29 05:10:00] ERROR (PigStatsUtil.java:181) - 1 map reduce job(s) failed! And sometimes the message is: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:242) Caused by: java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229) and if I go deeper into this message: [2011-11-30 04:49:27] INFO (ReduceTask.java:2260) - Interleaved on-disk merge complete: 0 files left. [2011-11-30 04:49:27] INFO (ReduceTask.java:2265) - In-memory merge complete: 30 files left. [2011-11-30 04:49:27] INFO (Merger.java:390) - Merging 30 sorted segments [2011-11-30 04:49:27] INFO (Merger.java:473) - Down to the last merge-pass, with 1 segments left of total size: 13499327 bytes [2011-11-30 04:49:27] INFO (CodecPool.java:103) - Got brand-new compressor [2011-11-30 04:49:27] INFO (ReduceTask.java:2386) - Merged 30 segments, 13499385 bytes to disk to satisfy reduce memory limit [2011-11-30 04:49:27] INFO (ReduceTask.java:2406) - Merging 1 files, 4225815 bytes from disk [2011-11-30 04:49:27] INFO (ReduceTask.java:2420) - Merging 0 segments, 0 bytes from memory into reduce [2011-11-30 04:49:27] INFO (Merger.java:390) - Merging 1 sorted segments [2011-11-30 04:49:27] INFO (Merger.java:473) - Down to the last merge-pass, with 1 segments left of total size: 4225811 bytes # # A fatal error has been detected by the Java Runtime Environment: # # java.lang.OutOfMemoryError: requested 35632 bytes for Chunk::new. Out of swap space? # # Internal Error (allocation.cpp:215), pid=7290, tid=1099024704 # Error: Chunk::new # # JRE version: 6.0_20-b02 # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode linux-amd64 ) # An error report file with more information is saved as: # /hadoop1/mapred/local/taskTracker/hdfs/jobcache/job_201111300833_1325/attemp t_201111300833_1325_r_000021_0/work/hs_err_pid7290.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp And I saw this: [2011-11-30 04:49:28] INFO (ReduceTask.java:2150) - Task attempt_201111300833_1325_r_000011_0: Failed fetch #1 from attempt_201111300833_1325_m_000001_0 [2011-11-30 04:49:28] FATAL (Task.java:280) - attempt_201111300833_1325_r_000011_0 : Map output copy failure : java.lang.OutOfMemoryError at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) From: Ruslan Al-fakikh [mailto:[EMAIL PROTECTED]] Sent: 24 ноября 2011 г. 15:56 To: [EMAIL PROTECTED] Subject: RE: java.lang.OutOfMemoryError when using TOP udf Hm. Interesting. Yeah, I really haven't seen the error after setting mapred.child.java.opts=-Xmx1024m. Probably I won't have to fix the Pig script:) From: Jonathan Coveney [mailto:[EMAIL PROTECTED]] Sent: 23 ноября 2011 г. 11:46 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf I have seen issues with spilling if it had less than 1GB of heap. Once I allocated enough ram, no issues. It seems unlikely to me that the bag implementation fails on this because it's such a common use and nobody has reported an error, and running with less than 1GB of heap is definitely not recommended. Very curious if the error crops up again. 2011/11/22 pablomar <[EMAIL PROTECTED]>
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-12-16, 13:32
Dmitriy,
You wrote > > Ok so this: > > > > thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { > > count = COUNT(thirdLevelsSummed); > > result = TOP( (int)(count * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); > > GENERATE FLATTEN(result); } > > > > requires "count" to be calculated before TOP can be applied. Since count can't be calculated until the reduce side, naturally, TOP > > can't start working on the map side (as it doesn't know its arguments yet). Try generating the counts * ($TLP + $BP) separately, joining them in (I am guessing you have no more than a few K categories -- in that case, > > you can do a replicated join), and then do group and TOP on. Probably I didn't understand your logic correctly. What I did is: changed this: thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { count = COUNT(thirdLevelsSummed); result = TOP( (int)(count * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); GENERATE FLATTEN(result); } to this: thirdLevelsTopNumberCounted = FOREACH thirdLevelsByCategory GENERATE group, thirdLevelsSummed, (int)( COUNT(thirdLevelsSummed) * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ) AS TopNumber; thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsTopNumberCounted GENERATE FLATTEN(TOP(TopNumber, 3, thirdLevelsSummed)); So I removed the COUNT from the nested group. It didn't help. Probably you meant the JOIN ... USING 'replicated' statement, but I didn't get how I can apply it here. Thanks -----Original Message----- From: Ruslan Al-fakikh [mailto:[EMAIL PROTECTED]] Sent: 24 ноября 2011 г. 15:56 To: [EMAIL PROTECTED] Subject: RE: java.lang.OutOfMemoryError when using TOP udf Hm. Interesting. Yeah, I really haven't seen the error after setting mapred.child.java.opts=-Xmx1024m. Probably I won't have to fix the Pig script:) -----Original Message----- From: Jonathan Coveney [mailto:[EMAIL PROTECTED]] Sent: 23 ноября 2011 г. 11:46 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf I have seen issues with spilling if it had less than 1GB of heap. Once I allocated enough ram, no issues. It seems unlikely to me that the bag implementation fails on this because it's such a common use and nobody has reported an error, and running with less than 1GB of heap is definitely not recommended. Very curious if the error crops up again. 2011/11/22 pablomar <[EMAIL PROTECTED]> > just a guess .. could it be possible that the Bag is kept in memory > instead of being spilled to disk ? > browsing the code of InternalCachedBag, I saw: > > private void init(int bagCount, float percent) { > factory = TupleFactory.getInstance(); > mContents = new ArrayList<Tuple> < > http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/util/Arra > yList.java.html > >(); > > long max = Runtime.getRuntime().maxMemory(); > maxMemUsage = (long)(((float)max * percent) / (float)bagCount); > cacheLimit = Integer.MAX_VALUE; > > // set limit to 0, if memusage is 0 or really really small. > // then all tuples are put into disk if (maxMemUsage < 1) { > cacheLimit = 0; > } > > addDone = false; > } > > my guess is the cacheLimit was set to Integer.MAX_VALUE and it's > trying to keep all in memory when it is not big enough but not so > small to have cacheLimit reset to 0 > > > > > On Tue, Nov 22, 2011 at 10:08 AM, Ruslan Al-fakikh < > [EMAIL PROTECTED]> wrote: > > > Jonathan, > > > > I am running it on Prod cluster in MR mode, not locally. I started > > to see the issue when input size grew. A few days ago I found a > > workaround of putting this property: > > mapred.child.java.opts=-Xmx1024m > > But I think this is a temporary solution and the job will fail when
-
Re: java.lang.OutOfMemoryError when using TOP udfDmitriy Ryaboy 2011-12-16, 20:15
I meant the latter, an actual join statement. So, generate the counts,
join them to the original relation, then group again and do TOP. D On Fri, Dec 16, 2011 at 5:32 AM, Ruslan Al-fakikh <[EMAIL PROTECTED]> wrote: > Dmitriy, > > You wrote > >> > Ok so this: >> > >> > thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { >> > count = COUNT(thirdLevelsSummed); >> > result = TOP( (int)(count * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); >> > GENERATE FLATTEN(result); } >> > >> > requires "count" to be calculated before TOP can be applied. Since count can't be calculated until the reduce side, naturally, TOP >> > can't start working on the map side (as it doesn't know its arguments yet). Try generating the counts * ($TLP + $BP) separately, joining them in (I am guessing you have no more than a few K categories -- in that case, >> > you can do a replicated join), and then do group and TOP on. > > Probably I didn't understand your logic correctly. What I did is: > changed this: > thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { > �� count = COUNT(thirdLevelsSummed); > �� result = TOP( (int)(count * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed); > �� GENERATE FLATTEN(result); > } > to this: > thirdLevelsTopNumberCounted = FOREACH thirdLevelsByCategory GENERATE > �� �� group, > �� �� thirdLevelsSummed, > �� �� (int)( COUNT(thirdLevelsSummed) * (double) ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ) AS TopNumber; > > thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsTopNumberCounted GENERATE > �� �� FLATTEN(TOP(TopNumber, 3, thirdLevelsSummed)); > > So I removed the COUNT from the nested group. It didn't help. Probably you meant the JOIN ... USING 'replicated' statement, but I didn't get how I can apply it here. > > Thanks > > -----Original Message----- > From: Ruslan Al-fakikh [mailto:[EMAIL PROTECTED]] > Sent: 24 ноября 2011 г. 15:56 > To: [EMAIL PROTECTED] > Subject: RE: java.lang.OutOfMemoryError when using TOP udf > > Hm. Interesting. Yeah, I really haven't seen the error after setting mapred.child.java.opts=-Xmx1024m. > Probably I won't have to fix the Pig script:) > > -----Original Message----- > From: Jonathan Coveney [mailto:[EMAIL PROTECTED]] > Sent: 23 ноября 2011 г. 11:46 > To: [EMAIL PROTECTED] > Subject: Re: java.lang.OutOfMemoryError when using TOP udf > > I have seen issues with spilling if it had less than 1GB of heap. Once I allocated enough ram, no issues. It seems unlikely to me that the bag implementation fails on this because it's such a common use and nobody has reported an error, and running with less than 1GB of heap is definitely not recommended. Very curious if the error crops up again. > > 2011/11/22 pablomar <[EMAIL PROTECTED]> > >> just a guess .. could it be possible that the Bag is kept in memory >> instead of being spilled to disk ? >> browsing the code of InternalCachedBag, I saw: >> >> private void init(int bagCount, float percent) { >> factory = TupleFactory.getInstance(); >> mContents = new ArrayList<Tuple> < >> http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/util/Arra >> yList.java.html >> >(); >> >> long max = Runtime.getRuntime().maxMemory(); >> maxMemUsage = (long)(((float)max * percent) / (float)bagCount); >> cacheLimit = Integer.MAX_VALUE; >> >> // set limit to 0, if memusage is 0 or really really small. >> // then all tuples are put into disk if (maxMemUsage < 1) {
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-12-22, 01:37
Hey guys
I did it according to the advice and moved the TOP execution the map phase and now I am getting the same error, but now it comes from that map phase. Any help much appreciated! Here is my current code: https://gist.github.com/1508511 Error stack trace: [2011-12-21 08:17:46] FATAL (Child.java:318) - Error running child : java.lang.OutOfMemoryError: Java heap space at java.io.DataInputStream.readUTF(DataInputStream.java:644) at java.io.DataInputStream.readUTF(DataInputStream.java:547) at org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:244) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.next(DefaultDataBag.java:231) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.hasNext(DefaultDataBag.java:157) at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) at org.apache.pig.builtin.TOP.exec(TOP.java:116) at org.apache.pig.builtin.TOP.exec(TOP.java:65) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:287) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:338) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:290) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:240) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) -----Original Message----- From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] Sent: 17 декабря 2011 г. 0:16 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf I meant the latter, an actual join statement. So, generate the counts, join them to the original relation, then group again and do TOP. D On Fri, Dec 16, 2011 at 5:32 AM, Ruslan Al-fakikh <[EMAIL PROTECTED]> wrote: > Dmitriy, > > You wrote > >> > Ok so this: >> > >> > thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory { >> > count = COUNT(thirdLevelsSummed); >> > result = TOP( (int)(count * (double) >> > ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) ), 3, >> > thirdLevelsSummed); >> > GENERATE FLATTEN(result); } >> > >> > requires "count" to be calculated before TOP can be applied. Since
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-12-27, 15:48
Actually I fixed it. I had to use an additional grouping to make it really Algebraic. But now I see OutOfMemory during Map merge:
[2011-12-27 08:44:07] FATAL (Child.java:318) - Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:417) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) Can anyone help? Thanks in advance! -----Original Message----- From: Ruslan Al-fakikh [mailto:[EMAIL PROTECTED]] Sent: 22 декабря 2011 г. 5:38 To: [EMAIL PROTECTED] Subject: RE: java.lang.OutOfMemoryError when using TOP udf Hey guys I did it according to the advice and moved the TOP execution the map phase and now I am getting the same error, but now it comes from that map phase. Any help much appreciated! Here is my current code: https://gist.github.com/1508511 Error stack trace: [2011-12-21 08:17:46] FATAL (Child.java:318) - Error running child : java.lang.OutOfMemoryError: Java heap space at java.io.DataInputStream.readUTF(DataInputStream.java:644) at java.io.DataInputStream.readUTF(DataInputStream.java:547) at org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:244) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.next(DefaultDataBag.java:231) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.hasNext(DefaultDataBag.java:157) at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) at org.apache.pig.builtin.TOP.exec(TOP.java:116) at org.apache.pig.builtin.TOP.exec(TOP.java:65) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:287) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:338) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:290) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:240) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]] Sent: 17 декабря 2011 г. 0:16 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf I meant the latter, an actual join statement. So, generate the counts, join them to the original relation, then group again and do TOP. D On Fri, Dec 16, 2011 at 5:32 AM, Ruslan Al-fakikh <[EMAIL PROTECTED]> wrote:
-
Re: java.lang.OutOfMemoryError when using TOP udfJonathan Coveney 2011-12-28, 19:18
How large is TopNumber? I imagine that if your TopNumber is large enough,
the UDF could still fail if the TopNumber # of values can't fit in the priority queue it puts together. Although in that final merge it could be smarter about it... will have to check the code when I get a chance to see if they are. 2011/12/27 Ruslan Al-fakikh <[EMAIL PROTECTED]> > Actually I fixed it. I had to use an additional grouping to make it really > Algebraic. But now I see OutOfMemory during Map merge: > > [2011-12-27 08:44:07] FATAL (Child.java:318) - Error running child : > java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) > at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:417) > at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) > at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > Can anyone help? > > Thanks in advance! > > -----Original Message----- > From: Ruslan Al-fakikh [mailto:[EMAIL PROTECTED]] > Sent: 22 декабря 2011 г. 5:38 > To: [EMAIL PROTECTED] > Subject: RE: java.lang.OutOfMemoryError when using TOP udf > > Hey guys > > I did it according to the advice and moved the TOP execution the map phase > and now I am getting the same error, but now it comes from that map phase. > > Any help much appreciated! > > Here is my current code: > https://gist.github.com/1508511 > > Error stack trace: > [2011-12-21 08:17:46] FATAL (Child.java:318) - Error running child : > java.lang.OutOfMemoryError: Java heap space > at java.io.DataInputStream.readUTF(DataInputStream.java:644) > at java.io.DataInputStream.readUTF(DataInputStream.java:547) > at > org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) > at > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) > at > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) > at > org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) > at > org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:244) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.next(DefaultDataBag.java:231) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.hasNext(DefaultDataBag.java:157) > at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) > at org.apache.pig.builtin.TOP.exec(TOP.java:116) > at org.apache.pig.builtin.TOP.exec(TOP.java:65) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:287) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:338) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:290)
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2011-12-28, 22:21
The TopNumber is about 100 000
-----Original Message----- From: Jonathan Coveney [mailto:[EMAIL PROTECTED]] Sent: 28 декабря 2011 г. 23:19 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf How large is TopNumber? I imagine that if your TopNumber is large enough, the UDF could still fail if the TopNumber # of values can't fit in the priority queue it puts together. Although in that final merge it could be smarter about it... will have to check the code when I get a chance to see if they are. 2011/12/27 Ruslan Al-fakikh <[EMAIL PROTECTED]> > Actually I fixed it. I had to use an additional grouping to make it > really Algebraic. But now I see OutOfMemory during Map merge: > > [2011-12-27 08:44:07] FATAL (Child.java:318) - Error running child : > java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) > at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:417) > at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) > at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > Can anyone help? > > Thanks in advance! > > -----Original Message----- > From: Ruslan Al-fakikh [mailto:[EMAIL PROTECTED]] > Sent: 22 декабря 2011 г. 5:38 > To: [EMAIL PROTECTED] > Subject: RE: java.lang.OutOfMemoryError when using TOP udf > > Hey guys > > I did it according to the advice and moved the TOP execution the map > phase and now I am getting the same error, but now it comes from that map phase. > > Any help much appreciated! > > Here is my current code: > https://gist.github.com/1508511 > > Error stack trace: > [2011-12-21 08:17:46] FATAL (Child.java:318) - Error running child : > java.lang.OutOfMemoryError: Java heap space > at java.io.DataInputStream.readUTF(DataInputStream.java:644) > at java.io.DataInputStream.readUTF(DataInputStream.java:547) > at > org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) > at > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) > at > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) > at > org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) > at > org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:244) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.next(DefaultDataBag.java:231) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.hasNext(DefaultDataBag.java:157) > at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) > at org.apache.pig.builtin.TOP.exec(TOP.java:116) > at org.apache.pig.builtin.TOP.exec(TOP.java:65) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:287)
-
RE: java.lang.OutOfMemoryError when using TOP udfRuslan Al-fakikh 2012-01-06, 03:14
According to my calculations the biggest TOP number is 2380324
Could that be the reason of failure in maps? -----Original Message----- From: Jonathan Coveney [mailto:[EMAIL PROTECTED]] Sent: 28 декабря 2011 г. 23:19 To: [EMAIL PROTECTED] Subject: Re: java.lang.OutOfMemoryError when using TOP udf How large is TopNumber? I imagine that if your TopNumber is large enough, the UDF could still fail if the TopNumber # of values can't fit in the priority queue it puts together. Although in that final merge it could be smarter about it... will have to check the code when I get a chance to see if they are. 2011/12/27 Ruslan Al-fakikh <[EMAIL PROTECTED]> > Actually I fixed it. I had to use an additional grouping to make it > really Algebraic. But now I see OutOfMemory during Map merge: > > [2011-12-27 08:44:07] FATAL (Child.java:318) - Error running child : > java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) > at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:417) > at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) > at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > Can anyone help? > > Thanks in advance! > > -----Original Message----- > From: Ruslan Al-fakikh [mailto:[EMAIL PROTECTED]] > Sent: 22 декабря 2011 г. 5:38 > To: [EMAIL PROTECTED] > Subject: RE: java.lang.OutOfMemoryError when using TOP udf > > Hey guys > > I did it according to the advice and moved the TOP execution the map > phase and now I am getting the same error, but now it comes from that map phase. > > Any help much appreciated! > > Here is my current code: > https://gist.github.com/1508511 > > Error stack trace: > [2011-12-21 08:17:46] FATAL (Child.java:318) - Error running child : > java.lang.OutOfMemoryError: Java heap space > at java.io.DataInputStream.readUTF(DataInputStream.java:644) > at java.io.DataInputStream.readUTF(DataInputStream.java:547) > at > org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) > at > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) > at > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) > at > org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) > at > org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:244) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.next(DefaultDataBag.java:231) > at > org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.hasNext(DefaultDataBag.java:157) > at org.apache.pig.builtin.TOP.updateTop(TOP.java:139) > at org.apache.pig.builtin.TOP.exec(TOP.java:116) > at org.apache.pig.builtin.TOP.exec(TOP.java:65) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) > at
-
Re: java.lang.OutOfMemoryError when using TOP udfJonathan Coveney 2012-01-06, 04:10
Ruslan, I took a look and it is being reasonable. I do that that that is
the issue: the way that it works is by holding a priority queue of however many items you care about, adding one, then popping the bottom one. If it has to hold almost 3M objects in memory, memory issues is a real likely thing. A couple things you can do: - have fewer columns. ie only do "TOP" of the things you really care about - more memory (don't you love that?) Others may have other suggestions. 2012/1/5 Ruslan Al-fakikh <[EMAIL PROTECTED]> > According to my calculations the biggest TOP number is 2380324 > Could that be the reason of failure in maps? > > -----Original Message----- > From: Jonathan Coveney [mailto:[EMAIL PROTECTED]] > Sent: 28 декабря 2011 г. 23:19 > To: [EMAIL PROTECTED] > Subject: Re: java.lang.OutOfMemoryError when using TOP udf > > How large is TopNumber? I imagine that if your TopNumber is large enough, > the UDF could still fail if the TopNumber # of values can't fit in the > priority queue it puts together. Although in that final merge it could be > smarter about it... will have to check the code when I get a chance to see > if they are. > > 2011/12/27 Ruslan Al-fakikh <[EMAIL PROTECTED]> > > > Actually I fixed it. I had to use an additional grouping to make it > > really Algebraic. But now I see OutOfMemory during Map merge: > > > > [2011-12-27 08:44:07] FATAL (Child.java:318) - Error running child : > > java.lang.OutOfMemoryError: Java heap space > > at > > org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) > > at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:417) > > at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) > > at > org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) > > at > org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) > > at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > > at > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) > > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) > > at > > > org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > > > Can anyone help? > > > > Thanks in advance! > > > > -----Original Message----- > > From: Ruslan Al-fakikh [mailto:[EMAIL PROTECTED]] > > Sent: 22 декабря 2011 г. 5:38 > > To: [EMAIL PROTECTED] > > Subject: RE: java.lang.OutOfMemoryError when using TOP udf > > > > Hey guys > > > > I did it according to the advice and moved the TOP execution the map > > phase and now I am getting the same error, but now it comes from that > map phase. > > > > Any help much appreciated! > > > > Here is my current code: > > https://gist.github.com/1508511 > > > > Error stack trace: > > [2011-12-21 08:17:46] FATAL (Child.java:318) - Error running child : > > java.lang.OutOfMemoryError: Java heap space > > at java.io.DataInputStream.readUTF(DataInputStream.java:644) > > at java.io.DataInputStream.readUTF(DataInputStream.java:547) > > at > > org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210) > > at > > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333) > > at > > org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) > > at > > org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) > > at > > org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) |