Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Building a databag in a UDF requiring tons of memory?


Copy link to this message
-
Building a databag in a UDF requiring tons of memory?
We're admittedly on an older version of pig (0.8.0-cdh3u0) but are trying to build a databag in our UDF and are getting OOM exceptions even with 6 GB of heap.  Specifically, we're marshaling data prior to writing it to Cassandra using our ToCassandraBag UDF and have a databag as one of the inputs and it has about 1.6 million entries in the bag to begin with.  So the line that it's OOMing on is:
https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java#L69
the error that it's getting is:
FATAL apache.hadoop.mapred.Child - Error running child : java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.<init>(String.java:215)
    at java.io.DataInputStream.readUTF(DataInputStream.java:644)
    at java.io.DataInputStream.readUTF(DataInputStream.java:547)
    at org.apache.pig.data.BinInterSedes.readCharArray(BinInterSedes.java:210)
    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:333)
    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
    at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555)
    at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64)
    at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:244)
    at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.next(DefaultDataBag.java:231)
    at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.hasNext(DefaultDataBag.java:157)
    at org.apache.pig.data.DefaultAbstractBag.addAll(DefaultAbstractBag.java:96)
    at org.pygmalion.udf.ToCassandraBag.exec(ToCassandraBag.java:69)
    at org.pygmalion.udf.ToCassandraBag.exec(ToCassandraBag.java:30)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:273)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

When we use 12 G for that task, it no longer gets that error, but I'm having a hard time seeing why 1) it would require that much memory in the first place and 2) why wouldn't it spill to disk if it got that far?

Any thoughts about this - is this some kind of memory problem with that version of Pig or something that we're doing wrong?