Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Memory issue with datafu StreamingQuantile in apache pig


Copy link to this message
-
Re: Memory issue with datafu StreamingQuantile in apache pig
Hi,

Looking at the stack trace, it looks like the task is failing
during initialization because it can't load JobConf into memory. In fact,
Pig uses JobConf heavily. For example, it serializes the entire MR plan,
store it in JobConf, and pass it to the back-end. I don't see any
workaround other than either breaking the script into smaller ones or
increasing the heap size of MR task processes.

I don't know what your script looks like, but I would try to break it down.
You might not have to call Quantile per column if you could factor out
other parts of the script into independent scripts. This is just my wild
guess, so please take it at your own risk.

You might also want to ask your question on the Datafu user group.

Thanks,
Cheolsoo

On Tue, Mar 5, 2013 at 2:45 PM, Zheng Ziyi <[EMAIL PROTECTED]> wrote:

> Hello,
>
>
> I have a pig script to compute 1000 quantiles of multiple columns. I run
> into an issue with java heap memory. Here is my pig script.
>
> SET mapred.child.java.opts ' -Xmx4096m -Dfile.encoding=UTF8
> -Djava.library.path=/apollo/env/TrafficAnalyticsHadoop/lib';
>
> define Quantile1 datafu.pig.stats.StreamingQuantile('1000');
>
> ....
>
> -- B has 50 columns
>
> G = GROUP B ALL; Quants = FOREACH G GENERATE Quantile1(B.$1) AS q1,
> Quantile1(B.$2) AS q2;
>
> ....
>
> The error:
>
> [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
> recreate exception from backed error: Error initializing
> attempt_201301282343_0526_m_000000_0: java.lang.OutOfMemoryError: Java heap
> space at java.util.Arrays.copyOf(Arrays.java:2882) at
>
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
> at java.lang.StringBuffer.append(StringBuffer.java:224) at
>
> com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.getNodeValueString(DeferredDocumentImpl.java:1167)
> at
>
> com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.getNodeValueString(DeferredDocumentImpl.java:1120)
> at
>
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl.synchronizeData(DeferredTextImpl.java:93)
> at
>
> com.sun.org.apache.xerces.internal.dom.CharacterDataImpl.getData(CharacterDataImpl.java:160)
> at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1231)
> at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1129)
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1063)
> at org.apache.hadoop.conf.Configuration.get(Configuration.java:416) at
> org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1910)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:378) at
>
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:186)
> at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1226) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at
> org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1201)
> at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1116)
> at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2404) at
> java.lang.Thread.run(Thread.java:662)
>
> It works fine if I change the code to
>
> Quants = FOREACH G GENERATE Quantile1(B.$1) AS q1;
>
> But it is very annoying to have multiples pig script for all the 50
> columns. Is it the only way to do it? Is it the correct way to use
> StreamingQuantile on multiple columns? Do I really need more memory than
> 4G?
>
> Thanks in Advance!
> Ziyi
>