Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> java.lang.OutOfMemoryError when using TOP udf


Copy link to this message
-
Re: java.lang.OutOfMemoryError when using TOP udf
just a guess .. could it be possible that the Bag is kept in memory instead
of being spilled to disk ?
browsing the code of InternalCachedBag, I saw:

private void init(int bagCount, float percent) {
        factory = TupleFactory.getInstance();
        mContents = new ArrayList<Tuple>
<http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/util/ArrayList.java.html>();

        long max = Runtime.getRuntime().maxMemory();
        maxMemUsage = (long)(((float)max * percent) / (float)bagCount);
        cacheLimit = Integer.MAX_VALUE;

        // set limit to 0, if memusage is 0 or really really small.
    // then all tuples are put into disk        if (maxMemUsage < 1) {
            cacheLimit = 0;
        }

        addDone = false;
    }

my guess is the cacheLimit was set to Integer.MAX_VALUE and it's trying to
keep all in memory when it is not big enough but not so small to have
cacheLimit reset to 0
On Tue, Nov 22, 2011 at 10:08 AM, Ruslan Al-fakikh <
[EMAIL PROTECTED]> wrote:

> Jonathan,
>
> I am running it on Prod cluster in MR mode, not locally. I started to see
> the issue when input size grew. A few days ago I found a workaround of
> putting this property:
> mapred.child.java.opts=-Xmx1024m
> But I think this is a temporary solution and the job will fail when the
> input size will grow again.
>
> Dmitriy,
>
> Thanks a lot for the investigation. I'll try it.
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
> Sent: 22 ноября 2011 г. 2:21
> To: [EMAIL PROTECTED]
> Subject: Re: java.lang.OutOfMemoryError when using TOP udf
>
> Ok so this:
>
> thirdLevelsTopVisitorsWithBots = FOREACH thirdLevelsByCategory {
>                                count = COUNT(thirdLevelsSummed);
>                                result = TOP( (int)(count * (double)
> ($THIRD_LEVELS_PERCENTAGE +
> $BOTS_PERCENTAGE) ), 3, thirdLevelsSummed);
>                                GENERATE FLATTEN(result);
> }
>
> requires "count" to be calculated before TOP can be applied. Since count
> can't be calculated until the reduce side, naturally, TOP can't start
> working on the map side (as it doesn't know its arguments yet).
>
> Try generating the counts * ($TLP + $BP) separately, joining them in (I am
> guessing you have no more than a few K categories -- in that case, you can
> do a replicated join), and then do group and TOP on.
>
> On Mon, Nov 21, 2011 at 1:53 PM, Jonathan Coveney <[EMAIL PROTECTED]>
> wrote:
> > You're right pablomar...hmm
> >
> > Ruslan: are you running this in mr mode on a cluster, or locally?
> >
> > I'm noticing this:
> > [2011-11-16 12:34:55] INFO  (SpillableMemoryManager.java:154) - first
> > memory handler call- Usage threshold init = 175308800(171200K) used > > 373454552(364701K) committed = 524288000(512000K) max > > 524288000(512000K)
> >
> > It looks like your max memory is 512MB. I've had issues with bag
> > spilling with less than 1GB allocated (-Xmx1024mb).
> >
> > 2011/11/21 pablomar <[EMAIL PROTECTED]>
> >
> >> i might be wrong, but it seems the error comes from
> >> while(itr.hasNext())
> >> not from the add to the queue
> >> so i don't think it is related to the number of elements in the queue
> >> ... maybe the field lenght?
> >>
> >> On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> >> > Internally, TOP is using a priority queue. It tries to be smart
> >> > about pulling off excess elements, but if you ask it for enough
> >> > elements, it
> >> can
> >> > blow up, because the priority queue is going to have n elements,
> >> > where n
> >> is
> >> > the ranking you want. This is consistent with the stack trace,
> >> > which died on updateTop which is when elements are added to the
> priority queue.
> >> >
> >> > Ruslan, how large are the limits you're setting? ie (int)(count *
> >> (double)
> >> > ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) )
> >> >
> >> > As far as TOP's implementation, I imagine you could get around the
> >> > issue
> >> by
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB