Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> java.lang.OutOfMemoryError when using TOP udf


+
Ruslan Al-fakikh 2011-11-17, 14:13
+
Dmitriy Ryaboy 2011-11-17, 16:43
+
pablomar 2011-11-17, 17:59
+
Dmitriy Ryaboy 2011-11-17, 20:07
+
Ruslan Al-Fakikh 2011-11-21, 14:11
+
Dmitriy Ryaboy 2011-11-21, 16:32
+
Ruslan Al-fakikh 2011-11-21, 17:10
+
Jonathan Coveney 2011-11-21, 18:22
Copy link to this message
-
Re: java.lang.OutOfMemoryError when using TOP udf
i might be wrong, but it seems the error comes from
while(itr.hasNext())
not from the add to the queue
so i don't think it is related to the number of elements in the queue
... maybe the field lenght?

On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> Internally, TOP is using a priority queue. It tries to be smart about
> pulling off excess elements, but if you ask it for enough elements, it can
> blow up, because the priority queue is going to have n elements, where n is
> the ranking you want. This is consistent with the stack trace, which died
> on updateTop which is when elements are added to the priority queue.
>
> Ruslan, how large are the limits you're setting? ie (int)(count * (double)
> ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) )
>
> As far as TOP's implementation, I imagine you could get around the issue by
> using a sorted data bag, but that might be much slower. hmm.
>
> 2011/11/21 Ruslan Al-fakikh <[EMAIL PROTECTED]>
>
>> Ok. Here it is:
>> https://gist.github.com/1383266
>>
>> -----Original Message-----
>> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
>> Sent: 21 ноября 2011 г. 20:32
>> To: [EMAIL PROTECTED]
>> Subject: Re: java.lang.OutOfMemoryError when using TOP udf
>>
>> Ruslan, I think the mailing list is set to reject attachments -- can you
>> post it as a github gist or something similar, and send a link?
>>
>> D
>>
>> On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh
>> <[EMAIL PROTECTED]> wrote:
>> > Hey Dmitriy,
>> >
>> > I attached the script. It is not a plain-pig script, because I make
>> > some preprocessing before submitting it to cluster, but the general
>> > idea of what I submit is clear.
>> >
>> > Thanks in advance!
>> >
>> > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
>> wrote:
>> >> Ok, so it's something in the rest of the script that's causing this
>> >> to happen. Ruslan, if you send your script, I can probably figure out
>> >> why (usually, it's using another, non-agebraic udf in your foreach,
>> >> or for pig 0.8, generating a constant in the foreach).
>> >>
>> >> D
>> >>
>> >> On Thu, Nov 17, 2011 at 9:59 AM, pablomar
>> >> <[EMAIL PROTECTED]> wrote:
>> >>> according to the stack trace, the algebraic is not being used it
>> >>> says
>> >>> updateTop(Top.java:139)
>> >>> exec(Top.java:116)
>> >>>
>> >>> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>> >>>> The top udf does not try to process all data in memory if the
>> >>>> algebraic optimization can be applied. It does need to keep the
>> >>>> topn numbers in memory of course. Can you confirm algebraic mode is
>> used?
>> >>>>
>> >>>> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh"
>> >>>> <[EMAIL PROTECTED]>
>> >>>> wrote:
>> >>>>
>> >>>>> Hey guys,
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> I encounter java.lang.OutOfMemoryError when using TOP udf. It
>> >>>>> seems that the udf tries to process all data in memory.
>> >>>>>
>> >>>>> Is there a workaround for TOP? Or maybe there is some other way of
>> >>>>> getting top results? I cannot use LIMIT since I need to 5% of
>> >>>>> data, not a constant number of rows.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> I am using:
>> >>>>>
>> >>>>> Apache Pig version 0.8.1-cdh3u2 (rexported)
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> The stack trace is:
>> >>>>>
>> >>>>> [2011-11-16 12:34:55] INFO  (CodecPool.java:128) - Got brand-new
>> >>>>> decompressor
>> >>>>>
>> >>>>> [2011-11-16 12:34:55] INFO  (Merger.java:473) - Down to the last
>> >>>>> merge-pass, with 21 segments left of total size: 2057257173 bytes
>> >>>>>
>> >>>>> [2011-11-16 12:34:55] INFO  (SpillableMemoryManager.java:154) -
>> >>>>> first memory handler call- Usage threshold init >> >>>>> 175308800(171200K) used >> >>>>> 373454552(364701K) committed = 524288000(512000K) max >> >>>>> 524288000(512000K)
>> >>>>>
>> >>>>> [2011-11-16 12:36:22] INFO  (SpillableMemoryManager.java:167) -
>> >>>>> first memory handler call - Collection threshold init >> >>>>> 175308800(171200K) used >> >>>>> 496500704(484863K) committed = 524288000(512000K) max >> >>>>> 524288000(512000K)
+
Jonathan Coveney 2011-11-21, 21:53
+
Dmitriy Ryaboy 2011-11-21, 22:20
+
Ruslan Al-fakikh 2011-11-22, 15:08
+
pablomar 2011-11-23, 03:10
+
Jonathan Coveney 2011-11-23, 07:45
+
Ruslan Al-fakikh 2011-11-24, 11:55
+
Ruslan Al-fakikh 2011-12-15, 14:57
+
Ruslan Al-fakikh 2011-12-16, 13:32
+
Dmitriy Ryaboy 2011-12-16, 20:15
+
Ruslan Al-fakikh 2011-12-22, 01:37
+
Ruslan Al-fakikh 2011-12-27, 15:48
+
Jonathan Coveney 2011-12-28, 19:18
+
Ruslan Al-fakikh 2012-01-06, 03:14
+
Jonathan Coveney 2012-01-06, 04:10
+
Ruslan Al-fakikh 2011-12-28, 22:21
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB