Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> java.lang.OutOfMemoryError when using TOP udf


+
Ruslan Al-fakikh 2011-11-17, 14:13
+
Dmitriy Ryaboy 2011-11-17, 16:43
+
pablomar 2011-11-17, 17:59
+
Dmitriy Ryaboy 2011-11-17, 20:07
+
Ruslan Al-Fakikh 2011-11-21, 14:11
+
Dmitriy Ryaboy 2011-11-21, 16:32
+
Ruslan Al-fakikh 2011-11-21, 17:10
+
Jonathan Coveney 2011-11-21, 18:22
Copy link to this message
-
Re: java.lang.OutOfMemoryError when using TOP udf
i might be wrong, but it seems the error comes from
while(itr.hasNext())
not from the add to the queue
so i don't think it is related to the number of elements in the queue
... maybe the field lenght?

On 11/21/11, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> Internally, TOP is using a priority queue. It tries to be smart about
> pulling off excess elements, but if you ask it for enough elements, it can
> blow up, because the priority queue is going to have n elements, where n is
> the ranking you want. This is consistent with the stack trace, which died
> on updateTop which is when elements are added to the priority queue.
>
> Ruslan, how large are the limits you're setting? ie (int)(count * (double)
> ($THIRD_LEVELS_PERCENTAGE + $BOTS_PERCENTAGE) )
>
> As far as TOP's implementation, I imagine you could get around the issue by
> using a sorted data bag, but that might be much slower. hmm.
>
> 2011/11/21 Ruslan Al-fakikh <[EMAIL PROTECTED]>
>
>> Ok. Here it is:
>> https://gist.github.com/1383266
>>
>> -----Original Message-----
>> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
>> Sent: 21 ноября 2011 г. 20:32
>> To: [EMAIL PROTECTED]
>> Subject: Re: java.lang.OutOfMemoryError when using TOP udf
>>
>> Ruslan, I think the mailing list is set to reject attachments -- can you
>> post it as a github gist or something similar, and send a link?
>>
>> D
>>
>> On Mon, Nov 21, 2011 at 6:11 AM, Ruslan Al-Fakikh
>> <[EMAIL PROTECTED]> wrote:
>> > Hey Dmitriy,
>> >
>> > I attached the script. It is not a plain-pig script, because I make
>> > some preprocessing before submitting it to cluster, but the general
>> > idea of what I submit is clear.
>> >
>> > Thanks in advance!
>> >
>> > On Fri, Nov 18, 2011 at 12:07 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
>> wrote:
>> >> Ok, so it's something in the rest of the script that's causing this
>> >> to happen. Ruslan, if you send your script, I can probably figure out
>> >> why (usually, it's using another, non-agebraic udf in your foreach,
>> >> or for pig 0.8, generating a constant in the foreach).
>> >>
>> >> D
>> >>
>> >> On Thu, Nov 17, 2011 at 9:59 AM, pablomar
>> >> <[EMAIL PROTECTED]> wrote:
>> >>> according to the stack trace, the algebraic is not being used it
>> >>> says
>> >>> updateTop(Top.java:139)
>> >>> exec(Top.java:116)
>> >>>
>> >>> On 11/17/11, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>> >>>> The top udf does not try to process all data in memory if the
>> >>>> algebraic optimization can be applied. It does need to keep the
>> >>>> topn numbers in memory of course. Can you confirm algebraic mode is
>> used?
>> >>>>
>> >>>> On Nov 17, 2011, at 6:13 AM, "Ruslan Al-fakikh"
>> >>>> <[EMAIL PROTECTED]>
>> >>>> wrote:
>> >>>>
>> >>>>> Hey guys,
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> I encounter java.lang.OutOfMemoryError when using TOP udf. It
>> >>>>> seems that the udf tries to process all data in memory.
>> >>>>>
>> >>>>> Is there a workaround for TOP? Or maybe there is some other way of
>> >>>>> getting top results? I cannot use LIMIT since I need to 5% of
>> >>>>> data, not a constant number of rows.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> I am using:
>> >>>>>
>> >>>>> Apache Pig version 0.8.1-cdh3u2 (rexported)
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> The stack trace is:
>> >>>>>
>> >>>>> [2011-11-16 12:34:55] INFO  (CodecPool.java:128) - Got brand-new
>> >>>>> decompressor
>> >>>>>
>> >>>>> [2011-11-16 12:34:55] INFO  (Merger.java:473) - Down to the last
>> >>>>> merge-pass, with 21 segments left of total size: 2057257173 bytes
>> >>>>>
>> >>>>> [2011-11-16 12:34:55] INFO  (SpillableMemoryManager.java:154) -
>> >>>>> first memory handler call- Usage threshold init >> >>>>> 175308800(171200K) used >> >>>>> 373454552(364701K) committed = 524288000(512000K) max >> >>>>> 524288000(512000K)
>> >>>>>
>> >>>>> [2011-11-16 12:36:22] INFO  (SpillableMemoryManager.java:167) -
>> >>>>> first memory handler call - Collection threshold init >> >>>>> 175308800(171200K) used >> >>>>> 496500704(484863K) committed = 524288000(512000K) max >> >>>>> 524288000(512000K)
+
Jonathan Coveney 2011-11-21, 21:53
+
Dmitriy Ryaboy 2011-11-21, 22:20
+
Ruslan Al-fakikh 2011-11-22, 15:08
+
pablomar 2011-11-23, 03:10
+
Jonathan Coveney 2011-11-23, 07:45
+
Ruslan Al-fakikh 2011-11-24, 11:55
+
Ruslan Al-fakikh 2011-12-15, 14:57
+
Ruslan Al-fakikh 2011-12-16, 13:32
+
Dmitriy Ryaboy 2011-12-16, 20:15
+
Ruslan Al-fakikh 2011-12-22, 01:37
+
Ruslan Al-fakikh 2011-12-27, 15:48
+
Jonathan Coveney 2011-12-28, 19:18
+
Ruslan Al-fakikh 2012-01-06, 03:14
+
Jonathan Coveney 2012-01-06, 04:10
+
Ruslan Al-fakikh 2011-12-28, 22:21