Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - UDF problem: Java Heap space


+
Aniket Mokashi 2011-02-24, 03:49
+
Dmitriy Ryaboy 2011-02-24, 03:56
+
Jai Krishna 2011-02-24, 08:58
+
Aniket Mokashi 2011-02-24, 23:49
+
Dmitriy Ryaboy 2011-02-25, 00:13
+
Daniel Dai 2011-02-25, 00:25
Copy link to this message
-
Re: UDF problem: Java Heap space
Aniket Mokashi 2011-02-25, 00:47
This is a map side udf.
pig script loads a log file and grabs contents inside angle brackets.
a = load; b = foreach a generate F(a); dump b;

I see following on tasktrackers-
2011-02-23 18:01:25,992 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
- Collection threshold init = 5439488(5312K) used = 409337824(399743K)
committed = 534118400(521600K) max = 715849728(699072K)
2011-02-23 18:01:26,102 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
call- Usage threshold init = 5439488(5312K) used = 546751088(533936K)
committed = 671547392(655808K) max = 715849728(699072K)

I am trying out some changes in udf to see if they work.

Thanks,
Aniket

On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote:
> Hi, Aniket,
> What is your Pig script? Is the UDF in map side or reduce side?
>
>
> Daniel
>
>
> Dmitriy Ryaboy wrote:
>
>> That's a max of 3.3K single-character strings. Even with the java
>> overhead that shouldn't be more than a meg right? none of these should
>> make it out of young gen assuming the list "cats" doesn't stick around
>> outside the udf.
>>
>> On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi
>> <[EMAIL PROTECTED]>wrote:
>>
>>
>>
>>> Hi Jai,
>>>
>>>
>>> Thanks for your email. I suspect that its the Strings in tight loop
>>> reason as you have suggested. I have a loop in my udf that does the
>>> following.
>>>
>>> while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd >>> someLog.indexOf(']', startInd); if(endInd > 0) { category >>> someLog.substring(startInd, endInd+1); cats.add(category); }
>>> startInd = endInd; }
>>>
>>>
>>> My jobs are failing in both local and mr mode. UDF works fine for a
>>> smaller input (a few lines). Also, I checked that sizeof someLog
>>> doesnt exceed a 10000.
>>>
>>> Thanks,
>>> Aniket
>>>
>>>
>>>
>>> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
>>>
>>>
>>>> Sharing the code would be useful as mentioned. Also of help would
>>>> the heap settings that the JVM had.
>>>>
>>>> However, off the top of my head, one common situation (esp. in text
>>>>  processing/tokenizing) is instantiating Strings in a tight loop.
>>>>
>>>> Besides you could also exercise your UDF in a local JVM and take a
>>>> heap dump / profile it. If your heap is less than 512M, you could
>>>> use basic profiling via hprof/hat (see
>>>> http://java.sun.com/developer/technicalArticles/Programming/HPROF.h
>>>> tml).
>>>>
>>>>
>>>> Thanks,
>>>> Jai
>>>>
>>>>
>>>>
>>>>
>>>> On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>>>>
>>>>
>>>>
>>>> Aniket, share the code?
>>>> It really depends on how you create them.
>>>>
>>>>
>>>>
>>>> -D
>>>>
>>>>
>>>>
>>>> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
>>>> <[EMAIL PROTECTED]>wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> I ve written a simple UDF that parses a chararray (which looks
>>>>> like ...[a].....[b]...[a]...) to capture stuff inside brackets and
>>>>> return them as String a=2;b=1; and so on. The input chararray are
>>>>> rarely more than 1000 characters and are not more than 100000 (I
>>>>> ve added log.warn in my udf to ensure this). But, I still see java
>>>>> heap error while running this udf (even in local mode, the job
>>>>> simply fails). My assumption is maps and lists that I use locally
>>>>> will be recollected by gc. Am I missing something?
>>>>>
>>>>> Thanks,
>>>>> Aniket
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>
>
>
+
Aniket Mokashi 2011-02-25, 01:26