Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
Haitao Yao 2012-07-10, 05:06
my reducers get 512 MB, -Xms512M -Xmx512M.
The reducer does not get OOM when manually invoke spill in my case.

Can you explain more about your solution?
And can your solution fit into 512MB reducer process?
Thanks very much.

Haitao Yao
[EMAIL PROTECTED]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-10,下午12:26, Jonathan Coveney 写道:

> I have something in the mix that should reduce bag memory :) Question: how
> much memory are your reducers getting? In my experience, you'll get OOM's
> on spilling if you have allocated less than a gig to the JVM
>
> 2012/7/9 Haitao Yao <[EMAIL PROTECTED]>
>
>> I have encountered the similar problem.  And I got a OOM while running the
>> reducer.
>> I think the reason is the data bag generated after group all is too big to
>> fit into the reducer's memory.
>>
>> and I have written a new COUNT implementation with explicit invoke
>> System.gc() and spill  after the COUNT function finish its job, but it
>> still get OOM
>>
>> here's the code of the new COUNT implementation:
>>        @Override
>>        public Long exec(Tuple input) throws IOException {
>>                DataBag bag = (DataBag)input.get(0);
>>                Long result = super.exec(input);
>>                LOG.warn(" before spill data bag memory : " +
>> Runtime.getRuntime().freeMemory());
>>                bag.spill();
>>                System.gc();
>>                LOG.warn(" after spill data bag memory : " +
>> Runtime.getRuntime().freeMemory());
>>                LOG.warn("big bag size: " + bag.size() + ", hashcode: " +
>> bag.hashCode());
>>                return result;
>>        }
>>
>>
>> I think we have to redesign the data bag implementation with less memory
>> consumed.
>>
>>
>>
>> Haitao Yao
>> [EMAIL PROTECTED]
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>>
>> 在 2012-7-10,上午6:54, Sheng Guo 写道:
>>
>>> the pig script:
>>>
>>> longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();
>>>
>>> grpall = group longDesc all;
>>> cnt = foreach grpall generate COUNT(longDesc) as allNumber;
>>> explain cnt;
>>>
>>>
>>> the dump relation result:
>>>
>>> #-----------------------------------------------
>>> # New Logical Plan:
>>> #-----------------------------------------------
>>> cnt: (Name: LOStore Schema: allNumber#65:long)
>>> |
>>> |---cnt: (Name: LOForEach Schema: allNumber#65:long)
>>>   |   |
>>>   |   (Name: LOGenerate[false] Schema:
>>> allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
>>>   |   |   |
>>>   |   |   (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid:
>>> 65)
>>>   |   |   |
>>>   |   |   |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column:
>>> (*))
>>>   |   |
>>>   |   |---longDesc: (Name: LOInnerLoad[1] Schema:
>>>
>> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
>>>   |
>>>   |---grpall: (Name: LOCogroup Schema:
>>>
>> group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
>>>       |   |
>>>       |   (Name: Constant Type: chararray Uid: 62)
>>>       |
>>>       |---longDesc: (Name: LOLoad Schema:
>>>
>> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null