Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
I have encountered the similar problem.  And I got a OOM while running the reducer.
I think the reason is the data bag generated after group all is too big to fit into the reducer's memory.

and I have written a new COUNT implementation with explicit invoke System.gc() and spill  after the COUNT function finish its job, but it still get OOM

here's the code of the new COUNT implementation:
@Override
public Long exec(Tuple input) throws IOException {
DataBag bag = (DataBag)input.get(0);
Long result = super.exec(input);
LOG.warn(" before spill data bag memory : " + Runtime.getRuntime().freeMemory());
bag.spill();
System.gc();
LOG.warn(" after spill data bag memory : " + Runtime.getRuntime().freeMemory());
LOG.warn("big bag size: " + bag.size() + ", hashcode: " + bag.hashCode());
return result;
}
I think we have to redesign the data bag implementation with less memory consumed.

Haitao Yao
[EMAIL PROTECTED]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-10,上午6:54, Sheng Guo 写道:

> the pig script:
>
> longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();
>
> grpall = group longDesc all;
> cnt = foreach grpall generate COUNT(longDesc) as allNumber;
> explain cnt;
>
>
> the dump relation result:
>
> #-----------------------------------------------
> # New Logical Plan:
> #-----------------------------------------------
> cnt: (Name: LOStore Schema: allNumber#65:long)
> |
> |---cnt: (Name: LOForEach Schema: allNumber#65:long)
>    |   |
>    |   (Name: LOGenerate[false] Schema:
> allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
>    |   |   |
>    |   |   (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid:
> 65)
>    |   |   |
>    |   |   |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column:
> (*))
>    |   |
>    |   |---longDesc: (Name: LOInnerLoad[1] Schema:
> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
>    |
>    |---grpall: (Name: LOCogroup Schema:
> group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
>        |   |
>        |   (Name: Constant Type: chararray Uid: 62)
>        |
>        |---longDesc: (Name: LOLoad Schema:
> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null
>
> #-----------------------------------------------
> # Physical Plan:
> #-----------------------------------------------
> cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
> |
> |---cnt: New For Each(false)[bag] - scope-8
>    |   |
>    |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-6
>    |   |
>    |   |---Project[bag][1] - scope-5
>    |
>    |---grpall: Package[tuple]{chararray} - scope-2
>        |
>        |---grpall: Global Rearrange[tuple] - scope-1
>            |
>            |---grpall: Local Rearrange[tuple]{chararray}(false) - scope-3
>                |   |
>                |   Constant(all) - scope-4