Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
Jonathan Coveney 2012-07-10, 04:26
I have something in the mix that should reduce bag memory :) Question: how
much memory are your reducers getting? In my experience, you'll get OOM's
on spilling if you have allocated less than a gig to the JVM

2012/7/9 Haitao Yao <[EMAIL PROTECTED]>

> I have encountered the similar problem.  And I got a OOM while running the
> reducer.
> I think the reason is the data bag generated after group all is too big to
> fit into the reducer's memory.
>
> and I have written a new COUNT implementation with explicit invoke
> System.gc() and spill  after the COUNT function finish its job, but it
> still get OOM
>
> here's the code of the new COUNT implementation:
>         @Override
>         public Long exec(Tuple input) throws IOException {
>                 DataBag bag = (DataBag)input.get(0);
>                 Long result = super.exec(input);
>                 LOG.warn(" before spill data bag memory : " +
> Runtime.getRuntime().freeMemory());
>                 bag.spill();
>                 System.gc();
>                 LOG.warn(" after spill data bag memory : " +
> Runtime.getRuntime().freeMemory());
>                 LOG.warn("big bag size: " + bag.size() + ", hashcode: " +
> bag.hashCode());
>                 return result;
>         }
>
>
> I think we have to redesign the data bag implementation with less memory
> consumed.
>
>
>
> Haitao Yao
> [EMAIL PROTECTED]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
>
> 在 2012-7-10,上午6:54, Sheng Guo ���道:
>
> > the pig script:
> >
> > longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();
> >
> > grpall = group longDesc all;
> > cnt = foreach grpall generate COUNT(longDesc) as allNumber;
> > explain cnt;
> >
> >
> > the dump relation result:
> >
> > #-----------------------------------------------
> > # New Logical Plan:
> > #-----------------------------------------------
> > cnt: (Name: LOStore Schema: allNumber#65:long)
> > |
> > |---cnt: (Name: LOForEach Schema: allNumber#65:long)
> >    |   |
> >    |   (Name: LOGenerate[false] Schema:
> > allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
> >    |   |   |
> >    |   |   (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid:
> > 65)
> >    |   |   |
> >    |   |   |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column:
> > (*))
> >    |   |
> >    |   |---longDesc: (Name: LOInnerLoad[1] Schema:
> >
> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
> >    |
> >    |---grpall: (Name: LOCogroup Schema:
> >
> group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
> >        |   |
> >        |   (Name: Constant Type: chararray Uid: 62)
> >        |
> >        |---longDesc: (Name: LOLoad Schema:
> >
> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null
> >
> > #-----------------------------------------------
> > # Physical Plan:
> > #-----