Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
the pig script:

longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();

grpall = group longDesc all;
cnt = foreach grpall generate COUNT(longDesc) as allNumber;
explain cnt;
the dump relation result:

#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
cnt: (Name: LOStore Schema: allNumber#65:long)
|
|---cnt: (Name: LOForEach Schema: allNumber#65:long)
    |   |
    |   (Name: LOGenerate[false] Schema:
allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
    |   |   |
    |   |   (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid:
65)
    |   |   |
    |   |   |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column:
(*))
    |   |
    |   |---longDesc: (Name: LOInnerLoad[1] Schema:
DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
    |
    |---grpall: (Name: LOCogroup Schema:
group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
        |   |
        |   (Name: Constant Type: chararray Uid: 62)
        |
        |---longDesc: (Name: LOLoad Schema:
DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null

#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
|
|---cnt: New For Each(false)[bag] - scope-8
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-6
    |   |
    |   |---Project[bag][1] - scope-5
    |
    |---grpall: Package[tuple]{chararray} - scope-2
        |
        |---grpall: Global Rearrange[tuple] - scope-1
            |
            |---grpall: Local Rearrange[tuple]{chararray}(false) - scope-3
                |   |
                |   Constant(all) - scope-4
                |
                |---longDesc:
Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0

2012-07-09 15:47:02,441 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-07-09 15:47:02,448 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
- Choosing to move algebraic foreach to combiner
2012-07-09 15:47:02,581 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-07-09 15:47:02,581 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-10
Map Plan
grpall: Local Rearrange[tuple]{chararray}(false) - scope-22
|   |
|   Project[chararray][0] - scope-23
|
|---cnt: New For Each(false,false)[bag] - scope-11
    |   |
    |   Project[chararray][0] - scope-12
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-13
    |   |
    |   |---Project[bag][1] - scope-14
    |
    |---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-24
        |
        |---longDesc:
Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0--------
Combine Plan
grpall: Local Rearrange[tuple]{chararray}(false) - scope-26
|   |
|   Project[chararray][0] - scope-27
|
|---cnt: New For Each(false,false)[bag] - scope-15
    |   |
    |   Project[chararray][0] - scope-16
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] -
scope-17
    |   |
    |   |---Project[bag][1] - scope-18
    |
    |---POCombinerPackage[tuple]{chararray} - scope-20--------
Reduce Plan
cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
|
|---cnt: New For Each(false)[bag] - scope-8
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - scope-6
    |   |
    |   |---Project[bag][1] - scope-19
    |
    |---POCombinerPackage[tuple]{chararray} - scope-28--------
Global sort: false

On Tue, Jul 3, 2012 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: