Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
the pig script:

longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();

grpall = group longDesc all;
cnt = foreach grpall generate COUNT(longDesc) as allNumber;
explain cnt;
the dump relation result:

#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
cnt: (Name: LOStore Schema: allNumber#65:long)
|
|---cnt: (Name: LOForEach Schema: allNumber#65:long)
    |   |
    |   (Name: LOGenerate[false] Schema:
allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
    |   |   |
    |   |   (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid:
65)
    |   |   |
    |   |   |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column:
(*))
    |   |
    |   |---longDesc: (Name: LOInnerLoad[1] Schema:
DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
    |
    |---grpall: (Name: LOCogroup Schema:
group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
        |   |
        |   (Name: Constant Type: chararray Uid: 62)
        |
        |---longDesc: (Name: LOLoad Schema:
DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null

#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
|
|---cnt: New For Each(false)[bag] - scope-8
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-6
    |   |
    |   |---Project[bag][1] - scope-5
    |
    |---grpall: Package[tuple]{chararray} - scope-2
        |
        |---grpall: Global Rearrange[tuple] - scope-1
            |
            |---grpall: Local Rearrange[tuple]{chararray}(false) - scope-3
                |   |
                |   Constant(all) - scope-4
                |
                |---longDesc:
Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0

2012-07-09 15:47:02,441 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-07-09 15:47:02,448 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
- Choosing to move algebraic foreach to combiner
2012-07-09 15:47:02,581 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-07-09 15:47:02,581 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-10
Map Plan
grpall: Local Rearrange[tuple]{chararray}(false) - scope-22
|   |
|   Project[chararray][0] - scope-23
|
|---cnt: New For Each(false,false)[bag] - scope-11
    |   |
    |   Project[chararray][0] - scope-12
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-13
    |   |
    |   |---Project[bag][1] - scope-14
    |
    |---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-24
        |
        |---longDesc:
Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0--------
Combine Plan
grpall: Local Rearrange[tuple]{chararray}(false) - scope-26
|   |
|   Project[chararray][0] - scope-27
|
|---cnt: New For Each(false,false)[bag] - scope-15
    |   |
    |   Project[chararray][0] - scope-16
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] -
scope-17
    |   |
    |   |---Project[bag][1] - scope-18
    |
    |---POCombinerPackage[tuple]{chararray} - scope-20--------
Reduce Plan
cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
|
|---cnt: New For Each(false)[bag] - scope-8
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - scope-6
    |   |
    |   |---Project[bag][1] - scope-19
    |
    |---POCombinerPackage[tuple]{chararray} - scope-28--------
Global sort: false

On Tue, Jul 3, 2012 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB