Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> little help on reading debug output


Copy link to this message
-
Re: little help on reading debug output
Hi,

This a representation of Pig's physical plan of execution. You can read
more about it at-
http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#EXPLAIN
http://wiki.apache.org/pig/PigExecutionModel

7** are ids for uniquely identifying operators (Logical/Physical/MR) in
pig. [NodeIdGenerator.getNextId()].

As multiple lines in Pig can generate single MapReduce task, it will be
hard to associate this part of the plan with the pig script line number.
But "Explain" can help you more.

Lot of functionality in Pig is implemented with the use of userfunc(UDFs).
Snippet from the code explaining where and why we use IsEmpty UDF-
<snip>
public static void addEmptyBagOuterJoin(PhysicalPlan fePlan, Schema
inputSchema) throws PlanException {
        // we currently have POProject[bag] as the only operator in the plan
        // If the bag is an empty bag, we should replace
        // it with a bag with one tuple with null fields so that when we
flatten
        // we do not drop records (flatten will drop records if the bag is
left
        // as an empty bag) and actually project nulls for the fields in
        // the empty bag

        // So we need to get to the following state:
        // POProject[Bag]
        //         \
        //    POUserFunc["IsEmpty()"] Const[Bag](bag with null fields)
        //                        \      |    POProject[Bag]
        //                         \     |    /
        //                          POBinCond
</snip>
This explains the use of IsEmpty() UDF.

Hope it helps.

Thanks,
Aniket

On Wed, June 16, 2010 2:52 pm, Corbin Hoenes wrote:
> Is there any documentation on how to read this output when I 'set debug
> on' I get in my reducer syslog:
>
> DEBUG:
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce
> $Reduce - New For Each(true,true)[tuple] - 1-770
> |   |
> |   POBinCond[bag] - 1-768
> |   |
> |   |---Project[bag][1] - 1-764
> |   |
> |   |---POUserFunc(org.apache.pig.builtin.IsEmpty)[boolean] - 1-766
> |   |   |
> |   |   |---Project[bag][1] - 1-765
> |   |
> |   |---Constant({()}) - 1-767
> |   |
> |   Project[bag][2] - 1-769
> DEBUG: org.apache.pig.data.InternalCachedBag - Memory can hold 45450
> records, put the rest in spill file. DEBUG:
> org.apache.pig.data.InternalCachedBag - Memory can hold 45192 records,
> put the rest in spill file. DEBUG: org.apache.pig.data.InternalCachedBag -
> Memory can hold 44852 records, put the rest in spill file
>
>
> Specifically what do the 1-7** numbers mean?  Is it possible to get line
> numbers from the pig script :) Also strange is that it seems that
> POUserFunc is telling me we are running the IsEmpty UDF but that UDF
> isn't being called in this script at all...is it possible pig is using it
> under the covers?
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB