Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> POStatus.STATUS_NULL Semantics


Copy link to this message
-
Re: POStatus.STATUS_NULL Semantics
I'm definitely +1 on not having so many EOPs go through the pipeline
in the reducer. I've seen good improvements for maps in Tez, and I
suspect a join in classic MR could have similar gains.

For STATUS_NULL: I don't agree that the semantics are clear right now.
In the PigGenericMapReduce$Reducer::processOnePackageOutput(context)
call that you mentioned, package output does have it's return status
checked and if it's STATUS_NULL, it is discarded. At one point I
suspected maybe STATUS_NULL really did just mean null and all the
relational operators were just ignoring null values, but many of the
expression operators have checks like (res.returnStatus =POStatus.STATUS_OK && res.result != null).

So it's certainly confusing to me and it doesn't seem that there's a
concrete definition of STATUS_NULL right now. I don't think there's
much utility in having a STATUS_NULL that indicates a null value when
I can just check res.result. That's why I'm in favor of #2.

I have a patch that has passed all of 'test-commit' and is now going
through 'test'. I'll post this is in a JIRA + RB and we can have more
discussion there.

Thanks,
Mark

On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
> Also in continuation to our talk last week, it seems we are doing something
> inefficient in MR. I believe that should be a separate issue other than
> STATUS_NULL, it is more than a semantic cleanup (actually semantic is clear
> in existing code, just need some cleanup).
>
> Currently what Pig does is:
> 1. pull one record from POPackage/POJoinPackage
> 2. attach the record to pipeline
> 3. pull the leaf of pipeline until EOP
> 4. return to #1
>
> Seems to have a EOP to go through the entire pipeline for each
> POPackage/POJoinPackage output is unnecessary. It can be optimized. It
> would be interesting to explore the opportunity of the optimization here.
>
> Thanks,
> Daniel
>
>
> On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>
>> We shall definitely clear the ambiguity. +1 for #1.
>>
>> For #2, I did some digging into POPackage. What I see is in POPackage we
>> produce STATUS_NULL, but not POJoinPackage. And the return value for
>> POPackage is not used:
>> if (pack instanceof POJoinPackage) {
>>     while (true)
>>     {
>>         ....
>>         if (processOnePackageOutput(context))
>>             break;
>>     }
>> }
>> else {
>>      ....
>>      processOnePackageOutput(context);
>> }
>>
>> So it seems STATUS_NULL is not effectively in use in
>> POPackage/POJoinPackage. We shall always use STATUS_EOP to indicate the end
>> of pull (for both POPackage and pipeline). We shall not use STATUS_NULL
>> anywhere once we make #1 clear. Am I right?
>>
>> Thanks,
>> Daniel
>>
>>
>> On Tue, Nov 12, 2013 at 8:35 AM, Cheolsoo Park <[EMAIL PROTECTED]>wrote:
>>
>>> >> I'll post a patch that makes the second option official and changes all
>>> the examples of the first to return null with STATUS_OK (this is what
>>> happens with the constant null currently).
>>>
>>> I am +1 for this.
>>>
>>>
>>> On Fri, Nov 8, 2013 at 10:16 PM, Mark Wagner <[EMAIL PROTECTED]
>>> >wrote:
>>>
>>> > Hi Devs,
>>> >
>>> > While discussing the Pig on Tez project today, we were unable to reach
>>> > a conclusion on what the semantics of what POStatus.STATUS_NULL are
>>> > and would like to ask those who remember the history of Pig to chime
>>> > in. The two interpretations are:
>>> >
>>> > 1. POStatus.STATUS_NULL indicates that the pulled output IS null. This
>>> > is used in some places like EqualToExpr.
>>> >
>>> > 2. POStatus.STATUS_NULL indicates that the pull did not produce any
>>> > output. This is backed up by it's usage in POPackage for flattening an
>>> > empty bag, and PigGenericMapBase where pulls on the operator pipeline
>>> > that result in STATUS_NULL are discarded.
>>> >
>>> > If nobody has any concrete documentation or explanation for the
>>> > discrepancy, I'll post a patch that makes the second option official
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB