Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - POStatus.STATUS_NULL Semantics

Copy link to this message
Re: POStatus.STATUS_NULL Semantics
Mark Wagner 2013-11-13, 00:32
I'm definitely +1 on not having so many EOPs go through the pipeline
in the reducer. I've seen good improvements for maps in Tez, and I
suspect a join in classic MR could have similar gains.

For STATUS_NULL: I don't agree that the semantics are clear right now.
In the PigGenericMapReduce$Reducer::processOnePackageOutput(context)
call that you mentioned, package output does have it's return status
checked and if it's STATUS_NULL, it is discarded. At one point I
suspected maybe STATUS_NULL really did just mean null and all the
relational operators were just ignoring null values, but many of the
expression operators have checks like (res.returnStatus =POStatus.STATUS_OK && res.result != null).

So it's certainly confusing to me and it doesn't seem that there's a
concrete definition of STATUS_NULL right now. I don't think there's
much utility in having a STATUS_NULL that indicates a null value when
I can just check res.result. That's why I'm in favor of #2.

I have a patch that has passed all of 'test-commit' and is now going
through 'test'. I'll post this is in a JIRA + RB and we can have more
discussion there.


On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
> Also in continuation to our talk last week, it seems we are doing something
> inefficient in MR. I believe that should be a separate issue other than
> STATUS_NULL, it is more than a semantic cleanup (actually semantic is clear
> in existing code, just need some cleanup).
> Currently what Pig does is:
> 1. pull one record from POPackage/POJoinPackage
> 2. attach the record to pipeline
> 3. pull the leaf of pipeline until EOP
> 4. return to #1
> Seems to have a EOP to go through the entire pipeline for each
> POPackage/POJoinPackage output is unnecessary. It can be optimized. It
> would be interesting to explore the opportunity of the optimization here.
> Thanks,
> Daniel
> On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>> We shall definitely clear the ambiguity. +1 for #1.
>> For #2, I did some digging into POPackage. What I see is in POPackage we
>> produce STATUS_NULL, but not POJoinPackage. And the return value for
>> POPackage is not used:
>> if (pack instanceof POJoinPackage) {
>>     while (true)
>>     {
>>         ....
>>         if (processOnePackageOutput(context))
>>             break;
>>     }
>> }
>> else {
>>      ....
>>      processOnePackageOutput(context);
>> }
>> So it seems STATUS_NULL is not effectively in use in
>> POPackage/POJoinPackage. We shall always use STATUS_EOP to indicate the end
>> of pull (for both POPackage and pipeline). We shall not use STATUS_NULL
>> anywhere once we make #1 clear. Am I right?
>> Thanks,
>> Daniel
>> On Tue, Nov 12, 2013 at 8:35 AM, Cheolsoo Park <[EMAIL PROTECTED]>wrote:
>>> >> I'll post a patch that makes the second option official and changes all
>>> the examples of the first to return null with STATUS_OK (this is what
>>> happens with the constant null currently).
>>> I am +1 for this.
>>> On Fri, Nov 8, 2013 at 10:16 PM, Mark Wagner <[EMAIL PROTECTED]
>>> >wrote:
>>> > Hi Devs,
>>> >
>>> > While discussing the Pig on Tez project today, we were unable to reach
>>> > a conclusion on what the semantics of what POStatus.STATUS_NULL are
>>> > and would like to ask those who remember the history of Pig to chime
>>> > in. The two interpretations are:
>>> >
>>> > 1. POStatus.STATUS_NULL indicates that the pulled output IS null. This
>>> > is used in some places like EqualToExpr.
>>> >
>>> > 2. POStatus.STATUS_NULL indicates that the pull did not produce any
>>> > output. This is backed up by it's usage in POPackage for flattening an
>>> > empty bag, and PigGenericMapBase where pulls on the operator pipeline
>>> > that result in STATUS_NULL are discarded.
>>> >
>>> > If nobody has any concrete documentation or explanation for the
>>> > discrepancy, I'll post a patch that makes the second option official