The main case that motivated my 'no result' interpretation was line
299 in POPackage where flattening empty bags results in STATUS_NULL
and that many of the expression operators do not check STATUS_NULL,
but check result != null instead.
I think we can break this down into two parts, each of which are
probably more easily agreed upon:
1. Is it necessary to have a POStatus just to signify null? Why can't
we just set result = null with STATUS_OK? (This should only arise in
expression plans) (I'd lean towards using null and STATUS_OK)
2. Does it make sense to have a POStatus which signifies an empty
result which should be ignored? Or should we just use the
representation of null for this. This should only arise in
top-level/relational operators, I believe. I don't feel strongly about
this, as long as we're consistent.
On Tue, Nov 12, 2013 at 5:30 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
> Hi, Mark,
> I totally agree we shall optimize the EOP. The optimization I can think of
> now is:
> 1. In tez, we don't bound to one key per call mode, we can return EOP only
> when we exhaust all keys (you've already did that)
> 2. In MR, POJoinPackage currently return a EOP per flattened output, we can
> optimize to use a EOP per key
> I still think STATUS_NULL just means a null value. I didn't find evidence
> it is used to mean something else. Ignoring STATUS_NULL in the
> runPipeline() should be right. That's because null tuple in a bag is valid,
> but when we flatten to the top level null is not valid. I did trace some
> join queries in MR code, I didn't find a case where we use STATUS_NULL to
> signal a situation other than a null value. Did you have a scenario where
> we use STATUS_NULL to signal something else?
> On Tue, Nov 12, 2013 at 4:32 PM, Mark Wagner <[EMAIL PROTECTED]>wrote:
>> I'm definitely +1 on not having so many EOPs go through the pipeline
>> in the reducer. I've seen good improvements for maps in Tez, and I
>> suspect a join in classic MR could have similar gains.
>> For STATUS_NULL: I don't agree that the semantics are clear right now.
>> In the PigGenericMapReduce$Reducer::processOnePackageOutput(context)
>> call that you mentioned, package output does have it's return status
>> checked and if it's STATUS_NULL, it is discarded. At one point I
>> suspected maybe STATUS_NULL really did just mean null and all the
>> relational operators were just ignoring null values, but many of the
>> expression operators have checks like (res.returnStatus =>> POStatus.STATUS_OK && res.result != null).
>> So it's certainly confusing to me and it doesn't seem that there's a
>> concrete definition of STATUS_NULL right now. I don't think there's
>> much utility in having a STATUS_NULL that indicates a null value when
>> I can just check res.result. That's why I'm in favor of #2.
>> I have a patch that has passed all of 'test-commit' and is now going
>> through 'test'. I'll post this is in a JIRA + RB and we can have more
>> discussion there.
>> On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>> > Also in continuation to our talk last week, it seems we are doing
>> > inefficient in MR. I believe that should be a separate issue other than
>> > STATUS_NULL, it is more than a semantic cleanup (actually semantic is
>> > in existing code, just need some cleanup).
>> > Currently what Pig does is:
>> > 1. pull one record from POPackage/POJoinPackage
>> > 2. attach the record to pipeline
>> > 3. pull the leaf of pipeline until EOP
>> > 4. return to #1
>> > Seems to have a EOP to go through the entire pipeline for each
>> > POPackage/POJoinPackage output is unnecessary. It can be optimized. It
>> > would be interesting to explore the opportunity of the optimization here.
>> > Thanks,
>> > Daniel
>> > On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[EMAIL PROTECTED]>
>> >> We shall definitely clear the ambiguity. +1 for #1.