Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> POStatus.STATUS_NULL Semantics


Copy link to this message
-
Re: POStatus.STATUS_NULL Semantics
The main case that motivated my 'no result' interpretation was line
299 in POPackage where flattening empty bags results in STATUS_NULL
and that many of the expression operators do not check STATUS_NULL,
but check result != null instead.

I think we can break this down into two parts, each of which are
probably more easily agreed upon:

1. Is it necessary to have a POStatus just to signify null? Why can't
we just set result = null with STATUS_OK? (This should only arise in
expression plans) (I'd lean towards using null and STATUS_OK)
2. Does it make sense to have a POStatus which signifies an empty
result which should be ignored? Or should we just use the
representation of null for this. This should only arise in
top-level/relational operators, I believe. I don't feel strongly about
this, as long as we're consistent.

-Mark

On Tue, Nov 12, 2013 at 5:30 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
> Hi, Mark,
> I totally agree we shall optimize the EOP. The optimization I can think of
> now is:
> 1. In tez, we don't bound to one key per call mode, we can return EOP only
> when we exhaust all keys (you've already did that)
> 2. In MR, POJoinPackage currently return a EOP per flattened output, we can
> optimize to use a EOP per key
>
> I still think STATUS_NULL just means a null value. I didn't find evidence
> it is used to mean something else. Ignoring STATUS_NULL in the
> runPipeline() should be right. That's because null tuple in a bag is valid,
> but when we flatten to the top level null is not valid. I did trace some
> join queries in MR code, I didn't find a case where we use STATUS_NULL to
> signal a situation other than a null value. Did you have a scenario where
> we use STATUS_NULL to signal something else?
>
> Thanks,
> Daniel
>
>
> On Tue, Nov 12, 2013 at 4:32 PM, Mark Wagner <[EMAIL PROTECTED]>wrote:
>
>> I'm definitely +1 on not having so many EOPs go through the pipeline
>> in the reducer. I've seen good improvements for maps in Tez, and I
>> suspect a join in classic MR could have similar gains.
>>
>> For STATUS_NULL: I don't agree that the semantics are clear right now.
>> In the PigGenericMapReduce$Reducer::processOnePackageOutput(context)
>> call that you mentioned, package output does have it's return status
>> checked and if it's STATUS_NULL, it is discarded. At one point I
>> suspected maybe STATUS_NULL really did just mean null and all the
>> relational operators were just ignoring null values, but many of the
>> expression operators have checks like (res.returnStatus =>> POStatus.STATUS_OK && res.result != null).
>>
>> So it's certainly confusing to me and it doesn't seem that there's a
>> concrete definition of STATUS_NULL right now. I don't think there's
>> much utility in having a STATUS_NULL that indicates a null value when
>> I can just check res.result. That's why I'm in favor of #2.
>>
>> I have a patch that has passed all of 'test-commit' and is now going
>> through 'test'. I'll post this is in a JIRA + RB and we can have more
>> discussion there.
>>
>> Thanks,
>> Mark
>>
>> On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>> > Also in continuation to our talk last week, it seems we are doing
>> something
>> > inefficient in MR. I believe that should be a separate issue other than
>> > STATUS_NULL, it is more than a semantic cleanup (actually semantic is
>> clear
>> > in existing code, just need some cleanup).
>> >
>> > Currently what Pig does is:
>> > 1. pull one record from POPackage/POJoinPackage
>> > 2. attach the record to pipeline
>> > 3. pull the leaf of pipeline until EOP
>> > 4. return to #1
>> >
>> > Seems to have a EOP to go through the entire pipeline for each
>> > POPackage/POJoinPackage output is unnecessary. It can be optimized. It
>> > would be interesting to explore the opportunity of the optimization here.
>> >
>> > Thanks,
>> > Daniel
>> >
>> >
>> > On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[EMAIL PROTECTED]>
>> wrote:
>> >
>> >> We shall definitely clear the ambiguity. +1 for #1.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB