Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> CUBE/ROLLUP/GROUPING SETS syntax


Copy link to this message
-
Re: CUBE/ROLLUP/GROUPING SETS syntax

On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:

> I was going to say the same thing Alan said w.r.t. operators: operators in
> the grammar can correspond to whatever logical and physical operators you
> want.
>
> As far as the principle of least astonishment compared to SQL... Pig is
> already pretty astonishing. I don't know why we would bend over backwards
> to make the syntax so similar in this case when even getting to the point
> of doing a CUBE means understanding an object model that is pretty
> different from SQL.
>
> On that note,
>
> rel = CUBE rel BY GROUPING SETS(cols);
>
> seems really confusing. I'd much rather overload the group operating than
> the cube operator. If I see "cube," I expect a cube. If you start doing
> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
> latin is simple enough that I don't think having a rollup, group_set, etc
> operator will be so confusing, because they're already going to be typing
> that stuff in the conext of
>
> group rel by rollup(cols); and so on. I don't see how it's worth adding
> more, confusing syntax for the sake of creating parallels with a language
> we now share very little with.

Fair points.

>
> But I won't beat it any further... if people prefer a different syntax,
> that's fine. Just excited to have the features in Pig!
+1, I can live with any of the 3 syntax choices (near SQL, original, and Jon's).

Alan.

> Jon
>
> 2012/5/30 Alan Gates <[EMAIL PROTECTED]>
>
>> Some thoughts on this:
>>
>> 1) +1 to what Dmitriy said on HAVING
>>
>> 2) We need to be clear about separating operators in the grammar versus
>> logical plan versus physical plan.  The choices you make in the grammar are
>> totally independent of the other two.  That is, you could choose the syntax:
>>
>> rel = GROUP rel BY CUBE (a, b, c)
>>
>> and still have a separate POCube operator.  When the parser sees GROUP BY
>> CUBE it will generate an LOCube operator for the logical plan rather than
>> an LOGroup operator.  You can still have a separate POCube physical
>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube and
>> POGroup vs. POCube.
>>
>> 3) On syntax I can see arguments for keeping as close to SQL as possible
>> and for the syntax proposed by Prasanth.  The argument for sticking close
>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>> exactly SQL, as it would end up looking like:
>>
>> rel = GROUP rel BY CUBE (cols)
>> rel = GROUP rel BY ROLLUP (cols)
>> rel = GROUP rel BY GROUPING SETS(cols);
>>
>> The argument I see for sticking with Prasanth's approach is that GROUP is
>> really short for COGROUP in Pig Latin, and I don't think we're proposing
>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such
>> a thing.  This makes CUBE really a separate operation.  But if we go this
>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY
>> GROUPING SETS.  Let's not proliferate operators.
>>
>> Alan.
>>
>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>
>>> Thanks Jonathan for looking into it and for your suggestions.
>>>
>>> The reason why I came with a clause rather than a separate operator was
>> to avoid adding additional operators to the grammar.
>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>> adding to the complexity. I am planning to keep everything under cube
>> operator, so only LOCube and POCube operators will be added additionally.
>> And as you and Dmitriy have mentioned the purpose of HAVING clause is the
>> same as FILTER so we do not need a separate HAVING clause.
>>>
>>> I will give a quick recap of cube related operations and multiple syntax
>> options for achieving the same. I am also adding partial cubing and rollup
>> in this discussion.
>>>
>>> 1) CUBE
>>>
>>> Current syntax:
>>> alias = CUBE rel BY (a, b);
>>>
>>> Following group-by's will be computed:
>>> (a, b)
>>> (a)
>>> (b)
>>> ()
>>>
>>> 2) Partial CUBE