Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> CUBE/ROLLUP/GROUPING SETS syntax


Copy link to this message
-
Re: CUBE/ROLLUP/GROUPING SETS syntax
Some thoughts on this:

1) +1 to what Dmitriy said on HAVING

2) We need to be clear about separating operators in the grammar versus logical plan versus physical plan.  The choices you make in the grammar are totally independent of the other two.  That is, you could choose the syntax:

rel = GROUP rel BY CUBE (a, b, c)

and still have a separate POCube operator.  When the parser sees GROUP BY CUBE it will generate an LOCube operator for the logical plan rather than an LOGroup operator.  You can still have a separate POCube physical operator.  Separate optimizations can be applied to LOGroup vs. LOCube and POGroup vs. POCube.  

3) On syntax I can see arguments for keeping as close to SQL as possible and for the syntax proposed by Prasanth.  The argument for sticking close to SQL is it conforms to the law of least astonishment.  It wouldn't be exactly SQL, as it would end up looking like:

rel = GROUP rel BY CUBE (cols)
rel = GROUP rel BY ROLLUP (cols)
rel = GROUP rel BY GROUPING SETS(cols);

The argument I see for sticking with Prasanth's approach is that GROUP is really short for COGROUP in Pig Latin, and I don't think we're proposing doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such a thing.  This makes CUBE really a separate operation.  But if we go this route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY GROUPING SETS.  Let's not proliferate operators.

Alan.

On May 29, 2012, at 3:55 PM, Prasanth J wrote:

> Thanks Jonathan for looking into it and for your suggestions.
>
> The reason why I came with a clause rather than a separate operator was to avoid adding additional operators to the grammar.
> So adding ROLLUP, GROUPING_SET will need separate logical operators adding to the complexity. I am planning to keep everything under cube operator, so only LOCube and POCube operators will be added additionally. And as you and Dmitriy have mentioned the purpose of HAVING clause is the same as FILTER so we do not need a separate HAVING clause.
>
> I will give a quick recap of cube related operations and multiple syntax options for achieving the same. I am also adding partial cubing and rollup in this discussion.
>
> 1) CUBE
>
> Current syntax:
> alias = CUBE rel BY (a, b);
>
> Following group-by's will be computed:
> (a, b)
> (a)
> (b)
> ()
>
> 2) Partial CUBE
>
> Proposed syntax:
> alias = CUBE rel BY a, (b, c);
>
> Following group-by's will be computed:
> (a, b, c)
> (a, b)
> (a, c)
> (a)
>
> 3) ROLLUP
>
> Proposed syntax 1:
> alias = CUBE rel BY ROLLUP(a, b);
>
> Proposed syntax 2:
> alias = CUBE rel BY (a::b);
>
> Proposed syntax 3:
> alias = ROLLUP rel BY (a, b);
>
> Following group-by's will be computed:
> (a, b)
> (a)
> ()
>
> 4) Partial ROLLUP
>
> Proposed syntax 1:
> alias = CUBE rel BY a, ROLLUP(b, c);
>
> Proposed syntax 2:
> alias = CUBE rel BY (a, b::c);
>
> Proposed syntax 3:
> alias = ROLLUP rel BY a, (b, c);
>
> Following group-by's will be computed:
> (a, b, c)
> (a, b)
> (a)
>
> 5) GROUPING SETS
>
> Proposed syntax 1:
> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>
> Proposed syntax 2:
> alias = CUBE rel BY {(a), (b, c), (c)}
>
> Proposed syntax 3:
> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>
> Following group-by's will be computed:
> (a)
> (b, c)
> (c)
>
> Please vote for syntax 1, 2 or 3 so that we can come to a consensus before I start hacking the grammar file.
>
> Thanks
> -- Prasanth
>
> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>
>> Hey Prashanth, happy hacking.
>>
>> My opinion:
>>
>> CUBE:
>>
>> alias = CUBE rel BY (a,b,c);
>>
>>
>> I like that syntax. It's unambiguous what is going on.
>>
>>
>> ROLLUP:
>>
>>
>> alias = CUBE rel BY ROLLUP(a,b,c);
>>
>>
>> I never liked that syntax in SQL. I suggest we just do what we did with CUBE. IE
>>
>>
>> alias = ROLLUP rel BY (a,b,c);
>>
>>
>> GROUPING SETS:
>>
>>
>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>
>>
>> I don't like this. The cube vs. grouping sets is confusing to me. maybe