Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> CUBE/ROLLUP/GROUPING SETS syntax


Copy link to this message
-
Re: CUBE/ROLLUP/GROUPING SETS syntax
Thanks Alan and Jon for expressing your views.

I agree with Jon's point, if the syntax contains CUBE then user expects it to perform CUBE operation. So Jon's syntax seems more meaningful and concise

rel = CUBE rel BY (dims);
rel = ROLLUP rel BY (dims);
rel = GROUPING_SET rel BY (dims);

2 reasons why I do not prefer using SQL syntax is
1) I do not want to break into existing Group operator implementation :)
2) The syntax gets longer in case of partial hierarchical cubing/rollups
For ex:

rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), ROLLUP(dim7,dim8,dim9);

whereas same thing can be expressed like

rel = ROLLUP rel BY dim0, (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);

Thanks Alan for pointing out the way for independently managing the operators in parser and logical/physical plan. So for all these operators (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to differentiate between these three operations.

But, yes we are proliferating operators in this case.

Thanks
-- Prasanth

On May 30, 2012, at 4:42 PM, Alan Gates wrote:

>
> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>
>> I was going to say the same thing Alan said w.r.t. operators: operators in
>> the grammar can correspond to whatever logical and physical operators you
>> want.
>>
>> As far as the principle of least astonishment compared to SQL... Pig is
>> already pretty astonishing. I don't know why we would bend over backwards
>> to make the syntax so similar in this case when even getting to the point
>> of doing a CUBE means understanding an object model that is pretty
>> different from SQL.
>>
>> On that note,
>>
>> rel = CUBE rel BY GROUPING SETS(cols);
>>
>> seems really confusing. I'd much rather overload the group operating than
>> the cube operator. If I see "cube," I expect a cube. If you start doing
>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>> latin is simple enough that I don't think having a rollup, group_set, etc
>> operator will be so confusing, because they're already going to be typing
>> that stuff in the conext of
>>
>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>> more, confusing syntax for the sake of creating parallels with a language
>> we now share very little with.
>
> Fair points.
>
>>
>> But I won't beat it any further... if people prefer a different syntax,
>> that's fine. Just excited to have the features in Pig!
> +1, I can live with any of the 3 syntax choices (near SQL, original, and Jon's).
>
> Alan.
>
>> Jon
>>
>> 2012/5/30 Alan Gates <[EMAIL PROTECTED]>
>>
>>> Some thoughts on this:
>>>
>>> 1) +1 to what Dmitriy said on HAVING
>>>
>>> 2) We need to be clear about separating operators in the grammar versus
>>> logical plan versus physical plan.  The choices you make in the grammar are
>>> totally independent of the other two.  That is, you could choose the syntax:
>>>
>>> rel = GROUP rel BY CUBE (a, b, c)
>>>
>>> and still have a separate POCube operator.  When the parser sees GROUP BY
>>> CUBE it will generate an LOCube operator for the logical plan rather than
>>> an LOGroup operator.  You can still have a separate POCube physical
>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube and
>>> POGroup vs. POCube.
>>>
>>> 3) On syntax I can see arguments for keeping as close to SQL as possible
>>> and for the syntax proposed by Prasanth.  The argument for sticking close
>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>> exactly SQL, as it would end up looking like:
>>>
>>> rel = GROUP rel BY CUBE (cols)
>>> rel = GROUP rel BY ROLLUP (cols)
>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>
>>> The argument I see for sticking with Prasanth's approach is that GROUP is
>>> really short for COGROUP in Pig Latin, and I don't think we're proposing
>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such
>>> a thing.  This makes CUBE really a separate operation.  But if we go this
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB