Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - CUBE/ROLLUP/GROUPING SETS syntax


Copy link to this message
-
Re: CUBE/ROLLUP/GROUPING SETS syntax
Prasanth J 2012-06-21, 20:28
Hello all

I initially implemented ROLLUP as a separate operation with the following syntax

a = ROLLUP inp BY (x,y);

which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like

GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);

so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL

out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);

In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly.

Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax?

Thanks
-- Prasanth

On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:

> As far as the underlying implementation, if they all use the same
> optimizations that you use in cube, then it can be LOCube. If they have
> their own optimizations etc (or could), it may be worth them having their
> own Logical operators (which might just be LOCube with flags for the time
> being) that allows us more flexibilty. But I suppose that's between you,
> eclipse, and your GSOC mentor.
>
> 2012/5/30 Prasanth J <[EMAIL PROTECTED]>
>
>> Thanks Alan and Jon for expressing your views.
>>
>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>>
>> rel = CUBE rel BY (dims);
>> rel = ROLLUP rel BY (dims);
>> rel = GROUPING_SET rel BY (dims);
>>
>> 2 reasons why I do not prefer using SQL syntax is
>> 1) I do not want to break into existing Group operator implementation :)
>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>> For ex:
>>
>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>> ROLLUP(dim7,dim8,dim9);
>>
>> whereas same thing can be expressed like
>>
>> rel = ROLLUP rel BY dim0,
>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>
>> Thanks Alan for pointing out the way for independently managing the
>> operators in parser and logical/physical plan. So for all these operators
>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>> differentiate between these three operations.
>>
>> But, yes we are proliferating operators in this case.
>>
>> Thanks
>> -- Prasanth
>>
>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>
>>>
>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>
>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>> in
>>>> the grammar can correspond to whatever logical and physical operators
>> you
>>>> want.
>>>>
>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>> already pretty astonishing. I don't know why we would bend over
>> backwards
>>>> to make the syntax so similar in this case when even getting to the
>> point
>>>> of doing a CUBE means understanding an object model that is pretty
>>>> different from SQL.
>>>>
>>>> On that note,
>>>>
>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>
>>>> seems really confusing. I'd much rather overload the group operating
>> than
>>>> the cube operator. If I see "cube," I expect a cube. If you start doing
>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>>>> latin is simple enough that I don't think having a rollup, group_set,
>> etc
>>>> operator will be so confusing, because they're already going to be
>> typing
>>>> that stuff in the conext of
>>>>
>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding