Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - CUBE/ROLLUP/GROUPING SETS syntax


Copy link to this message
-
Re: CUBE/ROLLUP/GROUPING SETS syntax
Dmitriy Ryaboy 2012-06-22, 20:14
One happens on the mapper.

On Thu, Jun 21, 2012 at 2:52 PM, Prasanth J <[EMAIL PROTECTED]> wrote:
> Thanks Alan.
> Your suggestion looks correct.
>
> I think with this I can achieve what I wanted in the same syntax
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> Just curious to know.
> How is this different from CROSS? and why is CROSS expensive when compared to flatten?
>
> Thanks
> -- Prasanth
>
> On Jun 21, 2012, at 5:11 PM, Alan Gates wrote:
>
>> I think I'm missing something here.  The result of the "out =" line is three bags, correct?  If that's the case, the cross product you want is achieved by doing:
>>
>> result = foreach out generate flatten($0), flatten($1), flatten($2)
>>
>> This is not the same as CROSS, which would be expensive.
>>
>> Alan.
>>
>> On Jun 21, 2012, at 1:28 PM, Prasanth J wrote:
>>
>>> Hello all
>>>
>>> I initially implemented ROLLUP as a separate operation with the following syntax
>>>
>>> a = ROLLUP inp BY (x,y);
>>>
>>> which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like
>>>
>>> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>>>
>>> so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL
>>>
>>> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>>>
>>> In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly.
>>>
>>> Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax?
>>>
>>> Thanks
>>> -- Prasanth
>>>
>>> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>>>
>>>> As far as the underlying implementation, if they all use the same
>>>> optimizations that you use in cube, then it can be LOCube. If they have
>>>> their own optimizations etc (or could), it may be worth them having their
>>>> own Logical operators (which might just be LOCube with flags for the time
>>>> being) that allows us more flexibilty. But I suppose that's between you,
>>>> eclipse, and your GSOC mentor.
>>>>
>>>> 2012/5/30 Prasanth J <[EMAIL PROTECTED]>
>>>>
>>>>> Thanks Alan and Jon for expressing your views.
>>>>>
>>>>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>>>>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>>>>>
>>>>> rel = CUBE rel BY (dims);
>>>>> rel = ROLLUP rel BY (dims);
>>>>> rel = GROUPING_SET rel BY (dims);
>>>>>
>>>>> 2 reasons why I do not prefer using SQL syntax is
>>>>> 1) I do not want to break into existing Group operator implementation :)
>>>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>>>> For ex:
>>>>>
>>>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>>>>> ROLLUP(dim7,dim8,dim9);
>>>>>
>>>>> whereas same thing can be expressed like
>>>>>
>>>>> rel = ROLLUP rel BY dim0,
>>>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>>>>
>>>>> Thanks Alan for pointing out the way for independently managing the
>>>>> operators in parser and logical/physical plan. So for all these operators
>>>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>>>> differentiate between these three operations.
>>>>>
>>>>> But, yes we are proliferating operators in this case.
>>>>>
>>>>> Thanks
>>>>> -- Prasanth
>>>>>
>>>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>>>>
>>>>>>
>>>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>>>>
>>>>>>> I was going to say the same thing Alan said w.r.t. operators: operators