Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> CUBE/ROLLUP/GROUPING SETS syntax


Copy link to this message
-
Re: CUBE/ROLLUP/GROUPING SETS syntax
I think I'm missing something here.  The result of the "out =" line is three bags, correct?  If that's the case, the cross product you want is achieved by doing:

result = foreach out generate flatten($0), flatten($1), flatten($2)

This is not the same as CROSS, which would be expensive.

Alan.

On Jun 21, 2012, at 1:28 PM, Prasanth J wrote:

> Hello all
>
> I initially implemented ROLLUP as a separate operation with the following syntax
>
> a = ROLLUP inp BY (x,y);
>
> which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like
>
> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL
>
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly.
>
> Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax?
>
> Thanks
> -- Prasanth
>
> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>
>> As far as the underlying implementation, if they all use the same
>> optimizations that you use in cube, then it can be LOCube. If they have
>> their own optimizations etc (or could), it may be worth them having their
>> own Logical operators (which might just be LOCube with flags for the time
>> being) that allows us more flexibilty. But I suppose that's between you,
>> eclipse, and your GSOC mentor.
>>
>> 2012/5/30 Prasanth J <[EMAIL PROTECTED]>
>>
>>> Thanks Alan and Jon for expressing your views.
>>>
>>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>>>
>>> rel = CUBE rel BY (dims);
>>> rel = ROLLUP rel BY (dims);
>>> rel = GROUPING_SET rel BY (dims);
>>>
>>> 2 reasons why I do not prefer using SQL syntax is
>>> 1) I do not want to break into existing Group operator implementation :)
>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>> For ex:
>>>
>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>>> ROLLUP(dim7,dim8,dim9);
>>>
>>> whereas same thing can be expressed like
>>>
>>> rel = ROLLUP rel BY dim0,
>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>>
>>> Thanks Alan for pointing out the way for independently managing the
>>> operators in parser and logical/physical plan. So for all these operators
>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>> differentiate between these three operations.
>>>
>>> But, yes we are proliferating operators in this case.
>>>
>>> Thanks
>>> -- Prasanth
>>>
>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>>
>>>>
>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>>
>>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>>> in
>>>>> the grammar can correspond to whatever logical and physical operators
>>> you
>>>>> want.
>>>>>
>>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>>> already pretty astonishing. I don't know why we would bend over
>>> backwards
>>>>> to make the syntax so similar in this case when even getting to the
>>> point
>>>>> of doing a CUBE means understanding an object model that is pretty
>>>>> different from SQL.
>>>>>
>>>>> On that note,
>>>>>
>>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>>
>>>>> seems really confusing. I'd much rather overload the group operating
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB