|
Prasanth J
2012-05-28, 05:36
Jonathan Coveney
2012-05-29, 20:05
Prasanth J
2012-05-29, 22:55
Alan Gates
2012-05-30, 16:35
Jonathan Coveney
2012-05-30, 17:43
Alan Gates
2012-05-30, 20:42
Prasanth J
2012-05-31, 00:02
Jonathan Coveney
2012-05-31, 00:10
Prasanth J
2012-06-21, 20:28
Alan Gates
2012-06-21, 21:11
Prasanth J
2012-06-21, 21:52
Dmitriy Ryaboy
2012-06-22, 20:14
Jonathan Coveney
2012-06-21, 20:41
Prasanth J
2012-06-21, 20:43
Jonathan Coveney
2012-06-21, 20:50
|
-
CUBE/ROLLUP/GROUPING SETS syntaxPrasanth J 2012-05-28, 05:36
Hello everyone
I am looking for feedback from the community about the syntax for CUBE/ROLLUP/GROUPING SETS operations in pig. I am moving the discussion from JIRA to dev-list so that everyone can share their opinion for operator syntax. Please have a look at the syntax proposal at the link below and let me know your opinion https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 Thanks -- Prasanth +
Prasanth J 2012-05-28, 05:36
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxJonathan Coveney 2012-05-29, 20:05
Hey Prashanth, happy hacking.
My opinion: CUBE: alias = CUBE rel BY (a,b,c); I like that syntax. It's unambiguous what is going on. ROLLUP: alias = CUBE rel BY ROLLUP(a,b,c); I never liked that syntax in SQL. I suggest we just do what we did with CUBE. IE alias = ROLLUP rel BY (a,b,c); GROUPING SETS: alias = CUBE rel BY GROUPING SETS((a,b),(b),()); I don't like this. The cube vs. grouping sets is confusing to me. maybe following the same pattern you could do something like: alias = GROUPING_SET rel BY ((a,b),(b),()); As far as having, is there an optimization that can be done with a HAVING clause that can't be done based on the logical plan that comes afterwards? That seems odd to me. Since you have to materialize the result anyway, can't the having clause just be a FILTER that comes after the cube? I don't know why we need a special syntax. My opinion. Forgive janky formatting, gmail + paste = pain. Jon 2012/5/27 Prasanth J <[EMAIL PROTECTED]> > Hello everyone > > I am looking for feedback from the community about the syntax for > CUBE/ROLLUP/GROUPING SETS operations in pig. > I am moving the discussion from JIRA to dev-list so that everyone can > share their opinion for operator syntax. Please have a look at the syntax > proposal at the link below and let me know your opinion > > > https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 > > Thanks > -- Prasanth > > +
Jonathan Coveney 2012-05-29, 20:05
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxPrasanth J 2012-05-29, 22:55
Thanks Jonathan for looking into it and for your suggestions.
The reason why I came with a clause rather than a separate operator was to avoid adding additional operators to the grammar. So adding ROLLUP, GROUPING_SET will need separate logical operators adding to the complexity. I am planning to keep everything under cube operator, so only LOCube and POCube operators will be added additionally. And as you and Dmitriy have mentioned the purpose of HAVING clause is the same as FILTER so we do not need a separate HAVING clause. I will give a quick recap of cube related operations and multiple syntax options for achieving the same. I am also adding partial cubing and rollup in this discussion. 1) CUBE Current syntax: alias = CUBE rel BY (a, b); Following group-by's will be computed: (a, b) (a) (b) () 2) Partial CUBE Proposed syntax: alias = CUBE rel BY a, (b, c); Following group-by's will be computed: (a, b, c) (a, b) (a, c) (a) 3) ROLLUP Proposed syntax 1: alias = CUBE rel BY ROLLUP(a, b); Proposed syntax 2: alias = CUBE rel BY (a::b); Proposed syntax 3: alias = ROLLUP rel BY (a, b); Following group-by's will be computed: (a, b) (a) () 4) Partial ROLLUP Proposed syntax 1: alias = CUBE rel BY a, ROLLUP(b, c); Proposed syntax 2: alias = CUBE rel BY (a, b::c); Proposed syntax 3: alias = ROLLUP rel BY a, (b, c); Following group-by's will be computed: (a, b, c) (a, b) (a) 5) GROUPING SETS Proposed syntax 1: alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) Proposed syntax 2: alias = CUBE rel BY {(a), (b, c), (c)} Proposed syntax 3: alias = GROUPING_SET rel BY ((a), (b, c), (c)) Following group-by's will be computed: (a) (b, c) (c) Please vote for syntax 1, 2 or 3 so that we can come to a consensus before I start hacking the grammar file. Thanks -- Prasanth On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: > Hey Prashanth, happy hacking. > > My opinion: > > CUBE: > > alias = CUBE rel BY (a,b,c); > > > I like that syntax. It's unambiguous what is going on. > > > ROLLUP: > > > alias = CUBE rel BY ROLLUP(a,b,c); > > > I never liked that syntax in SQL. I suggest we just do what we did with CUBE. IE > > > alias = ROLLUP rel BY (a,b,c); > > > GROUPING SETS: > > > alias = CUBE rel BY GROUPING SETS((a,b),(b),()); > > > I don't like this. The cube vs. grouping sets is confusing to me. maybe > following the > same pattern you could do something like: > > alias = GROUPING_SET rel BY ((a,b),(b),()); > > As far as having, is there an optimization that can be done with a HAVING > clause that can't be done based on the logical plan that comes afterwards? > That seems odd to me. Since you have to materialize the result anyway, > can't the having clause just be a FILTER that comes after the cube? I don't > know why we need a special syntax. > > My opinion. Forgive janky formatting, gmail + paste = pain. > Jon > > 2012/5/27 Prasanth J <[EMAIL PROTECTED]> > >> Hello everyone >> >> I am looking for feedback from the community about the syntax for >> CUBE/ROLLUP/GROUPING SETS operations in pig. >> I am moving the discussion from JIRA to dev-list so that everyone can >> share their opinion for operator syntax. Please have a look at the syntax >> proposal at the link below and let me know your opinion >> >> >> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 >> >> Thanks >> -- Prasanth >> >> +
Prasanth J 2012-05-29, 22:55
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxAlan Gates 2012-05-30, 16:35
Some thoughts on this:
1) +1 to what Dmitriy said on HAVING 2) We need to be clear about separating operators in the grammar versus logical plan versus physical plan. The choices you make in the grammar are totally independent of the other two. That is, you could choose the syntax: rel = GROUP rel BY CUBE (a, b, c) and still have a separate POCube operator. When the parser sees GROUP BY CUBE it will generate an LOCube operator for the logical plan rather than an LOGroup operator. You can still have a separate POCube physical operator. Separate optimizations can be applied to LOGroup vs. LOCube and POGroup vs. POCube. 3) On syntax I can see arguments for keeping as close to SQL as possible and for the syntax proposed by Prasanth. The argument for sticking close to SQL is it conforms to the law of least astonishment. It wouldn't be exactly SQL, as it would end up looking like: rel = GROUP rel BY CUBE (cols) rel = GROUP rel BY ROLLUP (cols) rel = GROUP rel BY GROUPING SETS(cols); The argument I see for sticking with Prasanth's approach is that GROUP is really short for COGROUP in Pig Latin, and I don't think we're proposing doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such a thing. This makes CUBE really a separate operation. But if we go this route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY GROUPING SETS. Let's not proliferate operators. Alan. On May 29, 2012, at 3:55 PM, Prasanth J wrote: > Thanks Jonathan for looking into it and for your suggestions. > > The reason why I came with a clause rather than a separate operator was to avoid adding additional operators to the grammar. > So adding ROLLUP, GROUPING_SET will need separate logical operators adding to the complexity. I am planning to keep everything under cube operator, so only LOCube and POCube operators will be added additionally. And as you and Dmitriy have mentioned the purpose of HAVING clause is the same as FILTER so we do not need a separate HAVING clause. > > I will give a quick recap of cube related operations and multiple syntax options for achieving the same. I am also adding partial cubing and rollup in this discussion. > > 1) CUBE > > Current syntax: > alias = CUBE rel BY (a, b); > > Following group-by's will be computed: > (a, b) > (a) > (b) > () > > 2) Partial CUBE > > Proposed syntax: > alias = CUBE rel BY a, (b, c); > > Following group-by's will be computed: > (a, b, c) > (a, b) > (a, c) > (a) > > 3) ROLLUP > > Proposed syntax 1: > alias = CUBE rel BY ROLLUP(a, b); > > Proposed syntax 2: > alias = CUBE rel BY (a::b); > > Proposed syntax 3: > alias = ROLLUP rel BY (a, b); > > Following group-by's will be computed: > (a, b) > (a) > () > > 4) Partial ROLLUP > > Proposed syntax 1: > alias = CUBE rel BY a, ROLLUP(b, c); > > Proposed syntax 2: > alias = CUBE rel BY (a, b::c); > > Proposed syntax 3: > alias = ROLLUP rel BY a, (b, c); > > Following group-by's will be computed: > (a, b, c) > (a, b) > (a) > > 5) GROUPING SETS > > Proposed syntax 1: > alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) > > Proposed syntax 2: > alias = CUBE rel BY {(a), (b, c), (c)} > > Proposed syntax 3: > alias = GROUPING_SET rel BY ((a), (b, c), (c)) > > Following group-by's will be computed: > (a) > (b, c) > (c) > > Please vote for syntax 1, 2 or 3 so that we can come to a consensus before I start hacking the grammar file. > > Thanks > -- Prasanth > > On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: > >> Hey Prashanth, happy hacking. >> >> My opinion: >> >> CUBE: >> >> alias = CUBE rel BY (a,b,c); >> >> >> I like that syntax. It's unambiguous what is going on. >> >> >> ROLLUP: >> >> >> alias = CUBE rel BY ROLLUP(a,b,c); >> >> >> I never liked that syntax in SQL. I suggest we just do what we did with CUBE. IE >> >> >> alias = ROLLUP rel BY (a,b,c); >> >> >> GROUPING SETS: >> >> >> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); >> >> >> I don't like this. The cube vs. grouping sets is confusing to me. maybe +
Alan Gates 2012-05-30, 16:35
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxJonathan Coveney 2012-05-30, 17:43
I was going to say the same thing Alan said w.r.t. operators: operators in
the grammar can correspond to whatever logical and physical operators you want. As far as the principle of least astonishment compared to SQL... Pig is already pretty astonishing. I don't know why we would bend over backwards to make the syntax so similar in this case when even getting to the point of doing a CUBE means understanding an object model that is pretty different from SQL. On that note, rel = CUBE rel BY GROUPING SETS(cols); seems really confusing. I'd much rather overload the group operating than the cube operator. If I see "cube," I expect a cube. If you start doing rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig latin is simple enough that I don't think having a rollup, group_set, etc operator will be so confusing, because they're already going to be typing that stuff in the conext of group rel by rollup(cols); and so on. I don't see how it's worth adding more, confusing syntax for the sake of creating parallels with a language we now share very little with. But I won't beat it any further... if people prefer a different syntax, that's fine. Just excited to have the features in Pig! Jon 2012/5/30 Alan Gates <[EMAIL PROTECTED]> > Some thoughts on this: > > 1) +1 to what Dmitriy said on HAVING > > 2) We need to be clear about separating operators in the grammar versus > logical plan versus physical plan. The choices you make in the grammar are > totally independent of the other two. That is, you could choose the syntax: > > rel = GROUP rel BY CUBE (a, b, c) > > and still have a separate POCube operator. When the parser sees GROUP BY > CUBE it will generate an LOCube operator for the logical plan rather than > an LOGroup operator. You can still have a separate POCube physical > operator. Separate optimizations can be applied to LOGroup vs. LOCube and > POGroup vs. POCube. > > 3) On syntax I can see arguments for keeping as close to SQL as possible > and for the syntax proposed by Prasanth. The argument for sticking close > to SQL is it conforms to the law of least astonishment. It wouldn't be > exactly SQL, as it would end up looking like: > > rel = GROUP rel BY CUBE (cols) > rel = GROUP rel BY ROLLUP (cols) > rel = GROUP rel BY GROUPING SETS(cols); > > The argument I see for sticking with Prasanth's approach is that GROUP is > really short for COGROUP in Pig Latin, and I don't think we're proposing > doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such > a thing. This makes CUBE really a separate operation. But if we go this > route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY > GROUPING SETS. Let's not proliferate operators. > > Alan. > > On May 29, 2012, at 3:55 PM, Prasanth J wrote: > > > Thanks Jonathan for looking into it and for your suggestions. > > > > The reason why I came with a clause rather than a separate operator was > to avoid adding additional operators to the grammar. > > So adding ROLLUP, GROUPING_SET will need separate logical operators > adding to the complexity. I am planning to keep everything under cube > operator, so only LOCube and POCube operators will be added additionally. > And as you and Dmitriy have mentioned the purpose of HAVING clause is the > same as FILTER so we do not need a separate HAVING clause. > > > > I will give a quick recap of cube related operations and multiple syntax > options for achieving the same. I am also adding partial cubing and rollup > in this discussion. > > > > 1) CUBE > > > > Current syntax: > > alias = CUBE rel BY (a, b); > > > > Following group-by's will be computed: > > (a, b) > > (a) > > (b) > > () > > > > 2) Partial CUBE > > > > Proposed syntax: > > alias = CUBE rel BY a, (b, c); > > > > Following group-by's will be computed: > > (a, b, c) > > (a, b) > > (a, c) > > (a) > > > > 3) ROLLUP > > > > Proposed syntax 1: > > alias = CUBE rel BY ROLLUP(a, b); > > > > Proposed syntax 2: > > alias = CUBE rel BY (a::b); +
Jonathan Coveney 2012-05-30, 17:43
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxAlan Gates 2012-05-30, 20:42
On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: > I was going to say the same thing Alan said w.r.t. operators: operators in > the grammar can correspond to whatever logical and physical operators you > want. > > As far as the principle of least astonishment compared to SQL... Pig is > already pretty astonishing. I don't know why we would bend over backwards > to make the syntax so similar in this case when even getting to the point > of doing a CUBE means understanding an object model that is pretty > different from SQL. > > On that note, > > rel = CUBE rel BY GROUPING SETS(cols); > > seems really confusing. I'd much rather overload the group operating than > the cube operator. If I see "cube," I expect a cube. If you start doing > rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig > latin is simple enough that I don't think having a rollup, group_set, etc > operator will be so confusing, because they're already going to be typing > that stuff in the conext of > > group rel by rollup(cols); and so on. I don't see how it's worth adding > more, confusing syntax for the sake of creating parallels with a language > we now share very little with. Fair points. > > But I won't beat it any further... if people prefer a different syntax, > that's fine. Just excited to have the features in Pig! +1, I can live with any of the 3 syntax choices (near SQL, original, and Jon's). Alan. > Jon > > 2012/5/30 Alan Gates <[EMAIL PROTECTED]> > >> Some thoughts on this: >> >> 1) +1 to what Dmitriy said on HAVING >> >> 2) We need to be clear about separating operators in the grammar versus >> logical plan versus physical plan. The choices you make in the grammar are >> totally independent of the other two. That is, you could choose the syntax: >> >> rel = GROUP rel BY CUBE (a, b, c) >> >> and still have a separate POCube operator. When the parser sees GROUP BY >> CUBE it will generate an LOCube operator for the logical plan rather than >> an LOGroup operator. You can still have a separate POCube physical >> operator. Separate optimizations can be applied to LOGroup vs. LOCube and >> POGroup vs. POCube. >> >> 3) On syntax I can see arguments for keeping as close to SQL as possible >> and for the syntax proposed by Prasanth. The argument for sticking close >> to SQL is it conforms to the law of least astonishment. It wouldn't be >> exactly SQL, as it would end up looking like: >> >> rel = GROUP rel BY CUBE (cols) >> rel = GROUP rel BY ROLLUP (cols) >> rel = GROUP rel BY GROUPING SETS(cols); >> >> The argument I see for sticking with Prasanth's approach is that GROUP is >> really short for COGROUP in Pig Latin, and I don't think we're proposing >> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such >> a thing. This makes CUBE really a separate operation. But if we go this >> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY >> GROUPING SETS. Let's not proliferate operators. >> >> Alan. >> >> On May 29, 2012, at 3:55 PM, Prasanth J wrote: >> >>> Thanks Jonathan for looking into it and for your suggestions. >>> >>> The reason why I came with a clause rather than a separate operator was >> to avoid adding additional operators to the grammar. >>> So adding ROLLUP, GROUPING_SET will need separate logical operators >> adding to the complexity. I am planning to keep everything under cube >> operator, so only LOCube and POCube operators will be added additionally. >> And as you and Dmitriy have mentioned the purpose of HAVING clause is the >> same as FILTER so we do not need a separate HAVING clause. >>> >>> I will give a quick recap of cube related operations and multiple syntax >> options for achieving the same. I am also adding partial cubing and rollup >> in this discussion. >>> >>> 1) CUBE >>> >>> Current syntax: >>> alias = CUBE rel BY (a, b); >>> >>> Following group-by's will be computed: >>> (a, b) >>> (a) >>> (b) >>> () >>> >>> 2) Partial CUBE +
Alan Gates 2012-05-30, 20:42
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxPrasanth J 2012-05-31, 00:02
Thanks Alan and Jon for expressing your views.
I agree with Jon's point, if the syntax contains CUBE then user expects it to perform CUBE operation. So Jon's syntax seems more meaningful and concise rel = CUBE rel BY (dims); rel = ROLLUP rel BY (dims); rel = GROUPING_SET rel BY (dims); 2 reasons why I do not prefer using SQL syntax is 1) I do not want to break into existing Group operator implementation :) 2) The syntax gets longer in case of partial hierarchical cubing/rollups For ex: rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), ROLLUP(dim7,dim8,dim9); whereas same thing can be expressed like rel = ROLLUP rel BY dim0, (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); Thanks Alan for pointing out the way for independently managing the operators in parser and logical/physical plan. So for all these operators (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to differentiate between these three operations. But, yes we are proliferating operators in this case. Thanks -- Prasanth On May 30, 2012, at 4:42 PM, Alan Gates wrote: > > On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: > >> I was going to say the same thing Alan said w.r.t. operators: operators in >> the grammar can correspond to whatever logical and physical operators you >> want. >> >> As far as the principle of least astonishment compared to SQL... Pig is >> already pretty astonishing. I don't know why we would bend over backwards >> to make the syntax so similar in this case when even getting to the point >> of doing a CUBE means understanding an object model that is pretty >> different from SQL. >> >> On that note, >> >> rel = CUBE rel BY GROUPING SETS(cols); >> >> seems really confusing. I'd much rather overload the group operating than >> the cube operator. If I see "cube," I expect a cube. If you start doing >> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig >> latin is simple enough that I don't think having a rollup, group_set, etc >> operator will be so confusing, because they're already going to be typing >> that stuff in the conext of >> >> group rel by rollup(cols); and so on. I don't see how it's worth adding >> more, confusing syntax for the sake of creating parallels with a language >> we now share very little with. > > Fair points. > >> >> But I won't beat it any further... if people prefer a different syntax, >> that's fine. Just excited to have the features in Pig! > +1, I can live with any of the 3 syntax choices (near SQL, original, and Jon's). > > Alan. > >> Jon >> >> 2012/5/30 Alan Gates <[EMAIL PROTECTED]> >> >>> Some thoughts on this: >>> >>> 1) +1 to what Dmitriy said on HAVING >>> >>> 2) We need to be clear about separating operators in the grammar versus >>> logical plan versus physical plan. The choices you make in the grammar are >>> totally independent of the other two. That is, you could choose the syntax: >>> >>> rel = GROUP rel BY CUBE (a, b, c) >>> >>> and still have a separate POCube operator. When the parser sees GROUP BY >>> CUBE it will generate an LOCube operator for the logical plan rather than >>> an LOGroup operator. You can still have a separate POCube physical >>> operator. Separate optimizations can be applied to LOGroup vs. LOCube and >>> POGroup vs. POCube. >>> >>> 3) On syntax I can see arguments for keeping as close to SQL as possible >>> and for the syntax proposed by Prasanth. The argument for sticking close >>> to SQL is it conforms to the law of least astonishment. It wouldn't be >>> exactly SQL, as it would end up looking like: >>> >>> rel = GROUP rel BY CUBE (cols) >>> rel = GROUP rel BY ROLLUP (cols) >>> rel = GROUP rel BY GROUPING SETS(cols); >>> >>> The argument I see for sticking with Prasanth's approach is that GROUP is >>> really short for COGROUP in Pig Latin, and I don't think we're proposing >>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such >>> a thing. This makes CUBE really a separate operation. But if we go this +
Prasanth J 2012-05-31, 00:02
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxJonathan Coveney 2012-05-31, 00:10
As far as the underlying implementation, if they all use the same
optimizations that you use in cube, then it can be LOCube. If they have their own optimizations etc (or could), it may be worth them having their own Logical operators (which might just be LOCube with flags for the time being) that allows us more flexibilty. But I suppose that's between you, eclipse, and your GSOC mentor. 2012/5/30 Prasanth J <[EMAIL PROTECTED]> > Thanks Alan and Jon for expressing your views. > > I agree with Jon's point, if the syntax contains CUBE then user expects it > to perform CUBE operation. So Jon's syntax seems more meaningful and concise > > rel = CUBE rel BY (dims); > rel = ROLLUP rel BY (dims); > rel = GROUPING_SET rel BY (dims); > > 2 reasons why I do not prefer using SQL syntax is > 1) I do not want to break into existing Group operator implementation :) > 2) The syntax gets longer in case of partial hierarchical cubing/rollups > For ex: > > rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), > ROLLUP(dim7,dim8,dim9); > > whereas same thing can be expressed like > > rel = ROLLUP rel BY dim0, > (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); > > Thanks Alan for pointing out the way for independently managing the > operators in parser and logical/physical plan. So for all these operators > (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to > differentiate between these three operations. > > But, yes we are proliferating operators in this case. > > Thanks > -- Prasanth > > On May 30, 2012, at 4:42 PM, Alan Gates wrote: > > > > > On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: > > > >> I was going to say the same thing Alan said w.r.t. operators: operators > in > >> the grammar can correspond to whatever logical and physical operators > you > >> want. > >> > >> As far as the principle of least astonishment compared to SQL... Pig is > >> already pretty astonishing. I don't know why we would bend over > backwards > >> to make the syntax so similar in this case when even getting to the > point > >> of doing a CUBE means understanding an object model that is pretty > >> different from SQL. > >> > >> On that note, > >> > >> rel = CUBE rel BY GROUPING SETS(cols); > >> > >> seems really confusing. I'd much rather overload the group operating > than > >> the cube operator. If I see "cube," I expect a cube. If you start doing > >> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig > >> latin is simple enough that I don't think having a rollup, group_set, > etc > >> operator will be so confusing, because they're already going to be > typing > >> that stuff in the conext of > >> > >> group rel by rollup(cols); and so on. I don't see how it's worth adding > >> more, confusing syntax for the sake of creating parallels with a > language > >> we now share very little with. > > > > Fair points. > > > >> > >> But I won't beat it any further... if people prefer a different syntax, > >> that's fine. Just excited to have the features in Pig! > > +1, I can live with any of the 3 syntax choices (near SQL, original, and > Jon's). > > > > Alan. > > > >> Jon > >> > >> 2012/5/30 Alan Gates <[EMAIL PROTECTED]> > >> > >>> Some thoughts on this: > >>> > >>> 1) +1 to what Dmitriy said on HAVING > >>> > >>> 2) We need to be clear about separating operators in the grammar versus > >>> logical plan versus physical plan. The choices you make in the > grammar are > >>> totally independent of the other two. That is, you could choose the > syntax: > >>> > >>> rel = GROUP rel BY CUBE (a, b, c) > >>> > >>> and still have a separate POCube operator. When the parser sees GROUP > BY > >>> CUBE it will generate an LOCube operator for the logical plan rather > than > >>> an LOGroup operator. You can still have a separate POCube physical > >>> operator. Separate optimizations can be applied to LOGroup vs. LOCube > and > >>> POGroup vs. POCube. > >>> > >>> 3) On syntax I can see arguments for keeping as close to SQL as +
Jonathan Coveney 2012-05-31, 00:10
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxPrasanth J 2012-06-21, 20:28
Hello all
I initially implemented ROLLUP as a separate operation with the following syntax a = ROLLUP inp BY (x,y); which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly. Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax? Thanks -- Prasanth On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: > As far as the underlying implementation, if they all use the same > optimizations that you use in cube, then it can be LOCube. If they have > their own optimizations etc (or could), it may be worth them having their > own Logical operators (which might just be LOCube with flags for the time > being) that allows us more flexibilty. But I suppose that's between you, > eclipse, and your GSOC mentor. > > 2012/5/30 Prasanth J <[EMAIL PROTECTED]> > >> Thanks Alan and Jon for expressing your views. >> >> I agree with Jon's point, if the syntax contains CUBE then user expects it >> to perform CUBE operation. So Jon's syntax seems more meaningful and concise >> >> rel = CUBE rel BY (dims); >> rel = ROLLUP rel BY (dims); >> rel = GROUPING_SET rel BY (dims); >> >> 2 reasons why I do not prefer using SQL syntax is >> 1) I do not want to break into existing Group operator implementation :) >> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >> For ex: >> >> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), >> ROLLUP(dim7,dim8,dim9); >> >> whereas same thing can be expressed like >> >> rel = ROLLUP rel BY dim0, >> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >> >> Thanks Alan for pointing out the way for independently managing the >> operators in parser and logical/physical plan. So for all these operators >> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >> differentiate between these three operations. >> >> But, yes we are proliferating operators in this case. >> >> Thanks >> -- Prasanth >> >> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >> >>> >>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>> >>>> I was going to say the same thing Alan said w.r.t. operators: operators >> in >>>> the grammar can correspond to whatever logical and physical operators >> you >>>> want. >>>> >>>> As far as the principle of least astonishment compared to SQL... Pig is >>>> already pretty astonishing. I don't know why we would bend over >> backwards >>>> to make the syntax so similar in this case when even getting to the >> point >>>> of doing a CUBE means understanding an object model that is pretty >>>> different from SQL. >>>> >>>> On that note, >>>> >>>> rel = CUBE rel BY GROUPING SETS(cols); >>>> >>>> seems really confusing. I'd much rather overload the group operating >> than >>>> the cube operator. If I see "cube," I expect a cube. If you start doing >>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig >>>> latin is simple enough that I don't think having a rollup, group_set, >> etc >>>> operator will be so confusing, because they're already going to be >> typing >>>> that stuff in the conext of >>>> >>>> group rel by rollup(cols); and so on. I don't see how it's worth adding +
Prasanth J 2012-06-21, 20:28
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxAlan Gates 2012-06-21, 21:11
I think I'm missing something here. The result of the "out =" line is three bags, correct? If that's the case, the cross product you want is achieved by doing:
result = foreach out generate flatten($0), flatten($1), flatten($2) This is not the same as CROSS, which would be expensive. Alan. On Jun 21, 2012, at 1:28 PM, Prasanth J wrote: > Hello all > > I initially implemented ROLLUP as a separate operation with the following syntax > > a = ROLLUP inp BY (x,y); > > which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like > > GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL > > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly. > > Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax? > > Thanks > -- Prasanth > > On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: > >> As far as the underlying implementation, if they all use the same >> optimizations that you use in cube, then it can be LOCube. If they have >> their own optimizations etc (or could), it may be worth them having their >> own Logical operators (which might just be LOCube with flags for the time >> being) that allows us more flexibilty. But I suppose that's between you, >> eclipse, and your GSOC mentor. >> >> 2012/5/30 Prasanth J <[EMAIL PROTECTED]> >> >>> Thanks Alan and Jon for expressing your views. >>> >>> I agree with Jon's point, if the syntax contains CUBE then user expects it >>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise >>> >>> rel = CUBE rel BY (dims); >>> rel = ROLLUP rel BY (dims); >>> rel = GROUPING_SET rel BY (dims); >>> >>> 2 reasons why I do not prefer using SQL syntax is >>> 1) I do not want to break into existing Group operator implementation :) >>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >>> For ex: >>> >>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), >>> ROLLUP(dim7,dim8,dim9); >>> >>> whereas same thing can be expressed like >>> >>> rel = ROLLUP rel BY dim0, >>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >>> >>> Thanks Alan for pointing out the way for independently managing the >>> operators in parser and logical/physical plan. So for all these operators >>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >>> differentiate between these three operations. >>> >>> But, yes we are proliferating operators in this case. >>> >>> Thanks >>> -- Prasanth >>> >>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >>> >>>> >>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>>> >>>>> I was going to say the same thing Alan said w.r.t. operators: operators >>> in >>>>> the grammar can correspond to whatever logical and physical operators >>> you >>>>> want. >>>>> >>>>> As far as the principle of least astonishment compared to SQL... Pig is >>>>> already pretty astonishing. I don't know why we would bend over >>> backwards >>>>> to make the syntax so similar in this case when even getting to the >>> point >>>>> of doing a CUBE means understanding an object model that is pretty >>>>> different from SQL. >>>>> >>>>> On that note, >>>>> >>>>> rel = CUBE rel BY GROUPING SETS(cols); >>>>> >>>>> seems really confusing. I'd much rather overload the group operating +
Alan Gates 2012-06-21, 21:11
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxPrasanth J 2012-06-21, 21:52
Thanks Alan.
Your suggestion looks correct. I think with this I can achieve what I wanted in the same syntax out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); Just curious to know. How is this different from CROSS? and why is CROSS expensive when compared to flatten? Thanks -- Prasanth On Jun 21, 2012, at 5:11 PM, Alan Gates wrote: > I think I'm missing something here. The result of the "out =" line is three bags, correct? If that's the case, the cross product you want is achieved by doing: > > result = foreach out generate flatten($0), flatten($1), flatten($2) > > This is not the same as CROSS, which would be expensive. > > Alan. > > On Jun 21, 2012, at 1:28 PM, Prasanth J wrote: > >> Hello all >> >> I initially implemented ROLLUP as a separate operation with the following syntax >> >> a = ROLLUP inp BY (x,y); >> >> which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like >> >> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >> >> so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL >> >> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >> >> In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly. >> >> Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax? >> >> Thanks >> -- Prasanth >> >> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: >> >>> As far as the underlying implementation, if they all use the same >>> optimizations that you use in cube, then it can be LOCube. If they have >>> their own optimizations etc (or could), it may be worth them having their >>> own Logical operators (which might just be LOCube with flags for the time >>> being) that allows us more flexibilty. But I suppose that's between you, >>> eclipse, and your GSOC mentor. >>> >>> 2012/5/30 Prasanth J <[EMAIL PROTECTED]> >>> >>>> Thanks Alan and Jon for expressing your views. >>>> >>>> I agree with Jon's point, if the syntax contains CUBE then user expects it >>>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise >>>> >>>> rel = CUBE rel BY (dims); >>>> rel = ROLLUP rel BY (dims); >>>> rel = GROUPING_SET rel BY (dims); >>>> >>>> 2 reasons why I do not prefer using SQL syntax is >>>> 1) I do not want to break into existing Group operator implementation :) >>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >>>> For ex: >>>> >>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), >>>> ROLLUP(dim7,dim8,dim9); >>>> >>>> whereas same thing can be expressed like >>>> >>>> rel = ROLLUP rel BY dim0, >>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >>>> >>>> Thanks Alan for pointing out the way for independently managing the >>>> operators in parser and logical/physical plan. So for all these operators >>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >>>> differentiate between these three operations. >>>> >>>> But, yes we are proliferating operators in this case. >>>> >>>> Thanks >>>> -- Prasanth >>>> >>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >>>> >>>>> >>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>>>> >>>>>> I was going to say the same thing Alan said w.r.t. operators: operators >>>> in >>>>>> the grammar can correspond to whatever logical and physical operators >>>> you >>>>>> want. >>>>>> >>>>>> As far as the principle of least astonishment compared to SQL... Pig is +
Prasanth J 2012-06-21, 21:52
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxDmitriy Ryaboy 2012-06-22, 20:14
One happens on the mapper.
On Thu, Jun 21, 2012 at 2:52 PM, Prasanth J <[EMAIL PROTECTED]> wrote: > Thanks Alan. > Your suggestion looks correct. > > I think with this I can achieve what I wanted in the same syntax > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > Just curious to know. > How is this different from CROSS? and why is CROSS expensive when compared to flatten? > > Thanks > -- Prasanth > > On Jun 21, 2012, at 5:11 PM, Alan Gates wrote: > >> I think I'm missing something here. The result of the "out =" line is three bags, correct? If that's the case, the cross product you want is achieved by doing: >> >> result = foreach out generate flatten($0), flatten($1), flatten($2) >> >> This is not the same as CROSS, which would be expensive. >> >> Alan. >> >> On Jun 21, 2012, at 1:28 PM, Prasanth J wrote: >> >>> Hello all >>> >>> I initially implemented ROLLUP as a separate operation with the following syntax >>> >>> a = ROLLUP inp BY (x,y); >>> >>> which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like >>> >>> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >>> >>> so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL >>> >>> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >>> >>> In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly. >>> >>> Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax? >>> >>> Thanks >>> -- Prasanth >>> >>> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: >>> >>>> As far as the underlying implementation, if they all use the same >>>> optimizations that you use in cube, then it can be LOCube. If they have >>>> their own optimizations etc (or could), it may be worth them having their >>>> own Logical operators (which might just be LOCube with flags for the time >>>> being) that allows us more flexibilty. But I suppose that's between you, >>>> eclipse, and your GSOC mentor. >>>> >>>> 2012/5/30 Prasanth J <[EMAIL PROTECTED]> >>>> >>>>> Thanks Alan and Jon for expressing your views. >>>>> >>>>> I agree with Jon's point, if the syntax contains CUBE then user expects it >>>>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise >>>>> >>>>> rel = CUBE rel BY (dims); >>>>> rel = ROLLUP rel BY (dims); >>>>> rel = GROUPING_SET rel BY (dims); >>>>> >>>>> 2 reasons why I do not prefer using SQL syntax is >>>>> 1) I do not want to break into existing Group operator implementation :) >>>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >>>>> For ex: >>>>> >>>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), >>>>> ROLLUP(dim7,dim8,dim9); >>>>> >>>>> whereas same thing can be expressed like >>>>> >>>>> rel = ROLLUP rel BY dim0, >>>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >>>>> >>>>> Thanks Alan for pointing out the way for independently managing the >>>>> operators in parser and logical/physical plan. So for all these operators >>>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >>>>> differentiate between these three operations. >>>>> >>>>> But, yes we are proliferating operators in this case. >>>>> >>>>> Thanks >>>>> -- Prasanth >>>>> >>>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >>>>> >>>>>> >>>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>>>>> >>>>>>> I was going to say the same thing Alan said w.r.t. operators: operators +
Dmitriy Ryaboy 2012-06-22, 20:14
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxJonathan Coveney 2012-06-21, 20:41
Just to make sure I understand this correctly, is
out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); equivalent to: out1 = CUBE rel BY (a,b,c); out2 = ROLLUP rel BY (c,d); out3 = CUBY rel BY (e,f); out = CROSS out1, out2, out3; ? 2012/6/21 Prasanth J <[EMAIL PROTECTED]> > Hello all > > I initially implemented ROLLUP as a separate operation with the following > syntax > > a = ROLLUP inp BY (x,y); > > which does the same thing as CUBE (inserting foreach + group-by in logical > plan) except that it uses RollupDimensions UDF. But the issue with this > approach is that we cannot mix CUBE and ROLLUP operations together in the > same syntax which is a typical case. SQL/Oracle supports using CUBE and > ROLLUP together like > > GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > so I modified the pig grammar to support the similar usage. So now we can > use a syntax similar to SQL > > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > In this approach, the logical plan should introduce cartesian product > between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for > generating the final output. But I read from the documentation ( > http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator > is an expensive operator and advices to use it sparingly. > > Is there any other way to achieve the cartesian product in a less > expensive way? Also, does anyone have thoughts about this new syntax? > > Thanks > -- Prasanth > > On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: > > > As far as the underlying implementation, if they all use the same > > optimizations that you use in cube, then it can be LOCube. If they have > > their own optimizations etc (or could), it may be worth them having their > > own Logical operators (which might just be LOCube with flags for the time > > being) that allows us more flexibilty. But I suppose that's between you, > > eclipse, and your GSOC mentor. > > > > 2012/5/30 Prasanth J <[EMAIL PROTECTED]> > > > >> Thanks Alan and Jon for expressing your views. > >> > >> I agree with Jon's point, if the syntax contains CUBE then user expects > it > >> to perform CUBE operation. So Jon's syntax seems more meaningful and > concise > >> > >> rel = CUBE rel BY (dims); > >> rel = ROLLUP rel BY (dims); > >> rel = GROUPING_SET rel BY (dims); > >> > >> 2 reasons why I do not prefer using SQL syntax is > >> 1) I do not want to break into existing Group operator implementation :) > >> 2) The syntax gets longer in case of partial hierarchical cubing/rollups > >> For ex: > >> > >> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), > ROLLUP(dim4,dim5,dim6), > >> ROLLUP(dim7,dim8,dim9); > >> > >> whereas same thing can be expressed like > >> > >> rel = ROLLUP rel BY dim0, > >> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); > >> > >> Thanks Alan for pointing out the way for independently managing the > >> operators in parser and logical/physical plan. So for all these > operators > >> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to > >> differentiate between these three operations. > >> > >> But, yes we are proliferating operators in this case. > >> > >> Thanks > >> -- Prasanth > >> > >> On May 30, 2012, at 4:42 PM, Alan Gates wrote: > >> > >>> > >>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: > >>> > >>>> I was going to say the same thing Alan said w.r.t. operators: > operators > >> in > >>>> the grammar can correspond to whatever logical and physical operators > >> you > >>>> want. > >>>> > >>>> As far as the principle of least astonishment compared to SQL... Pig > is > >>>> already pretty astonishing. I don't know why we would bend over > >> backwards > >>>> to make the syntax so similar in this case when even getting to the > >> point > >>>> of doing a CUBE means understanding an object model that is pretty > >>>> different from SQL. > >>>> > >>>> On that note, > >>>> > >>>> rel = CUBE rel BY GROUPING SETS(cols); > >>>> > >>>> seems really confusing. I'd much rather overload the group operating +
Jonathan Coveney 2012-06-21, 20:41
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxPrasanth J 2012-06-21, 20:43
Yeah you are right.
Thanks -- Prasanth On Jun 21, 2012, at 4:41 PM, Jonathan Coveney wrote: > Just to make sure I understand this correctly, is > > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > equivalent to: > > out1 = CUBE rel BY (a,b,c); > out2 = ROLLUP rel BY (c,d); > out3 = CUBY rel BY (e,f); > > out = CROSS out1, out2, out3; > > ? > > 2012/6/21 Prasanth J <[EMAIL PROTECTED]> > >> Hello all >> >> I initially implemented ROLLUP as a separate operation with the following >> syntax >> >> a = ROLLUP inp BY (x,y); >> >> which does the same thing as CUBE (inserting foreach + group-by in logical >> plan) except that it uses RollupDimensions UDF. But the issue with this >> approach is that we cannot mix CUBE and ROLLUP operations together in the >> same syntax which is a typical case. SQL/Oracle supports using CUBE and >> ROLLUP together like >> >> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >> >> so I modified the pig grammar to support the similar usage. So now we can >> use a syntax similar to SQL >> >> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >> >> In this approach, the logical plan should introduce cartesian product >> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for >> generating the final output. But I read from the documentation ( >> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator >> is an expensive operator and advices to use it sparingly. >> >> Is there any other way to achieve the cartesian product in a less >> expensive way? Also, does anyone have thoughts about this new syntax? >> >> Thanks >> -- Prasanth >> >> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: >> >>> As far as the underlying implementation, if they all use the same >>> optimizations that you use in cube, then it can be LOCube. If they have >>> their own optimizations etc (or could), it may be worth them having their >>> own Logical operators (which might just be LOCube with flags for the time >>> being) that allows us more flexibilty. But I suppose that's between you, >>> eclipse, and your GSOC mentor. >>> >>> 2012/5/30 Prasanth J <[EMAIL PROTECTED]> >>> >>>> Thanks Alan and Jon for expressing your views. >>>> >>>> I agree with Jon's point, if the syntax contains CUBE then user expects >> it >>>> to perform CUBE operation. So Jon's syntax seems more meaningful and >> concise >>>> >>>> rel = CUBE rel BY (dims); >>>> rel = ROLLUP rel BY (dims); >>>> rel = GROUPING_SET rel BY (dims); >>>> >>>> 2 reasons why I do not prefer using SQL syntax is >>>> 1) I do not want to break into existing Group operator implementation :) >>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >>>> For ex: >>>> >>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), >> ROLLUP(dim4,dim5,dim6), >>>> ROLLUP(dim7,dim8,dim9); >>>> >>>> whereas same thing can be expressed like >>>> >>>> rel = ROLLUP rel BY dim0, >>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >>>> >>>> Thanks Alan for pointing out the way for independently managing the >>>> operators in parser and logical/physical plan. So for all these >> operators >>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >>>> differentiate between these three operations. >>>> >>>> But, yes we are proliferating operators in this case. >>>> >>>> Thanks >>>> -- Prasanth >>>> >>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >>>> >>>>> >>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>>>> >>>>>> I was going to say the same thing Alan said w.r.t. operators: >> operators >>>> in >>>>>> the grammar can correspond to whatever logical and physical operators >>>> you >>>>>> want. >>>>>> >>>>>> As far as the principle of least astonishment compared to SQL... Pig >> is >>>>>> already pretty astonishing. I don't know why we would bend over >>>> backwards >>>>>> to make the syntax so similar in this case when even getting to the >>>> point >>>>> +
Prasanth J 2012-06-21, 20:43
-
Re: CUBE/ROLLUP/GROUPING SETS syntaxJonathan Coveney 2012-06-21, 20:50
IMHO, I don't know that it's worth overloading the operator to support
that. If that's what it is, then they can just do what I said. I would rather the focus be on getting ROLLUP, CUBE, etc implemented and efficient. Curious what others think. 2012/6/21 Prasanth J <[EMAIL PROTECTED]> > Yeah you are right. > > Thanks > -- Prasanth > > On Jun 21, 2012, at 4:41 PM, Jonathan Coveney wrote: > > > Just to make sure I understand this correctly, is > > > > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > > > equivalent to: > > > > out1 = CUBE rel BY (a,b,c); > > out2 = ROLLUP rel BY (c,d); > > out3 = CUBY rel BY (e,f); > > > > out = CROSS out1, out2, out3; > > > > ? > > > > 2012/6/21 Prasanth J <[EMAIL PROTECTED]> > > > >> Hello all > >> > >> I initially implemented ROLLUP as a separate operation with the > following > >> syntax > >> > >> a = ROLLUP inp BY (x,y); > >> > >> which does the same thing as CUBE (inserting foreach + group-by in > logical > >> plan) except that it uses RollupDimensions UDF. But the issue with this > >> approach is that we cannot mix CUBE and ROLLUP operations together in > the > >> same syntax which is a typical case. SQL/Oracle supports using CUBE and > >> ROLLUP together like > >> > >> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > >> > >> so I modified the pig grammar to support the similar usage. So now we > can > >> use a syntax similar to SQL > >> > >> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > >> > >> In this approach, the logical plan should introduce cartesian product > >> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for > >> generating the final output. But I read from the documentation ( > >> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS > operator > >> is an expensive operator and advices to use it sparingly. > >> > >> Is there any other way to achieve the cartesian product in a less > >> expensive way? Also, does anyone have thoughts about this new syntax? > >> > >> Thanks > >> -- Prasanth > >> > >> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: > >> > >>> As far as the underlying implementation, if they all use the same > >>> optimizations that you use in cube, then it can be LOCube. If they have > >>> their own optimizations etc (or could), it may be worth them having > their > >>> own Logical operators (which might just be LOCube with flags for the > time > >>> being) that allows us more flexibilty. But I suppose that's between > you, > >>> eclipse, and your GSOC mentor. > >>> > >>> 2012/5/30 Prasanth J <[EMAIL PROTECTED]> > >>> > >>>> Thanks Alan and Jon for expressing your views. > >>>> > >>>> I agree with Jon's point, if the syntax contains CUBE then user > expects > >> it > >>>> to perform CUBE operation. So Jon's syntax seems more meaningful and > >> concise > >>>> > >>>> rel = CUBE rel BY (dims); > >>>> rel = ROLLUP rel BY (dims); > >>>> rel = GROUPING_SET rel BY (dims); > >>>> > >>>> 2 reasons why I do not prefer using SQL syntax is > >>>> 1) I do not want to break into existing Group operator implementation > :) > >>>> 2) The syntax gets longer in case of partial hierarchical > cubing/rollups > >>>> For ex: > >>>> > >>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), > >> ROLLUP(dim4,dim5,dim6), > >>>> ROLLUP(dim7,dim8,dim9); > >>>> > >>>> whereas same thing can be expressed like > >>>> > >>>> rel = ROLLUP rel BY dim0, > >>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); > >>>> > >>>> Thanks Alan for pointing out the way for independently managing the > >>>> operators in parser and logical/physical plan. So for all these > >> operators > >>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags > to > >>>> differentiate between these three operations. > >>>> > >>>> But, yes we are proliferating operators in this case. > >>>> > >>>> Thanks > >>>> -- Prasanth > >>>> > >>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: > >>>> > >>>>> > >>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: +
Jonathan Coveney 2012-06-21, 20:50
|