|
|
-
Is this desirable: relation.projection as sugar for foreach relation generate projection
Jonathan Coveney 2012-02-22, 00:59
While thinking and talking through https://issues.apache.org/jira/browse/PIG-2536, something came up: should this idea, that relation.projection works in distincts, work in any case where a projection is present? In the patch I linked to, it allows you to do b = distinct a.$0; It accomplishes this by mapping that to: b = distinct (foreach a generate $0); It seems that if this is useful, then this should be available wherever relations are used? ie b = group a.(x,y) by x; or anything. The case of group is somewhat problematic, however, because if you describe that, you'll get... b: {group: int,1-2: {(x: int,y: int)}} Which, per Alan's comment, has to do with no real naming convention for nested relations.... I guess the question is whether this is, in general, useful? More broadly... - Is it worth thinking about how to make this go deeper? Currently you can do b = distinct a.x, but not b = distinct a.x.$0 (if it were appropriate). There are issues with this (and in fact there is an outstanding but w.r.t. b = foreach (group a by $0) generate $1.$0.$0.$0.$0; <== this works!). - Is the strategy of the syntactic sugar ok? I think in this case it should be (the relation name issue notwithstanding), but could see arguments either way. Find a super small patch with no tests attached... I wanted to get some thoughts before making yet another JIRA?
-
Re: Is this desirable: relation.projection as sugar for foreach relation generate projection
Russell Jurney 2012-02-22, 03:48
b = group a.(x,y) by x; /* Less verbose is good to me. */ b = foreach (group a by x) { generate group, a.(x,y) } /* sad panda */ On Tuesday, February 21, 2012, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > While thinking and talking through https://issues.apache.org/jira/browse/PIG-2536, something came up: > > should this idea, that relation.projection works in distincts, work in any case where a projection is present? > > In the patch I linked to, it allows you to do > > b = distinct a.$0; > > It accomplishes this by mapping that to: > > b = distinct (foreach a generate $0); > > It seems that if this is useful, then this should be available wherever relations are used? > > ie > > b = group a.(x,y) by x; > > or anything. The case of group is somewhat problematic, however, because if you describe that, you'll get... > > b: {group: int,1-2: {(x: int,y: int)}} > > Which, per Alan's comment, has to do with no real naming convention for nested relations.... > > I guess the question is whether this is, in general, useful? > > More broadly... > - Is it worth thinking about how to make this go deeper? Currently you can do b = distinct a.x, but not b = distinct a.x.$0 (if it were appropriate). There are issues with this (and in fact there is an outstanding but w.r.t. b = foreach (group a by $0) generate $1.$0.$0.$0.$0; <== this works!). > - Is the strategy of the syntactic sugar ok? I think in this case it should be (the relation name issue notwithstanding), but could see arguments either way. > > Find a super small patch with no tests attached... I wanted to get some thoughts before making yet another JIRA? > -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
-
Re: Is this desirable: relation.projection as sugar for foreach relation generate projection
Alan Gates 2012-02-23, 22:10
I've been ruminating on this for a while, and I don't have an answer yet. So I'll just give the feedback I've thought of.
This further blurs the already line between a relation (which can go on the left side of an assignment) and a bag (which cannot). I don't know if that's a good thing or not. We have already significantly blurred that with casting of relations to scalars and with the way bags convert to relations and bag to bags with nested foreach.
More comments inlined. On Feb 21, 2012, at 4:59 PM, Jonathan Coveney wrote:
> > > ie > > b = group a.(x,y) by x; > > or anything. The case of group is somewhat problematic, however, because if you describe that, you'll get... > > b: {group: int,1-2: {(x: int,y: int)}}
The problem shouldn't just show up in describe, but in trying to use the resulting bag, since pig names the bag based on the relation that was grouped. E.g.
b = group a.(x,y) by x; c = foreach b generate SUM(???.y);
I have no name to give the bag in the SUM operation. Users can get around this by using positional parameters.
> > > More broadly... > - Is it worth thinking about how to make this go deeper? Currently you can do b = distinct a.x, but not b = distinct a.x.$0 (if it were appropriate). There are issues with this (and in fact there is an outstanding but w.r.t. b = foreach (group a by $0) generate $1.$0.$0.$0.$0; <== this works!).
The problem with this is that distinct a.x and distinct a.x.$0 are really different things.
b = distinct a.x
would be short hand for
b1 = foreach a generate x; b = distinct b1;
whereas
b = distinct a.x.$0
would be short hand for
b = foreach a { b1 = x.$0; b2 = distinct b1; generate b2; }
Maybe that's ok, I'm not sure.
> - Is the strategy of the syntactic sugar ok? I think in this case it should be (the relation name issue notwithstanding), but could see arguments either way. > > Find a super small patch with no tests attached... I wanted to get some thoughts before making yet another JIRA?
The Apache lists don't allow attachments. You'll have to send it in the mail itself or post it somewhere and link to it in your mail.
Alan.
-
Re: Is this desirable: relation.projection as sugar for foreach relation generate projection
Jonathan Coveney 2012-02-23, 22:23
Adam, thanks for the comments. Below is the cat of the patch (it's short enough to just paste in line):
Your comments are welcome, and I'd be curious what others think as well. The blurring of the line between bags and relations is what I'm worried about, but at the same time, one of the things people confuse the most is that distinction. Index: test/org/apache/pig/test/TestEvalPipeline.java ==================================================================--- test/org/apache/pig/test/TestEvalPipeline.java (revision 1244760) +++ test/org/apache/pig/test/TestEvalPipeline.java (working copy) @@ -383,7 +383,7 @@ pigServer.registerQuery("A = LOAD '" + Util.generateURI(tmpFile.toString(), pigContext) + "';"); if (eliminateDuplicates){ - pigServer.registerQuery("B = DISTINCT (FOREACH A GENERATE $0) PARALLEL 10;"); + pigServer.registerQuery("B = DISTINCT A.$0 PARALLEL 10;"); }else{ if(!useUDF) { pigServer.registerQuery("B = ORDER A BY $0 PARALLEL 10;"); Index: test/org/apache/pig/test/TestEvalPipelineLocal.java ==================================================================--- test/org/apache/pig/test/TestEvalPipelineLocal.java (revision 1244760) +++ test/org/apache/pig/test/TestEvalPipelineLocal.java (working copy) @@ -400,7 +400,7 @@ + Util.generateURI(tmpFile.toString(), pigServer .getPigContext()) + "';"); if (eliminateDuplicates){ - pigServer.registerQuery("B = DISTINCT (FOREACH A GENERATE $0) PARALLEL 10;"); + pigServer.registerQuery("B = DISTINCT A.$0 PARALLEL 10;"); }else{ if(!useUDF) { pigServer.registerQuery("B = ORDER A BY $0 PARALLEL 10;"); Index: src/org/apache/pig/parser/AstPrinter.g ==================================================================Index: src/org/apache/pig/parser/QueryParser.g ==================================================================--- src/org/apache/pig/parser/QueryParser.g (revision 1244760) +++ src/org/apache/pig/parser/QueryParser.g (working copy) @@ -506,7 +506,10 @@ | LEFT_PAREN! col_ref ( ASC | DESC )? RIGHT_PAREN! ;
-distinct_clause : DISTINCT^ rel partition_clause? +distinct_clause : DISTINCT rel PERIOD ( col_alias_or_index | ( LEFT_PAREN col_alias_or_index ( COMMA col_alias_or_index )* RIGHT_PAREN ) ) partition_clause? + -> ^( DISTINCT ^( FOREACH rel ^( FOREACH_PLAN_SIMPLE ^( GENERATE col_alias_or_index+ ) ) ) partition_clause? ) + | DISTINCT rel partition_clause? + -> ^( DISTINCT rel partition_clause? ) ;
partition_clause : PARTITION^ BY! func_name
-
Re: Is this desirable: relation.projection as sugar for foreach relation generate projection
Daniel Dai 2012-02-24, 22:52
One of my concern is that it could complicate GUI mapping for the Pig script in the future. I feel it might be more clear one statement only do one thing.
Daniel
On Thu, Feb 23, 2012 at 2:23 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > Adam, thanks for the comments. Below is the cat of the patch (it's short > enough to just paste in line): > > Your comments are welcome, and I'd be curious what others think as well. > The blurring of the line between bags and relations is what I'm worried > about, but at the same time, one of the things people confuse the most is > that distinction. > > > Index: test/org/apache/pig/test/TestEvalPipeline.java > ==================================================================> --- test/org/apache/pig/test/TestEvalPipeline.java (revision 1244760) > +++ test/org/apache/pig/test/TestEvalPipeline.java (working copy) > @@ -383,7 +383,7 @@ > pigServer.registerQuery("A = LOAD '" > + Util.generateURI(tmpFile.toString(), pigContext) + "';"); > if (eliminateDuplicates){ > - pigServer.registerQuery("B = DISTINCT (FOREACH A GENERATE $0) > PARALLEL 10;"); > + pigServer.registerQuery("B = DISTINCT A.$0 PARALLEL 10;"); > }else{ > if(!useUDF) { > pigServer.registerQuery("B = ORDER A BY $0 PARALLEL 10;"); > Index: test/org/apache/pig/test/TestEvalPipelineLocal.java > ==================================================================> --- test/org/apache/pig/test/TestEvalPipelineLocal.java (revision > 1244760) > +++ test/org/apache/pig/test/TestEvalPipelineLocal.java (working copy) > @@ -400,7 +400,7 @@ > + Util.generateURI(tmpFile.toString(), pigServer > .getPigContext()) + "';"); > if (eliminateDuplicates){ > - pigServer.registerQuery("B = DISTINCT (FOREACH A GENERATE $0) > PARALLEL 10;"); > + pigServer.registerQuery("B = DISTINCT A.$0 PARALLEL 10;"); > }else{ > if(!useUDF) { > pigServer.registerQuery("B = ORDER A BY $0 PARALLEL 10;"); > Index: src/org/apache/pig/parser/AstPrinter.g > ==================================================================> Index: src/org/apache/pig/parser/QueryParser.g > ==================================================================> --- src/org/apache/pig/parser/QueryParser.g (revision 1244760) > +++ src/org/apache/pig/parser/QueryParser.g (working copy) > @@ -506,7 +506,10 @@ > | LEFT_PAREN! col_ref ( ASC | DESC )? RIGHT_PAREN! > ; > > -distinct_clause : DISTINCT^ rel partition_clause? > +distinct_clause : DISTINCT rel PERIOD ( col_alias_or_index | ( LEFT_PAREN > col_alias_or_index ( COMMA col_alias_or_index )* RIGHT_PAREN ) ) > partition_clause? > + -> ^( DISTINCT ^( FOREACH rel ^( FOREACH_PLAN_SIMPLE ^( > GENERATE col_alias_or_index+ ) ) ) partition_clause? ) > + | DISTINCT rel partition_clause? > + -> ^( DISTINCT rel partition_clause? ) > ; > > partition_clause : PARTITION^ BY! func_name
-
Re: Is this desirable: relation.projection as sugar for foreach relation generate projection
Dmitriy Ryaboy 2012-03-02, 21:48
But that's already not the case. The syntax "a = distinct (foreach b generate $1, $2);" is completely legal.
D
On Fri, Feb 24, 2012 at 2:52 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > One of my concern is that it could complicate GUI mapping for the Pig > script in the future. I feel it might be more clear one statement only > do one thing. > > Daniel > > On Thu, Feb 23, 2012 at 2:23 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: >> Adam, thanks for the comments. Below is the cat of the patch (it's short >> enough to just paste in line): >> >> Your comments are welcome, and I'd be curious what others think as well. >> The blurring of the line between bags and relations is what I'm worried >> about, but at the same time, one of the things people confuse the most is >> that distinction. >> >> >> Index: test/org/apache/pig/test/TestEvalPipeline.java >> ==================================================================>> --- test/org/apache/pig/test/TestEvalPipeline.java (revision 1244760) >> +++ test/org/apache/pig/test/TestEvalPipeline.java (working copy) >> @@ -383,7 +383,7 @@ >> pigServer.registerQuery("A = LOAD '" >> + Util.generateURI(tmpFile.toString(), pigContext) + "';"); >> if (eliminateDuplicates){ >> - pigServer.registerQuery("B = DISTINCT (FOREACH A GENERATE $0) >> PARALLEL 10;"); >> + pigServer.registerQuery("B = DISTINCT A.$0 PARALLEL 10;"); >> }else{ >> if(!useUDF) { >> pigServer.registerQuery("B = ORDER A BY $0 PARALLEL 10;"); >> Index: test/org/apache/pig/test/TestEvalPipelineLocal.java >> ==================================================================>> --- test/org/apache/pig/test/TestEvalPipelineLocal.java (revision >> 1244760) >> +++ test/org/apache/pig/test/TestEvalPipelineLocal.java (working copy) >> @@ -400,7 +400,7 @@ >> + Util.generateURI(tmpFile.toString(), pigServer >> .getPigContext()) + "';"); >> if (eliminateDuplicates){ >> - pigServer.registerQuery("B = DISTINCT (FOREACH A GENERATE $0) >> PARALLEL 10;"); >> + pigServer.registerQuery("B = DISTINCT A.$0 PARALLEL 10;"); >> }else{ >> if(!useUDF) { >> pigServer.registerQuery("B = ORDER A BY $0 PARALLEL 10;"); >> Index: src/org/apache/pig/parser/AstPrinter.g >> ==================================================================>> Index: src/org/apache/pig/parser/QueryParser.g >> ==================================================================>> --- src/org/apache/pig/parser/QueryParser.g (revision 1244760) >> +++ src/org/apache/pig/parser/QueryParser.g (working copy) >> @@ -506,7 +506,10 @@ >> | LEFT_PAREN! col_ref ( ASC | DESC )? RIGHT_PAREN! >> ; >> >> -distinct_clause : DISTINCT^ rel partition_clause? >> +distinct_clause : DISTINCT rel PERIOD ( col_alias_or_index | ( LEFT_PAREN >> col_alias_or_index ( COMMA col_alias_or_index )* RIGHT_PAREN ) ) >> partition_clause? >> + -> ^( DISTINCT ^( FOREACH rel ^( FOREACH_PLAN_SIMPLE ^( >> GENERATE col_alias_or_index+ ) ) ) partition_clause? ) >> + | DISTINCT rel partition_clause? >> + -> ^( DISTINCT rel partition_clause? ) >> ; >> >> partition_clause : PARTITION^ BY! func_name
-
Re: Is this desirable: relation.projection as sugar for foreach relation generate projection
Daniel Dai 2012-03-02, 23:02
I should say one operator only do one thing instead.
Daniel
On Fri, Mar 2, 2012 at 1:48 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > But that's already not the case. The syntax "a = distinct (foreach b > generate $1, $2);" is completely legal. > > D > > On Fri, Feb 24, 2012 at 2:52 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: >> One of my concern is that it could complicate GUI mapping for the Pig >> script in the future. I feel it might be more clear one statement only >> do one thing. >> >> Daniel >> >> On Thu, Feb 23, 2012 at 2:23 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: >>> Adam, thanks for the comments. Below is the cat of the patch (it's short >>> enough to just paste in line): >>> >>> Your comments are welcome, and I'd be curious what others think as well. >>> The blurring of the line between bags and relations is what I'm worried >>> about, but at the same time, one of the things people confuse the most is >>> that distinction. >>> >>> >>> Index: test/org/apache/pig/test/TestEvalPipeline.java >>> ==================================================================>>> --- test/org/apache/pig/test/TestEvalPipeline.java (revision 1244760) >>> +++ test/org/apache/pig/test/TestEvalPipeline.java (working copy) >>> @@ -383,7 +383,7 @@ >>> pigServer.registerQuery("A = LOAD '" >>> + Util.generateURI(tmpFile.toString(), pigContext) + "';"); >>> if (eliminateDuplicates){ >>> - pigServer.registerQuery("B = DISTINCT (FOREACH A GENERATE $0) >>> PARALLEL 10;"); >>> + pigServer.registerQuery("B = DISTINCT A.$0 PARALLEL 10;"); >>> }else{ >>> if(!useUDF) { >>> pigServer.registerQuery("B = ORDER A BY $0 PARALLEL 10;"); >>> Index: test/org/apache/pig/test/TestEvalPipelineLocal.java >>> ==================================================================>>> --- test/org/apache/pig/test/TestEvalPipelineLocal.java (revision >>> 1244760) >>> +++ test/org/apache/pig/test/TestEvalPipelineLocal.java (working copy) >>> @@ -400,7 +400,7 @@ >>> + Util.generateURI(tmpFile.toString(), pigServer >>> .getPigContext()) + "';"); >>> if (eliminateDuplicates){ >>> - pigServer.registerQuery("B = DISTINCT (FOREACH A GENERATE $0) >>> PARALLEL 10;"); >>> + pigServer.registerQuery("B = DISTINCT A.$0 PARALLEL 10;"); >>> }else{ >>> if(!useUDF) { >>> pigServer.registerQuery("B = ORDER A BY $0 PARALLEL 10;"); >>> Index: src/org/apache/pig/parser/AstPrinter.g >>> ==================================================================>>> Index: src/org/apache/pig/parser/QueryParser.g >>> ==================================================================>>> --- src/org/apache/pig/parser/QueryParser.g (revision 1244760) >>> +++ src/org/apache/pig/parser/QueryParser.g (working copy) >>> @@ -506,7 +506,10 @@ >>> | LEFT_PAREN! col_ref ( ASC | DESC )? RIGHT_PAREN! >>> ; >>> >>> -distinct_clause : DISTINCT^ rel partition_clause? >>> +distinct_clause : DISTINCT rel PERIOD ( col_alias_or_index | ( LEFT_PAREN >>> col_alias_or_index ( COMMA col_alias_or_index )* RIGHT_PAREN ) ) >>> partition_clause? >>> + -> ^( DISTINCT ^( FOREACH rel ^( FOREACH_PLAN_SIMPLE ^( >>> GENERATE col_alias_or_index+ ) ) ) partition_clause? ) >>> + | DISTINCT rel partition_clause? >>> + -> ^( DISTINCT rel partition_clause? ) >>> ; >>> >>> partition_clause : PARTITION^ BY! func_name
|
|