|
|
-
Group by Fetching top 100 from each group
Benjamin Juhn 2012-06-29, 23:19
Hi there,
I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this?
Thanks, Ben
-
Re: Group by Fetching top 100 from each group
Sal Uryasev 2012-06-29, 23:27
Hey Ben, You can do a nested ORDER => LIMIT inside a FOREACH http://pig.apache.org/docs/r0.10.0/basic.html#foreachNewer versions of Pig also have a TOP function that will replace the ORDER => LIMIT. -Sal On Jun 29, 2012, at 4:19 PM, Benjamin Juhn wrote: Hi there, I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this? Thanks, Ben
-
RE: Group by Fetching top 100 from each group
Austin Stickney 2012-06-29, 23:55
You would want to do a FOREACH after the GROUP BY where you limit the contents of each group. Usually you would also want to order the bag before you limit it, so that you are taking the top 100 of something, rather than just a random selection of 100. Here's an example that creates a list of the top 100 salesmen from each state.
people = LOAD 'people.tsv' USING PigStorage() AS ( fname:chararray, lname:chararray, state:chararray, sales:double );
group_by_state = GROUP people BY state;
top_sales_by_state = FOREACH group_by_state { order_by_sales = ORDER people BY sales DESC; top_sales = LIMIT order_by_sales 100;
GENERATE Group AS state:chararray, FLATTEN(top_sales) AS ( fname:chararray, lname:chararray, state:chararray, sales:double ) ; };
Austin
-----Original Message----- From: Benjamin Juhn [mailto:[EMAIL PROTECTED]] Sent: Friday, June 29, 2012 4:19 PM To: [EMAIL PROTECTED] Subject: Group by Fetching top 100 from each group
Hi there,
I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this?
Thanks, Ben
-
Re: Group by Fetching top 100 from each group
Kris Coward 2012-06-30, 00:02
LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement. These should be able to do what you want. e.g. B = GROUP A BY key C = FOREACH B { X = ORDER A BY orderingParam; Y = LIMIT X 100; GENERATE group, Y;} -Kris On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote: > Hi there, > > I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this? > > Thanks, > Ben -- Kris Coward http://unripe.melon.org/GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
-
Re: Group by Fetching top 100 from each group
Jonathan Coveney 2012-06-30, 01:39
Ideally, you should use the TOP function. It will be more efficient, as it is algebraic. 2012/6/29 Kris Coward <[EMAIL PROTECTED]> > > LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement. > These should be able to do what you want. > > e.g. > > B = GROUP A BY key > C = FOREACH B { > X = ORDER A BY orderingParam; > Y = LIMIT X 100; > GENERATE group, Y;} > > -Kris > > On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote: > > Hi there, > > > > I'm trying to write a group by statement, only returning the top 100 > records from each group. Does pig support this? > > > > Thanks, > > Ben > > -- > Kris Coward http://unripe.melon.org/> GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 >
-
Re: Group by Fetching top 100 from each group
Corbin Hoenes 2012-06-30, 02:44
http://pig.apache.org/docs/r0.10.0/func.html#topxOn Jun 29, 2012, at 5:19 PM, Benjamin Juhn wrote: > Hi there, > > I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this? > > Thanks, > Ben
-
Re: Group by Fetching top 100 from each group
Kris Coward 2012-06-30, 04:47
Yes, that is indeed better. On Fri, Jun 29, 2012 at 06:39:58PM -0700, Jonathan Coveney wrote: > Ideally, you should use the TOP function. It will be more efficient, as it > is algebraic. > > 2012/6/29 Kris Coward <[EMAIL PROTECTED]> > > > > > LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement. > > These should be able to do what you want. > > > > e.g. > > > > B = GROUP A BY key > > C = FOREACH B { > > X = ORDER A BY orderingParam; > > Y = LIMIT X 100; > > GENERATE group, Y;} > > > > -Kris > > > > On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote: > > > Hi there, > > > > > > I'm trying to write a group by statement, only returning the top 100 > > records from each group. Does pig support this? > > > > > > Thanks, > > > Ben > > > > -- > > Kris Coward http://unripe.melon.org/> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > > -- Kris Coward http://unripe.melon.org/GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
|
|