|
|
-
SUM of project-range of fields?
Nathan Neff 2013-03-10, 14:45
Hello
I'm trying to find a SUM of a range of fields, and am having difficulty.
I have the following data structure (from the movielens public dataset) where there's a "fixed" field of "Name" and there's a denormalized "genres" list (for example, the first column is "action", second is "comedy", etc.
Name|Genres Toy Story|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0 GoldenEye|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
This seems like an ideal use for the project-range feature of Pig, where it would be trivial to find movies that belonged to two or more genres.
I'm trying to use this code: movies = load 'movies' USING PigStorage('|'); movie_and_genres = FOREACH movies GENERATE $0, TOBAG($2 ..) AS genres; DUMP movie_and_genres; This works, and gives me: (Toy Story,{(0),(1),(1),(1),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)}) (GoldenEye,{(1),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(1),(0),(0)})
However, if I try to run a SUM on the genres bag, I receive the error:
"Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast."
I've tried to flatten and cast the genres bag like this:
movies = load 'movies' USING PigStorage('|'); movie_and_genres = FOREACH movies GENERATE $0, FLATTEN(TOBAG($2 ..)); movie_and_int_genres = FOREACH movie_and_genres GENERATE $0, (int) $1;
However, then I receive the error: Cannot cast bytearray to int
Any ideas what to try next? Or, would I be better off trying to use a STREAM or custom loader to do something like this?
Thanks, --Nate
-
Re: SUM of project-range of fields?
Russell Jurney 2013-03-11, 04:15
Try: movie_and_genres = FOREACH movies GENERATE $0, (b:bag{t:tuple(i:int)})TOBAG($2 ..) AS genres:b:bag{t:tuple(i:int)}; foo = foreach movies_and_genres generate SUM(genres) as genre_total; Russell Jurney http://datasyndrome.comOn Mar 10, 2013, at 7:45 AM, Nathan Neff <[EMAIL PROTECTED]> wrote: > movie_and_genres = FOREACH movies GENERATE $0, TOBAG($2 ..) AS genres;
-
Re: SUM of project-range of fields?
Nathan Neff 2013-03-19, 01:15
On Sun, Mar 10, 2013 at 11:15 PM, Russell Jurney <[EMAIL PROTECTED]> wrote: > Try: > > movie_and_genres = FOREACH movies GENERATE $0, > (b:bag{t:tuple(i:int)})TOBAG($2 ..) AS genres:b:bag{t:tuple(i:int)}; > foo = foreach movies_and_genres generate SUM(genres) as genre_total; Hi Russell Thanks for your help, but I couldn't get this variation to work -- I get Syntax error, unexpected symbol at or near 'b' I'm using Pig 0.10 > > > Russell Jurney http://datasyndrome.com> > On Mar 10, 2013, at 7:45 AM, Nathan Neff <[EMAIL PROTECTED]> wrote: > >> movie_and_genres = FOREACH movies GENERATE $0, TOBAG($2 ..) AS genres;
-
Re: SUM of project-range of fields?
Nathan Neff 2013-03-19, 01:50
It seems like I'm getting closer:
With this data:
Toy Story|0 GoldenEye|0|1|0|1
And this script:
movies = load 'movies' USING PigStorage('|'); movie_and_genres = FOREACH movies GENERATE $0, TOTUPLE($1 ..); DUMP movie_and_genres; describe movie_and_genres;
I get this output:
(Toy Story,(0)) (GoldenEye,(0,1,0,1)) movie_and_genres: {bytearray,()}
-
Re: SUM of project-range of fields?
Abhinav Neelam 2013-03-19, 17:43
Russell's code works with a little modification. (The cast to int doesn't work though.)
movie_and_genres = FOREACH movies GENERATE $0 as movie_name, (bag{tuple()})TOBAG($2 ..) AS genres: bag{genre_bit: tuple()}; foo = foreach movies_and_genres generate movie_name, (int)SUM(genres) as genre_total; Having said that, it appears from your problem description that there're a fixed number of genres and every movie record would contain either a 0 or 1 corresponding to that genre. Ergo, every record has the same number of columns. (Is that right? I see your second example doesn't follow this though.) Then you could specify the detailed schema in your load statement and simplify matters.
Secondly, it appears that order matters in your genre bitmap (you say the first column corresponds to action movies and so on). Bags are unordered, so it makes sense to make a tuple out of your genre bitmap first because the TOBAG operation will throw away all column order information. You need to FLATTEN your tuple before TOBAG-ging and SUM-ming it though.
HTH, Abhinav On 19 March 2013 07:20, Nathan Neff <[EMAIL PROTECTED]> wrote:
> It seems like I'm getting closer: > > With this data: > > Toy Story|0 > GoldenEye|0|1|0|1 > > And this script: > > movies = load 'movies' USING PigStorage('|'); > movie_and_genres = FOREACH movies GENERATE $0, TOTUPLE($1 ..); > DUMP movie_and_genres; > describe movie_and_genres; > > I get this output: > > (Toy Story,(0)) > (GoldenEye,(0,1,0,1)) > movie_and_genres: {bytearray,()} >
-
Re: SUM of project-range of fields?
Nathan Neff 2013-03-19, 20:16
It works!!!!
This confirms that Pig is better than Java MapReduce :-)
Thanks everyone for their help.
Input: Toy Story|0|0|0|0|1|1|0|0|0 GoldenEye|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 SomeNewMovie|0|0|0|0|1|1|0|0|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1
Script: movies = load 'movies' USING PigStorage('|'); movie_and_genres = FOREACH movies GENERATE $0 as movie_name, (bag{tuple()})TOBAG($1 ..) AS genres: bag{genre_bit: tuple()}; movies_sum_genres = foreach movie_and_genres generate movie_name, (int)SUM(genres) as genre_total; DUMP movies_sum_genres;
Output: (Toy Story,2) (GoldenEye,3) (SomeNewMovie,41)
On Tue, Mar 19, 2013 at 12:43 PM, Abhinav Neelam <[EMAIL PROTECTED]> wrote: > Russell's code works with a little modification. (The cast to int doesn't > work though.) > > movie_and_genres = FOREACH movies GENERATE $0 as movie_name, > (bag{tuple()})TOBAG($2 ..) AS genres: bag{genre_bit: tuple()}; > foo = foreach movies_and_genres generate movie_name, (int)SUM(genres) as > genre_total; > > > Having said that, it appears from your problem description that there're a > fixed number of genres and every movie record would contain either a 0 or 1 > corresponding to that genre. Ergo, every record has the same number of > columns. (Is that right? I see your second example doesn't follow this > though.) Then you could specify the detailed schema in your load statement > and simplify matters.
The second example is the main reason for using the range, in that new genres could be added arbitrarily at the end of each record. Thus movie#1 could have been added when there were 10 known genres in the format, but movie#1000 has 11 known genres in the record format.
> Secondly, it appears that order matters in your genre bitmap (you say the > first column corresponds to action movies and so on).
that's correct -- for all movies, the first zero is whether or not it belongs to an 'action' genre.
>Bags are unordered, > so it makes sense to make a tuple out of your genre bitmap first because > the TOBAG operation will throw away all column order information. > You need to FLATTEN your tuple before TOBAG-ging and SUM-ming it though.
Hmm in my use-case (find the movies that belong to 2 or more genres) this wouldn't matter, but that's a very interesting (and tricky) point to note. Thank you very much!
> HTH, > Abhinav > > > > > On 19 March 2013 07:20, Nathan Neff <[EMAIL PROTECTED]> wrote: > >> It seems like I'm getting closer: >> >> With this data: >> >> Toy Story|0 >> GoldenEye|0|1|0|1 >> >> And this script: >> >> movies = load 'movies' USING PigStorage('|'); >> movie_and_genres = FOREACH movies GENERATE $0, TOTUPLE($1 ..); >> DUMP movie_and_genres; >> describe movie_and_genres; >> >> I get this output: >> >> (Toy Story,(0)) >> (GoldenEye,(0,1,0,1)) >> movie_and_genres: {bytearray,()} >>
|
|