Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - SUM of project-range of fields?

Copy link to this message
Re: SUM of project-range of fields?
Abhinav Neelam 2013-03-19, 17:43
Russell's code works with a little modification. (The cast to int doesn't
work though.)

movie_and_genres = FOREACH movies GENERATE $0 as movie_name,
(bag{tuple()})TOBAG($2 ..) AS genres: bag{genre_bit: tuple()};
foo = foreach movies_and_genres generate movie_name, (int)SUM(genres) as
Having said that, it appears from your problem description that there're a
fixed number of genres and every movie record would contain either a 0 or 1
corresponding to that genre. Ergo, every record has the same number of
columns. (Is that right? I see your second example doesn't follow this
though.) Then you could specify the detailed schema in your load statement
and simplify matters.

Secondly, it appears that order matters in your genre bitmap (you say the
first column corresponds to action movies and so on). Bags are unordered,
so it makes sense to make a tuple out of your genre bitmap first because
the TOBAG operation will throw away all column order information.
You need to FLATTEN your tuple before TOBAG-ging and SUM-ming it though.

On 19 March 2013 07:20, Nathan Neff <[EMAIL PROTECTED]> wrote:

> It seems like I'm getting closer:
> With this data:
> Toy Story|0
> GoldenEye|0|1|0|1
> And this script:
> movies = load 'movies' USING PigStorage('|');
> movie_and_genres = FOREACH movies GENERATE $0, TOTUPLE($1 ..);
> DUMP movie_and_genres;
> describe movie_and_genres;
> I get this output:
> (Toy Story,(0))
> (GoldenEye,(0,1,0,1))
> movie_and_genres: {bytearray,()}