Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> calculate percentage


Copy link to this message
-
Re: calculate percentage
Adding FLATTEN to your "grouped-by-multiple-cols" relation
(iq_per_region_per_gender) will make it much easier to join and visualize.
 Once your join keys are flat string literals ("gender"), then it's just a
straightforward JOIN/FOREACH.

Here's a fragment that seems to do what you need:

A = LOAD 'test.txt' USING PigStorage(',') AS
(name:chararray,region:chararray,gender:chararray,iq:int);
per_gender = FOREACH (GROUP A by gender) GENERATE group, SUM(A.iq) as
gentot:long;
per_gender_region = FOREACH (GROUP A by (gender, region)) GENERATE
FLATTEN(group), SUM(A.iq) AS genregtot:long;
joined = JOIN per_gender by $0, per_gender_region by $0;
final = FOREACH joined GENERATE per_gender::group, region,
(float)genregtot/(float)gentot;

Which yields the following:

(Male,Here,0.89285713)
(Male,There,0.10714286)
(Female,Here,0.13793103)
(Female,There,0.86206895)

Norbert

On Wed, Oct 12, 2011 at 5:38 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> Yes but I'm still not able to compute the percentage. I've joined the bags
> as below.
>
> A = LOAD '/data/marco/foo.csv' USING PigStorage(',') AS (name:cha
> rarray, region:chararray, gender:chararray, iq:int);
> iq_per_region_per_gender = GROUP A BY (region, gender);
> total_iq_per_gender = GROUP A BY (gender);
>
> describe iq_per_region_per_gender
> iq_per_region_per_gender: {group: (region: chararray,gender: chararray),A:
> {(name: chararray,region: chararray,gender: chararray,iq: int)}}
>
> describe total_iq_per_gender;
> total_iq_per_gender: {group: chararray,A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)}}
>
> total = JOIN iq_per_region_per_gender BY group.gender, total_iq_per_gender
> BY $0;
>
> describe total
> total: {iq_per_region_per_gender::group: (region: chararray,gender:
> chararray),iq_per_region_per_gender::A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)},total_iq_per_gender::group:
> chararray,total_iq_per_gender::A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)}}
>
> -- Now I would like to use the 'joined' data.
> -- providing me sth like this:
> -- Male, Here, 0.2
> -- Female, Here, 0,8
> -- Male, There, 1
> -- Female, There, 0
> -- But I'm not sure how my FOREACH GENERATE needs to look like.
>
>
> On Wed, Oct 12, 2011 at 10:34 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
> wrote:
>
> > Sure, just join your total counts with your partials on gender.
> >
> > D
> >
> > On Tue, Oct 11, 2011 at 11:58 PM, Marco Cadetg <[EMAIL PROTECTED]> wrote:
> >
> > > D'oh I just see that unfortunately my example was a bit over
> simplified.
> > > The
> > > total needs to be grouped by another field like below.
> > >
> > > A = LOAD 'student' USING PigStorage() AS (name:chararray,
> > region:chararray,
> > > gender:charrarray, iq:int);
> > > DUMP A;
> > > (Eva, There, Female,500)
> > > (John, There, Male, 10)
> > > (Alf, There, Male, 10)
> > > (ET, There, Male, 10)
> > > (Mary, Here, Female, 80)
> > > (Bill, Here, Male, 100)
> > > (Joe, Here, Male, 150)
> > >
> > > total_iq_per_region = GROUP A BY (region, gender);
> > >
> > > total_iq_per_region_per_gender = FOREACH total_iq_per_region
> > > {
> > >  GENERATE FLATTEN(group),
> > >  SUM(A.iq) AS iq_per_region_per_gender;
> > > }
> > >
> > > total_iq_per_gender = GROUP A BY (gender);
> > >
> > > total_iq_per_gender = FOREACH A
> > > {
> > >  GENERATE FLATTEN(group),
> > >  SUM(A.iq) AS iq_per_gender;
> > > }
> > >
> > > Now I guess I could use JOIN to combine both bags(?) by gender but
> > somehow
> > > I
> > > don't get it.
> > >
> > > Thanks
> > > -Marco
> > >
> > > On Tue, Oct 11, 2011 at 6:02 PM, Marco Cadetg <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > Thanks a lot, Shawn! Looks like I need to learn some basics ;)
> > > > -Marco
> > > >
> > > > On Tue, Oct 11, 2011 at 5:39 PM, Xiaomeng Wan <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > >> total_iq = foreach (group A by all) generate SUM(A.iq) as total;
> > > >>
> > > >> total_iq_per_region = FOREACH total_iq_per_region
> > > >> {
> > > >>  GENERATE FLATTEN(group),