Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> calculate percentage


+
Marco Cadetg 2011-10-11, 15:20
+
Xiaomeng Wan 2011-10-11, 15:39
+
Marco Cadetg 2011-10-11, 16:02
+
Marco Cadetg 2011-10-12, 06:58
+
Dmitriy Ryaboy 2011-10-12, 08:34
+
Marco Cadetg 2011-10-12, 09:38
Copy link to this message
-
Re: calculate percentage
Adding FLATTEN to your "grouped-by-multiple-cols" relation
(iq_per_region_per_gender) will make it much easier to join and visualize.
 Once your join keys are flat string literals ("gender"), then it's just a
straightforward JOIN/FOREACH.

Here's a fragment that seems to do what you need:

A = LOAD 'test.txt' USING PigStorage(',') AS
(name:chararray,region:chararray,gender:chararray,iq:int);
per_gender = FOREACH (GROUP A by gender) GENERATE group, SUM(A.iq) as
gentot:long;
per_gender_region = FOREACH (GROUP A by (gender, region)) GENERATE
FLATTEN(group), SUM(A.iq) AS genregtot:long;
joined = JOIN per_gender by $0, per_gender_region by $0;
final = FOREACH joined GENERATE per_gender::group, region,
(float)genregtot/(float)gentot;

Which yields the following:

(Male,Here,0.89285713)
(Male,There,0.10714286)
(Female,Here,0.13793103)
(Female,There,0.86206895)

Norbert

On Wed, Oct 12, 2011 at 5:38 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> Yes but I'm still not able to compute the percentage. I've joined the bags
> as below.
>
> A = LOAD '/data/marco/foo.csv' USING PigStorage(',') AS (name:cha
> rarray, region:chararray, gender:chararray, iq:int);
> iq_per_region_per_gender = GROUP A BY (region, gender);
> total_iq_per_gender = GROUP A BY (gender);
>
> describe iq_per_region_per_gender
> iq_per_region_per_gender: {group: (region: chararray,gender: chararray),A:
> {(name: chararray,region: chararray,gender: chararray,iq: int)}}
>
> describe total_iq_per_gender;
> total_iq_per_gender: {group: chararray,A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)}}
>
> total = JOIN iq_per_region_per_gender BY group.gender, total_iq_per_gender
> BY $0;
>
> describe total
> total: {iq_per_region_per_gender::group: (region: chararray,gender:
> chararray),iq_per_region_per_gender::A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)},total_iq_per_gender::group:
> chararray,total_iq_per_gender::A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)}}
>
> -- Now I would like to use the 'joined' data.
> -- providing me sth like this:
> -- Male, Here, 0.2
> -- Female, Here, 0,8
> -- Male, There, 1
> -- Female, There, 0
> -- But I'm not sure how my FOREACH GENERATE needs to look like.
>
>
> On Wed, Oct 12, 2011 at 10:34 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
> wrote:
>
> > Sure, just join your total counts with your partials on gender.
> >
> > D
> >
> > On Tue, Oct 11, 2011 at 11:58 PM, Marco Cadetg <[EMAIL PROTECTED]> wrote:
> >
> > > D'oh I just see that unfortunately my example was a bit over
> simplified.
> > > The
> > > total needs to be grouped by another field like below.
> > >
> > > A = LOAD 'student' USING PigStorage() AS (name:chararray,
> > region:chararray,
> > > gender:charrarray, iq:int);
> > > DUMP A;
> > > (Eva, There, Female,500)
> > > (John, There, Male, 10)
> > > (Alf, There, Male, 10)
> > > (ET, There, Male, 10)
> > > (Mary, Here, Female, 80)
> > > (Bill, Here, Male, 100)
> > > (Joe, Here, Male, 150)
> > >
> > > total_iq_per_region = GROUP A BY (region, gender);
> > >
> > > total_iq_per_region_per_gender = FOREACH total_iq_per_region
> > > {
> > >  GENERATE FLATTEN(group),
> > >  SUM(A.iq) AS iq_per_region_per_gender;
> > > }
> > >
> > > total_iq_per_gender = GROUP A BY (gender);
> > >
> > > total_iq_per_gender = FOREACH A
> > > {
> > >  GENERATE FLATTEN(group),
> > >  SUM(A.iq) AS iq_per_gender;
> > > }
> > >
> > > Now I guess I could use JOIN to combine both bags(?) by gender but
> > somehow
> > > I
> > > don't get it.
> > >
> > > Thanks
> > > -Marco
> > >
> > > On Tue, Oct 11, 2011 at 6:02 PM, Marco Cadetg <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > Thanks a lot, Shawn! Looks like I need to learn some basics ;)
> > > > -Marco
> > > >
> > > > On Tue, Oct 11, 2011 at 5:39 PM, Xiaomeng Wan <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > >> total_iq = foreach (group A by all) generate SUM(A.iq) as total;
> > > >>
> > > >> total_iq_per_region = FOREACH total_iq_per_region
> > > >> {
> > > >>  GENERATE FLATTEN(group),
+
Marco Cadetg 2011-10-12, 16:03
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB