Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Join question


Copy link to this message
-
Re: Join question
jamal sasha 2013-04-01, 22:44
Hi,
  Yeah, there was a bug in my "stats" data.
I was wondering how can I calcualte average in pig..
Something like :
http://stackoverflow.com/questions/12593527/finding-mean-using-pig-or-hadoop

But in top response.. it seems that the user wanted to calculate across
average across all data..
as

count = COUNT(inpt)
and inpt is the complete input
whereas what i want.. that denominator is count for each id..

so my data is like:

id, value
1,1.0
1,3.0
1,5.0
2,1.0

So, the average I am expecting is:

 1, 3.0
2,1.0

as 1 +3 + 5 /3 = 3
whereas in the example.. count(inpt) should give me 4?

How do i achieve this.
Thanks
On Mon, Apr 1, 2013 at 2:24 PM, Mehmet Tepedelenlioglu <[EMAIL PROTECTED]>
wrote:
>
> Are your ids unique?
>
> On 4/1/13 2:06 PM, "jamal sasha" <[EMAIL PROTECTED]> wrote:
>
> >Hi,
> >  I have a simple join question.
> >base = load 'input1'   USING PigStorage( ',' ) as (id1, field1, field2);
> >stats = load 'input2' USING PigStorage(',') as (id1, mean, median);
> >joined = JOIN base BY  id1, stats BY id1;
> >final = FOREACH joined GENERATE base::id1, base::field1,base::field2,
> >stats::mean,stats::median;
> >STORE final INTO   'output'   USING PigStorage( ',' );
> >
> >But something doesnt feels right.
> >Inputs are of order MB's.. whereas outputs are like 100GB's...
> >
> >I tried it on sample file
> >where base is 35MB
> >stats is 10MB
> >and output explodes to GB's??
> >What am i missing?
>
>