Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Pivot table with zeros instead of empty


+
Eugene Morozov 2013-03-06, 15:20
+
Prashant Kommireddi 2013-03-07, 08:04
Copy link to this message
-
Re: Pivot table with zeros instead of empty
Eugene Morozov 2013-03-08, 08:55
Prashant,

thank you for the advice. That's a good one!

On Thu, Mar 7, 2013 at 12:04 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> You can group on ip which will result in ip -> bag mappings. Let's say you
> loaded (ip, hour, count) dataset as follows
>
> grunt> a = load 'data' using PigStorage(',') as (ip:chararray, hour:int,
> count:int);
> grunt> b = group a by
> ip;
> grunt> describe b;
> b: {group: chararray,a: {(ip: chararray,hour: int,count: int)}}
>
> The group here is ip and the resulting bag contains (ip, hour, count)
> tuples.
> grunt> dump b;
>
> (128.187.97.22,{(128.187.97.22,0,180),(128.187.97.22,1,84),(128.187.97.22,2,25),(128.187.97.22,22,31),(128.187.97.22,23,2)})
>
> You can write a very simple UDF that returns a Tuple based on the contents
> of this bag that would have what you need. The UDF would traverse through
> the tuples in the bag and create a map of hour->count. Your result tuple
> should be sorted by hour and only return the counts in an order (m0...m23).
>
> -Prashant
>
> On Wed, Mar 6, 2013 at 7:20 AM, Eugene Morozov <[EMAIL PROTECTED]
> >wrote:
>
> > Hello!
> >
> > I have a script that gives me following result:
> >
> > time_grouped = GROUP joined BY (ip, hour);
> > counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as
> hour,
> > COUNT(joined) as count;
> >
> > (128.187.97.22, 0, 180)
> > (128.187.97.22, 1, 84)
> > (128.187.97.22, 2, 25)
> > (128.187.97.22, 22, 31)
> > (128.187.97.22, 23, 2)
> >
> > That is IP address, hour of day, counter.
> > I'd like to get following:
> >
> > (128.187.97.22, (m1,m2, m3, ..., m23))
> > m1-m23 corresponds to the counter. And if there is nothing for particular
> > hour, then I'd like to have 0 instead empty value.
> >
> > The trick here is that if I do not have anything for particular hour,
> then
> > I won't have count for it.
> > Is there a way to achieve the goal?
> >
> > Thanks in advance
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > [EMAIL PROTECTED]
> >
>

--
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
[EMAIL PROTECTED]
+
Eugene Morozov 2013-03-08, 19:21