Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pivot table with zeros instead of empty

Copy link to this message
Re: Pivot table with zeros instead of empty
Prashant Kommireddi 2013-03-07, 08:04
You can group on ip which will result in ip -> bag mappings. Let's say you
loaded (ip, hour, count) dataset as follows

grunt> a = load 'data' using PigStorage(',') as (ip:chararray, hour:int,
grunt> b = group a by
grunt> describe b;
b: {group: chararray,a: {(ip: chararray,hour: int,count: int)}}

The group here is ip and the resulting bag contains (ip, hour, count)
grunt> dump b;

You can write a very simple UDF that returns a Tuple based on the contents
of this bag that would have what you need. The UDF would traverse through
the tuples in the bag and create a map of hour->count. Your result tuple
should be sorted by hour and only return the counts in an order (m0...m23).


On Wed, Mar 6, 2013 at 7:20 AM, Eugene Morozov <[EMAIL PROTECTED]>wrote:

> Hello!
> I have a script that gives me following result:
> time_grouped = GROUP joined BY (ip, hour);
> counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as hour,
> COUNT(joined) as count;
> (, 0, 180)
> (, 1, 84)
> (, 2, 25)
> (, 22, 31)
> (, 23, 2)
> That is IP address, hour of day, counter.
> I'd like to get following:
> (, (m1,m2, m3, ..., m23))
> m1-m23 corresponds to the counter. And if there is nothing for particular
> hour, then I'd like to have 0 instead empty value.
> The trick here is that if I do not have anything for particular hour, then
> I won't have count for it.
> Is there a way to achieve the goal?
> Thanks in advance
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com