Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pivot table with zeros instead of empty


Copy link to this message
-
Re: Pivot table with zeros instead of empty
You can group on ip which will result in ip -> bag mappings. Let's say you
loaded (ip, hour, count) dataset as follows

grunt> a = load 'data' using PigStorage(',') as (ip:chararray, hour:int,
count:int);
grunt> b = group a by
ip;
grunt> describe b;
b: {group: chararray,a: {(ip: chararray,hour: int,count: int)}}

The group here is ip and the resulting bag contains (ip, hour, count)
tuples.
grunt> dump b;
(128.187.97.22,{(128.187.97.22,0,180),(128.187.97.22,1,84),(128.187.97.22,2,25),(128.187.97.22,22,31),(128.187.97.22,23,2)})

You can write a very simple UDF that returns a Tuple based on the contents
of this bag that would have what you need. The UDF would traverse through
the tuples in the bag and create a map of hour->count. Your result tuple
should be sorted by hour and only return the counts in an order (m0...m23).

-Prashant

On Wed, Mar 6, 2013 at 7:20 AM, Eugene Morozov <[EMAIL PROTECTED]>wrote:

> Hello!
>
> I have a script that gives me following result:
>
> time_grouped = GROUP joined BY (ip, hour);
> counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as hour,
> COUNT(joined) as count;
>
> (128.187.97.22, 0, 180)
> (128.187.97.22, 1, 84)
> (128.187.97.22, 2, 25)
> (128.187.97.22, 22, 31)
> (128.187.97.22, 23, 2)
>
> That is IP address, hour of day, counter.
> I'd like to get following:
>
> (128.187.97.22, (m1,m2, m3, ..., m23))
> m1-m23 corresponds to the counter. And if there is nothing for particular
> hour, then I'd like to have 0 instead empty value.
>
> The trick here is that if I do not have anything for particular hour, then
> I won't have count for it.
> Is there a way to achieve the goal?
>
> Thanks in advance
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> [EMAIL PROTECTED]
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB