Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pivot table with zeros instead of empty


+
Eugene Morozov 2013-03-06, 15:20
+
Prashant Kommireddi 2013-03-07, 08:04
Copy link to this message
-
Re: Pivot table with zeros instead of empty
Prashant,

thank you for the advice. That's a good one!

On Thu, Mar 7, 2013 at 12:04 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> You can group on ip which will result in ip -> bag mappings. Let's say you
> loaded (ip, hour, count) dataset as follows
>
> grunt> a = load 'data' using PigStorage(',') as (ip:chararray, hour:int,
> count:int);
> grunt> b = group a by
> ip;
> grunt> describe b;
> b: {group: chararray,a: {(ip: chararray,hour: int,count: int)}}
>
> The group here is ip and the resulting bag contains (ip, hour, count)
> tuples.
> grunt> dump b;
>
> (128.187.97.22,{(128.187.97.22,0,180),(128.187.97.22,1,84),(128.187.97.22,2,25),(128.187.97.22,22,31),(128.187.97.22,23,2)})
>
> You can write a very simple UDF that returns a Tuple based on the contents
> of this bag that would have what you need. The UDF would traverse through
> the tuples in the bag and create a map of hour->count. Your result tuple
> should be sorted by hour and only return the counts in an order (m0...m23).
>
> -Prashant
>
> On Wed, Mar 6, 2013 at 7:20 AM, Eugene Morozov <[EMAIL PROTECTED]
> >wrote:
>
> > Hello!
> >
> > I have a script that gives me following result:
> >
> > time_grouped = GROUP joined BY (ip, hour);
> > counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as
> hour,
> > COUNT(joined) as count;
> >
> > (128.187.97.22, 0, 180)
> > (128.187.97.22, 1, 84)
> > (128.187.97.22, 2, 25)
> > (128.187.97.22, 22, 31)
> > (128.187.97.22, 23, 2)
> >
> > That is IP address, hour of day, counter.
> > I'd like to get following:
> >
> > (128.187.97.22, (m1,m2, m3, ..., m23))
> > m1-m23 corresponds to the counter. And if there is nothing for particular
> > hour, then I'd like to have 0 instead empty value.
> >
> > The trick here is that if I do not have anything for particular hour,
> then
> > I won't have count for it.
> > Is there a way to achieve the goal?
> >
> > Thanks in advance
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > [EMAIL PROTECTED]
> >
>

--
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
[EMAIL PROTECTED]
+
Eugene Morozov 2013-03-08, 19:21
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB