Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pivot table with zeros instead of empty


Copy link to this message
-
Re: Pivot table with zeros instead of empty
I would like to share what I've been trying to do to achieve my goal with
only PIG (no custom UDF - just to please my curiosity =)).
Please tell me if that's wrong direction.

So, I've got two files
exp-tuple1:
0 1
1 1
2 2
3 3
4 4
6 6
7 7
8 8
9 9

exp-tuple2
0 0
1 0
2 0
3 0
4 0
5 0     <--- please, note, there is not 5 in exp-tuple1
6 0
7 0
8 0
9 0

The goal is to get something like (1, 1, 2, 3, 4, 0, 6, 7, 8, 9) from
exp-tuple1, but with zero for 5 instead of empty value.

a = load 'exp-tuple1' as (hour:int, count:int);
b = load 'exp-tuple2' as (hour:int, count:int);

joined = join b by hour left, a by hour;
c = foreach joined generate b::hour, b::count + (a::count is null ? 0 :
a::count) as count;
grouped = group c all;    <-- instead of all I have real value
d = foreach grouped generate group, $1.count;

dump d; -- (all,{(1),(1),(2),(3),(4),(0),(6),(7),(8),(9)})
I don't care about all these braces, it's good to have such a result.

Instead of "all" there would real value. So, it seems it's possible with
having hack of tuple with zeros for all 24 hours.
But, the question is would it be performant enough? Or it's still better to
write my own UDF?
On Fri, Mar 8, 2013 at 12:55 PM, Eugene Morozov
<[EMAIL PROTECTED]>wrote:

> Prashant,
>
> thank you for the advice. That's a good one!
>
> On Thu, Mar 7, 2013 at 12:04 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
>
>> You can group on ip which will result in ip -> bag mappings. Let's say you
>> loaded (ip, hour, count) dataset as follows
>>
>> grunt> a = load 'data' using PigStorage(',') as (ip:chararray, hour:int,
>> count:int);
>> grunt> b = group a by
>> ip;
>> grunt> describe b;
>> b: {group: chararray,a: {(ip: chararray,hour: int,count: int)}}
>>
>> The group here is ip and the resulting bag contains (ip, hour, count)
>> tuples.
>> grunt> dump b;
>>
>> (128.187.97.22,{(128.187.97.22,0,180),(128.187.97.22,1,84),(128.187.97.22,2,25),(128.187.97.22,22,31),(128.187.97.22,23,2)})
>>
>> You can write a very simple UDF that returns a Tuple based on the contents
>> of this bag that would have what you need. The UDF would traverse through
>> the tuples in the bag and create a map of hour->count. Your result tuple
>> should be sorted by hour and only return the counts in an order
>> (m0...m23).
>>
>> -Prashant
>>
>> On Wed, Mar 6, 2013 at 7:20 AM, Eugene Morozov <[EMAIL PROTECTED]
>> >wrote:
>>
>> > Hello!
>> >
>> > I have a script that gives me following result:
>> >
>> > time_grouped = GROUP joined BY (ip, hour);
>> > counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as
>> hour,
>> > COUNT(joined) as count;
>> >
>> > (128.187.97.22, 0, 180)
>> > (128.187.97.22, 1, 84)
>> > (128.187.97.22, 2, 25)
>> > (128.187.97.22, 22, 31)
>> > (128.187.97.22, 23, 2)
>> >
>> > That is IP address, hour of day, counter.
>> > I'd like to get following:
>> >
>> > (128.187.97.22, (m1,m2, m3, ..., m23))
>> > m1-m23 corresponds to the counter. And if there is nothing for
>> particular
>> > hour, then I'd like to have 0 instead empty value.
>> >
>> > The trick here is that if I do not have anything for particular hour,
>> then
>> > I won't have count for it.
>> > Is there a way to achieve the goal?
>> >
>> > Thanks in advance
>> > --
>> > Evgeny Morozov
>> > Developer Grid Dynamics
>> > Skype: morozov.evgeny
>> > www.griddynamics.com
>> > [EMAIL PROTECTED]
>> >
>>
>
>
>
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> [EMAIL PROTECTED]
>

--
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
[EMAIL PROTECTED]