Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Spreading data in Pig

Copy link to this message
Re: Spreading data in Pig
John Meek 2013-03-31, 20:44
Thanks Jacob. That looks like it will work. I got to figure out a way to transpose that R function in jython to make a udf consistent with the rest of my script .Thanks.



-----Original Message-----
From: Jacob Perkins <[EMAIL PROTECTED]>
Sent: Sun, Mar 31, 2013 2:13 pm
Subject: Re: Spreading data in Pig
Hi John,

The only way I can think of to do this is using the RANK operator
(available only in pig version 0.11) along with a custom udf as follows:

* RANK the users relation to result in something like:

(User1, 1)
(User2, 2)
(User3, 3)
(User4, 4)
(User5, 5)
(User6, 6)
(User7, 7)
(User8, 8)
(User9, 9)

* Use a udf that functions much like the rstats "seq" function
(http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html) that
generates a bag containing integers from 0 up to the capacity of a given

(Hostb, {(0),(1)})
(Hostc, {(0),(1),(2),(3)})
(Hostd, {(0),(1),(2)})

which can then be flattened in a projection to result in:

(Hostb, 0)
(Hostb, 1)
(Hostc, 0)
(Hostc, 1)
(Hostc, 2)
(Hostc, 3)
(Hostd, 0)
(Hostd, 1)
(Hostd, 2)

(Basically reversing any aggregation that was done to produce the
capacity count in the first place...)

* Rank the exploded set of hosts to result in:

(Hostb, 1)
(Hostb, 2)
(Hostc, 3)
(Hostc, 4)
(Hostc, 5)
(Hostc, 6)
(Hostd, 7)
(Hostd, 8)
(Hostd, 9)

* You can then join the ranked hosts and the ranked users by rank and
project out fields you don't need to result in:

(Hostb, User1)
(Hostb, User2)
(Hostc, User3)
(Hostc, User4)
(Hostc, User5)
(Hostc, User6)
(Hostd, User7)
(Hostd, User8)
(Hostd, User9)

Here's some example pig code that I used that works with pig 0.11 (I
already have a Seq udf):


users = load 'users' as (user_id:chararray);
hosts = load 'hosts' as (host_id:chararray, capacity:int);

hosts_exploded = foreach hosts {
                   sequence = Seq(0, capacity, capacity);
                     host_id           as host_id,
                     flatten(sequence) as num;

ranked_users = rank users;
ranked_hosts = rank hosts_exploded;

spread = foreach (join ranked_users by $0, ranked_hosts by $0) generate
host_id, user_id;

dump spread;

Hope that helps!


On Sun, 2013-03-31 at 12:06 -0400, John Meek wrote:
> hey all,
> Can anyone let me know how I can accomplish below problem in Pig?
> I have 2 data sources:
> TABLE A with a list of User IDs:
> User1
> User2
> User3
> User4
> User5
> User6
> User7
> User8
> User9
> TABLE B with (Host name, Capacity):
> Hostb 2
> Hostc 4
> Hostd 3
> I basically need to spread the data in table A based on Table B based on how
much capacity Table B has.
> So end result should be a file:
> User1 Hostb
> User2 Hostb
> User3 Hostc
> User4 Hostc
> User5 Hostc
> User6 Hostc
> User7 Hostd
> User8 Hostd
> User9 Hostd
> The order does not matter as long as each Host gets the capacity it can take.
Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there wont
be any extra or less values to plug in.
> thanks,
> JM