Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Spreading data in Pig


Copy link to this message
-
Re: Spreading data in Pig
Jacob Perkins 2013-03-31, 18:13
Hi John,

The only way I can think of to do this is using the RANK operator
(available only in pig version 0.11) along with a custom udf as follows:

* RANK the users relation to result in something like:

(User1, 1)
(User2, 2)
(User3, 3)
(User4, 4)
(User5, 5)
(User6, 6)
(User7, 7)
(User8, 8)
(User9, 9)

* Use a udf that functions much like the rstats "seq" function
(http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html) that
generates a bag containing integers from 0 up to the capacity of a given
host:

(Hostb, {(0),(1)})
(Hostc, {(0),(1),(2),(3)})
(Hostd, {(0),(1),(2)})

which can then be flattened in a projection to result in:

(Hostb, 0)
(Hostb, 1)
(Hostc, 0)
(Hostc, 1)
(Hostc, 2)
(Hostc, 3)
(Hostd, 0)
(Hostd, 1)
(Hostd, 2)

(Basically reversing any aggregation that was done to produce the
capacity count in the first place...)

* Rank the exploded set of hosts to result in:

(Hostb, 1)
(Hostb, 2)
(Hostc, 3)
(Hostc, 4)
(Hostc, 5)
(Hostc, 6)
(Hostd, 7)
(Hostd, 8)
(Hostd, 9)

* You can then join the ranked hosts and the ranked users by rank and
project out fields you don't need to result in:

(Hostb, User1)
(Hostb, User2)
(Hostc, User3)
(Hostc, User4)
(Hostc, User5)
(Hostc, User6)
(Hostd, User7)
(Hostd, User8)
(Hostd, User9)

Here's some example pig code that I used that works with pig 0.11 (I
already have a Seq udf):

************

users = load 'users' as (user_id:chararray);
hosts = load 'hosts' as (host_id:chararray, capacity:int);

hosts_exploded = foreach hosts {
                   sequence = Seq(0, capacity, capacity);
                   generate
                     host_id           as host_id,
                     flatten(sequence) as num;
                 };

ranked_users = rank users;
ranked_hosts = rank hosts_exploded;

spread = foreach (join ranked_users by $0, ranked_hosts by $0) generate
host_id, user_id;

dump spread;

************
Hope that helps!

--jacob
@thedatachef

On Sun, 2013-03-31 at 12:06 -0400, John Meek wrote:
> hey all,
>
> Can anyone let me know how I can accomplish below problem in Pig?
>
> I have 2 data sources:
>
> TABLE A with a list of User IDs:
>
> User1
> User2
> User3
> User4
> User5
> User6
> User7
> User8
> User9
>
> TABLE B with (Host name, Capacity):
>
> Hostb 2
> Hostc 4
> Hostd 3
>
>
> I basically need to spread the data in table A based on Table B based on how much capacity Table B has.
>
> So end result should be a file:
>
> User1 Hostb
> User2 Hostb
> User3 Hostc
> User4 Hostc
> User5 Hostc
> User6 Hostc
> User7 Hostd
> User8 Hostd
> User9 Hostd
>
> The order does not matter as long as each Host gets the capacity it can take. Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there wont be any extra or less values to plug in.
>
>
> thanks,
> JM
>
>