Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Joining inner and outer bags

Copy link to this message
Re: Joining inner and outer bags
Thejas M Nair 2011-01-07, 18:44

On 1/7/11 9:20 AM, "Kris Coward" <[EMAIL PROTECTED]> wrote:

> Hi,
> I've got an outer bag/relation consistig of a bunch of user information,
> one of the pieces of which is an inner bag of possible events for that
> user, and the value of those events, should they occur. Outside the bag,
> there are also a few data concerning whether specific events have
> already occurred.
> In another relation, I have the assortment of events grouped with the
> probability that any of them will occur.
> I'd like to generate expected values for each user, but know that I
> can't JOIN within a FOREACH block (or do a nested FOREACH). For a UDF,
> I vaguely recall some sort of constraint on nesting inner bags that
> would interfere with my ability to bundle the possible events bag with
> the actual events data into a single object that could be passed to a
> UDF that extends EvalFunc.
I can't think of any limitations that would prevent you from writing such an
You can pass the bag of events to the udf, and have the udf append the
probability information to tuples in the bag and return the new bag. I am
assuming that the even probability relation is small enough to be stored in
> Am I misremembering something? Is there some other sort of clever
> trickery that I might be able to use to generate expected values if I'm
> not? (and if I am, is there something less hackish than a GROUP on a
> unique tuple element that I could use to load the desired values into a
> bag or tuple (or just plain pass the entire tuple to a UDF)?

Is this the alternative solution you are trying to avoid ? - do a (foreach-)
flatten on the events bag of first relation, do a join (using 'replicated'
if the 2nd relation is small enough), and then do a group-by on user (id).
This will not involve writing a UDF, but it will have an additional reduce
phase for the group-by. If you use a udf that appends the information, it
will be a map-only job.