Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Joining inner and outer bags

Copy link to this message
Re: Joining inner and outer bags
On Fri, Jan 07, 2011 at 10:44:03AM -0800, Thejas M Nair wrote:
> On 1/7/11 9:20 AM, "Kris Coward" <[EMAIL PROTECTED]> wrote:
> > I've got an outer bag/relation consistig of a bunch of user information,
> > one of the pieces of which is an inner bag of possible events for that
> > user, and the value of those events, should they occur. Outside the bag,
> > there are also a few data concerning whether specific events have
> > already occurred.
> >
> > In another relation, I have the assortment of events grouped with the
> > probability that any of them will occur.
> >
> > I'd like to generate expected values for each user, but know that I
> > can't JOIN within a FOREACH block (or do a nested FOREACH). For a UDF,
> > I vaguely recall some sort of constraint on nesting inner bags that
> > would interfere with my ability to bundle the possible events bag with
> > the actual events data into a single object that could be passed to a
> > UDF that extends EvalFunc.
> I can't think of any limitations that would prevent you from writing such an
> udf.
> You can pass the bag of events to the udf, and have the udf append the
> probability information to tuples in the bag and return the new bag. I am
> assuming that the even probability relation is small enough to be stored in
> memory.
> > Am I misremembering something? Is there some other sort of clever
> > trickery that I might be able to use to generate expected values if I'm
> > not? (and if I am, is there something less hackish than a GROUP on a
> > unique tuple element that I could use to load the desired values into a
> > bag or tuple (or just plain pass the entire tuple to a UDF)?
> Is this the alternative solution you are trying to avoid ? - do a (foreach-)
> flatten on the events bag of first relation, do a join (using 'replicated'
> if the 2nd relation is small enough), and then do a group-by on user (id).
> This will not involve writing a UDF, but it will have an additional reduce
> phase for the group-by. If you use a udf that appends the information, it
> will be a map-only job.

I'm not trying to avoid that solution at all.. FLATTENing the events bag
and then reGROUPing it after the seems like it's probably the solution I
was looking for (the bag had been ORDERed before, and some information
was present in the ordering, but I can separate that information out so
that it survives FLATTEN.


Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3