Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Re: Cross Product of Two Tuples?


Copy link to this message
-
Re: Cross Product of Two Tuples?
Gianmarco De Francisci Mo... 2012-04-05, 09:27
I would say the additional nesting level is a bug.
But we should check if we break stuff with this change.

Cheers,
--
Gianmarco

On Thu, Apr 5, 2012 at 01:36, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> Pig folks: it seems like it defies the expectation if TOBAG is run on a
> single TUPLE and you don't get a bag. I can patch it, but seem like a fair
> change?
>
> 2012/4/4 Eli Finkelshteyn <[EMAIL PROTECTED]>
>
> > Nah, doesn't work because it doubles up the tuple, so that:
> >
> > TOBAG(('hello', 'howdy', 'hi'))
> > returns
> > {(('hello', 'howdy', 'hi'))}
> >
> > And so,
> >
> > FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2))
> > gets me
> >
> > ('hello', 'howdy', 'hi'), ('hola', 'bonjour')
> >
> > which is just what I started with.
> >
> > Anyway, to solve this problem, what I did was make a quick python udf to
> > make a bag from a tuple without doubling up the tuple, and then ran
> FLATTEN
> > on that, which looks like:
> >
> > bagged = FOREACH split_set GENERATE FLATTEN(py_udfs.tupleToBag(t1)**),
> > FLATTEN(py_udfs.tupleToBag(t2)**);
> >
> > Where the Python udf I'm using is:
> >
> > @outputSchema("b:bag{}")
> > def tupleToBag(tup):
> >    b = [tupify(i) for i in tupify(tup)]
> >    return b
> >
> > def tupify(tup):
> >    if isinstance(tup, tuple):
> >        return tup
> >    return (tup,)
> >
> > I'll add that into Python PiggyBank as soon as I get a chance to finish
> > that stuff up.
> >
> > Eli
> >
> >
> >
> > On 4/4/12 2:43 PM, Jonathan Coveney wrote:
> >
> >> FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2)) should give you the cross
> >>
> >> 2012/4/4 Eli Finkelshteyn<iefinkel@gmail.**com <[EMAIL PROTECTED]>>
> >>
> >>  That's for a relation only. Unless I'm missing something, it does not
> >>> work
> >>> for tuples. What I'm doing what require a FOREACH, I'm thinking.
> >>>
> >>> Eli
> >>>
> >>>
> >>> On 4/4/12 2:24 PM, Prashant Kommireddi wrote:
> >>>
> >>>  http://pig.apache.org/docs/r0.****9.1/basic.html#cross<
> http://pig.apache.org/docs/r0.**9.1/basic.html#cross>
> >>>> <http://**pig.apache.org/docs/r0.9.1/**basic.html#cross<
> http://pig.apache.org/docs/r0.9.1/basic.html#cross>
> >>>> >
> >>>>
> >>>> -Prashant
> >>>>
> >>>> On Wed, Apr 4, 2012 at 11:18 AM, Eli Finkelshteyn<iefinkel@gmail.****
> >>>> com<[EMAIL PROTECTED]>
> >>>>
> >>>>  wrote:
> >>>>>
> >>>>  Hi Folks,
> >>>>
> >>>>> I'm currently trying to do something I figured would be trivial, but
> >>>>> actually wound up being a bit of work for me, so I'm wondering if I'm
> >>>>> missing something. All I want to do is get a cross product of two
> >>>>> tuples.
> >>>>> So for example, given an input of:
> >>>>>
> >>>>> ('hello', 'howdy', 'hi'), ('hola', 'bonjour')
> >>>>>
> >>>>> I'd get:
> >>>>>
> >>>>> ('hello', 'hola')
> >>>>> ('hello', 'bonjour')
> >>>>> ('howdy', 'hola')
> >>>>> ('howdy', 'bonjour')
> >>>>> ('hi', 'hola')
> >>>>> ('hi', 'bonjour')
> >>>>>
> >>>>> At first, I figured I could FLATTEN(TOBAG(tuple1, tuple2)), but
> that's
> >>>>> no
> >>>>> good cause the tuples are first themselves put into new tuples. So,
> >>>>> what
> >>>>> I'm left with no is writing a dirty and slow python udf for this. Is
> >>>>> there
> >>>>> really no better way to do this? I'd think it would be a pretty
> >>>>> standard
> >>>>> task.
> >>>>>
> >>>>> Eli
> >>>>>
> >>>>>
> >>>>>
> >
>