Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Cross Product of Two Tuples?


+
Eli Finkelshteyn 2012-04-04, 18:18
+
Herbert Mühlburger 2012-04-04, 18:24
+
Prashant Kommireddi 2012-04-04, 18:24
+
Eli Finkelshteyn 2012-04-04, 18:40
+
Jonathan Coveney 2012-04-04, 18:43
Copy link to this message
-
Re: Cross Product of Two Tuples?
Eli Finkelshteyn 2012-04-04, 21:37
Nah, doesn't work because it doubles up the tuple, so that:

TOBAG(('hello', 'howdy', 'hi'))
returns
{(('hello', 'howdy', 'hi'))}

And so,

FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2))
gets me
('hello', 'howdy', 'hi'), ('hola', 'bonjour')

which is just what I started with.

Anyway, to solve this problem, what I did was make a quick python udf to
make a bag from a tuple without doubling up the tuple, and then ran
FLATTEN on that, which looks like:

bagged = FOREACH split_set GENERATE FLATTEN(py_udfs.tupleToBag(t1)),
FLATTEN(py_udfs.tupleToBag(t2));

Where the Python udf I'm using is:

@outputSchema("b:bag{}")
def tupleToBag(tup):
     b = [tupify(i) for i in tupify(tup)]
     return b

def tupify(tup):
     if isinstance(tup, tuple):
         return tup
     return (tup,)

I'll add that into Python PiggyBank as soon as I get a chance to finish
that stuff up.

Eli
On 4/4/12 2:43 PM, Jonathan Coveney wrote:
> FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2)) should give you the cross
>
> 2012/4/4 Eli Finkelshteyn<[EMAIL PROTECTED]>
>
>> That's for a relation only. Unless I'm missing something, it does not work
>> for tuples. What I'm doing what require a FOREACH, I'm thinking.
>>
>> Eli
>>
>>
>> On 4/4/12 2:24 PM, Prashant Kommireddi wrote:
>>
>>> http://pig.apache.org/docs/r0.**9.1/basic.html#cross<http://pig.apache.org/docs/r0.9.1/basic.html#cross>
>>>
>>> -Prashant
>>>
>>> On Wed, Apr 4, 2012 at 11:18 AM, Eli Finkelshteyn<iefinkel@gmail.**com<[EMAIL PROTECTED]>
>>>> wrote:
>>>   Hi Folks,
>>>> I'm currently trying to do something I figured would be trivial, but
>>>> actually wound up being a bit of work for me, so I'm wondering if I'm
>>>> missing something. All I want to do is get a cross product of two tuples.
>>>> So for example, given an input of:
>>>>
>>>> ('hello', 'howdy', 'hi'), ('hola', 'bonjour')
>>>>
>>>> I'd get:
>>>>
>>>> ('hello', 'hola')
>>>> ('hello', 'bonjour')
>>>> ('howdy', 'hola')
>>>> ('howdy', 'bonjour')
>>>> ('hi', 'hola')
>>>> ('hi', 'bonjour')
>>>>
>>>> At first, I figured I could FLATTEN(TOBAG(tuple1, tuple2)), but that's no
>>>> good cause the tuples are first themselves put into new tuples. So, what
>>>> I'm left with no is writing a dirty and slow python udf for this. Is
>>>> there
>>>> really no better way to do this? I'd think it would be a pretty standard
>>>> task.
>>>>
>>>> Eli
>>>>
>>>>
+
Scott Carey 2012-04-05, 17:04
+
Jonathan Coveney 2012-04-05, 18:25
+
Scott Carey 2012-04-05, 20:35
+
Jonathan Coveney 2012-04-05, 23:41
+
Scott Carey 2012-04-06, 01:23
+
Eli Finkelshteyn 2012-04-07, 23:27
+
Jonathan Coveney 2012-04-06, 06:45