Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Cross Product of Two Tuples?


Copy link to this message
-
Re: Cross Product of Two Tuples?
Nah, doesn't work because it doubles up the tuple, so that:

TOBAG(('hello', 'howdy', 'hi'))
returns
{(('hello', 'howdy', 'hi'))}

And so,

FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2))
gets me
('hello', 'howdy', 'hi'), ('hola', 'bonjour')

which is just what I started with.

Anyway, to solve this problem, what I did was make a quick python udf to
make a bag from a tuple without doubling up the tuple, and then ran
FLATTEN on that, which looks like:

bagged = FOREACH split_set GENERATE FLATTEN(py_udfs.tupleToBag(t1)),
FLATTEN(py_udfs.tupleToBag(t2));

Where the Python udf I'm using is:

@outputSchema("b:bag{}")
def tupleToBag(tup):
     b = [tupify(i) for i in tupify(tup)]
     return b

def tupify(tup):
     if isinstance(tup, tuple):
         return tup
     return (tup,)

I'll add that into Python PiggyBank as soon as I get a chance to finish
that stuff up.

Eli
On 4/4/12 2:43 PM, Jonathan Coveney wrote:
> FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2)) should give you the cross
>
> 2012/4/4 Eli Finkelshteyn<[EMAIL PROTECTED]>
>
>> That's for a relation only. Unless I'm missing something, it does not work
>> for tuples. What I'm doing what require a FOREACH, I'm thinking.
>>
>> Eli
>>
>>
>> On 4/4/12 2:24 PM, Prashant Kommireddi wrote:
>>
>>> http://pig.apache.org/docs/r0.**9.1/basic.html#cross<http://pig.apache.org/docs/r0.9.1/basic.html#cross>
>>>
>>> -Prashant
>>>
>>> On Wed, Apr 4, 2012 at 11:18 AM, Eli Finkelshteyn<iefinkel@gmail.**com<[EMAIL PROTECTED]>
>>>> wrote:
>>>   Hi Folks,
>>>> I'm currently trying to do something I figured would be trivial, but
>>>> actually wound up being a bit of work for me, so I'm wondering if I'm
>>>> missing something. All I want to do is get a cross product of two tuples.
>>>> So for example, given an input of:
>>>>
>>>> ('hello', 'howdy', 'hi'), ('hola', 'bonjour')
>>>>
>>>> I'd get:
>>>>
>>>> ('hello', 'hola')
>>>> ('hello', 'bonjour')
>>>> ('howdy', 'hola')
>>>> ('howdy', 'bonjour')
>>>> ('hi', 'hola')
>>>> ('hi', 'bonjour')
>>>>
>>>> At first, I figured I could FLATTEN(TOBAG(tuple1, tuple2)), but that's no
>>>> good cause the tuples are first themselves put into new tuples. So, what
>>>> I'm left with no is writing a dirty and slow python udf for this. Is
>>>> there
>>>> really no better way to do this? I'd think it would be a pretty standard
>>>> task.
>>>>
>>>> Eli
>>>>
>>>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB