Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Synthetic keys

So, the example I gave before: x = join a by 1, b by 1  using
'replicated'; does a replicated cross, and it creates the synthetic keys
implicitly, which is great because the tuple it returns does not have the
synthetic keys in it. An explicit replicated cross would be good though,
since the implementation probably is pretty simple.

On 5/28/13 10:30 AM, "Pradeep Gollakota" <[EMAIL PROTECTED]> wrote:

>I ran into a similar problem where I had a relation (A) which was massive
>and another relation (B) which had exactly 1 record. I needed to do a
>product of these two relations, and the default implementation was very
>slow. I worked around it by generating a synthetic key myself and then
>a replicated join to cross the two relations. It looked something like the
>data1 = load 'data1'; # billions of records
>data2 = load 'data2'; # 1 record
>A = foreach data1 generate *, 1 as fake_key;
>B = foreach data2 generate *, 1 as fake_key;
>C = join B by fake_key, A by fake_key using 'replicated';
>I looked around to see if Pig supported this out of the box, but I didn't
>find anything.
>Perhaps a replicated cross operator would be helpful for these type of
>From the O'Reilly book, this is what is said about the cross operator:
>does implement cross in a parallel fashion. It does this by generating a
>synthetic join key, replicating rows, and then doing the cross as a join."
>Since the cross product operator is already being performed as join under
>the hood, I wonder how difficult it would be to support different join
>strategies for cross.
>On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
>> Thanks, but is there a map-side cross? The usual cross seems to have a
>> bug. I sent an example of how to replicate this bug.
>> On 5/24/13 9:15 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>> >You can do this, but pig has a CROSS keyword that you can use.
>> >
>> >
>> >2013/5/23 Mehmet Tepedelenlioglu <[EMAIL PROTECTED]>
>> >
>> >> Hi,
>> >>
>> >> I am using this:
>> >>
>> >> x = join a by 1, b by 1  using 'replicated';
>> >>
>> >> with the hope that it generates some synthetic key '1' on both a and
>> >>and
>> >> joins it on that key, thereby, in this case, doing a clean map side
>> >>cross
>> >> of
>> >> a and b with no schema changes (exactly the way a cross would work).
>> >> seems to be working, but since I just tried it and it worked, I am
>> >>sure
>> >> if there is anything in there I should be aware of. Does anyone know?
>> >>
>> >> Thanks,
>> >>
>> >> Mehmet
>> >>
>> >>
>> >>