Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Synthetic keys


Copy link to this message
-
Re: Synthetic keys
Mehmet Tepedelenlioglu 2013-05-28, 18:18
0.10.0-cdh4.1.2

On 5/28/13 11:07 AM, "Pradeep Gollakota" <[EMAIL PROTECTED]> wrote:

>Oh I see... I don't remember if I tried to do it your way or not. I'm
>using
>the CDH3 version (0.8.1) of pig. I'm not sure if explicit literals in
>join's are supported in that version. I'll give it a shot and see since it
>will simplify my script.
>What version of pig are you using?
>
>
>On Tue, May 28, 2013 at 2:04 PM, Mehmet Tepedelenlioglu <
>[EMAIL PROTECTED]> wrote:
>
>> So, the example I gave before: x = join a by 1, b by 1  using
>> 'replicated'; does a replicated cross, and it creates the synthetic keys
>> implicitly, which is great because the tuple it returns does not have
>>the
>> synthetic keys in it. An explicit replicated cross would be good though,
>> since the implementation probably is pretty simple.
>>
>>
>> On 5/28/13 10:30 AM, "Pradeep Gollakota" <[EMAIL PROTECTED]> wrote:
>>
>> >I ran into a similar problem where I had a relation (A) which was
>>massive
>> >and another relation (B) which had exactly 1 record. I needed to do a
>> >cross
>> >product of these two relations, and the default implementation was very
>> >slow. I worked around it by generating a synthetic key myself and then
>> >used
>> >a replicated join to cross the two relations. It looked something like
>>the
>> >following:
>> >
>> >data1 = load 'data1'; # billions of records
>> >data2 = load 'data2'; # 1 record
>> >A = foreach data1 generate *, 1 as fake_key;
>> >B = foreach data2 generate *, 1 as fake_key;
>> >C = join B by fake_key, A by fake_key using 'replicated';
>> >
>> >I looked around to see if Pig supported this out of the box, but I
>>didn't
>> >find anything.
>> >
>> >Perhaps a replicated cross operator would be helpful for these type of
>> >problems.
>> >From the O'Reilly book, this is what is said about the cross operator:
>> >"Pig
>> >does implement cross in a parallel fashion. It does this by generating
>>a
>> >synthetic join key, replicating rows, and then doing the cross as a
>>join."
>> >Since the cross product operator is already being performed as join
>>under
>> >the hood, I wonder how difficult it would be to support different join
>> >strategies for cross.
>> >
>> >
>> >On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
>> >[EMAIL PROTECTED]> wrote:
>> >
>> >> Thanks, but is there a map-side cross? The usual cross seems to have
>>a
>> >> bug. I sent an example of how to replicate this bug.
>> >>
>> >> On 5/24/13 9:15 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>> >>
>> >> >You can do this, but pig has a CROSS keyword that you can use.
>> >> >
>> >> >
>> >> >2013/5/23 Mehmet Tepedelenlioglu <[EMAIL PROTECTED]>
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I am using this:
>> >> >>
>> >> >> x = join a by 1, b by 1  using 'replicated';
>> >> >>
>> >> >> with the hope that it generates some synthetic key '1' on both a
>>and
>> >>b
>> >> >>and
>> >> >> joins it on that key, thereby, in this case, doing a clean map
>>side
>> >> >>cross
>> >> >> of
>> >> >> a and b with no schema changes (exactly the way a cross would
>>work).
>> >>It
>> >> >> seems to be working, but since I just tried it and it worked, I am
>> >>not
>> >> >>sure
>> >> >> if there is anything in there I should be aware of. Does anyone
>>know?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Mehmet
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>>
>>
>>