Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Synthetic keys


Copy link to this message
-
Re: Synthetic keys
Pradeep Gollakota 2013-05-28, 17:30
I ran into a similar problem where I had a relation (A) which was massive
and another relation (B) which had exactly 1 record. I needed to do a cross
product of these two relations, and the default implementation was very
slow. I worked around it by generating a synthetic key myself and then used
a replicated join to cross the two relations. It looked something like the
following:

data1 = load 'data1'; # billions of records
data2 = load 'data2'; # 1 record
A = foreach data1 generate *, 1 as fake_key;
B = foreach data2 generate *, 1 as fake_key;
C = join B by fake_key, A by fake_key using 'replicated';

I looked around to see if Pig supported this out of the box, but I didn't
find anything.

Perhaps a replicated cross operator would be helpful for these type of
problems.
>From the O'Reilly book, this is what is said about the cross operator: "Pig
does implement cross in a parallel fashion. It does this by generating a
synthetic join key, replicating rows, and then doing the cross as a join."
Since the cross product operator is already being performed as join under
the hood, I wonder how difficult it would be to support different join
strategies for cross.
On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
[EMAIL PROTECTED]> wrote:

> Thanks, but is there a map-side cross? The usual cross seems to have a
> bug. I sent an example of how to replicate this bug.
>
> On 5/24/13 9:15 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>
> >You can do this, but pig has a CROSS keyword that you can use.
> >
> >
> >2013/5/23 Mehmet Tepedelenlioglu <[EMAIL PROTECTED]>
> >
> >> Hi,
> >>
> >> I am using this:
> >>
> >> x = join a by 1, b by 1  using 'replicated';
> >>
> >> with the hope that it generates some synthetic key '1' on both a and b
> >>and
> >> joins it on that key, thereby, in this case, doing a clean map side
> >>cross
> >> of
> >> a and b with no schema changes (exactly the way a cross would work). It
> >> seems to be working, but since I just tried it and it worked, I am not
> >>sure
> >> if there is anything in there I should be aware of. Does anyone know?
> >>
> >> Thanks,
> >>
> >> Mehmet
> >>
> >>
> >>
>
>
>