I ran into a similar problem where I had a relation (A) which was massive
and another relation (B) which had exactly 1 record. I needed to do a cross
product of these two relations, and the default implementation was very
slow. I worked around it by generating a synthetic key myself and then used
a replicated join to cross the two relations. It looked something like the
data1 = load 'data1'; # billions of records
data2 = load 'data2'; # 1 record
A = foreach data1 generate *, 1 as fake_key;
B = foreach data2 generate *, 1 as fake_key;
C = join B by fake_key, A by fake_key using 'replicated';
I looked around to see if Pig supported this out of the box, but I didn't
Perhaps a replicated cross operator would be helpful for these type of
>From the O'Reilly book, this is what is said about the cross operator: "Pig
does implement cross in a parallel fashion. It does this by generating a
synthetic join key, replicating rows, and then doing the cross as a join."
Since the cross product operator is already being performed as join under
the hood, I wonder how difficult it would be to support different join
strategies for cross.
On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
[EMAIL PROTECTED]> wrote:
> Thanks, but is there a map-side cross? The usual cross seems to have a
> bug. I sent an example of how to replicate this bug.
> On 5/24/13 9:15 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
> >You can do this, but pig has a CROSS keyword that you can use.
> >2013/5/23 Mehmet Tepedelenlioglu <[EMAIL PROTECTED]>
> >> Hi,
> >> I am using this:
> >> x = join a by 1, b by 1 using 'replicated';
> >> with the hope that it generates some synthetic key '1' on both a and b
> >> joins it on that key, thereby, in this case, doing a clean map side
> >> of
> >> a and b with no schema changes (exactly the way a cross would work). It
> >> seems to be working, but since I just tried it and it worked, I am not
> >> if there is anything in there I should be aware of. Does anyone know?
> >> Thanks,
> >> Mehmet