Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Synthetic keys


Copy link to this message
-
Re: Synthetic keys
I ran into a similar problem where I had a relation (A) which was massive
and another relation (B) which had exactly 1 record. I needed to do a cross
product of these two relations, and the default implementation was very
slow. I worked around it by generating a synthetic key myself and then used
a replicated join to cross the two relations. It looked something like the
following:

data1 = load 'data1'; # billions of records
data2 = load 'data2'; # 1 record
A = foreach data1 generate *, 1 as fake_key;
B = foreach data2 generate *, 1 as fake_key;
C = join B by fake_key, A by fake_key using 'replicated';

I looked around to see if Pig supported this out of the box, but I didn't
find anything.

Perhaps a replicated cross operator would be helpful for these type of
problems.
>From the O'Reilly book, this is what is said about the cross operator: "Pig
does implement cross in a parallel fashion. It does this by generating a
synthetic join key, replicating rows, and then doing the cross as a join."
Since the cross product operator is already being performed as join under
the hood, I wonder how difficult it would be to support different join
strategies for cross.
On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
[EMAIL PROTECTED]> wrote:

> Thanks, but is there a map-side cross? The usual cross seems to have a
> bug. I sent an example of how to replicate this bug.
>
> On 5/24/13 9:15 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>
> >You can do this, but pig has a CROSS keyword that you can use.
> >
> >
> >2013/5/23 Mehmet Tepedelenlioglu <[EMAIL PROTECTED]>
> >
> >> Hi,
> >>
> >> I am using this:
> >>
> >> x = join a by 1, b by 1  using 'replicated';
> >>
> >> with the hope that it generates some synthetic key '1' on both a and b
> >>and
> >> joins it on that key, thereby, in this case, doing a clean map side
> >>cross
> >> of
> >> a and b with no schema changes (exactly the way a cross would work). It
> >> seems to be working, but since I just tried it and it worked, I am not
> >>sure
> >> if there is anything in there I should be aware of. Does anyone know?
> >>
> >> Thanks,
> >>
> >> Mehmet
> >>
> >>
> >>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB