Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Merge Join Conditions


Copy link to this message
-
Re: Merge Join Conditions
Hi Alex,

Yeah its possible to do that. I wrote the patch for it on
https://issues.apache.org/jira/browse/PIG-959 but I didnt get a chance
to do much testing on it. Now it probably wont apply. But if you want
it I can get back it in sync and then you can apply it on trunk and
use it.

The output of merge join will indeed be sorted on the join key. But
your specific example may not benefit from that fact. If you have
multi-way join, it will probably be better to do a regular hash join,
where you can join n relations in one MR job, instead of requiring n
MR jobs.

Ashutosh
On Fri, Jun 4, 2010 at 05:48, Alexander Schätzle
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
> the conditions of the Merge Join say that there are only FILTER and FOREACH allowed between the LOAD and the Merge Join.
> I wonder why it is not possible to order the loaded input on the join key with the ORDER statement before applying the Merge Join?
> Afterwards the input would be sorted according to the join key such that a Merge join would be possible.
>
> Script could look for example like that:
>
> indata1 = LOAD 'inputFile1' AS (a, b, c);
> indata2 = LOAD 'inputFile2' AS (a, b, c);
> sorted_indata1 = ORDER indata1 BY a ASC;
> sorted_indata2 = ORDER indata2 BY a ASC;
> result = JOIN sorted_indata1 BY a, sorted_indata2 BY a USING "merge";
>
>
> Second question: Is the output of a Merge Join not also sorted on the Join key? This would highly improve the use of a Merge Join because it would be possible to concatenate multiple Merge Joins like this:
>
> result1 = JOIN sorted_indata1 BY a, sorted_indata2 BY a USING "merge";
> result2 = JOIN result1 BY a, sorted_indata3 BY a USING "merge";
> ...
>
> Thx in advance,
> Alex
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB