Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> More 'Merge join/Cogroup only supports Filter, Foreach, filter and Load as its predecessor' silliness :-(


Copy link to this message
-
Re: More 'Merge join/Cogroup only supports Filter, Foreach, filter and Load as its predecessor' silliness :-(
Try the following:

data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int, target:int);

by_source = ORDER data BY source;
by_target = FOREACH (ORDER data BY target) GENERATE target, source;

STORE by_source INTO 'tmp/by_source' USING PigStorage();
STORE by_target INTO 'tmp/by_target' USING PigStorage();

-- Add this magical keyword here.
exec;

by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int,
target:int);
by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int,
target:int);

joined = JOIN by_source BY source, by_target BY target USING 'merge';

STORE joined           INTO 'tmp/joined' ;

Since pig looks across the entire script it finds the lineage of data across
'load-store' and finds all the predecessors for Merge-Join and since
currently only few of those operators can be predecessors, query fails to
compile. By introducing 'exec', compiler is forced to stop and compile and
execute the script till there and then picks up execution after the current
one is finished, and then it only sees loads as the predecessors for merge
join.

Hope it helps,
Ashutosh

On Sat, Aug 20, 2011 at 14:03, Kevin Burton <[EMAIL PROTECTED]> wrote:

> OK….. I still can't get this to work.
>
> I've read the documentation and i still get the same error on 0.9.0 …
>
> Here's my code. I think it's implying that I need to have the predecessor
> as
> a LOAD and meet the following conditions:
>
>
> Inner merge join (between two tables) will only work under these
> conditions:
> >
> >    - Between the load of the sorted input and the merge join statement
> >    there can only be filter statements and foreach statement where the
> foreach
> >    statement should meet the following conditions:
> >
> >
> >    - There should be no UDFs in the foreach statement.
> >
> >
> >    - The foreach statement should not change the position of the join
> >    keys.
> >
> >
> >    - There should be no transformation on the join keys which will change
> >    the sort order.
> >
> >
> >    - Data must be sorted on join keys in ascending (ASC) order on both
> >    sides.
> >
> >
> >    - Right-side loader must implement either the {OrderedLoadFunc}
> >    interface or {IndexableLoadFunc} interface.
> >
> >
> >    - Type information must be provided for the join key in the schema.
> >
> > The Zebra and PigStorage loaders satisfy all of these conditions.
>
>
> …… which I believe I AM….. but it's still not working.
>
> Here's the data:
>
>
> 1,1
> 1,2
> 1,3
> 1,4
> 1,1000000000
> 0,1
> 0,2
> 0,3
> 0,4
> 0,1000000000
>
>
> … and the script.
>
> data = LOAD 'test2.csv' USING PigStorage(',') AS (source:int, target:int);
>
> by_source = ORDER data BY source;
> by_target = FOREACH (ORDER data BY target) GENERATE target, source;
>
> STORE by_source INTO 'tmp/by_source' USING PigStorage();
> STORE by_target INTO 'tmp/by_target' USING PigStorage();
>
> by_source = LOAD 'tmp/by_source' USING PigStorage() AS (source:int,
> target:int);
> by_target = LOAD 'tmp/by_target' USING PigStorage() AS (source:int,
> target:int);
>
> joined = JOIN by_source BY source, by_target BY target USING 'merge';
>
> STORE joined           INTO 'tmp/joined' ;
>
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: *San Francisco, CA*
> Skype: *burtonator*
>
> Skype-in: *(415) 871-0687*
>