Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Join Multiple Relations by Different Fields


Copy link to this message
-
Join Multiple Relations by Different Fields
Hi,

Say I have three files `data1`, `data2` and `assocs`:

$ cat data1
key1,foo
key2,bar
$ cat data2
key3,braz
key4,froz
$ cat assoc
key1,key3
key2,key4

I load these files via

$ pig -b -p debug=WARN -x local
Warning: $HADOOP_HOME is deprecated.

Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
Logging error messages to: /home/vince/tmp/pig_1355407390166.log
Connecting to hadoop file system at: file:///
grunt> data1 = load 'data1' as (key: chararray, val: chararray);  
grunt> data2 = load 'data2' as (key: chararray, val: chararray);  
grunt> assoc = load 'assoc' as (key1: chararray, key2: chararray);

What I want is a relation that looks like:

(foo, braz)
(bar, froz)

That is

data1_val, data1_key <-> assoc_key1, assoc_key2 <-> data2_key, data2_val

So my first assumption was to do a join on data1, assoc first and then
on the resulting relation with data2. Anyways, doing a

A = join data1 by key, assoc by key1;
dump A;

Doesn't yield any results. Is this a bug or am I doing something
conceptually wrong?

Regards,
Thomas Bach.