Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Join Multiple Relations by Different Fields


Copy link to this message
-
Join Multiple Relations by Different Fields
Hi,

Say I have three files `data1`, `data2` and `assocs`:

$ cat data1
key1,foo
key2,bar
$ cat data2
key3,braz
key4,froz
$ cat assoc
key1,key3
key2,key4

I load these files via

$ pig -b -p debug=WARN -x local
Warning: $HADOOP_HOME is deprecated.

Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
Logging error messages to: /home/vince/tmp/pig_1355407390166.log
Connecting to hadoop file system at: file:///
grunt> data1 = load 'data1' as (key: chararray, val: chararray);  
grunt> data2 = load 'data2' as (key: chararray, val: chararray);  
grunt> assoc = load 'assoc' as (key1: chararray, key2: chararray);

What I want is a relation that looks like:

(foo, braz)
(bar, froz)

That is

data1_val, data1_key <-> assoc_key1, assoc_key2 <-> data2_key, data2_val

So my first assumption was to do a join on data1, assoc first and then
on the resulting relation with data2. Anyways, doing a

A = join data1 by key, assoc by key1;
dump A;

Doesn't yield any results. Is this a bug or am I doing something
conceptually wrong?

Regards,
Thomas Bach.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB