Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Complex joins


Copy link to this message
-
Complex joins
We've got a data type that is modeled after a typical object-oriented
data-model format (simple fields, and collections of other objects). We're
trying to accomplish the following join:

Here's out example input:
-------------------------------------
data1 = {  ( 'a1', { ('a2-thing1'), ('a2-thing2') } )  }
data2 = {  ( 'a2-thing1', 'x-value1' ), ( 'a2-thing1', 'x-value2' )  }

Here's what we want to get:
--------------------------------------
( 'a1', { ('a2-thing1', {
('x-value1'), ('x-value2') }
) }
)

Notice that we are trying to join the collection of a2 fields of the 1st
data set, on the first field in the 2nd data set.

We tried this:
--------------------
A = load 'data1' as ( a:tuple(a1:chararray, a2:bag{(a2t:chararray)}) );
B = load 'data2' as ( a2t:chararray, x:chararray );
X = join A by a2.a2t, B by a2t;

We get this error:
---------------------------
ERROR 1128: Cannot find field a2t in
a1:chararray,a2:bag{:tuple(a2t:chararray)}

Try as we might, we cannot find the right way to do this complex join.
Questions:
  1) Should we be simplifying our data format into a more SQL table-like
structure and doing more joins to reduce the complexity?
  2) How can we accomplish joining data2's data into the data1 "objects"?

--
Ho Duc Ha
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB