Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Complex joins


Copy link to this message
-
Complex joins
We've got a data type that is modeled after a typical object-oriented
data-model format (simple fields, and collections of other objects). We're
trying to accomplish the following join:

Here's out example input:
-------------------------------------
data1 = {  ( 'a1', { ('a2-thing1'), ('a2-thing2') } )  }
data2 = {  ( 'a2-thing1', 'x-value1' ), ( 'a2-thing1', 'x-value2' )  }

Here's what we want to get:
--------------------------------------
( 'a1', { ('a2-thing1', {
('x-value1'), ('x-value2') }
) }
)

Notice that we are trying to join the collection of a2 fields of the 1st
data set, on the first field in the 2nd data set.

We tried this:
--------------------
A = load 'data1' as ( a:tuple(a1:chararray, a2:bag{(a2t:chararray)}) );
B = load 'data2' as ( a2t:chararray, x:chararray );
X = join A by a2.a2t, B by a2t;

We get this error:
---------------------------
ERROR 1128: Cannot find field a2t in
a1:chararray,a2:bag{:tuple(a2t:chararray)}

Try as we might, we cannot find the right way to do this complex join.
Questions:
  1) Should we be simplifying our data format into a more SQL table-like
structure and doing more joins to reduce the complexity?
  2) How can we accomplish joining data2's data into the data1 "objects"?

--
Ho Duc Ha