Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> LEFT OUTER JOIN?


+
Dan DeCapria, CivicScienc... 2013-04-19, 18:24
Have a solution which I personally don't like, but it seems to work for now:

T = LOAD 'T.dat' AS (a:chararray, b:chararray, c:chararray, d:chararray,
x:chararray, y:chararray);
U = LOAD 'U.dat' AS (a:chararray, b:chararray, c:chararray, d:chararray,
x:chararray, y:chararray);
A = JOIN T BY (a,b,c,d) FULL OUTER, U BY (a,b,c,d);
B = FOREACH A GENERATE
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::a : U::a) AS a:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::b : U::b) AS b:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::c : U::c) AS c:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::d : U::d) AS d:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::x : U::x) AS x:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::y : U::y) AS y:chararray;
DUMP B;
----
(a1,b1,c1,d1,g1,h1)
(a2,b2,c2,d2,e2,f2)
(a3,b3,c3,d3,g3,h3)
(a4,b4,c4,d4,e4,f4)
(a5,b5,c5,d5,e5,f5)
(a6,b6,c6,d6,g6,h6)

Bumping for an optimization here.

Thanks,  -Dan

On Fri, Apr 19, 2013 at 2:24 PM, Dan DeCapria, CivicScience <
[EMAIL PROTECTED]> wrote:

> Consider two aliases (T) and (U), loaded from data with schema defined
> below. I was considering a left outer join to 'merge' the two records,
> overriding those in U with the join fields in T, but the result isn't what
> I desired.  Possibly a filter then union is required? I'm just not sure;
> looking for a quick way to get to this result (R):
>
> R: ?[JOIN T BY (a,b,c,d) LEFT OUTER, U BY (a,b,c,d)]: schema : a,b,c,d,x,y
> a1 b1 c1 d1 *g1 h1*
> a2 b2 c2 d2 e2 f2
> a3 b3 c3 d3 *g3 h3*
> a4 b4 c4 d4 e4 f4
> a5 b5 c5 d5 e5 f5
> a6 b6 c6 d6 g6 h6
>
> T : schema : a,b,c,d,x,y
> a1 b1 c1 d1 g1 h1
> a3 b3 c3 d3 g3 h3
> a6 b6 c6 d6 g6 h6
>
> U: schema : a,b,c,d,x,y
> a1 b1 c1 d1 e1 f1
> a2 b2 c2 d2 e2 f2
> a3 b3 c3 d3 e3 f3
> a4 b4 c4 d4 e4 f4
> a5 b5 c5 d5 e5 f5
>
> Many thanks,  -Dan
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB