Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Join Question


Copy link to this message
-
Re: Join Question
I think there's probably some convoluted way to do this. First thing you'll
have to do is flatten your data.

data1 = A, B
_____
X, X1
X, X2
Y, Y1
Y, Y2
Y, Y3

Then do a  join by "B" onto you second dataset. This should produce the
following

data2 = data1::A, data1::B, data2::A, data2::B, data2::C (I'm assuming data
set has exactly 4 columns).
_______________
X, X1, X1, 4, 5, 6
X, X2, X2, 3, 7, 3

Now do a group by data1::A to get
{X, {(X, X1, X1, 4, 5, 6), (X, X2, X2, 3, 7, 3), ...}}
{Y, {(Y, Y1, Y1, ...), (Y, Y2, Y2, ...), ...}}

This is as far as I got, I'm not sure if there's a built-in UDF to
transform that into what you're looking for. I thought maybe BagToTuple,
but it will return a single tuple with all elements of all tuples in the
bag. If the above data format supports your use cases, you're done. If not,
you can write a UDF to transform it into the required format.
On Wed, Sep 4, 2013 at 4:39 PM, F. Jerrell Schivers
<[EMAIL PROTECTED]>wrote:

> Howdy folks,
>
> Let's say I have a set of data that looks like this:
>
> X, (X1, X2)
> Y, (Y1, Y2, Y3)
>
> So there could be an unknown number of members of each tuple per row.
>
> I also have a second set of data that looks like this:
>
> X1, 4, 5, 6
> X2, 3, 7, 3
>
> I'd like to join these such that I get:
>
> X, (X1, 4, 5, 6), (X2, 3, 7, 3)
> Y, (Y1, etc), (Y2, etc), (Y3, etc)
>
> Is this possible with Pig?
>
> Thanks,
> Jerrell
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB