

Howdy folks,
Let's say I have a set of data that looks like this:
X, (X1, X2) Y, (Y1, Y2, Y3)
So there could be an unknown number of members of each tuple per row.
I also have a second set of data that looks like this:
X1, 4, 5, 6 X2, 3, 7, 3
I'd like to join these such that I get:
X, (X1, 4, 5, 6), (X2, 3, 7, 3) Y, (Y1, etc), (Y2, etc), (Y3, etc)
Is this possible with Pig?
Thanks, Jerrell
I think there's probably some convoluted way to do this. First thing you'll have to do is flatten your data.
data1 = A, B _____ X, X1 X, X2 Y, Y1 Y, Y2 Y, Y3
Then do a join by "B" onto you second dataset. This should produce the following
data2 = data1::A, data1::B, data2::A, data2::B, data2::C (I'm assuming data set has exactly 4 columns). _______________ X, X1, X1, 4, 5, 6 X, X2, X2, 3, 7, 3
Now do a group by data1::A to get {X, {(X, X1, X1, 4, 5, 6), (X, X2, X2, 3, 7, 3), ...}} {Y, {(Y, Y1, Y1, ...), (Y, Y2, Y2, ...), ...}}
This is as far as I got, I'm not sure if there's a builtin UDF to transform that into what you're looking for. I thought maybe BagToTuple, but it will return a single tuple with all elements of all tuples in the bag. If the above data format supports your use cases, you're done. If not, you can write a UDF to transform it into the required format. On Wed, Sep 4, 2013 at 4:39 PM, F. Jerrell Schivers <[EMAIL PROTECTED]>wrote:
> Howdy folks, > > Let's say I have a set of data that looks like this: > > X, (X1, X2) > Y, (Y1, Y2, Y3) > > So there could be an unknown number of members of each tuple per row. > > I also have a second set of data that looks like this: > > X1, 4, 5, 6 > X2, 3, 7, 3 > > I'd like to join these such that I get: > > X, (X1, 4, 5, 6), (X2, 3, 7, 3) > Y, (Y1, etc), (Y2, etc), (Y3, etc) > > Is this possible with Pig? > > Thanks, > Jerrell >
Hi Pradeep,
This is exactly what I'm looking for. I was going to process this data inside a UDF anyway, so it's easy for me to pick out what I need. Many thanks.
Jerrell
On Wed, 4 Sep 2013, Pradeep Gollakota wrote:
> I think there's probably some convoluted way to do this. First thing you'll > have to do is flatten your data. > > data1 = A, B > _____ > X, X1 > X, X2 > Y, Y1 > Y, Y2 > Y, Y3 > > Then do a join by "B" onto you second dataset. This should produce the > following > > data2 = data1::A, data1::B, data2::A, data2::B, data2::C (I'm assuming data > set has exactly 4 columns). > _______________ > X, X1, X1, 4, 5, 6 > X, X2, X2, 3, 7, 3 > > Now do a group by data1::A to get > {X, {(X, X1, X1, 4, 5, 6), (X, X2, X2, 3, 7, 3), ...}} > {Y, {(Y, Y1, Y1, ...), (Y, Y2, Y2, ...), ...}} > > This is as far as I got, I'm not sure if there's a builtin UDF to > transform that into what you're looking for. I thought maybe BagToTuple, > but it will return a single tuple with all elements of all tuples in the > bag. If the above data format supports your use cases, you're done. If not, > you can write a UDF to transform it into the required format. > > > On Wed, Sep 4, 2013 at 4:39 PM, F. Jerrell Schivers > <[EMAIL PROTECTED]>wrote: > >> Howdy folks, >> >> Let's say I have a set of data that looks like this: >> >> X, (X1, X2) >> Y, (Y1, Y2, Y3) >> >> So there could be an unknown number of members of each tuple per row. >> >> I also have a second set of data that looks like this: >> >> X1, 4, 5, 6 >> X2, 3, 7, 3 >> >> I'd like to join these such that I get: >> >> X, (X1, 4, 5, 6), (X2, 3, 7, 3) >> Y, (Y1, etc), (Y2, etc), (Y3, etc) >> >> Is this possible with Pig? >> >> Thanks, >> Jerrell >> >

