Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Complex joins


Copy link to this message
-
Re: Complex joins
Oh ok... good to know that PigStorage can handle complex data types.

Just to confirm, the result in X should be (x, A: {(x, a1, b1)}, B: {(x,
a2, b2)})
Group will give you all the tuples for the group key from relation A in one
bag and all the tuples from relation B in a different bag.
To project away the group filed, you can just do a FOREACH on X and project
away the fields you don't need. I haven't yet tested this, but something
like the following should work.

Y = foreach X {
    Ap = foreach A generate a1, b1;
    Bp = foreach B generate a2, b2;
    generate x, Ap, Bp;
}

One other minor recommendation I have is to use the COGROUP operator
instead of GROUP. They are the same thing, but I believe it's best practice
to use GROUP for grouping single relations and COGROUP for grouping
multiple relations.
On Thu, May 23, 2013 at 2:24 AM, David Parks <[EMAIL PROTECTED]> wrote:

> Hi, I'm working alongside Ha on this.
>
> You were right and wrong about the PigStorage format. It *is* a tab
> delimited format, that was our mistake, but those tabs *can* contain tuples
> and bags (using the parenthesis and bracket notation). Anyway, you're
> comments helped us figure out the problem, so I thank you for the time you
> took to offer the suggestion!
>
> Now we've got the example data loading correctly and we can create a simple
> example of the flatten, join, and re-group method you suggested. We added a
> small improvement to not force us to re-group on many fields in the end.
>
> I do have one question further... when we GROUP everything back together in
> the end I notice that the group field also gets included in the tuples.
> Example:
>
> A = (x, a1, b1)
> B = (x, a2, b2)
> X = GROUP A on x, B on x
>
> We get: ( x, {(x,a1,b1), (x, a2, b2)} )
>
> Which is essentially our desired result, but we don't need the duplicate x
> in the inner tuples, is there an efficient way to just render this?
>
> ( x, {(a1,b1), (a2,b2)} )
>
>
>
> -----Original Message-----
> From: Pradeep Gollakota [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, May 23, 2013 10:05 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Complex joins
>
> As far as I know, PigStorage cannot handle complex data types such as Bags
> (It's just a delimiter seperated file). You might have to restructure your
> data or use a different storage function or write a custom storage
> function.
> Since your datamodel is modeled after OO, you might be able to leverage
> Avro
> to maintain your datamodel.
>
>
> On Wed, May 22, 2013 at 10:51 PM, Ho Duc Ha <[EMAIL PROTECTED]> wrote:
>
> > We changed the load statement to:
> >
> > X = load 'data3' using PigStorage() as ( a:chararray,
> > b:bag{(c:chararray)} );
> >
> > But we get the same results with your statement:
> >
> > Y = FOREACH X GENERATE b;
> > dump Y;
> >
> > output (of above command)
> > -----------------------------------------
> > ()
> >
> > What we really want to create is a set of the tuples in the bag b
> > ('5'),('6')
> >
> > Another example which seems to fail to load properly is this (using
> > ints instead of strings):
> >
> > file: data4
> > -------------
> > ( 3, {(5),(6)} )
> >
> > X1 = load 'data4' using PigStorage() as ( a:int, b:bag{(c:int)} );
> > dump X1;
> >
> > result:
> > ---------
> > (,)
> >
> > We also tried formatting the data like this, with the extra tuple
> > around it like I see in the output often, no luck:
> > ((3, {(5),(6)} ))
> >
> >
> >
> >
> > On Wed, May 22, 2013 at 11:32 PM, Sergey Goder <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Looks like you're probably not reading the data in correctly.
> > > Perhaps you need to specify the USING PigStorage() syntax and
> > > specify the correct delimiter as an argument.
> > >
> > > Also, if you want Y to just be the bag then you can just write it
> > > as;
> > >
> > > Y = FOREACH X GENERATE b;
> > >
> > >
> > > On Wed, May 22, 2013 at 8:51 AM, Ho Duc Ha <[EMAIL PROTECTED]> wrote:
> > >
> > > > Actually I think you're right, the process in map/reduce isn't so