Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Complex joins


+
Ho Duc Ha 2013-05-22, 12:49
+
Pradeep Gollakota 2013-05-22, 13:25
+
Ho Duc Ha 2013-05-22, 15:51
+
Pradeep Gollakota 2013-05-22, 16:41
+
Sergey Goder 2013-05-22, 16:32
+
Ho Duc Ha 2013-05-23, 02:51
+
Pradeep Gollakota 2013-05-23, 03:04
+
David Parks 2013-05-23, 06:24
Copy link to this message
-
Re: Complex joins
Oh ok... good to know that PigStorage can handle complex data types.

Just to confirm, the result in X should be (x, A: {(x, a1, b1)}, B: {(x,
a2, b2)})
Group will give you all the tuples for the group key from relation A in one
bag and all the tuples from relation B in a different bag.
To project away the group filed, you can just do a FOREACH on X and project
away the fields you don't need. I haven't yet tested this, but something
like the following should work.

Y = foreach X {
    Ap = foreach A generate a1, b1;
    Bp = foreach B generate a2, b2;
    generate x, Ap, Bp;
}

One other minor recommendation I have is to use the COGROUP operator
instead of GROUP. They are the same thing, but I believe it's best practice
to use GROUP for grouping single relations and COGROUP for grouping
multiple relations.
On Thu, May 23, 2013 at 2:24 AM, David Parks <[EMAIL PROTECTED]> wrote:

> Hi, I'm working alongside Ha on this.
>
> You were right and wrong about the PigStorage format. It *is* a tab
> delimited format, that was our mistake, but those tabs *can* contain tuples
> and bags (using the parenthesis and bracket notation). Anyway, you're
> comments helped us figure out the problem, so I thank you for the time you
> took to offer the suggestion!
>
> Now we've got the example data loading correctly and we can create a simple
> example of the flatten, join, and re-group method you suggested. We added a
> small improvement to not force us to re-group on many fields in the end.
>
> I do have one question further... when we GROUP everything back together in
> the end I notice that the group field also gets included in the tuples.
> Example:
>
> A = (x, a1, b1)
> B = (x, a2, b2)
> X = GROUP A on x, B on x
>
> We get: ( x, {(x,a1,b1), (x, a2, b2)} )
>
> Which is essentially our desired result, but we don't need the duplicate x
> in the inner tuples, is there an efficient way to just render this?
>
> ( x, {(a1,b1), (a2,b2)} )
>
>
>
> -----Original Message-----
> From: Pradeep Gollakota [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, May 23, 2013 10:05 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Complex joins
>
> As far as I know, PigStorage cannot handle complex data types such as Bags
> (It's just a delimiter seperated file). You might have to restructure your
> data or use a different storage function or write a custom storage
> function.
> Since your datamodel is modeled after OO, you might be able to leverage
> Avro
> to maintain your datamodel.
>
>
> On Wed, May 22, 2013 at 10:51 PM, Ho Duc Ha <[EMAIL PROTECTED]> wrote:
>
> > We changed the load statement to:
> >
> > X = load 'data3' using PigStorage() as ( a:chararray,
> > b:bag{(c:chararray)} );
> >
> > But we get the same results with your statement:
> >
> > Y = FOREACH X GENERATE b;
> > dump Y;
> >
> > output (of above command)
> > -----------------------------------------
> > ()
> >
> > What we really want to create is a set of the tuples in the bag b
> > ('5'),('6')
> >
> > Another example which seems to fail to load properly is this (using
> > ints instead of strings):
> >
> > file: data4
> > -------------
> > ( 3, {(5),(6)} )
> >
> > X1 = load 'data4' using PigStorage() as ( a:int, b:bag{(c:int)} );
> > dump X1;
> >
> > result:
> > ---------
> > (,)
> >
> > We also tried formatting the data like this, with the extra tuple
> > around it like I see in the output often, no luck:
> > ((3, {(5),(6)} ))
> >
> >
> >
> >
> > On Wed, May 22, 2013 at 11:32 PM, Sergey Goder <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Looks like you're probably not reading the data in correctly.
> > > Perhaps you need to specify the USING PigStorage() syntax and
> > > specify the correct delimiter as an argument.
> > >
> > > Also, if you want Y to just be the bag then you can just write it
> > > as;
> > >
> > > Y = FOREACH X GENERATE b;
> > >
> > >
> > > On Wed, May 22, 2013 at 8:51 AM, Ho Duc Ha <[EMAIL PROTECTED]> wrote:
> > >
> > > > Actually I think you're right, the process in map/reduce isn't so
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB