Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Complex joins


Copy link to this message
-
Re: Complex joins
Looks like you're probably not reading the data in correctly. Perhaps you
need to specify the USING PigStorage() syntax and specify the correct
delimiter as an argument.

Also, if you want Y to just be the bag then you can just write it as;

Y = FOREACH X GENERATE b;
On Wed, May 22, 2013 at 8:51 AM, Ho Duc Ha <[EMAIL PROTECTED]> wrote:

> Actually I think you're right, the process in map/reduce isn't so
> different.
>
> However, after trying to do this, we can't understand the output we see
> below. We expected to see only '3' in alias Z, and '5' and '6' in alias Y,
> neither result was as expected.
>
> X = load 'data3' as ( a:chararray, b:bag{(c:chararray)} );
> Y = foreach X { W = foreach b generate *; generate W; };
> Z = foreach X generate a;
>
> data3
> ( '3', {( '5' ),('6')} )
>
> dump X
> (( '3', {( '5' ),('6')} ),)
>
> dump Y
> ({})
>
> dump Z
> (( '3', {( '5' ),('6')} ))
>
>
>
>
> On Wed, May 22, 2013 at 8:25 PM, Pradeep Gollakota <[EMAIL PROTECTED]
> >wrote:
>
> > Hi All,
> >
> > I'm a beginner pig user and this is my first post to the Pig mailing
> list.
> >
> > Anyway, to answer your question, the first thing that comes to my mind is
> > that Pig may not be able to do a complex join like that.
> >
> > However, you can first flatten the bag in A, then do your join and then
> do
> > a group by do get the result in the format you are looking for. This may
> > not be an idea solution, but it should work.
> >
> > Pradeep
> >
> >
> > On Wed, May 22, 2013 at 8:49 AM, Ho Duc Ha <[EMAIL PROTECTED]> wrote:
> >
> > > We've got a data type that is modeled after a typical object-oriented
> > > data-model format (simple fields, and collections of other objects).
> > We're
> > > trying to accomplish the following join:
> > >
> > > Here's out example input:
> > > -------------------------------------
> > > data1 = {  ( 'a1', { ('a2-thing1'), ('a2-thing2') } )  }
> > > data2 = {  ( 'a2-thing1', 'x-value1' ), ( 'a2-thing1', 'x-value2' )  }
> > >
> > > Here's what we want to get:
> > > --------------------------------------
> > > ( 'a1', { ('a2-thing1', {
> > > ('x-value1'), ('x-value2') }
> > > ) }
> > > )
> > >
> > > Notice that we are trying to join the collection of a2 fields of the
> 1st
> > > data set, on the first field in the 2nd data set.
> > >
> > > We tried this:
> > > --------------------
> > > A = load 'data1' as ( a:tuple(a1:chararray, a2:bag{(a2t:chararray)}) );
> > > B = load 'data2' as ( a2t:chararray, x:chararray );
> > > X = join A by a2.a2t, B by a2t;
> > >
> > > We get this error:
> > > ---------------------------
> > > ERROR 1128: Cannot find field a2t in
> > > a1:chararray,a2:bag{:tuple(a2t:chararray)}
> > >
> > > Try as we might, we cannot find the right way to do this complex join.
> > > Questions:
> > >   1) Should we be simplifying our data format into a more SQL
> table-like
> > > structure and doing more joins to reduce the complexity?
> > >   2) How can we accomplish joining data2's data into the data1
> "objects"?
> > >
> > > --
> > > Ho Duc Ha
> > >
> >
>
>
>
> --
> Ho Duc Ha
>