Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> conditional and multiple generate inside foreach?


Copy link to this message
-
Re: conditional and multiple generate inside foreach?
I see 3 independent questions :

  1. How can we pass entire row tuple to an UDF as 'B = FOREACH A GENERATE
myudf(A)', without knowing schema? I don't know if that is passible. It does
feel like it should be possible.

  2. How can I return an augmented Tuple? Your UDF can make a copy of the
input tuple and add whatever you like to and return it.. may be your
question is not this simple.

  3. How can I make UDF result in multiple row for for input row  as in your
example:
       - your UDF needs to return bag of row tuples. For (b,1) it would
return {(b,1,yesterday), (b,1,week), ... }
       - your pig script would flatten the output of the UDF :
         B = foreach A generate FLATTEN( myUDF(name, days_ago) );

Raghu.

On Fri, Jul 22, 2011 at 6:10 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:

> Thanks. I'm not familiar with python, but I write bunch of UDFs in java.
>
> One question though, how do I pass the the entire tuple to the UDF, I mean
> I
> can't do something like this:
>
>    B = FOREACH A GENERATE myudf(A)
>
> Essentially what I want is given a tuple, I want to enrich the tuple to add
> one more field to it, and the value of the new field depends on the value
> in
> some existing fields in the tuple.
>
> (a,1) -> (a,1,yesterday)
>
> how would I do that?
>
> I imagine I can do
>   B = GROUP A BY random;
>   C = FOREACH B GENERATE myudf(A);
>
> But I really don't like adding another GROUP BY here.
>
> On Fri, Jul 22, 2011 at 5:23 PM, Scott Foster <[EMAIL PROTECTED]
> >wrote:
>
> > Hi Dexin,
> > This is the sort of thing I've started using Python UDFs for. See:
> > http://wiki.apache.org/pig/UDFsUsingScriptingLanguages for examples of
> > how to write the python code.
> >
> > If your udf was implemented in Python you could then do this...
> >
> > register 'udfs.py' using jython as udf;
> > ...
> > B = FOREACH A generate name, udf.daysAgoString(days_ago);
> >
> > scott.
> >
> > On Fri, Jul 22, 2011 at 4:42 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> > > Possible to do conditional and more than one generate inside a foreach?
> > >
> > > for example, I have tuples like this (names, days_ago)
> > >
> > > (a,0)
> > > (b,1)
> > > (c,9)
> > > (d,40)
> > >
> > > b shows up 1 day ago, so it belongs to all of the following: yesterday,
> > last
> > > week, last month, and last quarter. So I'd like to turn the above to:
> > >
> > > (a,0,today)
> > > (b,1,yesterday)
> > > (b,1,week)
> > > (b,1,month)
> > > (b,1,quarter)
> > > (c,9,month)
> > > (c,9,quarter)
> > > (d,40,quarter)
> > >
> > > I imagine/dream I could do something like this
> > >
> > > B = FOREACH A
> > >  {
> > >        if (days_ago <= 90) generate name,days_ago,'quarter';
> > >        if (days_ago <= 30) generate name,days_ago,'month';
> > >        if (days_ago <= 7)   generate name,days_ago,'week';
> > >        if (days_ago == 1)   generate name,days_ago,'yesterday';
> > >        if (days_ago == 0)   generate name,days_ago,'today';
> > >  }
> > >
> > > of course that's not valid syntax. I could write my own UDF but would
> be
> > > nice there's some way to get what I want without UDF.
> > >
> > > Thanks!
> > > Dexin
> > >
> >
>