Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF that takes bag as input and returns another bag


Copy link to this message
-
Re: UDF that takes bag as input and returns another bag
By extending an abstract class, you can reuse the generics for the pig
input's Tuple ETL validation, and a consistent hook for your DataBag
parsing logic.  Consider the following abstract class ParseBagAsBag, which
can be extended by your own MyDatabagParserToDataBag, with override to
method parser_logic() and with references to the output super.bag:

public abstract class ParseBagAsBag extends EvalFunc<DataBag> {

    public TupleFactory tuple_factory = TupleFactory.getInstance();
    public BagFactory bag_factory = BagFactory.getInstance();
    public DataBag bag;

    /**
     * Wrapper for Deconstructing the input Tuple to extract DataBag
component.
     * @param input Tuple containing DataBag.
     * @return DataBag of parser logic, NULL iff bag is empty.
     * @throws IOException
     */
    @Override
    public DataBag exec(Tuple input) throws IOException {
        this.tuple = this.tuple_factory.newTuple();
    //  if valid, create a new Tuple from factory
        if (input != null) {
     //  @precondition check
            if ((!input.isNull()) && (input.size() > 0)) {
     //  @precondition check; tuple is non-empty and interesting
                Object oBag = input.get(0);
    //  DataBag wrapped in a one-element Tuple
                if (oBag instanceof DataBag) {
     //  @precondition check; type pig.DataBag
                    DataBag databag = (DataBag) oBag;
                    parser_logic(databag);
                }
            }
        }
        return (this.bag.size() > 0) ? this.bag : null;
    //  return the bag iff modified from factory instantiation, otherwise
return NULL Object
    }

    public abstract void parser_logic(DataBag databag) throws IOException;
}

Hope this helps.

-Dan

On Mon, Mar 18, 2013 at 11:01 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Ah, I suppose I was just proving it oculd be done.
>
> To make a new one, you'd do:
>
> public class MyUdf extends EvalFunc<DataBag> {
>   private static final BagFactory mBagFactory = BagFactory.getInstance();
>   public DataBag exec(Tuple input) throws IOException {
>     DataBag output = mBagFactory.newDefaultBag();
>     for (Tuple t : (DataBag)input.get(0)) {
>       output.add(t);
>     }
>     return output;
>   }
> }
>
>
>
>
> 2013/3/18 Kris Coward <[EMAIL PROTECTED]>
>
> >
> > But he asked for a function that returns *another* bag ;)
> >
> > Snark aside, when returning bags or tuples, it's also worthwhile to at
> > least consider also defining the output schema, which for your example
> > code would probably mean
> >
> > public Schema outputSchema(Schema input){
> >   Schema output = new Schema();
> >   output.add(input.getField(0));
> >   return output;
> > }
> >
> > (possibly with some omitted exception handling)
> >
> > -Kris
> >
> > On Mon, Mar 18, 2013 at 11:19:17AM +0100, Jonathan Coveney wrote:
> > > Absolutely.
> > >
> > > public class MyUdf extends EvalFunc<DataBag> {
> > >   public DataBag exec(Tuple input) throws IOException {
> > >     return (DataBag)input.get(0);
> > >   }
> > > }
> > >
> > >
> > > A dummy example, but there you go. DataBag is a valid pig type like any
> > > other, so you just returnit like you would normally.
> > >
> > >
> > > 2013/3/18 pranjal rajput <[EMAIL PROTECTED]>
> > >
> > > > Hi,
> > > > Can we define a UDF in pig that takes a bag as an input and returns
> > another
> > > > bag as output?
> > > > How can this be done?
> > > > Thanks,
> > > > --
> > > > regards
> > > > Pranjal
> > > >
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
> >
>