Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF that takes bag as input and returns another bag


Copy link to this message
-
Re: UDF that takes bag as input and returns another bag
By extending an abstract class, you can reuse the generics for the pig
input's Tuple ETL validation, and a consistent hook for your DataBag
parsing logic.  Consider the following abstract class ParseBagAsBag, which
can be extended by your own MyDatabagParserToDataBag, with override to
method parser_logic() and with references to the output super.bag:

public abstract class ParseBagAsBag extends EvalFunc<DataBag> {

    public TupleFactory tuple_factory = TupleFactory.getInstance();
    public BagFactory bag_factory = BagFactory.getInstance();
    public DataBag bag;

    /**
     * Wrapper for Deconstructing the input Tuple to extract DataBag
component.
     * @param input Tuple containing DataBag.
     * @return DataBag of parser logic, NULL iff bag is empty.
     * @throws IOException
     */
    @Override
    public DataBag exec(Tuple input) throws IOException {
        this.tuple = this.tuple_factory.newTuple();
    //  if valid, create a new Tuple from factory
        if (input != null) {
     //  @precondition check
            if ((!input.isNull()) && (input.size() > 0)) {
     //  @precondition check; tuple is non-empty and interesting
                Object oBag = input.get(0);
    //  DataBag wrapped in a one-element Tuple
                if (oBag instanceof DataBag) {
     //  @precondition check; type pig.DataBag
                    DataBag databag = (DataBag) oBag;
                    parser_logic(databag);
                }
            }
        }
        return (this.bag.size() > 0) ? this.bag : null;
    //  return the bag iff modified from factory instantiation, otherwise
return NULL Object
    }

    public abstract void parser_logic(DataBag databag) throws IOException;
}

Hope this helps.

-Dan

On Mon, Mar 18, 2013 at 11:01 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Ah, I suppose I was just proving it oculd be done.
>
> To make a new one, you'd do:
>
> public class MyUdf extends EvalFunc<DataBag> {
>   private static final BagFactory mBagFactory = BagFactory.getInstance();
>   public DataBag exec(Tuple input) throws IOException {
>     DataBag output = mBagFactory.newDefaultBag();
>     for (Tuple t : (DataBag)input.get(0)) {
>       output.add(t);
>     }
>     return output;
>   }
> }
>
>
>
>
> 2013/3/18 Kris Coward <[EMAIL PROTECTED]>
>
> >
> > But he asked for a function that returns *another* bag ;)
> >
> > Snark aside, when returning bags or tuples, it's also worthwhile to at
> > least consider also defining the output schema, which for your example
> > code would probably mean
> >
> > public Schema outputSchema(Schema input){
> >   Schema output = new Schema();
> >   output.add(input.getField(0));
> >   return output;
> > }
> >
> > (possibly with some omitted exception handling)
> >
> > -Kris
> >
> > On Mon, Mar 18, 2013 at 11:19:17AM +0100, Jonathan Coveney wrote:
> > > Absolutely.
> > >
> > > public class MyUdf extends EvalFunc<DataBag> {
> > >   public DataBag exec(Tuple input) throws IOException {
> > >     return (DataBag)input.get(0);
> > >   }
> > > }
> > >
> > >
> > > A dummy example, but there you go. DataBag is a valid pig type like any
> > > other, so you just returnit like you would normally.
> > >
> > >
> > > 2013/3/18 pranjal rajput <[EMAIL PROTECTED]>
> > >
> > > > Hi,
> > > > Can we define a UDF in pig that takes a bag as an input and returns
> > another
> > > > bag as output?
> > > > How can this be done?
> > > > Thanks,
> > > > --
> > > > regards
> > > > Pranjal
> > > >
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB