Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Generating multiple tuples from single tuple


Copy link to this message
-
Re: Generating multiple tuples from single tuple
IMHO, if you want this to be more generic, I would have it just take the
full line, and then parse it out. Why? Because what happens when you have
an indeterminate number of columns? That's my own pesonal opinion though.
As far as implementation, I would return a DataBag (because what you want
are many rows, and Bags = rows).

you want these two things to make the Tuples and output bag:

private static final TupleFactory mTupleFactory TupleFactory.getInstance();
private static final BagFactory mBagFactory = BagFactory.getInstance();

Their use is described in the Pig api, but essentially, you'll have
something like this (this is off the cuff and needs some love, but is the
general idea)...

DataBag output = mBagFactory.newDefaultBag();
String[] vals = ((String)input.get(0)).split("|");
List<Object> protoTuple = new ArrayList<Object>(3);
protoTuple.add(vals[0]); //the first will be the ID
protoTuple.add(null);
protoTuple.add(null);
for (int i = 1; i < vals.length; i++) {
    String[] colAndValue = vals[i].split(":");
    protoTuple.set(1, colAndValue[0]); //the column name
    protoTuple.set(2, colAndValue[0]); //the value
    output.add(mTupleFactory.newTuple(protoTuple)); //the default of
newTuple(List) is to copy the List over, which is what we want
}
return output;

the output will always have ID, then col and val. You want to flatten the
output of this UDF.

2012/7/2 naresh <[EMAIL PROTECTED]>

> Thanks for the suggestions.
>
> @Jonathan Coveney:
>
> input tuple :  (id1,column1,column2)
> output : two tuples (id1,column1)  and (id2,column2) so it is List<Tuple>
> or should I return a Bag?
>
> public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
> {
>     public List<Tuple> exec(Tuple input) throws IOException {
>         if (input == null || input.size() == 0)
>             return null;
>         try{
>             // not sure how whether I can create tuples on my own. Looks
> like I should use TupleFactory.
>             // return list of tuples.
>         }catch(Exception e){
>             throw WrappedIOException.wrap("Caught exception processing
> input row ", e);
>         }
>     }
> }
>
> Can you point me to some example?
>
> Thanks for your time,
> Naresh.
>
> On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <[EMAIL PROTECTED]>
> wrote:
>
> > You can probably hack together something that will do exactly this
> without
> > writing a UDF, but I think a UDF will be most useful here...especially if
> > you want to add more columns, etc etc.
> >
> > 2012/7/1 Subir S <[EMAIL PROTECTED]>
> >
> > > Would FLATTEN help?
> > >
> > > B = GROUP A by ID;
> > >
> > > C = FOREACH B GENERATE group, FLATTEN ($1);
> > >
> > > Might work i guess. Not tested.
> > >
> > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hi,
> > > >
> > > >         I am new to pig scripting. I like to generate multiple tuples
> > > from
> > > > a single tuple. What I mean is:
> > > >
> > > > I have file with following data in it.
> > > >
> > > > >> cat data
> > > >
> > > > ID | ColumnName1:Value1 | ColumnName2:Value2
> > > >
> > > > so I load it by the following command
> > > >
> > > > grunt >> A = load '$data' using PigStorage('|');
> > > >
> > > > grunt >> dump A;
> > > >
> > > > (ID,ColumnName1:Value1,ColumnName2:Value2)
> > > >
> > > > Now I want to split this tuple into two tuples.
> > > >
> > > > (ID, ColumnName1, Value1)
> > > > (ID, ColumnName2, Value2)
> > > >
> > > > Can I use UDF along with foreach and generate. Some thing like the
> > > > following?
> > > >
> > > > grunt >> foreach A generate SOMEUDF(A)
> > > >
> > > > Thanks for your time,
> > > > Naresh.
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB