Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Generating multiple tuples from single tuple


+
naresh 2012-07-02, 02:34
+
Subir S 2012-07-02, 05:28
+
Jonathan Coveney 2012-07-02, 16:34
+
naresh 2012-07-02, 18:04
+
Jonathan Coveney 2012-07-02, 20:19
+
naresh 2012-07-02, 21:34
Copy link to this message
-
Re: Generating multiple tuples from single tuple
If you don't have an unknown number of columns, you can do this
not-too-hacky (hopefully) pig:

grunt> dump in;
****file:/homes/abhinavn/pigtest/tuplify.data
****file:/tmp/temp-2067511203/tmp-1924449354
(10,C1:V1,C2:V2)
grunt> bagged = foreach in generate $0,
TOBAG(STRSPLIT((chararray)$1,':',2), STRSPLIT((chararray)$2,':',2));
grunt> dump bagged;
****file:/homes/abhinavn/pigtest/tuplify.data
****file:/tmp/temp-2067511203/tmp588738197
(10,{(C1,V1),(C2,V2)})
grunt> flat = foreach bagged generate $0, FLATTEN($1);
grunt> dump flat;
****file:/homes/abhinavn/pigtest/tuplify.data
****file:/tmp/temp-2067511203/tmp-1881239619
(10,C1,V1)
(10,C2,V2)
On 3 July 2012 03:04, naresh <[EMAIL PROTECTED]> wrote:

> @Jonathan Conveney:
>
> Thanks a lot for detailed explanation. I got the point now.
>
> Thanks for your time,
> Naresh.
>
> On Mon, Jul 2, 2012 at 1:19 PM, Jonathan Coveney <[EMAIL PROTECTED]>
> wrote:
>
> > IMHO, if you want this to be more generic, I would have it just take the
> > full line, and then parse it out. Why? Because what happens when you have
> > an indeterminate number of columns? That's my own pesonal opinion though.
> > As far as implementation, I would return a DataBag (because what you want
> > are many rows, and Bags = rows).
> >
> > you want these two things to make the Tuples and output bag:
> >
> > private static final TupleFactory mTupleFactory > > TupleFactory.getInstance();
> > private static final BagFactory mBagFactory = BagFactory.getInstance();
> >
> > Their use is described in the Pig api, but essentially, you'll have
> > something like this (this is off the cuff and needs some love, but is the
> > general idea)...
> >
> > DataBag output = mBagFactory.newDefaultBag();
> > String[] vals = ((String)input.get(0)).split("|");
> > List<Object> protoTuple = new ArrayList<Object>(3);
> > protoTuple.add(vals[0]); //the first will be the ID
> > protoTuple.add(null);
> > protoTuple.add(null);
> > for (int i = 1; i < vals.length; i++) {
> >     String[] colAndValue = vals[i].split(":");
> >     protoTuple.set(1, colAndValue[0]); //the column name
> >     protoTuple.set(2, colAndValue[0]); //the value
> >     output.add(mTupleFactory.newTuple(protoTuple)); //the default of
> > newTuple(List) is to copy the List over, which is what we want
> > }
> > return output;
> >
> > the output will always have ID, then col and val. You want to flatten the
> > output of this UDF.
> >
> > 2012/7/2 naresh <[EMAIL PROTECTED]>
> >
> > > Thanks for the suggestions.
> > >
> > > @Jonathan Coveney:
> > >
> > > input tuple :  (id1,column1,column2)
> > > output : two tuples (id1,column1)  and (id2,column2) so it is
> List<Tuple>
> > > or should I return a Bag?
> > >
> > > public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
> > > {
> > >     public List<Tuple> exec(Tuple input) throws IOException {
> > >         if (input == null || input.size() == 0)
> > >             return null;
> > >         try{
> > >             // not sure how whether I can create tuples on my own.
> Looks
> > > like I should use TupleFactory.
> > >             // return list of tuples.
> > >         }catch(Exception e){
> > >             throw WrappedIOException.wrap("Caught exception processing
> > > input row ", e);
> > >         }
> > >     }
> > > }
> > >
> > > Can you point me to some example?
> > >
> > > Thanks for your time,
> > > Naresh.
> > >
> > > On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > You can probably hack together something that will do exactly this
> > > without
> > > > writing a UDF, but I think a UDF will be most useful
> here...especially
> > if
> > > > you want to add more columns, etc etc.
> > > >
> > > > 2012/7/1 Subir S <[EMAIL PROTECTED]>
> > > >
> > > > > Would FLATTEN help?
> > > > >
> > > > > B = GROUP A by ID;
> > > > >
> > > > > C = FOREACH B GENERATE group, FLATTEN ($1);
> > > > >
> > > > > Might work i guess. Not tested.
> > > > >
> > > > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]>
+
naresh 2012-07-05, 18:28
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB