Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Generating multiple tuples from single tuple


+
naresh 2012-07-02, 02:34
+
Subir S 2012-07-02, 05:28
+
Jonathan Coveney 2012-07-02, 16:34
+
naresh 2012-07-02, 18:04
+
Jonathan Coveney 2012-07-02, 20:19
+
naresh 2012-07-02, 21:34
Copy link to this message
-
Re: Generating multiple tuples from single tuple
If you don't have an unknown number of columns, you can do this
not-too-hacky (hopefully) pig:

grunt> dump in;
****file:/homes/abhinavn/pigtest/tuplify.data
****file:/tmp/temp-2067511203/tmp-1924449354
(10,C1:V1,C2:V2)
grunt> bagged = foreach in generate $0,
TOBAG(STRSPLIT((chararray)$1,':',2), STRSPLIT((chararray)$2,':',2));
grunt> dump bagged;
****file:/homes/abhinavn/pigtest/tuplify.data
****file:/tmp/temp-2067511203/tmp588738197
(10,{(C1,V1),(C2,V2)})
grunt> flat = foreach bagged generate $0, FLATTEN($1);
grunt> dump flat;
****file:/homes/abhinavn/pigtest/tuplify.data
****file:/tmp/temp-2067511203/tmp-1881239619
(10,C1,V1)
(10,C2,V2)
On 3 July 2012 03:04, naresh <[EMAIL PROTECTED]> wrote:

> @Jonathan Conveney:
>
> Thanks a lot for detailed explanation. I got the point now.
>
> Thanks for your time,
> Naresh.
>
> On Mon, Jul 2, 2012 at 1:19 PM, Jonathan Coveney <[EMAIL PROTECTED]>
> wrote:
>
> > IMHO, if you want this to be more generic, I would have it just take the
> > full line, and then parse it out. Why? Because what happens when you have
> > an indeterminate number of columns? That's my own pesonal opinion though.
> > As far as implementation, I would return a DataBag (because what you want
> > are many rows, and Bags = rows).
> >
> > you want these two things to make the Tuples and output bag:
> >
> > private static final TupleFactory mTupleFactory > > TupleFactory.getInstance();
> > private static final BagFactory mBagFactory = BagFactory.getInstance();
> >
> > Their use is described in the Pig api, but essentially, you'll have
> > something like this (this is off the cuff and needs some love, but is the
> > general idea)...
> >
> > DataBag output = mBagFactory.newDefaultBag();
> > String[] vals = ((String)input.get(0)).split("|");
> > List<Object> protoTuple = new ArrayList<Object>(3);
> > protoTuple.add(vals[0]); //the first will be the ID
> > protoTuple.add(null);
> > protoTuple.add(null);
> > for (int i = 1; i < vals.length; i++) {
> >     String[] colAndValue = vals[i].split(":");
> >     protoTuple.set(1, colAndValue[0]); //the column name
> >     protoTuple.set(2, colAndValue[0]); //the value
> >     output.add(mTupleFactory.newTuple(protoTuple)); //the default of
> > newTuple(List) is to copy the List over, which is what we want
> > }
> > return output;
> >
> > the output will always have ID, then col and val. You want to flatten the
> > output of this UDF.
> >
> > 2012/7/2 naresh <[EMAIL PROTECTED]>
> >
> > > Thanks for the suggestions.
> > >
> > > @Jonathan Coveney:
> > >
> > > input tuple :  (id1,column1,column2)
> > > output : two tuples (id1,column1)  and (id2,column2) so it is
> List<Tuple>
> > > or should I return a Bag?
> > >
> > > public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
> > > {
> > >     public List<Tuple> exec(Tuple input) throws IOException {
> > >         if (input == null || input.size() == 0)
> > >             return null;
> > >         try{
> > >             // not sure how whether I can create tuples on my own.
> Looks
> > > like I should use TupleFactory.
> > >             // return list of tuples.
> > >         }catch(Exception e){
> > >             throw WrappedIOException.wrap("Caught exception processing
> > > input row ", e);
> > >         }
> > >     }
> > > }
> > >
> > > Can you point me to some example?
> > >
> > > Thanks for your time,
> > > Naresh.
> > >
> > > On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > You can probably hack together something that will do exactly this
> > > without
> > > > writing a UDF, but I think a UDF will be most useful
> here...especially
> > if
> > > > you want to add more columns, etc etc.
> > > >
> > > > 2012/7/1 Subir S <[EMAIL PROTECTED]>
> > > >
> > > > > Would FLATTEN help?
> > > > >
> > > > > B = GROUP A by ID;
> > > > >
> > > > > C = FOREACH B GENERATE group, FLATTEN ($1);
> > > > >
> > > > > Might work i guess. Not tested.
> > > > >
> > > > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]>
+
naresh 2012-07-05, 18:28