|
|
-
Generating multiple tuples from single tuple
naresh 2012-07-02, 02:34
Hi,
I am new to pig scripting. I like to generate multiple tuples from a single tuple. What I mean is:
I have file with following data in it.
>> cat data
ID | ColumnName1:Value1 | ColumnName2:Value2
so I load it by the following command
grunt >> A = load '$data' using PigStorage('|');
grunt >> dump A;
(ID,ColumnName1:Value1,ColumnName2:Value2)
Now I want to split this tuple into two tuples.
(ID, ColumnName1, Value1) (ID, ColumnName2, Value2)
Can I use UDF along with foreach and generate. Some thing like the following?
grunt >> foreach A generate SOMEUDF(A)
Thanks for your time, Naresh.
-
Re: Generating multiple tuples from single tuple
Subir S 2012-07-02, 05:28
Would FLATTEN help?
B = GROUP A by ID;
C = FOREACH B GENERATE group, FLATTEN ($1);
Might work i guess. Not tested.
On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]> wrote:
> Hi, > > I am new to pig scripting. I like to generate multiple tuples from > a single tuple. What I mean is: > > I have file with following data in it. > > >> cat data > > ID | ColumnName1:Value1 | ColumnName2:Value2 > > so I load it by the following command > > grunt >> A = load '$data' using PigStorage('|'); > > grunt >> dump A; > > (ID,ColumnName1:Value1,ColumnName2:Value2) > > Now I want to split this tuple into two tuples. > > (ID, ColumnName1, Value1) > (ID, ColumnName2, Value2) > > Can I use UDF along with foreach and generate. Some thing like the > following? > > grunt >> foreach A generate SOMEUDF(A) > > Thanks for your time, > Naresh. >
-
Re: Generating multiple tuples from single tuple
Jonathan Coveney 2012-07-02, 16:34
You can probably hack together something that will do exactly this without writing a UDF, but I think a UDF will be most useful here...especially if you want to add more columns, etc etc.
2012/7/1 Subir S <[EMAIL PROTECTED]>
> Would FLATTEN help? > > B = GROUP A by ID; > > C = FOREACH B GENERATE group, FLATTEN ($1); > > Might work i guess. Not tested. > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am new to pig scripting. I like to generate multiple tuples > from > > a single tuple. What I mean is: > > > > I have file with following data in it. > > > > >> cat data > > > > ID | ColumnName1:Value1 | ColumnName2:Value2 > > > > so I load it by the following command > > > > grunt >> A = load '$data' using PigStorage('|'); > > > > grunt >> dump A; > > > > (ID,ColumnName1:Value1,ColumnName2:Value2) > > > > Now I want to split this tuple into two tuples. > > > > (ID, ColumnName1, Value1) > > (ID, ColumnName2, Value2) > > > > Can I use UDF along with foreach and generate. Some thing like the > > following? > > > > grunt >> foreach A generate SOMEUDF(A) > > > > Thanks for your time, > > Naresh. > > >
-
Re: Generating multiple tuples from single tuple
naresh 2012-07-02, 18:04
Thanks for the suggestions.
@Jonathan Coveney:
input tuple : (id1,column1,column2) output : two tuples (id1,column1) and (id2,column2) so it is List<Tuple> or should I return a Bag?
public class SPLITTUPPLE extends EvalFunc <List<Tuple>> { public List<Tuple> exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ // not sure how whether I can create tuples on my own. Looks like I should use TupleFactory. // return list of tuples. }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
Can you point me to some example?
Thanks for your time, Naresh.
On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> You can probably hack together something that will do exactly this without > writing a UDF, but I think a UDF will be most useful here...especially if > you want to add more columns, etc etc. > > 2012/7/1 Subir S <[EMAIL PROTECTED]> > > > Would FLATTEN help? > > > > B = GROUP A by ID; > > > > C = FOREACH B GENERATE group, FLATTEN ($1); > > > > Might work i guess. Not tested. > > > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am new to pig scripting. I like to generate multiple tuples > > from > > > a single tuple. What I mean is: > > > > > > I have file with following data in it. > > > > > > >> cat data > > > > > > ID | ColumnName1:Value1 | ColumnName2:Value2 > > > > > > so I load it by the following command > > > > > > grunt >> A = load '$data' using PigStorage('|'); > > > > > > grunt >> dump A; > > > > > > (ID,ColumnName1:Value1,ColumnName2:Value2) > > > > > > Now I want to split this tuple into two tuples. > > > > > > (ID, ColumnName1, Value1) > > > (ID, ColumnName2, Value2) > > > > > > Can I use UDF along with foreach and generate. Some thing like the > > > following? > > > > > > grunt >> foreach A generate SOMEUDF(A) > > > > > > Thanks for your time, > > > Naresh. > > > > > >
-
Re: Generating multiple tuples from single tuple
Jonathan Coveney 2012-07-02, 20:19
IMHO, if you want this to be more generic, I would have it just take the full line, and then parse it out. Why? Because what happens when you have an indeterminate number of columns? That's my own pesonal opinion though. As far as implementation, I would return a DataBag (because what you want are many rows, and Bags = rows).
you want these two things to make the Tuples and output bag:
private static final TupleFactory mTupleFactory TupleFactory.getInstance(); private static final BagFactory mBagFactory = BagFactory.getInstance();
Their use is described in the Pig api, but essentially, you'll have something like this (this is off the cuff and needs some love, but is the general idea)...
DataBag output = mBagFactory.newDefaultBag(); String[] vals = ((String)input.get(0)).split("|"); List<Object> protoTuple = new ArrayList<Object>(3); protoTuple.add(vals[0]); //the first will be the ID protoTuple.add(null); protoTuple.add(null); for (int i = 1; i < vals.length; i++) { String[] colAndValue = vals[i].split(":"); protoTuple.set(1, colAndValue[0]); //the column name protoTuple.set(2, colAndValue[0]); //the value output.add(mTupleFactory.newTuple(protoTuple)); //the default of newTuple(List) is to copy the List over, which is what we want } return output;
the output will always have ID, then col and val. You want to flatten the output of this UDF.
2012/7/2 naresh <[EMAIL PROTECTED]>
> Thanks for the suggestions. > > @Jonathan Coveney: > > input tuple : (id1,column1,column2) > output : two tuples (id1,column1) and (id2,column2) so it is List<Tuple> > or should I return a Bag? > > public class SPLITTUPPLE extends EvalFunc <List<Tuple>> > { > public List<Tuple> exec(Tuple input) throws IOException { > if (input == null || input.size() == 0) > return null; > try{ > // not sure how whether I can create tuples on my own. Looks > like I should use TupleFactory. > // return list of tuples. > }catch(Exception e){ > throw WrappedIOException.wrap("Caught exception processing > input row ", e); > } > } > } > > Can you point me to some example? > > Thanks for your time, > Naresh. > > On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <[EMAIL PROTECTED]> > wrote: > > > You can probably hack together something that will do exactly this > without > > writing a UDF, but I think a UDF will be most useful here...especially if > > you want to add more columns, etc etc. > > > > 2012/7/1 Subir S <[EMAIL PROTECTED]> > > > > > Would FLATTEN help? > > > > > > B = GROUP A by ID; > > > > > > C = FOREACH B GENERATE group, FLATTEN ($1); > > > > > > Might work i guess. Not tested. > > > > > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I am new to pig scripting. I like to generate multiple tuples > > > from > > > > a single tuple. What I mean is: > > > > > > > > I have file with following data in it. > > > > > > > > >> cat data > > > > > > > > ID | ColumnName1:Value1 | ColumnName2:Value2 > > > > > > > > so I load it by the following command > > > > > > > > grunt >> A = load '$data' using PigStorage('|'); > > > > > > > > grunt >> dump A; > > > > > > > > (ID,ColumnName1:Value1,ColumnName2:Value2) > > > > > > > > Now I want to split this tuple into two tuples. > > > > > > > > (ID, ColumnName1, Value1) > > > > (ID, ColumnName2, Value2) > > > > > > > > Can I use UDF along with foreach and generate. Some thing like the > > > > following? > > > > > > > > grunt >> foreach A generate SOMEUDF(A) > > > > > > > > Thanks for your time, > > > > Naresh. > > > > > > > > > >
-
Re: Generating multiple tuples from single tuple
naresh 2012-07-02, 21:34
@Jonathan Conveney:
Thanks a lot for detailed explanation. I got the point now.
Thanks for your time, Naresh.
On Mon, Jul 2, 2012 at 1:19 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> IMHO, if you want this to be more generic, I would have it just take the > full line, and then parse it out. Why? Because what happens when you have > an indeterminate number of columns? That's my own pesonal opinion though. > As far as implementation, I would return a DataBag (because what you want > are many rows, and Bags = rows). > > you want these two things to make the Tuples and output bag: > > private static final TupleFactory mTupleFactory > TupleFactory.getInstance(); > private static final BagFactory mBagFactory = BagFactory.getInstance(); > > Their use is described in the Pig api, but essentially, you'll have > something like this (this is off the cuff and needs some love, but is the > general idea)... > > DataBag output = mBagFactory.newDefaultBag(); > String[] vals = ((String)input.get(0)).split("|"); > List<Object> protoTuple = new ArrayList<Object>(3); > protoTuple.add(vals[0]); //the first will be the ID > protoTuple.add(null); > protoTuple.add(null); > for (int i = 1; i < vals.length; i++) { > String[] colAndValue = vals[i].split(":"); > protoTuple.set(1, colAndValue[0]); //the column name > protoTuple.set(2, colAndValue[0]); //the value > output.add(mTupleFactory.newTuple(protoTuple)); //the default of > newTuple(List) is to copy the List over, which is what we want > } > return output; > > the output will always have ID, then col and val. You want to flatten the > output of this UDF. > > 2012/7/2 naresh <[EMAIL PROTECTED]> > > > Thanks for the suggestions. > > > > @Jonathan Coveney: > > > > input tuple : (id1,column1,column2) > > output : two tuples (id1,column1) and (id2,column2) so it is List<Tuple> > > or should I return a Bag? > > > > public class SPLITTUPPLE extends EvalFunc <List<Tuple>> > > { > > public List<Tuple> exec(Tuple input) throws IOException { > > if (input == null || input.size() == 0) > > return null; > > try{ > > // not sure how whether I can create tuples on my own. Looks > > like I should use TupleFactory. > > // return list of tuples. > > }catch(Exception e){ > > throw WrappedIOException.wrap("Caught exception processing > > input row ", e); > > } > > } > > } > > > > Can you point me to some example? > > > > Thanks for your time, > > Naresh. > > > > On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <[EMAIL PROTECTED]> > > wrote: > > > > > You can probably hack together something that will do exactly this > > without > > > writing a UDF, but I think a UDF will be most useful here...especially > if > > > you want to add more columns, etc etc. > > > > > > 2012/7/1 Subir S <[EMAIL PROTECTED]> > > > > > > > Would FLATTEN help? > > > > > > > > B = GROUP A by ID; > > > > > > > > C = FOREACH B GENERATE group, FLATTEN ($1); > > > > > > > > Might work i guess. Not tested. > > > > > > > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Hi, > > > > > > > > > > I am new to pig scripting. I like to generate multiple > tuples > > > > from > > > > > a single tuple. What I mean is: > > > > > > > > > > I have file with following data in it. > > > > > > > > > > >> cat data > > > > > > > > > > ID | ColumnName1:Value1 | ColumnName2:Value2 > > > > > > > > > > so I load it by the following command > > > > > > > > > > grunt >> A = load '$data' using PigStorage('|'); > > > > > > > > > > grunt >> dump A; > > > > > > > > > > (ID,ColumnName1:Value1,ColumnName2:Value2) > > > > > > > > > > Now I want to split this tuple into two tuples. > > > > > > > > > > (ID, ColumnName1, Value1) > > > > > (ID, ColumnName2, Value2) > > > > > > > > > > Can I use UDF along with foreach and generate. Some thing like the > > > > > following? > > > > > > > > > > grunt >> foreach A generate SOMEUDF(A)
-
Re: Generating multiple tuples from single tuple
Abhinav Neelam 2012-07-04, 07:00
If you don't have an unknown number of columns, you can do this not-too-hacky (hopefully) pig:
grunt> dump in; ****file:/homes/abhinavn/pigtest/tuplify.data ****file:/tmp/temp-2067511203/tmp-1924449354 (10,C1:V1,C2:V2) grunt> bagged = foreach in generate $0, TOBAG(STRSPLIT((chararray)$1,':',2), STRSPLIT((chararray)$2,':',2)); grunt> dump bagged; ****file:/homes/abhinavn/pigtest/tuplify.data ****file:/tmp/temp-2067511203/tmp588738197 (10,{(C1,V1),(C2,V2)}) grunt> flat = foreach bagged generate $0, FLATTEN($1); grunt> dump flat; ****file:/homes/abhinavn/pigtest/tuplify.data ****file:/tmp/temp-2067511203/tmp-1881239619 (10,C1,V1) (10,C2,V2) On 3 July 2012 03:04, naresh <[EMAIL PROTECTED]> wrote:
> @Jonathan Conveney: > > Thanks a lot for detailed explanation. I got the point now. > > Thanks for your time, > Naresh. > > On Mon, Jul 2, 2012 at 1:19 PM, Jonathan Coveney <[EMAIL PROTECTED]> > wrote: > > > IMHO, if you want this to be more generic, I would have it just take the > > full line, and then parse it out. Why? Because what happens when you have > > an indeterminate number of columns? That's my own pesonal opinion though. > > As far as implementation, I would return a DataBag (because what you want > > are many rows, and Bags = rows). > > > > you want these two things to make the Tuples and output bag: > > > > private static final TupleFactory mTupleFactory > > TupleFactory.getInstance(); > > private static final BagFactory mBagFactory = BagFactory.getInstance(); > > > > Their use is described in the Pig api, but essentially, you'll have > > something like this (this is off the cuff and needs some love, but is the > > general idea)... > > > > DataBag output = mBagFactory.newDefaultBag(); > > String[] vals = ((String)input.get(0)).split("|"); > > List<Object> protoTuple = new ArrayList<Object>(3); > > protoTuple.add(vals[0]); //the first will be the ID > > protoTuple.add(null); > > protoTuple.add(null); > > for (int i = 1; i < vals.length; i++) { > > String[] colAndValue = vals[i].split(":"); > > protoTuple.set(1, colAndValue[0]); //the column name > > protoTuple.set(2, colAndValue[0]); //the value > > output.add(mTupleFactory.newTuple(protoTuple)); //the default of > > newTuple(List) is to copy the List over, which is what we want > > } > > return output; > > > > the output will always have ID, then col and val. You want to flatten the > > output of this UDF. > > > > 2012/7/2 naresh <[EMAIL PROTECTED]> > > > > > Thanks for the suggestions. > > > > > > @Jonathan Coveney: > > > > > > input tuple : (id1,column1,column2) > > > output : two tuples (id1,column1) and (id2,column2) so it is > List<Tuple> > > > or should I return a Bag? > > > > > > public class SPLITTUPPLE extends EvalFunc <List<Tuple>> > > > { > > > public List<Tuple> exec(Tuple input) throws IOException { > > > if (input == null || input.size() == 0) > > > return null; > > > try{ > > > // not sure how whether I can create tuples on my own. > Looks > > > like I should use TupleFactory. > > > // return list of tuples. > > > }catch(Exception e){ > > > throw WrappedIOException.wrap("Caught exception processing > > > input row ", e); > > > } > > > } > > > } > > > > > > Can you point me to some example? > > > > > > Thanks for your time, > > > Naresh. > > > > > > On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <[EMAIL PROTECTED]> > > > wrote: > > > > > > > You can probably hack together something that will do exactly this > > > without > > > > writing a UDF, but I think a UDF will be most useful > here...especially > > if > > > > you want to add more columns, etc etc. > > > > > > > > 2012/7/1 Subir S <[EMAIL PROTECTED]> > > > > > > > > > Would FLATTEN help? > > > > > > > > > > B = GROUP A by ID; > > > > > > > > > > C = FOREACH B GENERATE group, FLATTEN ($1); > > > > > > > > > > Might work i guess. Not tested. > > > > > > > > > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <[EMAIL PROTECTED]>
-
Re: Generating multiple tuples from single tuple
naresh 2012-07-05, 18:28
@Abhinav: Thanks for the suggestion.
On Wed, Jul 4, 2012 at 12:00 AM, Abhinav Neelam <[EMAIL PROTECTED]>wrote:
> If you don't have an unknown number of columns, you can do this > not-too-hacky (hopefully) pig: > > grunt> dump in; > ****file:/homes/abhinavn/pigtest/tuplify.data > ****file:/tmp/temp-2067511203/tmp-1924449354 > (10,C1:V1,C2:V2) > grunt> bagged = foreach in generate $0, > TOBAG(STRSPLIT((chararray)$1,':',2), STRSPLIT((chararray)$2,':',2)); > grunt> dump bagged; > ****file:/homes/abhinavn/pigtest/tuplify.data > ****file:/tmp/temp-2067511203/tmp588738197 > (10,{(C1,V1),(C2,V2)}) > grunt> flat = foreach bagged generate $0, FLATTEN($1); > grunt> dump flat; > ****file:/homes/abhinavn/pigtest/tuplify.data > ****file:/tmp/temp-2067511203/tmp-1881239619 > (10,C1,V1) > (10,C2,V2) > > > On 3 July 2012 03:04, naresh <[EMAIL PROTECTED]> wrote: > > > @Jonathan Conveney: > > > > Thanks a lot for detailed explanation. I got the point now. > > > > Thanks for your time, > > Naresh. > > > > On Mon, Jul 2, 2012 at 1:19 PM, Jonathan Coveney <[EMAIL PROTECTED]> > > wrote: > > > > > IMHO, if you want this to be more generic, I would have it just take > the > > > full line, and then parse it out. Why? Because what happens when you > have > > > an indeterminate number of columns? That's my own pesonal opinion > though. > > > As far as implementation, I would return a DataBag (because what you > want > > > are many rows, and Bags = rows). > > > > > > you want these two things to make the Tuples and output bag: > > > > > > private static final TupleFactory mTupleFactory > > > TupleFactory.getInstance(); > > > private static final BagFactory mBagFactory = BagFactory.getInstance(); > > > > > > Their use is described in the Pig api, but essentially, you'll have > > > something like this (this is off the cuff and needs some love, but is > the > > > general idea)... > > > > > > DataBag output = mBagFactory.newDefaultBag(); > > > String[] vals = ((String)input.get(0)).split("|"); > > > List<Object> protoTuple = new ArrayList<Object>(3); > > > protoTuple.add(vals[0]); //the first will be the ID > > > protoTuple.add(null); > > > protoTuple.add(null); > > > for (int i = 1; i < vals.length; i++) { > > > String[] colAndValue = vals[i].split(":"); > > > protoTuple.set(1, colAndValue[0]); //the column name > > > protoTuple.set(2, colAndValue[0]); //the value > > > output.add(mTupleFactory.newTuple(protoTuple)); //the default of > > > newTuple(List) is to copy the List over, which is what we want > > > } > > > return output; > > > > > > the output will always have ID, then col and val. You want to flatten > the > > > output of this UDF. > > > > > > 2012/7/2 naresh <[EMAIL PROTECTED]> > > > > > > > Thanks for the suggestions. > > > > > > > > @Jonathan Coveney: > > > > > > > > input tuple : (id1,column1,column2) > > > > output : two tuples (id1,column1) and (id2,column2) so it is > > List<Tuple> > > > > or should I return a Bag? > > > > > > > > public class SPLITTUPPLE extends EvalFunc <List<Tuple>> > > > > { > > > > public List<Tuple> exec(Tuple input) throws IOException { > > > > if (input == null || input.size() == 0) > > > > return null; > > > > try{ > > > > // not sure how whether I can create tuples on my own. > > Looks > > > > like I should use TupleFactory. > > > > // return list of tuples. > > > > }catch(Exception e){ > > > > throw WrappedIOException.wrap("Caught exception > processing > > > > input row ", e); > > > > } > > > > } > > > > } > > > > > > > > Can you point me to some example? > > > > > > > > Thanks for your time, > > > > Naresh. > > > > > > > > On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > You can probably hack together something that will do exactly this > > > > without > > > > > writing a UDF, but I think a UDF will be most useful > > here...especially
|
|