|
|
-
UDF with nested bag in tuples
Zehua Liu 2009-04-03, 09:34
Hi,
I am trying to create an UDF that returns tuple of schema (id: int, words: { (word) } ) . This is a bit similar to the TOKENIZE built-in udf, which returns { (word) }, but with an additional id to indicate where the tokenized words come from. Imagine tokenizing documents with doc id, I want to pair the tokenized words with the doc id.
I adapted the code from TOKENIZE.java to get the following (the complete java file is attached):
public Tuple exec(Tuple input) throws IOException { if (input == null || input.size() == 0) { return null; } try { Integer id = (Integer)input.get(0); String text = (String)input.get(1);
DataBag sentenceBag = _bagFactory.newDefaultBag(); StringTokenizer tok = new StringTokenizer(text, " \",()*", false); while (tok.hasMoreTokens()) { String token = tok.nextToken(); sentenceBag.add(_tupleFactory.newTuple(token)); } Tuple output = _tupleFactory.newTuple(); output.append(id); output.append(sentenceBag);
return output; } catch(Exception e) { throw WrappedIOException.wrap("Caught exception processing input row ", e); } }
public Schema outputSchema(Schema input) { try { Schema.FieldSchema tokenFs = new Schema.FieldSchema("token", DataType.CHARARRAY); Schema tupleSchema = new Schema(tokenFs);
Schema.FieldSchema tupleFs; tupleFs = new Schema.FieldSchema("tuple_of_tokens", tupleSchema, DataType.TUPLE);
Schema bagSchema = new Schema(tupleFs); bagSchema.setTwoLevelAccessRequired(true); Schema.FieldSchema bagFs = new Schema.FieldSchema( "bag_of_tokenTuples",bagSchema, DataType.BAG);
Schema schema = new Schema(); schema.add(new Schema.FieldSchema("id", DataType.INTEGER)); schema.add(bagFs);
return schema; } catch (Exception e) { return null; } }
The input is a file with two columns: id, text I ran the following pig programs in grunt: REGISTER ./testpig.jar DEFINE TESTBAG testpig.TESTBAG(); docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int, text: chararray); testbag = FOREACH docs GENERATE TESTBAG(id, text); dump testbag words = FOREACH testbag GENERATE $0.id,$0.bag_of_tokenTuples; dump words
There are two issues with this: 1. dump words failed with the msg "ERROR 0: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.DataBag". how to get it work? 2. the schema of testbag is "testbag: {(id: int,bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)})}", while I was expecting "testbag: {id: int,bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)}}, which you would get if it comes from a group by. This forces me to use $0 in the words statement.
I am using pig from the latest svn trunk, rev 760771.
Any help is appreciated.
Thanks,
Zehua
-
Re: UDF with nested bag in tuples
zhang jianfeng 2009-04-03, 10:42
You should explicitly use schema in statement:
FOREACH docs GENERATE TESTBAG(id, text) AS t:tuple(id: int, b:bags { t1:tuple(w:chararray)} ) otherwise the pig can not infer the schema for you. I've encountered this problem before. On Fri, Apr 3, 2009 at 5:34 PM, Zehua Liu <[EMAIL PROTECTED]> wrote:
> Hi, > > I am trying to create an UDF that returns tuple of schema (id: int, words: > { (word) } ) . This is a bit similar to the TOKENIZE built-in udf, which > returns { (word) }, but with an additional id to indicate where the > tokenized words come from. Imagine tokenizing documents with doc id, I want > to pair the tokenized words with the doc id. > > I adapted the code from TOKENIZE.java to get the following (the complete > java file is attached): > > public Tuple exec(Tuple input) throws IOException > { > if (input == null || input.size() == 0) { return null; > } > try { > Integer id = (Integer)input.get(0); > String text = (String)input.get(1); > > DataBag sentenceBag = _bagFactory.newDefaultBag(); > StringTokenizer tok = new StringTokenizer(text, " \",()*", > false); > while (tok.hasMoreTokens()) { > String token = tok.nextToken(); > sentenceBag.add(_tupleFactory.newTuple(token)); > } > Tuple output = _tupleFactory.newTuple(); > output.append(id); > output.append(sentenceBag); > > return output; > } catch(Exception e) { > throw WrappedIOException.wrap("Caught exception processing > input row ", e); > } > } > > public Schema outputSchema(Schema input) { > try { > Schema.FieldSchema tokenFs = new Schema.FieldSchema("token", > DataType.CHARARRAY); > Schema tupleSchema = new Schema(tokenFs); > > Schema.FieldSchema tupleFs; > tupleFs = new Schema.FieldSchema("tuple_of_tokens", > tupleSchema, > DataType.TUPLE); > > Schema bagSchema = new Schema(tupleFs); > bagSchema.setTwoLevelAccessRequired(true); > Schema.FieldSchema bagFs = new Schema.FieldSchema( > "bag_of_tokenTuples",bagSchema, DataType.BAG); > > Schema schema = new Schema(); > schema.add(new Schema.FieldSchema("id", DataType.INTEGER)); > schema.add(bagFs); > > return schema; > } catch (Exception e) { > return null; > } > } > > The input is a file with two columns: id, text > I ran the following pig programs in grunt: > REGISTER ./testpig.jar > DEFINE TESTBAG testpig.TESTBAG(); > docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int, > text: chararray); > testbag = FOREACH docs GENERATE TESTBAG(id, text); > dump testbag > words = FOREACH testbag GENERATE $0.id,$0.bag_of_tokenTuples; > dump words > > There are two issues with this: > 1. dump words failed with the msg "ERROR 0: > org.apache.pig.data.DefaultTuple cannot be cast to > org.apache.pig.data.DataBag". how to get it work? > 2. the schema of testbag is "testbag: {(id: int,bag_of_tokenTuples: > {tuple_of_tokens: (token: chararray)})}", while I was expecting "testbag: > {id: int,bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)}}, which > you would get if it comes from a group by. This forces me to use $0 in the > words statement. > > I am using pig from the latest svn trunk, rev 760771. > > Any help is appreciated. > > Thanks, > > Zehua >
-
Re: UDF with nested bag in tuples
Zehua Liu 2009-04-03, 22:53
Thanks, for the quick response.
I tried what you suggested, but it still doesn't work. I removed the outputSchema() method, and use a trailing "AS ". "describe testbag" shows the same schema. "dump testbag" still works and the the dumped data looks the same as the one with outputSchema (but not "AS "). But when it comes to the step of "dump words", it still reported the same error.
Based on my limited knowledge of pig source codes, I only managed to trace to the point where I think it's related to how the first level of tuple is handled by POProject operator. By first level, I mean : { ( -- first level of tuple id: int, words: bag { t1 (w:chararray) } ) -- first level of tuple }
If I were to use a group by on a schema of { id, int, w: chararray } , the resulting schema would be like: { -- no additional level of tuple group: int, docs: bag { id: int, w:chararray } }
words = FOREACH testbag GENERATE $0.id,$0.words;
The physical plan for handling "$0.words" is Project[bag][1] | |---Project[tuple][0]
The processing was first passed to getNext(DataBag) when handling Project[bag][1], which is correct. But when going down to handle Project[tuple][0], it is still handed to getNext(DataBag) without checking that the type now is "tuple" not bag. Below is the code fragment that I am referring to, in POProject.processInputBag():
if(!isInputAttached()) { return inputs.get(0).getNext(dummyBag); }else{ res.result = (DataBag)input.get(columns.get(0)); res.returnStatus = POStatus.STATUS_OK; return res; }
The other thing that's confusing to me is that POProject.getNext(DataBag) seems to be projecting columns of tuples within a bag, rather than projecting a column of bag type from a tuple. This makes me feel that the physical plan for "$0.words" should really be Project[tuple][1] | |---Project[tuple][0]
i.e., the type of words might not matter much, since it is to be copied over as it is.
Sorry if I am going into the wrong way and misleading you guys. Zehua
On Fri, Apr 3, 2009 at 6:42 PM, zhang jianfeng <[EMAIL PROTECTED]> wrote:
> You should explicitly use schema in statement: > > FOREACH docs GENERATE TESTBAG(id, text) AS t:tuple(id: int, b:bags { > t1:tuple(w:chararray)} ) > > > otherwise the pig can not infer the schema for you. I've encountered this > problem before. > > > > > On Fri, Apr 3, 2009 at 5:34 PM, Zehua Liu <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am trying to create an UDF that returns tuple of schema (id: int, > words: > > { (word) } ) . This is a bit similar to the TOKENIZE built-in udf, which > > returns { (word) }, but with an additional id to indicate where the > > tokenized words come from. Imagine tokenizing documents with doc id, I > want > > to pair the tokenized words with the doc id. > > > > I adapted the code from TOKENIZE.java to get the following (the complete > > java file is attached): > > > > public Tuple exec(Tuple input) throws IOException > > { > > if (input == null || input.size() == 0) { return null; > > } > > try { > > Integer id = (Integer)input.get(0); > > String text = (String)input.get(1); > > > > DataBag sentenceBag = _bagFactory.newDefaultBag(); > > StringTokenizer tok = new StringTokenizer(text, " \",()*", > > false); > > while (tok.hasMoreTokens()) { > > String token = tok.nextToken(); > > sentenceBag.add(_tupleFactory.newTuple(token)); > > } > > Tuple output = _tupleFactory.newTuple(); > > output.append(id); > > output.append(sentenceBag); > > > > return output; > > } catch(Exception e) { > > throw WrappedIOException.wrap("Caught exception processing > > input row ", e); > > } > > } > > > > public Schema outputSchema(Schema input) { > > try { > > Schema.FieldSchema tokenFs = new Schema.FieldSchema("token",
-
Re: UDF with nested bag in tuples
Mridul Muralidharan 2009-04-03, 23:47
Hi,
If the script you indicated was not a template for something else (that is, you just want this script to work - and this is not an illustrative script of some other problem), you can try two things : Potential solution 1:
Directly use tokenize itself.
docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int, text: chararray); testbag = FOREACH docs GENERATE id, FLATTEN(TOKENIZE(text)) as bag_of_tokenTuples; dump testbag words = FOREACH testbag GENERATE id, bag_of_tokenTuples; dump words
Potential solution 2: Using your udf - pig wraps the output of the udf within a tuple - so you might want to do flatten to remove this level of wrapping. REGISTER ./testpig.jar DEFINE TESTBAG testpig.TESTBAG(); docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int, text: chararray); testbag = FOREACH docs GENERATE FLATTEN(TESTBAG(id, text)) AS (id:int, bag_of_tokenTuples:{t:(word:chararray)}); dump testbag words = FOREACH testbag GENERATE id, bag_of_tokenTuples; dump words I am just typing these as I go, so there might be some errors above :-)
Regards, Mridul
Zehua Liu wrote: > Thanks, for the quick response. > > I tried what you suggested, but it still doesn't work. I removed the > outputSchema() method, and use a trailing "AS ". "describe testbag" shows > the same schema. "dump testbag" still works and the the dumped data looks > the same as the one with outputSchema (but not "AS "). But when it comes to > the step of "dump words", it still reported the same error. > > Based on my limited knowledge of pig source codes, I only managed to trace > to the point where I think it's related to how the first level of tuple is > handled by POProject operator. By first level, I mean : > { > ( -- first level of tuple > id: int, words: bag { t1 (w:chararray) } > ) -- first level of tuple > } > > If I were to use a group by on a schema of { id, int, w: chararray } , the > resulting schema would be like: > { -- no additional level of tuple > group: int, docs: bag { id: int, w:chararray } > } > > words = FOREACH testbag GENERATE $0.id,$0.words; > > The physical plan for handling "$0.words" is > Project[bag][1] > | > |---Project[tuple][0] > > The processing was first passed to getNext(DataBag) when handling > Project[bag][1], which is correct. But when going down to handle > Project[tuple][0], it is still handed to getNext(DataBag) without checking > that the type now is "tuple" not bag. Below is the code fragment that I am > referring to, in POProject.processInputBag(): > > if(!isInputAttached()) { > return inputs.get(0).getNext(dummyBag); > }else{ > res.result = (DataBag)input.get(columns.get(0)); > res.returnStatus = POStatus.STATUS_OK; > return res; > } > > The other thing that's confusing to me is that POProject.getNext(DataBag) > seems to be projecting columns of tuples within a bag, rather than > projecting a column of bag type from a tuple. This makes me feel that the > physical plan for "$0.words" should really be > Project[tuple][1] > | > |---Project[tuple][0] > > i.e., the type of words might not matter much, since it is to be copied over > as it is. > > Sorry if I am going into the wrong way and misleading you guys. > > > Zehua > > On Fri, Apr 3, 2009 at 6:42 PM, zhang jianfeng <[EMAIL PROTECTED]> wrote: > >> You should explicitly use schema in statement: >> >> FOREACH docs GENERATE TESTBAG(id, text) AS t:tuple(id: int, b:bags { >> t1:tuple(w:chararray)} ) >> >> >> otherwise the pig can not infer the schema for you. I've encountered this >> problem before. >> >> >> >> >> On Fri, Apr 3, 2009 at 5:34 PM, Zehua Liu <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> I am trying to create an UDF that returns tuple of schema (id: int, >> words: >>> { (word) } ) . This is a bit similar to the TOKENIZE built-in udf, which >>> returns { (word) }, but with an additional id to indicate where the >>> tokenized words come from. Imagine tokenizing documents with doc id, I
-
Re: UDF with nested bag in tuples
Zehua Liu 2009-04-04, 00:11
Yeah, I am trying to illustrate what I think is a problem. My actual use case is not really about tokenize, just that TOKENIZE seems to be the only example that I could find that has a complicated outputSchema() function.
My current workaround is to use something similar to TOKENIZE by extending EvalFunc<DataBag> instead of EvalFunc<Tuple> and output a bag instead of a tuple at a time. ({ (id, word) }) instead of (id, {word}) .
But your solution 2 works! Thanks. Zehua
On Sat, Apr 4, 2009 at 7:47 AM, Mridul Muralidharan <[EMAIL PROTECTED]>wrote:
> > Hi, > > If the script you indicated was not a template for something else (that > is, you just want this script to work - and this is not an illustrative > script of some other problem), you can try two things : > > > Potential solution 1: > > Directly use tokenize itself. > > docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int, > text: chararray); > testbag = FOREACH docs GENERATE id, FLATTEN(TOKENIZE(text)) as > bag_of_tokenTuples; > dump testbag > words = FOREACH testbag GENERATE id, bag_of_tokenTuples; > dump words > > > > Potential solution 2: > Using your udf - > pig wraps the output of the udf within a tuple - so you might want to do > flatten to remove this level of wrapping. > > > REGISTER ./testpig.jar > DEFINE TESTBAG testpig.TESTBAG(); > docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int, > text: chararray); > testbag = FOREACH docs GENERATE FLATTEN(TESTBAG(id, text)) AS (id:int, > bag_of_tokenTuples:{t:(word:chararray)}); > dump testbag > words = FOREACH testbag GENERATE id, bag_of_tokenTuples; > dump words > > > > > I am just typing these as I go, so there might be some errors above :-) > > Regards, > Mridul > > > > > Zehua Liu wrote: > >> Thanks, for the quick response. >> >> I tried what you suggested, but it still doesn't work. I removed the >> outputSchema() method, and use a trailing "AS ". "describe testbag" shows >> the same schema. "dump testbag" still works and the the dumped data looks >> the same as the one with outputSchema (but not "AS "). But when it comes >> to >> the step of "dump words", it still reported the same error. >> >> Based on my limited knowledge of pig source codes, I only managed to trace >> to the point where I think it's related to how the first level of tuple is >> handled by POProject operator. By first level, I mean : >> { >> ( -- first level of tuple >> id: int, words: bag { t1 (w:chararray) } >> ) -- first level of tuple >> } >> >> If I were to use a group by on a schema of { id, int, w: chararray } , the >> resulting schema would be like: >> { -- no additional level of tuple >> group: int, docs: bag { id: int, w:chararray } >> } >> >> words = FOREACH testbag GENERATE $0.id,$0.words; >> >> The physical plan for handling "$0.words" is >> Project[bag][1] >> | >> |---Project[tuple][0] >> >> The processing was first passed to getNext(DataBag) when handling >> Project[bag][1], which is correct. But when going down to handle >> Project[tuple][0], it is still handed to getNext(DataBag) without checking >> that the type now is "tuple" not bag. Below is the code fragment that I am >> referring to, in POProject.processInputBag(): >> >> if(!isInputAttached()) { >> return inputs.get(0).getNext(dummyBag); >> }else{ >> res.result = (DataBag)input.get(columns.get(0)); >> res.returnStatus = POStatus.STATUS_OK; >> return res; >> } >> >> The other thing that's confusing to me is that POProject.getNext(DataBag) >> seems to be projecting columns of tuples within a bag, rather than >> projecting a column of bag type from a tuple. This makes me feel that the >> physical plan for "$0.words" should really be >> Project[tuple][1] >> | >> |---Project[tuple][0] >> >> i.e., the type of words might not matter much, since it is to be copied >> over >> as it is. >> >> Sorry if I am going into the wrong way and misleading you guys. >> >> >> Zehua
-
RE: UDF with nested bag in tuples
Santhosh Srinivasan 2009-04-04, 00:27
Zehua, I modified your outputSchema method (added two lines and removed one line). This should work. Let me know if it does not. Thanks, Santhosh public Schema outputSchema(Schema input) { try { Schema.FieldSchema tokenFs = new Schema.FieldSchema("token",
DataType.CHARARRAY); Schema tupleSchema = new Schema(tokenFs);
Schema.FieldSchema tupleFs; tupleFs = new Schema.FieldSchema("tuple_of_tokens", tupleSchema, DataType.TUPLE);
Schema bagSchema = new Schema(tupleFs); bagSchema.setTwoLevelAccessRequired(true); Schema.FieldSchema bagFs = new Schema.FieldSchema( "bag_of_tokenTuples",bagSchema, DataType.BAG); Schema schema = new Schema(); schema.add(new Schema.FieldSchema("id", DataType.INTEGER)); schema.add(bagFs);
//Added the following two lines and removed return schema Schema.FieldSchema tupleFs = new Schema.FieldSchema("testbag", schema, DataType.TUPLE); return new Schema(tupleFs);
} catch (Exception e) { return null; } } ________________________________
From: Zehua Liu [mailto:[EMAIL PROTECTED]] Sent: Friday, April 03, 2009 2:34 AM To: [EMAIL PROTECTED] Subject: UDF with nested bag in tuples Hi,
I am trying to create an UDF that returns tuple of schema (id: int, words: { (word) } ) . This is a bit similar to the TOKENIZE built-in udf, which returns { (word) }, but with an additional id to indicate where the tokenized words come from. Imagine tokenizing documents with doc id, I want to pair the tokenized words with the doc id.
I adapted the code from TOKENIZE.java to get the following (the complete java file is attached):
public Tuple exec(Tuple input) throws IOException { if (input == null || input.size() == 0) { return null; } try { Integer id = (Integer)input.get(0); String text = (String)input.get(1); DataBag sentenceBag = _bagFactory.newDefaultBag(); StringTokenizer tok = new StringTokenizer(text, " \",()*", false); while (tok.hasMoreTokens()) { String token = tok.nextToken(); sentenceBag.add(_tupleFactory.newTuple(token)); } Tuple output = _tupleFactory.newTuple(); output.append(id); output.append(sentenceBag); return output; } catch(Exception e) { throw WrappedIOException.wrap("Caught exception processing input row ", e); } }
public Schema outputSchema(Schema input) { try { Schema.FieldSchema tokenFs = new Schema.FieldSchema("token",
DataType.CHARARRAY); Schema tupleSchema = new Schema(tokenFs);
Schema.FieldSchema tupleFs; tupleFs = new Schema.FieldSchema("tuple_of_tokens", tupleSchema, DataType.TUPLE);
Schema bagSchema = new Schema(tupleFs); bagSchema.setTwoLevelAccessRequired(true); Schema.FieldSchema bagFs = new Schema.FieldSchema( "bag_of_tokenTuples",bagSchema, DataType.BAG); Schema schema = new Schema(); schema.add(new Schema.FieldSchema("id", DataType.INTEGER)); schema.add(bagFs); return schema; } catch (Exception e) { return null; } }
The input is a file with two columns: id, text I ran the following pig programs in grunt: REGISTER ./testpig.jar DEFINE TESTBAG testpig.TESTBAG(); docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int, text: chararray); testbag = FOREACH docs GENERATE TESTBAG(id, text); dump testbag words = FOREACH testbag GENERATE $0.id,$0.bag_of_tokenTuples; dump words
There are two issues with this: 1. dump words failed with the msg "ERROR 0: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.DataBag". how to get it work? 2. the schema of testbag is "testbag: {(id: int,bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)})}", while I was expecting "testbag: {id: int,bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)}}, which you would get if it comes from a group by. This forces me to use $0 in the words statement.
I am using pig from the latest svn trunk, rev 760771.
Any help is appreciated.
Thanks,
Zehua
-
Re: UDF with nested bag in tuples
Zehua Liu 2009-04-04, 00:42
This does not work, either. the schema of testbag looks the same using "describe testbag". it still failed with the same cast error in when I tried to dump words.
On Sat, Apr 4, 2009 at 8:27 AM, Santhosh Srinivasan <[EMAIL PROTECTED]>wrote:
> Zehua, > > I modified your outputSchema method (added two lines and removed one > line). This should work. Let me know if it does not. > > Thanks, > Santhosh > > public Schema outputSchema(Schema input) { > try { > Schema.FieldSchema tokenFs = new Schema.FieldSchema("token", > > DataType.CHARARRAY); > Schema tupleSchema = new Schema(tokenFs); > > Schema.FieldSchema tupleFs; > tupleFs = new Schema.FieldSchema("tuple_of_tokens", > tupleSchema, > DataType.TUPLE); > > Schema bagSchema = new Schema(tupleFs); > bagSchema.setTwoLevelAccessRequired(true); > Schema.FieldSchema bagFs = new Schema.FieldSchema( > "bag_of_tokenTuples",bagSchema, DataType.BAG); > > Schema schema = new Schema(); > schema.add(new Schema.FieldSchema("id", DataType.INTEGER)); > schema.add(bagFs); > > //Added the following two lines and removed return schema > Schema.FieldSchema tupleFs = new > Schema.FieldSchema("testbag", schema, DataType.TUPLE); > return new Schema(tupleFs); > > } catch (Exception e) { > return null; > } > } > > > ________________________________ > > From: Zehua Liu [mailto:[EMAIL PROTECTED]] > Sent: Friday, April 03, 2009 2:34 AM > To: [EMAIL PROTECTED] > Subject: UDF with nested bag in tuples > > > Hi, > > I am trying to create an UDF that returns tuple of schema (id: int, > words: { (word) } ) . This is a bit similar to the TOKENIZE built-in > udf, which returns { (word) }, but with an additional id to indicate > where the tokenized words come from. Imagine tokenizing documents with > doc id, I want to pair the tokenized words with the doc id. > > I adapted the code from TOKENIZE.java to get the following (the complete > java file is attached): > > public Tuple exec(Tuple input) throws IOException > { > if (input == null || input.size() == 0) { return null; > } > try { > Integer id = (Integer)input.get(0); > String text = (String)input.get(1); > > DataBag sentenceBag = _bagFactory.newDefaultBag(); > StringTokenizer tok = new StringTokenizer(text, " \",()*", > false); > while (tok.hasMoreTokens()) { > String token = tok.nextToken(); > sentenceBag.add(_tupleFactory.newTuple(token)); > } > Tuple output = _tupleFactory.newTuple(); > output.append(id); > output.append(sentenceBag); > > return output; > } catch(Exception e) { > throw WrappedIOException.wrap("Caught exception processing > input row ", e); > } > } > > public Schema outputSchema(Schema input) { > try { > Schema.FieldSchema tokenFs = new Schema.FieldSchema("token", > > DataType.CHARARRAY); > Schema tupleSchema = new Schema(tokenFs); > > Schema.FieldSchema tupleFs; > tupleFs = new Schema.FieldSchema("tuple_of_tokens", > tupleSchema, > DataType.TUPLE); > > Schema bagSchema = new Schema(tupleFs); > bagSchema.setTwoLevelAccessRequired(true); > Schema.FieldSchema bagFs = new Schema.FieldSchema( > "bag_of_tokenTuples",bagSchema, DataType.BAG); > > Schema schema = new Schema(); > schema.add(new Schema.FieldSchema("id", DataType.INTEGER)); > schema.add(bagFs); > > return schema; > } catch (Exception e) { > return null; > } > } > > The input is a file with two columns: id, text > I ran the following pig programs in grunt:
|
|