Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - UDF with nested bag in tuples


Copy link to this message
-
RE: UDF with nested bag in tuples
Santhosh Srinivasan 2009-04-04, 00:27
Zehua,
 
I modified your outputSchema method (added two lines and removed one
line). This should work. Let me know if it does not.
 
Thanks,
Santhosh
 
public Schema outputSchema(Schema input)     {
        try   {
            Schema.FieldSchema tokenFs = new Schema.FieldSchema("token",

                    DataType.CHARARRAY);
            Schema tupleSchema = new Schema(tokenFs);

            Schema.FieldSchema tupleFs;
            tupleFs = new Schema.FieldSchema("tuple_of_tokens",
tupleSchema,
                    DataType.TUPLE);

            Schema bagSchema = new Schema(tupleFs);
            bagSchema.setTwoLevelAccessRequired(true);
            Schema.FieldSchema bagFs = new Schema.FieldSchema(
"bag_of_tokenTuples",bagSchema, DataType.BAG);
            
            Schema schema = new Schema();
            schema.add(new Schema.FieldSchema("id", DataType.INTEGER));
            schema.add(bagFs);

            //Added the following two lines and removed return schema
            Schema.FieldSchema tupleFs = new
Schema.FieldSchema("testbag", schema, DataType.TUPLE);
            return new Schema(tupleFs);            

        }        catch (Exception e)        {
            return null;
        }
    }
________________________________

From: Zehua Liu [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 03, 2009 2:34 AM
To: [EMAIL PROTECTED]
Subject: UDF with nested bag in tuples
Hi,

I am trying to create an UDF that returns tuple of schema (id: int,
words: { (word) } ) . This is a bit similar to the TOKENIZE built-in
udf, which returns { (word) }, but with an additional id to indicate
where the tokenized words come from. Imagine tokenizing documents with
doc id, I want to pair the tokenized words with the doc id.

I adapted the code from TOKENIZE.java to get the following (the complete
java file is attached):

    public Tuple exec(Tuple input) throws IOException
    {
        if (input == null || input.size() == 0) {      return null;
}
        try   {
            Integer id = (Integer)input.get(0);
            String text = (String)input.get(1);
            
            DataBag sentenceBag = _bagFactory.newDefaultBag();
            StringTokenizer tok = new StringTokenizer(text, " \",()*",
false);
            while (tok.hasMoreTokens()) {
                String token = tok.nextToken();
                sentenceBag.add(_tupleFactory.newTuple(token));
            }
            Tuple output = _tupleFactory.newTuple();
            output.append(id);
            output.append(sentenceBag);
            
            return output;
        }  catch(Exception e)   {
            throw WrappedIOException.wrap("Caught exception processing
input row ", e);
        }
    }

    public Schema outputSchema(Schema input)     {
        try   {
            Schema.FieldSchema tokenFs = new Schema.FieldSchema("token",

                    DataType.CHARARRAY);
            Schema tupleSchema = new Schema(tokenFs);

            Schema.FieldSchema tupleFs;
            tupleFs = new Schema.FieldSchema("tuple_of_tokens",
tupleSchema,
                    DataType.TUPLE);

            Schema bagSchema = new Schema(tupleFs);
            bagSchema.setTwoLevelAccessRequired(true);
            Schema.FieldSchema bagFs = new Schema.FieldSchema(
"bag_of_tokenTuples",bagSchema, DataType.BAG);
            
            Schema schema = new Schema();
            schema.add(new Schema.FieldSchema("id", DataType.INTEGER));
            schema.add(bagFs);
            
            return schema;
        }        catch (Exception e)        {
            return null;
        }
    }

The input is a file with two columns: id, text
I ran the following pig programs in grunt:
REGISTER ./testpig.jar
DEFINE TESTBAG testpig.TESTBAG();
docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int,
text: chararray);
testbag = FOREACH docs GENERATE TESTBAG(id, text);
dump testbag
words = FOREACH testbag GENERATE $0.id,$0.bag_of_tokenTuples;
dump words

There are two issues with this:
1. dump words failed with the msg "ERROR 0:
org.apache.pig.data.DefaultTuple cannot be cast to
org.apache.pig.data.DataBag". how to get it work?
2. the schema of testbag is "testbag: {(id: int,bag_of_tokenTuples:
{tuple_of_tokens: (token: chararray)})}", while I was expecting
"testbag: {id: int,bag_of_tokenTuples: {tuple_of_tokens: (token:
chararray)}}, which you would get if it comes from a group by. This
forces me to use $0 in the words statement.

I am using pig from the latest svn trunk, rev 760771.

Any help is appreciated.

Thanks,

Zehua