Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF with nested bag in tuples


Copy link to this message
-
RE: UDF with nested bag in tuples
Zehua,
 
I modified your outputSchema method (added two lines and removed one
line). This should work. Let me know if it does not.
 
Thanks,
Santhosh
 
public Schema outputSchema(Schema input)     {
        try   {
            Schema.FieldSchema tokenFs = new Schema.FieldSchema("token",

                    DataType.CHARARRAY);
            Schema tupleSchema = new Schema(tokenFs);

            Schema.FieldSchema tupleFs;
            tupleFs = new Schema.FieldSchema("tuple_of_tokens",
tupleSchema,
                    DataType.TUPLE);

            Schema bagSchema = new Schema(tupleFs);
            bagSchema.setTwoLevelAccessRequired(true);
            Schema.FieldSchema bagFs = new Schema.FieldSchema(
"bag_of_tokenTuples",bagSchema, DataType.BAG);
            
            Schema schema = new Schema();
            schema.add(new Schema.FieldSchema("id", DataType.INTEGER));
            schema.add(bagFs);

            //Added the following two lines and removed return schema
            Schema.FieldSchema tupleFs = new
Schema.FieldSchema("testbag", schema, DataType.TUPLE);
            return new Schema(tupleFs);            

        }        catch (Exception e)        {
            return null;
        }
    }
________________________________

From: Zehua Liu [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 03, 2009 2:34 AM
To: [EMAIL PROTECTED]
Subject: UDF with nested bag in tuples
Hi,

I am trying to create an UDF that returns tuple of schema (id: int,
words: { (word) } ) . This is a bit similar to the TOKENIZE built-in
udf, which returns { (word) }, but with an additional id to indicate
where the tokenized words come from. Imagine tokenizing documents with
doc id, I want to pair the tokenized words with the doc id.

I adapted the code from TOKENIZE.java to get the following (the complete
java file is attached):

    public Tuple exec(Tuple input) throws IOException
    {
        if (input == null || input.size() == 0) {      return null;
}
        try   {
            Integer id = (Integer)input.get(0);
            String text = (String)input.get(1);
            
            DataBag sentenceBag = _bagFactory.newDefaultBag();
            StringTokenizer tok = new StringTokenizer(text, " \",()*",
false);
            while (tok.hasMoreTokens()) {
                String token = tok.nextToken();
                sentenceBag.add(_tupleFactory.newTuple(token));
            }
            Tuple output = _tupleFactory.newTuple();
            output.append(id);
            output.append(sentenceBag);
            
            return output;
        }  catch(Exception e)   {
            throw WrappedIOException.wrap("Caught exception processing
input row ", e);
        }
    }

    public Schema outputSchema(Schema input)     {
        try   {
            Schema.FieldSchema tokenFs = new Schema.FieldSchema("token",

                    DataType.CHARARRAY);
            Schema tupleSchema = new Schema(tokenFs);

            Schema.FieldSchema tupleFs;
            tupleFs = new Schema.FieldSchema("tuple_of_tokens",
tupleSchema,
                    DataType.TUPLE);

            Schema bagSchema = new Schema(tupleFs);
            bagSchema.setTwoLevelAccessRequired(true);
            Schema.FieldSchema bagFs = new Schema.FieldSchema(
"bag_of_tokenTuples",bagSchema, DataType.BAG);
            
            Schema schema = new Schema();
            schema.add(new Schema.FieldSchema("id", DataType.INTEGER));
            schema.add(bagFs);
            
            return schema;
        }        catch (Exception e)        {
            return null;
        }
    }

The input is a file with two columns: id, text
I ran the following pig programs in grunt:
REGISTER ./testpig.jar
DEFINE TESTBAG testpig.TESTBAG();
docs = LOAD '/home/testpig/docs.tsv' USING PigStorage('\t') AS (id: int,
text: chararray);
testbag = FOREACH docs GENERATE TESTBAG(id, text);
dump testbag
words = FOREACH testbag GENERATE $0.id,$0.bag_of_tokenTuples;
dump words

There are two issues with this:
1. dump words failed with the msg "ERROR 0:
org.apache.pig.data.DefaultTuple cannot be cast to
org.apache.pig.data.DataBag". how to get it work?
2. the schema of testbag is "testbag: {(id: int,bag_of_tokenTuples:
{tuple_of_tokens: (token: chararray)})}", while I was expecting
"testbag: {id: int,bag_of_tokenTuples: {tuple_of_tokens: (token:
chararray)}}, which you would get if it comes from a group by. This
forces me to use $0 in the words statement.

I am using pig from the latest svn trunk, rev 760771.

Any help is appreciated.

Thanks,

Zehua
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB