Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - String Representation of DataBag and its Schema


Copy link to this message
-
Re: String Representation of DataBag and its Schema
Dan DeCapria, CivicScienc... 2013-03-19, 13:37
Thank you for your reply.

The problem is I cannot find a methodology to go from a String
representation of a complex data type to a nested Object of pig DataTypes.
I looked over the pig 0.10.1 docs, but cannot find a way to go from String
and Schema to pig DataType Object.

For context, I am generating these Strings for my own JUnit testing of
other UDFs.  Currently, for complex types, I have to generate each nesting
from Tuple and DataBag factories, append data, and next them manually.  For
larger unit tests, this process becomes unwieldy (hundreds of lines per
method, non-dynamic), and it would be much simpler to go directly from a
String and a Schema to a DataBag Object for UDF testing (few lines of code,
easily modifiable).

-Dan

On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Why not just use PigStorage? This is essentially what it does. It saves a
> bag as text, and then loads it again.
>
> I suppose the question becomes: why do you need to do this?
>
>
> 2013/3/18 Dan DeCapria, CivicScience <[EMAIL PROTECTED]>
>
> > In Java, I am trying to convert a DataBag from it's String representation
> > with its schema String to a valid DataBag Object:
> >
> > String databag_string = "{(apples,1024)}";
> > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}";
> >
> > I've tried implementing something along the lines of this, but I believe
> > it's in the wrong direction, and then I get stuck:
> >
> >         String[] aliases = {"b1", "t1", "a", "b"};
> >         byte[] types = {DataType.BAG, DataType.TUPLE, DataType.CHARARRAY,
> > DataType.LONG};
> >         List<Schema.FieldSchema> fsList = new
> > ArrayList<Schema.FieldSchema>();
> >         for (int i = 0; i < aliases.length; i++) {
> >             fsList.add(new Schema.FieldSchema(aliases[i], types[i])) ;
> >         }
> >         Schema origSchema = new Schema(fsList);
> >         ResourceSchema rsSchema = new ResourceSchema(origSchema);
> >         Schema genSchema = Schema.getPigSchema(rsSchema);
> >         ResourceSchema.ResourceFieldSchema[] rfschema > > rsSchema.getFields();
> >         ... lost here, maybe Utf8StorageConverter c = new
> > Utf8StorageConverter(); ???
> >
> >
> > An ideal process would be along the lines of:
> >
> > DataBag d = BagFactory.getInstance().newDefaultBag();
> > d.something(databag_string, schema_string);    // ??? no idea what this
> > process could be
> > d.toString().equals(databag_string) == true.
> >
> > Thanks, -Dan
> >
>

--
Dan DeCapria
CivicScience, Inc.
Senior Informatics / DM / ML / BI Specialist