|
Dan DeCapria, CivicScienc...
2013-03-18, 20:18
Jonathan Coveney
2013-03-18, 22:31
Dan DeCapria, CivicScienc...
2013-03-19, 13:37
Dan DeCapria, CivicScienc...
2013-03-19, 15:16
Jonathan Coveney
2013-03-19, 15:27
Dan DeCapria, CivicScienc...
2013-03-19, 15:37
Jonathan Coveney
2013-03-19, 15:43
Dan DeCapria, CivicScienc...
2013-03-19, 15:43
Dan DeCapria, CivicScienc...
2013-03-19, 15:52
Jonathan Coveney
2013-03-19, 16:08
Dan DeCapria, CivicScienc...
2013-03-19, 16:40
Jonathan Coveney
2013-03-19, 16:53
Jonathan Coveney
2013-03-19, 16:54
Dan DeCapria, CivicScienc...
2013-03-19, 17:20
William Oberman
2013-03-21, 15:51
|
-
String Representation of DataBag and its SchemaDan DeCapria, CivicScienc... 2013-03-18, 20:18
In Java, I am trying to convert a DataBag from it's String representation
with its schema String to a valid DataBag Object: String databag_string = "{(apples,1024)}"; String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; I've tried implementing something along the lines of this, but I believe it's in the wrong direction, and then I get stuck: String[] aliases = {"b1", "t1", "a", "b"}; byte[] types = {DataType.BAG, DataType.TUPLE, DataType.CHARARRAY, DataType.LONG}; List<Schema.FieldSchema> fsList = new ArrayList<Schema.FieldSchema>(); for (int i = 0; i < aliases.length; i++) { fsList.add(new Schema.FieldSchema(aliases[i], types[i])) ; } Schema origSchema = new Schema(fsList); ResourceSchema rsSchema = new ResourceSchema(origSchema); Schema genSchema = Schema.getPigSchema(rsSchema); ResourceSchema.ResourceFieldSchema[] rfschema rsSchema.getFields(); ... lost here, maybe Utf8StorageConverter c = new Utf8StorageConverter(); ??? An ideal process would be along the lines of: DataBag d = BagFactory.getInstance().newDefaultBag(); d.something(databag_string, schema_string); // ??? no idea what this process could be d.toString().equals(databag_string) == true. Thanks, -Dan
-
Re: String Representation of DataBag and its SchemaJonathan Coveney 2013-03-18, 22:31
Why not just use PigStorage? This is essentially what it does. It saves a
bag as text, and then loads it again. I suppose the question becomes: why do you need to do this? 2013/3/18 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > In Java, I am trying to convert a DataBag from it's String representation > with its schema String to a valid DataBag Object: > > String databag_string = "{(apples,1024)}"; > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; > > I've tried implementing something along the lines of this, but I believe > it's in the wrong direction, and then I get stuck: > > String[] aliases = {"b1", "t1", "a", "b"}; > byte[] types = {DataType.BAG, DataType.TUPLE, DataType.CHARARRAY, > DataType.LONG}; > List<Schema.FieldSchema> fsList = new > ArrayList<Schema.FieldSchema>(); > for (int i = 0; i < aliases.length; i++) { > fsList.add(new Schema.FieldSchema(aliases[i], types[i])) ; > } > Schema origSchema = new Schema(fsList); > ResourceSchema rsSchema = new ResourceSchema(origSchema); > Schema genSchema = Schema.getPigSchema(rsSchema); > ResourceSchema.ResourceFieldSchema[] rfschema > rsSchema.getFields(); > ... lost here, maybe Utf8StorageConverter c = new > Utf8StorageConverter(); ??? > > > An ideal process would be along the lines of: > > DataBag d = BagFactory.getInstance().newDefaultBag(); > d.something(databag_string, schema_string); // ??? no idea what this > process could be > d.toString().equals(databag_string) == true. > > Thanks, -Dan >
-
Re: String Representation of DataBag and its SchemaDan DeCapria, CivicScienc... 2013-03-19, 13:37
Thank you for your reply.
The problem is I cannot find a methodology to go from a String representation of a complex data type to a nested Object of pig DataTypes. I looked over the pig 0.10.1 docs, but cannot find a way to go from String and Schema to pig DataType Object. For context, I am generating these Strings for my own JUnit testing of other UDFs. Currently, for complex types, I have to generate each nesting from Tuple and DataBag factories, append data, and next them manually. For larger unit tests, this process becomes unwieldy (hundreds of lines per method, non-dynamic), and it would be much simpler to go directly from a String and a Schema to a DataBag Object for UDF testing (few lines of code, easily modifiable). -Dan On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > Why not just use PigStorage? This is essentially what it does. It saves a > bag as text, and then loads it again. > > I suppose the question becomes: why do you need to do this? > > > 2013/3/18 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > In Java, I am trying to convert a DataBag from it's String representation > > with its schema String to a valid DataBag Object: > > > > String databag_string = "{(apples,1024)}"; > > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; > > > > I've tried implementing something along the lines of this, but I believe > > it's in the wrong direction, and then I get stuck: > > > > String[] aliases = {"b1", "t1", "a", "b"}; > > byte[] types = {DataType.BAG, DataType.TUPLE, DataType.CHARARRAY, > > DataType.LONG}; > > List<Schema.FieldSchema> fsList = new > > ArrayList<Schema.FieldSchema>(); > > for (int i = 0; i < aliases.length; i++) { > > fsList.add(new Schema.FieldSchema(aliases[i], types[i])) ; > > } > > Schema origSchema = new Schema(fsList); > > ResourceSchema rsSchema = new ResourceSchema(origSchema); > > Schema genSchema = Schema.getPigSchema(rsSchema); > > ResourceSchema.ResourceFieldSchema[] rfschema > > rsSchema.getFields(); > > ... lost here, maybe Utf8StorageConverter c = new > > Utf8StorageConverter(); ??? > > > > > > An ideal process would be along the lines of: > > > > DataBag d = BagFactory.getInstance().newDefaultBag(); > > d.something(databag_string, schema_string); // ??? no idea what this > > process could be > > d.toString().equals(databag_string) == true. > > > > Thanks, -Dan > > > -- Dan DeCapria CivicScience, Inc. Senior Informatics / DM / ML / BI Specialist
-
Re: String Representation of DataBag and its SchemaDan DeCapria, CivicScienc... 2013-03-19, 15:16
Expanding upon this, the following use case's Schema Object can be resolved
from inputs: String string_databag = "{(a,(b,d),f)}"; String string_schema "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; Schema schema = Utils.getSchemaFromString(string_schema); Next step is to resolve a DataBag Object from String string_databag and the Schema Object. -Dan On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < [EMAIL PROTECTED]> wrote: > Thank you for your reply. > > The problem is I cannot find a methodology to go from a String > representation of a complex data type to a nested Object of pig DataTypes. > I looked over the pig 0.10.1 docs, but cannot find a way to go from String > and Schema to pig DataType Object. > > For context, I am generating these Strings for my own JUnit testing of > other UDFs. Currently, for complex types, I have to generate each nesting > from Tuple and DataBag factories, append data, and next them manually. For > larger unit tests, this process becomes unwieldy (hundreds of lines per > method, non-dynamic), and it would be much simpler to go directly from a > String and a Schema to a DataBag Object for UDF testing (few lines of code, > easily modifiable). > > -Dan > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > >> Why not just use PigStorage? This is essentially what it does. It saves a >> bag as text, and then loads it again. >> >> I suppose the question becomes: why do you need to do this? >> >> >> 2013/3/18 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> >> >> > In Java, I am trying to convert a DataBag from it's String >> representation >> > with its schema String to a valid DataBag Object: >> > >> > String databag_string = "{(apples,1024)}"; >> > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; >> > >> > I've tried implementing something along the lines of this, but I believe >> > it's in the wrong direction, and then I get stuck: >> > >> > String[] aliases = {"b1", "t1", "a", "b"}; >> > byte[] types = {DataType.BAG, DataType.TUPLE, >> DataType.CHARARRAY, >> > DataType.LONG}; >> > List<Schema.FieldSchema> fsList = new >> > ArrayList<Schema.FieldSchema>(); >> > for (int i = 0; i < aliases.length; i++) { >> > fsList.add(new Schema.FieldSchema(aliases[i], types[i])) ; >> > } >> > Schema origSchema = new Schema(fsList); >> > ResourceSchema rsSchema = new ResourceSchema(origSchema); >> > Schema genSchema = Schema.getPigSchema(rsSchema); >> > ResourceSchema.ResourceFieldSchema[] rfschema >> > rsSchema.getFields(); >> > ... lost here, maybe Utf8StorageConverter c = new >> > Utf8StorageConverter(); ??? >> > >> > >> > An ideal process would be along the lines of: >> > >> > DataBag d = BagFactory.getInstance().newDefaultBag(); >> > d.something(databag_string, schema_string); // ??? no idea what this >> > process could be >> > d.toString().equals(databag_string) == true. >> > >> > Thanks, -Dan >> > >> > > > > -- > Dan DeCapria > CivicScience, Inc. > Senior Informatics / DM / ML / BI Specialist > -- Dan DeCapria CivicScience, Inc. Senior Informatics / DM / ML / BI Specialist
-
Re: String Representation of DataBag and its SchemaJonathan Coveney 2013-03-19, 15:27
how was string_databag generated?
2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > Expanding upon this, the following use case's Schema Object can be resolved > from inputs: > > String string_databag = "{(a,(b,d),f)}"; > String string_schema > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; > Schema schema = Utils.getSchemaFromString(string_schema); > > Next step is to resolve a DataBag Object from String string_databag and the > Schema Object. > > -Dan > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < > [EMAIL PROTECTED]> wrote: > > > Thank you for your reply. > > > > The problem is I cannot find a methodology to go from a String > > representation of a complex data type to a nested Object of pig > DataTypes. > > I looked over the pig 0.10.1 docs, but cannot find a way to go from > String > > and Schema to pig DataType Object. > > > > For context, I am generating these Strings for my own JUnit testing of > > other UDFs. Currently, for complex types, I have to generate each > nesting > > from Tuple and DataBag factories, append data, and next them manually. > For > > larger unit tests, this process becomes unwieldy (hundreds of lines per > > method, non-dynamic), and it would be much simpler to go directly from a > > String and a Schema to a DataBag Object for UDF testing (few lines of > code, > > easily modifiable). > > > > -Dan > > > > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > >> Why not just use PigStorage? This is essentially what it does. It saves > a > >> bag as text, and then loads it again. > >> > >> I suppose the question becomes: why do you need to do this? > >> > >> > >> 2013/3/18 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > >> > >> > In Java, I am trying to convert a DataBag from it's String > >> representation > >> > with its schema String to a valid DataBag Object: > >> > > >> > String databag_string = "{(apples,1024)}"; > >> > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; > >> > > >> > I've tried implementing something along the lines of this, but I > believe > >> > it's in the wrong direction, and then I get stuck: > >> > > >> > String[] aliases = {"b1", "t1", "a", "b"}; > >> > byte[] types = {DataType.BAG, DataType.TUPLE, > >> DataType.CHARARRAY, > >> > DataType.LONG}; > >> > List<Schema.FieldSchema> fsList = new > >> > ArrayList<Schema.FieldSchema>(); > >> > for (int i = 0; i < aliases.length; i++) { > >> > fsList.add(new Schema.FieldSchema(aliases[i], types[i])) ; > >> > } > >> > Schema origSchema = new Schema(fsList); > >> > ResourceSchema rsSchema = new ResourceSchema(origSchema); > >> > Schema genSchema = Schema.getPigSchema(rsSchema); > >> > ResourceSchema.ResourceFieldSchema[] rfschema > >> > rsSchema.getFields(); > >> > ... lost here, maybe Utf8StorageConverter c = new > >> > Utf8StorageConverter(); ??? > >> > > >> > > >> > An ideal process would be along the lines of: > >> > > >> > DataBag d = BagFactory.getInstance().newDefaultBag(); > >> > d.something(databag_string, schema_string); // ??? no idea what > this > >> > process could be > >> > d.toString().equals(databag_string) == true. > >> > > >> > Thanks, -Dan > >> > > >> > > > > > > > > -- > > Dan DeCapria > > CivicScience, Inc. > > Senior Informatics / DM / ML / BI Specialist > > > > > > -- > Dan DeCapria > CivicScience, Inc. > Senior Informatics / DM / ML / BI Specialist >
-
Re: String Representation of DataBag and its SchemaDan DeCapria, CivicScienc... 2013-03-19, 15:37
String string_databag in this example was typed out by me, as the input
String for a JUnit test method. I am considering generating many of these for case specific unit testing of my UDFs. -Dan On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > how was string_databag generated? > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > Expanding upon this, the following use case's Schema Object can be > resolved > > from inputs: > > > > String string_databag = "{(a,(b,d),f)}"; > > String string_schema > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; > > Schema schema = Utils.getSchemaFromString(string_schema); > > > > Next step is to resolve a DataBag Object from String string_databag and > the > > Schema Object. > > > > -Dan > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < > > [EMAIL PROTECTED]> wrote: > > > > > Thank you for your reply. > > > > > > The problem is I cannot find a methodology to go from a String > > > representation of a complex data type to a nested Object of pig > > DataTypes. > > > I looked over the pig 0.10.1 docs, but cannot find a way to go from > > String > > > and Schema to pig DataType Object. > > > > > > For context, I am generating these Strings for my own JUnit testing of > > > other UDFs. Currently, for complex types, I have to generate each > > nesting > > > from Tuple and DataBag factories, append data, and next them manually. > > For > > > larger unit tests, this process becomes unwieldy (hundreds of lines per > > > method, non-dynamic), and it would be much simpler to go directly from > a > > > String and a Schema to a DataBag Object for UDF testing (few lines of > > code, > > > easily modifiable). > > > > > > -Dan > > > > > > > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney <[EMAIL PROTECTED] > > >wrote: > > > > > >> Why not just use PigStorage? This is essentially what it does. It > saves > > a > > >> bag as text, and then loads it again. > > >> > > >> I suppose the question becomes: why do you need to do this? > > >> > > >> > > >> 2013/3/18 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > >> > > >> > In Java, I am trying to convert a DataBag from it's String > > >> representation > > >> > with its schema String to a valid DataBag Object: > > >> > > > >> > String databag_string = "{(apples,1024)}"; > > >> > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; > > >> > > > >> > I've tried implementing something along the lines of this, but I > > believe > > >> > it's in the wrong direction, and then I get stuck: > > >> > > > >> > String[] aliases = {"b1", "t1", "a", "b"}; > > >> > byte[] types = {DataType.BAG, DataType.TUPLE, > > >> DataType.CHARARRAY, > > >> > DataType.LONG}; > > >> > List<Schema.FieldSchema> fsList = new > > >> > ArrayList<Schema.FieldSchema>(); > > >> > for (int i = 0; i < aliases.length; i++) { > > >> > fsList.add(new Schema.FieldSchema(aliases[i], > types[i])) ; > > >> > } > > >> > Schema origSchema = new Schema(fsList); > > >> > ResourceSchema rsSchema = new ResourceSchema(origSchema); > > >> > Schema genSchema = Schema.getPigSchema(rsSchema); > > >> > ResourceSchema.ResourceFieldSchema[] rfschema > > >> > rsSchema.getFields(); > > >> > ... lost here, maybe Utf8StorageConverter c = new > > >> > Utf8StorageConverter(); ??? > > >> > > > >> > > > >> > An ideal process would be along the lines of: > > >> > > > >> > DataBag d = BagFactory.getInstance().newDefaultBag(); > > >> > d.something(databag_string, schema_string); // ??? no idea what > > this > > >> > process could be > > >> > d.toString().equals(databag_string) == true. > > >> > > > >> > Thanks, -Dan > > >> > > > >> > > > > > > > > > > > > -- > > > Dan DeCapria > > > CivicScience, Inc. > > > Senior Informatics / DM / ML / BI Specialist > > > > > > > > > > > Dan DeCapria CivicScience, Inc. Senior Informatics / DM / ML / BI Specialist
-
Re: String Representation of DataBag and its SchemaJonathan Coveney 2013-03-19, 15:43
How are you planning on generating these cases? By hand? Or automated?
2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > String string_databag in this example was typed out by me, as the input > String for a JUnit test method. I am considering generating many of these > for case specific unit testing of my UDFs. > > -Dan > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > how was string_databag generated? > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > > > Expanding upon this, the following use case's Schema Object can be > > resolved > > > from inputs: > > > > > > String string_databag = "{(a,(b,d),f)}"; > > > String string_schema > > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; > > > Schema schema = Utils.getSchemaFromString(string_schema); > > > > > > Next step is to resolve a DataBag Object from String string_databag and > > the > > > Schema Object. > > > > > > -Dan > > > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Thank you for your reply. > > > > > > > > The problem is I cannot find a methodology to go from a String > > > > representation of a complex data type to a nested Object of pig > > > DataTypes. > > > > I looked over the pig 0.10.1 docs, but cannot find a way to go from > > > String > > > > and Schema to pig DataType Object. > > > > > > > > For context, I am generating these Strings for my own JUnit testing > of > > > > other UDFs. Currently, for complex types, I have to generate each > > > nesting > > > > from Tuple and DataBag factories, append data, and next them > manually. > > > For > > > > larger unit tests, this process becomes unwieldy (hundreds of lines > per > > > > method, non-dynamic), and it would be much simpler to go directly > from > > a > > > > String and a Schema to a DataBag Object for UDF testing (few lines of > > > code, > > > > easily modifiable). > > > > > > > > -Dan > > > > > > > > > > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > >> Why not just use PigStorage? This is essentially what it does. It > > saves > > > a > > > >> bag as text, and then loads it again. > > > >> > > > >> I suppose the question becomes: why do you need to do this? > > > >> > > > >> > > > >> 2013/3/18 Dan DeCapria, CivicScience <[EMAIL PROTECTED] > > > > > >> > > > >> > In Java, I am trying to convert a DataBag from it's String > > > >> representation > > > >> > with its schema String to a valid DataBag Object: > > > >> > > > > >> > String databag_string = "{(apples,1024)}"; > > > >> > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; > > > >> > > > > >> > I've tried implementing something along the lines of this, but I > > > believe > > > >> > it's in the wrong direction, and then I get stuck: > > > >> > > > > >> > String[] aliases = {"b1", "t1", "a", "b"}; > > > >> > byte[] types = {DataType.BAG, DataType.TUPLE, > > > >> DataType.CHARARRAY, > > > >> > DataType.LONG}; > > > >> > List<Schema.FieldSchema> fsList = new > > > >> > ArrayList<Schema.FieldSchema>(); > > > >> > for (int i = 0; i < aliases.length; i++) { > > > >> > fsList.add(new Schema.FieldSchema(aliases[i], > > types[i])) ; > > > >> > } > > > >> > Schema origSchema = new Schema(fsList); > > > >> > ResourceSchema rsSchema = new ResourceSchema(origSchema); > > > >> > Schema genSchema = Schema.getPigSchema(rsSchema); > > > >> > ResourceSchema.ResourceFieldSchema[] rfschema > > > >> > rsSchema.getFields(); > > > >> > ... lost here, maybe Utf8StorageConverter c = new > > > >> > Utf8StorageConverter(); ??? > > > >> > > > > >> > > > > >> > An ideal process would be along the lines of: > > > >> > > > > >> > DataBag d = BagFactory.getInstance().newDefaultBag();
-
Re: String Representation of DataBag and its SchemaDan DeCapria, CivicScienc... 2013-03-19, 15:43
Such that this string_input matches the Schema:
String string_databag = "{(apples,(banana,1024),2048)}"; String string_schema "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; Schema schema = Utils.getSchemaFromString(string_schema); LogicalSchema logical_schema = Utils.parseSchema(string_schema); ResourceSchema rschema = new ResourceSchema(schema); -Dan On Tue, Mar 19, 2013 at 11:37 AM, Dan DeCapria, CivicScience < [EMAIL PROTECTED]> wrote: > String string_databag in this example was typed out by me, as the input > String for a JUnit test method. I am considering generating many of these > for case specific unit testing of my UDFs. > > -Dan > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > >> how was string_databag generated? >> >> >> 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> >> >> > Expanding upon this, the following use case's Schema Object can be >> resolved >> > from inputs: >> > >> > String string_databag = "{(a,(b,d),f)}"; >> > String string_schema >> > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; >> > Schema schema = Utils.getSchemaFromString(string_schema); >> > >> > Next step is to resolve a DataBag Object from String string_databag and >> the >> > Schema Object. >> > >> > -Dan >> > >> > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Thank you for your reply. >> > > >> > > The problem is I cannot find a methodology to go from a String >> > > representation of a complex data type to a nested Object of pig >> > DataTypes. >> > > I looked over the pig 0.10.1 docs, but cannot find a way to go from >> > String >> > > and Schema to pig DataType Object. >> > > >> > > For context, I am generating these Strings for my own JUnit testing of >> > > other UDFs. Currently, for complex types, I have to generate each >> > nesting >> > > from Tuple and DataBag factories, append data, and next them manually. >> > For >> > > larger unit tests, this process becomes unwieldy (hundreds of lines >> per >> > > method, non-dynamic), and it would be much simpler to go directly >> from a >> > > String and a Schema to a DataBag Object for UDF testing (few lines of >> > code, >> > > easily modifiable). >> > > >> > > -Dan >> > > >> > > >> > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney <[EMAIL PROTECTED] >> > >wrote: >> > > >> > >> Why not just use PigStorage? This is essentially what it does. It >> saves >> > a >> > >> bag as text, and then loads it again. >> > >> >> > >> I suppose the question becomes: why do you need to do this? >> > >> >> > >> >> > >> 2013/3/18 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> >> > >> >> > >> > In Java, I am trying to convert a DataBag from it's String >> > >> representation >> > >> > with its schema String to a valid DataBag Object: >> > >> > >> > >> > String databag_string = "{(apples,1024)}"; >> > >> > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; >> > >> > >> > >> > I've tried implementing something along the lines of this, but I >> > believe >> > >> > it's in the wrong direction, and then I get stuck: >> > >> > >> > >> > String[] aliases = {"b1", "t1", "a", "b"}; >> > >> > byte[] types = {DataType.BAG, DataType.TUPLE, >> > >> DataType.CHARARRAY, >> > >> > DataType.LONG}; >> > >> > List<Schema.FieldSchema> fsList = new >> > >> > ArrayList<Schema.FieldSchema>(); >> > >> > for (int i = 0; i < aliases.length; i++) { >> > >> > fsList.add(new Schema.FieldSchema(aliases[i], >> types[i])) ; >> > >> > } >> > >> > Schema origSchema = new Schema(fsList); >> > >> > ResourceSchema rsSchema = new ResourceSchema(origSchema); >> > >> > Schema genSchema = Schema.getPigSchema(rsSchema); >> > >> > ResourceSchema.ResourceFieldSchema[] rfschema >> > >> > rsSchema.getFields(); Dan DeCapria CivicScience, Inc. Senior Informatics / DM / ML / BI Specialist
-
Re: String Representation of DataBag and its SchemaDan DeCapria, CivicScienc... 2013-03-19, 15:52
By hand; creating a new JUnit method to test a specific use case against a
functional requirement in the UDF. The UDFs I am testing are part of a larger ETL testing initiative I have been undertaking. To ensure that the various states of legacy data are correctly extracted and transformed into a Pig context, I am creating specific JUnit tests per each UDF containing specific use cases as testing methods. Motivation to use String inputs for the Data Objects and Schema Objects is the improvement on the conventional approach - creating Java Objects and adding and appending nested Objects to create the desired complex type DataBag Object to pass to the UDF as use case input. This simpler process I'm looking for should improve scale-ability and rapid-prototyping within the testing scripts. It will also make the process more approachable for another programmer to write additional unit tests. -Dan On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > How are you planning on generating these cases? By hand? Or automated? > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > String string_databag in this example was typed out by me, as the input > > String for a JUnit test method. I am considering generating many of these > > for case specific unit testing of my UDFs. > > > > -Dan > > > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney <[EMAIL PROTECTED] > > >wrote: > > > > > how was string_databag generated? > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > > > > > Expanding upon this, the following use case's Schema Object can be > > > resolved > > > > from inputs: > > > > > > > > String string_databag = "{(a,(b,d),f)}"; > > > > String string_schema > > > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; > > > > Schema schema = Utils.getSchemaFromString(string_schema); > > > > > > > > Next step is to resolve a DataBag Object from String string_databag > and > > > the > > > > Schema Object. > > > > > > > > -Dan > > > > > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > Thank you for your reply. > > > > > > > > > > The problem is I cannot find a methodology to go from a String > > > > > representation of a complex data type to a nested Object of pig > > > > DataTypes. > > > > > I looked over the pig 0.10.1 docs, but cannot find a way to go from > > > > String > > > > > and Schema to pig DataType Object. > > > > > > > > > > For context, I am generating these Strings for my own JUnit testing > > of > > > > > other UDFs. Currently, for complex types, I have to generate each > > > > nesting > > > > > from Tuple and DataBag factories, append data, and next them > > manually. > > > > For > > > > > larger unit tests, this process becomes unwieldy (hundreds of lines > > per > > > > > method, non-dynamic), and it would be much simpler to go directly > > from > > > a > > > > > String and a Schema to a DataBag Object for UDF testing (few lines > of > > > > code, > > > > > easily modifiable). > > > > > > > > > > -Dan > > > > > > > > > > > > > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney < > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > >> Why not just use PigStorage? This is essentially what it does. It > > > saves > > > > a > > > > >> bag as text, and then loads it again. > > > > >> > > > > >> I suppose the question becomes: why do you need to do this? > > > > >> > > > > >> > > > > >> 2013/3/18 Dan DeCapria, CivicScience < > [EMAIL PROTECTED] > > > > > > > >> > > > > >> > In Java, I am trying to convert a DataBag from it's String > > > > >> representation > > > > >> > with its schema String to a valid DataBag Object: > > > > >> > > > > > >> > String databag_string = "{(apples,1024)}"; > > > > >> > String schema_string = "b1:bag{t1:tuple(a:chararray,b:long)}"; > > > > >> > > > > > >> > I've tried implementing something along the lines of this, but I Dan DeCapria CivicScience, Inc. Senior Informatics / DM / ML / BI Specialist
-
Re: String Representation of DataBag and its SchemaJonathan Coveney 2013-03-19, 16:08
I definitely understand the benefits, I just wanted to understand your
workflow so could weigh in with what I would do. In your case, if you're going to be making these by hand, then I would mimic what PigStorage outputs, and then just load it in using PigStorage. 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > By hand; creating a new JUnit method to test a specific use case against a > functional requirement in the UDF. > > The UDFs I am testing are part of a larger ETL testing initiative I have > been undertaking. To ensure that the various states of legacy data are > correctly extracted and transformed into a Pig context, I am creating > specific JUnit tests per each UDF containing specific use cases as testing > methods. > > Motivation to use String inputs for the Data Objects and Schema Objects is > the improvement on the conventional approach - creating Java Objects and > adding and appending nested Objects to create the desired complex type > DataBag Object to pass to the UDF as use case input. This simpler process > I'm looking for should improve scale-ability and rapid-prototyping within > the testing scripts. It will also make the process more approachable for > another programmer to write additional unit tests. > > -Dan > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > How are you planning on generating these cases? By hand? Or automated? > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > > > String string_databag in this example was typed out by me, as the input > > > String for a JUnit test method. I am considering generating many of > these > > > for case specific unit testing of my UDFs. > > > > > > -Dan > > > > > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney <[EMAIL PROTECTED] > > > >wrote: > > > > > > > how was string_databag generated? > > > > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > > > > > > > Expanding upon this, the following use case's Schema Object can be > > > > resolved > > > > > from inputs: > > > > > > > > > > String string_databag = "{(a,(b,d),f)}"; > > > > > String string_schema > > > > > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; > > > > > Schema schema = Utils.getSchemaFromString(string_schema); > > > > > > > > > > Next step is to resolve a DataBag Object from String string_databag > > and > > > > the > > > > > Schema Object. > > > > > > > > > > -Dan > > > > > > > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > > > Thank you for your reply. > > > > > > > > > > > > The problem is I cannot find a methodology to go from a String > > > > > > representation of a complex data type to a nested Object of pig > > > > > DataTypes. > > > > > > I looked over the pig 0.10.1 docs, but cannot find a way to go > from > > > > > String > > > > > > and Schema to pig DataType Object. > > > > > > > > > > > > For context, I am generating these Strings for my own JUnit > testing > > > of > > > > > > other UDFs. Currently, for complex types, I have to generate > each > > > > > nesting > > > > > > from Tuple and DataBag factories, append data, and next them > > > manually. > > > > > For > > > > > > larger unit tests, this process becomes unwieldy (hundreds of > lines > > > per > > > > > > method, non-dynamic), and it would be much simpler to go directly > > > from > > > > a > > > > > > String and a Schema to a DataBag Object for UDF testing (few > lines > > of > > > > > code, > > > > > > easily modifiable). > > > > > > > > > > > > -Dan > > > > > > > > > > > > > > > > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney < > > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > >> Why not just use PigStorage? This is essentially what it does. > It > > > > saves > > > > > a > > > > > >> bag as text, and then loads it again.
-
Re: String Representation of DataBag and its SchemaDan DeCapria, CivicScienc... 2013-03-19, 16:40
This would work, but the goal would be to *not* invoke local interactive
pig to execute a LOAD USING PigStorage() and pass the data into the UDF. I was hoping to keep this completely in the Java and JUnit testing universe. Looking over the PigStorage() doc<https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html>, would you know how to construct this process from a baseline PigStorage Object, such as: PigStorage pigstorage = new PigStorage(); Any ideas? -Dan On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > I definitely understand the benefits, I just wanted to understand your > workflow so could weigh in with what I would do. > > In your case, if you're going to be making these by hand, then I would > mimic what PigStorage outputs, and then just load it in using PigStorage. > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > By hand; creating a new JUnit method to test a specific use case against > a > > functional requirement in the UDF. > > > > The UDFs I am testing are part of a larger ETL testing initiative I have > > been undertaking. To ensure that the various states of legacy data are > > correctly extracted and transformed into a Pig context, I am creating > > specific JUnit tests per each UDF containing specific use cases as > testing > > methods. > > > > Motivation to use String inputs for the Data Objects and Schema Objects > is > > the improvement on the conventional approach - creating Java Objects and > > adding and appending nested Objects to create the desired complex type > > DataBag Object to pass to the UDF as use case input. This simpler process > > I'm looking for should improve scale-ability and rapid-prototyping within > > the testing scripts. It will also make the process more approachable for > > another programmer to write additional unit tests. > > > > -Dan > > > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <[EMAIL PROTECTED] > > >wrote: > > > > > How are you planning on generating these cases? By hand? Or automated? > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > > > > > String string_databag in this example was typed out by me, as the > input > > > > String for a JUnit test method. I am considering generating many of > > these > > > > for case specific unit testing of my UDFs. > > > > > > > > -Dan > > > > > > > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > how was string_databag generated? > > > > > > > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience < > [EMAIL PROTECTED]> > > > > > > > > > > > Expanding upon this, the following use case's Schema Object can > be > > > > > resolved > > > > > > from inputs: > > > > > > > > > > > > String string_databag = "{(a,(b,d),f)}"; > > > > > > String string_schema > > > > > > > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; > > > > > > Schema schema = Utils.getSchemaFromString(string_schema); > > > > > > > > > > > > Next step is to resolve a DataBag Object from String > string_databag > > > and > > > > > the > > > > > > Schema Object. > > > > > > > > > > > > -Dan > > > > > > > > > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < > > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > Thank you for your reply. > > > > > > > > > > > > > > The problem is I cannot find a methodology to go from a String > > > > > > > representation of a complex data type to a nested Object of pig > > > > > > DataTypes. > > > > > > > I looked over the pig 0.10.1 docs, but cannot find a way to go > > from > > > > > > String > > > > > > > and Schema to pig DataType Object. > > > > > > > > > > > > > > For context, I am generating these Strings for my own JUnit > > testing > > > > of > > > > > > > other UDFs. Currently, for complex types, I have to generate > > each > > > > > > nesting Dan DeCapria CivicScience, Inc. Senior Informatics / DM / ML / BI Specialist
-
Re: String Representation of DataBag and its SchemaJonathan Coveney 2013-03-19, 16:53
doing "new PigStorage()" is possible, but tricky. Maybe some of the other
contributors have an easier way of doing this, but in the short term, I'd work on getting that to work. It's mainly just making sure you initialize it properly. 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > This would work, but the goal would be to *not* invoke local interactive > pig to execute a LOAD USING PigStorage() and pass the data into the UDF. I > was hoping to keep this completely in the Java and JUnit testing universe. > > Looking over the PigStorage() > doc< > https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html > >, > would you know how to construct this process from a baseline PigStorage > Object, such as: > > PigStorage pigstorage = new PigStorage(); > > Any ideas? > > -Dan > > On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > I definitely understand the benefits, I just wanted to understand your > > workflow so could weigh in with what I would do. > > > > In your case, if you're going to be making these by hand, then I would > > mimic what PigStorage outputs, and then just load it in using PigStorage. > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > > > By hand; creating a new JUnit method to test a specific use case > against > > a > > > functional requirement in the UDF. > > > > > > The UDFs I am testing are part of a larger ETL testing initiative I > have > > > been undertaking. To ensure that the various states of legacy data are > > > correctly extracted and transformed into a Pig context, I am creating > > > specific JUnit tests per each UDF containing specific use cases as > > testing > > > methods. > > > > > > Motivation to use String inputs for the Data Objects and Schema Objects > > is > > > the improvement on the conventional approach - creating Java Objects > and > > > adding and appending nested Objects to create the desired complex type > > > DataBag Object to pass to the UDF as use case input. This simpler > process > > > I'm looking for should improve scale-ability and rapid-prototyping > within > > > the testing scripts. It will also make the process more approachable > for > > > another programmer to write additional unit tests. > > > > > > -Dan > > > > > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <[EMAIL PROTECTED] > > > >wrote: > > > > > > > How are you planning on generating these cases? By hand? Or > automated? > > > > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > > > > > > > String string_databag in this example was typed out by me, as the > > input > > > > > String for a JUnit test method. I am considering generating many of > > > these > > > > > for case specific unit testing of my UDFs. > > > > > > > > > > -Dan > > > > > > > > > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney < > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > how was string_databag generated? > > > > > > > > > > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience < > > [EMAIL PROTECTED]> > > > > > > > > > > > > > Expanding upon this, the following use case's Schema Object can > > be > > > > > > resolved > > > > > > > from inputs: > > > > > > > > > > > > > > String string_databag = "{(a,(b,d),f)}"; > > > > > > > String string_schema > > > > > > > > > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; > > > > > > > Schema schema > Utils.getSchemaFromString(string_schema); > > > > > > > > > > > > > > Next step is to resolve a DataBag Object from String > > string_databag > > > > and > > > > > > the > > > > > > > Schema Object. > > > > > > > > > > > > > > -Dan > > > > > > > > > > > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < > > > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > Thank you for your reply. > > > > > > > > > > > > > > > > The problem is I cannot find a methodology to go from a
-
Re: String Representation of DataBag and its SchemaJonathan Coveney 2013-03-19, 16:54
Ack, hit enter. I'd look at the LoadFunc interface, the PigSTorage class,
and if you can't make it work without playing a little, let me know. 2013/3/19 Jonathan Coveney <[EMAIL PROTECTED]> > doing "new PigStorage()" is possible, but tricky. Maybe some of the other > contributors have an easier way of doing this, but in the short term, I'd > work on getting that to work. It's mainly just making sure you initialize > it properly. > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > >> This would work, but the goal would be to *not* invoke local interactive >> pig to execute a LOAD USING PigStorage() and pass the data into the UDF. >> I >> was hoping to keep this completely in the Java and JUnit testing universe. >> >> Looking over the PigStorage() >> doc< >> https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html >> >, >> would you know how to construct this process from a baseline PigStorage >> Object, such as: >> >> PigStorage pigstorage = new PigStorage(); >> >> Any ideas? >> >> -Dan >> >> On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <[EMAIL PROTECTED] >> >wrote: >> >> > I definitely understand the benefits, I just wanted to understand your >> > workflow so could weigh in with what I would do. >> > >> > In your case, if you're going to be making these by hand, then I would >> > mimic what PigStorage outputs, and then just load it in using >> PigStorage. >> > >> > >> > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> >> > >> > > By hand; creating a new JUnit method to test a specific use case >> against >> > a >> > > functional requirement in the UDF. >> > > >> > > The UDFs I am testing are part of a larger ETL testing initiative I >> have >> > > been undertaking. To ensure that the various states of legacy data >> are >> > > correctly extracted and transformed into a Pig context, I am creating >> > > specific JUnit tests per each UDF containing specific use cases as >> > testing >> > > methods. >> > > >> > > Motivation to use String inputs for the Data Objects and Schema >> Objects >> > is >> > > the improvement on the conventional approach - creating Java Objects >> and >> > > adding and appending nested Objects to create the desired complex type >> > > DataBag Object to pass to the UDF as use case input. This simpler >> process >> > > I'm looking for should improve scale-ability and rapid-prototyping >> within >> > > the testing scripts. It will also make the process more approachable >> for >> > > another programmer to write additional unit tests. >> > > >> > > -Dan >> > > >> > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney < >> [EMAIL PROTECTED] >> > > >wrote: >> > > >> > > > How are you planning on generating these cases? By hand? Or >> automated? >> > > > >> > > > >> > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED] >> > >> > > > >> > > > > String string_databag in this example was typed out by me, as the >> > input >> > > > > String for a JUnit test method. I am considering generating many >> of >> > > these >> > > > > for case specific unit testing of my UDFs. >> > > > > >> > > > > -Dan >> > > > > >> > > > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney < >> > [EMAIL PROTECTED] >> > > > > >wrote: >> > > > > >> > > > > > how was string_databag generated? >> > > > > > >> > > > > > >> > > > > > 2013/3/19 Dan DeCapria, CivicScience < >> > [EMAIL PROTECTED]> >> > > > > > >> > > > > > > Expanding upon this, the following use case's Schema Object >> can >> > be >> > > > > > resolved >> > > > > > > from inputs: >> > > > > > > >> > > > > > > String string_databag = "{(a,(b,d),f)}"; >> > > > > > > String string_schema >> > > > > > > >> > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; >> > > > > > > Schema schema >> Utils.getSchemaFromString(string_schema); >> > > > > > > >> > > > > > > Next step is to resolve a DataBag Object from String >> > string_databag
-
Re: String Representation of DataBag and its SchemaDan DeCapria, CivicScienc... 2013-03-19, 17:20
I'll give it an honest try, and any additional from the community is
greatly appreciated! I've been on this idea for a few days now. I even implemented my own UDF parser by converting the input to a char[] array and a push/popping on a Stack of Node Objects to generate the nested inner complex DataTypes as a Node tree. This worked well from a Node-linking standpoint, with a DFS traversal on the Node tree to rebuild the DataBag Object. But it has its caveats, as I have to create a UDF to generate the input for another input, and it assumes the fields are type safe from elements "{(})#," which isn't the case (ie, a serialized json chararray for a field). So I was hoping for a more OTS solution using existing classes and methods given the String and it's Schema a priori. Thank you for your help, and I'll keep this post updated on my progress towards a solution. -Dan On Tue, Mar 19, 2013 at 12:54 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > Ack, hit enter. I'd look at the LoadFunc interface, the PigSTorage class, > and if you can't make it work without playing a little, let me know. > > > 2013/3/19 Jonathan Coveney <[EMAIL PROTECTED]> > > > doing "new PigStorage()" is possible, but tricky. Maybe some of the other > > contributors have an easier way of doing this, but in the short term, I'd > > work on getting that to work. It's mainly just making sure you initialize > > it properly. > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > >> This would work, but the goal would be to *not* invoke local interactive > >> pig to execute a LOAD USING PigStorage() and pass the data into the UDF. > >> I > >> was hoping to keep this completely in the Java and JUnit testing > universe. > >> > >> Looking over the PigStorage() > >> doc< > >> > https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html > >> >, > >> would you know how to construct this process from a baseline PigStorage > >> Object, such as: > >> > >> PigStorage pigstorage = new PigStorage(); > >> > >> Any ideas? > >> > >> -Dan > >> > >> On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <[EMAIL PROTECTED] > >> >wrote: > >> > >> > I definitely understand the benefits, I just wanted to understand your > >> > workflow so could weigh in with what I would do. > >> > > >> > In your case, if you're going to be making these by hand, then I would > >> > mimic what PigStorage outputs, and then just load it in using > >> PigStorage. > >> > > >> > > >> > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > >> > > >> > > By hand; creating a new JUnit method to test a specific use case > >> against > >> > a > >> > > functional requirement in the UDF. > >> > > > >> > > The UDFs I am testing are part of a larger ETL testing initiative I > >> have > >> > > been undertaking. To ensure that the various states of legacy data > >> are > >> > > correctly extracted and transformed into a Pig context, I am > creating > >> > > specific JUnit tests per each UDF containing specific use cases as > >> > testing > >> > > methods. > >> > > > >> > > Motivation to use String inputs for the Data Objects and Schema > >> Objects > >> > is > >> > > the improvement on the conventional approach - creating Java Objects > >> and > >> > > adding and appending nested Objects to create the desired complex > type > >> > > DataBag Object to pass to the UDF as use case input. This simpler > >> process > >> > > I'm looking for should improve scale-ability and rapid-prototyping > >> within > >> > > the testing scripts. It will also make the process more > approachable > >> for > >> > > another programmer to write additional unit tests. > >> > > > >> > > -Dan > >> > > > >> > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney < > >> [EMAIL PROTECTED] > >> > > >wrote: > >> > > > >> > > > How are you planning on generating these cases? By hand? Or > >> automated? > >> > > > > >> > > > > >> > > > 2013/3/19 Dan DeCapria, CivicScience < > [EMAIL PROTECTED] Dan DeCapria CivicScience, Inc. Senior Informatics / DM / ML / BI Specialist
-
Re: String Representation of DataBag and its SchemaWilliam Oberman 2013-03-21, 15:51
We managed to piece this together. It's not fully generic (we assume a
single field). But, it gets the job done for unit testing. -------------- package com.civicscience.util; import org.apache.pig.ResourceSchema; import org.apache.pig.builtin.Utf8StorageConverter; import org.apache.pig.impl.util.CastUtils; import org.apache.pig.impl.util.Utils; import org.apache.pig.newplan.logical.relational.LogicalSchema; import java.io.IOException; public class CSPigUtils { public static Object getPigRepresentation(String schema, String data) throws IOException { Utf8StorageConverter caster = new Utf8StorageConverter(); LogicalSchema ls = Utils.parseSchema(schema); ResourceSchema rs = new ResourceSchema(ls); ResourceSchema.ResourceFieldSchema[] fields = rs.getFields(); return CastUtils.convertToType(caster, data.getBytes(), fields[0], fields[0].getType()); } } --------------- On Tue, Mar 19, 2013 at 1:20 PM, Dan DeCapria, CivicScience < [EMAIL PROTECTED]> wrote: > I'll give it an honest try, and any additional from the community is > greatly appreciated! > > I've been on this idea for a few days now. I even implemented my own UDF > parser by converting the input to a char[] array and a push/popping on a > Stack of Node Objects to generate the nested inner complex DataTypes as a > Node tree. This worked well from a Node-linking standpoint, with a DFS > traversal on the Node tree to rebuild the DataBag Object. But it has > its caveats, as I have to create a UDF to generate the input for another > input, and it assumes the fields are type safe from elements "{(})#," which > isn't the case (ie, a serialized json chararray for a field). So I was > hoping for a more OTS solution using existing classes and methods given the > String and it's Schema a priori. > > Thank you for your help, and I'll keep this post updated on my progress > towards a solution. > > -Dan > > On Tue, Mar 19, 2013 at 12:54 PM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > Ack, hit enter. I'd look at the LoadFunc interface, the PigSTorage class, > > and if you can't make it work without playing a little, let me know. > > > > > > 2013/3/19 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > doing "new PigStorage()" is possible, but tricky. Maybe some of the > other > > > contributors have an easier way of doing this, but in the short term, > I'd > > > work on getting that to work. It's mainly just making sure you > initialize > > > it properly. > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]> > > > > > >> This would work, but the goal would be to *not* invoke local > interactive > > >> pig to execute a LOAD USING PigStorage() and pass the data into the > UDF. > > >> I > > >> was hoping to keep this completely in the Java and JUnit testing > > universe. > > >> > > >> Looking over the PigStorage() > > >> doc< > > >> > > > https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html > > >> >, > > >> would you know how to construct this process from a baseline > PigStorage > > >> Object, such as: > > >> > > >> PigStorage pigstorage = new PigStorage(); > > >> > > >> Any ideas? > > >> > > >> -Dan > > >> > > >> On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney < > [EMAIL PROTECTED] > > >> >wrote: > > >> > > >> > I definitely understand the benefits, I just wanted to understand > your > > >> > workflow so could weigh in with what I would do. > > >> > > > >> > In your case, if you're going to be making these by hand, then I > would > > >> > mimic what PigStorage outputs, and then just load it in using > > >> PigStorage. > > >> > > > >> > > > >> > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED] > > > > >> > > > >> > > By hand; creating a new JUnit method to test a specific use case > > >> against > > >> > a > > >> > > functional requirement in the UDF. > > >> > > > > >> > > The UDFs I am testing are part of a larger ETL testing initiative |