Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pigify Data Input to UDF for Unit Testing


Copy link to this message
-
Pigify Data Input to UDF for Unit Testing
First poster here! Really excited to get some feedback and contribute to
Pig!

I am attempting to simplify the UDF input process in the context of scaling
JUnit testing. Previously, to create a valid Pig input for my UDFs for
JUnit testing, I have had to make each layer/nesting of the Pig input from
org.apache.pig.data.* constructs, per each use case to unit test.  I am
looking for a quick methodology to simplify this process and to scale for
addition unit testing.  A use case is defined below:

Assume the input schema is defined a priori.  Assume also that the
outputSchema is properly defined in the UDF to be unit tested. Illustrating
the InputSchema from the prior pig process, I have the InputData in the
form of InputSchema, per my testing UDF. Conceptually, the unit testing
approach is as follows:

InputSchema
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray)),field_e:chararray)}

OutputSchema
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray),tuple_d2:tuple(field_c:chararray,field_d:chararray)),field_e:chararray)}

Prior (non-scalable) methodology:
Create bag_a DataBag.
Create tuple_b Tuple.
Create tuple_c1 Tuple.
Create tuple_d1 Tuple.
append data field_a to tuple_d1.  append data field_b to tuple_d1.
append tuple_c1 to tuple_b. append data field_e to tuple_b.
append tuple_b to bag_a.
unit test UDF(bag_a). //

Is there a way to 'pigify' the InputSchema data String, as it appears from
illustrate of the prior pig process, to be fed into the UDF(InputData),
such that I do not have to perform the Prior methodology explicitly? A
solution would be ideal of the form:

Awesome methodology:
String_of_data_in_inputFormat:
 bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray)),field_b)}
DataBag bag_a = pigify(String_of_data_in_inputFormat);
unit test UDF(bag_a). //

Thanks in advance,

-Dan DeCapria
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB