Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pigify Data Input to UDF for Unit Testing


Copy link to this message
-
Pigify Data Input to UDF for Unit Testing
Dan DeCapria, CivicScienc... 2013-03-11, 19:35
First poster here! Really excited to get some feedback and contribute to
Pig!

I am attempting to simplify the UDF input process in the context of scaling
JUnit testing. Previously, to create a valid Pig input for my UDFs for
JUnit testing, I have had to make each layer/nesting of the Pig input from
org.apache.pig.data.* constructs, per each use case to unit test.  I am
looking for a quick methodology to simplify this process and to scale for
addition unit testing.  A use case is defined below:

Assume the input schema is defined a priori.  Assume also that the
outputSchema is properly defined in the UDF to be unit tested. Illustrating
the InputSchema from the prior pig process, I have the InputData in the
form of InputSchema, per my testing UDF. Conceptually, the unit testing
approach is as follows:

InputSchema
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray)),field_e:chararray)}

OutputSchema
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray),tuple_d2:tuple(field_c:chararray,field_d:chararray)),field_e:chararray)}

Prior (non-scalable) methodology:
Create bag_a DataBag.
Create tuple_b Tuple.
Create tuple_c1 Tuple.
Create tuple_d1 Tuple.
append data field_a to tuple_d1.  append data field_b to tuple_d1.
append tuple_c1 to tuple_b. append data field_e to tuple_b.
append tuple_b to bag_a.
unit test UDF(bag_a). //

Is there a way to 'pigify' the InputSchema data String, as it appears from
illustrate of the prior pig process, to be fed into the UDF(InputData),
such that I do not have to perform the Prior methodology explicitly? A
solution would be ideal of the form:

Awesome methodology:
String_of_data_in_inputFormat:
 bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray)),field_b)}
DataBag bag_a = pigify(String_of_data_in_inputFormat);
unit test UDF(bag_a). //

Thanks in advance,

-Dan DeCapria