Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Flattening bags and tuples without a known schema

Copy link to this message
Flattening bags and tuples without a known schema
This email discusses a use case of flattening a bag or tuple when the
schema of the bag or tuple is not known, i.e., null.

When UDFs return bags or tuples (complex type), the schema of the
complex type can be declared via the outputSchema method of the UDF. By
default, the outputSchema method in EvalFunc (the abstract base class)
returns null. When users try to flatten the output of the UDF, the
schema of the flattened column cannot be determined. An example follows.


--myudf returns a bag whose schema is null, i.e., not declared
B = foreach A generate flatten(myudf), $1 as x;

In the above example, since the schema of the bag returned by myudf is
not known, we have two possible options:

1. Erring on the side of safety, set the schema of the flattened column
to be a bytearray. While this is a safe assumption, authors of the UDF
who are aware of the exact return value of the UDF, will try to access
the elements appropriately. For example, if myudf returned a bag with
tuples containing 3 elements, the following might be a possible use

C = foreach B generate $2 as mycolumn;

At this point, the safe assumption about the flattened column being a
single column of type bytearray will generate {bytearray, x: bytearray}
as the schema for B. As a result, statement C will generate a parse
exception for out of bound access.

Given the fact that UDF authors have complete knowledge about the return
values of the UDF, they should override the outputSchema method in the
UDF to ensure correct schemas. The other option is to specify the schema
as part of the "AS" clause in the generate statement, i.e.,

B = foreach A generate flatten(myudf) as (name: chararray, age: int,
gpa: float), $1 as x;

2. Set the schema of the foreach to be unknown or null. The bag returned
by the UDF could contain arbitrary number of columns, making it
impossible to set the correct column number for the other expression, x
in the generate clause. In all likelihood, this will break existing pig
scripts as:

B = foreach A generate flatten(myudf), $1 as x;
C = foreach B generate $1 + x;
Currently, I have an implementation for option 1. Any
thoughts/suggestions/comments are welcome.