Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Partition keys in LoadMetadata is broken in 0.10?

Copy link to this message
Partition keys in LoadMetadata is broken in 0.10?

I am trying to implement a loader which is partition-aware.  As
prescribed, my loader implements LoadMetadata, however,
getPartitionKeys is never invoked.
The script is of this form:

X = LOAD 'input' USING MyLoader();
X = FILTER X BY partition_col == 'some_string';

and the schema returned by MyLoader.getSchema includes the column
'partition_col' which is of type 'chararray'.
After debugging pig, I have found what appears to be a bug in the new
code (version 0.10 snapshot and also in 0.9.1).  The reason
MyLoader.getPartitionKeys is never invoked is due to the wrongfully
'foreach' after the 'load' and before the 'filter'.  The code in
TypeCastInserterTransformer.check used to return 'false' if the
schemas matched or all fields were of type 'bytearray'; cf. pig
version 0.8.1.
Effectively, the above script gets transformed into:

X = LOAD 'input' USING MyLoader();
X = FILTER X BY partition_col == 'some_string';

Subsequently, PartitionFilterPushDownTransformer.check observes that
the immediate successor of 'load' is _not_ 'filter', whence
getPartitionKeys is never invoked.

Any suggestions?



P.S. While in the above case the 'foreach' can be avoided, in general
typecasting may need to be performed if the user-provided schema does
not match the one returned by the loader.
I think the general case needs to be handled correctly, perhaps by
ignoring all synthetic operators after the 'load'.  (This is just a
wild guess.)