Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Re: Partition keys in LoadMetadata is broken in 0.10?


+
Daniel Dai 2012-01-01, 02:42
+
Stan Rosenberg 2012-01-01, 03:34
+
Stan Rosenberg 2012-01-01, 03:37
+
Daniel Dai 2012-01-01, 08:36
+
Dmitriy Ryaboy 2012-01-02, 01:34
+
Daniel Dai 2012-01-02, 02:09
+
Dmitriy Ryaboy 2012-01-02, 02:16
Copy link to this message
-
Partition keys in LoadMetadata is broken in 0.10?
Hi,

I am trying to implement a loader which is partition-aware.  As
prescribed, my loader implements LoadMetadata, however,
getPartitionKeys is never invoked.
The script is of this form:

X = LOAD 'input' USING MyLoader();
X = FILTER X BY partition_col == 'some_string';

and the schema returned by MyLoader.getSchema includes the column
'partition_col' which is of type 'chararray'.
After debugging pig, I have found what appears to be a bug in the new
code (version 0.10 snapshot and also in 0.9.1).  The reason
MyLoader.getPartitionKeys is never invoked is due to the wrongfully
inserted
'foreach' after the 'load' and before the 'filter'.  The code in
TypeCastInserterTransformer.check used to return 'false' if the
schemas matched or all fields were of type 'bytearray'; cf. pig
version 0.8.1.
Effectively, the above script gets transformed into:

X = LOAD 'input' USING MyLoader();
X = FOREACH X GENERATE ...;
X = FILTER X BY partition_col == 'some_string';

Subsequently, PartitionFilterPushDownTransformer.check observes that
the immediate successor of 'load' is _not_ 'filter', whence
getPartitionKeys is never invoked.

Any suggestions?

Thanks,

stan

P.S. While in the above case the 'foreach' can be avoided, in general
typecasting may need to be performed if the user-provided schema does
not match the one returned by the loader.
I think the general case needs to be handled correctly, perhaps by
ignoring all synthetic operators after the 'load'.  (This is just a
wild guess.)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB