Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Fwd: Partition keys in LoadMetadata is broken in 0.10?


Copy link to this message
-
Fwd: Partition keys in LoadMetadata is broken in 0.10?
Stan Rosenberg 2011-12-30, 17:54
Howdy All,

I am resurrecting my previous message sent to the list on Dec. 7.  Let
me first summarize.  In a nutshell, as far as I can tell,
partition-aware loading is broken
in pig, and the culprit is PIG-1188 wherein the final decision was to
introduce project & cast, i.e, foreach, after load.  There are two
problems with that approach.
First, as indicated in my original message, 'getPartitionKeys' is
never invoked because instead of the expected instruction sequence
'load; filter', PIG-1188
changed it to 'load; foreach; filter'.  Second, if a loader already
happens to project & cast in order to adhere the data to the schema,
then the foreach synthesized
by pig is a waste of time.

Essentially, we had to undo the patch in 'PIG-1188' in order to get
partition filters to work; this enabled us to implement a HiveLoader
very much like
HCatLoader which incidentally is also broken for the very same reason.
 This is obviously a hack and a real solution is needed.
If the decision made in PIG-1188 cannot be re-considered, then I
suggest that we revisit the logic which is used to pass partition
filters to partition-aware loaders.

Many thanks!

stan

---------- Forwarded message ----------
From: Stan Rosenberg <[EMAIL PROTECTED]>
Date: Wed, Dec 7, 2011 at 12:24 PM
Subject: Partition keys in LoadMetadata is broken in 0.10?
To: [EMAIL PROTECTED]
Hi,

I am trying to implement a loader which is partition-aware.  As
prescribed, my loader implements LoadMetadata, however,
getPartitionKeys is never invoked.
The script is of this form:

X = LOAD 'input' USING MyLoader();
X = FILTER X BY partition_col == 'some_string';

and the schema returned by MyLoader.getSchema includes the column
'partition_col' which is of type 'chararray'.
After debugging pig, I have found what appears to be a bug in the new
code (version 0.10 snapshot and also in 0.9.1).  The reason
MyLoader.getPartitionKeys is never invoked is due to the wrongfully
inserted
'foreach' after the 'load' and before the 'filter'.  The code in
TypeCastInserterTransformer.check used to return 'false' if the
schemas matched or all fields were of type 'bytearray'; cf. pig
version 0.8.1.
Effectively, the above script gets transformed into:

X = LOAD 'input' USING MyLoader();
X = FOREACH X GENERATE ...;
X = FILTER X BY partition_col == 'some_string';

Subsequently, PartitionFilterPushDownTransformer.check observes that
the immediate successor of 'load' is _not_ 'filter', whence
getPartitionKeys is never invoked.

Any suggestions?

Thanks,

stan

P.S. While in the above case the 'foreach' can be avoided, in general
typecasting may need to be performed if the user-provided schema does
not match the one returned by the loader.
I think the general case needs to be handled correctly, perhaps by
ignoring all synthetic operators after the 'load'.  (This is just a
wild guess.)