Pig, mail # user - Partition keys in LoadMetadata is broken in 0.10?

Re: Partition keys in LoadMetadata is broken in 0.10?
Stan Rosenberg 2012-01-01, 03:34
Hi Daniel,

Thanks for pointing out PIG-2346.  However, what happens if the user
decides to rename some of the fields using the 'as' statement; we
have the same problem, i.e., 'foreach' is generated.  As a heuristic,
perhaps synthesized operators should be marked as such.  This way, pig
skip synthesized operators when trying to match the sequence 'load; filter'.
Another alternative is to create a new keyword, say 'where', to be
used for specifying partitions.  E.g.,

A = load 'daily_activity' from HiveLoader where date_partition >20110101 and date_partition <= 20110201;


On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
> Hi, Stan,
> Foreach is inserted only if you have "as" in "load" statement. This is to
> assure the data loaded conforms with "as" clause. At some point there is a
> bug in implementation, this should be fixed in PIG-2346 and will be
> included in all subsequent releases.
> Thanks,
> Daniel
> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg <
>> Howdy All,
>> I am resurrecting my previous message sent to the list on Dec. 7.  Let
>> me first summarize.  In a nutshell, as far as I can tell,
>> partition-aware loading is broken
>> in pig, and the culprit is PIG-1188 wherein the final decision was to
>> introduce project & cast, i.e, foreach, after load.  There are two
>> problems with that approach.
>> First, as indicated in my original message, 'getPartitionKeys' is
>> never invoked because instead of the expected instruction sequence
>> 'load; filter', PIG-1188
>> changed it to 'load; foreach; filter'.  Second, if a loader already
>> happens to project & cast in order to adhere the data to the schema,
>> then the foreach synthesized
>> by pig is a waste of time.
>> Essentially, we had to undo the patch in 'PIG-1188' in order to get
>> partition filters to work; this enabled us to implement a HiveLoader
>> very much like
>> HCatLoader which incidentally is also broken for the very same reason.
>>  This is obviously a hack and a real solution is needed.
>> If the decision made in PIG-1188 cannot be re-considered, then I
>> suggest that we revisit the logic which is used to pass partition
>> filters to partition-aware loaders.
>> Many thanks!
>> stan
>> ---------- Forwarded message ----------
>> From: Stan Rosenberg <[EMAIL PROTECTED]>
>> Date: Wed, Dec 7, 2011 at 12:24 PM
>> Subject: Partition keys in LoadMetadata is broken in 0.10?
>> Hi,
>> I am trying to implement a loader which is partition-aware.  As
>> prescribed, my loader implements LoadMetadata, however,
>> getPartitionKeys is never invoked.
>> The script is of this form:
>> X = LOAD 'input' USING MyLoader();
>> X = FILTER X BY partition_col == 'some_string';
>> and the schema returned by MyLoader.getSchema includes the column
>> 'partition_col' which is of type 'chararray'.
>> After debugging pig, I have found what appears to be a bug in the new
>> code (version 0.10 snapshot and also in 0.9.1).  The reason
>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully
>> inserted
>> 'foreach' after the 'load' and before the 'filter'.  The code in
>> TypeCastInserterTransformer.check used to return 'false' if the
>> schemas matched or all fields were of type 'bytearray'; cf. pig
>> version 0.8.1.
>> Effectively, the above script gets transformed into:
>> X = LOAD 'input' USING MyLoader();
>> X = FILTER X BY partition_col == 'some_string';
>> Subsequently, PartitionFilterPushDownTransformer.check observes that
>> the immediate successor of 'load' is _not_ 'filter', whence
>> getPartitionKeys is never invoked.
>> Any suggestions?
>> Thanks,
>> stan
>> P.S. While in the above case the 'foreach' can be avoided, in general
>> typecasting may need to be performed if the user-provided schema does
>> not match the one returned by the loader.