Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Re: Partition keys in LoadMetadata is broken in 0.10?


+
Daniel Dai 2012-01-01, 02:42
+
Stan Rosenberg 2012-01-01, 03:34
Copy link to this message
-
Re: Partition keys in LoadMetadata is broken in 0.10?
Just to be clear, the concrete syntax had a typo; should have been:

A = load 'daily_activity' USING HiveLoader WHERE date_partition >20110101 and date_partition <= 20110201;

On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg
<[EMAIL PROTECTED]> wrote:
>
> A = load 'daily_activity' from HiveLoader where date_partition >> 20110101 and date_partition <= 20110201;
>
> stan
>
> On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>> Hi, Stan,
>> Foreach is inserted only if you have "as" in "load" statement. This is to
>> assure the data loaded conforms with "as" clause. At some point there is a
>> bug in implementation, this should be fixed in PIG-2346 and will be
>> included in all subsequent releases.
>>
>> Thanks,
>> Daniel
>>
>> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Howdy All,
>>>
>>> I am resurrecting my previous message sent to the list on Dec. 7.  Let
>>> me first summarize.  In a nutshell, as far as I can tell,
>>> partition-aware loading is broken
>>> in pig, and the culprit is PIG-1188 wherein the final decision was to
>>> introduce project & cast, i.e, foreach, after load.  There are two
>>> problems with that approach.
>>> First, as indicated in my original message, 'getPartitionKeys' is
>>> never invoked because instead of the expected instruction sequence
>>> 'load; filter', PIG-1188
>>> changed it to 'load; foreach; filter'.  Second, if a loader already
>>> happens to project & cast in order to adhere the data to the schema,
>>> then the foreach synthesized
>>> by pig is a waste of time.
>>>
>>> Essentially, we had to undo the patch in 'PIG-1188' in order to get
>>> partition filters to work; this enabled us to implement a HiveLoader
>>> very much like
>>> HCatLoader which incidentally is also broken for the very same reason.
>>>  This is obviously a hack and a real solution is needed.
>>> If the decision made in PIG-1188 cannot be re-considered, then I
>>> suggest that we revisit the logic which is used to pass partition
>>> filters to partition-aware loaders.
>>>
>>> Many thanks!
>>>
>>> stan
>>>
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Stan Rosenberg <[EMAIL PROTECTED]>
>>> Date: Wed, Dec 7, 2011 at 12:24 PM
>>> Subject: Partition keys in LoadMetadata is broken in 0.10?
>>> To: [EMAIL PROTECTED]
>>>
>>>
>>> Hi,
>>>
>>> I am trying to implement a loader which is partition-aware.  As
>>> prescribed, my loader implements LoadMetadata, however,
>>> getPartitionKeys is never invoked.
>>> The script is of this form:
>>>
>>> X = LOAD 'input' USING MyLoader();
>>> X = FILTER X BY partition_col == 'some_string';
>>>
>>> and the schema returned by MyLoader.getSchema includes the column
>>> 'partition_col' which is of type 'chararray'.
>>>
>>>
>>> After debugging pig, I have found what appears to be a bug in the new
>>> code (version 0.10 snapshot and also in 0.9.1).  The reason
>>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully
>>> inserted
>>> 'foreach' after the 'load' and before the 'filter'.  The code in
>>> TypeCastInserterTransformer.check used to return 'false' if the
>>> schemas matched or all fields were of type 'bytearray'; cf. pig
>>> version 0.8.1.
>>> Effectively, the above script gets transformed into:
>>>
>>> X = LOAD 'input' USING MyLoader();
>>> X = FOREACH X GENERATE ...;
>>> X = FILTER X BY partition_col == 'some_string';
>>>
>>> Subsequently, PartitionFilterPushDownTransformer.check observes that
>>> the immediate successor of 'load' is _not_ 'filter', whence
>>> getPartitionKeys is never invoked.
>>>
>>> Any suggestions?
>>>
>>> Thanks,
>>>
>>> stan
>>>
>>> P.S. While in the above case the 'foreach' can be avoided, in general
>>> typecasting may need to be performed if the user-provided schema does
>>> not match the one returned by the loader.
>>> I think the general case needs to be handled correctly, perhaps by
>>> ignoring all synthetic operators after the 'load'.  (This is just a
+
Daniel Dai 2012-01-01, 08:36
+
Dmitriy Ryaboy 2012-01-02, 01:34
+
Daniel Dai 2012-01-02, 02:09
+
Dmitriy Ryaboy 2012-01-02, 02:16
+
Stan Rosenberg 2011-12-07, 17:24