Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Re: Partition keys in LoadMetadata is broken in 0.10?


+
Daniel Dai 2012-01-01, 02:42
+
Stan Rosenberg 2012-01-01, 03:34
Copy link to this message
-
Re: Partition keys in LoadMetadata is broken in 0.10?
Just to be clear, the concrete syntax had a typo; should have been:

A = load 'daily_activity' USING HiveLoader WHERE date_partition >20110101 and date_partition <= 20110201;

On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg
<[EMAIL PROTECTED]> wrote:
>
> A = load 'daily_activity' from HiveLoader where date_partition >> 20110101 and date_partition <= 20110201;
>
> stan
>
> On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>> Hi, Stan,
>> Foreach is inserted only if you have "as" in "load" statement. This is to
>> assure the data loaded conforms with "as" clause. At some point there is a
>> bug in implementation, this should be fixed in PIG-2346 and will be
>> included in all subsequent releases.
>>
>> Thanks,
>> Daniel
>>
>> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Howdy All,
>>>
>>> I am resurrecting my previous message sent to the list on Dec. 7.  Let
>>> me first summarize.  In a nutshell, as far as I can tell,
>>> partition-aware loading is broken
>>> in pig, and the culprit is PIG-1188 wherein the final decision was to
>>> introduce project & cast, i.e, foreach, after load.  There are two
>>> problems with that approach.
>>> First, as indicated in my original message, 'getPartitionKeys' is
>>> never invoked because instead of the expected instruction sequence
>>> 'load; filter', PIG-1188
>>> changed it to 'load; foreach; filter'.  Second, if a loader already
>>> happens to project & cast in order to adhere the data to the schema,
>>> then the foreach synthesized
>>> by pig is a waste of time.
>>>
>>> Essentially, we had to undo the patch in 'PIG-1188' in order to get
>>> partition filters to work; this enabled us to implement a HiveLoader
>>> very much like
>>> HCatLoader which incidentally is also broken for the very same reason.
>>>  This is obviously a hack and a real solution is needed.
>>> If the decision made in PIG-1188 cannot be re-considered, then I
>>> suggest that we revisit the logic which is used to pass partition
>>> filters to partition-aware loaders.
>>>
>>> Many thanks!
>>>
>>> stan
>>>
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Stan Rosenberg <[EMAIL PROTECTED]>
>>> Date: Wed, Dec 7, 2011 at 12:24 PM
>>> Subject: Partition keys in LoadMetadata is broken in 0.10?
>>> To: [EMAIL PROTECTED]
>>>
>>>
>>> Hi,
>>>
>>> I am trying to implement a loader which is partition-aware.  As
>>> prescribed, my loader implements LoadMetadata, however,
>>> getPartitionKeys is never invoked.
>>> The script is of this form:
>>>
>>> X = LOAD 'input' USING MyLoader();
>>> X = FILTER X BY partition_col == 'some_string';
>>>
>>> and the schema returned by MyLoader.getSchema includes the column
>>> 'partition_col' which is of type 'chararray'.
>>>
>>>
>>> After debugging pig, I have found what appears to be a bug in the new
>>> code (version 0.10 snapshot and also in 0.9.1).  The reason
>>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully
>>> inserted
>>> 'foreach' after the 'load' and before the 'filter'.  The code in
>>> TypeCastInserterTransformer.check used to return 'false' if the
>>> schemas matched or all fields were of type 'bytearray'; cf. pig
>>> version 0.8.1.
>>> Effectively, the above script gets transformed into:
>>>
>>> X = LOAD 'input' USING MyLoader();
>>> X = FOREACH X GENERATE ...;
>>> X = FILTER X BY partition_col == 'some_string';
>>>
>>> Subsequently, PartitionFilterPushDownTransformer.check observes that
>>> the immediate successor of 'load' is _not_ 'filter', whence
>>> getPartitionKeys is never invoked.
>>>
>>> Any suggestions?
>>>
>>> Thanks,
>>>
>>> stan
>>>
>>> P.S. While in the above case the 'foreach' can be avoided, in general
>>> typecasting may need to be performed if the user-provided schema does
>>> not match the one returned by the loader.
>>> I think the general case needs to be handled correctly, perhaps by
>>> ignoring all synthetic operators after the 'load'.  (This is just a
+
Daniel Dai 2012-01-01, 08:36
+
Dmitriy Ryaboy 2012-01-02, 01:34
+
Daniel Dai 2012-01-02, 02:09
+
Dmitriy Ryaboy 2012-01-02, 02:16
+
Stan Rosenberg 2011-12-07, 17:24
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB