Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Partition keys in LoadMetadata is broken in 0.10?


Copy link to this message
-
Re: Partition keys in LoadMetadata is broken in 0.10?
Hi Daniel,

Thanks for pointing out PIG-2346.  However, what happens if the user
decides to rename some of the fields using the 'as' statement; we
have the same problem, i.e., 'foreach' is generated.  As a heuristic,
perhaps synthesized operators should be marked as such.  This way, pig
can
skip synthesized operators when trying to match the sequence 'load; filter'.
Another alternative is to create a new keyword, say 'where', to be
used for specifying partitions.  E.g.,

A = load 'daily_activity' from HiveLoader where date_partition >20110101 and date_partition <= 20110201;

stan

On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
> Hi, Stan,
> Foreach is inserted only if you have "as" in "load" statement. This is to
> assure the data loaded conforms with "as" clause. At some point there is a
> bug in implementation, this should be fixed in PIG-2346 and will be
> included in all subsequent releases.
>
> Thanks,
> Daniel
>
> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg <
> [EMAIL PROTECTED]> wrote:
>
>> Howdy All,
>>
>> I am resurrecting my previous message sent to the list on Dec. 7.  Let
>> me first summarize.  In a nutshell, as far as I can tell,
>> partition-aware loading is broken
>> in pig, and the culprit is PIG-1188 wherein the final decision was to
>> introduce project & cast, i.e, foreach, after load.  There are two
>> problems with that approach.
>> First, as indicated in my original message, 'getPartitionKeys' is
>> never invoked because instead of the expected instruction sequence
>> 'load; filter', PIG-1188
>> changed it to 'load; foreach; filter'.  Second, if a loader already
>> happens to project & cast in order to adhere the data to the schema,
>> then the foreach synthesized
>> by pig is a waste of time.
>>
>> Essentially, we had to undo the patch in 'PIG-1188' in order to get
>> partition filters to work; this enabled us to implement a HiveLoader
>> very much like
>> HCatLoader which incidentally is also broken for the very same reason.
>>  This is obviously a hack and a real solution is needed.
>> If the decision made in PIG-1188 cannot be re-considered, then I
>> suggest that we revisit the logic which is used to pass partition
>> filters to partition-aware loaders.
>>
>> Many thanks!
>>
>> stan
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Stan Rosenberg <[EMAIL PROTECTED]>
>> Date: Wed, Dec 7, 2011 at 12:24 PM
>> Subject: Partition keys in LoadMetadata is broken in 0.10?
>> To: [EMAIL PROTECTED]
>>
>>
>> Hi,
>>
>> I am trying to implement a loader which is partition-aware.  As
>> prescribed, my loader implements LoadMetadata, however,
>> getPartitionKeys is never invoked.
>> The script is of this form:
>>
>> X = LOAD 'input' USING MyLoader();
>> X = FILTER X BY partition_col == 'some_string';
>>
>> and the schema returned by MyLoader.getSchema includes the column
>> 'partition_col' which is of type 'chararray'.
>>
>>
>> After debugging pig, I have found what appears to be a bug in the new
>> code (version 0.10 snapshot and also in 0.9.1).  The reason
>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully
>> inserted
>> 'foreach' after the 'load' and before the 'filter'.  The code in
>> TypeCastInserterTransformer.check used to return 'false' if the
>> schemas matched or all fields were of type 'bytearray'; cf. pig
>> version 0.8.1.
>> Effectively, the above script gets transformed into:
>>
>> X = LOAD 'input' USING MyLoader();
>> X = FOREACH X GENERATE ...;
>> X = FILTER X BY partition_col == 'some_string';
>>
>> Subsequently, PartitionFilterPushDownTransformer.check observes that
>> the immediate successor of 'load' is _not_ 'filter', whence
>> getPartitionKeys is never invoked.
>>
>> Any suggestions?
>>
>> Thanks,
>>
>> stan
>>
>> P.S. While in the above case the 'foreach' can be avoided, in general
>> typecasting may need to be performed if the user-provided schema does
>> not match the one returned by the loader.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB