Jeff Yuan 2013-03-14, 20:31
Rohini Palaniswamy 2013-03-14, 20:51
Jeff Yuan 2013-03-14, 21:00
Rohini Palaniswamy 2013-03-14, 21:30
-Re: Loader partitioning on field
Jeff Yuan 2013-03-14, 22:03
Rohini, I see your point.
One followup question: it's possible for the result of a UDF to be
constant and not dependent on the tuples of each record, right? Is Pig
able to make such a determination in this case and push the pushdown
such UDF results to load?
On Thu, Mar 14, 2013 at 2:30 PM, Rohini Palaniswamy
<[EMAIL PROTECTED]> wrote:
> The filter push down to LoadFunc happens on the front end before the job
> launch and the UDF is still not evaluated then. So you need to have
> constants in your filter condition.
> Logical plan is internal to pig and will never be exposed. Refer
> On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <[EMAIL PROTECTED]> wrote:
>> Thanks! Regarding 1), where there is a UDF in the filter step on a
>> partition field. The UDF is not first evaluated before and then the
>> result passed to the load function?
>> A separate question: In a LoadFunc, is there a way to get a reference
>> to the logical query plan?
>> Thanks again.
>> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
>> <[EMAIL PROTECTED]> wrote:
>> > Jeff,
>> > 1) It should not. If it does push, then it is a bug in pig.
>> > 2) I think it should be fine.
>> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer
>> > Regards,
>> > Rohini
>> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <[EMAIL PROTECTED]>
>> >> I am writing a loader for a storage format, which partitions by a
>> >> particular field in the record. So I would like to implement something
>> >> which can push down filters on the partitioned field so that the
>> >> record reader does not need to read files that are outside the
>> >> filtered range. In the interface "LoadMetadata", the
>> >> "getPartitionKeys" and "setPartitionFilter" functions seem to support
>> >> what I need (where Pig should pass the filtering expression on the
>> >> declared partition keys to "setPartitionFilter", but I have a couple
>> >> of questions. I'm going to reference the following example, where
>> >> timestamp is the partition key.
>> >> a = load 'stored_data' using CustomLoader();
>> >> b = filter a by timestamp = CUSTOM_UDF(date, month);
>> >> 1. Would partitioning work in this case where the partition key filter
>> >> includes a UDF?
>> >> 2. Does the partition statement need to be directly after the load
>> >> statement? What I mean is, if I declare a variable c between a and b
>> >> which does some other operation on a, will Pig pass the filter
>> >> expression of b when loading a?
>> >> 3. Can you point out roughly where this "setPartitionFilter" function
>> >> is called in Pig code during the load process? I couldn't seem to find
>> >> it through a search of the Pig source.
>> >> Thanks a lot!
Jonathan Coveney 2013-03-14, 22:15
Jeff Yuan 2013-03-14, 22:56
Jonathan Coveney 2013-03-15, 10:17