Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Loader partitioning on field


+
Jeff Yuan 2013-03-14, 20:31
+
Rohini Palaniswamy 2013-03-14, 20:51
+
Jeff Yuan 2013-03-14, 21:00
+
Rohini Palaniswamy 2013-03-14, 21:30
Copy link to this message
-
Re: Loader partitioning on field
Rohini, I see your point.

One followup question: it's possible for the result of a UDF to be
constant and not dependent on the tuples of each record, right? Is Pig
able to make such a determination in this case and push the pushdown
such UDF results to load?

Thanks,
Jeff

On Thu, Mar 14, 2013 at 2:30 PM, Rohini Palaniswamy
<[EMAIL PROTECTED]> wrote:
> The filter push down to LoadFunc happens on the front end before the job
> launch and the UDF is still not evaluated then. So you need to have
> constants in your filter condition.
>
> Logical plan is internal to pig and will never be exposed. Refer
> https://issues.apache.org/jira/browse/PIG-3199
>
> Regards,
> Rohini
>
>
> On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <[EMAIL PROTECTED]> wrote:
>
>> Thanks! Regarding 1), where there is a UDF in the filter step on a
>> partition field. The UDF is not first evaluated before and then the
>> result passed to the load function?
>>
>> A separate question: In a LoadFunc, is there a way to get a reference
>> to the logical query plan?
>>
>> Thanks again.
>>
>> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
>> <[EMAIL PROTECTED]> wrote:
>> > Jeff,
>> >
>> > 1) It should not. If it does push, then it is a bug in pig.
>> >
>> > 2) I think it should be fine.
>> >
>> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer
>> >
>> > Regards,
>> >
>> > Rohini
>> >
>> >
>> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <[EMAIL PROTECTED]>
>> wrote:
>> >
>> >> I am writing a loader for a storage format, which partitions by a
>> >> particular field in the record. So I would like to implement something
>> >> which can push down filters on the partitioned field so that the
>> >> record reader does not need to read files that are outside the
>> >> filtered range. In the interface "LoadMetadata", the
>> >> "getPartitionKeys" and "setPartitionFilter" functions seem to support
>> >> what I need (where Pig should pass the filtering expression on the
>> >> declared partition keys to "setPartitionFilter", but I have a couple
>> >> of questions. I'm going to reference the following example, where
>> >> timestamp is the partition key.
>> >>
>> >> a = load 'stored_data' using CustomLoader();
>> >> b = filter a by timestamp = CUSTOM_UDF(date, month);
>> >>
>> >> 1. Would partitioning work in this case where the partition key filter
>> >> includes a UDF?
>> >>
>> >> 2. Does the partition statement need to be directly after the load
>> >> statement? What I mean is, if I declare a variable c between a and b
>> >> which does some other operation on a, will Pig pass the filter
>> >> expression of b when loading a?
>> >>
>> >> 3. Can you point out roughly where this "setPartitionFilter" function
>> >> is called in Pig code during the load process? I couldn't seem to find
>> >> it through a search of the Pig source.
>> >>
>> >> Thanks a lot!
>> >>
>>
+
Jonathan Coveney 2013-03-14, 22:15
+
Jeff Yuan 2013-03-14, 22:56
+
Jonathan Coveney 2013-03-15, 10:17
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB