|
|
-
Re: Loader partitioning on fieldJonathan Coveney 2013-03-15, 10:17
If it is being passed in anyway, you could make it a $PARAM that is set by
the launch script, and then it would be a constant in the script. 2013/3/14 Jeff Yuan <[EMAIL PROTECTED]> > Well, I have a UDF called "SCHEDULED_TIME()" that returns the time > when a pig query is scheduled to run in the system. This time is > passed in by the system to pig when the job is launched. Since I > partition files by time field, a user could filter based on the result > of this UDF. > > > > On Thu, Mar 14, 2013 at 3:15 PM, Jonathan Coveney <[EMAIL PROTECTED]> > wrote: > > No, it is not. But if it knew that, how would that filter be meaningful? > > What do you have in mind? > > > > > > 2013/3/14 Jeff Yuan <[EMAIL PROTECTED]> > > > >> Rohini, I see your point. > >> > >> One followup question: it's possible for the result of a UDF to be > >> constant and not dependent on the tuples of each record, right? Is Pig > >> able to make such a determination in this case and push the pushdown > >> such UDF results to load? > >> > >> Thanks, > >> Jeff > >> > >> On Thu, Mar 14, 2013 at 2:30 PM, Rohini Palaniswamy > >> <[EMAIL PROTECTED]> wrote: > >> > The filter push down to LoadFunc happens on the front end before the > job > >> > launch and the UDF is still not evaluated then. So you need to have > >> > constants in your filter condition. > >> > > >> > Logical plan is internal to pig and will never be exposed. Refer > >> > https://issues.apache.org/jira/browse/PIG-3199 > >> > > >> > Regards, > >> > Rohini > >> > > >> > > >> > On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <[EMAIL PROTECTED]> > >> wrote: > >> > > >> >> Thanks! Regarding 1), where there is a UDF in the filter step on a > >> >> partition field. The UDF is not first evaluated before and then the > >> >> result passed to the load function? > >> >> > >> >> A separate question: In a LoadFunc, is there a way to get a reference > >> >> to the logical query plan? > >> >> > >> >> Thanks again. > >> >> > >> >> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy > >> >> <[EMAIL PROTECTED]> wrote: > >> >> > Jeff, > >> >> > > >> >> > 1) It should not. If it does push, then it is a bug in pig. > >> >> > > >> >> > 2) I think it should be fine. > >> >> > > >> >> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer > >> >> > > >> >> > Regards, > >> >> > > >> >> > Rohini > >> >> > > >> >> > > >> >> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <[EMAIL PROTECTED] > > > >> >> wrote: > >> >> > > >> >> >> I am writing a loader for a storage format, which partitions by a > >> >> >> particular field in the record. So I would like to implement > >> something > >> >> >> which can push down filters on the partitioned field so that the > >> >> >> record reader does not need to read files that are outside the > >> >> >> filtered range. In the interface "LoadMetadata", the > >> >> >> "getPartitionKeys" and "setPartitionFilter" functions seem to > support > >> >> >> what I need (where Pig should pass the filtering expression on the > >> >> >> declared partition keys to "setPartitionFilter", but I have a > couple > >> >> >> of questions. I'm going to reference the following example, where > >> >> >> timestamp is the partition key. > >> >> >> > >> >> >> a = load 'stored_data' using CustomLoader(); > >> >> >> b = filter a by timestamp = CUSTOM_UDF(date, month); > >> >> >> > >> >> >> 1. Would partitioning work in this case where the partition key > >> filter > >> >> >> includes a UDF? > >> >> >> > >> >> >> 2. Does the partition statement need to be directly after the load > >> >> >> statement? What I mean is, if I declare a variable c between a > and b > >> >> >> which does some other operation on a, will Pig pass the filter > >> >> >> expression of b when loading a? > >> >> >> > >> >> >> 3. Can you point out roughly where this "setPartitionFilter" > function > >> >> >> is called in Pig code during the load process? I couldn't seem to > >> find > >> >> >> it through a search of the Pig source. |