Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Loader partitioning on field


+
Jeff Yuan 2013-03-14, 20:31
+
Rohini Palaniswamy 2013-03-14, 20:51
+
Jeff Yuan 2013-03-14, 21:00
+
Rohini Palaniswamy 2013-03-14, 21:30
+
Jeff Yuan 2013-03-14, 22:03
+
Jonathan Coveney 2013-03-14, 22:15
+
Jeff Yuan 2013-03-14, 22:56
Copy link to this message
-
Re: Loader partitioning on field
Jonathan Coveney 2013-03-15, 10:17
If it is being passed in anyway, you could make it a $PARAM that is set by
the launch script, and then it would be a constant in the script.
2013/3/14 Jeff Yuan <[EMAIL PROTECTED]>

> Well, I have a UDF called "SCHEDULED_TIME()" that returns the time
> when a pig query is scheduled to run in the system. This time is
> passed in by the system to pig when the job is launched. Since I
> partition files by time field, a user could filter based on the result
> of this UDF.
>
>
>
> On Thu, Mar 14, 2013 at 3:15 PM, Jonathan Coveney <[EMAIL PROTECTED]>
> wrote:
> > No, it is not. But if it knew that, how would that filter be meaningful?
> > What do you have in mind?
> >
> >
> > 2013/3/14 Jeff Yuan <[EMAIL PROTECTED]>
> >
> >> Rohini, I see your point.
> >>
> >> One followup question: it's possible for the result of a UDF to be
> >> constant and not dependent on the tuples of each record, right? Is Pig
> >> able to make such a determination in this case and push the pushdown
> >> such UDF results to load?
> >>
> >> Thanks,
> >> Jeff
> >>
> >> On Thu, Mar 14, 2013 at 2:30 PM, Rohini Palaniswamy
> >> <[EMAIL PROTECTED]> wrote:
> >> > The filter push down to LoadFunc happens on the front end before the
> job
> >> > launch and the UDF is still not evaluated then. So you need to have
> >> > constants in your filter condition.
> >> >
> >> > Logical plan is internal to pig and will never be exposed. Refer
> >> > https://issues.apache.org/jira/browse/PIG-3199
> >> >
> >> > Regards,
> >> > Rohini
> >> >
> >> >
> >> > On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <[EMAIL PROTECTED]>
> >> wrote:
> >> >
> >> >> Thanks! Regarding 1), where there is a UDF in the filter step on a
> >> >> partition field. The UDF is not first evaluated before and then the
> >> >> result passed to the load function?
> >> >>
> >> >> A separate question: In a LoadFunc, is there a way to get a reference
> >> >> to the logical query plan?
> >> >>
> >> >> Thanks again.
> >> >>
> >> >> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
> >> >> <[EMAIL PROTECTED]> wrote:
> >> >> > Jeff,
> >> >> >
> >> >> > 1) It should not. If it does push, then it is a bug in pig.
> >> >> >
> >> >> > 2) I think it should be fine.
> >> >> >
> >> >> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer
> >> >> >
> >> >> > Regards,
> >> >> >
> >> >> > Rohini
> >> >> >
> >> >> >
> >> >> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <[EMAIL PROTECTED]
> >
> >> >> wrote:
> >> >> >
> >> >> >> I am writing a loader for a storage format, which partitions by a
> >> >> >> particular field in the record. So I would like to implement
> >> something
> >> >> >> which can push down filters on the partitioned field so that the
> >> >> >> record reader does not need to read files that are outside the
> >> >> >> filtered range. In the interface "LoadMetadata", the
> >> >> >> "getPartitionKeys" and "setPartitionFilter" functions seem to
> support
> >> >> >> what I need (where Pig should pass the filtering expression on the
> >> >> >> declared partition keys to "setPartitionFilter", but I have a
> couple
> >> >> >> of questions. I'm going to reference the following example, where
> >> >> >> timestamp is the partition key.
> >> >> >>
> >> >> >> a = load 'stored_data' using CustomLoader();
> >> >> >> b = filter a by timestamp = CUSTOM_UDF(date, month);
> >> >> >>
> >> >> >> 1. Would partitioning work in this case where the partition key
> >> filter
> >> >> >> includes a UDF?
> >> >> >>
> >> >> >> 2. Does the partition statement need to be directly after the load
> >> >> >> statement? What I mean is, if I declare a variable c between a
> and b
> >> >> >> which does some other operation on a, will Pig pass the filter
> >> >> >> expression of b when loading a?
> >> >> >>
> >> >> >> 3. Can you point out roughly where this "setPartitionFilter"
> function
> >> >> >> is called in Pig code during the load process? I couldn't seem to
> >> find
> >> >> >> it through a search of the Pig source.