Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Loader partitioning on field


Copy link to this message
-
Loader partitioning on field
Jeff Yuan 2013-03-14, 20:31
I am writing a loader for a storage format, which partitions by a
particular field in the record. So I would like to implement something
which can push down filters on the partitioned field so that the
record reader does not need to read files that are outside the
filtered range. In the interface "LoadMetadata", the
"getPartitionKeys" and "setPartitionFilter" functions seem to support
what I need (where Pig should pass the filtering expression on the
declared partition keys to "setPartitionFilter", but I have a couple
of questions. I'm going to reference the following example, where
timestamp is the partition key.

a = load 'stored_data' using CustomLoader();
b = filter a by timestamp = CUSTOM_UDF(date, month);

1. Would partitioning work in this case where the partition key filter
includes a UDF?

2. Does the partition statement need to be directly after the load
statement? What I mean is, if I declare a variable c between a and b
which does some other operation on a, will Pig pass the filter
expression of b when loading a?

3. Can you point out roughly where this "setPartitionFilter" function
is called in Pig code during the load process? I couldn't seem to find
it through a search of the Pig source.

Thanks a lot!