Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Loader partitioning on field


Copy link to this message
-
Loader partitioning on field
I am writing a loader for a storage format, which partitions by a
particular field in the record. So I would like to implement something
which can push down filters on the partitioned field so that the
record reader does not need to read files that are outside the
filtered range. In the interface "LoadMetadata", the
"getPartitionKeys" and "setPartitionFilter" functions seem to support
what I need (where Pig should pass the filtering expression on the
declared partition keys to "setPartitionFilter", but I have a couple
of questions. I'm going to reference the following example, where
timestamp is the partition key.

a = load 'stored_data' using CustomLoader();
b = filter a by timestamp = CUSTOM_UDF(date, month);

1. Would partitioning work in this case where the partition key filter
includes a UDF?

2. Does the partition statement need to be directly after the load
statement? What I mean is, if I declare a variable c between a and b
which does some other operation on a, will Pig pass the filter
expression of b when loading a?

3. Can you point out roughly where this "setPartitionFilter" function
is called in Pig code during the load process? I couldn't seem to find
it through a search of the Pig source.

Thanks a lot!
+
Rohini Palaniswamy 2013-03-14, 20:51
+
Jeff Yuan 2013-03-14, 21:00
+
Rohini Palaniswamy 2013-03-14, 21:30
+
Jeff Yuan 2013-03-14, 22:03
+
Jonathan Coveney 2013-03-14, 22:15
+
Jeff Yuan 2013-03-14, 22:56
+
Jonathan Coveney 2013-03-15, 10:17
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB