Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Review Request 16717: PIG-3562 Implement combiner optimizations for DISTINCT in Tez


Copy link to this message
-
Review Request 16717: PIG-3562 Implement combiner optimizations for DISTINCT in Tez
Alex Bain 2014-01-08, 01:47

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16717/
-----------------------------------------------------------

Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
Bugs: PIG-3562
    https://issues.apache.org/jira/browse/PIG-3562
Repository: pig-git
Description
-------

Implement DISTINCT combiner optimizations in Tez

1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some global variables and a special DistinctCombiner class that throws away the duplicate tuples. We could hack this into Pig-on-Tez, but instead I just reused the reduce plan as the combiner plan, which does the same thing (through a POPackage->POProject->POForEach with the setDistinct property set to true).

I'm a little bit concerned that this combiner plan could somehow be slower than the special DistinctCombiner class, but I don't see how.

There is also a special CombinerPackager packager that I did NOT use for this. I think that packager is really intended for use with the algebraic UDF combiner optimizations only.

2. I carefully verified that DISTINCT nested inside a FOREACH code block is optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. I added TestTezCompiler and e2e tests for this. Cheolsoo already made all the combiner changes for this to work correctly - I didn't make any code changes here.
Diffs
-----

  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java a7de3a7
  test/e2e/pig/tests/tez.conf 71cdcbc
  test/org/apache/pig/test/data/GoldenFiles/TEZC13.gld PRE-CREATION
  test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld 35d9313
  test/org/apache/pig/tez/TestTezCompiler.java 2252531

Diff: https://reviews.apache.org/r/16717/diff/
Testing
-------

Updated golden file for existing TestTezCompiler DISTINCT test to include combiner plan
Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner
Added e2e test that runs DISTINCT with algebraic udf combiner
I am getting some test-e2e-tez failures in ORDER BY tests, but I am also getting these in a clean Tez branch. My new e2e test passes.
Thanks,

Alex Bain