Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # dev - Review Request 16717: PIG-3562 Implement combiner optimizations for DISTINCT in Tez


+
Alex Bain 2014-01-08, 01:47
Copy link to this message
-
Re: Review Request 16717: PIG-3562 Implement combiner optimizations for DISTINCT in Tez
Cheolsoo Park 2014-01-08, 22:53

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16717/#review31403
-----------------------------------------------------------

Ship it!
I will commit it after running tests.

- Cheolsoo Park
On Jan. 8, 2014, 1:47 a.m., Alex Bain wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16717/
> -----------------------------------------------------------
>
> (Updated Jan. 8, 2014, 1:47 a.m.)
>
>
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
>
>
> Bugs: PIG-3562
>     https://issues.apache.org/jira/browse/PIG-3562
>
>
> Repository: pig-git
>
>
> Description
> -------
>
> Implement DISTINCT combiner optimizations in Tez
>
> 1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some global variables and a special DistinctCombiner class that throws away the duplicate tuples. We could hack this into Pig-on-Tez, but instead I just reused the reduce plan as the combiner plan, which does the same thing (through a POPackage->POProject->POForEach with the setDistinct property set to true).
>
> I'm a little bit concerned that this combiner plan could somehow be slower than the special DistinctCombiner class, but I don't see how.
>
> There is also a special CombinerPackager packager that I did NOT use for this. I think that packager is really intended for use with the algebraic UDF combiner optimizations only.
>
> 2. I carefully verified that DISTINCT nested inside a FOREACH code block is optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. I added TestTezCompiler and e2e tests for this. Cheolsoo already made all the combiner changes for this to work correctly - I didn't make any code changes here.
>
>
> Diffs
> -----
>
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java a7de3a7
>   test/e2e/pig/tests/tez.conf 71cdcbc
>   test/org/apache/pig/test/data/GoldenFiles/TEZC13.gld PRE-CREATION
>   test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld 35d9313
>   test/org/apache/pig/tez/TestTezCompiler.java 2252531
>
> Diff: https://reviews.apache.org/r/16717/diff/
>
>
> Testing
> -------
>
> Updated golden file for existing TestTezCompiler DISTINCT test to include combiner plan
> Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner
> Added e2e test that runs DISTINCT with algebraic udf combiner
> I am getting some test-e2e-tez failures in ORDER BY tests, but I am also getting these in a clean Tez branch. My new e2e test passes.
>
>
> Thanks,
>
> Alex Bain
>
>