Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # dev - Review Request 15261: PIG-3555 Initial implementation of Tez combiner optimization


+
Cheolsoo Park 2013-11-06, 11:04
+
Cheolsoo Park 2013-11-06, 17:00
+
Cheolsoo Park 2013-11-06, 23:55
+
Alex Bain 2013-11-08, 00:55
+
Cheolsoo Park 2013-11-08, 07:37
Copy link to this message
-
Re: Review Request 15261: PIG-3555 Initial implementation of Tez combiner optimization
Cheolsoo Park 2013-11-08, 07:57

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15261/
-----------------------------------------------------------

(Updated Nov. 8, 2013, 7:57 a.m.)
Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
Changes
-------

Updated patch includes the following improvements-
# CombinerOptimizer stores TezOperPlan (parent plan of TezOperators), so I can retrieve predecessor vertices via getPredecessors(). To detect the POLocalRearrange -> POPackage pattern, 1) I find POPackage in the current vertex, and 2) I look up predecessors to find POLocalRearrange. If I find them in both 1 and 2, I add combiners.
# Combine plan is set to both input and output descriptors (as per Rohini's suggestion). Since they're identical, I do not print them twice in the explain output.
# TestCombiner unit tests pass with TezMiniCluster. The unit tests remain almost unchanged except that 1) PigServer dynamically takes a exec type via a system property, and 2) PigServer.shutdown() is called after each test case to delete temporary files.

Only remaining issues that I haven't addressed are what's related to multiple input/outputs. I am wondering whether we can resolve them in a separate jira given that Mark and Rohini are actively working on them. What do others think?
Bugs: PIG-3555
    https://issues.apache.org/jira/browse/PIG-3555
Repository: pig-git
Description
-------

Initial implementation of Tez combiner optimizer. The patch includes the following changes-
* Factored out CombinerOptimizer code into a utility class called CombinerOptimizerUtil. So both MR and Tez CombinerOptimizer use this utility class instead of duplicating code.
* Introduced a new class called TezEdgeDescriptor that holds combine plans as well as various edge properties.
* Added TezEdgeDescriptors to TezOperator. Note that I added multiple descriptors for inbound edges but a single descriptor for all the outbound edges. This is because TezDagBuilder always creates an edge by connecting predecessors to the current vertex. Please let me know if you think we should allow multiple descriptors for outbound edges too.
* Refactored some code in TezDagBuilder while touching it.
Diffs (updated)
-----

  src/org/apache/pig/PigServer.java c0826ea
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 18a382b
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRExecType.java 07d737d
  src/org/apache/pig/backend/hadoop/executionengine/tez/CombinerOptimizer.java e69de29
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0b1f3c9
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 45e47b0
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java e69de29
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java 3f14644
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java 5a42ded
  src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java e69de29
  test/org/apache/pig/test/TestCombiner.java 6252b51
  test/org/apache/pig/test/data/GoldenFiles/TEZC1.gld 925f07e
  test/org/apache/pig/test/data/GoldenFiles/TEZC2.gld a3974fe
  test/org/apache/pig/test/data/GoldenFiles/TEZC3.gld a8c942b
  test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld fb7c903
  test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld e6cd25e
  test/tez-tests c22a448

Diff: https://reviews.apache.org/r/15261/diff/
Testing
-------

ant test-tez passes.
ant test-e2e-tez passes.

I didn't add new test cases, but an e2e test case (Checkin_3) includes an algebraic udf (count) following group-by. I also manually tested it on a live cluster.
Thanks,

Cheolsoo Park

+
Rohini Palaniswamy 2013-11-08, 23:12
+
Cheolsoo Park 2013-11-09, 00:33
+
Cheolsoo Park 2013-11-10, 04:31
+
Mark Wagner 2013-11-11, 21:16
+
Cheolsoo Park 2013-11-11, 21:41
+
Mark Wagner 2013-11-06, 22:54
+
Cheolsoo Park 2013-11-08, 07:37
+
Rohini Palaniswamy 2013-11-06, 19:46
+
Rohini Palaniswamy 2013-11-06, 22:15
+
Cheolsoo Park 2013-11-06, 20:55