Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Review Request 19724: PIG-3814: Implement RANK in Tez


Copy link to this message
-
Review Request 19724: PIG-3814: Implement RANK in Tez

This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19724/

Review request for pig, Cheolsoo Park and Daniel Dai.
Bugs: PIG-3814
    https://issues.apache.org/jira/browse/PIG-3814
Repository: pig
Description

Rank implementation in Tez is different from MR implementation.
  * MR Implementation has 1 map-only job (POCounter) which sets the Current taskId at position 0 of tuple and local map task counter at position 1. It also emits job Counters for the number of records in that map task. JobControlCompiler collects those, calculate offsets and launches the next map only job (PORank) with those offset information in the jobconf.
  * Tez Implementation has 3 vertices. Vertex 1 outputs tuples from POCounter to Vertex 3. It also outputs the counters to Vertex 2 which calculates the offsets and broadcasts it to Vertex 3.

Other changes made:
   - Changed taskid to be Integer instead of String to reduce memory overhead.

Possible optimizations:
   - POCounter sets the Current taskId at position 0 of tuple and counter at position 1. PORank create a new tuple of size-1 to remove the task id and copies over the rest which is lot of overhead. We could just set the task id as the last element of tuple and remove that from arraylist instead of array copy. Will create a separate jira for that.
Diffs

  http://svn.apache.org/repos/asf/pig/branches/tez/ivy/libraries.properties 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigMapReduceCounter.java 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCounter.java 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORank.java 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/POValueOutputTez.java 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezTaskConfigurable.java PRE-CREATION
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/POCounterStatsTez.java PRE-CREATION
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/POCounterTez.java PRE-CREATION
  http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/PORankTez.java PRE-CREATION
  http://svn.apache.org/repos/asf/pig/branches/tez/test/e2e/pig/drivers/TestDriverPig.pm 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/test/e2e/pig/tests/nightly.conf 1582317
  http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC20.gld PRE-CREATION
  http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC21.gld PRE-CREATION
  http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/tez/TestTezCompiler.java 1582317

Diff: https://reviews.apache.org/r/19724/diff/
Testing

Enabled Rank e2e tests for tez. Except Rank 9 and 11, others pass. Rank 9 has some Tez map output data corruption issue. Yet to investigate. Rank 11 is a issue with SPLIT and aware of the reason. The output keys need to be updated in MultiQueryOptimizerTez after Tez operators have been merged. That is already done for POFRJoinTez. But trying to think of a generic way to do this (new interfaces to get input keys and output keys), so that we don't have to add every operator to MultiQueryOptimizerTez. Will do that in a separate jira.
Thanks,

Rohini Palaniswamy