Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Review Request 16860: PIG-3644: Implement skewed join in Tez

Copy link to this message
Re: Review Request 16860: PIG-3644: Implement skewed join in Tez

This is an automatically generated e-mail. To reply, visit:

(Updated Jan. 17, 2014, 12:49 a.m.)
Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.

Per Rohini's request, I am uploading the final patch that I committed to tez branch.
Bugs: PIG-3644
Repository: pig-git

Skewed join in Tez is implemented in 5 vertices:
Vertex 1) Sample/load skewed table => broadcast sampling input to vertex 2 and shuffle entire input to vertex 3.
Vertex 2) Sampling aggregation vertex => build distribution map and broadcast it to vertex 3 and 4.
Vertex 3) POLocalRearrangeTez for skewed table => partition skewed table using SkewedPartitioner and shuffle it to vertex 5.
Vertex 4) POPartitionRearrangeTez for streaming table => shuffle streaming table to vertex 5.
Vertex 5) Join inputs from vertex 3 and 4.

New classes for Tez:
- POPoissonSample) Sampling operator for skewed join.
- POPartitionRearrangeTez) Sub-class of POPartitionRearrange for Tez.
- SkewedPartitionerTez) Sub-class of SkewedPartitioner for Tez.

Note that there are a couple of places I can refactor. For eg,
- POPoissonSample and PoissonSampleLoader
- POPartitionRearrageTez and POLocalRearrangeTez

I will do it in follow-up jiras.
Diffs (updated)

  src/org/apache/pig/PigConfiguration.java ccf3635
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 4790abe
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPoissonSample.java PRE-CREATION
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POReservoirSample.java bcb339c
  src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 585509d
  src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java PRE-CREATION
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java e9d8e64
  src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java e22c319
  src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java PRE-CREATION
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 632eae5
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 53b255e
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java 93e522f
  src/org/apache/pig/backend/hadoop/executionengine/tez/WeightedRangePartitionerTez.java 7bcc79e
  src/org/apache/pig/impl/builtin/PartitionSkewedKeys.java 7ce0e82
  src/org/apache/pig/impl/builtin/PoissonSampleLoader.java 5ce5b9e
  test/e2e/pig/tests/tez.conf ac254e5

Diff: https://reviews.apache.org/r/16860/diff/

- Added e2e test cases for inner and outer skewed joins.
- unit tests pass.
- e2e tests pass.

Cheolsoo Park