Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Review Request 16313: PIG-3604 Implement replicated join in Tez


Copy link to this message
-
Re: Review Request 16313: PIG-3604 Implement replicated join in Tez
Cheolsoo Park 2013-12-17, 17:41


> On Dec. 17, 2013, 3:52 p.m., Rohini Palaniswamy wrote:
> > The approach is good when the replicate join is not the first vertex of the DAG (i.e in case of a MR, replicate join is part of a reduce). If it is the first vertex of the DAG, we need to compare and see that with this approach the performance does not regress with the MR's map only replicate join using distributed cache. Created PIG-3631 for follow up.

Thank you Rohini for the review. I totally agree that we should measure performance.

But even in MR, you have two jobs - first one that loads a small table and stage it on distributed cache, and second one that does join. In Tez, I am replacing the first job with a vertex broadcasting a small table. So the performance difference will be between copying a file to distributed cache vs broadcasting it to downstream vertex. My assumption is that broadcasting is fast since it doesn't have the sort phrase. Of course, I might be wrong.

I will address your comments below shortly. Thank you!
- Cheolsoo
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16313/#review30533
-----------------------------------------------------------
On Dec. 17, 2013, 3:51 a.m., Cheolsoo Park wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16313/
> -----------------------------------------------------------
>
> (Updated Dec. 17, 2013, 3:51 a.m.)
>
>
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
>
>
> Bugs: PIG-3604
>     https://issues.apache.org/jira/browse/PIG-3604
>
>
> Repository: pig-git
>
>
> Description
> -------
>
> Implemented replicated join in Tez as follows:
> - POFRJoinTez extends POFRJoin. The difference between two is that replication hash table is constructed out of broadcasting edges in Tez instead of files on distributed cache in MR.
> - TezCompiler adds a vertex per replicated table and connect it to POFRJoin vertex via broadcasting edge.
>
> Note that in POLocalRerrangeTez, I package tuples in the same way for broadcast and scatter/gather edges, so I removed outputType (DataMovementType).
>
>
> Diffs
> -----
>
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POFRJoin.java d7c54d8
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POMergeJoin.java e900751
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POFRJoinTez.java e69de29
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java cda5d89
>   src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 7a1736a
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 2584501
>   test/e2e/pig/tests/tez.conf b280698
>   test/org/apache/pig/test/data/GoldenFiles/TEZC10.gld e69de29
>   test/org/apache/pig/tez/TestTezCompiler.java 79dc94e
>
> Diff: https://reviews.apache.org/r/16313/diff/
>
>
> Testing
> -------
>
> Added a unit test case to TestTezCompiler.
> Added a e2e test case to Join.
>
> ant test-tez passes.
> e2e test passes.
>
>
> Thanks,
>
> Cheolsoo Park
>
>