Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Running TPC-H on Pig

Copy link to this message
Running TPC-H on Pig
Hello everyone,

As people are usually more concerned about the performance, we need more
benchmarks to identify the bottleneck of the Pig's performance. For a class
project we develop a whole set of Pig scripts for TPC-H. Though Pig was not
designed for this RDBMS benchmark, it does support most of the relation
operators like join and aggregation, which can be optimized based on this
benchmark. Besides that, we can also demonstrate how to write efficient pig
scripts by making full use of Pig Latin's features.

Here are what we did:
1) write correct pig scripts for TPC-H by verifying them on 1GB data.
2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
implement join.
3) show how to optimize the join by slightly reordering or using replicated
join. We think pig should be able to have more heuristic optimization for
the join, such as evaluating the smaller join first, using replicated join
for small tables, and putting the larger table on the right side of the
hash join.
4) identify the poor performance of aggregation. Pig doesn't yet support
hash-based aggregation so it's extremely slow for aggregation. Good to know
that Pig is just about to support it:)

As TPC-H is well-known, a good benchmark result can help change people's
impression that Pig is slow. Actually we compare Pig and Hive and find that
Pig is not necessarily slower than Hive. I wonder if we can create a jira
for this project.

Jie Li
PhD Candidate of Computer Science
Duke University