Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Running TPC-H on Pig

Copy link to this message
Re: Running TPC-H on Pig
Did you mean the two update functions of TPC-H? I think we can leave them
out as Hive did, as usually Hadoop is not for update.


On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <[EMAIL PROTECTED]>wrote:

> Please do. The association with TPC-H might be tricky as it mandates the
> concurrent data modification. Nevertheless, the benchmark will be very
> useful as you point out.
> -----Original Message-----
> From: Jie Li [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, November 29, 2011 11:38 AM
> Subject: Running TPC-H on Pig
> Hello everyone,
> As people are usually more concerned about the performance, we need more
> benchmarks to identify the bottleneck of the Pig's performance. For a class
> project we develop a whole set of Pig scripts for TPC-H. Though Pig was not
> designed for this RDBMS benchmark, it does support most of the relation
> operators like join and aggregation, which can be optimized based on this
> benchmark. Besides that, we can also demonstrate how to write efficient pig
> scripts by making full use of Pig Latin's features.
> Here are what we did:
> 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> implement join.
> 3) show how to optimize the join by slightly reordering or using
> replicated join. We think pig should be able to have more heuristic
> optimization for the join, such as evaluating the smaller join first, using
> replicated join for small tables, and putting the larger table on the right
> side of the hash join.
> 4) identify the poor performance of aggregation. Pig doesn't yet support
> hash-based aggregation so it's extremely slow for aggregation. Good to know
> that Pig is just about to support it:)
> As TPC-H is well-known, a good benchmark result can help change people's
> impression that Pig is slow. Actually we compare Pig and Hive and find that
> Pig is not necessarily slower than Hive. I wonder if we can create a jira
> for this project.
> Thanks,
> Jie Li
> PhD Candidate of Computer Science
> Duke University