Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Running TPC-H on Pig


Copy link to this message
-
Re: Running TPC-H on Pig
Yeah we already have some results but not so good, so we are currently
rewriting some of the scripts especially rewriting the joins. Once we can a
good result we will publish it.

Jie

On Tue, Nov 29, 2011 at 2:41 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> I'm a little confused. Do you already have the benchmarks? I'd love to see
> them if you do. Do you want to make a JIRA in order to put this info on the
> site? I'm a little confused, but I agree that statistics can help focus
> effort and could also be a good tool for evangelism (especially if Pig is
> in fact as fast as Hive in cases).
>
> 2011/11/29 Jie Li <[EMAIL PROTECTED]>
>
> > Hello everyone,
> >
> > As people are usually more concerned about the performance, we need more
> > benchmarks to identify the bottleneck of the Pig's performance. For a
> class
> > project we develop a whole set of Pig scripts for TPC-H. Though Pig was
> not
> > designed for this RDBMS benchmark, it does support most of the relation
> > operators like join and aggregation, which can be optimized based on this
> > benchmark. Besides that, we can also demonstrate how to write efficient
> pig
> > scripts by making full use of Pig Latin's features.
> >
> > Here are what we did:
> > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> > implement join.
> > 3) show how to optimize the join by slightly reordering or using
> replicated
> > join. We think pig should be able to have more heuristic optimization for
> > the join, such as evaluating the smaller join first, using replicated
> join
> > for small tables, and putting the larger table on the right side of the
> > hash join.
> > 4) identify the poor performance of aggregation. Pig doesn't yet support
> > hash-based aggregation so it's extremely slow for aggregation. Good to
> know
> > that Pig is just about to support it:)
> >
> > As TPC-H is well-known, a good benchmark result can help change people's
> > impression that Pig is slow. Actually we compare Pig and Hive and find
> that
> > Pig is not necessarily slower than Hive. I wonder if we can create a jira
> > for this project.
> >
> > Thanks,
> > Jie Li
> > PhD Candidate of Computer Science
> > Duke University
> >
>