Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Running TPC-H on Pig


Copy link to this message
-
Re: Running TPC-H on Pig
Yeah we already have some results but not so good, so we are currently
rewriting some of the scripts especially rewriting the joins. Once we can a
good result we will publish it.

Jie

On Tue, Nov 29, 2011 at 2:41 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> I'm a little confused. Do you already have the benchmarks? I'd love to see
> them if you do. Do you want to make a JIRA in order to put this info on the
> site? I'm a little confused, but I agree that statistics can help focus
> effort and could also be a good tool for evangelism (especially if Pig is
> in fact as fast as Hive in cases).
>
> 2011/11/29 Jie Li <[EMAIL PROTECTED]>
>
> > Hello everyone,
> >
> > As people are usually more concerned about the performance, we need more
> > benchmarks to identify the bottleneck of the Pig's performance. For a
> class
> > project we develop a whole set of Pig scripts for TPC-H. Though Pig was
> not
> > designed for this RDBMS benchmark, it does support most of the relation
> > operators like join and aggregation, which can be optimized based on this
> > benchmark. Besides that, we can also demonstrate how to write efficient
> pig
> > scripts by making full use of Pig Latin's features.
> >
> > Here are what we did:
> > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> > implement join.
> > 3) show how to optimize the join by slightly reordering or using
> replicated
> > join. We think pig should be able to have more heuristic optimization for
> > the join, such as evaluating the smaller join first, using replicated
> join
> > for small tables, and putting the larger table on the right side of the
> > hash join.
> > 4) identify the poor performance of aggregation. Pig doesn't yet support
> > hash-based aggregation so it's extremely slow for aggregation. Good to
> know
> > that Pig is just about to support it:)
> >
> > As TPC-H is well-known, a good benchmark result can help change people's
> > impression that Pig is slow. Actually we compare Pig and Hive and find
> that
> > Pig is not necessarily slower than Hive. I wonder if we can create a jira
> > for this project.
> >
> > Thanks,
> > Jie Li
> > PhD Candidate of Computer Science
> > Duke University
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB