Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Running TPC-H on Pig


Copy link to this message
-
Re: Running TPC-H on Pig
Jie Li 2011-12-02, 21:34
TPC-E is for transaction, so why is it better for evaluating Hadoop related
systems?

We are benchmarking the whole queries. We found that some simple heuristics
work very well so far. No doubt that the statistics would help make a even
better query plan.

Jie

On Wed, Nov 30, 2011 at 12:18 AM, Renato Marroquín Mogrovejo <
[EMAIL PROTECTED]> wrote:

> Hey,
>
> why didn't you use the TPC-E?and what are you guys exactly
> benchmarking?i.e. specific components of both systems or the whole queries?
> Because hive is already able to use some basic statistics but pig isn't,and
> at least until hcat is ready it won't be able to take fully advantage of
> them.
>
> Renato M.
> On Nov 29, 2011 8:18 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>
> > If you want some feedback on the how to make the scripts faster, feel
> free
> > to post them.
> >
> > 2011/11/29 Jie Li <[EMAIL PROTECTED]>
> >
> > > Did you mean the two update functions of TPC-H? I think we can leave
> them
> > > out as Hive did, as usually Hadoop is not for update.
> > >
> > > Jie
> > >
> > > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Please do. The association with TPC-H might be tricky as it mandates
> > the
> > > > concurrent data modification. Nevertheless, the benchmark will be
> very
> > > > useful as you point out.
> > > >
> > > > -----Original Message-----
> > > > From: Jie Li [mailto:[EMAIL PROTECTED]]
> > > > Sent: Tuesday, November 29, 2011 11:38 AM
> > > > To: [EMAIL PROTECTED]
> > > > Subject: Running TPC-H on Pig
> > > >
> > > > Hello everyone,
> > > >
> > > > As people are usually more concerned about the performance, we need
> > more
> > > > benchmarks to identify the bottleneck of the Pig's performance. For a
> > > class
> > > > project we develop a whole set of Pig scripts for TPC-H. Though Pig
> was
> > > not
> > > > designed for this RDBMS benchmark, it does support most of the
> relation
> > > > operators like join and aggregation, which can be optimized based on
> > this
> > > > benchmark. Besides that, we can also demonstrate how to write
> efficient
> > > pig
> > > > scripts by making full use of Pig Latin's features.
> > > >
> > > > Here are what we did:
> > > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator
> > to
> > > > implement join.
> > > > 3) show how to optimize the join by slightly reordering or using
> > > > replicated join. We think pig should be able to have more heuristic
> > > > optimization for the join, such as evaluating the smaller join first,
> > > using
> > > > replicated join for small tables, and putting the larger table on the
> > > right
> > > > side of the hash join.
> > > > 4) identify the poor performance of aggregation. Pig doesn't yet
> > support
> > > > hash-based aggregation so it's extremely slow for aggregation. Good
> to
> > > know
> > > > that Pig is just about to support it:)
> > > >
> > > > As TPC-H is well-known, a good benchmark result can help change
> > people's
> > > > impression that Pig is slow. Actually we compare Pig and Hive and
> find
> > > that
> > > > Pig is not necessarily slower than Hive. I wonder if we can create a
> > jira
> > > > for this project.
> > > >
> > > > Thanks,
> > > > Jie Li
> > > > PhD Candidate of Computer Science
> > > > Duke University
> > > >
> > > >
> > >
> >
>