|
|
Shrinivas Joshi 2011-02-18, 21:32
Which workloads are used for serious benchmarking of Hadoop clusters? Do you care about any of the following workloads : TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
Thanks, -Shrinivas
Ted Dunning 2011-02-18, 22:00
MalStone looks like a very narrow benchmark.
Terasort is also a very narrow and somewhat idiosyncratic benchmark, but it has the characteristic that lots of people use it.
You should add PigMix to your list. There java versions of the problems in PigMix that make a pretty good set of benchmarks independent of Pig itself.
On Fri, Feb 18, 2011 at 1:32 PM, Shrinivas Joshi <[EMAIL PROTECTED]>wrote:
> Which workloads are used for serious benchmarking of Hadoop clusters? Do > you > care about any of the following workloads : > TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, > sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. > > Thanks, > -Shrinivas >
Jim Falgout 2011-02-18, 22:27
We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the data and the queries, if not the query generator. There is a Jira issue in Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I don't remember the issue number offhand.
-----Original Message----- From: Shrinivas Joshi [mailto:[EMAIL PROTECTED]] Sent: Friday, February 18, 2011 3:32 PM To: [EMAIL PROTECTED] Subject: benchmark choices
Which workloads are used for serious benchmarking of Hadoop clusters? Do you care about any of the following workloads : TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
Thanks, -Shrinivas
Shrinivas Joshi 2011-02-18, 22:35
Thanks Jim. MRBench mentioned in this paper http://dcslab.snu.ac.kr/~khjeon/papers/2008/icpads_mrbench.pdf looks like a map/reduce port of TPC-H workload. BTW, MRBench mentioned in the above paper and the one in mapred/src/test/mapred/org/apache/hadoop/mapred/MRBench.java look different to me. Is that a fair statement? -Shrinivas On Fri, Feb 18, 2011 at 4:27 PM, Jim Falgout <[EMAIL PROTECTED]>wrote: > We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the > data and the queries, if not the query generator. There is a Jira issue in > Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I > don't remember the issue number offhand. > > -----Original Message----- > From: Shrinivas Joshi [mailto:[EMAIL PROTECTED]] > Sent: Friday, February 18, 2011 3:32 PM > To: [EMAIL PROTECTED] > Subject: benchmark choices > > Which workloads are used for serious benchmarking of Hadoop clusters? Do > you care about any of the following workloads : > TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, > sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. > > Thanks, > -Shrinivas > >
Ted Dunning 2011-02-18, 22:35
I just read the malstone report. They report times for a Java version that is many (5x) times slower than for a streaming implementation. That single fact indicates that the Java code is so appallingly bad that this is a very bad benchmark.
On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout <[EMAIL PROTECTED]>wrote:
> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the > data and the queries, if not the query generator. There is a Jira issue in > Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I > don't remember the issue number offhand. > > -----Original Message----- > From: Shrinivas Joshi [mailto:[EMAIL PROTECTED]] > Sent: Friday, February 18, 2011 3:32 PM > To: [EMAIL PROTECTED] > Subject: benchmark choices > > Which workloads are used for serious benchmarking of Hadoop clusters? Do > you care about any of the following workloads : > TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, > sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. > > Thanks, > -Shrinivas > >
Konstantin Boudnik 2011-02-18, 22:50
On Fri, Feb 18, 2011 at 14:35, Ted Dunning <[EMAIL PROTECTED]> wrote: > I just read the malstone report. They report times for a Java version that > is many (5x) times slower than for a streaming implementation. That single > fact indicates that the Java code is so appallingly bad that this is a very > bad benchmark.
Slow Java code? That's funny ;) Running with Hotspot on by any chance?
> On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout <[EMAIL PROTECTED]>wrote: > >> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the >> data and the queries, if not the query generator. There is a Jira issue in >> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I >> don't remember the issue number offhand. >> >> -----Original Message----- >> From: Shrinivas Joshi [mailto:[EMAIL PROTECTED]] >> Sent: Friday, February 18, 2011 3:32 PM >> To: [EMAIL PROTECTED] >> Subject: benchmark choices >> >> Which workloads are used for serious benchmarking of Hadoop clusters? Do >> you care about any of the following workloads : >> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, >> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. >> >> Thanks, >> -Shrinivas >> >> >
Shrinivas Joshi 2011-02-21, 20:39
I wonder what companies like Amazon, Cloudera, RackSpace, Facebook, Yahoo etc. look at for the purpose of benchmarking. I guess GridMix v3 might be of more interest to Yahoo.
I would appreciate if someone can comment more on this.
Thanks, -Shrinivas
On Fri, Feb 18, 2011 at 4:50 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote:
> On Fri, Feb 18, 2011 at 14:35, Ted Dunning <[EMAIL PROTECTED]> wrote: > > I just read the malstone report. They report times for a Java version > that > > is many (5x) times slower than for a streaming implementation. That > single > > fact indicates that the Java code is so appallingly bad that this is a > very > > bad benchmark. > > Slow Java code? That's funny ;) Running with Hotspot on by any chance? > > > On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout <[EMAIL PROTECTED] > >wrote: > > > >> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the > >> data and the queries, if not the query generator. There is a Jira issue > in > >> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I > >> don't remember the issue number offhand. > >> > >> -----Original Message----- > >> From: Shrinivas Joshi [mailto:[EMAIL PROTECTED]] > >> Sent: Friday, February 18, 2011 3:32 PM > >> To: [EMAIL PROTECTED] > >> Subject: benchmark choices > >> > >> Which workloads are used for serious benchmarking of Hadoop clusters? Do > >> you care about any of the following workloads : > >> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, > >> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. > >> > >> Thanks, > >> -Shrinivas > >> > >> > > >
Konstantin Boudnik 2011-02-21, 22:50
Adding Roman Shaposhnik to the list who's "tasked" with benchmarking @Cloudera
On Mon, Feb 21, 2011 at 12:39, Shrinivas Joshi <[EMAIL PROTECTED]> wrote: > I wonder what companies like Amazon, Cloudera, RackSpace, Facebook, Yahoo > etc. look at for the purpose of benchmarking. I guess GridMix v3 might be of > more interest to Yahoo. > > I would appreciate if someone can comment more on this. > > Thanks, > -Shrinivas > > On Fri, Feb 18, 2011 at 4:50 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote: >> >> On Fri, Feb 18, 2011 at 14:35, Ted Dunning <[EMAIL PROTECTED]> wrote: >> > I just read the malstone report. They report times for a Java version >> > that >> > is many (5x) times slower than for a streaming implementation. That >> > single >> > fact indicates that the Java code is so appallingly bad that this is a >> > very >> > bad benchmark. >> >> Slow Java code? That's funny ;) Running with Hotspot on by any chance? >> >> > On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout >> > <[EMAIL PROTECTED]>wrote: >> > >> >> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the >> >> data and the queries, if not the query generator. There is a Jira issue >> >> in >> >> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, >> >> I >> >> don't remember the issue number offhand. >> >> >> >> -----Original Message----- >> >> From: Shrinivas Joshi [mailto:[EMAIL PROTECTED]] >> >> Sent: Friday, February 18, 2011 3:32 PM >> >> To: [EMAIL PROTECTED] >> >> Subject: benchmark choices >> >> >> >> Which workloads are used for serious benchmarking of Hadoop clusters? >> >> Do >> >> you care about any of the following workloads : >> >> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, >> >> NNBench, >> >> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. >> >> >> >> Thanks, >> >> -Shrinivas >> >> >> >> >> > > >
|
|