I am still beginning Hadoop. Is there any benchmarks or 'performance heuristics' for Hadoop? Is it possible to say something like 'You can process X lines of GZipped log file on a medium AWS server in Y minutes"? I would like to get an idea of what kind of workflow is possible.
Doesn't just cover Hadoop, but maybe the methodology will give you an idea of what you're looking for.
There's too many variables to pin down a "general" average. Every job will run differently on every cluster, given the machines can be heterogenous builds, with heterogenous configs at the machine level, then the cluster will have configs that may or may not override the machine configs...plus the job submitter can specify runtime variables...
Things like the type of data being processed affect the amount of disk I/O, network traffic required, etc., which are in turn affected by their components...
Throwing more nodes at a problem will usually make it faster, but how much faster depends...
Best way to read your cluster is establish a benchmark operation that models your expected use case (or one of them), then adjust things on the cluster and see what tips the time, spill, network traffic, etc. one way or another.
Eric Sammer's *Hadoop Operations* will break down nicely how real-life cluster configs affect performance. There are also a lot of case studies in Tom White's * Hadoop: The Definitive Guide*.
*Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Tue, Feb 25, 2014 at 3:09 PM, Brian Stempin <[EMAIL PROTECTED]>wrote:
Thanks a lot guys! From Dieters original reply I got TeraSort and I am currently running different scenarios with that. It seems to be The Benchmark right now. It's relatively simple and yet it does test most of the functionality.
Devin: You mention a couple of books I already have in the stack for reading. Do any of you know of an authoritative source on actual optimization (maybe even 'profiling'?) of a Hadoop cluster? I am testing on relatively (very) light HW and my background is Java servers so I started fiddling with mem-settings - of course. Not much luck there. :-D /th
On Tue, 2014-02-25 at 15:43 -0500, Devin Suiter RDX wrote: