-Re: Estimating Time required to compute M/Rjob
Ted Dunning 2011-04-16, 22:03
Sounds like this paper might help you:
Predicting Multiple Performance Metrics for Queries: Better Decisions
Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno,
Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, & David
On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch <[EMAIL PROTECTED]> wrote:
> some additional thoughts about the the 'variables' involved in
> characterizing the M/R application itself.
> - the configuration of the cluster for numbers of mappers vs reducers
> compared to the characteristics (amount of work/procesing) required in each
> of the map/shuffle/reduce stages
> - is the application using multiple chained M/R stages? Multi stage
> M/R's are more difficult to tune properly in terms of keeping all workers
> busy . That may be challenging to model.
> 2011/4/16 Stephen Boesch <[EMAIL PROTECTED]>
> > You could consider two scenarios / set of requirements for your estimator:
> > 1. Allow it to 'learn' from certain input data and then project running
> > times of similar (or moderately dissimilar) workloads. So the first steps
> > could be to define a couple of relatively small "control" M/R jobs on a
> > small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
> > cluster. Try to design the "control" M/R job in a way that it will be
> > able to completely load down all of the available DataNodes in the
> > cluster-under-test for at least a brief period of time. Then you wlil
> > have obtained a decent signal on the capabilities of the cluster under test
> > and may allow a relatively high degree of predictive accuracy for even much
> > larger jobs
> > 2. If instead it were your goal to drive the predictions off of a
> > purely mathematical model - in your terms the "application" and "base file
> > system" - and without any empirical data - then here is an alternative
> > approach.
> > - Follow step (1) above against a variety of "applications" and
> > "base file systems" - especially in configurations for which you wish your
> > model to provide high quality predictions.
> > - Save the results in structured data
> > - Derive formulas for characterizing the curves of performance via
> > those variables that you defined (application / base file system)
> > Now you have a trained model. When it is applied to a new set of
> > applications / base file systems it can use the curves you have already
> > determined to provide the result without any runtime requirements.
> > Obviously the value of this second approach is limited by the degree of
> > similarity of the training data to the applications you attempt to model.
> > If all of your training data is on a 50 node cluster against machines with
> > IDE drives don't expect good results when asked to model a 1000 node cluster
> > using SAN's / RAID's / SCSI's.
> > 2011/4/16 Sonal Goyal <[EMAIL PROTECTED]>
> >> What is your MR job doing? What is the amount of data it is processing?
> >> What
> >> kind of a cluster do you have? Would you be able to share some details
> >> about
> >> what you are trying to do?
> >> If you are looking for metrics, you could look at the Terasort run ..
> >> Thanks and Regards,
> >> Sonal
> >> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> >> Integration<https://github.com/sonalgoyal/hiho>
> >> Nube Technologies <http://www.nubetech.co>
> >> <http://in.linkedin.com/in/sonalgoyal>
> >> On Sat, Apr 16, 2011 at 3:31 PM, real great..
> >> <[EMAIL PROTECTED]>wrote:
> >> > Hi,
> >> > As a part of my final year BE final project I want to estimate the time
> >> > required by a M/R job given an application and a base file system.
> >> > Can you folks please help me by posting some thoughts on this issue or
> >> > posting some links here.