Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Estimating Time required to compute M/Rjob


Copy link to this message
-
Re: Estimating Time required to compute M/Rjob
Sounds like this paper might help you:

Predicting Multiple Performance Metrics for Queries: Better Decisions
Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno,
Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, & David
Patterson

http://radlab.cs.berkeley.edu/publication/187

On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch <[EMAIL PROTECTED]> wrote:
>
> some additional thoughts about the the  'variables' involved in
> characterizing the M/R application itself.
>
>
>   - the configuration of the cluster for numbers of mappers vs reducers
>   compared to the characteristics (amount of work/procesing) required in each
>   of the map/shuffle/reduce stages
>
>
>   - is the application using multiple chained M/R stages?  Multi stage
>   M/R's are more difficult to tune properly in terms of keeping all workers
>   busy  . That may be challenging to model.
>
> 2011/4/16 Stephen Boesch <[EMAIL PROTECTED]>
>
> > You could consider two scenarios / set of requirements for your estimator:
> >
> >
> >    1. Allow it to 'learn' from certain input data and then project running
> >    times of similar (or moderately dissimilar) workloads.   So the first steps
> >    could be to define a couple of  relatively small "control" M/R jobs on a
> >    small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
> >     cluster.  Try to design the "control" M/R job  in a way that it will be
> >    able to completely load down all of the  available DataNodes in the
> >     cluster-under-test for at least a brief period of time.   Then you wlil
> >    have obtained a decent signal on the capabilities of the cluster under test
> >    and may allow a relatively high degree of predictive accuracy for even much
> >    larger jobs
> >    2. If instead it were your goal to drive the predictions off of a
> >    purely mathematical model  - in your terms the "application" and "base file
> >    system" - and without any empirical data - then here is an alternative
> >    approach.
> >       - Follow step (1) above against a variety of "applications" and
> >       "base file systems" - especially in configurations for which  you wish your
> >       model to provide high quality predictions.
> >       - Save  the results in structured data
> >       - Derive formulas for characterizing the curves of performance via
> >       those variables that you defined (application /  base file system)
> >
> > Now you have a trained model.  When it is applied to a new set of
> > applications / base file systems it can use the curves you have already
> > determined to provide the result without any runtime requirements.
> >
> > Obviously the value of this second approach is limited by the degree of
> > similarity of the training data to the applications you attempt to model.
> >  If all of your training data is on a 50 node cluster against machines with
> > IDE drives don't expect good results when asked to model a 1000 node cluster
> > using SAN's / RAID's / SCSI's.
> >
> >
> > 2011/4/16 Sonal Goyal <[EMAIL PROTECTED]>
> >
> >> What is your MR job doing? What is the amount of data it is processing?
> >> What
> >> kind of a cluster do you have? Would you be able to share some details
> >> about
> >> what you are trying to do?
> >>
> >> If you are looking for metrics, you could look at the Terasort run ..
> >>
> >> Thanks and Regards,
> >> Sonal
> >> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> >> Integration<https://github.com/sonalgoyal/hiho>
> >> Nube Technologies <http://www.nubetech.co>
> >>
> >> <http://in.linkedin.com/in/sonalgoyal>
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Apr 16, 2011 at 3:31 PM, real great..
> >> <[EMAIL PROTECTED]>wrote:
> >>
> >> > Hi,
> >> > As a part of my final year BE final project I want to estimate the time
> >> > required by a M/R job given an application and a base file system.
> >> > Can you folks please help me by posting some thoughts on this issue or
> >> > posting some links here.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB