|
real great..
2011-04-16, 10:01
Sonal Goyal
2011-04-16, 13:39
Stephen Boesch
2011-04-16, 20:08
Stephen Boesch
2011-04-16, 20:19
Ted Dunning
2011-04-16, 22:03
real great..
2011-04-17, 14:00
Matthew Foley
2011-04-17, 19:07
Lance Norskog
2011-04-17, 23:57
Ted Dunning
2011-04-18, 00:07
James Seigel Tynt
2011-04-18, 00:25
real great..
2011-04-18, 01:39
Matthew Foley
2011-04-18, 06:28
real great..
2011-04-18, 08:49
|
-
Estimating Time required to compute M/Rjobreal great.. 2011-04-16, 10:01
Hi,
As a part of my final year BE final project I want to estimate the time required by a M/R job given an application and a base file system. Can you folks please help me by posting some thoughts on this issue or posting some links here. -- Regards, R.V.
-
Re: Estimating Time required to compute M/RjobSonal Goyal 2011-04-16, 13:39
What is your MR job doing? What is the amount of data it is processing? What
kind of a cluster do you have? Would you be able to share some details about what you are trying to do? If you are looking for metrics, you could look at the Terasort run .. Thanks and Regards, Sonal <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data Integration<https://github.com/sonalgoyal/hiho> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Sat, Apr 16, 2011 at 3:31 PM, real great.. <[EMAIL PROTECTED]>wrote: > Hi, > As a part of my final year BE final project I want to estimate the time > required by a M/R job given an application and a base file system. > Can you folks please help me by posting some thoughts on this issue or > posting some links here. > > -- > Regards, > R.V. >
-
Re: Estimating Time required to compute M/RjobStephen Boesch 2011-04-16, 20:08
You could consider two scenarios / set of requirements for your estimator:
1. Allow it to 'learn' from certain input data and then project running times of similar (or moderately dissimilar) workloads. So the first steps could be to define a couple of relatively small "control" M/R jobs on a small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R cluster. Try to design the "control" M/R job in a way that it will be able to completely load down all of the available DataNodes in the cluster-under-test for at least a brief period of time. Then you wlil have obtained a decent signal on the capabilities of the cluster under test and may allow a relatively high degree of predictive accuracy for even much larger jobs 2. If instead it were your goal to drive the predictions off of a purely mathematical model - in your terms the "application" and "base file system" - and without any empirical data - then here is an alternative approach. - Follow step (1) above against a variety of "applications" and "base file systems" - especially in configurations for which you wish your model to provide high quality predictions. - Save the results in structured data - Derive formulas for characterizing the curves of performance via those variables that you defined (application / base file system) Now you have a trained model. When it is applied to a new set of applications / base file systems it can use the curves you have already determined to provide the result without any runtime requirements. Obviously the value of this second approach is limited by the degree of similarity of the training data to the applications you attempt to model. If all of your training data is on a 50 node cluster against machines with IDE drives don't expect good results when asked to model a 1000 node cluster using SAN's / RAID's / SCSI's. 2011/4/16 Sonal Goyal <[EMAIL PROTECTED]> > What is your MR job doing? What is the amount of data it is processing? > What > kind of a cluster do you have? Would you be able to share some details > about > what you are trying to do? > > If you are looking for metrics, you could look at the Terasort run .. > > Thanks and Regards, > Sonal > <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data > Integration<https://github.com/sonalgoyal/hiho> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Sat, Apr 16, 2011 at 3:31 PM, real great.. > <[EMAIL PROTECTED]>wrote: > > > Hi, > > As a part of my final year BE final project I want to estimate the time > > required by a M/R job given an application and a base file system. > > Can you folks please help me by posting some thoughts on this issue or > > posting some links here. > > > > -- > > Regards, > > R.V. > > >
-
Re: Estimating Time required to compute M/RjobStephen Boesch 2011-04-16, 20:19
some additional thoughts about the the 'variables' involved in
characterizing the M/R application itself. - the configuration of the cluster for numbers of mappers vs reducers compared to the characteristics (amount of work/procesing) required in each of the map/shuffle/reduce stages - is the application using multiple chained M/R stages? Multi stage M/R's are more difficult to tune properly in terms of keeping all workers busy . That may be challenging to model. 2011/4/16 Stephen Boesch <[EMAIL PROTECTED]> > You could consider two scenarios / set of requirements for your estimator: > > > 1. Allow it to 'learn' from certain input data and then project running > times of similar (or moderately dissimilar) workloads. So the first steps > could be to define a couple of relatively small "control" M/R jobs on a > small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R > cluster. Try to design the "control" M/R job in a way that it will be > able to completely load down all of the available DataNodes in the > cluster-under-test for at least a brief period of time. Then you wlil > have obtained a decent signal on the capabilities of the cluster under test > and may allow a relatively high degree of predictive accuracy for even much > larger jobs > 2. If instead it were your goal to drive the predictions off of a > purely mathematical model - in your terms the "application" and "base file > system" - and without any empirical data - then here is an alternative > approach. > - Follow step (1) above against a variety of "applications" and > "base file systems" - especially in configurations for which you wish your > model to provide high quality predictions. > - Save the results in structured data > - Derive formulas for characterizing the curves of performance via > those variables that you defined (application / base file system) > > Now you have a trained model. When it is applied to a new set of > applications / base file systems it can use the curves you have already > determined to provide the result without any runtime requirements. > > Obviously the value of this second approach is limited by the degree of > similarity of the training data to the applications you attempt to model. > If all of your training data is on a 50 node cluster against machines with > IDE drives don't expect good results when asked to model a 1000 node cluster > using SAN's / RAID's / SCSI's. > > > 2011/4/16 Sonal Goyal <[EMAIL PROTECTED]> > >> What is your MR job doing? What is the amount of data it is processing? >> What >> kind of a cluster do you have? Would you be able to share some details >> about >> what you are trying to do? >> >> If you are looking for metrics, you could look at the Terasort run .. >> >> Thanks and Regards, >> Sonal >> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data >> Integration<https://github.com/sonalgoyal/hiho> >> Nube Technologies <http://www.nubetech.co> >> >> <http://in.linkedin.com/in/sonalgoyal> >> >> >> >> >> >> On Sat, Apr 16, 2011 at 3:31 PM, real great.. >> <[EMAIL PROTECTED]>wrote: >> >> > Hi, >> > As a part of my final year BE final project I want to estimate the time >> > required by a M/R job given an application and a base file system. >> > Can you folks please help me by posting some thoughts on this issue or >> > posting some links here. >> > >> > -- >> > Regards, >> > R.V. >> > >> > >
-
Re: Estimating Time required to compute M/RjobTed Dunning 2011-04-16, 22:03
Sounds like this paper might help you:
Predicting Multiple Performance Metrics for Queries: Better Decisions Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno, Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, & David Patterson http://radlab.cs.berkeley.edu/publication/187 On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch <[EMAIL PROTECTED]> wrote: > > some additional thoughts about the the 'variables' involved in > characterizing the M/R application itself. > > > - the configuration of the cluster for numbers of mappers vs reducers > compared to the characteristics (amount of work/procesing) required in each > of the map/shuffle/reduce stages > > > - is the application using multiple chained M/R stages? Multi stage > M/R's are more difficult to tune properly in terms of keeping all workers > busy . That may be challenging to model. > > 2011/4/16 Stephen Boesch <[EMAIL PROTECTED]> > > > You could consider two scenarios / set of requirements for your estimator: > > > > > > 1. Allow it to 'learn' from certain input data and then project running > > times of similar (or moderately dissimilar) workloads. So the first steps > > could be to define a couple of relatively small "control" M/R jobs on a > > small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R > > cluster. Try to design the "control" M/R job in a way that it will be > > able to completely load down all of the available DataNodes in the > > cluster-under-test for at least a brief period of time. Then you wlil > > have obtained a decent signal on the capabilities of the cluster under test > > and may allow a relatively high degree of predictive accuracy for even much > > larger jobs > > 2. If instead it were your goal to drive the predictions off of a > > purely mathematical model - in your terms the "application" and "base file > > system" - and without any empirical data - then here is an alternative > > approach. > > - Follow step (1) above against a variety of "applications" and > > "base file systems" - especially in configurations for which you wish your > > model to provide high quality predictions. > > - Save the results in structured data > > - Derive formulas for characterizing the curves of performance via > > those variables that you defined (application / base file system) > > > > Now you have a trained model. When it is applied to a new set of > > applications / base file systems it can use the curves you have already > > determined to provide the result without any runtime requirements. > > > > Obviously the value of this second approach is limited by the degree of > > similarity of the training data to the applications you attempt to model. > > If all of your training data is on a 50 node cluster against machines with > > IDE drives don't expect good results when asked to model a 1000 node cluster > > using SAN's / RAID's / SCSI's. > > > > > > 2011/4/16 Sonal Goyal <[EMAIL PROTECTED]> > > > >> What is your MR job doing? What is the amount of data it is processing? > >> What > >> kind of a cluster do you have? Would you be able to share some details > >> about > >> what you are trying to do? > >> > >> If you are looking for metrics, you could look at the Terasort run .. > >> > >> Thanks and Regards, > >> Sonal > >> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data > >> Integration<https://github.com/sonalgoyal/hiho> > >> Nube Technologies <http://www.nubetech.co> > >> > >> <http://in.linkedin.com/in/sonalgoyal> > >> > >> > >> > >> > >> > >> On Sat, Apr 16, 2011 at 3:31 PM, real great.. > >> <[EMAIL PROTECTED]>wrote: > >> > >> > Hi, > >> > As a part of my final year BE final project I want to estimate the time > >> > required by a M/R job given an application and a base file system. > >> > Can you folks please help me by posting some thoughts on this issue or > >> > posting some links here.
-
Re: Estimating Time required to compute M/Rjobreal great.. 2011-04-17, 14:00
Thanks a lot guys..will go throught it all.
On Sun, Apr 17, 2011 at 3:33 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Sounds like this paper might help you: > > Predicting Multiple Performance Metrics for Queries: Better Decisions > Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno, > Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, & David > Patterson > > http://radlab.cs.berkeley.edu/publication/187 > > On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch <[EMAIL PROTECTED]> wrote: > > > > some additional thoughts about the the 'variables' involved in > > characterizing the M/R application itself. > > > > > > - the configuration of the cluster for numbers of mappers vs reducers > > compared to the characteristics (amount of work/procesing) required in > each > > of the map/shuffle/reduce stages > > > > > > - is the application using multiple chained M/R stages? Multi stage > > M/R's are more difficult to tune properly in terms of keeping all > workers > > busy . That may be challenging to model. > > > > 2011/4/16 Stephen Boesch <[EMAIL PROTECTED]> > > > > > You could consider two scenarios / set of requirements for your > estimator: > > > > > > > > > 1. Allow it to 'learn' from certain input data and then project > running > > > times of similar (or moderately dissimilar) workloads. So the > first steps > > > could be to define a couple of relatively small "control" M/R jobs > on a > > > small-ish dataset and throw it at the unknown (cluster-under-test) > hdfs/ M/R > > > cluster. Try to design the "control" M/R job in a way that it > will be > > > able to completely load down all of the available DataNodes in the > > > cluster-under-test for at least a brief period of time. Then you > wlil > > > have obtained a decent signal on the capabilities of the cluster > under test > > > and may allow a relatively high degree of predictive accuracy for > even much > > > larger jobs > > > 2. If instead it were your goal to drive the predictions off of a > > > purely mathematical model - in your terms the "application" and > "base file > > > system" - and without any empirical data - then here is an > alternative > > > approach. > > > - Follow step (1) above against a variety of "applications" and > > > "base file systems" - especially in configurations for which you > wish your > > > model to provide high quality predictions. > > > - Save the results in structured data > > > - Derive formulas for characterizing the curves of performance > via > > > those variables that you defined (application / base file > system) > > > > > > Now you have a trained model. When it is applied to a new set of > > > applications / base file systems it can use the curves you have already > > > determined to provide the result without any runtime requirements. > > > > > > Obviously the value of this second approach is limited by the degree of > > > similarity of the training data to the applications you attempt to > model. > > > If all of your training data is on a 50 node cluster against machines > with > > > IDE drives don't expect good results when asked to model a 1000 node > cluster > > > using SAN's / RAID's / SCSI's. > > > > > > > > > 2011/4/16 Sonal Goyal <[EMAIL PROTECTED]> > > > > > >> What is your MR job doing? What is the amount of data it is > processing? > > >> What > > >> kind of a cluster do you have? Would you be able to share some details > > >> about > > >> what you are trying to do? > > >> > > >> If you are looking for metrics, you could look at the Terasort run .. > > >> > > >> Thanks and Regards, > > >> Sonal > > >> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data > > >> Integration<https://github.com/sonalgoyal/hiho> > > >> Nube Technologies <http://www.nubetech.co> > > >> > > >> <http://in.linkedin.com/in/sonalgoyal> > > >> > > >> > > >> > > >> > > >> > > >> On Sat, Apr 16, 2011 at 3:31 PM, real great.. Regards, R.V.
-
Re: Estimating Time required to compute M/RjobMatthew Foley 2011-04-17, 19:07
Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: "If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?"
Or are you only dealing with one particular M/R job? If so, the above is a good way to look at it: first identify the controlling parameters, then analyze how they co-vary with execution time. Now you've reduced it to a question that can be answered by a series of "make hypothesis" / "do experiment" steps :-) Pick a parameter you think is a likely candidate, and make a series of measurements of execution time for different values of the parameter. Repeat until you've fully characterized the problem space. Good luck, --Matt On Apr 16, 2011, at 6:39 AM, Sonal Goyal wrote: What is your MR job doing? What is the amount of data it is processing? What kind of a cluster do you have? Would you be able to share some details about what you are trying to do? If you are looking for metrics, you could look at the Terasort run .. Thanks and Regards, Sonal <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data Integration<https://github.com/sonalgoyal/hiho> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Sat, Apr 16, 2011 at 3:31 PM, real great.. <[EMAIL PROTECTED]>wrote: > Hi, > As a part of my final year BE final project I want to estimate the time > required by a M/R job given an application and a base file system. > Can you folks please help me by posting some thoughts on this issue or > posting some links here. > > -- > Regards, > R.V. >
-
Re: Estimating Time required to compute M/RjobLance Norskog 2011-04-17, 23:57
ROC Convex Hull is an analysis technique for optimizing parameters for
given outputs. For example, if a classification technique has tuning knobs, ROCCH will find the settings that give a desired failure rate. On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley <[EMAIL PROTECTED]> wrote: > Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: "If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?" > > Or are you only dealing with one particular M/R job? If so, the above is a good way to look at it: first identify the controlling parameters, then analyze how they co-vary with execution time. Now you've reduced it to a question that can be answered by a series of "make hypothesis" / "do experiment" steps :-) Pick a parameter you think is a likely candidate, and make a series of measurements of execution time for different values of the parameter. Repeat until you've fully characterized the problem space. > > Good luck, > --Matt > > On Apr 16, 2011, at 6:39 AM, Sonal Goyal wrote: > > What is your MR job doing? What is the amount of data it is processing? What > kind of a cluster do you have? Would you be able to share some details about > what you are trying to do? > > If you are looking for metrics, you could look at the Terasort run .. > > Thanks and Regards, > Sonal > <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data > Integration<https://github.com/sonalgoyal/hiho> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Sat, Apr 16, 2011 at 3:31 PM, real great.. > <[EMAIL PROTECTED]>wrote: > >> Hi, >> As a part of my final year BE final project I want to estimate the time >> required by a M/R job given an application and a base file system. >> Can you folks please help me by posting some thoughts on this issue or >> posting some links here. >> >> -- >> Regards, >> R.V. >> > > -- Lance Norskog [EMAIL PROTECTED]
-
Re: Estimating Time required to compute M/RjobTed Dunning 2011-04-18, 00:07
Turing completion isn't the central question here, really. The truth
is, map-reduce programs have considerably pressure to be written in a scalable fashion which limits them to fairly simple behaviors that result in pretty linear dependence of run-time on input size for a given program. The cool thing about the paper that I linked to the other day is that there are enough cues about the expected runtime of the program available to make good predictions *without* looking at the details. No doubt the estimation facility could make good use of something as simple as the hash of the jar in question, but even without that it is possible to produce good estimates. I suppose that this means that all of us Hadoop programmers are really just kind of boring folk. On average, anyway. On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley <[EMAIL PROTECTED]> wrote: > Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: "If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?" >
-
Re: Estimating Time required to compute M/RjobJames Seigel Tynt 2011-04-18, 00:25
Yup. I'm boring
On 2011-04-17, at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Turing completion isn't the central question here, really. The truth > is, map-reduce programs have considerably pressure to be written in a > scalable fashion which limits them to fairly simple behaviors that > result in pretty linear dependence of run-time on input size for a > given program. > > The cool thing about the paper that I linked to the other day is that > there are enough cues about the expected runtime of the program > available to make good predictions *without* looking at the details. > No doubt the estimation facility could make good use of something as > simple as the hash of the jar in question, but even without that it is > possible to produce good estimates. > > I suppose that this means that all of us Hadoop programmers are really > just kind of boring folk. On average, anyway. > > On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley <[EMAIL PROTECTED]> wrote: >> Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: "If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?" >>
-
Re: Estimating Time required to compute M/Rjobreal great.. 2011-04-18, 01:39
@mathew: initially i wanted to concentrate on generic class of
applications..wouldnt mind to stick on to one now..can i know something more about the descriptive parameters? @all: any results of anybody having done something similar? On Mon, Apr 18, 2011 at 5:55 AM, James Seigel Tynt <[EMAIL PROTECTED]> wrote: > Yup. I'm boring > > > > On 2011-04-17, at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Turing completion isn't the central question here, really. The truth > > is, map-reduce programs have considerably pressure to be written in a > > scalable fashion which limits them to fairly simple behaviors that > > result in pretty linear dependence of run-time on input size for a > > given program. > > > > The cool thing about the paper that I linked to the other day is that > > there are enough cues about the expected runtime of the program > > available to make good predictions *without* looking at the details. > > No doubt the estimation facility could make good use of something as > > simple as the hash of the jar in question, but even without that it is > > possible to produce good estimates. > > > > I suppose that this means that all of us Hadoop programmers are really > > just kind of boring folk. On average, anyway. > > > > On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley <[EMAIL PROTECTED]> > wrote: > >> Since general M/R jobs vary over a huge (Turing problem equivalent!) > range of behaviors, a more tractable problem might be to characterize the > descriptive parameters needed to answer the question: "If the following > problem P runs in T0 amount of time on a certain benchmark platform B0, how > long T1 will it take to run on a differently configured real-world platform > B1 ?" > >> > -- Regards, R.V.
-
Re: Estimating Time required to compute M/RjobMatthew Foley 2011-04-18, 06:28
R.V.,
I was only suggesting one way to tackle the problem; I don't have a list of appropriate parameters. I think Ted has much more experience in this area, and he is encouraging you to stay with the generic approach. You should study that paper he recommended, the approach looks really powerful. --Matt On Apr 17, 2011, at 6:39 PM, real great.. wrote: @mathew: initially i wanted to concentrate on generic class of applications..wouldnt mind to stick on to one now..can i know something more about the descriptive parameters? @all: any results of anybody having done something similar? On Mon, Apr 18, 2011 at 5:55 AM, James Seigel Tynt <[EMAIL PROTECTED]> wrote: > Yup. I'm boring > > > > On 2011-04-17, at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Turing completion isn't the central question here, really. The truth >> is, map-reduce programs have considerably pressure to be written in a >> scalable fashion which limits them to fairly simple behaviors that >> result in pretty linear dependence of run-time on input size for a >> given program. >> >> The cool thing about the paper that I linked to the other day is that >> there are enough cues about the expected runtime of the program >> available to make good predictions *without* looking at the details. >> No doubt the estimation facility could make good use of something as >> simple as the hash of the jar in question, but even without that it is >> possible to produce good estimates. >> >> I suppose that this means that all of us Hadoop programmers are really >> just kind of boring folk. On average, anyway. >> >> On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley <[EMAIL PROTECTED]> > wrote: >>> Since general M/R jobs vary over a huge (Turing problem equivalent!) > range of behaviors, a more tractable problem might be to characterize the > descriptive parameters needed to answer the question: "If the following > problem P runs in T0 amount of time on a certain benchmark platform B0, how > long T1 will it take to run on a differently configured real-world platform > B1 ?" >>> > -- Regards, R.V.
-
Re: Estimating Time required to compute M/Rjobreal great.. 2011-04-18, 08:49
sure,will do..:)
On Mon, Apr 18, 2011 at 11:58 AM, Matthew Foley <[EMAIL PROTECTED]> wrote: > R.V., > I was only suggesting one way to tackle the problem; I don't have a list of > appropriate parameters. > I think Ted has much more experience in this area, and he is encouraging > you to stay with the generic approach. You should study that paper he > recommended, the approach looks really powerful. > --Matt > > On Apr 17, 2011, at 6:39 PM, real great.. wrote: > > @mathew: initially i wanted to concentrate on generic class of > applications..wouldnt mind to stick on to one now..can i know something > more > about the descriptive parameters? > > @all: any results of anybody having done something similar? > > On Mon, Apr 18, 2011 at 5:55 AM, James Seigel Tynt <[EMAIL PROTECTED]> wrote: > > > Yup. I'm boring > > > > > > > > On 2011-04-17, at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > >> Turing completion isn't the central question here, really. The truth > >> is, map-reduce programs have considerably pressure to be written in a > >> scalable fashion which limits them to fairly simple behaviors that > >> result in pretty linear dependence of run-time on input size for a > >> given program. > >> > >> The cool thing about the paper that I linked to the other day is that > >> there are enough cues about the expected runtime of the program > >> available to make good predictions *without* looking at the details. > >> No doubt the estimation facility could make good use of something as > >> simple as the hash of the jar in question, but even without that it is > >> possible to produce good estimates. > >> > >> I suppose that this means that all of us Hadoop programmers are really > >> just kind of boring folk. On average, anyway. > >> > >> On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley <[EMAIL PROTECTED]> > > wrote: > >>> Since general M/R jobs vary over a huge (Turing problem equivalent!) > > range of behaviors, a more tractable problem might be to characterize the > > descriptive parameters needed to answer the question: "If the following > > problem P runs in T0 amount of time on a certain benchmark platform B0, > how > > long T1 will it take to run on a differently configured real-world > platform > > B1 ?" > >>> > > > > > > -- > Regards, > R.V. > > -- Regards, R.V. |