
+
Edward Capriolo 20100301, 17:27
+
Darren Govoni 20100301, 21:13
+
Edward Capriolo 20100301, 23:25
+
Darren Govoni 20100301, 23:37
+
brien colwell 20100301, 23:42
+
Edward Capriolo 20100302, 00:48
+
Jeff Hammerbacher 20100303, 02:09

Re: BigO Notation for Hadoop
I think there is still a long way toward predicting Hadoop job
runtime. The ICDE10 paper listed a lot of limitations of their methods and is a short paper (4 pgs). In terms of the Berkeley research, based on what I learned from a presentation done by Archana (the first author of ICDE09 paper), they just scratched the surface of the problem, and that the methodology seems to require quite exhaustive experimentation on the configuration space to train their models  overall, I am not convinced that the approach would work as well as predicting sql query performance (the results they presented in the ICDE09 paper). Hong On Mar 2, 2010, at 6:09 PM, Jeff Hammerbacher wrote: > Predicting the run time of a MapReduce/Pig/Hive job has been > addressed by > folks at the University of Washington (e.g. > http://www.cs.washington.edu/homes/kmorton/ICDE10.pdf) and Berkeley > (e.g > using the techniques from > http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf). > > On Mon, Mar 1, 2010 at 4:48 PM, Edward Capriolo > <[EMAIL PROTECTED]>wrote: > >> I am looking at this many different ways. >> >> For example: shuffle sort might run faster if we have 12 disks not >> 8 per >> node. >> >> >> So shuffle sort involves data size/ disk speed network speed/ and >> processor speed/ number of nodes. >> >> >> Can we find formula to take these (and more factors ) into account? >> Once we find it we should be able to plug in 12 or 8 and get a result >> close to the shuffle sort time. >> >> >> I think it would be rather cool to have a long drawn out formula.that >> even made reference to some constants, like time to copy data to >> distributed cache, >> >> >> >> I am looking at source data size, map complety, map output size, >> shuffle sort time, reduce complexity, number of nodes and try to >> arrive at a formula that will say how long a job will take. >> >> From there we can factor in something like all nodes have 10 g >> ethernet and watch the entire thing fall apart :) >> >> >> >> >> On 3/1/10, brien colwell <[EMAIL PROTECTED]> wrote: >>> Map reduce should be a constant factor improvement for the algorithm >>> complexity. I think you're asking for the overhead as a function of >>> input/cluster size? If your algorithm has some complexity O(f(n)), >>> and >>> you spread it over M nodes (constant), with some merge complexity >>> less >>> than f(n), the total time will still be O(f(n)). >>> >>> I run a small job, measure the time, and then extrapolate based on >>> the >> bigO. >>> >>> >>> >>> >>> >>> >>> On 3/1/2010 6:25 PM, Edward Capriolo wrote: >>>> On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni<[EMAIL PROTECTED]> >> wrote: >>>> >>>>> Theoretically. O(n) >>>>> >>>>> All other variables being equal across all nodes >>>>> should...mmmmm.....reduce to n. >>>>> >>>>> That part that really can't be measured is the cost of Hadoop's >>>>> bookkeeping chores as the data set grows since some things in >>>>> Hadoop >>>>> involve synchronous/serial behavior. >>>>> >>>>> On Mon, 20100301 at 12:27 0500, Edward Capriolo wrote: >>>>> >>>>> >>>>>> A previous post to coreuser mentioned some formula to >>>>>> determine job >>>>>> time. I was wondering if anyone out there is trying to tackle >>>>>> designing a formula that can calculate the job run time of a >>>>>> map/reduce program. Obviously there are many variables here >>>>>> including >>>>>> but not limited to Disk Speed ,Network Speed, Processor Speed, >>>>>> input >>>>>> data, many constants , dataskew, map complexity, reduce >>>>>> complexity, # >>>>>> of nodes...... >>>>>> >>>>>> As an intellectual challenge has anyone starting trying to >>>>>> write a >>>>>> formula that can take into account all these factors and try to >>>>>> actually predict a job time in minutes/hours? >>>>>> >>>>> >>>>> >>>>> >>>> Understood, BIG0 notation is really not what I am looking for. >>>> >>>> Given all variables are the same, a hadoop job on a finite set of >> 
