Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Big-O Notation for Hadoop


Copy link to this message
-
Re: Big-O Notation for Hadoop
Jeff Hammerbacher 2010-03-03, 02:09
Predicting the run time of a MapReduce/Pig/Hive job has been addressed by
folks at the University of Washington (e.g.
http://www.cs.washington.edu/homes/kmorton/ICDE10.pdf) and Berkeley (e.g
using the techniques from
http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf).

On Mon, Mar 1, 2010 at 4:48 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> I am looking at this many different ways.
>
> For example: shuffle sort might run faster if we have 12 disks not 8 per
> node.
>
>
> So shuffle sort involves data size/ disk speed network speed/ and
> processor speed/ number of nodes.
>
>
> Can we find formula to take these (and more factors ) into account?
> Once we find it we should be able to plug in 12 or 8 and get a result
> close to the shuffle sort time.
>
>
> I think it would be rather cool to have a long drawn out formula.that
> even made reference to some constants, like time to copy data to
> distributed cache,
>
>
>
> I am looking at source data size, map complety, map output size,
> shuffle sort time, reduce complexity, number of nodes and try to
> arrive at a formula that will say how long a job will take.
>
> From there we can factor in something like all nodes have 10 g
> ethernet and watch the entire thing fall apart :)
>
>
>
>
> On 3/1/10, brien colwell <[EMAIL PROTECTED]> wrote:
> > Map reduce should be a constant factor improvement for the algorithm
> > complexity. I think you're asking for the overhead as a function of
> > input/cluster size? If your algorithm has some complexity O(f(n)), and
> > you spread it over M nodes (constant), with some merge complexity less
> > than f(n), the total time will still be O(f(n)).
> >
> > I run a small job, measure the time, and then extrapolate based on the
> bigO.
> >
> >
> >
> >
> >
> >
> > On 3/1/2010 6:25 PM, Edward Capriolo wrote:
> >> On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni<[EMAIL PROTECTED]>
>  wrote:
> >>
> >>> Theoretically. O(n)
> >>>
> >>> All other variables being equal across all nodes
> >>> should...mmmmm.....reduce to n.
> >>>
> >>> That part that really can't be measured is the cost of Hadoop's
> >>> bookkeeping chores as the data set grows since some things in Hadoop
> >>> involve synchronous/serial behavior.
> >>>
> >>> On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote:
> >>>
> >>>
> >>>> A previous post to core-user mentioned some formula to determine job
> >>>> time. I was wondering if anyone out there is trying to tackle
> >>>> designing a formula that can calculate the job run time of a
> >>>> map/reduce program. Obviously there are many variables here including
> >>>> but not limited to Disk Speed ,Network Speed, Processor Speed, input
> >>>> data, many constants , data-skew, map complexity, reduce complexity, #
> >>>> of nodes......
> >>>>
> >>>> As an intellectual challenge has anyone starting trying to write a
> >>>> formula that can take into account all these factors and try to
> >>>> actually predict a job time in minutes/hours?
> >>>>
> >>>
> >>>
> >>>
> >> Understood, BIG-0 notation is really not what I am looking for.
> >>
> >> Given all variables are the same, a hadoop job on a finite set of data
> >> should run for a finite time. There are parts of the process that run
> >> linear and parts that run in parallel, but there must be a way to
> >> express how long a job actually takes (although admittedly it is very
> >> involved to figure out)
> >>
> >
> >
>