-Re: Shuffle phase: fine-grained control of data flow
Luke Lu 2012-11-09, 17:54
For skewed map output, another optimization would be "local direct shuffle"
to further minimize network usage AND disk spillage (on the reducer side),
ie. schedule reduce tasks on the same host with the largest map output.
On Thu, Nov 8, 2012 at 8:59 AM, Robert Evans <[EMAIL PROTECTED]> wrote:
> Ok so you are specifically looking at reducing overall network bandwidth
> of skewed map outputs, not all map outputs. That would very much mean
> that #1 and #3 are off base. But as you point out it would only really be
> performance win if the data fits into memory. It seems like an interesting
> idea. If the goal is to reduce bandwidth and not improve individual job
> performance then it seems more plausible. Do you have a benchmark (grid
> mix run etc) that really taxes the network that you could use to measure
> the impact such a change would have? Something like this really needs
> some hard numbers for a proper evaluation.
> --Bobby Evans
> On 11/7/12 11:32 PM, "Jiwei Li" <[EMAIL PROTECTED]> wrote:
> >Hi Bobby,
> >Thank you a lot for your suggestions. My whole idea is to minimize the
> >aggregate network bandwidth during Shuffle Phase, that is, to limit the
> >hops to minimum when transmitting data from map node to reduce node.
> >Usually, Partitioner creates skews that the JobTracker allocates different
> >amounts of map outputs to participating reduce nodes. Making reduce nodes
> >near map outputs with largest concerned partitions can reduce the
> >network bandwidth.
> >For #1, there is no need to schedule map tasks to be close to one another,
> >since it will only congest links among the cluster. For #2, the location
> >and size of each partition in each map output can be sent to JobTracker
> >along with the processing of InputSplit. Collecting enough such
> >(not necessarily waiting map tasks to finish), the JobTracker starts to
> >schedule reduce tasks to fetch map output data. #3 is the same as #1.
> >Now the tricky part is that if all map outputs are spilled to disks,
> >network bandwidth may not be a bottleneck, because the time consumed in
> >disk seeks outnumbers that in data transmission. If map outputs fit in
> >memory, then network must be taken seriously. Also note that for evenly
> >distributed map outputs, current scheduling policy works just fine.
> >On Wed, Nov 7, 2012 at 11:45 PM, Robert Evans <[EMAIL PROTECTED]>
> >> Jiwei,
> >> I think you could use that knowledge to launch reducers closer to the
> >> output, but I am not sure that it would make much difference. It may
> >> slow things down. It is a question of several things
> >> 1) Can we get enough map tasks close to one another that it will make a
> >> difference?
> >> 2) Does the reduced shuffle time offset the overhead of waiting for the
> >> map location data before launching and fetching data early?
> >> 3) and do the time savings also offset the overhead of getting the map
> >> tasks to be close to one another?
> >> For #2 you might be able to deal with this by using speculative
> >> and launching some reduce tasks later if you see a clustering of map
> >> output. For #1 it will require changes to how we schedule tasks which
> >> depending on how well it is implemented will impact #3 as well.
> >> Additionally for #1 any job that approaches the same order of size as
> >> cluster will almost require the map tasks to be evenly distributed
> >> the cluster. If you can come up with a patch I would love to see some
> >> performance numbers.
> >> Personally I think spending time reducing the size of the data sent to
> >> reducers is a much bigger win. Can you use a combiner? Do you really
> >> all of the data or can you sample the data to get a statistically
> >> significant picture of what is in the data? Have you enabled
> >> between the maps and the reducers?