-Re: Shuffling over the network for local map data.
Suresh Kumar 2013-01-22, 23:03
As Luke mentioned the change I made is very useful for small clusters with
lots of cores. I'm working on very ideal case ie 1 machine cluster with 48
cores So I really do not know how it would be most general use cases.
In one of my use cases, the shuffle copy used to take between 40 mins. It
now takes 10-30 seconds. In another use case with map almost close to an
identity function the unpatched the shuffle copy lasted for 12+hours before
it failed as it ran out of disk space. Using the patched code the shuffle
copy lasted about 30 -60 seconds.
On Tue, Jan 22, 2013 at 11:42 AM, Albert Chu <[EMAIL PROTECTED]> wrote:
> I've experimented with similar changes in the hadoop trunk, although my
> desire was to improve performance for networked file systems. I had not
> considered the idea that it could be used for files stored locally on
> What type of performance tests did you run and what kind of improvements
> did you find (or not find)?
> On Tue, 2013-01-22 at 11:02 -0800, Suresh Kumar wrote:
> > I have a patch that tries to use file links instead of making a copy
> > of the data that is already available locally. I tested it on the a
> > single machine cluster configuration running 48 mappers and reducers.
> > I unfortunately do not have access to a cluster even a small one. Can
> > some on review and test run my patch ?
> > I created the patch using Eclipse against 1.0.3. My knowledge in Java
> > in limited and the code is not well written in some classes. So please
> > let me know if I need to make changes to the code along with a short
> > explanation of the change. I will happily do so.
> > Thanks,
> > Suresh.
> Albert Chu
> [EMAIL PROTECTED]
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory