Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # dev - hadoop1.2.1 speedup model


Copy link to this message
-
Re: hadoop1.2.1 speedup model
Robert Evans 2013-09-09, 14:57
How many times did you run the experiment at each setting?  What is the
standard deviation for each of these settings.  It could be that you are
simply running into the error bounds of Hadoop.  Hadoop is far from
consistent in it's performance.  For our benchmarking we typically will
run the test 5 times, throw out the top and bottom result, as possibly
outliers and then average the other runs.  Even with that we have to be
very careful that we weed out bad nodes or the numbers are useless for
comparison.  The other thing to look at is where was all of the time spent
for each of these settings.  The map portion should be very close to
linear with the number of tasks, assuming that there is no disk or network
contention.  The shuffle is far from linear as the number of fetches is a
function of the number of maps and the number of reducers.  The reduce
phase itself should be close to linear assuming that there isn't much skew
to your data.

--Bobby

On 9/7/13 3:33 AM, "牛兆捷" <[EMAIL PROTECTED]> wrote:

>But I still want to fine the most efficient assignment and scale both data
>and nodes as you said, for example in my result, 2 is the best, and 8 is
>better than 4.
>
>Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
>hard to model this result. Can you give me some hint about this kind of
>trend?
>
>
>2013/9/7 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]>
>
>>
>> Clearly your input size isn't changing. And depending on how they are
>> distributed on the nodes, there could be Datanode/disks contention.
>>
>> The better way to model this is by scaling the input data also linearly.
>> More nodes should process more data in the same amount of time.
>>
>> Thanks,
>> +Vinod
>>
>> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote:
>>
>> > Hi all:
>> >
>> > I vary the computational nodes of cluster and get the speedup result
>>in
>> attachment.
>> >
>> > In my mind, there are three type of speedup model: linear, sub-linear
>> and super-linear. However the curve of my result seems a little
>>strange. I
>> have attached it.
>> > <speedup.png>
>> >
>> > This is sort in example.jar, actually it is done only using the
>>default
>> map-reduce mechanism of Hadoop.
>> >
>> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12
>>cpu,
>> 20g men)
>> >  io.sort.mb = 512, block size = 512mb, heap size = 1024mb,
>>  reduce.slowstart = 0.05, the others are default.
>> >
>> > Input data: 20g, I divide it to 64 files
>> >
>> > Sort example: 64 map tasks, 64 reduce tasks
>> >
>> > Computational nodes: varying from 2 to 9
>> >
>> > Why the speedup mechanism is like this? How can I model it properly?
>> >
>> > Thanks〜
>> >
>> > --
>> > Sincerely,
>> > Zhaojie
>> >
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>>entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the
>>reader
>> of this message is not the intended recipient, you are hereby notified
>>that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>>immediately
>> and delete it from your system. Thank You.
>>
>
>
>
>--
>*Sincerely,*
>*Zhaojie*
>*
>*