Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # dev >> Why In-memory Mapoutput is necessary in ReduceCopier


Copy link to this message
-
Re: Why In-memory Mapoutput is necessary in ReduceCopier
MR2 does not use HTTP (via Jetty) anymore for map output transfers and
has moved over to using Netty. If you're focussing on development or
improvement work on Apache Hadoop, I'd suggest reading/relying on the
trunk code and not the branch-1 code.

On Tue, Mar 12, 2013 at 7:12 AM, Ling Kun <[EMAIL PROTECTED]> wrote:
> Dear Ravi and all,
>
>    Thanks very much for your kindly reply.
>    I am currently concern about whether it is possible to eliminate the
> HTTP GET method using some other ways. And currently have not got a better
> idea.
>
>    Thanks agin.
>
> yours,
> Ling Kun
>
>
> On Tue, Mar 12, 2013 at 12:58 AM, Ravi Prakash <[EMAIL PROTECTED]> wrote:
>
>> Hi Ling,
>>
>> Yes! It is because of performance concerns. We want to keep and merge map
>> outputs in memory as much as we can. The amount of memory reserved for this
>> purpose is configurable. Obviously storing fetched map outputs on disk,
>> then reading them back from disk to merge them and then write out back to
>> disk, is a lot more expensive than if it were done in memory.
>>
>> Please let us know if you find there was an opportunity to keep the map
>> output in memory but we did not, and instead shuffled to disk.
>>
>> Thanks
>> Ravi
>>
>>
>>
>>
>> ________________________________
>>  From: Ling Kun <[EMAIL PROTECTED]>
>> To: [EMAIL PROTECTED]
>> Sent: Monday, March 11, 2013 5:27 AM
>> Subject: Why In-memory Mapoutput is necessary in ReduceCopier
>>
>> Dear all,
>>
>>      I am focusing on the Mapoutput copier implementation. This part of
>> code will try to get mapoutputs, and merge them into a file that can feed
>> to reduce functions. I have the following questions.
>>
>> 1. All the local file mapoutput data will be merged together by the
>> LocalFSMerge, and the in-memory mapout will be merged by
>> InMemFSMergeThread. For the InMemFSMergeThread, there is also a writer
>> object   which write the result to outputPath ( ReduceTask.java Line 2843).
>> It seems after merging, in-memory mapoutput and local file mapoutput data
>> will all be stored in local file system. Why not just using the local file
>> for all mapoutput data.
>>
>> 2. After using http to get  some fragment of a map output file, some of the
>> mapoutput data will be selected and keep in memory, while others are
>> directly write to local disk of reducers. Which mapoutput wil be kept in
>> memory is determined in MapOutputCopier.getMapOutput(), this method will
>> call ramManager.canFitInMemory().  why not store all the data to disk?
>>
>> 3. According to the comment, Hadoop will put a file in memory if it meets:
>> a, the size of the (decompressed) file should be less than 25% of the total
>> inmem fs; b, there is space available in the inmem fs. Why ? Is it because
>> of the performance?
>>
>>
>>
>> Thanks
>>
>> yours,
>> Ling Kun
>>
>> --
>> http://www.lingcc.com
>>
>
>
>
> --
> http://www.lingcc.com

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB