One approach we are considering is implementing a simple fuse-based file
system that only keeps files in memory. Then, while running mapreduce in
'psuedo-distributed' mode, configure mapreduce to use this in-memory
file system as the write location for the intermediate key value pairs.
Perhaps this technique would be supported already by features available
to regular users? Can anyone point me in the right direction?
On 11/26/12 12:46 AM, Mahesh Balija wrote:
> Hi Randy/Alex,
> Your problem seems to be interesting and it is understood
> that you want to provide a way in Hadoop to handle small jobs as well.
> Please see my inline answers,
> On Mon, Nov 26, 2012 at 7:08 AM, rshepherd <[EMAIL PROTECTED]> wrote:
>> Hi everybody,
>> I am a student at NYU and am evaluating an idea for final project for a
>> distributed systems class. The idea is roughly as follows; the overhead
>> for running map-reduce on a 'small' job is high. (A small job would be
>> defined as something fitting in memory on a single machine.) Can
>> hadoop's map-reduce be modified to be efficient for jobs such as this?
>> It seems that one way to do begin to achieve this goal would be to
>> modify the way the intermediate key-value pairs are handled, the
>> "handoff" from the map to the reduce. Rather than writing them to HDFS,
>> either pass them directly to a reducer or keep them in memory in a data
>> structure. Using a single, shared hashmap would alleviate the need to
>> sort the mapper output. Instead perhaps distribute the slots to a
>> reducer or reducers on multiple threads. My hope is that, as this is a
>> simplification of distributed map-reduce, it will be relatively
>> straightforward to alter the code to in-memory approach for smaller jobs
>> that would perform very well for this special case.
> Actually framework is responsible for invoking the mapper and reducer
> And maintaining the intermediate records in a local file system.
> NOT sure how much code you need to re-write to handle this case. (May be
> Context which writes the data and partitioning, invoking reducer function
> for your Hashmap entries etc ) .
> NOTE:- As your hasmap is as small as it can fit into memory serializing
> your hashmap to the corresponding reducer will be a overhead if the reducer
> is not in the same node. (its better to avoid serializing to a different
>> I was hoping that someone on the list could help me with the following
>> 1) Does this sound like a good idea that might be achievable in a few
> Though this idea is interesting it might need lot of effort as you have to
> understand the framework thoroughly. Also may need lot of code changes.
> Along with that it should be configurable or should be a property set on
> the Job instance.
>> 2) Does my intuition about how to achieve the goal seem reasonable?
> NOT really sure as you need to dig down various components.
>> 3) If so, any advice on now to navigate the code base? (Any pointers on
>> packages/classes of interest would be highly appreciated)
> Context, partitioner, Mapper, Reducer, Job/JobConf, Backend framework
> classes which invoke them and may be more which I couldn't imagine now.
>> 4) Any other feedback?
> Your idea seem to be exactly other-way how hadoop operates.
> Evaluate some options like running a job in Local runner mode etc and how
> is that different from your idea/approach.
> Also making this more efficient by handling different cases will be a
> biggest concern (like serializing the map though its not needed).
>> Thanks in advance to anyone willing and able to help!