|
rshepherd
2012-11-26, 01:38
sampanriver@...
2012-11-26, 02:54
rshepherd
2012-11-26, 02:55
Alex Halter
2012-11-26, 04:13
Mahesh Balija
2012-11-26, 05:46
Sampan River
2012-11-26, 07:51
rshepherd
2012-11-27, 17:35
rshepherd
2012-11-28, 17:24
|
-
map-reduce-related school project helprshepherd 2012-11-26, 01:38
Hi everybody,
I am a student at NYU and am evaluating an idea for final project for a distributed systems class. The idea is roughly as follows; the overhead for running map-reduce on a 'small' job is high. (A small job would be defined as something fitting in memory on a single machine.) Can hadoop's map-reduce be modified to be efficient for jobs such as this? It seems that one way to do begin to achieve this goal would be to modify the way the intermediate key-value pairs are handled, the "handoff" from the map to the reduce. Rather than writing them to HDFS, either pass them directly to a reducer or keep them in memory in a data structure. Using a single, shared hashmap would alleviate the need to sort the mapper output. Instead perhaps distribute the slots to a reducer or reducers on multiple threads. My hope is that, as this is a simplification of distributed map-reduce, it will be relatively straightforward to alter the code to in-memory approach for smaller jobs that would perform very well for this special case. I was hoping that someone on the list could help me with the following questions: 1) Does this sound like a good idea that might be achievable in a few weeks? 2) Does my intuition about how to achieve the goal seem reasonable? 3) If so, any advice on now to navigate the code base? (Any pointers on packages/classes of interest would be highly appreciated) 4) Any other feedback? Thanks in advance to anyone willing and able to help! Randy
-
Re: map-reduce-related school project helpsampanriver@... 2012-11-26, 02:54
Hi Randy,
The intermediate key-value pairs are not written to HDFS. They are written to the local file system. Besides, if the job is "small", why do you use the MapReduce? You can just do it on a local machine. Jiang Shan From: rshepherd Date: 2012-11-26 09:38 To: mapreduce-dev Subject: map-reduce-related school project help Hi everybody, I am a student at NYU and am evaluating an idea for final project for a distributed systems class. The idea is roughly as follows; the overhead for running map-reduce on a 'small' job is high. (A small job would be defined as something fitting in memory on a single machine.) Can hadoop's map-reduce be modified to be efficient for jobs such as this? It seems that one way to do begin to achieve this goal would be to modify the way the intermediate key-value pairs are handled, the "handoff" from the map to the reduce. Rather than writing them to HDFS, either pass them directly to a reducer or keep them in memory in a data structure. Using a single, shared hashmap would alleviate the need to sort the mapper output. Instead perhaps distribute the slots to a reducer or reducers on multiple threads. My hope is that, as this is a simplification of distributed map-reduce, it will be relatively straightforward to alter the code to in-memory approach for smaller jobs that would perform very well for this special case. I was hoping that someone on the list could help me with the following questions: 1) Does this sound like a good idea that might be achievable in a few weeks? 2) Does my intuition about how to achieve the goal seem reasonable? 3) If so, any advice on now to navigate the code base? (Any pointers on packages/classes of interest would be highly appreciated) 4) Any other feedback? Thanks in advance to anyone willing and able to help! Randy
-
Re: map-reduce-related school project helprshepherd 2012-11-26, 02:55
Hi Jiang, thanks for your response.
I think the idea would be to be able to use the map-reduce programming paradigm on small, local jobs. In other words, provide a way to take existing jobs that are running in a distributed fashion and run them against the machine-local version. Part of the purpose is educational, and intended to illustrate the way that map-reduce is implemented and the trade-offs that are present. I hope this clarifies things. On 11/25/12 9:54 PM, [EMAIL PROTECTED] wrote: > Hi Randy, > The intermediate key-value pairs are not written to HDFS. They are written to the local file system. Besides, if the job is "small", why do you use the MapReduce? You can just do it on a local machine. > > Jiang Shan > > > > > > From: rshepherd > Date: 2012-11-26 09:38 > To: mapreduce-dev > Subject: map-reduce-related school project help > Hi everybody, > > I am a student at NYU and am evaluating an idea for final project for a > distributed systems class. The idea is roughly as follows; the overhead > for running map-reduce on a 'small' job is high. (A small job would be > defined as something fitting in memory on a single machine.) Can > hadoop's map-reduce be modified to be efficient for jobs such as this? > > It seems that one way to do begin to achieve this goal would be to > modify the way the intermediate key-value pairs are handled, the > "handoff" from the map to the reduce. Rather than writing them to HDFS, > either pass them directly to a reducer or keep them in memory in a data > structure. Using a single, shared hashmap would alleviate the need to > sort the mapper output. Instead perhaps distribute the slots to a > reducer or reducers on multiple threads. My hope is that, as this is a > simplification of distributed map-reduce, it will be relatively > straightforward to alter the code to in-memory approach for smaller jobs > that would perform very well for this special case. > > I was hoping that someone on the list could help me with the following > questions: > > 1) Does this sound like a good idea that might be achievable in a few weeks? > 2) Does my intuition about how to achieve the goal seem reasonable? > 3) If so, any advice on now to navigate the code base? (Any pointers on > packages/classes of interest would be highly appreciated) > 4) Any other feedback? > > Thanks in advance to anyone willing and able to help! > Randy
-
Re: map-reduce-related school project helpAlex Halter 2012-11-26, 04:13
Hi, I am working with Randy on this. I know one reason for this project is
that actual usage data from big companies that use map reduce suggest that a high percentage of jobs are small. On Sun, Nov 25, 2012 at 9:55 PM, rshepherd <[EMAIL PROTECTED]> wrote: > Hi Jiang, thanks for your response. > > I think the idea would be to be able to use the map-reduce programming > paradigm on small, local jobs. In other words, provide a way to take > existing jobs that are running in a distributed fashion and run them > against the machine-local version. Part of the purpose is educational, > and intended to illustrate the way that map-reduce is implemented and > the trade-offs that are present. I hope this clarifies things. > > On 11/25/12 9:54 PM, [EMAIL PROTECTED] wrote: > > Hi Randy, > > The intermediate key-value pairs are not written to HDFS. They are > written to the local file system. Besides, if the job is "small", why do > you use the MapReduce? You can just do it on a local machine. > > > > Jiang Shan > > > > > > > > > > > > From: rshepherd > > Date: 2012-11-26 09:38 > > To: mapreduce-dev > > Subject: map-reduce-related school project help > > Hi everybody, > > > > I am a student at NYU and am evaluating an idea for final project for a > > distributed systems class. The idea is roughly as follows; the overhead > > for running map-reduce on a 'small' job is high. (A small job would be > > defined as something fitting in memory on a single machine.) Can > > hadoop's map-reduce be modified to be efficient for jobs such as this? > > > > It seems that one way to do begin to achieve this goal would be to > > modify the way the intermediate key-value pairs are handled, the > > "handoff" from the map to the reduce. Rather than writing them to HDFS, > > either pass them directly to a reducer or keep them in memory in a data > > structure. Using a single, shared hashmap would alleviate the need to > > sort the mapper output. Instead perhaps distribute the slots to a > > reducer or reducers on multiple threads. My hope is that, as this is a > > simplification of distributed map-reduce, it will be relatively > > straightforward to alter the code to in-memory approach for smaller jobs > > that would perform very well for this special case. > > > > I was hoping that someone on the list could help me with the following > > questions: > > > > 1) Does this sound like a good idea that might be achievable in a few > weeks? > > 2) Does my intuition about how to achieve the goal seem reasonable? > > 3) If so, any advice on now to navigate the code base? (Any pointers on > > packages/classes of interest would be highly appreciated) > > 4) Any other feedback? > > > > Thanks in advance to anyone willing and able to help! > > Randy > >
-
Re: map-reduce-related school project helpMahesh Balija 2012-11-26, 05:46
Hi Randy/Alex,
Your problem seems to be interesting and it is understood that you want to provide a way in Hadoop to handle small jobs as well. Please see my inline answers, On Mon, Nov 26, 2012 at 7:08 AM, rshepherd <[EMAIL PROTECTED]> wrote: > Hi everybody, > > I am a student at NYU and am evaluating an idea for final project for a > distributed systems class. The idea is roughly as follows; the overhead > for running map-reduce on a 'small' job is high. (A small job would be > defined as something fitting in memory on a single machine.) Can > hadoop's map-reduce be modified to be efficient for jobs such as this? > > It seems that one way to do begin to achieve this goal would be to > modify the way the intermediate key-value pairs are handled, the > "handoff" from the map to the reduce. Rather than writing them to HDFS, > either pass them directly to a reducer or keep them in memory in a data > structure. Using a single, shared hashmap would alleviate the need to > sort the mapper output. Instead perhaps distribute the slots to a > reducer or reducers on multiple threads. My hope is that, as this is a > simplification of distributed map-reduce, it will be relatively > straightforward to alter the code to in-memory approach for smaller jobs > that would perform very well for this special case. > Actually framework is responsible for invoking the mapper and reducer functions. And maintaining the intermediate records in a local file system. NOT sure how much code you need to re-write to handle this case. (May be Context which writes the data and partitioning, invoking reducer function for your Hashmap entries etc ) . NOTE:- As your hasmap is as small as it can fit into memory serializing your hashmap to the corresponding reducer will be a overhead if the reducer is not in the same node. (its better to avoid serializing to a different node) > I was hoping that someone on the list could help me with the following > questions: > > 1) Does this sound like a good idea that might be achievable in a few > weeks? > Though this idea is interesting it might need lot of effort as you have to understand the framework thoroughly. Also may need lot of code changes. Along with that it should be configurable or should be a property set on the Job instance. > 2) Does my intuition about how to achieve the goal seem reasonable? > NOT really sure as you need to dig down various components. > 3) If so, any advice on now to navigate the code base? (Any pointers on > packages/classes of interest would be highly appreciated) > Context, partitioner, Mapper, Reducer, Job/JobConf, Backend framework classes which invoke them and may be more which I couldn't imagine now. > 4) Any other feedback? Your idea seem to be exactly other-way how hadoop operates. Evaluate some options like running a job in Local runner mode etc and how is that different from your idea/approach. Also making this more efficient by handling different cases will be a biggest concern (like serializing the map though its not needed). > > Thanks in advance to anyone willing and able to help! > Randy >
-
Re: map-reduce-related school project helpSampan River 2012-11-26, 07:51
Hi Randy,
You can search for the paper of "Break the MapReduce Stage Barrier" first. In the paper, the author proposed a pipeline map-reduce model that do not need the shuffle phase. And I think it is very similar to your idea. Jiang Shan 2012/11/26 Mahesh Balija <[EMAIL PROTECTED]> > Hi Randy/Alex, > > Your problem seems to be interesting and it is understood > that you want to provide a way in Hadoop to handle small jobs as well. > > Please see my inline answers, > > On Mon, Nov 26, 2012 at 7:08 AM, rshepherd <[EMAIL PROTECTED]> wrote: > > > Hi everybody, > > > > I am a student at NYU and am evaluating an idea for final project for a > > distributed systems class. The idea is roughly as follows; the overhead > > for running map-reduce on a 'small' job is high. (A small job would be > > defined as something fitting in memory on a single machine.) Can > > hadoop's map-reduce be modified to be efficient for jobs such as this? > > > > It seems that one way to do begin to achieve this goal would be to > > modify the way the intermediate key-value pairs are handled, the > > "handoff" from the map to the reduce. Rather than writing them to HDFS, > > either pass them directly to a reducer or keep them in memory in a data > > structure. Using a single, shared hashmap would alleviate the need to > > sort the mapper output. Instead perhaps distribute the slots to a > > reducer or reducers on multiple threads. My hope is that, as this is a > > simplification of distributed map-reduce, it will be relatively > > straightforward to alter the code to in-memory approach for smaller jobs > > that would perform very well for this special case. > > > Actually framework is responsible for invoking the mapper and reducer > functions. > And maintaining the intermediate records in a local file system. > NOT sure how much code you need to re-write to handle this case. (May be > Context which writes the data and partitioning, invoking reducer function > for your Hashmap entries etc ) . > NOTE:- As your hasmap is as small as it can fit into memory serializing > your hashmap to the corresponding reducer will be a overhead if the reducer > is not in the same node. (its better to avoid serializing to a different > node) > > > > I was hoping that someone on the list could help me with the following > > questions: > > > > 1) Does this sound like a good idea that might be achievable in a few > > weeks? > > > Though this idea is interesting it might need lot of effort as you have to > understand the framework thoroughly. Also may need lot of code changes. > Along with that it should be configurable or should be a property set on > the Job instance. > > > 2) Does my intuition about how to achieve the goal seem reasonable? > > > NOT really sure as you need to dig down various components. > > > 3) If so, any advice on now to navigate the code base? (Any pointers on > > packages/classes of interest would be highly appreciated) > > > Context, partitioner, Mapper, Reducer, Job/JobConf, Backend framework > classes which invoke them and may be more which I couldn't imagine now. > > > 4) Any other feedback? > > Your idea seem to be exactly other-way how hadoop operates. > Evaluate some options like running a job in Local runner mode etc and how > is that different from your idea/approach. > Also making this more efficient by handling different cases will be a > biggest concern (like serializing the map though its not needed). > > > > > Thanks in advance to anyone willing and able to help! > > Randy > > >
-
Re: map-reduce-related school project helprshepherd 2012-11-27, 17:35
Hi Mahesh, thanks so much for you reply.
Fortunately, what we would need to complete would simply be a proof-of-concept, as opposed to a full-fledged feature of hadoop. In other words, we only need to demonstrate that improvements are possible for this special case. Next we will be evaluating the code base to determine if such an improvement is possible given our time frame. Thanks again for your help, Randy On 11/26/12 12:46 AM, Mahesh Balija wrote: > Hi Randy/Alex, > > Your problem seems to be interesting and it is understood > that you want to provide a way in Hadoop to handle small jobs as well. > > Please see my inline answers, > > On Mon, Nov 26, 2012 at 7:08 AM, rshepherd <[EMAIL PROTECTED]> wrote: > >> Hi everybody, >> >> I am a student at NYU and am evaluating an idea for final project for a >> distributed systems class. The idea is roughly as follows; the overhead >> for running map-reduce on a 'small' job is high. (A small job would be >> defined as something fitting in memory on a single machine.) Can >> hadoop's map-reduce be modified to be efficient for jobs such as this? >> >> It seems that one way to do begin to achieve this goal would be to >> modify the way the intermediate key-value pairs are handled, the >> "handoff" from the map to the reduce. Rather than writing them to HDFS, >> either pass them directly to a reducer or keep them in memory in a data >> structure. Using a single, shared hashmap would alleviate the need to >> sort the mapper output. Instead perhaps distribute the slots to a >> reducer or reducers on multiple threads. My hope is that, as this is a >> simplification of distributed map-reduce, it will be relatively >> straightforward to alter the code to in-memory approach for smaller jobs >> that would perform very well for this special case. >> > Actually framework is responsible for invoking the mapper and reducer > functions. > And maintaining the intermediate records in a local file system. > NOT sure how much code you need to re-write to handle this case. (May be > Context which writes the data and partitioning, invoking reducer function > for your Hashmap entries etc ) . > NOTE:- As your hasmap is as small as it can fit into memory serializing > your hashmap to the corresponding reducer will be a overhead if the reducer > is not in the same node. (its better to avoid serializing to a different > node) > > >> I was hoping that someone on the list could help me with the following >> questions: >> >> 1) Does this sound like a good idea that might be achievable in a few >> weeks? >> > Though this idea is interesting it might need lot of effort as you have to > understand the framework thoroughly. Also may need lot of code changes. > Along with that it should be configurable or should be a property set on > the Job instance. > >> 2) Does my intuition about how to achieve the goal seem reasonable? >> > NOT really sure as you need to dig down various components. > >> 3) If so, any advice on now to navigate the code base? (Any pointers on >> packages/classes of interest would be highly appreciated) >> > Context, partitioner, Mapper, Reducer, Job/JobConf, Backend framework > classes which invoke them and may be more which I couldn't imagine now. > >> 4) Any other feedback? > Your idea seem to be exactly other-way how hadoop operates. > Evaluate some options like running a job in Local runner mode etc and how > is that different from your idea/approach. > Also making this more efficient by handling different cases will be a > biggest concern (like serializing the map though its not needed). > >> Thanks in advance to anyone willing and able to help! >> Randy >>
-
Re: map-reduce-related school project helprshepherd 2012-11-28, 17:24
Hi folks,
One approach we are considering is implementing a simple fuse-based file system that only keeps files in memory. Then, while running mapreduce in 'psuedo-distributed' mode, configure mapreduce to use this in-memory file system as the write location for the intermediate key value pairs. Perhaps this technique would be supported already by features available to regular users? Can anyone point me in the right direction? Thanks again, Randy On 11/26/12 12:46 AM, Mahesh Balija wrote: > Hi Randy/Alex, > > Your problem seems to be interesting and it is understood > that you want to provide a way in Hadoop to handle small jobs as well. > > Please see my inline answers, > > On Mon, Nov 26, 2012 at 7:08 AM, rshepherd <[EMAIL PROTECTED]> wrote: > >> Hi everybody, >> >> I am a student at NYU and am evaluating an idea for final project for a >> distributed systems class. The idea is roughly as follows; the overhead >> for running map-reduce on a 'small' job is high. (A small job would be >> defined as something fitting in memory on a single machine.) Can >> hadoop's map-reduce be modified to be efficient for jobs such as this? >> >> It seems that one way to do begin to achieve this goal would be to >> modify the way the intermediate key-value pairs are handled, the >> "handoff" from the map to the reduce. Rather than writing them to HDFS, >> either pass them directly to a reducer or keep them in memory in a data >> structure. Using a single, shared hashmap would alleviate the need to >> sort the mapper output. Instead perhaps distribute the slots to a >> reducer or reducers on multiple threads. My hope is that, as this is a >> simplification of distributed map-reduce, it will be relatively >> straightforward to alter the code to in-memory approach for smaller jobs >> that would perform very well for this special case. >> > Actually framework is responsible for invoking the mapper and reducer > functions. > And maintaining the intermediate records in a local file system. > NOT sure how much code you need to re-write to handle this case. (May be > Context which writes the data and partitioning, invoking reducer function > for your Hashmap entries etc ) . > NOTE:- As your hasmap is as small as it can fit into memory serializing > your hashmap to the corresponding reducer will be a overhead if the reducer > is not in the same node. (its better to avoid serializing to a different > node) > > >> I was hoping that someone on the list could help me with the following >> questions: >> >> 1) Does this sound like a good idea that might be achievable in a few >> weeks? >> > Though this idea is interesting it might need lot of effort as you have to > understand the framework thoroughly. Also may need lot of code changes. > Along with that it should be configurable or should be a property set on > the Job instance. > >> 2) Does my intuition about how to achieve the goal seem reasonable? >> > NOT really sure as you need to dig down various components. > >> 3) If so, any advice on now to navigate the code base? (Any pointers on >> packages/classes of interest would be highly appreciated) >> > Context, partitioner, Mapper, Reducer, Job/JobConf, Backend framework > classes which invoke them and may be more which I couldn't imagine now. > >> 4) Any other feedback? > Your idea seem to be exactly other-way how hadoop operates. > Evaluate some options like running a job in Local runner mode etc and how > is that different from your idea/approach. > Also making this more efficient by handling different cases will be a > biggest concern (like serializing the map though its not needed). > >> Thanks in advance to anyone willing and able to help! >> Randy >> |