|
Grandl Robert
2012-07-08, 01:37
Harsh J
2012-07-08, 05:34
Harsh J
2012-07-09, 01:16
Grandl Robert
2012-07-09, 01:27
Pavan Kulkarni
2012-07-09, 02:56
Harsh J
2012-07-09, 03:38
Pavan Kulkarni
2012-07-09, 04:11
Arun C Murthy
2012-07-09, 13:24
Manoj Babu
2012-07-09, 17:52
Harsh J
2012-07-09, 17:57
Manoj Babu
2012-07-09, 18:07
Karthik Kambatla
2012-07-09, 18:12
Grandl Robert
2012-07-09, 19:55
Arun C Murthy
2012-07-09, 20:33
Grandl Robert
2012-07-10, 03:15
Karthik Kambatla
2012-07-10, 03:33
Subir S
2012-07-10, 15:29
Subir S
2012-07-14, 05:49
Harsh J
2012-07-14, 06:08
Subir S
2012-07-14, 12:00
Harsh J
2012-07-14, 13:55
Subir S
2012-07-16, 20:31
|
-
Basic question on how reducer worksGrandl Robert 2012-07-08, 01:37
Hi,
I have some questions related to basic functionality in Hadoop. 1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each partition for each reducer ? 2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ? Could somebody help me with these questions together with pointing me out where I can find the java code doing that ? I am running Hadoop 1.0.3. Thanks, Robert
-
Re: Basic question on how reducer worksHarsh J 2012-07-08, 05:34
Hi Robert,
Inline. (Answer is specific to Hadoop 1.x since you asked for that alone, but certain things may vary for Hadoop 2.x). On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: > Hi, > > I have some questions related to basic functionality in Hadoop. > > 1. When a Mapper process the intermediate output data, how it knows how many > partitions to do(how many reducers will be) and how much data to go in each > partition for each reducer ? The number of reducers is non-dynamic and is user-specified, and is set in the job configuration. Hence the Partitioner knows about the value it needs to use for its numPartitions (== numReduces for the job). For this one in 1.x code, look at MapTask.java, in the constructors of internal classes OldOutputCollector (Stable API) and NewOutputCollector (New API). The data estimated to be going into a partition, for limit/scheduling checks, is currently a naive computation, done by summing upon the estimate output sizes of each map. See ResourceEstimator#getEstimatedReduceInputSize for the overall estimation across maps, and see Task#calculateOutputSize for the per-map estimation code. > 2. A JobTracker when assigns a task to a reducer, it will also specify the > locations of intermediate output data where it should retrieve it right ? > But how a reducer will know from each remote location with intermediate > output what portion it has to retrieve only ? The JT does not send in the information of locations when a reduce is scheduled. When the reducers begin their shuffle phase, they query the TaskTracker to get the map completion events, via TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by itself calls the JobTracker#getTaskCompletionEvents protocol call to get this info underneath. The returned structure carries the host that has completed the map successfully, which the Reduce's copier relies on to fetch the data from the right host's TT. The reduce merely asks the data assigned for it for the specific completed maps at each TT. Note that a reduce task ID is also its partition ID, so it merely has to ask the data for its own task ID # and the TT serves, over HTTP, the right parts of the intermediate data to it. Feel free to ping back if you need some more clarification! :) -- Harsh J
-
Re: Basic question on how reducer worksHarsh J 2012-07-09, 01:16
The changes should appear in your Task's userlogs (not the TaskTracker
logs). Have you deployed your changed code properly (i.e. do you generate a new tarball, or perhaps use the MRMiniCluster to do this)? On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: > Hi Harsh, > > Your comments were extremely helpful. > > Still I am wondering why if I add LOG.info entries into MapTask.java or > ReduceTask.java in most of the functions(including Old/NewOutputCollector), > the logs are not shown. In this way it's hard for me to track which > functions are called and which not. Even more in ReduceTask.java. > > Do you have any ideas ? > > Thanks a lot for your answer, > Robert > > ________________________________ > From: Harsh J <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Grandl Robert <[EMAIL PROTECTED]> > Sent: Sunday, July 8, 2012 1:34 AM > > Subject: Re: Basic question on how reducer works > > Hi Robert, > > Inline. (Answer is specific to Hadoop 1.x since you asked for that > alone, but certain things may vary for Hadoop 2.x). > > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how >> many >> partitions to do(how many reducers will be) and how much data to go in >> each >> partition for each reducer ? > > The number of reducers is non-dynamic and is user-specified, and is > set in the job configuration. Hence the Partitioner knows about the > value it needs to use for its numPartitions (== numReduces for the > job). > > For this one in 1.x code, look at MapTask.java, in the constructors of > internal classes OldOutputCollector (Stable API) and > NewOutputCollector (New API). > > The data estimated to be going into a partition, for limit/scheduling > checks, is currently a naive computation, done by summing upon the > estimate output sizes of each map. See > ResourceEstimator#getEstimatedReduceInputSize for the overall > estimation across maps, and see Task#calculateOutputSize for the > per-map estimation code. > >> 2. A JobTracker when assigns a task to a reducer, it will also specify the >> locations of intermediate output data where it should retrieve it right ? >> But how a reducer will know from each remote location with intermediate >> output what portion it has to retrieve only ? > > The JT does not send in the information of locations when a reduce is > scheduled. When the reducers begin their shuffle phase, they query the > TaskTracker to get the map completion events, via > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by > itself calls the JobTracker#getTaskCompletionEvents protocol call to > get this info underneath. The returned structure carries the host that > has completed the map successfully, which the Reduce's copier relies > on to fetch the data from the right host's TT. > > The reduce merely asks the data assigned for it for the specific > completed maps at each TT. Note that a reduce task ID is also its > partition ID, so it merely has to ask the data for its own task ID # > and the TT serves, over HTTP, the right parts of the intermediate data > to it. > > Feel free to ping back if you need some more clarification! :) > > -- > Harsh J > > -- Harsh J
-
Re: Basic question on how reducer worksGrandl Robert 2012-07-09, 01:27
I see. I was looking into tasktracker log :).
Thanks a lot, Robert ________________________________ From: Harsh J <[EMAIL PROTECTED]> To: Grandl Robert <[EMAIL PROTECTED]>; mapreduce-user <[EMAIL PROTECTED]> Sent: Sunday, July 8, 2012 9:16 PM Subject: Re: Basic question on how reducer works The changes should appear in your Task's userlogs (not the TaskTracker logs). Have you deployed your changed code properly (i.e. do you generate a new tarball, or perhaps use the MRMiniCluster to do this)? On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: > Hi Harsh, > > Your comments were extremely helpful. > > Still I am wondering why if I add LOG.info entries into MapTask.java or > ReduceTask.java in most of the functions(including Old/NewOutputCollector), > the logs are not shown. In this way it's hard for me to track which > functions are called and which not. Even more in ReduceTask.java. > > Do you have any ideas ? > > Thanks a lot for your answer, > Robert > > ________________________________ > From: Harsh J <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Grandl Robert <[EMAIL PROTECTED]> > Sent: Sunday, July 8, 2012 1:34 AM > > Subject: Re: Basic question on how reducer works > > Hi Robert, > > Inline. (Answer is specific to Hadoop 1.x since you asked for that > alone, but certain things may vary for Hadoop 2.x). > > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how >> many >> partitions to do(how many reducers will be) and how much data to go in >> each >> partition for each reducer ? > > The number of reducers is non-dynamic and is user-specified, and is > set in the job configuration. Hence the Partitioner knows about the > value it needs to use for its numPartitions (== numReduces for the > job). > > For this one in 1.x code, look at MapTask.java, in the constructors of > internal classes OldOutputCollector (Stable API) and > NewOutputCollector (New API). > > The data estimated to be going into a partition, for limit/scheduling > checks, is currently a naive computation, done by summing upon the > estimate output sizes of each map. See > ResourceEstimator#getEstimatedReduceInputSize for the overall > estimation across maps, and see Task#calculateOutputSize for the > per-map estimation code. > >> 2. A JobTracker when assigns a task to a reducer, it will also specify the >> locations of intermediate output data where it should retrieve it right ? >> But how a reducer will know from each remote location with intermediate >> output what portion it has to retrieve only ? > > The JT does not send in the information of locations when a reduce is > scheduled. When the reducers begin their shuffle phase, they query the > TaskTracker to get the map completion events, via > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by > itself calls the JobTracker#getTaskCompletionEvents protocol call to > get this info underneath. The returned structure carries the host that > has completed the map successfully, which the Reduce's copier relies > on to fetch the data from the right host's TT. > > The reduce merely asks the data assigned for it for the specific > completed maps at each TT. Note that a reduce task ID is also its > partition ID, so it merely has to ask the data for its own task ID # > and the TT serves, over HTTP, the right parts of the intermediate data > to it. > > Feel free to ping back if you need some more clarification! :) > > -- > Harsh J > > -- Harsh J
-
Re: Basic question on how reducer worksPavan Kulkarni 2012-07-09, 02:56
I too had similar problems.
I guess we should also set the debug mode for that specific class in the log4j.properties file .Isn't it? And I didn't quite get what you mean by task's userlogs? where are these logs located ? In the logs directory I only see logs for all the daemons.Thanks On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <[EMAIL PROTECTED]> wrote: > I see. I was looking into tasktracker log :). > > Thanks a lot, > Robert > > ------------------------------ > *From:* Harsh J <[EMAIL PROTECTED]> > *To:* Grandl Robert <[EMAIL PROTECTED]>; mapreduce-user < > [EMAIL PROTECTED]> > *Sent:* Sunday, July 8, 2012 9:16 PM > > *Subject:* Re: Basic question on how reducer works > > The changes should appear in your Task's userlogs (not the TaskTracker > logs). Have you deployed your changed code properly (i.e. do you > generate a new tarball, or perhaps use the MRMiniCluster to do this)? > > On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: > > Hi Harsh, > > > > Your comments were extremely helpful. > > > > Still I am wondering why if I add LOG.info <http://log.info/> entries > into MapTask.java or > > ReduceTask.java in most of the functions(including > Old/NewOutputCollector), > > the logs are not shown. In this way it's hard for me to track which > > functions are called and which not. Even more in ReduceTask.java. > > > > Do you have any ideas ? > > > > Thanks a lot for your answer, > > Robert > > > > ________________________________ > > From: Harsh J <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]; Grandl Robert <[EMAIL PROTECTED]> > > Sent: Sunday, July 8, 2012 1:34 AM > > > > Subject: Re: Basic question on how reducer works > > > > Hi Robert, > > > > Inline. (Answer is specific to Hadoop 1.x since you asked for that > > alone, but certain things may vary for Hadoop 2.x). > > > > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: > >> Hi, > >> > >> I have some questions related to basic functionality in Hadoop. > >> > >> 1. When a Mapper process the intermediate output data, how it knows how > >> many > >> partitions to do(how many reducers will be) and how much data to go in > >> each > >> partition for each reducer ? > > > > The number of reducers is non-dynamic and is user-specified, and is > > set in the job configuration. Hence the Partitioner knows about the > > value it needs to use for its numPartitions (== numReduces for the > > job). > > > > For this one in 1.x code, look at MapTask.java, in the constructors of > > internal classes OldOutputCollector (Stable API) and > > NewOutputCollector (New API). > > > > The data estimated to be going into a partition, for limit/scheduling > > checks, is currently a naive computation, done by summing upon the > > estimate output sizes of each map. See > > ResourceEstimator#getEstimatedReduceInputSize for the overall > > estimation across maps, and see Task#calculateOutputSize for the > > per-map estimation code. > > > >> 2. A JobTracker when assigns a task to a reducer, it will also specify > the > >> locations of intermediate output data where it should retrieve it right > ? > >> But how a reducer will know from each remote location with intermediate > >> output what portion it has to retrieve only ? > > > > The JT does not send in the information of locations when a reduce is > > scheduled. When the reducers begin their shuffle phase, they query the > > TaskTracker to get the map completion events, via > > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by > > itself calls the JobTracker#getTaskCompletionEvents protocol call to > > get this info underneath. The returned structure carries the host that > > has completed the map successfully, which the Reduce's copier relies > > on to fetch the data from the right host's TT. > > > > The reduce merely asks the data assigned for it for the specific > > completed maps at each TT. Note that a reduce task ID is also its > > partition ID, so it merely has to ask the data for its own task ID # Pavan Kulkarni
-
Re: Basic question on how reducer worksHarsh J 2012-07-09, 03:38
Pavan,
This is covered in the MR tutorial doc: http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#Task+Logs On Mon, Jul 9, 2012 at 8:26 AM, Pavan Kulkarni <[EMAIL PROTECTED]> wrote: > I too had similar problems. > I guess we should also set the debug mode for > that specific class in the log4j.properties file .Isn't it? > > And I didn't quite get what you mean by task's userlogs? > where are these logs located ? In the logs directory I only see > logs for all the daemons.Thanks > > > On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <[EMAIL PROTECTED]> wrote: >> >> I see. I was looking into tasktracker log :). >> >> Thanks a lot, >> Robert >> >> ________________________________ >> From: Harsh J <[EMAIL PROTECTED]> >> To: Grandl Robert <[EMAIL PROTECTED]>; mapreduce-user >> <[EMAIL PROTECTED]> >> Sent: Sunday, July 8, 2012 9:16 PM >> >> Subject: Re: Basic question on how reducer works >> >> The changes should appear in your Task's userlogs (not the TaskTracker >> logs). Have you deployed your changed code properly (i.e. do you >> generate a new tarball, or perhaps use the MRMiniCluster to do this)? >> >> On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: >> > Hi Harsh, >> > >> > Your comments were extremely helpful. >> > >> > Still I am wondering why if I add LOG.info entries into MapTask.java or >> > ReduceTask.java in most of the functions(including >> > Old/NewOutputCollector), >> > the logs are not shown. In this way it's hard for me to track which >> > functions are called and which not. Even more in ReduceTask.java. >> > >> > Do you have any ideas ? >> > >> > Thanks a lot for your answer, >> > Robert >> > >> > ________________________________ >> > From: Harsh J <[EMAIL PROTECTED]> >> > To: [EMAIL PROTECTED]; Grandl Robert <[EMAIL PROTECTED]> >> > Sent: Sunday, July 8, 2012 1:34 AM >> > >> > Subject: Re: Basic question on how reducer works >> > >> > Hi Robert, >> > >> > Inline. (Answer is specific to Hadoop 1.x since you asked for that >> > alone, but certain things may vary for Hadoop 2.x). >> > >> > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <[EMAIL PROTECTED]> wrote: >> >> Hi, >> >> >> >> I have some questions related to basic functionality in Hadoop. >> >> >> >> 1. When a Mapper process the intermediate output data, how it knows how >> >> many >> >> partitions to do(how many reducers will be) and how much data to go in >> >> each >> >> partition for each reducer ? >> > >> > The number of reducers is non-dynamic and is user-specified, and is >> > set in the job configuration. Hence the Partitioner knows about the >> > value it needs to use for its numPartitions (== numReduces for the >> > job). >> > >> > For this one in 1.x code, look at MapTask.java, in the constructors of >> > internal classes OldOutputCollector (Stable API) and >> > NewOutputCollector (New API). >> > >> > The data estimated to be going into a partition, for limit/scheduling >> > checks, is currently a naive computation, done by summing upon the >> > estimate output sizes of each map. See >> > ResourceEstimator#getEstimatedReduceInputSize for the overall >> > estimation across maps, and see Task#calculateOutputSize for the >> > per-map estimation code. >> > >> >> 2. A JobTracker when assigns a task to a reducer, it will also specify >> >> the >> >> locations of intermediate output data where it should retrieve it right >> >> ? >> >> But how a reducer will know from each remote location with intermediate >> >> output what portion it has to retrieve only ? >> > >> > The JT does not send in the information of locations when a reduce is >> > scheduled. When the reducers begin their shuffle phase, they query the >> > TaskTracker to get the map completion events, via >> > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by >> > itself calls the JobTracker#getTaskCompletionEvents protocol call to >> > get this info underneath. The returned structure carries the host that Harsh J
-
Re: Basic question on how reducer worksPavan Kulkarni 2012-07-09, 04:11
Oh.Thanks a lot Harsh .
On Sun, Jul 8, 2012 at 11:38 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Pavan, > > This is covered in the MR tutorial doc: > http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#Task+Logs > > On Mon, Jul 9, 2012 at 8:26 AM, Pavan Kulkarni <[EMAIL PROTECTED]> > wrote: > > I too had similar problems. > > I guess we should also set the debug mode for > > that specific class in the log4j.properties file .Isn't it? > > > > And I didn't quite get what you mean by task's userlogs? > > where are these logs located ? In the logs directory I only see > > logs for all the daemons.Thanks > > > > > > On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <[EMAIL PROTECTED]> wrote: > >> > >> I see. I was looking into tasktracker log :). > >> > >> Thanks a lot, > >> Robert > >> > >> ________________________________ > >> From: Harsh J <[EMAIL PROTECTED]> > >> To: Grandl Robert <[EMAIL PROTECTED]>; mapreduce-user > >> <[EMAIL PROTECTED]> > >> Sent: Sunday, July 8, 2012 9:16 PM > >> > >> Subject: Re: Basic question on how reducer works > >> > >> The changes should appear in your Task's userlogs (not the TaskTracker > >> logs). Have you deployed your changed code properly (i.e. do you > >> generate a new tarball, or perhaps use the MRMiniCluster to do this)? > >> > >> On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <[EMAIL PROTECTED]> > wrote: > >> > Hi Harsh, > >> > > >> > Your comments were extremely helpful. > >> > > >> > Still I am wondering why if I add LOG.info entries into MapTask.java > or > >> > ReduceTask.java in most of the functions(including > >> > Old/NewOutputCollector), > >> > the logs are not shown. In this way it's hard for me to track which > >> > functions are called and which not. Even more in ReduceTask.java. > >> > > >> > Do you have any ideas ? > >> > > >> > Thanks a lot for your answer, > >> > Robert > >> > > >> > ________________________________ > >> > From: Harsh J <[EMAIL PROTECTED]> > >> > To: [EMAIL PROTECTED]; Grandl Robert < > [EMAIL PROTECTED]> > >> > Sent: Sunday, July 8, 2012 1:34 AM > >> > > >> > Subject: Re: Basic question on how reducer works > >> > > >> > Hi Robert, > >> > > >> > Inline. (Answer is specific to Hadoop 1.x since you asked for that > >> > alone, but certain things may vary for Hadoop 2.x). > >> > > >> > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <[EMAIL PROTECTED]> > wrote: > >> >> Hi, > >> >> > >> >> I have some questions related to basic functionality in Hadoop. > >> >> > >> >> 1. When a Mapper process the intermediate output data, how it knows > how > >> >> many > >> >> partitions to do(how many reducers will be) and how much data to go > in > >> >> each > >> >> partition for each reducer ? > >> > > >> > The number of reducers is non-dynamic and is user-specified, and is > >> > set in the job configuration. Hence the Partitioner knows about the > >> > value it needs to use for its numPartitions (== numReduces for the > >> > job). > >> > > >> > For this one in 1.x code, look at MapTask.java, in the constructors of > >> > internal classes OldOutputCollector (Stable API) and > >> > NewOutputCollector (New API). > >> > > >> > The data estimated to be going into a partition, for limit/scheduling > >> > checks, is currently a naive computation, done by summing upon the > >> > estimate output sizes of each map. See > >> > ResourceEstimator#getEstimatedReduceInputSize for the overall > >> > estimation across maps, and see Task#calculateOutputSize for the > >> > per-map estimation code. > >> > > >> >> 2. A JobTracker when assigns a task to a reducer, it will also > specify > >> >> the > >> >> locations of intermediate output data where it should retrieve it > right > >> >> ? > >> >> But how a reducer will know from each remote location with > intermediate > >> >> output what portion it has to retrieve only ? > >> > > >> > The JT does not send in the information of locations when a reduce is > >> > scheduled. When the reducers begin their shuffle phase, they query the Pavan Kulkarni
-
Re: Basic question on how reducer worksArun C Murthy 2012-07-09, 13:24
Robert,
On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: > Hi, > > I have some questions related to basic functionality in Hadoop. > > 1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each partition for each reducer ? > > 2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ? To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/reduce-id pair is present via an output-file/index-file combination. Arun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
-
Re: Basic question on how reducer worksManoj Babu 2012-07-09, 17:52
Hi,
It would be more helpful, If you could more details for the below doubts. 1, How the partitioner knows which reducer needs to be called? 2, When we are using more than one reducers, the output gets separated. Actually for what scenario we have to go for multiple reducers? Cheers! Manoj. On Mon, Jul 9, 2012 at 6:54 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > Robert, > > On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: > > Hi, > > I have some questions related to basic functionality in Hadoop. > > 1. When a Mapper process the intermediate output data, how it knows how > many partitions to do(how many reducers will be) and how much data to go in > each partition for each reducer ? > > 2. A JobTracker when assigns a task to a reducer, it will also specify the > locations of intermediate output data where it should retrieve it right ? > But how a reducer will know from each remote location with intermediate > output what portion it has to retrieve only ? > > > To add to Harsh's comment. Essentially the TT *knows* where the output of > a given map-id/reduce-id pair is present via an output-file/index-file > combination. > > Arun > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > >
-
Re: Basic question on how reducer worksHarsh J 2012-07-09, 17:57
Manoj,
Think of it this way, and you shouldn't be confused: A reducer == a partition. For (1) - Partitioners do not 'call' a reduce, just write the data with a proper partition ID. The reducer thats same as the partition ID, picks it up for itself later. This we have already explained earlier. For (2) - For what scenario do you _not_ want multiple reducers handling each partition uniquely, when it is possible to scale that way? On Mon, Jul 9, 2012 at 11:22 PM, Manoj Babu <[EMAIL PROTECTED]> wrote: > Hi, > > It would be more helpful, If you could more details for the below doubts. > > 1, How the partitioner knows which reducer needs to be called? > 2, When we are using more than one reducers, the output gets separated. > Actually for what scenario we have to go for multiple reducers? > > Cheers! > Manoj. > > > > On Mon, Jul 9, 2012 at 6:54 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: >> >> Robert, >> >> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >> >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how >> many partitions to do(how many reducers will be) and how much data to go in >> each partition for each reducer ? >> >> 2. A JobTracker when assigns a task to a reducer, it will also specify the >> locations of intermediate output data where it should retrieve it right ? >> But how a reducer will know from each remote location with intermediate >> output what portion it has to retrieve only ? >> >> >> To add to Harsh's comment. Essentially the TT *knows* where the output of >> a given map-id/reduce-id pair is present via an output-file/index-file >> combination. >> >> Arun >> >> -- >> Arun C. Murthy >> Hortonworks Inc. >> http://hortonworks.com/ >> >> > -- Harsh J
-
Re: Basic question on how reducer worksManoj Babu 2012-07-09, 18:07
Hi Harsh,
Thanks for clarifying. I was in thought earlier that Partitioner is picking the reducer. My cluster setup provides options for multiple reducers so i want to know when and in which scenario we have go for multiple reducers? Cheers! Manoj. On Mon, Jul 9, 2012 at 11:27 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Manoj, > > Think of it this way, and you shouldn't be confused: A reducer == a > partition. > > For (1) - Partitioners do not 'call' a reduce, just write the data > with a proper partition ID. The reducer thats same as the partition > ID, picks it up for itself later. This we have already explained > earlier. > > For (2) - For what scenario do you _not_ want multiple reducers > handling each partition uniquely, when it is possible to scale that > way? > > On Mon, Jul 9, 2012 at 11:22 PM, Manoj Babu <[EMAIL PROTECTED]> wrote: > > Hi, > > > > It would be more helpful, If you could more details for the below doubts. > > > > 1, How the partitioner knows which reducer needs to be called? > > 2, When we are using more than one reducers, the output gets separated. > > Actually for what scenario we have to go for multiple reducers? > > > > Cheers! > > Manoj. > > > > > > > > On Mon, Jul 9, 2012 at 6:54 PM, Arun C Murthy <[EMAIL PROTECTED]> > wrote: > >> > >> Robert, > >> > >> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: > >> > >> Hi, > >> > >> I have some questions related to basic functionality in Hadoop. > >> > >> 1. When a Mapper process the intermediate output data, how it knows how > >> many partitions to do(how many reducers will be) and how much data to > go in > >> each partition for each reducer ? > >> > >> 2. A JobTracker when assigns a task to a reducer, it will also specify > the > >> locations of intermediate output data where it should retrieve it right > ? > >> But how a reducer will know from each remote location with intermediate > >> output what portion it has to retrieve only ? > >> > >> > >> To add to Harsh's comment. Essentially the TT *knows* where the output > of > >> a given map-id/reduce-id pair is present via an output-file/index-file > >> combination. > >> > >> Arun > >> > >> -- > >> Arun C. Murthy > >> Hortonworks Inc. > >> http://hortonworks.com/ > >> > >> > > > > > > -- > Harsh J >
-
Re: Basic question on how reducer worksKarthik Kambatla 2012-07-09, 18:12
Hi Manoj,
As Harsh said, we would almost always need multiple reducers. As each reduce is potentially executed on a different core (same machine or a different one), in most cases, we would want at least as many reduces as the number of cores for maximum parallelism/performance. Karthik On Mon, Jul 9, 2012 at 11:07 AM, Manoj Babu <[EMAIL PROTECTED]> wrote: > Hi Harsh, > > Thanks for clarifying. I was in thought earlier that Partitioner is > picking the reducer. > > My cluster setup provides options for multiple reducers so i want to know > when and in which scenario we have go for multiple reducers? > > Cheers! > Manoj. > > > > On Mon, Jul 9, 2012 at 11:27 PM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Manoj, >> >> Think of it this way, and you shouldn't be confused: A reducer == a >> partition. >> >> For (1) - Partitioners do not 'call' a reduce, just write the data >> with a proper partition ID. The reducer thats same as the partition >> ID, picks it up for itself later. This we have already explained >> earlier. >> >> For (2) - For what scenario do you _not_ want multiple reducers >> handling each partition uniquely, when it is possible to scale that >> way? >> >> On Mon, Jul 9, 2012 at 11:22 PM, Manoj Babu <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > It would be more helpful, If you could more details for the below >> doubts. >> > >> > 1, How the partitioner knows which reducer needs to be called? >> > 2, When we are using more than one reducers, the output gets separated. >> > Actually for what scenario we have to go for multiple reducers? >> > >> > Cheers! >> > Manoj. >> > >> > >> > >> > On Mon, Jul 9, 2012 at 6:54 PM, Arun C Murthy <[EMAIL PROTECTED]> >> wrote: >> >> >> >> Robert, >> >> >> >> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >> >> >> >> Hi, >> >> >> >> I have some questions related to basic functionality in Hadoop. >> >> >> >> 1. When a Mapper process the intermediate output data, how it knows how >> >> many partitions to do(how many reducers will be) and how much data to >> go in >> >> each partition for each reducer ? >> >> >> >> 2. A JobTracker when assigns a task to a reducer, it will also specify >> the >> >> locations of intermediate output data where it should retrieve it >> right ? >> >> But how a reducer will know from each remote location with intermediate >> >> output what portion it has to retrieve only ? >> >> >> >> >> >> To add to Harsh's comment. Essentially the TT *knows* where the output >> of >> >> a given map-id/reduce-id pair is present via an output-file/index-file >> >> combination. >> >> >> >> Arun >> >> >> >> -- >> >> Arun C. Murthy >> >> Hortonworks Inc. >> >> http://hortonworks.com/ >> >> >> >> >> > >> >> >> >> -- >> Harsh J >> > >
-
Re: Basic question on how reducer worksGrandl Robert 2012-07-09, 19:55
Thanks a lot guys for answers.
Still I am not able to find exactly the code for the following things: 1. reducer to read from a Map output only its partition. I looked into ReduceTask#getMapOutput which do the actual read in ReduceTask#shuffleInMemory, but I don't see where it specify which partition to read(reduceID). 2. still don't understand very well in which part of the code(MapTask.java) the intermediate data is written do which partition. So MapOutputBuffer is the one who actually writes the data to buffer and spill after buffer is full. Could you please elaborate a bit on how the data is written to which partition ? Thanks, Robert ________________________________ From: Arun C Murthy <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, July 9, 2012 9:24 AM Subject: Re: Basic question on how reducer works Robert, On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: Hi, > > >I have some questions related to basic functionality in Hadoop. > > >1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each partition for each reducer ? > > >2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ? To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/reduce-id pair is present via an output-file/index-file combination. Arun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
-
Re: Basic question on how reducer worksArun C Murthy 2012-07-09, 20:33
On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: > Thanks a lot guys for answers. > > Still I am not able to find exactly the code for the following things: > > 1. reducer to read from a Map output only its partition. I looked into ReduceTask#getMapOutput which do the actual read in ReduceTask#shuffleInMemory, but I don't see where it specify which partition to read(reduceID). > Look at TaskTracker.MapOutputServlet. > 2. still don't understand very well in which part of the code(MapTask.java) the intermediate data is written do which partition. So MapOutputBuffer is the one who actually writes the data to buffer and spill after buffer is full. Could you please elaborate a bit on how the data is written to which partition ? > Essentially you can think of the partition-id as the 'primary key' and the actual 'key' in the map-output of <key, value> as the 'secondary key'. hth, Arun > Thanks, > Robert > > From: Arun C Murthy <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, July 9, 2012 9:24 AM > Subject: Re: Basic question on how reducer works > > Robert, > > On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: > >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each partition for each reducer ? >> >> 2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ? > > To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/reduce-id pair is present via an output-file/index-file combination. > > Arun > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > > > -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
-
Re: Basic question on how reducer worksGrandl Robert 2012-07-10, 03:15
Thanks Arun.
So just for my clarification. The map will create partitions according to the number of reducers s.t. each reducer to get almost same number of keys in its partition. However, each key can have different number of values so the "weight" of each partition will depend on that. Also when a new <key, value> is added into a partition a hash on the partition ID will be computed to find the corresponding partition ? Robert ________________________________ From: Arun C Murthy <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, July 9, 2012 4:33 PM Subject: Re: Basic question on how reducer works On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: Thanks a lot guys for answers. > > > >Still I am not able to find exactly the code for the following things: > > >1. reducer to read from a Map output only its partition. I looked into ReduceTask#getMapOutput which do the actual read in ReduceTask#shuffleInMemory, but I don't see where it specify which partition to read(reduceID). > > Look at TaskTracker.MapOutputServlet. 2. still don't understand very well in which part of the code(MapTask.java) the intermediate data is written do which partition. So MapOutputBuffer is the one who actually writes the data to buffer and spill after buffer is full. Could you please elaborate a bit on how the data is written to which partition ? > > Essentially you can think of the partition-id as the 'primary key' and the actual 'key' in the map-output of <key, value> as the 'secondary key'. hth, Arun Thanks, >Robert > > > >________________________________ > From: Arun C Murthy <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Monday, July 9, 2012 9:24 AM >Subject: Re: Basic question on how reducer works > > >Robert, > > >On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: > >Hi, >> >> >>I have some questions related to basic functionality in Hadoop. >> >> >>1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each partition for each reducer ? >> >> >>2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ? > >To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/reduce-id pair is present via an output-file/index-file combination. > > >Arun > > >-- >Arun C. Murthy >Hortonworks Inc. >http://hortonworks.com/ > > > > > -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
-
Re: Basic question on how reducer worksKarthik Kambatla 2012-07-10, 03:33
The partitioner is configurable. The default partitioner, from what I
remember, computes the partition as the hashcode modulo number of reducers/partitions. For random input, it is balanced, but some cases can have very skewed key distribution. Also, as you have pointed out, the number of values per key can also vary. Together, both of them determine "weight" of each partition as you call it. Karthik On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <[EMAIL PROTECTED]> wrote: > Thanks Arun. > > So just for my clarification. The map will create partitions according to > the number of reducers s.t. each reducer to get almost same number of keys > in its partition. However, each key can have different number of values so > the "weight" of each partition will depend on that. Also when a new <key, > value> is added into a partition a hash on the partition ID will be > computed to find the corresponding partition ? > > Robert > > ------------------------------ > *From:* Arun C Murthy <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Monday, July 9, 2012 4:33 PM > > *Subject:* Re: Basic question on how reducer works > > > On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: > > Thanks a lot guys for answers. > > Still I am not able to find exactly the code for the following things: > > 1. reducer to read from a Map output only its partition. I looked into > ReduceTask#getMapOutput which do the actual read in > ReduceTask#shuffleInMemory, but I don't see where it specify which > partition to read(reduceID). > > > Look at TaskTracker.MapOutputServlet. > > 2. still don't understand very well in which part of the > code(MapTask.java) the intermediate data is written do which partition. So > MapOutputBuffer is the one who actually writes the data to buffer and spill > after buffer is full. Could you please elaborate a bit on how the data is > written to which partition ? > > > Essentially you can think of the partition-id as the 'primary key' and the > actual 'key' in the map-output of <key, value> as the 'secondary key'. > > hth, > Arun > > Thanks, > Robert > > ------------------------------ > *From:* Arun C Murthy <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Monday, July 9, 2012 9:24 AM > *Subject:* Re: Basic question on how reducer works > > Robert, > > On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: > > Hi, > > I have some questions related to basic functionality in Hadoop. > > 1. When a Mapper process the intermediate output data, how it knows how > many partitions to do(how many reducers will be) and how much data to go in > each partition for each reducer ? > > 2. A JobTracker when assigns a task to a reducer, it will also specify the > locations of intermediate output data where it should retrieve it right ? > But how a reducer will know from each remote location with intermediate > output what portion it has to retrieve only ? > > > To add to Harsh's comment. Essentially the TT *knows* where the output of > a given map-id/reduce-id pair is present via an output-file/index-file > combination. > > Arun > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > > > > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > > > >
-
Re: Basic question on how reducer worksSubir S 2012-07-10, 15:29
Is there any property to convey the maximum amount of data each
reducer/partition may take for processing. Like the bytes_per_reducer of pig, so that the count of reducers can be controlled based on size of intermediate map output data size? On 7/10/12, Karthik Kambatla <[EMAIL PROTECTED]> wrote: > The partitioner is configurable. The default partitioner, from what I > remember, computes the partition as the hashcode modulo number of > reducers/partitions. For random input, it is balanced, but some cases can > have very skewed key distribution. Also, as you have pointed out, the > number of values per key can also vary. Together, both of them determine > "weight" of each partition as you call it. > > Karthik > > On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <[EMAIL PROTECTED]> wrote: > >> Thanks Arun. >> >> So just for my clarification. The map will create partitions according to >> the number of reducers s.t. each reducer to get almost same number of >> keys >> in its partition. However, each key can have different number of values >> so >> the "weight" of each partition will depend on that. Also when a new <key, >> value> is added into a partition a hash on the partition ID will be >> computed to find the corresponding partition ? >> >> Robert >> >> ------------------------------ >> *From:* Arun C Murthy <[EMAIL PROTECTED]> >> *To:* [EMAIL PROTECTED] >> *Sent:* Monday, July 9, 2012 4:33 PM >> >> *Subject:* Re: Basic question on how reducer works >> >> >> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >> >> Thanks a lot guys for answers. >> >> Still I am not able to find exactly the code for the following things: >> >> 1. reducer to read from a Map output only its partition. I looked into >> ReduceTask#getMapOutput which do the actual read in >> ReduceTask#shuffleInMemory, but I don't see where it specify which >> partition to read(reduceID). >> >> >> Look at TaskTracker.MapOutputServlet. >> >> 2. still don't understand very well in which part of the >> code(MapTask.java) the intermediate data is written do which partition. >> So >> MapOutputBuffer is the one who actually writes the data to buffer and >> spill >> after buffer is full. Could you please elaborate a bit on how the data is >> written to which partition ? >> >> >> Essentially you can think of the partition-id as the 'primary key' and >> the >> actual 'key' in the map-output of <key, value> as the 'secondary key'. >> >> hth, >> Arun >> >> Thanks, >> Robert >> >> ------------------------------ >> *From:* Arun C Murthy <[EMAIL PROTECTED]> >> *To:* [EMAIL PROTECTED] >> *Sent:* Monday, July 9, 2012 9:24 AM >> *Subject:* Re: Basic question on how reducer works >> >> Robert, >> >> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >> >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how >> many partitions to do(how many reducers will be) and how much data to go >> in >> each partition for each reducer ? >> >> 2. A JobTracker when assigns a task to a reducer, it will also specify >> the >> locations of intermediate output data where it should retrieve it right ? >> But how a reducer will know from each remote location with intermediate >> output what portion it has to retrieve only ? >> >> >> To add to Harsh's comment. Essentially the TT *knows* where the output of >> a given map-id/reduce-id pair is present via an output-file/index-file >> combination. >> >> Arun >> >> -- >> Arun C. Murthy >> Hortonworks Inc. >> http://hortonworks.com/ >> >> >> >> >> >> -- >> Arun C. Murthy >> Hortonworks Inc. >> http://hortonworks.com/ >> >> >> >> >> >
-
Re: Basic question on how reducer worksSubir S 2012-07-14, 05:49
Probably a wrong question in a wrong thread and wrong mailing list :)
On 7/10/12, Subir S <[EMAIL PROTECTED]> wrote: > Is there any property to convey the maximum amount of data each > reducer/partition may take for processing. Like the bytes_per_reducer > of pig, so that the count of reducers can be controlled based on size > of intermediate map output data size? > > On 7/10/12, Karthik Kambatla <[EMAIL PROTECTED]> wrote: >> The partitioner is configurable. The default partitioner, from what I >> remember, computes the partition as the hashcode modulo number of >> reducers/partitions. For random input, it is balanced, but some cases can >> have very skewed key distribution. Also, as you have pointed out, the >> number of values per key can also vary. Together, both of them determine >> "weight" of each partition as you call it. >> >> Karthik >> >> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <[EMAIL PROTECTED]> wrote: >> >>> Thanks Arun. >>> >>> So just for my clarification. The map will create partitions according >>> to >>> the number of reducers s.t. each reducer to get almost same number of >>> keys >>> in its partition. However, each key can have different number of values >>> so >>> the "weight" of each partition will depend on that. Also when a new >>> <key, >>> value> is added into a partition a hash on the partition ID will be >>> computed to find the corresponding partition ? >>> >>> Robert >>> >>> ------------------------------ >>> *From:* Arun C Murthy <[EMAIL PROTECTED]> >>> *To:* [EMAIL PROTECTED] >>> *Sent:* Monday, July 9, 2012 4:33 PM >>> >>> *Subject:* Re: Basic question on how reducer works >>> >>> >>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >>> >>> Thanks a lot guys for answers. >>> >>> Still I am not able to find exactly the code for the following things: >>> >>> 1. reducer to read from a Map output only its partition. I looked into >>> ReduceTask#getMapOutput which do the actual read in >>> ReduceTask#shuffleInMemory, but I don't see where it specify which >>> partition to read(reduceID). >>> >>> >>> Look at TaskTracker.MapOutputServlet. >>> >>> 2. still don't understand very well in which part of the >>> code(MapTask.java) the intermediate data is written do which partition. >>> So >>> MapOutputBuffer is the one who actually writes the data to buffer and >>> spill >>> after buffer is full. Could you please elaborate a bit on how the data >>> is >>> written to which partition ? >>> >>> >>> Essentially you can think of the partition-id as the 'primary key' and >>> the >>> actual 'key' in the map-output of <key, value> as the 'secondary key'. >>> >>> hth, >>> Arun >>> >>> Thanks, >>> Robert >>> >>> ------------------------------ >>> *From:* Arun C Murthy <[EMAIL PROTECTED]> >>> *To:* [EMAIL PROTECTED] >>> *Sent:* Monday, July 9, 2012 9:24 AM >>> *Subject:* Re: Basic question on how reducer works >>> >>> Robert, >>> >>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >>> >>> Hi, >>> >>> I have some questions related to basic functionality in Hadoop. >>> >>> 1. When a Mapper process the intermediate output data, how it knows how >>> many partitions to do(how many reducers will be) and how much data to go >>> in >>> each partition for each reducer ? >>> >>> 2. A JobTracker when assigns a task to a reducer, it will also specify >>> the >>> locations of intermediate output data where it should retrieve it right >>> ? >>> But how a reducer will know from each remote location with intermediate >>> output what portion it has to retrieve only ? >>> >>> >>> To add to Harsh's comment. Essentially the TT *knows* where the output >>> of >>> a given map-id/reduce-id pair is present via an output-file/index-file >>> combination. >>> >>> Arun >>> >>> -- >>> Arun C. Murthy >>> Hortonworks Inc. >>> http://hortonworks.com/ >>> >>> >>> >>> >>> >>> -- >>> Arun C. Murthy >>> Hortonworks Inc. >>> http://hortonworks.com/ >>> >>> >>> >>> >>> >> >
-
Re: Basic question on how reducer worksHarsh J 2012-07-14, 06:08
If you wish to impose a limit on the max reducer input to be allowed
in a job, you may set "mapreduce.reduce.input.limit" on your job, as total bytes allowed per reducer. But this is more of a hard limit, which I suspect your question wasn't about. Your question is indeed better off on the pig's user lists. On Tue, Jul 10, 2012 at 8:59 PM, Subir S <[EMAIL PROTECTED]> wrote: > Is there any property to convey the maximum amount of data each > reducer/partition may take for processing. Like the bytes_per_reducer > of pig, so that the count of reducers can be controlled based on size > of intermediate map output data size? > > On 7/10/12, Karthik Kambatla <[EMAIL PROTECTED]> wrote: >> The partitioner is configurable. The default partitioner, from what I >> remember, computes the partition as the hashcode modulo number of >> reducers/partitions. For random input, it is balanced, but some cases can >> have very skewed key distribution. Also, as you have pointed out, the >> number of values per key can also vary. Together, both of them determine >> "weight" of each partition as you call it. >> >> Karthik >> >> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <[EMAIL PROTECTED]> wrote: >> >>> Thanks Arun. >>> >>> So just for my clarification. The map will create partitions according to >>> the number of reducers s.t. each reducer to get almost same number of >>> keys >>> in its partition. However, each key can have different number of values >>> so >>> the "weight" of each partition will depend on that. Also when a new <key, >>> value> is added into a partition a hash on the partition ID will be >>> computed to find the corresponding partition ? >>> >>> Robert >>> >>> ------------------------------ >>> *From:* Arun C Murthy <[EMAIL PROTECTED]> >>> *To:* [EMAIL PROTECTED] >>> *Sent:* Monday, July 9, 2012 4:33 PM >>> >>> *Subject:* Re: Basic question on how reducer works >>> >>> >>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >>> >>> Thanks a lot guys for answers. >>> >>> Still I am not able to find exactly the code for the following things: >>> >>> 1. reducer to read from a Map output only its partition. I looked into >>> ReduceTask#getMapOutput which do the actual read in >>> ReduceTask#shuffleInMemory, but I don't see where it specify which >>> partition to read(reduceID). >>> >>> >>> Look at TaskTracker.MapOutputServlet. >>> >>> 2. still don't understand very well in which part of the >>> code(MapTask.java) the intermediate data is written do which partition. >>> So >>> MapOutputBuffer is the one who actually writes the data to buffer and >>> spill >>> after buffer is full. Could you please elaborate a bit on how the data is >>> written to which partition ? >>> >>> >>> Essentially you can think of the partition-id as the 'primary key' and >>> the >>> actual 'key' in the map-output of <key, value> as the 'secondary key'. >>> >>> hth, >>> Arun >>> >>> Thanks, >>> Robert >>> >>> ------------------------------ >>> *From:* Arun C Murthy <[EMAIL PROTECTED]> >>> *To:* [EMAIL PROTECTED] >>> *Sent:* Monday, July 9, 2012 9:24 AM >>> *Subject:* Re: Basic question on how reducer works >>> >>> Robert, >>> >>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >>> >>> Hi, >>> >>> I have some questions related to basic functionality in Hadoop. >>> >>> 1. When a Mapper process the intermediate output data, how it knows how >>> many partitions to do(how many reducers will be) and how much data to go >>> in >>> each partition for each reducer ? >>> >>> 2. A JobTracker when assigns a task to a reducer, it will also specify >>> the >>> locations of intermediate output data where it should retrieve it right ? >>> But how a reducer will know from each remote location with intermediate >>> output what portion it has to retrieve only ? >>> >>> >>> To add to Harsh's comment. Essentially the TT *knows* where the output of >>> a given map-id/reduce-id pair is present via an output-file/index-file >>> combination. Harsh J
-
Re: Basic question on how reducer worksSubir S 2012-07-14, 12:00
Harsh, Thanks I think this is what I was looking for. I have 3 related
questions. 1.) Will this work in 0.20.2-cdh3u3 2.) What is the hard limit that you mean? 3.)Can this be applied for streaming? Thanks, Subir On 7/14/12, Harsh J <[EMAIL PROTECTED]> wrote: > If you wish to impose a limit on the max reducer input to be allowed > in a job, you may set "mapreduce.reduce.input.limit" on your job, as > total bytes allowed per reducer. > > But this is more of a hard limit, which I suspect your question wasn't > about. Your question is indeed better off on the pig's user lists. > > On Tue, Jul 10, 2012 at 8:59 PM, Subir S <[EMAIL PROTECTED]> wrote: >> Is there any property to convey the maximum amount of data each >> reducer/partition may take for processing. Like the bytes_per_reducer >> of pig, so that the count of reducers can be controlled based on size >> of intermediate map output data size? >> >> On 7/10/12, Karthik Kambatla <[EMAIL PROTECTED]> wrote: >>> The partitioner is configurable. The default partitioner, from what I >>> remember, computes the partition as the hashcode modulo number of >>> reducers/partitions. For random input, it is balanced, but some cases >>> can >>> have very skewed key distribution. Also, as you have pointed out, the >>> number of values per key can also vary. Together, both of them determine >>> "weight" of each partition as you call it. >>> >>> Karthik >>> >>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <[EMAIL PROTECTED]> wrote: >>> >>>> Thanks Arun. >>>> >>>> So just for my clarification. The map will create partitions according >>>> to >>>> the number of reducers s.t. each reducer to get almost same number of >>>> keys >>>> in its partition. However, each key can have different number of values >>>> so >>>> the "weight" of each partition will depend on that. Also when a new >>>> <key, >>>> value> is added into a partition a hash on the partition ID will be >>>> computed to find the corresponding partition ? >>>> >>>> Robert >>>> >>>> ------------------------------ >>>> *From:* Arun C Murthy <[EMAIL PROTECTED]> >>>> *To:* [EMAIL PROTECTED] >>>> *Sent:* Monday, July 9, 2012 4:33 PM >>>> >>>> *Subject:* Re: Basic question on how reducer works >>>> >>>> >>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >>>> >>>> Thanks a lot guys for answers. >>>> >>>> Still I am not able to find exactly the code for the following things: >>>> >>>> 1. reducer to read from a Map output only its partition. I looked into >>>> ReduceTask#getMapOutput which do the actual read in >>>> ReduceTask#shuffleInMemory, but I don't see where it specify which >>>> partition to read(reduceID). >>>> >>>> >>>> Look at TaskTracker.MapOutputServlet. >>>> >>>> 2. still don't understand very well in which part of the >>>> code(MapTask.java) the intermediate data is written do which partition. >>>> So >>>> MapOutputBuffer is the one who actually writes the data to buffer and >>>> spill >>>> after buffer is full. Could you please elaborate a bit on how the data >>>> is >>>> written to which partition ? >>>> >>>> >>>> Essentially you can think of the partition-id as the 'primary key' and >>>> the >>>> actual 'key' in the map-output of <key, value> as the 'secondary key'. >>>> >>>> hth, >>>> Arun >>>> >>>> Thanks, >>>> Robert >>>> >>>> ------------------------------ >>>> *From:* Arun C Murthy <[EMAIL PROTECTED]> >>>> *To:* [EMAIL PROTECTED] >>>> *Sent:* Monday, July 9, 2012 9:24 AM >>>> *Subject:* Re: Basic question on how reducer works >>>> >>>> Robert, >>>> >>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >>>> >>>> Hi, >>>> >>>> I have some questions related to basic functionality in Hadoop. >>>> >>>> 1. When a Mapper process the intermediate output data, how it knows how >>>> many partitions to do(how many reducers will be) and how much data to >>>> go >>>> in >>>> each partition for each reducer ? >>>> >>>> 2. A JobTracker when assigns a task to a reducer, it will also specify
-
Re: Basic question on how reducer worksHarsh J 2012-07-14, 13:55
Subir,
On Sat, Jul 14, 2012 at 5:30 PM, Subir S <[EMAIL PROTECTED]> wrote: > Harsh, Thanks I think this is what I was looking for. I have 3 related > questions. > > 1.) Will this work in 0.20.2-cdh3u3 Yes, will work. (Btw, best to ask CDH-specific questions on the [EMAIL PROTECTED] lists) > 2.) What is the hard limit that you mean? If a reducer gets more data than this value, due to the map's outputs growing large (for any partition), the job will begin to fail. > 3.)Can this be applied for streaming? Yes, streaming is still MR and this property is for MR (applied during scheduling, so not streaming/java specific). -- Harsh J
-
Re: Basic question on how reducer worksSubir S 2012-07-16, 20:31
Just for reference of others who might see this thread. Jira
corresponding to parameter on reduce input limit is MAPREDUCE-2324 On 7/14/12, Harsh J <[EMAIL PROTECTED]> wrote: > Subir, > > On Sat, Jul 14, 2012 at 5:30 PM, Subir S <[EMAIL PROTECTED]> wrote: >> Harsh, Thanks I think this is what I was looking for. I have 3 related >> questions. >> >> 1.) Will this work in 0.20.2-cdh3u3 > > Yes, will work. (Btw, best to ask CDH-specific questions on the > [EMAIL PROTECTED] lists) > >> 2.) What is the hard limit that you mean? > > If a reducer gets more data than this value, due to the map's outputs > growing large (for any partition), the job will begin to fail. > >> 3.)Can this be applied for streaming? > > Yes, streaming is still MR and this property is for MR (applied during > scheduling, so not streaming/java specific). > > -- > Harsh J > |