By default you get at least one task per file; if any file is bigger than a
block, then that file is broken up into N tasks where each is one block
long. Not sure what you mean by "properly calculate" -- as long as you have
more tasks than you have cores, then you'll definitely have work for every
core to do; having more tasks with high granularity will also let nodes that
get "small" tasks to complete many of them while other cores are stuck with
the "heavier" tasks.
If you call setNumMapTasks() with a higher number of tasks than the
InputFormat creates (via the algorithm above), then it should create
additional tasks by dividing files up into smaller chunks (which may be
As for where you should run your computation.. I don't know that the "map"
and "reduce" phases are really "optimized" for computation in any particular
way. It's just a data motion thing. (At the end of the day, it's your code
doing the processing on either side of the fence, which should dominate the
execution time.) If you use an identity mapper with a pseudo-random key to
spray the data into a bunch of reduce partitions, then you'll get a bunch of
reducers each working on a hopefully-evenly-sized slice of the data. So the
map tasks will quickly read from the original source data and forward the
workload along to the reducers which do the actual heavy lifting. The cost
of this approach is that you have to pay for the time taken to transfer the
data from the mapper nodes to the reducer nodes and sort by key when it gets
there. If you're only working with 600 MB of data, this is probably
negligible. The advantages of doing your computation in the reducers is
1) You can directly control the number of reducer tasks and set this equal
to the number of cores in your cluster.
2) You can tune your partitioning algorithm such that all reducers get
roughly equal workload assignments, if there appears to be some sort of skew
in the dataset.
The tradeoff is that you have to ship all the data to the reducers before
computation starts, which sacrifices data locality and involves an
"intermediate" data set of the same size as the input data set. If this is
in the range of hundreds of GB or north, then this can be very
time-consuming -- so it doesn't scale terribly well. Of course, by the time
you've got several hundred GB of data to work with, your current workload
imbalance issues should be moot anyway.
On Fri, Nov 27, 2009 at 4:33 PM, CubicDesign <[EMAIL PROTECTED]> wrote:
> Aaron Kimball wrote:
>> (Note: this is a tasktracker setting, not a job setting. you'll need to
>> set this on every
>> node, then restart the mapreduce cluster to take effect.)
> Ok. And here is my mistake. I set this to 16 only on the main node not also
> on data nodes. Thanks a lot!!!!!!
> Of course, you need to have enough RAM to make sure that all these tasks
>> run concurrently without swapping.
> No problem!
> If your individual records require around a minute each to process as you
>> claimed earlier, you're
>> nowhere near in danger of hitting that particular performance bottleneck.
> I was thinking that is I am under the recommended value of 64MB, Hadoop
> cannot properly calculate the number of tasks.