On 3 September 2012 15:19, Abhay Ratnaparkhi <[EMAIL PROTECTED]>wrote:
> How can one get to know the nodes on which reduce tasks will run?
> One of my job is running and it's completing all the map tasks.
> My map tasks write lots of intermediate data. The intermediate directory
> is getting full on all the nodes.
> If the reduce task take any node from cluster then It'll try to copy the
> data to same disk and it'll eventually fail due to Disk space related
you could always set up specific partitions for intermediate data, though
you get better bandwidth by striping the data across all disks, and better
flexibility by sharing the same partition.
There's also a property to set the amount of space to allocate for DFS
storage; reduce that by changing dfs.datanode.du.reserved and the
datanodes will leave more free space around.