|
|
Tonci Buljan 2010-07-09, 09:32
Hello everyone,
I have a cluster from 8 datanodes and a namenode. When I start teragen program everything works OK, the data is generated. But when I start terasort program, seems like that only 2 datanodes do the job. And everything is so slow. I've tried with only 10 records and cluster generated sort in few seconds, but with bigger number, it stucks around 15%.
Do you have any idea why this is so?
I'm using Hadoop 0.20.2 and Ubuntu 8.10. Thank you.
Owen O'Malley 2010-07-09, 13:33
I would guess that you didn't set the number of reducers for the job, and it defaulted to 2.
-- Owen
Tonci Buljan 2010-07-10, 11:29
Thank you for your response Owen. It is true, I haven't done that, figured that few hours after posting here.
I'm having problems with understanding these variables:
mapred.tasktracker.reduce.tasks.maximum <- Is this configured on every datanode separately? What number shall I put here?
mapred.tasktracker.map.tasks.maximum <- same question as mapred.tasktracker.reduce.tasks.maximum
mapred.reduce.tasks <- Is this configured ONLY on Namenode and what value should it have for my 8 node cluster?
mapred.map.tasks <- same question as mapred.reduce.tasks I've tried playing with these variables but getting error:"Too many fetch-failures..."
Please, if anyone have any idea how to setup this the right way.
Thank you.
On 9 July 2010 15:33, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> I would guess that you didn't set the number of reducers for the job, > and it defaulted to 2. > > -- Owen >
mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum are configured in mapred-site.xml They're cluster-wide.
Hadoop would sync configuation from name node to data nodes upon startup, you don't need to configure for individual datanode.
"Too many fetch-failures..." error appeared in previous discussions and I don't see definitive cause from them.
On Sat, Jul 10, 2010 at 4:29 AM, Tonci Buljan <[EMAIL PROTECTED]>wrote:
> Thank you for your response Owen. It is true, I haven't done that, figured > that few hours after posting here. > > I'm having problems with understanding these variables: > > mapred.tasktracker.reduce.tasks.maximum <- Is this configured on every > datanode separately? What number shall I put here? > > mapred.tasktracker.map.tasks.maximum <- same question as > mapred.tasktracker.reduce.tasks.maximum > > mapred.reduce.tasks <- Is this configured ONLY on Namenode and what value > should it have for my 8 node cluster? > > mapred.map.tasks <- same question as mapred.reduce.tasks > > > I've tried playing with these variables but getting error:"Too many > fetch-failures..." > > Please, if anyone have any idea how to setup this the right way. > > Thank you. > > On 9 July 2010 15:33, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > > I would guess that you didn't set the number of reducers for the job, > > and it defaulted to 2. > > > > -- Owen > > >
Owen O'Malley 2010-07-11, 17:39
On Jul 10, 2010, at 4:29 AM, Tonci Buljan wrote:
> mapred.tasktracker.reduce.tasks.maximum <- Is this configured on every > datanode separately? What number shall I put here? > > mapred.tasktracker.map.tasks.maximum <- same question as > mapred.tasktracker.reduce.tasks.maximum
Generally, RAM is the scarce resource. Decide how you want to divide your worker's RAM between tasks. So with 6 G of RAM, I'd probably make 4 map slots of 0.75G each and 2 reduce slots of 1.5G each.
> mapred.reduce.tasks <- Is this configured ONLY on Namenode and what > value > should it have for my 8 node cluster?
You should set it to your reduce task capacity of 2 * 8 = 16.
> mapred.map.tasks <- same question as mapred.reduce.tasks
It matters less, but go ahead and set it to the map capacity of 4 * 8 = 32. More important is to set your vm and buffer sizes for the tasks. You also want to set your HDFS block size to be 0.5G to 2G. That will make your map inputs the right size.
-- Owen
|
|