Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> balancing and replication in HDFS


+
Jeffrey Buell 2011-02-25, 22:52
+
Todd Lipcon 2011-02-25, 23:12
+
Jeffrey Buell 2011-02-26, 00:45
+
Todd Lipcon 2011-02-26, 04:13
Copy link to this message
-
RE: balancing and replication in HDFS
Todd,

Much better.  I also had to adjust the number of map tasks for the gen step.  Storage was spread across 3 nodes instead of 4, but a bit more playing with these parameters should do the trick.

Thanks for your help.

Jeff

From: Todd Lipcon [mailto:[EMAIL PROTECTED]]
Sent: Friday, February 25, 2011 8:09 PM
To: [EMAIL PROTECTED]
Cc: Jeffrey Buell
Subject: Re: balancing and replication in HDFS

When you run terasort, pass -Dmapred.reduce.tasks=4 and see how that goes for you. See this old thread for info:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%[EMAIL PROTECTED]%3E

-Todd
On Fri, Feb 25, 2011 at 4:45 PM, Jeffrey Buell <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hi Todd,

Thanks for the quick response.

I added <final>true</final> to dfs.replication, but I still get just one output copy.  Can hadoop apps overwrite the replication level even with this parameter?

I tried increasing mapred.tasktracker.reduce.tasks.maximum from 1 to 4, but that didn't make any difference:  output is still all on one node.  I thought that parameter controls the number of reduce tasks per node so 1 should be sufficient, is that correct logic?  The other reduce parameters are at their defaults.  Which parameters should I be trying?

There's no way the sort can run efficiently if the gen is unbalanced.  At 5 GB, there're plenty of blocks to make the distribution even.  How does HDFS decide to spread the storage across the nodes?  Note the sizes of the 4 chunks (1.7,...,3.2 GB) seem to be repeatable, but they land on different nodes each time teragen is run.

Jeff

> -----Original Message-----
> From: Todd Lipcon [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
> Sent: Friday, February 25, 2011 3:13 PM
> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
> Subject: Re: balancing and replication in HDFS
>
> Hi Jeff,
> The output of terasort has replication level 1 by default. This is so
> it goes faster with the default settings and makes for more impressive
> benchmark results :)
> The reason you see it all on one machine is probably that you're
> running with one reducer. Try configuring your terasort to use more
> reduce tasks and you should see the load (and space usage) even out.
> -Todd
>
> On Fri, Feb 25, 2011 at 2:52 PM, Jeffrey Buell <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
> wrote:
> >
> > I'm a newbie to hadoop and HDFS.  I'm seeing odd behavior in HDFS
> that I hope somebody can clear up for me.  I'm running hadoop version
> 0.20.1+169.127 from the cloudera distro on 4 identical nodes, each with
> 4 cpus and 100GB disk space.  Replication is set to 2.
> >
> > I run:
> >
> > hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar teragen 50000000
> tera_in5
> >
> > This produces the expected 10GB of data on disk (5GB * 2 copies).
>  But the data is spread very unevenly across the nodes, ranging from
> 1.7 to 3.2 GB on each node.  Then I sort the data:
> >
> > hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar terasort tera_in5
> tera_out5
> >
> > It finishes successfully, and HDFS recognizes the right amount of
> data:
> >
> > $ hadoop fs -du /user/hadoop/
> > Found 2 items
> > 5000023410  hdfs://namd-1/user/hadoop/tera_in5
> > 5000170993  hdfs://namd-1/user/hadoop/tera_out5
> >
> > However all the new data is on one node (apparently randomly chosen),
> and the total disk usage is only 15GB, which means that the output data
> is not replicated.  For nearly all the elapsed time of the sort, the
> other 3 nodes are idle.  Some of the output data is in
> dfs/data/current, but a lot is in one of 64 new subdirs
> (dfs/data/current/subdir0 through subdir63).
> >
> > Why is all this happening?  Am I missing some tunables that make HDFS
> do the right balance and replication?
> >
> > Thanks,
> >
> > Jeff
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

--
Todd Lipcon
Software Engineer, Cloudera
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB