Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> HDFS drive, partition best practice


+
John Buchanan 2011-02-07, 20:25
+
Jonathan Disher 2011-02-07, 22:06
+
Scott Golby 2011-02-07, 22:40
+
John Buchanan 2011-02-08, 15:20
Copy link to this message
-
Re: HDFS drive, partition best practice

On Feb 8, 2011, at 7:20 AM, John Buchanan wrote:
> What we were thinking for our first deployment was 10 HP DL385's each with
> 8 2TB SATA drives.  First pair in Raid1 for the system drive, the
> remaining each containing a distinct partition and mount point, then
> specified in hdfs-site.xml in comma-delimited fashion.  Seems to make more
> sense to use Raid at least for the system drives so the loss of 1 drive
> won't take down the entire node.  Granted data integrity wouldn't be
> affected but how much time do you want to spend rebuilding an entire node
> due to the loss of one drive.  Considered using a smaller pair for the
> system drives but if they're all the same then we only need to stock one
> type of spare drive.
Don't bother RAID'ing the system drive.  Seriously.  You're giving up performance for something that rarely happens.  If you have decent configuration management, rebuilding a node is not a big deal and doesn't take that long anyway.  

Besides, losing one of the JBOD disks will likely bring the node down anyway.

> Another question I have is whether using 1TB drives would be advisable
> over 2TB for the purpose of reducing rebuild time.  

You're over thinking the rebuild time.  Again, configuration management makes this a non-issue.
> Or perhaps I'm still
> thinking of this as I would a Raid volume.  If we needed to rebalance
> across the cluster would the time needed be more dependent on the amount
> of data involved and the connectivity between nodes?

Yes.

When a node goes down, the data and tasks are automatically moved.  So a node can be down for as long as it needs to be down.  The grid will still be functional.  So don't panic if a compute node goes down. :)
+
Adam Phelps 2011-02-08, 19:33
+
Allen Wittenauer 2011-02-08, 20:09
+
Patrick Angeles 2011-02-08, 20:17
+
Patrick Angeles 2011-02-08, 20:22
+
Allen Wittenauer 2011-02-08, 20:43
+
Mag Gam 2011-02-22, 12:34
+
Patrick Angeles 2011-02-08, 19:53
+
Bharath Mundlapudi 2011-02-08, 19:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB