John Buchanan 2011-02-15, 14:51
On Feb 15, 2011, at 6:51 AM, John Buchanan
<[EMAIL PROTECTED]> wrote:
> I wonder if you could discuss more what you meant by (or what you use)
> configuration management? Doing some initial research I'm finding quite a
> few options for centralized configuration management, both open source and
> commercial. Would love to hear what others are using.
My site uses puppet relatively successfully. There are other options
too which work equally well. The main part is using version control
for configs, doing configuration peer reviews just like "real code",
and something that auto pushes changes out.
> On 2/8/11 11:25 AM, "Allen Wittenauer" <[EMAIL PROTECTED]> wrote:
>> On Feb 8, 2011, at 7:20 AM, John Buchanan wrote:
>>> What we were thinking for our first deployment was 10 HP DL385's each
>>> 8 2TB SATA drives. First pair in Raid1 for the system drive, the
>>> remaining each containing a distinct partition and mount point, then
>>> specified in hdfs-site.xml in comma-delimited fashion. Seems to make
>>> sense to use Raid at least for the system drives so the loss of 1 drive
>>> won't take down the entire node. Granted data integrity wouldn't be
>>> affected but how much time do you want to spend rebuilding an entire
>>> due to the loss of one drive. Considered using a smaller pair for the
>>> system drives but if they're all the same then we only need to stock one
>>> type of spare drive.
>> Don't bother RAID'ing the system drive. Seriously. You're giving up
>> performance for something that rarely happens. If you have decent
>> configuration management, rebuilding a node is not a big deal and doesn't
>> take that long anyway.
>> Besides, losing one of the JBOD disks will likely bring the node down
>>> Another question I have is whether using 1TB drives would be advisable
>>> over 2TB for the purpose of reducing rebuild time.
>> You're over thinking the rebuild time. Again, configuration
>> management makes this a non-issue.
>>> Or perhaps I'm still
>>> thinking of this as I would a Raid volume. If we needed to rebalance
>>> across the cluster would the time needed be more dependent on the amount
>>> of data involved and the connectivity between nodes?
>> When a node goes down, the data and tasks are automatically moved.
>> So a node can be down for as long as it needs to be down. The grid will
>> still be functional. So don't panic if a compute node goes down. :)