|
|
-
1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Evert Lammerts 2011-07-01, 06:41
>> Keeping the amount of disks per node low and the amount of nodes high >> should keep the impact of dead nodes in control. > > It keeps the impact of dead nodes in control but I don't think thats > long-term cost efficient. As prices of 10GbE go down, the "keep the node > small" arguement seems less fitting. And on another note, most servers > manufactured in the last 10 years have dual 1GbE network interfaces. If one > were to go by these calcs: > > >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 > minutes to recover > > It seems like that assumes a single 1GbE interface, why not leverage the > second?
I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch.
In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects.
Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times.
Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-)
Evert On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <[EMAIL PROTECTED]>wrote:
> > You can get 12-24 TB in a server today, which means the loss of a server > > generates a lot of traffic -which argues for 10 Gbe. > > > > But > > -big increase in switch cost, especially if you (CoI warning) go with > > Cisco > > -there have been problems with things like BIOS PXE and lights out > > management on 10 Gbe -probably due to the NICs being things the BIOS > > wasn't expecting and off the mainboard. This should improve. > > -I don't know how well linux works with ether that fast (field reports > > useful) > > -the big threat is still ToR switch failure, as that will trigger a > > re-replication of every block in the rack. > > Keeping the amount of disks per node low and the amount of nodes high > should keep the impact of dead nodes in control. A ToR switch failing is > different - missing 30 nodes (~120TB) at once cannot be fixed by adding more > nodes; that actually increases ToR switch failure. Although such failure is > quite rare to begin with, I guess. The back-of-the-envelope-calculation I > made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., > when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with > HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing > should take around 640 seconds. Also see the attached spreadsheet.) This > doesn't take ToR switch failure in account though. On the other hand - 150 > nodes is only ~5 racks - in such a scenario you might rather want to shut > the system down completely rather than letting it replicate 20% of all data. > > Cheers, > Evert
-
Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Ryan Rawson 2011-07-01, 07:16
What's the justification for a management interface? Doesn't that increase complexity? Also you still twice the ports? On Jun 30, 2011 11:54 PM, "Evert Lammerts" <[EMAIL PROTECTED]> wrote: >>> Keeping the amount of disks per node low and the amount of nodes high >>> should keep the impact of dead nodes in control. >> >> It keeps the impact of dead nodes in control but I don't think thats >> long-term cost efficient. As prices of 10GbE go down, the "keep the node >> small" arguement seems less fitting. And on another note, most servers >> manufactured in the last 10 years have dual 1GbE network interfaces. If one >> were to go by these calcs: >> >> >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 >> minutes to recover >> >> It seems like that assumes a single 1GbE interface, why not leverage the >> second? > > I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch. > > In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects. > > Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times. > > Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-) > > Evert > > > On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <[EMAIL PROTECTED] >wrote: > >> > You can get 12-24 TB in a server today, which means the loss of a server >> > generates a lot of traffic -which argues for 10 Gbe. >> > >> > But >> > -big increase in switch cost, especially if you (CoI warning) go with >> > Cisco >> > -there have been problems with things like BIOS PXE and lights out >> > management on 10 Gbe -probably due to the NICs being things the BIOS >> > wasn't expecting and off the mainboard. This should improve. >> > -I don't know how well linux works with ether that fast (field reports >> > useful) >> > -the big threat is still ToR switch failure, as that will trigger a >> > re-replication of every block in the rack. >> >> Keeping the amount of disks per node low and the amount of nodes high >> should keep the impact of dead nodes in control. A ToR switch failing is >> different - missing 30 nodes (~120TB) at once cannot be fixed by adding more >> nodes; that actually increases ToR switch failure. Although such failure is >> quite rare to begin with, I guess. The back-of-the-envelope-calculation I >> made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., >> when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with >> HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing >> should take around 640 seconds. Also see the attached spreadsheet.) This >> doesn't take ToR switch failure in account though. On the other hand - 150 >> nodes is only ~5 racks - in such a scenario you might rather want to shut data.
-
RE: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Evert Lammerts 2011-07-01, 08:17
> What's the justification for a management interface? Doesn't that increase > complexity? Also you still twice the ports?
That's true. It's a tradition that I haven't questioned before. The reasoning, whether right or wrong, is that user jobs (our users are external) can get in the way of admins. If there's a lot of network traffic it takes a lot longer to re-install a node. It also makes security easier - we have a seperate SSH deamon listening on the management net accepting root logins - something the default deamon does not do. Of course the same can be done by running a seperate deamon on the public interface, on a port that is only accessible from our internal network.
Evert
On Jun 30, 2011 11:54 PM, "Evert Lammerts" <[EMAIL PROTECTED]> wrote: >>> Keeping the amount of disks per node low and the amount of nodes high >>> should keep the impact of dead nodes in control. >> >> It keeps the impact of dead nodes in control but I don't think thats >> long-term cost efficient. As prices of 10GbE go down, the "keep the node >> small" arguement seems less fitting. And on another note, most servers >> manufactured in the last 10 years have dual 1GbE network interfaces. If one >> were to go by these calcs: >> >> >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 >> minutes to recover >> >> It seems like that assumes a single 1GbE interface, why not leverage the >> second? > > I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch. > > In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects. > > Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times. > > Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-) > > Evert > > > On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <[EMAIL PROTECTED] >wrote: > >> > You can get 12-24 TB in a server today, which means the loss of a server >> > generates a lot of traffic -which argues for 10 Gbe. >> > >> > But >> > -big increase in switch cost, especially if you (CoI warning) go with >> > Cisco >> > -there have been problems with things like BIOS PXE and lights out >> > management on 10 Gbe -probably due to the NICs being things the BIOS >> > wasn't expecting and off the mainboard. This should improve. >> > -I don't know how well linux works with ether that fast (field reports >> > useful) >> > -the big threat is still ToR switch failure, as that will trigger a >> > re-replication of every block in the rack. >> >> Keeping the amount of disks per node low and the amount of nodes high >> should keep the impact of dead nodes in control. A ToR switch failing is >> different - missing 30 nodes (~120TB) at once cannot be fixed by adding more is (e.g., with 150 data.
-
Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Steve Loughran 2011-07-01, 15:43
On 01/07/2011 07:41, Evert Lammerts wrote: >>> Keeping the amount of disks per node low and the amount of nodes high >>> should keep the impact of dead nodes in control. >> >> It keeps the impact of dead nodes in control but I don't think thats >> long-term cost efficient. As prices of 10GbE go down, the "keep the node >> small" arguement seems less fitting. And on another note, most servers >> manufactured in the last 10 years have dual 1GbE network interfaces. If one >> were to go by these calcs: >> >>> 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 >> minutes to recover >> >> It seems like that assumes a single 1GbE interface, why not leverage the >> second? > > I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch.
Yes, I didn't get into ILO. That can be 100Mbps. > > In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects. > > Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times.
2Gb bonded with 2x ToR removes a single ToR switch as an SPOF for the rack but increases install/debugging costs. More wires, and your network configuration topology has got worse.
> > Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-)
Yeah, but you are a computing facility whose backbone gets data off the LHC tier 1 sites abroad faster than the seek time of the disks in your neighbouring building...
-
Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Steve Loughran 2011-07-01, 15:47
On 01/07/2011 08:16, Ryan Rawson wrote: > What's the justification for a management interface? Doesn't that increase > complexity? Also you still twice the ports?
ILO reduces ops complexity. you can push things like BIOS updates out, boot machines into known states, instead of the slowly diverging world that RPM or debian updates get you into (the final state depends on the order, and different machines end up applying them in a different order), and for diagnosing problems when even the root disk doesn't want to come out and play.
In a big cluster you need to worry about things like not powering on a quadrant of the site simultaneously as boot-time can be a peak power surge; you may want to bring up slices of every rack gradually, upgrade the BIOS and OS, and gradually ramp things up. This is where HPC admin tools differ from classic "lock down an end user windows PC" tooling.
-
Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Ted Dunning 2011-07-01, 17:14
You are twice the ports, but if you know that the management port is not used for serious data, then you can put a SOHO grade switch on those ports at negligible cost.
There is a serious conflict of goals here if you have software that can make serious use more than one NIC. On the one hand, it is nice to use the hardware you have. On the other, it is really nice to have guaranteed bandwidth for management.
With traditional switch level link aggregation this is usually not a problem since each flow is committed to one NIC or the other, resulting in poor balancing. The silver lining of the poor balancing is that there is always plenty of bandwidth for administration function.
Though it may be slightly controversial to mention, the way that MapRdoes application level NIC bonding could cause some backlash from ops staff because it can actually saturate as many NIC's as you make available. Generally, management doesn't require much bandwidth and doing things like reloading the BIOS is usually done when a machine is out of service for maintenance, but the potential for surprise is there.
I have definitely seen a number of conditions where service access is a complete god-send. With geographically dispersed data centers, it is an absolute requirement because you just can't staff every data center with enough hands-on admins within a few minutes travel time of the data center. ILO (or DRAC as Dell calls it) gives you 5 minute response time to fixing totally horked machines.
On Fri, Jul 1, 2011 at 8:47 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> On 01/07/2011 08:16, Ryan Rawson wrote: > >> What's the justification for a management interface? Doesn't that increase >> complexity? Also you still twice the ports? >> > > ILO reduces ops complexity. you can push things like BIOS updates out, boot > machines into known states, instead of the slowly diverging world that RPM > or debian updates get you into (the final state depends on the order, and > different machines end up applying them in a different order), and for > diagnosing problems when even the root disk doesn't want to come out and > play. > > In a big cluster you need to worry about things like not powering on a > quadrant of the site simultaneously as boot-time can be a peak power surge; > you may want to bring up slices of every rack gradually, upgrade the BIOS > and OS, and gradually ramp things up. This is where HPC admin tools differ > from classic "lock down an end user windows PC" tooling. > >
|
|