|
Allen Wittenauer
2011-06-15, 00:16
Scott Carey
2011-06-22, 20:27
Allen Wittenauer
2011-06-22, 20:49
Scott Carey
2011-06-23, 02:42
Steve Loughran
2011-06-23, 12:49
Scott Carey
2011-06-26, 19:23
Steve Loughran
2011-06-27, 11:38
Ryan Rawson
2011-06-28, 00:10
Ted Dunning
2011-06-28, 00:12
Segel, Mike
2011-06-28, 01:54
Ryan Rawson
2011-06-28, 02:33
Segel, Mike
2011-06-28, 03:49
Steve Loughran
2011-06-28, 09:59
Michel Segel
2011-06-28, 12:27
Arun C Murthy
2011-06-28, 17:25
Evert Lammerts
2011-06-30, 21:31
Evert Lammerts
2011-06-30, 21:37
Aaron Eng
2011-06-30, 23:18
Ted Dunning
2011-07-01, 00:16
Todd Lipcon
2011-07-01, 00:24
Ted Dunning
2011-07-01, 01:12
M. C. Srivas
2011-07-01, 04:08
Ian Holsman
2011-07-01, 04:47
M. C. Srivas
2011-07-01, 05:06
Ted Dunning
2011-07-01, 05:09
Scott Carey
2011-07-01, 17:22
Abhishek Mehta
2011-07-01, 18:53
Ted Dunning
2011-07-01, 21:38
Ian Holsman
2011-07-03, 04:16
Eric Baldeschwieler
2011-07-13, 14:59
Ted Dunning
2011-07-13, 15:48
Steve Loughran
2011-07-01, 16:12
Ted Dunning
2011-07-01, 18:09
Scott Carey
2011-06-28, 02:58
Evert Lammerts
2011-06-30, 21:40
|
-
Hadoop Java VersionsAllen Wittenauer 2011-06-15, 00:16
While we're looking at the wiki, could folks update http://wiki.apache.org/hadoop/HadoopJavaVersions with whatever versions of Hadoop they are using successfully? Thanks. P.S., yes, I'm thinking about upgrading ours. :p +
Allen Wittenauer 2011-06-15, 00:16
-
Re: Hadoop Java VersionsScott Carey 2011-06-22, 20:27
"Problems have been reported with Hadoop, the 64-bit JVM and Compressed
Object References (the -XX:+UseCompressedOops option), so use of that option is discouraged." I think the above is dated. It also lacks critical information. What JVM and OS version was the problem seen? CompressedOops had several issues prior to Jre 6u20, and a few minor ones were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops for all Hadoop and non-Hadoop apps and have seen no issues. It is the default in 6u24 and 6u25 on a 64 bit JVM. On 6/14/11 5:16 PM, "Allen Wittenauer" <[EMAIL PROTECTED]> wrote: > >While we're looking at the wiki, could folks update >http://wiki.apache.org/hadoop/HadoopJavaVersions with whatever versions >of Hadoop they are using successfully? > >Thanks. > >P.S., yes, I'm thinking about upgrading ours. :p +
Scott Carey 2011-06-22, 20:27
-
Re: Hadoop Java VersionsAllen Wittenauer 2011-06-22, 20:49
On Jun 22, 2011, at 1:27 PM, Scott Carey wrote: > "Problems have been reported with Hadoop, the 64-bit JVM and Compressed > Object References (the -XX:+UseCompressedOops option), so use of that > option is discouraged." > > I think the above is dated. It also lacks critical information. What JVM > and OS version was the problem seen? That is definitely dated. Those were problems from 18 and 19. > > CompressedOops had several issues prior to Jre 6u20, and a few minor ones > were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops > for all Hadoop and non-Hadoop apps and have seen no issues. It is the > default in 6u24 and 6u25 on a 64 bit JVM. We use compressedoops on 21 every-so-often as well. Feel free to edit that whole section. :) Thanks! +
Allen Wittenauer 2011-06-22, 20:49
-
Re: Hadoop Java VersionsScott Carey 2011-06-23, 02:42
On 6/22/11 1:49 PM, "Allen Wittenauer" <[EMAIL PROTECTED]> wrote: > >On Jun 22, 2011, at 1:27 PM, Scott Carey wrote: > >> "Problems have been reported with Hadoop, the 64-bit JVM and Compressed >> Object References (the -XX:+UseCompressedOops option), so use of that >> option is discouraged." >> >> I think the above is dated. It also lacks critical information. What >>JVM >> and OS version was the problem seen? > >That is definitely dated. Those were problems from 18 and 19. > >> >> CompressedOops had several issues prior to Jre 6u20, and a few minor >>ones >> were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops >> for all Hadoop and non-Hadoop apps and have seen no issues. It is the >> default in 6u24 and 6u25 on a 64 bit JVM. > > We use compressedoops on 21 every-so-often as well. > > Feel free to edit that whole section. :) > > Thanks! Updated the CompressedOops section. I am using 6u25 in production and have used 6u24 as well. (64 bit version, CentOS 5.5) We usually roll out JVM's to a subset of nodes, then after a little burn-in to the whole thing. First on development / qa clusters (which are just as busy as production, but smaller), then production. > +
Scott Carey 2011-06-23, 02:42
-
Re: Hadoop Java VersionsSteve Loughran 2011-06-23, 12:49
On 22/06/2011 21:27, Scott Carey wrote:
> "Problems have been reported with Hadoop, the 64-bit JVM and Compressed > Object References (the -XX:+UseCompressedOops option), so use of that > option is discouraged." > > I think the above is dated. It also lacks critical information. What JVM > and OS version was the problem seen? A colleague saw it, intermittent JVM crashes. Unless he's updated I can't say the problem has gone away. > > CompressedOops had several issues prior to Jre 6u20, and a few minor ones > were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops > for all Hadoop and non-Hadoop apps and have seen no issues. It is the > default in 6u24 and 6u25 on a 64 bit JVM. what's your HW setup? #cores/server, #servers, underlying OS? +
Steve Loughran 2011-06-23, 12:49
-
Re: Hadoop Java VersionsScott Carey 2011-06-26, 19:23
On 6/23/11 5:49 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: >On 22/06/2011 21:27, Scott Carey wrote: >> "Problems have been reported with Hadoop, the 64-bit JVM and Compressed >> Object References (the -XX:+UseCompressedOops option), so use of that >> option is discouraged." >> >> I think the above is dated. It also lacks critical information. What >>JVM >> and OS version was the problem seen? > >A colleague saw it, intermittent JVM crashes. Unless he's updated I >can't say the problem has gone away. > >> >> CompressedOops had several issues prior to Jre 6u20, and a few minor >>ones >> were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops >> for all Hadoop and non-Hadoop apps and have seen no issues. It is the >> default in 6u24 and 6u25 on a 64 bit JVM. > > >what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). We are power constrained per rack either way. +
Scott Carey 2011-06-26, 19:23
-
Re: Hadoop Java VersionsSteve Loughran 2011-06-27, 11:38
On 26/06/11 20:23, Scott Carey wrote:
> > > On 6/23/11 5:49 AM, "Steve Loughran"<[EMAIL PROTECTED]> wrote: > >> what's your HW setup? #cores/server, #servers, underlying OS? > > CentOS 5.6. > 4 cores / 8 threads a server (Nehalem generation Intel processor). that should be enough to find problems. I've just moved up to a 6-core 12 thread desktop and that found problems on some non-Hadoop code, which shows that the more threads you have, and the faster the machines are, the more your race conditions show up. With Hadoop the fact that you can have 10-1000 servers means that in a large cluster the probability of that race condition showing up scales well. > Also run a smaller cluster with 2x quad core Core 2 generation Xeons. > > Off topic: > The single proc Nehalem is faster than the dual core 2's for most use > cases -- and much lower power. Looking forward to single proc 4 or 6 core > Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 > core has these 30% faster than the Nehalem generation systems in CPU bound > tasks and lower power. Intel prices single socket Xeons so much lower > than the Dual socket ones that the best value for us is to get more single > socket servers rather than fewer dual socket ones (with similar processor > to hard drive ratio). Yes, in a large cluster the price of filling the second socket can compare to a lot of storage, and TB of storage is more tangible. I guess it depends on your application. Regarding Sandy Bridge, I've no experience of those, but I worry that 10 Gbps is still bleeding edge, and shouldn't be needed for code with good locality anyway; it is probably more cost effective to stay at 1Gbps/server, though the issue there is the #of HDD/s server generates lots of replication traffic when a single server fails... +
Steve Loughran 2011-06-27, 11:38
-
Re: Hadoop Java VersionsRyan Rawson 2011-06-28, 00:10
On the subject of gige vs 10-gige, I think that we will very shortly
be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard drive of streaming data. Nodes with 4+ disks are throttled by the network. On a small cluster (20 nodes), the replication traffic can choke a cluster to death. The only way to fix quickly it is to bring that node back up. Perhaps the HortonWorks guys can work on that. -ryan On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 26/06/11 20:23, Scott Carey wrote: >> >> >> On 6/23/11 5:49 AM, "Steve Loughran"<[EMAIL PROTECTED]> wrote: >> > >>> what's your HW setup? #cores/server, #servers, underlying OS? >> >> CentOS 5.6. >> 4 cores / 8 threads a server (Nehalem generation Intel processor). > > > that should be enough to find problems. I've just moved up to a 6-core 12 > thread desktop and that found problems on some non-Hadoop code, which shows > that the more threads you have, and the faster the machines are, the more > your race conditions show up. With Hadoop the fact that you can have 10-1000 > servers means that in a large cluster the probability of that race condition > showing up scales well. > >> Also run a smaller cluster with 2x quad core Core 2 generation Xeons. >> >> Off topic: >> The single proc Nehalem is faster than the dual core 2's for most use >> cases -- and much lower power. Looking forward to single proc 4 or 6 core >> Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 >> core has these 30% faster than the Nehalem generation systems in CPU bound >> tasks and lower power. Intel prices single socket Xeons so much lower >> than the Dual socket ones that the best value for us is to get more single >> socket servers rather than fewer dual socket ones (with similar processor >> to hard drive ratio). > > Yes, in a large cluster the price of filling the second socket can compare > to a lot of storage, and TB of storage is more tangible. I guess it depends > on your application. > > Regarding Sandy Bridge, I've no experience of those, but I worry that 10 > Gbps is still bleeding edge, and shouldn't be needed for code with good > locality anyway; it is probably more cost effective to stay at 1Gbps/server, > though the issue there is the #of HDD/s server generates lots of replication > traffic when a single server fails... > +
Ryan Rawson 2011-06-28, 00:10
-
Re: Hadoop Java VersionsTed Dunning 2011-06-28, 00:12
Come to Srivas talk at the Summit.
On Mon, Jun 27, 2011 at 5:10 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > On the subject of gige vs 10-gige, I think that we will very shortly > be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard > drive of streaming data. Nodes with 4+ disks are throttled by the > network. On a small cluster (20 nodes), the replication traffic can > choke a cluster to death. The only way to fix quickly it is to bring > that node back up. Perhaps the HortonWorks guys can work on that. > > -ryan > > On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > > On 26/06/11 20:23, Scott Carey wrote: > >> > >> > >> On 6/23/11 5:49 AM, "Steve Loughran"<[EMAIL PROTECTED]> wrote: > >> > > > >>> what's your HW setup? #cores/server, #servers, underlying OS? > >> > >> CentOS 5.6. > >> 4 cores / 8 threads a server (Nehalem generation Intel processor). > > > > > > that should be enough to find problems. I've just moved up to a 6-core 12 > > thread desktop and that found problems on some non-Hadoop code, which > shows > > that the more threads you have, and the faster the machines are, the more > > your race conditions show up. With Hadoop the fact that you can have > 10-1000 > > servers means that in a large cluster the probability of that race > condition > > showing up scales well. > > > >> Also run a smaller cluster with 2x quad core Core 2 generation Xeons. > >> > >> Off topic: > >> The single proc Nehalem is faster than the dual core 2's for most use > >> cases -- and much lower power. Looking forward to single proc 4 or 6 > core > >> Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 > >> core has these 30% faster than the Nehalem generation systems in CPU > bound > >> tasks and lower power. Intel prices single socket Xeons so much lower > >> than the Dual socket ones that the best value for us is to get more > single > >> socket servers rather than fewer dual socket ones (with similar > processor > >> to hard drive ratio). > > > > Yes, in a large cluster the price of filling the second socket can > compare > > to a lot of storage, and TB of storage is more tangible. I guess it > depends > > on your application. > > > > Regarding Sandy Bridge, I've no experience of those, but I worry that 10 > > Gbps is still bleeding edge, and shouldn't be needed for code with good > > locality anyway; it is probably more cost effective to stay at > 1Gbps/server, > > though the issue there is the #of HDD/s server generates lots of > replication > > traffic when a single server fails... > > > +
Ted Dunning 2011-06-28, 00:12
-
Re: Hadoop Java VersionsSegel, Mike 2011-06-28, 01:54
That doesn't seem right.
In one of our test clusters (19 data nodes) we found that under heavy loads we were disk I/O bound and not network bound. Of course YMMV depending on your ToR switch. If we had more than 4 disks per node, we would probably see the network being the bottleneck. What did you set your bandwidth settings in the hdfs-site.xml? ( going from memory not sure of the exact setting...) But the good news... Newer hardware will start to have 10GBe on the motherboard. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 7:11 PM, "Ryan Rawson" <[EMAIL PROTECTED]> wrote: > On the subject of gige vs 10-gige, I think that we will very shortly > be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard > drive of streaming data. Nodes with 4+ disks are throttled by the > network. On a small cluster (20 nodes), the replication traffic can > choke a cluster to death. The only way to fix quickly it is to bring > that node back up. Perhaps the HortonWorks guys can work on that. > > -ryan > > On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: >> On 26/06/11 20:23, Scott Carey wrote: >>> >>> >>> On 6/23/11 5:49 AM, "Steve Loughran"<[EMAIL PROTECTED]> wrote: >>> >> >>>> what's your HW setup? #cores/server, #servers, underlying OS? >>> >>> CentOS 5.6. >>> 4 cores / 8 threads a server (Nehalem generation Intel processor). >> >> >> that should be enough to find problems. I've just moved up to a 6-core 12 >> thread desktop and that found problems on some non-Hadoop code, which shows >> that the more threads you have, and the faster the machines are, the more >> your race conditions show up. With Hadoop the fact that you can have 10-1000 >> servers means that in a large cluster the probability of that race condition >> showing up scales well. >> >>> Also run a smaller cluster with 2x quad core Core 2 generation Xeons. >>> >>> Off topic: >>> The single proc Nehalem is faster than the dual core 2's for most use >>> cases -- and much lower power. Looking forward to single proc 4 or 6 core >>> Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 >>> core has these 30% faster than the Nehalem generation systems in CPU bound >>> tasks and lower power. Intel prices single socket Xeons so much lower >>> than the Dual socket ones that the best value for us is to get more single >>> socket servers rather than fewer dual socket ones (with similar processor >>> to hard drive ratio). >> >> Yes, in a large cluster the price of filling the second socket can compare >> to a lot of storage, and TB of storage is more tangible. I guess it depends >> on your application. >> >> Regarding Sandy Bridge, I've no experience of those, but I worry that 10 >> Gbps is still bleeding edge, and shouldn't be needed for code with good >> locality anyway; it is probably more cost effective to stay at 1Gbps/server, >> though the issue there is the #of HDD/s server generates lots of replication >> traffic when a single server fails... >> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Segel, Mike 2011-06-28, 01:54
-
Re: Hadoop Java VersionsRyan Rawson 2011-06-28, 02:33
There are no bandwidth limitations in 0.20.x. None that I saw at
least. It was basically bandwidth-management-by-pwm. You could adjust the frequency of how many files-per-node were copied. In my case, the load was HBase real time serving, so it was servicing more smaller random reads, not a map-reduce. Everyone has their own use case :-) -ryan On Mon, Jun 27, 2011 at 6:54 PM, Segel, Mike <[EMAIL PROTECTED]> wrote: > That doesn't seem right. > In one of our test clusters (19 data nodes) we found that under heavy loads we were disk I/O bound and not network bound. Of course YMMV depending on your ToR switch. If we had more than 4 disks per node, we would probably see the network being the bottleneck. What did you set your bandwidth settings in the hdfs-site.xml? ( going from memory not sure of the exact setting...) > > But the good news... Newer hardware will start to have 10GBe on the motherboard. > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Jun 27, 2011, at 7:11 PM, "Ryan Rawson" <[EMAIL PROTECTED]> wrote: > >> On the subject of gige vs 10-gige, I think that we will very shortly >> be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard >> drive of streaming data. Nodes with 4+ disks are throttled by the >> network. On a small cluster (20 nodes), the replication traffic can >> choke a cluster to death. The only way to fix quickly it is to bring >> that node back up. Perhaps the HortonWorks guys can work on that. >> >> -ryan >> >> On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: >>> On 26/06/11 20:23, Scott Carey wrote: >>>> >>>> >>>> On 6/23/11 5:49 AM, "Steve Loughran"<[EMAIL PROTECTED]> wrote: >>>> >>> >>>>> what's your HW setup? #cores/server, #servers, underlying OS? >>>> >>>> CentOS 5.6. >>>> 4 cores / 8 threads a server (Nehalem generation Intel processor). >>> >>> >>> that should be enough to find problems. I've just moved up to a 6-core 12 >>> thread desktop and that found problems on some non-Hadoop code, which shows >>> that the more threads you have, and the faster the machines are, the more >>> your race conditions show up. With Hadoop the fact that you can have 10-1000 >>> servers means that in a large cluster the probability of that race condition >>> showing up scales well. >>> >>>> Also run a smaller cluster with 2x quad core Core 2 generation Xeons. >>>> >>>> Off topic: >>>> The single proc Nehalem is faster than the dual core 2's for most use >>>> cases -- and much lower power. Looking forward to single proc 4 or 6 core >>>> Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 >>>> core has these 30% faster than the Nehalem generation systems in CPU bound >>>> tasks and lower power. Intel prices single socket Xeons so much lower >>>> than the Dual socket ones that the best value for us is to get more single >>>> socket servers rather than fewer dual socket ones (with similar processor >>>> to hard drive ratio). >>> >>> Yes, in a large cluster the price of filling the second socket can compare >>> to a lot of storage, and TB of storage is more tangible. I guess it depends >>> on your application. >>> >>> Regarding Sandy Bridge, I've no experience of those, but I worry that 10 >>> Gbps is still bleeding edge, and shouldn't be needed for code with good >>> locality anyway; it is probably more cost effective to stay at 1Gbps/server, >>> though the issue there is the #of HDD/s server generates lots of replication >>> traffic when a single server fails... >>> > > > The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Ryan Rawson 2011-06-28, 02:33
-
Re: Hadoop Java VersionsSegel, Mike 2011-06-28, 03:49
Hmmm. I could have sworn there was a background balancing bandwidth limiter.
Haven't tested random reads... The last test we did ended up hitting the cache, but we didn't push it hard enough to hit network bandwidth limitations... Not to say they don't exist. Like I said in the other post, if we had more disks... we would hit it. We'll have to do more random testing. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 9:34 PM, "Ryan Rawson" <[EMAIL PROTECTED]> wrote: > There are no bandwidth limitations in 0.20.x. None that I saw at > least. It was basically bandwidth-management-by-pwm. You could > adjust the frequency of how many files-per-node were copied. > > In my case, the load was HBase real time serving, so it was servicing > more smaller random reads, not a map-reduce. Everyone has their own > use case :-) > > -ryan > > On Mon, Jun 27, 2011 at 6:54 PM, Segel, Mike <[EMAIL PROTECTED]> wrote: >> That doesn't seem right. >> In one of our test clusters (19 data nodes) we found that under heavy loads we were disk I/O bound and not network bound. Of course YMMV depending on your ToR switch. If we had more than 4 disks per node, we would probably see the network being the bottleneck. What did you set your bandwidth settings in the hdfs-site.xml? ( going from memory not sure of the exact setting...) >> >> But the good news... Newer hardware will start to have 10GBe on the motherboard. >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On Jun 27, 2011, at 7:11 PM, "Ryan Rawson" <[EMAIL PROTECTED]> wrote: >> >>> On the subject of gige vs 10-gige, I think that we will very shortly >>> be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard >>> drive of streaming data. Nodes with 4+ disks are throttled by the >>> network. On a small cluster (20 nodes), the replication traffic can >>> choke a cluster to death. The only way to fix quickly it is to bring >>> that node back up. Perhaps the HortonWorks guys can work on that. >>> >>> -ryan >>> >>> On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: >>>> On 26/06/11 20:23, Scott Carey wrote: >>>>> >>>>> >>>>> On 6/23/11 5:49 AM, "Steve Loughran"<[EMAIL PROTECTED]> wrote: >>>>> >>>> >>>>>> what's your HW setup? #cores/server, #servers, underlying OS? >>>>> >>>>> CentOS 5.6. >>>>> 4 cores / 8 threads a server (Nehalem generation Intel processor). >>>> >>>> >>>> that should be enough to find problems. I've just moved up to a 6-core 12 >>>> thread desktop and that found problems on some non-Hadoop code, which shows >>>> that the more threads you have, and the faster the machines are, the more >>>> your race conditions show up. With Hadoop the fact that you can have 10-1000 >>>> servers means that in a large cluster the probability of that race condition >>>> showing up scales well. >>>> >>>>> Also run a smaller cluster with 2x quad core Core 2 generation Xeons. >>>>> >>>>> Off topic: >>>>> The single proc Nehalem is faster than the dual core 2's for most use >>>>> cases -- and much lower power. Looking forward to single proc 4 or 6 core >>>>> Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 >>>>> core has these 30% faster than the Nehalem generation systems in CPU bound >>>>> tasks and lower power. Intel prices single socket Xeons so much lower >>>>> than the Dual socket ones that the best value for us is to get more single >>>>> socket servers rather than fewer dual socket ones (with similar processor >>>>> to hard drive ratio). >>>> >>>> Yes, in a large cluster the price of filling the second socket can compare >>>> to a lot of storage, and TB of storage is more tangible. I guess it depends >>>> on your application. >>>> >>>> Regarding Sandy Bridge, I've no experience of those, but I worry that 10 >>>> Gbps is still bleeding edge, and shouldn't be needed for code with good >>>> locality anyway; it is probably more cost effective to stay at 1Gbps/server, The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Segel, Mike 2011-06-28, 03:49
-
Re: Hadoop Java VersionsSteve Loughran 2011-06-28, 09:59
On 28/06/11 04:49, Segel, Mike wrote:
> Hmmm. I could have sworn there was a background balancing bandwidth limiter. There is, for the rebalancer, node outages are taken more seriously, though there have been problems in past 0.20.x where there was a risk of a cascade failure on a big switch/rack failure. The risk has been reduced, though we all await field reports to confirm this :) You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. 2x1 Gbe lets you have redundant switches, albeit at the price of more wiring, more things to go wrong with the wiring, etc. The other thing to consider is how well the "enterprise" switches work in this world -with a Hadoop cluster you can really test those claims how well the switches handle every port lighting up at full rate. Indeed, I recommend that as part of your acceptance tests for the switch. +
Steve Loughran 2011-06-28, 09:59
-
Re: Hadoop Java VersionsMichel Segel 2011-06-28, 12:27
You're preaching to the choir. :-)
With Sandybridge, you're going to start seeing 10 GBe on the motherboard. We built our clusters using 1U boxes where you're stuck w 4 3.5" drives. With larger chassis, You can fit an additional controller card and more drives. More drives reduces the bottleneck and means your performance will be throttled by your network and the amount of memory. I priced out a couple of vendors and when you build out your boxes, the magic number per data node is $10,000.00 USD. (budget this amount per data node.) Moore's Law doesn't drop the price, but it gets you more bang for your buck. Note that this magic number is pre-discount and YMMV. [This also gets in to what is meant by commodity hardware.] I agree that 10GBe is a necessity and I have been looking at it for the past 2 years, only to be shot down by my client's IT group. I agree that Cisco's ToR switches are expensive, however there are Arista and Blade networks switches that claim to be Cisco friendly and aren't too pricey. Somewhere around 10K a box. (Again YMMV). If you want to upgrade existing boxes, you will probably want to look at Solarflare cards. jMHO... Sent from a remote device. Please excuse any typos... Mike Segel On Jun 28, 2011, at 4:59 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 28/06/11 04:49, Segel, Mike wrote: >> Hmmm. I could have sworn there was a background balancing bandwidth limiter. > > There is, for the rebalancer, node outages are taken more seriously, though there have been problems in past 0.20.x where there was a risk of a cascade failure on a big switch/rack failure. The risk has been reduced, though we all await field reports to confirm this :) > > You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. > > But > -big increase in switch cost, especially if you (CoI warning) go with Cisco > -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. > -I don't know how well linux works with ether that fast (field reports useful) > -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. > > 2x1 Gbe lets you have redundant switches, albeit at the price of more wiring, more things to go wrong with the wiring, etc. > > The other thing to consider is how well the "enterprise" switches work in this world -with a Hadoop cluster you can really test those claims how well the switches handle every port lighting up at full rate. Indeed, I recommend that as part of your acceptance tests for the switch. > > > +
Michel Segel 2011-06-28, 12:27
-
Re: Hadoop Java VersionsArun C Murthy 2011-06-28, 17:25
We at Yahoo are about to deploy code to ensure a disk failure on a datanode is just that - a disk failure. Not a node failure. This really helps avoid replication storms.
It's in the 0.20.204 branch for the curious. Arun Sent from my iPhone On Jun 28, 2011, at 3:01 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: > On 28/06/11 04:49, Segel, Mike wrote: >> Hmmm. I could have sworn there was a background balancing bandwidth limiter. > > There is, for the rebalancer, node outages are taken more seriously, > though there have been problems in past 0.20.x where there was a risk of > a cascade failure on a big switch/rack failure. The risk has been > reduced, though we all await field reports to confirm this :) > > You can get 12-24 TB in a server today, which means the loss of a server > generates a lot of traffic -which argues for 10 Gbe. > > But > -big increase in switch cost, especially if you (CoI warning) go with > Cisco > -there have been problems with things like BIOS PXE and lights out > management on 10 Gbe -probably due to the NICs being things the BIOS > wasn't expecting and off the mainboard. This should improve. > -I don't know how well linux works with ether that fast (field reports > useful) > -the big threat is still ToR switch failure, as that will trigger a > re-replication of every block in the rack. > > 2x1 Gbe lets you have redundant switches, albeit at the price of more > wiring, more things to go wrong with the wiring, etc. > > The other thing to consider is how well the "enterprise" switches work > in this world -with a Hadoop cluster you can really test those claims > how well the switches handle every port lighting up at full rate. > Indeed, I recommend that as part of your acceptance tests for the switch. > > +
Arun C Murthy 2011-06-28, 17:25
-
RE: Hadoop Java VersionsEvert Lammerts 2011-06-30, 21:31
> You can get 12-24 TB in a server today, which means the loss of a server
> generates a lot of traffic -which argues for 10 Gbe. > > But > -big increase in switch cost, especially if you (CoI warning) go with > Cisco > -there have been problems with things like BIOS PXE and lights out > management on 10 Gbe -probably due to the NICs being things the BIOS > wasn't expecting and off the mainboard. This should improve. > -I don't know how well linux works with ether that fast (field reports > useful) > -the big threat is still ToR switch failure, as that will trigger a > re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert +
Evert Lammerts 2011-06-30, 21:31
-
RE: Hadoop Java VersionsEvert Lammerts 2011-06-30, 21:37
Forgot to attach the spreadsheet - here it is. All three tabs contain some data. The speeds are per second.
Evert ________________________________________ From: Evert Lammerts [[EMAIL PROTECTED]] Sent: Thursday, June 30, 2011 11:31 PM To: [EMAIL PROTECTED] Subject: RE: Hadoop Java Versions > You can get 12-24 TB in a server today, which means the loss of a server > generates a lot of traffic -which argues for 10 Gbe. > > But > -big increase in switch cost, especially if you (CoI warning) go with > Cisco > -there have been problems with things like BIOS PXE and lights out > management on 10 Gbe -probably due to the NICs being things the BIOS > wasn't expecting and off the mainboard. This should improve. > -I don't know how well linux works with ether that fast (field reports > useful) > -the big threat is still ToR switch failure, as that will trigger a > re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert +
Evert Lammerts 2011-06-30, 21:37
-
Re: Hadoop Java VersionsAaron Eng 2011-06-30, 23:18
>Keeping the amount of disks per node low and the amount of nodes high
should keep the impact of dead nodes in control. It keeps the impact of dead nodes in control but I don't think thats long-term cost efficient. As prices of 10GbE go down, the "keep the node small" arguement seems less fitting. And on another note, most servers manufactured in the last 10 years have dual 1GbE network interfaces. If one were to go by these calcs: >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover It seems like that assumes a single 1GbE interface, why not leverage the second? On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <[EMAIL PROTECTED]>wrote: > > You can get 12-24 TB in a server today, which means the loss of a server > > generates a lot of traffic -which argues for 10 Gbe. > > > > But > > -big increase in switch cost, especially if you (CoI warning) go with > > Cisco > > -there have been problems with things like BIOS PXE and lights out > > management on 10 Gbe -probably due to the NICs being things the BIOS > > wasn't expecting and off the mainboard. This should improve. > > -I don't know how well linux works with ether that fast (field reports > > useful) > > -the big threat is still ToR switch failure, as that will trigger a > > re-replication of every block in the rack. > > Keeping the amount of disks per node low and the amount of nodes high > should keep the impact of dead nodes in control. A ToR switch failing is > different - missing 30 nodes (~120TB) at once cannot be fixed by adding more > nodes; that actually increases ToR switch failure. Although such failure is > quite rare to begin with, I guess. The back-of-the-envelope-calculation I > made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., > when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with > HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing > should take around 640 seconds. Also see the attached spreadsheet.) This > doesn't take ToR switch failure in account though. On the other hand - 150 > nodes is only ~5 racks - in such a scenario you might rather want to shut > the system down completely rather than letting it replicate 20% of all data. > > Cheers, > Evert +
Aaron Eng 2011-06-30, 23:18
-
Re: Hadoop Java VersionsTed Dunning 2011-07-01, 00:16
You have to consider the long-term reliability as well.
Losing an entire set of 10 or 12 disks at once makes the overall reliability of a large cluster very suspect. This is because it becomes entirely too likely that two additional drives will fail before the data on the off-line node can be replicated. For 100 nodes, that can decrease the average time to data loss down to less than a year. This can only be mitigated in stock hadoop by keeping the number of drives relatively low. MapR avoids this by not failing nodes for trivial problems. On Thu, Jun 30, 2011 at 4:18 PM, Aaron Eng <[EMAIL PROTECTED]> wrote: > >Keeping the amount of disks per node low and the amount of nodes high > should keep the impact of dead nodes in control. > > It keeps the impact of dead nodes in control but I don't think thats > long-term cost efficient. As prices of 10GbE go down, the "keep the node > small" arguement seems less fitting. And on another note, most servers > manufactured in the last 10 years have dual 1GbE network interfaces. If > one > were to go by these calcs: > > >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around > ~32 > minutes to recover > > It seems like that assumes a single 1GbE interface, why not leverage the > second? > > On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <[EMAIL PROTECTED] > >wrote: > > > > You can get 12-24 TB in a server today, which means the loss of a > server > > > generates a lot of traffic -which argues for 10 Gbe. > > > > > > But > > > -big increase in switch cost, especially if you (CoI warning) go with > > > Cisco > > > -there have been problems with things like BIOS PXE and lights out > > > management on 10 Gbe -probably due to the NICs being things the BIOS > > > wasn't expecting and off the mainboard. This should improve. > > > -I don't know how well linux works with ether that fast (field > reports > > > useful) > > > -the big threat is still ToR switch failure, as that will trigger a > > > re-replication of every block in the rack. > > > > Keeping the amount of disks per node low and the amount of nodes high > > should keep the impact of dead nodes in control. A ToR switch failing is > > different - missing 30 nodes (~120TB) at once cannot be fixed by adding > more > > nodes; that actually increases ToR switch failure. Although such failure > is > > quite rare to begin with, I guess. The back-of-the-envelope-calculation I > > made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. > (e.g., > > when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, > with > > HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing > > should take around 640 seconds. Also see the attached spreadsheet.) This > > doesn't take ToR switch failure in account though. On the other hand - > 150 > > nodes is only ~5 racks - in such a scenario you might rather want to shut > > the system down completely rather than letting it replicate 20% of all > data. > > > > Cheers, > > Evert > +
Ted Dunning 2011-07-01, 00:16
-
Re: Hadoop Java VersionsTodd Lipcon 2011-07-01, 00:24
On Thu, Jun 30, 2011 at 5:16 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> You have to consider the long-term reliability as well. > > Losing an entire set of 10 or 12 disks at once makes the overall > reliability > of a large cluster very suspect. This is because it becomes entirely too > likely that two additional drives will fail before the data on the off-line > node can be replicated. For 100 nodes, that can decrease the average time > to data loss down to less than a year. This can only be mitigated in stock > hadoop by keeping the number of drives relatively low. MapR avoids this by > not failing nodes for trivial problems. > I'd advise you to look at "stock hadoop" again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd > > On Thu, Jun 30, 2011 at 4:18 PM, Aaron Eng <[EMAIL PROTECTED]> wrote: > > > >Keeping the amount of disks per node low and the amount of nodes high > > should keep the impact of dead nodes in control. > > > > It keeps the impact of dead nodes in control but I don't think thats > > long-term cost efficient. As prices of 10GbE go down, the "keep the node > > small" arguement seems less fitting. And on another note, most servers > > manufactured in the last 10 years have dual 1GbE network interfaces. If > > one > > were to go by these calcs: > > > > >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around > > ~32 > > minutes to recover > > > > It seems like that assumes a single 1GbE interface, why not leverage the > > second? > > > > On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <[EMAIL PROTECTED] > > >wrote: > > > > > > You can get 12-24 TB in a server today, which means the loss of a > > server > > > > generates a lot of traffic -which argues for 10 Gbe. > > > > > > > > But > > > > -big increase in switch cost, especially if you (CoI warning) go > with > > > > Cisco > > > > -there have been problems with things like BIOS PXE and lights out > > > > management on 10 Gbe -probably due to the NICs being things the BIOS > > > > wasn't expecting and off the mainboard. This should improve. > > > > -I don't know how well linux works with ether that fast (field > > reports > > > > useful) > > > > -the big threat is still ToR switch failure, as that will trigger a > > > > re-replication of every block in the rack. > > > > > > Keeping the amount of disks per node low and the amount of nodes high > > > should keep the impact of dead nodes in control. A ToR switch failing > is > > > different - missing 30 nodes (~120TB) at once cannot be fixed by adding > > more > > > nodes; that actually increases ToR switch failure. Although such > failure > > is > > > quite rare to begin with, I guess. The back-of-the-envelope-calculation > I > > > made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. > > (e.g., > > > when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, > > with > > > HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing > > > should take around 640 seconds. Also see the attached spreadsheet.) > This > > > doesn't take ToR switch failure in account though. On the other hand - > > 150 > > > nodes is only ~5 racks - in such a scenario you might rather want to > shut > > > the system down completely rather than letting it replicate 20% of all > > data. > > > > > > Cheers, > > > Evert > > > -- Todd Lipcon Software Engineer, Cloudera +
Todd Lipcon 2011-07-01, 00:24
-
Re: Hadoop Java VersionsTed Dunning 2011-07-01, 01:12
Good point Todd.
I was speaking from the experience of people I know who are using 0.20.x On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > On Thu, Jun 30, 2011 at 5:16 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > You have to consider the long-term reliability as well. > > > > Losing an entire set of 10 or 12 disks at once makes the overall > > reliability > > of a large cluster very suspect. This is because it becomes entirely too > > likely that two additional drives will fail before the data on the > off-line > > node can be replicated. For 100 nodes, that can decrease the average > time > > to data loss down to less than a year. This can only be mitigated in > stock > > hadoop by keeping the number of drives relatively low. MapR avoids this > by > > not failing nodes for trivial problems. > > > > I'd advise you to look at "stock hadoop" again. This used to be true, but > was fixed a long while back by HDFS-457 and several followup JIRAs. > > If MapR does something fancier, I'm sure we'd be interested to hear about > it > so we can compare the approaches. > +
Ted Dunning 2011-07-01, 01:12
-
Re: Hadoop Java VersionsM. C. Srivas 2011-07-01, 04:08
On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> > I'd advise you to look at "stock hadoop" again. This used to be true, but > was fixed a long while back by HDFS-457 and several followup JIRAs. > > If MapR does something fancier, I'm sure we'd be interested to hear about > it > so we can compare the approaches. > > -Todd > > MapR tracks disk responsiveness. In other words, a moving histogram of IO-completion times is maintained internally, and if a disk starts getting really slow, it is pre-emptively taken offline so it does not create long tails for running jobs (and the data on the disk is re-replicated using whatever re-replication policy is in place). One of the benefits of managing the disks directly instead of through ext3 / xfs / or other ... All these stats can be fed into Ganglia (or pushed out centrally via a text file that can be pulled out using NFS) if historical info about disk behavior (and failures) needs to be preserved. - Srivas. +
M. C. Srivas 2011-07-01, 04:08
-
Re: Hadoop Java VersionsIan Holsman 2011-07-01, 04:47
On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote: > On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >> >> I'd advise you to look at "stock hadoop" again. This used to be true, but >> was fixed a long while back by HDFS-457 and several followup JIRAs. >> >> If MapR does something fancier, I'm sure we'd be interested to hear about >> it >> so we can compare the approaches. >> >> -Todd >> >> > MapR tracks disk responsiveness. In other words, a moving histogram of > IO-completion times is maintained internally, and if a disk starts getting > really slow, it is pre-emptively taken offline so it does not create long > tails for running jobs (and the data on the disk is re-replicated using > whatever re-replication policy is in place). One of the benefits of > managing the disks directly instead of through ext3 / xfs / or other ... > > All these stats can be fed into Ganglia (or pushed out centrally via a text > file that can be pulled out using NFS) if historical info about disk > behavior (and failures) needs to be preserved. > > - Srivas. While I am intrigued about how MapR performs internally, I don't think this is the forum for it. please keep MapR (and other vendor specific discussions) on their respective support forums. Thanks! Ian. +
Ian Holsman 2011-07-01, 04:47
-
Re: Hadoop Java VersionsM. C. Srivas 2011-07-01, 05:06
No worries.
I read Todd's post as asking for elaboration ... sometimes knowing what another similar system does helps in improving your own. On Thu, Jun 30, 2011 at 9:47 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: > > On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote: > > > On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > > >> > >> I'd advise you to look at "stock hadoop" again. This used to be true, > but > >> was fixed a long while back by HDFS-457 and several followup JIRAs. > >> > >> If MapR does something fancier, I'm sure we'd be interested to hear > about > >> it > >> so we can compare the approaches. > >> > >> -Todd > >> > >> > > MapR tracks disk responsiveness. In other words, a moving histogram of > > IO-completion times is maintained internally, and if a disk starts > getting > > really slow, it is pre-emptively taken offline so it does not create long > > tails for running jobs (and the data on the disk is re-replicated using > > whatever re-replication policy is in place). One of the benefits of > > managing the disks directly instead of through ext3 / xfs / or other ... > > > > All these stats can be fed into Ganglia (or pushed out centrally via a > text > > file that can be pulled out using NFS) if historical info about disk > > behavior (and failures) needs to be preserved. > > > > - Srivas. > > While I am intrigued about how MapR performs internally, I don't think this > is the forum for it. > please keep MapR (and other vendor specific discussions) on their > respective support forums. > > Thanks! > > Ian. > > +
M. C. Srivas 2011-07-01, 05:06
-
Re: Hadoop Java VersionsTed Dunning 2011-07-01, 05:09
Ian,
Good point. Srivas was responding to Todd's question, but there might be better fora as you suggest. We have a good one for specific questions about MapR at http://answers.mapr.com That doesn't, however, really provide a useful forum for questions like Todd's which really spans both domains. Where would you suggest for questions that span the two subjects? On Thu, Jun 30, 2011 at 9:47 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: > >> If MapR does something fancier, I'm sure we'd be interested to hear > about > >> it > >> so we can compare the approaches. > >> > >> -Todd > >> > >> > > MapR tracks disk responsiveness. In other words, a moving histogram of > > IO-completion times is maintained internally, and if a disk starts > getting > > really slow, it is pre-emptively taken offline so it does not create long > > tails for running jobs (and the data on the disk is re-replicated using > > whatever re-replication policy is in place). One of the benefits of > > managing the disks directly instead of through ext3 / xfs / or other ... > > > > All these stats can be fed into Ganglia (or pushed out centrally via a > text > > file that can be pulled out using NFS) if historical info about disk > > behavior (and failures) needs to be preserved. > > > > - Srivas. > > While I am intrigued about how MapR performs internally, I don't think this > is the forum for it. > please keep MapR (and other vendor specific discussions) on their > respective support forums. > +
Ted Dunning 2011-07-01, 05:09
-
Re: Hadoop Java VersionsScott Carey 2011-07-01, 17:22
Although this thread is wandering a bit, I disagree strongly that it is
inappropriate to discuss other vendor specific features (or competing compute platform features) on general@. The topic has become the factors that influence hardware purchase choices, and one of those is how the system deals with disk failure. Compare/contrast with other platforms is healthy for the Hadoop project! On 6/30/11 9:47 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote: > >On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote: > >> On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> >>> >>> I'd advise you to look at "stock hadoop" again. This used to be true, >>>but >>> was fixed a long while back by HDFS-457 and several followup JIRAs. >>> >>> If MapR does something fancier, I'm sure we'd be interested to hear >>>about >>> it >>> so we can compare the approaches. >>> >>> -Todd >>> >>> >> MapR tracks disk responsiveness. In other words, a moving histogram of >> IO-completion times is maintained internally, and if a disk starts >>getting >> really slow, it is pre-emptively taken offline so it does not create >>long >> tails for running jobs (and the data on the disk is re-replicated using >> whatever re-replication policy is in place). One of the benefits of >> managing the disks directly instead of through ext3 / xfs / or other ... >> >> All these stats can be fed into Ganglia (or pushed out centrally via a >>text >> file that can be pulled out using NFS) if historical info about disk >> behavior (and failures) needs to be preserved. >> >> - Srivas. > >While I am intrigued about how MapR performs internally, I don't think >this is the forum for it. >please keep MapR (and other vendor specific discussions) on their >respective support forums. > >Thanks! > >Ian. > +
Scott Carey 2011-07-01, 17:22
-
Re: Hadoop Java VersionsAbhishek Mehta 2011-07-01, 18:53
i definitely agree with scott. as a user of the hadoop open source stack for building our banking focused big data analytics applications, i speak on behalf of our clients and the emerging hadoop eco-system that open and honest conversations on this thread/group, irrespective of whether one represents a company or apache, should be encouraged.
as an instance, with the fact that cloudera, mapR and soon hortonworks are all going to be offering competing hadoop distros for enterprises, it is important for all of us (and prospective users) to understand what they are doing to address critical gaps on the platform, and how the hadoop ecosystem benefits from it. From our perspective, it doesn't matter if one is better than the other (which is not the point i saw ted or mc making), but that companies, startups, apache and everybody else: 1. is thinking of the right issues 2. willing to solve them (and ideally contributing the solutions back) and 3. informing the exploding hadoop userbase of what not to do I see it benefitting all of us, especially as Hadoop rapidly jumps the transom and becomes the platform of choice for data management in industries like banking, retail and healthcare...just as it has for social media and the web... isn't that what we are launching our business plans around anyway... And in that sense we all owe ASF and the hadoop community (and not any one company) an equal amount of gratitude, humility and respect. On Jul 1, 2011, at 1:22 PM, Scott Carey wrote: > Although this thread is wandering a bit, I disagree strongly that it is > inappropriate to discuss other vendor specific features (or competing > compute platform features) on general@. The topic has become the factors > that influence hardware purchase choices, and one of those is how the > system deals with disk failure. Compare/contrast with other platforms is > healthy for the Hadoop project! +1 > > On 6/30/11 9:47 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote: > >> >> On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote: >> >>> On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >>> >>>> >>>> I'd advise you to look at "stock hadoop" again. This used to be true, >>>> but >>>> was fixed a long while back by HDFS-457 and several followup JIRAs. >>>> >>>> If MapR does something fancier, I'm sure we'd be interested to hear >>>> about >>>> it >>>> so we can compare the approaches. >>>> >>>> -Todd >>>> >>>> >>> MapR tracks disk responsiveness. In other words, a moving histogram of >>> IO-completion times is maintained internally, and if a disk starts >>> getting >>> really slow, it is pre-emptively taken offline so it does not create >>> long >>> tails for running jobs (and the data on the disk is re-replicated using >>> whatever re-replication policy is in place). One of the benefits of >>> managing the disks directly instead of through ext3 / xfs / or other ... >>> >>> All these stats can be fed into Ganglia (or pushed out centrally via a >>> text >>> file that can be pulled out using NFS) if historical info about disk >>> behavior (and failures) needs to be preserved. >>> >>> - Srivas. >> >> While I am intrigued about how MapR performs internally, I don't think >> this is the forum for it. >> please keep MapR (and other vendor specific discussions) on their >> respective support forums. >> >> Thanks! >> >> Ian. >> > +
Abhishek Mehta 2011-07-01, 18:53
-
Re: Hadoop Java VersionsTed Dunning 2011-07-01, 21:38
On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta <[EMAIL PROTECTED]>wrote:
> open and honest conversations on this thread/group, irrespective of > whether one represents a company or apache, should be encouraged. > Paradoxically, I partially agree with Ian on this. On the one hand, it is important to hear about alternatives in the same eco-system (and alternative approaches from other communities). That is a bit different from Ian's view. Where I agree with him is that discussions should not be in the form of simple breast beating. Discussions should be driven by data and experience. All data and experience are of equal value if they represent quality thought. The only place that company names should figure in this is as a bookmark so you can tell which product/release has the characteristic under consideration. > ... And in that sense we all owe ASF and the hadoop community (and not any > one company) an equal amount of gratitude, humility and respect. > This doesn't get said nearly enough. +
Ted Dunning 2011-07-01, 21:38
-
Re: Hadoop Java VersionsIan Holsman 2011-07-03, 04:16
On Jul 2, 2011, at 7:38 AM, Ted Dunning wrote: > On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta <[EMAIL PROTECTED]>wrote: > >> open and honest conversations on this thread/group, irrespective of >> whether one represents a company or apache, should be encouraged. >> > > Paradoxically, I partially agree with Ian on this. On the one hand, it is > important to hear about alternatives in the same eco-system (and alternative > approaches from other communities). That is a bit different from Ian's > view. While I don't mind that technical alternatives get discussed, I do get PO'd when the conversation goes into why product X is better than Y, or when someone makes claims that are incorrect because some 'customer' told them and stuff like that. If we can keep it at architecture/approaches instead of why a certain product is better than go right ahead. > > Where I agree with him is that discussions should not be in the form of > simple breast beating. Discussions should be driven by data and experience. > All data and experience are of equal value if they represent quality > thought. The only place that company names should figure in this is as a > bookmark so you can tell which product/release has the characteristic under > consideration. > > >> ... And in that sense we all owe ASF and the hadoop community (and not any >> one company) an equal amount of gratitude, humility and respect. >> > > This doesn't get said nearly enough. +
Ian Holsman 2011-07-03, 04:16
-
Re: Hadoop Java VersionsEric Baldeschwieler 2011-07-13, 14:59
We could create an apache hadoop list of product selection discussions. I believe this list is intended to be focused on project governance and similar discussions. Maybe we should simply create a governance list and leave this one to be the free for all?
On Jul 2, 2011, at 9:16 PM, Ian Holsman wrote: > > On Jul 2, 2011, at 7:38 AM, Ted Dunning wrote: > >> On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta <[EMAIL PROTECTED]>wrote: >> >>> open and honest conversations on this thread/group, irrespective of >>> whether one represents a company or apache, should be encouraged. >>> >> >> Paradoxically, I partially agree with Ian on this. On the one hand, it is >> important to hear about alternatives in the same eco-system (and alternative >> approaches from other communities). That is a bit different from Ian's >> view. > > While I don't mind that technical alternatives get discussed, I do get PO'd when > the conversation goes into why product X is better than Y, or when someone makes claims that are incorrect because some 'customer' told them and stuff like that. > > If we can keep it at architecture/approaches instead of why a certain product is better than go right ahead. > >> >> Where I agree with him is that discussions should not be in the form of >> simple breast beating. Discussions should be driven by data and experience. >> All data and experience are of equal value if they represent quality >> thought. The only place that company names should figure in this is as a >> bookmark so you can tell which product/release has the characteristic under >> consideration. >> >> >>> ... And in that sense we all owe ASF and the hadoop community (and not any >>> one company) an equal amount of gratitude, humility and respect. >>> >> >> This doesn't get said nearly enough. > +
Eric Baldeschwieler 2011-07-13, 14:59
-
Re: Hadoop Java VersionsTed Dunning 2011-07-13, 15:48
On Wed, Jul 13, 2011 at 7:59 AM, Eric Baldeschwieler <[EMAIL PROTECTED]
> wrote: > We could create an apache hadoop list of product selection discussions. I > believe this list is intended to be focused on project governance and > similar discussions. Maybe we should simply create a governance list and > leave this one to be the free for all? > If you need to separate the discussions, then taking the discussion with smaller/more experienced set of participants to the new list is going to be easier than trying to sweep the tide. A list named "general" sounds a lot like "general discussions" to newcomers (and frankly, to old-timers like me). For that matter, is there a strong reason to segregate the discussions? +
Ted Dunning 2011-07-13, 15:48
-
Re: Hadoop Java VersionsSteve Loughran 2011-07-01, 16:12
On 01/07/2011 01:16, Ted Dunning wrote:
> You have to consider the long-term reliability as well. > > Losing an entire set of 10 or 12 disks at once makes the overall reliability > of a large cluster very suspect. This is because it becomes entirely too > likely that two additional drives will fail before the data on the off-line > node can be replicated. For 100 nodes, that can decrease the average time > to data loss down to less than a year. There's also Rodrigo's work on alternate block placement that doesn't scatter blocks quite so randomly across a cluster, so a loss of a node or rack doesn't have adverse effects on so many files https://issues.apache.org/jira/browse/HDFS-1094 Given that most HDDs failures happen on cluster reboot, it is possible for 10-12 disks not to come up at the same time, if the cluster has been up for a while, but like Todd says -worry. At least a bit. I've heard hints of one FS that actually includes HDD batch data in block placement, to try and scatter data across batches, and be biased towards using new HDDs for temp storage during burn-in. Some research work on doing that to HDFS could be something to keep some postgraduate busy for a while, "Disk batch-aware block placement". > This can only be mitigated in stock > hadoop by keeping the number of drives relatively low. now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem? Because it seems to me that "stock" HDFS's use in production makes it one of the filesystems in the planet with the most number of non-RAIDed HDDs out there -things like Lustre and IBM GPFS go for RAID, as does HP IBRIX (the last two of which have some form of Hadoop support too, if you ask nicely). HDD/server numbers matter in that in a small cluster, it's better to have fewer machines to get more servers to spread the data over; you don't really want your 100 TB in three 1U servers. As your cluster grows -and you care more about storage capacity than raw compute- then the appeal of 24+ TB/server starts to look good, and that's when you care about the improvements to datanodes handling loss of worker disk better. Even without that, rebooting the DN may fix things, but the impact on ongoing work is the big issue -you don't just lose a replicated block, you lose data. Cascade failures leading to cluster outages are a separate issue and normally triggered by switch failure/config than anything else. It doesn't matter how reliable the hardware is if it gets the wrong configuration +
Steve Loughran 2011-07-01, 16:12
-
Re: Hadoop Java VersionsTed Dunning 2011-07-01, 18:09
On Fri, Jul 1, 2011 at 9:12 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> On 01/07/2011 01:16, Ted Dunning wrote: > >> You have to consider the long-term reliability as well. >> >> Losing an entire set of 10 or 12 disks at once makes the overall >> reliability >> of a large cluster very suspect. This is because it becomes entirely too >> likely that two additional drives will fail before the data on the >> off-line >> node can be replicated. For 100 nodes, that can decrease the average time >> to data loss down to less than a year. >> > > There's also Rodrigo's work on alternate block placement that doesn't > scatter blocks quite so randomly across a cluster, so a loss of a node or > rack doesn't have adverse effects on so many files > > https://issues.apache.org/**jira/browse/HDFS-1094<https://issues.apache.org/jira/browse/HDFS-1094> I did calculations based on this as well. The heuristic level of the computation is pretty simple, but to go any deeper, you have a pretty hair computation. My own approach was to use Monte Carlo Markov Chain to sample from the failure mode distribution. The codes that I wrote for this used pluggable placement, replication and failure models. I may have lacked sufficient cleverness at the time, but it was very difficult to come up with structured placement policies that actually improved the failure probabilities. Most such strategies massively decreased average probabilities. My suspicion by analogy with large structured error correction codes is that there are structured placement policies that perform well, but that in reasonably large clusters (number of disks > 50, say), that random placement will be within epsilon of the best possible strategy with very high probability. > Given that most HDDs failures happen on cluster reboot, it is possible for > 10-12 disks not to come up at the same time, if the cluster has been up for > a while, but like Todd says -worry. At least a bit. > Indeed. Thank goodness also that disk manufacturers tend to be pessimistic in quoting MTBF. These possibilities of correlated failure seriously complicate these computations, of course. > I've heard hints of one FS that actually includes HDD batch data in block > placement, to try and scatter data across batches, and be biased towards > using new HDDs for temp storage during burn-in. Some research work on doing > that to HDFS could be something to keep some postgraduate busy for a while, > "Disk batch-aware block placement". > Sadly, I can't comment on my knowledge of this except to say that there are non-obvious solutions to this that are embedded in at least one commercial map-reduce related product. I can't say which without getting chastised. > This can only be mitigated in stock >> hadoop by keeping the number of drives relatively low. >> > > now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem? Per system. > ..with the most number of non-RAIDed HDDs out there -things like Lustre and > IBM GPFS go for RAID, as does HP IBRIX (the last two of which have some form > of Hadoop support too, if you ask nicely). HDD/server numbers matter in that > in a small cluster, it's better to have fewer machines to get more servers > to spread the data over; you don't really want your 100 TB in three 1U > servers. As your cluster grows -and you care more about storage capacity > than raw compute- then the appeal of 24+ TB/server starts to look good, and > that's when you care about the improvements to datanodes handling loss of > worker disk better. Even without that, rebooting the DN may fix things, but > the impact on ongoing work is the big issue -you don't just lose a > replicated block, you lose data. > Generally, I agree with what you say. The effect of RAID is to squeeze the error distributions around so that partial failures have lower probability. This is complex in the aggregate. > > Cascade failures leading to cluster outages are a separate issue and > normally triggered by switch failure/config than anything else. It doesn't Indeed. +
Ted Dunning 2011-07-01, 18:09
-
Re: Hadoop Java VersionsScott Carey 2011-06-28, 02:58
For cost reasons, we just bonded two 1G network ports together. A single
stream won't go past 1Gbps, but concurrent ones do -- this is with the Linux built-in bonding. The network is only stressed during 'sort-like' jobs or big replication events. We also removed some disk bottlenecks by tuning the file systems aggressively -- using a separate partition for the M/R temp and the location that jars may unpack into helps tremendously. Ext4 can be configured to delay flushing to disk for this temp space, which for small jobs decreases the I/O tremendously as many files are deleted before they get pushed to disk. On 6/27/11 5:10 PM, "Ryan Rawson" <[EMAIL PROTECTED]> wrote: >On the subject of gige vs 10-gige, I think that we will very shortly >be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard >drive of streaming data. Nodes with 4+ disks are throttled by the >network. On a small cluster (20 nodes), the replication traffic can >choke a cluster to death. The only way to fix quickly it is to bring >that node back up. Perhaps the HortonWorks guys can work on that. > >-ryan > >On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: >> On 26/06/11 20:23, Scott Carey wrote: >>> >>> >>> On 6/23/11 5:49 AM, "Steve Loughran"<[EMAIL PROTECTED]> wrote: >>> >> >>>> what's your HW setup? #cores/server, #servers, underlying OS? >>> >>> CentOS 5.6. >>> 4 cores / 8 threads a server (Nehalem generation Intel processor). >> >> >> that should be enough to find problems. I've just moved up to a 6-core >>12 >> thread desktop and that found problems on some non-Hadoop code, which >>shows >> that the more threads you have, and the faster the machines are, the >>more >> your race conditions show up. With Hadoop the fact that you can have >>10-1000 >> servers means that in a large cluster the probability of that race >>condition >> showing up scales well. >> >>> Also run a smaller cluster with 2x quad core Core 2 generation Xeons. >>> >>> Off topic: >>> The single proc Nehalem is faster than the dual core 2's for most use >>> cases -- and much lower power. Looking forward to single proc 4 or 6 >>>core >>> Sandy Bridge based systems for the next expansion -- testing 4 core vs >>>4 >>> core has these 30% faster than the Nehalem generation systems in CPU >>>bound >>> tasks and lower power. Intel prices single socket Xeons so much lower >>> than the Dual socket ones that the best value for us is to get more >>>single >>> socket servers rather than fewer dual socket ones (with similar >>>processor >>> to hard drive ratio). >> >> Yes, in a large cluster the price of filling the second socket can >>compare >> to a lot of storage, and TB of storage is more tangible. I guess it >>depends >> on your application. >> >> Regarding Sandy Bridge, I've no experience of those, but I worry that 10 >> Gbps is still bleeding edge, and shouldn't be needed for code with good >> locality anyway; it is probably more cost effective to stay at >>1Gbps/server, >> though the issue there is the #of HDD/s server generates lots of >>replication >> traffic when a single server fails... >> +
Scott Carey 2011-06-28, 02:58
-
RE: Hadoop Java VersionsEvert Lammerts 2011-06-30, 21:40
That's not a question I'm qualified to answer. I do know we're now buying an Arista for a different cluster, but there's probably loads others out there.
*forwarded to general@...* ________________________________________ From: Abhishek Mehta [[EMAIL PROTECTED]] Sent: Thursday, June 30, 2011 11:38 PM To: Evert Lammerts Subject: Fwd: Hadoop Java Versions what are the other switch options (other than cisco that is?)? cheers Abhishek Mehta (e) [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> (v) 980.355.9855 Begin forwarded message: From: Evert Lammerts <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Date: June 30, 2011 5:31:26 PM EDT To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Subject: RE: Hadoop Java Versions Reply-To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert +
Evert Lammerts 2011-06-30, 21:40
|