|
S Ahmed
2010-11-23, 21:14
Jean-Daniel Cryans
2010-11-23, 21:22
S Ahmed
2010-11-23, 21:31
Jean-Daniel Cryans
2010-11-23, 21:44
S Ahmed
2010-11-24, 14:22
Jean-Daniel Cryans
2010-11-24, 18:07
Wojciech Langiewicz
2010-11-24, 18:26
Lars George
2010-11-24, 18:35
|
-
managing 5-10 serversS Ahmed 2010-11-23, 21:14
Hi,
How much of a guru do you have to be to keep say 5-10 servers humming? I'm a 1-man shop, and I dream of developing a web application, and scaling will be a core part of the application. Is it feasable for a 1-man operation to manage a 5-10 server hbase cluster? Is it something that requires hand holding and constant monitoring or it tends to be hands off?
-
Re: managing 5-10 serversJean-Daniel Cryans 2010-11-23, 21:22
Constant hand holding no, constant monitoring yes. Do setup Ganglia
and preferably Nagios. Then it depends what you're planning to do with your cluster... here we have 2x 20 machines in production, the one that serves live traffic is pretty much doing it's own thing by itself (although I keep a ganglia tab opened on a second monitor) and the other one is used strictly for MapReduce for which our internal users have developed a habit of running very destructive jobs on. But to be fair, it's probably the users that need support the most ;) J-D On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > Hi, > > How much of a guru do you have to be to keep say 5-10 servers humming? > > I'm a 1-man shop, and I dream of developing a web application, and scaling > will be a core part of the application. > > Is it feasable for a 1-man operation to manage a 5-10 server hbase cluster? > Is it something that requires hand holding and constant monitoring or it > tends to be hands off? >
-
Re: managing 5-10 serversS Ahmed 2010-11-23, 21:31
Are there any writeups on what things to look for?
What are some of the things that usually go wrong? Or is that an unfair question :) On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > Constant hand holding no, constant monitoring yes. Do setup Ganglia > and preferably Nagios. Then it depends what you're planning to do with > your cluster... here we have 2x 20 machines in production, the one > that serves live traffic is pretty much doing it's own thing by itself > (although I keep a ganglia tab opened on a second monitor) and the > other one is used strictly for MapReduce for which our internal users > have developed a habit of running very destructive jobs on. But to be > fair, it's probably the users that need support the most ;) > > J-D > > On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > > Hi, > > > > How much of a guru do you have to be to keep say 5-10 servers humming? > > > > I'm a 1-man shop, and I dream of developing a web application, and > scaling > > will be a core part of the application. > > > > Is it feasable for a 1-man operation to manage a 5-10 server hbase > cluster? > > Is it something that requires hand holding and constant monitoring or it > > tends to be hands off? > > >
-
Re: managing 5-10 serversJean-Daniel Cryans 2010-11-23, 21:44
I wish I could do a dump of my memory into an ops guide to HBase, but
currently I don't think there's such a writeup. What can go wrong... again it depends on your type of usage. With a MR-heavy cluster, it's usually very easy to drive the IO wait through the roof and then you'll end up with GC pauses >60 secs caused by CPU starvation. Here's a recent example we got when a big Mahout job was running: 2010-11-19T18:25:31.173-0800: [GC [ParNew: 114456K->13056K(118016K), 103.8190010 secs] 4624541K->4535473K(7154944K), 104.7165690 secs] [Times: user=4.45 sys=2.02, real=104.72 secs] The trained eye will quickly see that something very bad happened on that cluster. Indeed, during post-mortem we saw that somehow that machine started swapping which is the Worst Thing Ever (tm) that can happen to a machine that runs java processes. Make sure that your memory usage always stay under your total memory, even when all the mappers and reducers are using their heap at the fullest. And then double check that (which it seems we didn't do). On a cluster that serves web traffic, and thus must not be MRed against, you get the "usual" stuff like bad disks and operator errors. J-D On Tue, Nov 23, 2010 at 1:31 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > Are there any writeups on what things to look for? > > What are some of the things that usually go wrong? Or is that an unfair > question :) > > On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> Constant hand holding no, constant monitoring yes. Do setup Ganglia >> and preferably Nagios. Then it depends what you're planning to do with >> your cluster... here we have 2x 20 machines in production, the one >> that serves live traffic is pretty much doing it's own thing by itself >> (although I keep a ganglia tab opened on a second monitor) and the >> other one is used strictly for MapReduce for which our internal users >> have developed a habit of running very destructive jobs on. But to be >> fair, it's probably the users that need support the most ;) >> >> J-D >> >> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > How much of a guru do you have to be to keep say 5-10 servers humming? >> > >> > I'm a 1-man shop, and I dream of developing a web application, and >> scaling >> > will be a core part of the application. >> > >> > Is it feasable for a 1-man operation to manage a 5-10 server hbase >> cluster? >> > Is it something that requires hand holding and constant monitoring or it >> > tends to be hands off? >> > >> >
-
Re: managing 5-10 serversS Ahmed 2010-11-24, 14:22
So you have 20 nodes for the stumbled upon link redirection service?
Are there any blog posts that go over the setup and what sort of read/write traffic it gets? Is there a memcached layer that sites in front? On Tue, Nov 23, 2010 at 4:44 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > I wish I could do a dump of my memory into an ops guide to HBase, but > currently I don't think there's such a writeup. > > What can go wrong... again it depends on your type of usage. With a > MR-heavy cluster, it's usually very easy to drive the IO wait through > the roof and then you'll end up with GC pauses >60 secs caused by CPU > starvation. Here's a recent example we got when a big Mahout job was > running: > > 2010-11-19T18:25:31.173-0800: [GC [ParNew: 114456K->13056K(118016K), > 103.8190010 secs] 4624541K->4535473K(7154944K), 104.7165690 secs] > [Times: user=4.45 sys=2.02, real=104.72 secs] > > The trained eye will quickly see that something very bad happened on > that cluster. Indeed, during post-mortem we saw that somehow that > machine started swapping which is the Worst Thing Ever (tm) that can > happen to a machine that runs java processes. Make sure that your > memory usage always stay under your total memory, even when all the > mappers and reducers are using their heap at the fullest. And then > double check that (which it seems we didn't do). > > On a cluster that serves web traffic, and thus must not be MRed > against, you get the "usual" stuff like bad disks and operator errors. > > J-D > > On Tue, Nov 23, 2010 at 1:31 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > > Are there any writeups on what things to look for? > > > > What are some of the things that usually go wrong? Or is that an unfair > > question :) > > > > On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] > >wrote: > > > >> Constant hand holding no, constant monitoring yes. Do setup Ganglia > >> and preferably Nagios. Then it depends what you're planning to do with > >> your cluster... here we have 2x 20 machines in production, the one > >> that serves live traffic is pretty much doing it's own thing by itself > >> (although I keep a ganglia tab opened on a second monitor) and the > >> other one is used strictly for MapReduce for which our internal users > >> have developed a habit of running very destructive jobs on. But to be > >> fair, it's probably the users that need support the most ;) > >> > >> J-D > >> > >> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > >> > Hi, > >> > > >> > How much of a guru do you have to be to keep say 5-10 servers humming? > >> > > >> > I'm a 1-man shop, and I dream of developing a web application, and > >> scaling > >> > will be a core part of the application. > >> > > >> > Is it feasable for a 1-man operation to manage a 5-10 server hbase > >> cluster? > >> > Is it something that requires hand holding and constant monitoring or > it > >> > tends to be hands off? > >> > > >> > > >
-
Re: managing 5-10 serversJean-Daniel Cryans 2010-11-24, 18:07
Not just su.pr, but also stumbleupon.com which has the "social" layer.
We do have memcached in front of HBase. Regarding blog posts about our setup, just search for "stumbleupon hbase" and you'll find tons. The most recent presentation that's available online is my talk at Hadoop World. Vid: http://www.cloudera.com/videos/hw10_video_how_stumbleupon_built_and_advertising_platform_using_hbase_and_hadoop Slides: http://www.cloudera.com/resource/hw10_stumbleupon_advertising_platform_using_hbase J-D On Wed, Nov 24, 2010 at 6:22 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > So you have 20 nodes for the stumbled upon link redirection service? > > Are there any blog posts that go over the setup and what sort of read/write > traffic it gets? Is there a memcached layer that sites in front? > > On Tue, Nov 23, 2010 at 4:44 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> I wish I could do a dump of my memory into an ops guide to HBase, but >> currently I don't think there's such a writeup. >> >> What can go wrong... again it depends on your type of usage. With a >> MR-heavy cluster, it's usually very easy to drive the IO wait through >> the roof and then you'll end up with GC pauses >60 secs caused by CPU >> starvation. Here's a recent example we got when a big Mahout job was >> running: >> >> 2010-11-19T18:25:31.173-0800: [GC [ParNew: 114456K->13056K(118016K), >> 103.8190010 secs] 4624541K->4535473K(7154944K), 104.7165690 secs] >> [Times: user=4.45 sys=2.02, real=104.72 secs] >> >> The trained eye will quickly see that something very bad happened on >> that cluster. Indeed, during post-mortem we saw that somehow that >> machine started swapping which is the Worst Thing Ever (tm) that can >> happen to a machine that runs java processes. Make sure that your >> memory usage always stay under your total memory, even when all the >> mappers and reducers are using their heap at the fullest. And then >> double check that (which it seems we didn't do). >> >> On a cluster that serves web traffic, and thus must not be MRed >> against, you get the "usual" stuff like bad disks and operator errors. >> >> J-D >> >> On Tue, Nov 23, 2010 at 1:31 PM, S Ahmed <[EMAIL PROTECTED]> wrote: >> > Are there any writeups on what things to look for? >> > >> > What are some of the things that usually go wrong? Or is that an unfair >> > question :) >> > >> > On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] >> >wrote: >> > >> >> Constant hand holding no, constant monitoring yes. Do setup Ganglia >> >> and preferably Nagios. Then it depends what you're planning to do with >> >> your cluster... here we have 2x 20 machines in production, the one >> >> that serves live traffic is pretty much doing it's own thing by itself >> >> (although I keep a ganglia tab opened on a second monitor) and the >> >> other one is used strictly for MapReduce for which our internal users >> >> have developed a habit of running very destructive jobs on. But to be >> >> fair, it's probably the users that need support the most ;) >> >> >> >> J-D >> >> >> >> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <[EMAIL PROTECTED]> wrote: >> >> > Hi, >> >> > >> >> > How much of a guru do you have to be to keep say 5-10 servers humming? >> >> > >> >> > I'm a 1-man shop, and I dream of developing a web application, and >> >> scaling >> >> > will be a core part of the application. >> >> > >> >> > Is it feasable for a 1-man operation to manage a 5-10 server hbase >> >> cluster? >> >> > Is it something that requires hand holding and constant monitoring or >> it >> >> > tends to be hands off? >> >> > >> >> >> > >> >
-
Re: managing 5-10 serversWojciech Langiewicz 2010-11-24, 18:26
On 23.11.2010 22:14, S Ahmed wrote:
> Hi, > > How much of a guru do you have to be to keep say 5-10 servers humming? > > I'm a 1-man shop, and I dream of developing a web application, and scaling > will be a core part of the application. > > Is it feasable for a 1-man operation to manage a 5-10 server hbase cluster? > Is it something that requires hand holding and constant monitoring or it > tends to be hands off? > I'm not sure what kind of managing you mean, but for doing admin work on hadoop/hbase machines I use Cluster SSH (cssh) which allows you to log on multiple machines at once, and execute commands. For cluster up to 20 machines I think it's quite ok. -- Wojciech Langiewicz
-
Re: managing 5-10 serversLars George 2010-11-24, 18:35
I have set up and maintained clusters between 6 and 40 machines while
being a full time developer, so all as part of the development process. I used simple scripts like the ones I documented here (http://www.larsgeorge.com/2009/02/hadoop-scripts-part-1.html). Cluster SSH as mentioned is also used more often and if you want to do it right then use Puppet. But as JD says, setting up monitoring is the very first step or else you are flying blind. On Wed, Nov 24, 2010 at 7:26 PM, Wojciech Langiewicz <[EMAIL PROTECTED]> wrote: > On 23.11.2010 22:14, S Ahmed wrote: >> >> Hi, >> >> How much of a guru do you have to be to keep say 5-10 servers humming? >> >> I'm a 1-man shop, and I dream of developing a web application, and scaling >> will be a core part of the application. >> >> Is it feasable for a 1-man operation to manage a 5-10 server hbase >> cluster? >> Is it something that requires hand holding and constant monitoring or it >> tends to be hands off? >> > I'm not sure what kind of managing you mean, but for doing admin work on > hadoop/hbase machines I use Cluster SSH (cssh) which allows you to log on > multiple machines at once, and execute commands. For cluster up to 20 > machines I think it's quite ok. > > -- > Wojciech Langiewicz > |