Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - RegionServer dying every two or three days


Copy link to this message
-
Re: RegionServer dying every two or three days
Matt Corgan 2012-01-20, 20:07
I run c1.xlarge servers and have found them very stable.  I see 100 Mbit/s
sustained bi-directional network throughput (200Mbit/s total), sometimes up
to 150 * 2 Mbit/s.

Here's a pretty thorough examination of the underlying hardware:

http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
*High-CPU instances*

The high-CPU instances (c1.medium, c1.xlarge) run on systems with
dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket because
we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance
almost takes up the whole physical machine. However, we frequently observe
steal cycle on a c1.xlarge instance ranging from 0% to 25% with an average
of about 10%. The amount of steal cycle is not enough to host another
smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run
Amazon’s software firewall (security group). On Passmark-CPU mark, a
c1.xlarge machine achieves 7,962.6, actually higher than an average
dual-sock E5410 system is able to achieve (average is 6,903).

On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
<[EMAIL PROTECTED]>wrote:

> Thanks Neil for sharing your experience with AWS! Could you tell what
> instance type are you using?
> We are using m1.xlarge, that has 4 virtual cores, but i normally see
> recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge, etc.
> In principle these 8-core machines don't suffer too much with I/O problems
> since they don't share the physical server. Is there any piece of
> information from Amazon or other source that affirms that or it's based in
> empirical analysis?
>
> 2012/1/19 Neil Yalowitz <[EMAIL PROTECTED]>
>
> > We have experienced many problems with our cluster on EC2.  The blunt
> > solution was to increase the Zookeeper timeout to 5 minutes or even more.
> >
> > Even with a long timeout, however, it's not uncommon for us to see an EC2
> > instance to become unresponsive to pings and SSH several times during a
> > week.  It's been a very bad environment for clusters.
> >
> >
> > Neil
> >
> > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Hi Guys,
> > >
> > > I have tested the parameters provided by Sandy, and it solved the GC
> > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> > > I'm still experiencing some difficulties, the RegionServer continues to
> > > shutdown, but it seems related to I/O. It starts to timeout many
> > > connections, new connections to/from the machine timeout too, and
> finally
> > > the RegionServer dies because of YouAreDeadException. I will collect
> more
> > > data, but i think it's an Amazon/Virtualized Environment inherent
> issue.
> > >
> > > Thanks for the great help provided so far.
> > >
> > > 2012/1/5 Leonardo Gamas <[EMAIL PROTECTED]>
> > >
> > > > I don't think so, if Amazon stopped the machine it would cause a stop
> > of
> > > > minutes, not seconds, and since the DataNode, TaskTracker and
> Zookeeper
> > > > continue to work normally.
> > > > But it can be related to the shared environment nature of Amazon,
> maybe
> > > > some spike in I/O caused by another virtualized server in the same
> > > physical
> > > > machine.
> > > >
> > > > But the intance type i'm using:
> > > >
> > > > *Extra Large Instance*
> > > >
> > > > 15 GB memory
> > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> > > > 1,690 GB instance storage
> > > > 64-bit platform
> > > > I/O Performance: High
> > > > API name: m1.xlarge
> > > > I was not expecting to suffer from this problems, or at least not
> much.
> > > >
> > > >
> > > > 2012/1/5 Sandy Pratt <[EMAIL PROTECTED]>
> > > >
> > > >> You think it's an Amazon problem maybe?  Like they paused or
> migrated
> > > >> your virtual machine, and it just happens to be during GC, leaving
> us
> > to
> > > >> think the GC ran long when it didn't?  I don't have a lot of
> > experience
> > > >> with Amazon so I don't know if that sort of thing is common.