Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RegionServer dying every two or three days


Copy link to this message
-
Re: RegionServer dying every two or three days
I run c1.xlarge servers and have found them very stable.  I see 100 Mbit/s
sustained bi-directional network throughput (200Mbit/s total), sometimes up
to 150 * 2 Mbit/s.

Here's a pretty thorough examination of the underlying hardware:

http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
*High-CPU instances*

The high-CPU instances (c1.medium, c1.xlarge) run on systems with
dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket because
we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance
almost takes up the whole physical machine. However, we frequently observe
steal cycle on a c1.xlarge instance ranging from 0% to 25% with an average
of about 10%. The amount of steal cycle is not enough to host another
smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run
Amazon’s software firewall (security group). On Passmark-CPU mark, a
c1.xlarge machine achieves 7,962.6, actually higher than an average
dual-sock E5410 system is able to achieve (average is 6,903).

On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
<[EMAIL PROTECTED]>wrote:

> Thanks Neil for sharing your experience with AWS! Could you tell what
> instance type are you using?
> We are using m1.xlarge, that has 4 virtual cores, but i normally see
> recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge, etc.
> In principle these 8-core machines don't suffer too much with I/O problems
> since they don't share the physical server. Is there any piece of
> information from Amazon or other source that affirms that or it's based in
> empirical analysis?
>
> 2012/1/19 Neil Yalowitz <[EMAIL PROTECTED]>
>
> > We have experienced many problems with our cluster on EC2.  The blunt
> > solution was to increase the Zookeeper timeout to 5 minutes or even more.
> >
> > Even with a long timeout, however, it's not uncommon for us to see an EC2
> > instance to become unresponsive to pings and SSH several times during a
> > week.  It's been a very bad environment for clusters.
> >
> >
> > Neil
> >
> > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Hi Guys,
> > >
> > > I have tested the parameters provided by Sandy, and it solved the GC
> > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> > > I'm still experiencing some difficulties, the RegionServer continues to
> > > shutdown, but it seems related to I/O. It starts to timeout many
> > > connections, new connections to/from the machine timeout too, and
> finally
> > > the RegionServer dies because of YouAreDeadException. I will collect
> more
> > > data, but i think it's an Amazon/Virtualized Environment inherent
> issue.
> > >
> > > Thanks for the great help provided so far.
> > >
> > > 2012/1/5 Leonardo Gamas <[EMAIL PROTECTED]>
> > >
> > > > I don't think so, if Amazon stopped the machine it would cause a stop
> > of
> > > > minutes, not seconds, and since the DataNode, TaskTracker and
> Zookeeper
> > > > continue to work normally.
> > > > But it can be related to the shared environment nature of Amazon,
> maybe
> > > > some spike in I/O caused by another virtualized server in the same
> > > physical
> > > > machine.
> > > >
> > > > But the intance type i'm using:
> > > >
> > > > *Extra Large Instance*
> > > >
> > > > 15 GB memory
> > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> > > > 1,690 GB instance storage
> > > > 64-bit platform
> > > > I/O Performance: High
> > > > API name: m1.xlarge
> > > > I was not expecting to suffer from this problems, or at least not
> much.
> > > >
> > > >
> > > > 2012/1/5 Sandy Pratt <[EMAIL PROTECTED]>
> > > >
> > > >> You think it's an Amazon problem maybe?  Like they paused or
> migrated
> > > >> your virtual machine, and it just happens to be during GC, leaving
> us
> > to
> > > >> think the GC ran long when it didn't?  I don't have a lot of
> > experience
> > > >> with Amazon so I don't know if that sort of thing is common.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB