Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper >> mail # user >> Timeouts and ping handling


+
Manosiz Bhattacharyya 2012-01-18, 21:26
+
Patrick Hunt 2012-01-18, 22:34
+
Patrick Hunt 2012-01-18, 22:41
+
Ted Dunning 2012-01-18, 22:47
+
Manosiz Bhattacharyya 2012-01-18, 22:47
+
Patrick Hunt 2012-01-18, 22:53
+
Manosiz Bhattacharyya 2012-01-19, 00:47
+
Ted Dunning 2012-01-19, 00:54
+
Manosiz Bhattacharyya 2012-01-19, 01:47
+
Ted Dunning 2012-01-19, 01:15
+
Manosiz Bhattacharyya 2012-01-19, 01:41
+
Patrick Hunt 2012-01-19, 01:49
+
Manosiz Bhattacharyya 2012-01-19, 02:18
+
Ted Dunning 2012-01-19, 06:18
+
Manosiz Bhattacharyya 2012-01-19, 17:31
+
Patrick Hunt 2012-01-19, 18:09
+
Manosiz Bhattacharyya 2012-01-19, 18:48
Copy link to this message
-
Re: Timeouts and ping handling
ZK does pretty much entirely sequential I/O.

One thing that it does which might be very, very bad for SSD is that it
pre-allocates disk extents in the log by writing a bunch of zeros.  This is
to avoid directory updates as the log is written, but it doubles the load
on the SSD.

On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya
<[EMAIL PROTECTED]>wrote:

> I do not think that there is a problem with the queue size. I guess the
> problem is more with latency when the Fusion I/O goes in for a GC. We are
> enabling stats on the Zookeeper and the fusion I/O to be more precise. Does
> Zookeeper typically do only sequential I/O, or does it do some random too.
> We could then move the logs to a disk.
>
> Thanks,
> Manosiz.
>
> On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
>
> > If you aren't pushing much data through ZK, there is almost no way that
> the
> > request queue can fill up without the log or snapshot disks being slow.
> >  See what happens if you put the log into a real disk or (heaven help us)
> > onto a tmpfs partition.
> >
> > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
> > <[EMAIL PROTECTED]>wrote:
> >
> > > I will do as you mention.
> > >
> > > We are using the async API's throughout. Also we do not write too much
> > data
> > > into Zookeeper. We just use it for leadership elections and health
> > > monitoring, which is why we see the timeouts typically on idle
> zookeeper
> > > connections.
> > >
> > > The reason why we want the sessions to be alive is because of the
> > > leadership election algorithm that we use from the zookeeper recipe. So
> > if
> > > a connection is broken for the leader node, the ephemeral node that
> > > guaranteed its leadership is lost, and reconnecting will create a new
> > node
> > > which does not guarantee leadership. We then have to re-elect a new
> > leader
> > > - which requires significant work. The bigger the timeout, bigger is
> the
> > > time the cluster stays without a master for a particular service, as
> the
> > > old master cannot keep on working once it has known its session is gone
> > and
> > > with it, its ephemeral node. As we are trying to have highly available
> > > service (not internet scale, but at the scale of a storage system with
> ms
> > > latencies typically), we thought about reducing the timeout, but
> keeping
> > > the session open. Also note the node that typically is the master does
> > not
> > > write too often into zookeeper.
> > >
> > > Thanks,
> > > Manosiz.
> > >
> > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> > > > <[EMAIL PROTECTED]> wrote:
> > > > > Thanks Patrick for your answer,
> > > >
> > > > No problem.
> > > >
> > > > > Actually we are in a virtualized environment, we have a FIO disk
> for
> > > > > transactional logs. It does have some latency sometimes during FIO
> > > > garbage
> > > > > collection. We know this could be the potential issue, but was
> trying
> > > to
> > > > > workaround that.
> > > >
> > > > Ah, I see. I saw something very similar to this recently with SSDs
> > > > used for the datadir. The fdatasync latency was sometimes > 10
> > > > seconds. I suspect it happened as a result of disk GC activity.
> > > >
> > > > I was able to identify the problem by running something like this:
> > > >
> > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
> > > >
> > > > and then graphing the results (log scale). You should try running
> this
> > > > against your servers to confirm that it is indeed the problem.
> > > >
> > > > > We were trying to qualify the requests into two types - either HB's
> > or
> > > > > normal requests. Isn't it better to reject normal requests if the
> > queue
> > > > > size is full to say a certain threshold, but keep the session
> alive.
> > > That
> > > > > way the flow control can be achieved with the users session
> retrying
> > > the
+
Manosiz Bhattacharyya 2012-01-19, 18:49
+
Patrick Hunt 2012-01-19, 19:31
+
Manosiz Bhattacharyya 2012-01-19, 19:47