Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> Re: ephemeral node not deleted after client session closed


Copy link to this message
-
Re: ephemeral node not deleted after client session closed
Hi Pat,

We are already working on that, it was because of deploying zk along with
other applications. We will have separate boxes for ZK very soon.

I already saw the stats and could not co relate with spikes, unfortunately
we dint have gc logs. There were few deployment rules that were clearly
violated and we will fix them.

Good news is that we found an issue :-).Thanks again for your help.

thanks,
Kishore G

On Fri, Nov 11, 2011 at 9:47 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote:

> We always triage new issues as they come in (same when 1208 originally
> came in). However our ability to determine the cause is often bounded
> by the information provided by the user, which in this recent update
> was excellent and pointed out exactly the flaw. Kudos.
>
> We'll button up this issue (1208) for 3.3/3.4/trunk. After which I'll
> work on cutting a new 3.3.4 that includes this and some other fixes.
> It would be good if you could test this patch in the mean time.
>
> Also, this is particularly worrisome for me, you reported the
> following for your servers:
>
> Latency min/avg/max: 0/53/44712
> Latency min/avg/max: 0/11/11727
> Latency min/avg/max: 0/12/11994
> Latency min/avg/max: 0/9/11707
>
> That's really really terrible performance and you won't be happy with
> ZK as a result. You need to followup with your ops team to determine
> why the performance you are seeing is so terrible. Both in terms of
> average and max latency.
>
> Spikes in max latency is usually due to GC, swap, or bad disk
> performance for the WAL.
> Bad average latency might indicate poor network performance, or again
> bad disk performance.
>
> Try turning on CMS/parallelGC. Also try using iostat and look at the
> await times you're seeing for the WAL disk (correlate that with spikes
> in max latency, those counters can be reset using a 4letterword).
>
> Regards,
>
> Patrick
>
> On Fri, Nov 11, 2011 at 7:21 AM, Neha Narkhede <[EMAIL PROTECTED]>
> wrote:
> > Pat,
> >
> > That is excellent turnaround ! I will take a look at the running the test
> > as well as your patch. Will be a good opportunity for me to start
> > understanding the zookeeper codebase.
> >
> > Thanks again,
> > Neha
> >
> > On Thursday, November 10, 2011, Patrick Hunt <[EMAIL PROTECTED]> wrote:
> >> Ok, patch posted that fixes this (1208). Committers please take a look.
> >>
> >> Neha you might want to give a patched version a try. Awesome job
> >> helping to document and track down this issue. Thanks!
> >>
> >> Patrick
> >>
> >> On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
> >>> See my update to 1208 for a test that demonstrates this.
> >>>
> >>> On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <
> [EMAIL PROTECTED]>
> > wrote:
> >>>> Thanks Patrick for looking into this issue !
> >>>>
> >>>>>> The logs would indicate if an election happens. Look for "LOOKING"
> or
> >>>> "LEADING" or "FOLLOWING".
> >>>>
> >>>> The logs don't have any such entries. So I'm guessing there was no
> > election
> >>>> happening.
> >>>>
> >>>> Do you have thoughts, though, on how easy it would be to reproduce
> this
> >>>> bug, to verify the bug fix ?
> >>>>
> >>>> Thanks,
> >>>> Neha
> >>>>
> >>>>
> >>>> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[EMAIL PROTECTED]>
> wrote:
> >>>>
> >>>>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <
> [EMAIL PROTECTED]
> >>
> >>>>> wrote:
> >>>>> > Thanks for the quick responses, guys! Please find my replies
> inline -
> >>>>> >
> >>>>> >>> 1) Why is the session closed, the client closed it or the cluster
> >>>>> > expired it?
> >>>>> > Cluster expired it.
> >>>>> >
> >>>>>
> >>>>> Yes, I realized after that the cxid is 0 in your logs - that
> indicates
> >>>>> it was expired and not closed explicitly by the client.
> >>>>>
> >>>>>
> >>>>> >>> 3) the znode exists on all 4 servers, is that right?
> >>>>> > Yes
> >>>>> >
> >>>>>
> >>>>> This holds up my theory that the PrepRequestProcessor is accepting a
> >>>>> create from the client after the session has been expired.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB