Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # user - curator leader reconnect

Copy link to this message
Re: curator leader reconnect
Jordan Zimmerman 2012-02-07, 22:11
I'm having second thoughts about thisÅ 

What's happening, as you've seen, is that an exception is happening when
releasing the lock. This means that the client code cannot be certain that
the lock node has been removed. It may be that we are forced to make the
LeaderSelector instance unusable at this point. There is no reasonable
alternative that I can think of.

I'm still thinking about this, so stay tuned...

On 2/7/12 1:26 PM, "Jordan Zimmerman" <[EMAIL PROTECTED]> wrote:

>I really appreciate your help Hartmut. You have, indeed, found a bug. My
>test case didn't precisely replicate your situation. I updated the test so
>that it did (the lock node getting deleted after session expiration) and
>the same problem expressed. You also found the location of the bug making
>my job very easy ;)
>Thanks again - I'll push a fix and get a new build out soon.
>P.S. I've pasted this thread on Github for others' benefit:
>On 2/7/12 12:39 PM, "Hartmut Lang" <[EMAIL PROTECTED]> wrote:
>>Jordan, thanks for looking into this.
>>I cloned the code and had a look. For me your test case covers, that you
>>get the leadership again, after the RECONNECT happens. This is also the
>>case in my code.
>>But how does it check, that there is a related lock/ephemeral node in the
>>ZK-Cluster? Which is not the case for me.
>>I made some debugging:
>>If the connection is lost in InterProcessMutex.release() the releaseLocks
>>call will throw an exception, right?
>>So the lockData is not(!) set to null (line#130).
>>When the InterProcessMutex.aquire() is the called after the RECONNECT, it
>>is considered as "re_entering".
>>So the lock is just granted, without redoing the lock in the ZK-cluster.
>>This seems not ok for me.
>>But i'm the newbie here.
>>Would be great if you can have a look.
>>Am 7. Februar 2012 09:05 schrieb Jordan Zimmerman
>>> I just pushed a test that simulates the situation you describe and it
>>> works correctly. Can you please have a look at it and see what's
>>> about your case?
>>> TestLeaderSelectorCluster.java
>>>    testLostRestart()
>>> ________________________________________
>>> From: Hartmut Lang [[EMAIL PROTECTED]]
>>> Sent: Monday, February 06, 2012 9:55 PM
>>> Subject: Re: curator leader reconnect
>>> Well i use the CLI-client to connect to the ZK-Cluster. And i see now
>>> entry.
>>> My setup:
>>> I have a cluster of three ZK-nodes.
>>> I have a client starting LeaderSelector, which is connected to one
>>> cluster-node.
>>> I see the ephemeral node.
>>> I stop the  cluster-node the client is connected to. The client finally
>>> sees a LOST event. The ephemeral node is gone (using CLI).
>>> I start the cluster-node again. Client sees the RECONNECT and calls
>>> start(). And then takeLeaderShip() is called.
>>> But no ephemeral node in the cluster.
>>> /Hartmut
>>> Am 6. Februar 2012 18:46 schrieb Jordan Zimmerman
>>> >:
>>> > How are you verifying that there is no ephemeral node?
>>> >
>>> > -Jordan
>>> >
>>> > On 2/6/12 9:28 AM, "Hartmut Lang" <[EMAIL PROTECTED]>
>>> >
>>> > >Hi Jordan,
>>> > >
>>> > >thanks for your infos.
>>> > >What i see in my LeaderSelector example is this:
>>> > >when i just call the start() method after RECONNECT, the
>>> takeLeadership()
>>> > >method is called again.
>>> > >But no ephemeral node does exist in the ZK-Cluster for my client. So
>>> this
>>> > >seems not to be right.
>>> > >What could i do wrong?
>>> > >
>>> > >/Hartmut
>>> > >Am 6. Februar 2012 07:55 schrieb Jordan Zimmerman
>>> > >
>>> > >> No - don't call close. I'm afraid that it's a bit confusing. It
>>>was an
>>> > >> afterthought. Maybe I should add a restart() method or something.
>>> > >>
>>> > >> -JZ
>>> > >>
>>> > >> On 2/5/12 10:48 PM, "Hartmut Lang" <[EMAIL PROTECTED]>