Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper >> mail # user >> Problem recovering from a bad reconfig (3.5)


+
zk questions 2013-11-09, 18:59
+
Alexander Shraer 2013-11-10, 02:32
Copy link to this message
-
Re: Problem recovering from a bad reconfig (3.5)
btw, please update to the latest trunk - there was one
reconfig-related patch committed a day ago

On Sat, Nov 9, 2013 at 6:32 PM, Alexander Shraer <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I tried running your script but there were too much changes needed for
> it to work so I just run your scenario manually in my setup, for lack
> of time, and steps 1-4 worked with no errors. This of course doesn't
> mean much, but fwiw, here's some general thoughts...
>
> If step 2 completed for sure (at least on the leader), that is, you
> can see the new config on one of the servers, the error you're seeing
> shouldn't be happening. So this may already be a bug, or some issue
> with the setup. The error should only happen if there is an
> outstanding reconfig on the leader, which was proposed but not yet
> committed.
>
> Even if step 2 hasn't really completed when step 3 starts ant this
> error happens, it should be transient - if you just retry it should
> usually succeed (especially that you have only one entity
> orchestrating reconfigs). If it is stuck in a state where it
> continuously issues this error, and both servers 1 and 2 are up, then
> there's probably a bug. (There is actually a related JIRA
> https://issues.apache.org/jira/browse/ZOOKEEPER-1699
> but I really doubt that this is what you're seeing).
>
> In step 3, since server 3 successfully connects to the leader (the
> error message you mention comes from the leader, thrown in line 522 of
> PrepRequestProcessor.java) its not important that its initial config
> includes only 2 and 3 in your scenario.
>
> I think that the risk of starting server a new server with a partial
> view of the system (and not all servers in current config + the
> joining server) is that there's a chance that the servers it tries to
> contact are all down, in which case you'll need to start it again with
> a different server list. I guess this is what you're doing in step 5,
> but I didn't understand why you're doing this here - in your scenario
> 3 found the leader and encountered a transient error, no need to
> restart it, just try again.
>
> other things:
> - please keep in mind that patch 1691 may still need some work
> - don't include a version in the dynamic config file. the system
> writes out versions automatically, the users should never
> specify them.
>
> Alex
>
> On Sat, Nov 9, 2013 at 10:59 AM, zk questions <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I've been testing out the dynamic reconfig feature of 3.5 along with using
>> this patch (https://issues.apache.org/jira/browse/ZOOKEEPER-1691) and I'm
>> having an issue where my zk cluster won't allow me to perform further
>> reconfigs.
>> So here's what I'm doing:
>> 1) Start nodes 1 and 2
>> 2) Invoke reconfig on 1 to add 2; this suceeds
>> 3) Start node 3 with the initial configuration with the dynamic config set
>> to just 2 and 3, where 2 isn't a leader (manually verified)
>> 4) Invoke reconfig on 2 to add 3; this fails, with an error indicating that
>> another reconfig in progress
>> 5) Then I restart 3 with the configuration containing just 1 and 3
>> 6) Then I try again to add 3 to the cluster by invoking reconfig on 1 to add
>> 3; and again I see an error indicating that another reconfig is in progress
>>
>> FWIW: I'm testing this scenario to simulate the situation where I'm
>> automating the reconfig process and the dynamic configuration for 3 ends up
>> containing a node that isn't the leader.
>>
>> I was wondering what I should do in this situation to recover from the
>> failure at step 3 so that we can fix the dynamic config and then attempt a
>> proper reconfig (steps 4 - 6)?
>>
>> I've also attached a tar containing a script to automatically reproduce the
>> steps and problem I'm seeing above.
>>
>> Thanks.
+
zk questions 2013-11-09, 20:11
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB