Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> missing data after restarting+expanding a ZK 3.4.0 cluster


Copy link to this message
-
Re: missing data after restarting+expanding a ZK 3.4.0 cluster
Thanks Camille, done:

https://issues.apache.org/jira/browse/ZOOKEEPER-1319

On 12/05/2011 06:22 PM, Camille Fournier wrote:
> Even if it's "unsupported" you've basically always found real bugs with ZK
> this way, so we might as well make a JIRA tracker and try to figure out
> this one.
> I dunno if I'll have time to look before the weekend, so anyone else that
> is interested should feel free to dig in.
>
> C
>
> On Mon, Dec 5, 2011 at 8:47 PM, Jeremy Stribling<[EMAIL PROTECTED]>  wrote:
>
>    
>> I've been trying to update to ZK 3.4.0 and have had some issues where some
>> data become inaccessible after adding a node to a cluster.  My use case is
>> a bit strange (as explained before on this list) in that I try to grow the
>> cluster dynamically by having an external program automatically restart
>> Zookeeper servers in a controlled way whenever the list of participating ZK
>> servers needs to change.  I haven't made a JIRA for this yet, since I'm
>> guessing the official position is that ZK doesn't support this scenario
>> yet, but this used to work just fine in 3.3.3 (and before), so this
>> represents a regression.
>>
>> The scenario I see is this:
>>
>> 1) Start up a 1-server ZK cluster (the server has ZK ID 0).
>> 2) A client connects to the server, and makes a bunch of znodes, in
>> particular a znode called "/membership".
>> 3) Shut down the cluster.
>> 4) Bring up a 2-server ZK cluster, including the original server 0 with
>> its existing data, and a new server with ZK ID 1.
>> 5) Node 0 has the highest zxid and is elected leader.
>> 6) A client connecting to server 1 tries to "get /membership" and gets
>> back a -101 error code (no such znode).
>> 7) The same client then tries to "create /membership" and gets back a -110
>> error code (znode already exists).
>> 8) Clients connecting to server 0 can successfully "get /membership".
>>
>> I've attached a tarball with debug logs for both servers, annotating where
>> steps #1 and #4 happen.  You can see that the election involves a proposal
>> for zxid 110 from server 0, but immediately following the election server 1
>> has these lines:
>>
>> 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:**2901<http://127.0.0.1:2901>]
>> WARN org.apache.zookeeper.server.**quorum.Learner  - Got zxid 0x100000001
>> expected 0x1
>> 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO
>> org.apache.zookeeper.server.**persistence.FileTxnLog  - Creating new log
>> file: log.100000001
>>
>> Perhaps that's not relevant, but it struck me as odd.  At the end of
>> server 1's log you can see a repeated cycle of getData->create->getData as
>> the client tries to make sense of the inconsistent responses.
>>
>> The other piece of information is that if I try to use the on-disk
>> directories for either of the servers to start a new one-node ZK cluster,
>> all the data are accessible.
>>
>> Anyone have ideas?  I haven't tried writing a program outside of my
>> application to reproduce this, but I can do it very easily with some of my
>> app's tests if anyone needs more information.  I happy to turn this into a
>> JIRA if desired.  Thanks,
>>
>> Jeremy
>>
>>
>>      
>    
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB