Raúl Gutiérrez Segalés
Raúl Gutiérrez Segalés
I'm hosting an intern this summer. One project I've been thinking
about is to decouple zab from zookeeper. There are many use cases
where you need a quorum based replication, but the hierarchical data
model doesn't work well. A smallish (~1GB?) replicated key-value store
with millions of entires is one such example. The goal of the project
is to decouple the consensus algorithm (zab) from the data model
(zookeeper) more cleanly so that the users can define their own data
models and use zab to replicate the data.
I have 2 questions:
1. Are there any caveats that I should be aware of? For example,
transactions need to be idempotent to allow fuzzy snapshotting.
2. Is this useful? Personally I've seen many use cases where this
would be very useful, but I'd like to hear what you guys think.
I can see two reasons for decoupling Zab:
1- You'd like to be able to plug in new algorithms or at least make a clear separation of the replication protocol and the logic of the service.
2- You'd like to have an implementation of Zab that you could use for other things, like a kv store.
I think you're focusing more on 2. You can definitely use Zab for other things, and I'm all for it. It would probably be better to just implement the protocol from scratch rather than extract it from ZooKeeper. In fact, it might be worth having a look at ZK-30 (old one, huh?).
In the case of reimplementing it, it might be worth doing it outside ZooKeeper, as a separate project. It could be an incubated project.
Hope it helps!
On 31 May 2014, at 22:29, Michi Mutsuzaki <[EMAIL PROTECTED]> wrote:
On 31 May 2014 14:29, Michi Mutsuzaki <[EMAIL PROTECTED]> wrote:
I think this is super useful. As Flavio said, I think there are two
approaches: having ZAB as a library first or
carving out the ZAB bits and having a generic interface to plug in other
From the ZooKeeper's project PoV, I think that the latter would be awesome,
because we can clean
up a lot of code as it happens.
From an intern project's PoV, it sounds like working on an independent ZAB
implementation (libzab?) from scratch
is easier to target (and will have no impedance, getting huge changes
merged into ZooKeeper takes times...).
Thank you Flavio and Raul.
Thank you for pointing me to ZOOKEEPER-30. Yes, I was focused more on
2, but it's definitely a good idea to have a generic interface for
atomic broadcast so that you can plug in different algorithms. It
seems like the project can be broken into 3 pieces:
1. Define an interface for atomic broadcast. I'm not sure how things
like session tracker and dynamic reconfig fits into this.
2. Add a ZAB implementation of the interface.
3. Create a simple reference implementation of a service (maybe a
simple key-value store or a benchmark tool).
I agree with both of you that it's better to do this as a separate
project. Also, It might be better to do this as an incubator project
from the beginning. I think it makes it easier for people from
different organizations to collaborate. I'm willing to champion the
I'll open a JIRA once the intern is committed to the project.
On Sat, May 31, 2014 at 02:29:34PM -0700, Michi Mutsuzaki wrote:
So you want a replicated log which give you the guarantees of zab. How
would this be very different from Bookkeeper?
The use case this project is going after is to durably replicate
in-memory state. I think this project can differentiate itself from
1. BookKeeper is pretty heavyweight, as you need to deploy ZooKeeper
and bookies. I think there are use cases where you don't need the
horizontal scalability BookKeeper provides, and you prefer to have a
light-weight library for replicating state. ZooKeeper is one such
2. Please correct me if I'm wrong, but BookKeeper is not designed for
maintaining multiple in-memory replicas. A ledger can't be opened for
reading if it's already open for writing, and you need to recover by
restoring from a snapshot and replaying log entries if the writer goes
3. ZOOKEEPER-30, which I wasn't initially aware of, is another
motivation. I think there is a value in having a common interface for
consensus algorithms so that services can plug in different
implementations. This makes it easier to benchmark and test
correctness of various implementations.
On Sun, Jun 1, 2014 at 3:05 AM, Ivan Kelly <[EMAIL PROTECTED]> wrote:
I'm not sure it is worth transforming this discussion into a bk vs. zk/zab. I think the space they target is different, although they both deal with replication. It does sound worth having a separate zab implementation, but it isn't clear that it is worth separating zab in the zookeeper code base.
There seem to be some misconceptions here, so here are some clarifications:
- Zab itself doesn't deal with snapshots, it essentially replicates a log. The use of snapshots is an optimization to speed up recovery, and sure, it fits well into the framework of the protocol.
- BookKeeper indeed relies on zk because it requires a component for configuration and metadata of ledgers. By relying on a separate configuration component, the pool of bookies can grow and shrink arbitrarily, and such changes do not affect write performance like with zk. The configuration component, however, needs the properties of a protocol like zab, so we still need something like zab.
- Calling BK heavyweight is a bit of a stretch. Bookies + zk makes only two components! These are not production numbers, but I don't see a deployment with fewer than 10 machines (5 for ZK + 5 bookies) being very interesting. If that's a significant fraction of your overall server footprint, then sure, it is heavy for you.
On 01 Jun 2014, at 19:22, Michi Mutsuzaki <[EMAIL PROTECTED]> wrote:
Thank you for the clarifications Flavio. I guess 'heavyweight' is a
relative term. A typical use cases I deal with is to replicate small
amount of data (<1GB) among 3 ~ 5 servers, and having access to zab
would be very useful.
I didn't mean to suggest to separate zab in the zookeeper code base. I
referred to ZOOKEEPER-30 to highlight the usefulness of having a
common interface for replication protocol.
On Sun, Jun 1, 2014 at 2:52 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote:
an interesting read if you haven't see it. fig 1 is similar to Michi's
I think that reconfig should be the responsibility of the atomic broadcast
/ replicated log implementation (if supported by the specific
implementation). Client management and sessions seem like application
I'd also suggest to check out existing open source paxos libraries as an
On Sun, Jun 1, 2014 at 6:11 PM, Michi Mutsuzaki <[EMAIL PROTECTED]>
Thank you for the pointer Alex.
I agree that the reconfiguration is a responsibility of the atomic
broadcast. I feel that session management might need to rely on the
atomic broadcast exposing additional primitives. For example, right
now ZooKeeper forwards session information to the leader by
piggybacking it in the quorum ping packets.
Let me know if you know good open source libraries for references. So
far I've looked at ZooKeeper and goraft.
On Sun, Jun 1, 2014 at 6:36 PM, Alexander Shraer <[EMAIL PROTECTED]> wrote:
Decoupling ZAB is a good idea and like you all mentioned it could be used for things, like a key value store.
I've come across one such case in HDFS, where they have solved the problem their own way. As I know, the approach taken in this design is based on the well-known ZAB and Paxos.
So I hope there is a space for such libraries in the real world.
I was thinking from the point of view that if you want to provide ZAB
as a library, then the library will have to provide an RPC mechanism
for talking to other members of the quorum, and a means to persist
updates to disk before responding, and _then_ provide a ZAB
implementation somewhere in between. This doesn't seem much lighter
I think it's a worthwhile thing to pursue, but I disagree that a
separate project is a better way to doing it. If this is an intern
project, expecting them to reimplement ZAB might be a bit of a large
ask (depending on the internship length and the intern
themselves). An investigation into splitting the user interface layer
of zookeeper and ZAB seems itself to be a nice chunk to work on, and
it has the advantage that even if the changes don't get merged into
trunk, there will be a clearer picture as to why they can't be
You can read from a ledger while it is being written to, but right now
it's polling. Twitter are working on some changes to make it more
notification like to reduce latency between the primary writing and
the secondary reading.
I have a few reasons for suggesting a separate project:
- I don't see a reason for tying the releases of an independent
implementation of Zab to ZooKeeper
- The set of developers (and committers) interested in an independent
implementation of Zab might be different compared to ZooKeeper; it could
really be a separate community
- It really feels like parallel efforts along the lines of Curator and
BookKeeper, so I see it following similar steps
Regarding the effort of an intern, I guess it depends how far you want the
initial stretch to go. An initial implementation to contribute to Apache
followed by community activity might get it going.
Rakesh, thank you for the links!
I agree with Flavio about keeping this a separate project. Having said
that, at the point I'm not 100% sure whether the intern will implement
ZAB completely from scratch, or start from a fork of the ZooKeeper
code base. At this point I'm somewhat leaning towards using the
ZooKeeper code base as a starting point. As Ivan pointed out, it's
pretty ambitious to implement ZAB correctly in a short amount of time,
and it would be good to have something demonstrable at the end of the
On Mon, Jun 2, 2014 at 9:19 AM, FPJ <[EMAIL PROTECTED]> wrote:
It would be great to do a clean implementation of Zab. We have added a lot crap for backward compatibility, and the reconfig stuff, although a great feature properly implemented, didn't improve the state of the code. Also, an implementation of the Zab protocol perhaps putting snapshots aside for v0.1, shouldn't take more than just a few weeks.
If you do it openly say on github, then I volunteer to help.
On 03 Jun 2014, at 19:16, Michi Mutsuzaki <[EMAIL PROTECTED]> wrote:
On 3 June 2014 12:44, Flavio Junqueira <[EMAIL PROTECTED]lid>
A clean-room implementation of ZAB could indeed be awesome for multiple
purposes. Reasoning around the current implementation is some times
challenging for us missing the historical context.
Would be more than happy to help with reviews and such as well.
Thanks Flavio and Raul. I feel much more confident with your support.
Also, it would be a good learning experience for the intern and me.
Let's do this from scratch. I'll set up a github repo.
On Tue, Jun 3, 2014 at 12:51 PM, Raúl Gutiérrez Segalés
<[EMAIL PROTECTED]> wrote:
The intern hasn't started yet, but here is the github repo in case
anybody is interested.
On Tue, Jun 3, 2014 at 3:20 PM, Michi Mutsuzaki <[EMAIL PROTECTED]> wrote:
Thanks for the github repo.address. I was just about to write you to send
I will follow up with this as it is an interesting project. I read the
entire conversation and agree with some points.
On Wed, Jun 4, 2014 at 9:46 AM, Michi Mutsuzaki <[EMAIL PROTECTED]>
Yisheng has been working on this project for about 5 weeks for his
12-week internship. Here is the current status:
- First of all, let me thank Flavio and Hongchao for their help. I
don't think the project would be where it is right now without their
- We have more or less functional implementation of zab in java. You
can checkout the code here: https://github.com/ZK-1931/javazab
- There is a simple reference server. It's an http based key-value
store that uses javazab for replicating state:
- The implementation is missing 2 major features, dynamic
reconfiguration and snapshotting. Yisheng is about to start working on
It's fairly easy to run the reference server. It would be great if you
can play around with it and give us feedback.
On Tue, Jun 3, 2014 at 11:46 PM, Michi Mutsuzaki <[EMAIL PROTECTED]> wrote: