Paper Review: ZooKeeper: Wait-free Coordination for Internet-scale Systems

Paper Review:


ZooKeeper is a wait-free coordination used to construct lock, consensus and group membership for the clients. It uses hierarchical nodes with the same tree structure as UNIX file system to store the data and sequential number for recording. It guarantees FIFO execution for every single clients and it’s heavily optimized for faster reads.

Strong points:

  1. It’s just nice to have both synchronous and asynchronous versions of the APIs. Developers can choose whatever they like to fulfill their requirement on timing or consistency. Also we have a fast read and a slow read (sync + read), which makes thing flexible for developers and the applications.
  2. Locks are powerful primitives for distributed system, which, on the other hand, is hard to relax and fits into more applications. ZooKeeper has no lock but it’s easy to develop lock services upon ZooKeeper. Other different coordination models are also available in ZooKeeper.
  3. During the leader election, ZooKeeper clients don’t have to poll the locks and try to win the leader. Instead, it has a mechanism to keep clients in order and the previous leader failure only wakes the next client without interrupting others.

Weak points:

  1. Since ZooKeeper is an open source project aimed to serve variety of applications, it’s better to implement more options for the developers to choose from. For example, the watch mechanism is basically the same as Chubby with one-time trigger. If they could give some extra options, like when a client creates a znode, it could register for multiple triggers or notifications along with the updated data, that would possibly fit into more situations.
  2. Even with the guarantees on the ordering, the watch is an asynchronous according to ( This means the read could be stale not only because of the inconsistent local read from a non-leader server after a write, but also because of the late watch notification. There’s no good solution to this issue because any stronger consistency model could slow down the reads dramatically. However, since we already have sequence number for each znode, why don’t we bound the number with a timestamp or some version number so that the clients would at least have a better understanding of freshness of the data
  3. Looks like the simple locks without herd effect is done with a strict ordering of sequential znodes. A client will become a leader by being notified about the previous leader with the previous sequential number dies. Since the sequential number is increasing over the requests, does this mean the leader ordering is determined by the time they registered to the ZooKeeper? Sounds like this mechanism is not quite desirable in many cases. For example, if the developer wants the client with better hardware to become the leader, then it will be hard to implement because the znode sequential numbers don’t have anything to do with the hardware. (wait…. can I store the hardware scores into the data in each znode and maintain a ordered queue in the group znode? This way the developers can configure the leader election as they want.)
  4. There’s no mentioning on the issue of load balancing. Although all the ZooKeeper servers keep the exact same replicas, clients have access to any of the servers and can fire queries as they want. This could potentially lead to unbalanced request handling among all the servers and possibly some malicious attack as well. Again I couldn’t think of a good way to tackle this issue as a centralized load-balancing could not only slow down the client accesses but also add complexity to the system. Maybe we can implement the ZooKeeper servers in a way that if they detect a lot of TCP losses/timeouts during a time, it redirects some of its distanced clients to other servers


Paper Notes:

  1. ZooKeeper uses non-block primitives because with the blocking ones, like locks, slow and faulty clients can limit the overall performance of the system. The failure detection would increase the complexity. It’s a lock-free version of Chubby;
  2. The requests are handled with pipelined-fashion to support thousands of operation execution; the FIFO ordering of client operations enables asynchronous operations (multiple consecutive operations without reply);
  3. Zab for leader-based atomic broadcast protocol;
  4. Reads throughput is optimized by 2) servers handle reads locally 2) no strict ordering for reads and 3) client-side caching with a watch mechanism;
  5. Tree of directories and data (hierarchical name spaces) because it’s easier and familiar for users in a standard UNIX file system;
  6. Regular and ephemeral znodes
  7. Increasing sequential flag append to znode
  8. Watches to allow clients to receive timely notifications of changes without polling
    1. the clients send reads with watch flag set and the server will return the data with one-time promise to notify the clients about any change;
    2. watch only indicate the change without sending the actual data;
  9. Clients and ZooKeeper interact by using sessions with timeout. It ends when
    1. inactive clients over time
    2. close session by clients
    3. faulty clients detected by ZooKeeper
  10. Synchronous and asynchronous APIs
    1. no handles but use direct access to the path instead
    2. update with version number to enable version checking
  11. Two features:
    1. linearizability (multiple outstanding operations but ordered with guarantee)
    2. FIFO execution of queries
  12. For a leader re-electing scenario:
    1. lock service will prevent other processes using the old configurations but doesn’t offer a good solution when the new leader partially changed the configurations of all and then failed
    2. in ZooKeeper, the new leader is going to delete the ready znode, make configuration changes and then set the ready znode back on. The configuration change is send asynchronously. New configuration will not be used;
    3. ordering is the key preventing any client read anything before it’s ready;
    4. sync causes a server to apply all pending write requests before processing the read without having any extra write
  13. Write with atomic broadcast and read from any server for better availability
  14. Write-ahead log in disk
  15. Quorum-based algorithm with pipeline optimization
  16. Periodically snapshot for faster recovery. It’s inconsistent because of it’s a non-blocking depth first scan of the tree of each znode data



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s