Paper Review:
Amazon Dynamo is a fully distributed key-value pair data store. The balancing the partitioning is achieved with consistent hashing with sophisticated load assigning strategy and read/write operations are done with quorum-like voting.
Strong points:
- A fully distributed system is always better in scalability and maybe availability. A master failure would definitely have negative effect on availability, which is the main focus in Dynamo.
- Key feature “always available write” is really attractive. The quorum-like technique used in Dynamo offers the best availability so far and leaves the version conflicts to users to solve (well at least it returns your request somehow).
- Flexible and user-configurable features are shown in Dynamo like N, R, W values and different storage engine support.
Weak points:
- Complex distributed coordination/control. First they have a hash function to assign load to all the nodes; then they came up with the concept of virtual nodes which balance the load distribution; and then they worries about the joining and leaving of nodes accidentally or not; and then they realize there are still different assigning strategies which result in different performance in balancing. And all these are just for balancing. I guess it’s inevitable if you want a fully distributed system.
- Much computation and communication to do like hashing for node location, Merkle tree calculation and recalculation if the assignment changes, failure detection, gossiping for nodes joining/leaving and quorum-based voting. The system could be really complex and requires some computation power and network bandwidth to support.
- Rigid data storage model comparing to Bigtable. Only key-value pairs are allowed in Dynamo which could be a let down in some cases. However this is not much of a big deal since it supports different DB engines. I’m sure there could be some alternatives on the data model but still going around the way could make things complex.
- Too many replicas to support the high availability. Although the number of N is configurable but the number of replicas could increase because of node failure. Before the older version is removed (which is also configurable I think), extra replicas will put more demands on the storage space and bandwidth (more communication and data transfer) both.
Paper Outline:
-
Introduction:
- Reliability and scalability is dependent on application state management;
- Treats failure handling as the normal case;
- Dynamo has very high reliability requirements and need tight control over the tradeoffs between availability, consistency, effectiveness and performance;
- Simple interface to handle services that need primary-key access only;
- Uses some well known techniques to achieve scalability and availability:
- data partition and replication using consistent hashing;
- consistency is facilitated by object versioning;
- consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol;
- gossip-based failure detection and membership protocol;
- completely decentralized with minimal need for manual administration;
- storage nodes can be added/removed without any extra work;
-
Background:
-
System assumptions and requirements:
- query model:
- simple read/write operations;
- data and state is identified by unique keys;
- no operations span multiple data items;
- no need for relational schema;
- ACID properties:
- data stores provide ACID guarantees tend to have poor availability;
- targets applications that operate with weaker consistency;
- provides no isolation guarantees;
- Efficiency:
- needs to function on a commodity hardware infrastructure;
- services must be able to configure Dynamo such that they consistently achieve their latency and throughput;
- safe environment;
- query model:
-
Service level agreement:
- SLA is a formally negotiated contract where a client and a service agree on several system-related characteristics;
- SLAs are expressed and measured at the 99.9th percentile of the distribution, which could provides better overall experience;
- a common approach in the industry for forming a performance oriented SLA is by using average, median and expected variance;
-
Design considerations:
- replication algorithms force to trade off the availability;
- strong consistency and high availability cannot be achieved together;
- availability can be increased by using optimistic replication techniques;
- replicas are propagated in the background;
- might lead to conflicting changes and need conflict resolution;
- Dynamo is an always writable data store;
- some other data store resolve the conflicts during writes and might reject the write operation if it cannot reach all replicas;
- applications are more suitable for conflict resolution:
- they know what kind of data can be chosen;
- data store has limited information but will solve the conflict by “last write wins” policy if application doesn’t take care of it;
- other key principles:
- incremented scalability;
- symmetry;
- every node in Dynamo should have the same responsibility;
- decentralization;
- heterogeneity;
- replication algorithms force to trade off the availability;
-
-
Related work:
-
Peer to peer system:
- first generation P2P system like Freenet and Gnutella:
- mainly used as file sharing systems;
- searching requires flood through the network;
- next generation P2P structured networks like Pastry and Chord:
- global consistency protocol to support routing and searching;
- many storage systems built on these routing overlays:
- with conflict resolution to handle the concurrent updates;
- first generation P2P system like Freenet and Gnutella:
-
Distributed file systems and databases:
- typically support hierarchical namespaces;
- guarantee eventual consistency;
- traditional replicated relational database systems focus on consistency to provide a conventional programming model by limiting the scalability and availability;
-
Discussion:
- Dynamo is different:
- target applications that require “always writable” availability;
- built for with a single administrator and trusted nodes;
- no hierarchical namespaces and complex schema;
- built for latency sensitive applications;
- Dynamo is different:
-
-
System architecture:
-
Summary of the techniques used in Dynamo and their advantages:
-
System interface:
- get(key): returns a single object or a list of them with conflicting versions along with a context;
- put(key, context, object): writes the object with associated key to the disk and context resembles metadata about the object such as version;
-
Partitioning algorithm:
- partitioning scheme replies on consistent hashing;
- to distribute load across multiple storage hosts;
- the output range of a hash function is treated as a circle and each node in the system is assigned a random position on the ring;
- each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring and then walking the ring clockwise to find the first node with a position larger than the node’s position;
- thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring;
- the principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors;
- some challenges and solutions in the algorithm:
- random position assignment leads to non-uniform distribution;
- the basic algorithm is oblivious to the heterogeneous nodes;
- Dynamo uses a variant of consistent hashing and mapping multiple positions in the ring to each node;
- each node can be responsible for multiple positions (virtual nodes) in the ring; advantages if virtual nodes are:
- if a node fails, the load is dispersed across the ring;
- a new node accepts a roughly equivalent amount of load from old nodes;
- the number of virtual nodes responsible for a single machine is decided on its capacity;
- partitioning scheme replies on consistent hashing;
-
Replication:
- data is replicated on N hosts where N is a parameter;
- each key is assigned to a coordinator node:
- coordinator is responsible for the data replicas within a range;
- it replicates these keys at the N-1 clockwise successor nodes;
- the list of nodes that is responsible for storing a particular key is called a preference list;
- the list contains more than N nodes to account for failure;
- skip virtual node positions in the ring to ensure the list only contains distinct physical nodes;
-
Data versioning:
- updates are propagated to all replicas asynchronously;
- in case of failures, write requests may be handled by nodes not in the top N nodes in the list and causes the size of vector clock to grow;
- clock truncation scheme:
- each (node, counter) pair has a timestamp;
- when the number of pairs reaches a threshold, the oldest pair is removed from the clock;
- this scheme could be ineffective because the descendant relationships cannot be derived accurately;
-
Execution of get and put operations:
- operations are invoked using Amazon request processing framework;
- two strategy for a client to select a node:
- route its request through a generic load balancer;
- use a partition-aware client library that routes the request to the appropriate coordinator nodes with lower latency;
- a node handling a read/write operation is known as the coordinator;
- coordinator is usually the first among the top N nodes;
- Read/write operations involve the first N healthy nodes;
- Dynamo uses a consistency protocol similar to quorum systems:
- R is the minimum number of nodes that must participate in a successful read operation;
- W is the same as R except it’s for write operation;
- R+W > N yields a quorum-like system;
- During the writing, the coordinator writes the data locally and send to N highest-ranked reachable nodes and consider this operation successful if more than W-1 nodes respond;
- during the reading, the coordinator requests all existing versions of data for N highest-ranked reachable nodes and waits for R responses;
- return all the versions it deems to be causally unrelated;
-
Handling failures: hinted handoff:
- sloppy quorum is used to increase the availability:
- all read/write operations are performed on the first N healthy nodes, which may not always be the first N nodes in the ring;
- sloppy quorum is used to increase the availability:
-
Handling permanent failures: replica synchronization:
- Dynamo uses Merkle trees to detect inconsistency and minimize the data transferred for synchronization
- disadvantage is that many key ranges change when a node joins or leaves and the trees need to be recalculated;
- Dynamo uses Merkle trees to detect inconsistency and minimize the data transferred for synchronization
-
Membership and failure detection:
- ring membership:
- a node outage rarely signifies a permanent departure and therefore should not result in rebalancing or repair;
- an administrator adds/removes a node and a permanent history is kept by the node;
- a gossip-based protocol guarantees eventual consistent view of membership;
- external discovery:
- seeds are nodes that are discovered via an external mechanism and are known to all nodes;
- failure detection:
- decentralized failure detection protocols use a simple gossip-style protocol that enables each node in the system to learn about the arrival/departure of other nodes;
- ring membership:
-
Adding/remove storage nodes:
- new nodes will be assigned with a number of tokens that are randomly scattered on the ring;
- confirmation upon adding/removing nodes to:
- distribute the load uniformly;
- meet the latency requirements;
- ensure fast bootstrapping;
-
-
Implementation:
- Each storage node has three main software components:
- request coordination;
- membership and failure detection;
- local persistence engine;
- allows for different storage engines to be plugged in;
- The request coordination component is built on top of an event-driven messaging substrate where the message processing pipeline is split into multiple states;
- Request load is not uniformly distributed:
- any node in top N could coordinate the writes;
- always picked the fastest one in the previous read;
- increasing the chances of getting “read-your-writes” consistency;
- Each storage node has three main software components:
-
Experiences & lessons learned
- Ensuring uniform load distribution:
- T random tokens per node and partition by token value;
- T random tokens per node and equal sized partitions;
- Q/S tokens per node, equal-sized partitions;
- Ensuring uniform load distribution: