Paper Review: Dynamo Amazon’s Highly Available Key-value Storage

Paper Review:

Amazon Dynamo is a fully distributed key-value pair data store. The balancing the partitioning is achieved with consistent hashing with sophisticated load assigning strategy and read/write operations are done with quorum-like voting.

Strong points:

  1. A fully distributed system is always better in scalability and maybe availability. A master failure would definitely have negative effect on availability, which is the main focus in Dynamo.
  2. Key feature “always available write” is really attractive. The quorum-like technique used in Dynamo offers the best availability so far and leaves the version conflicts to users to solve (well at least it returns your request somehow).
  3. Flexible and user-configurable features are shown in Dynamo like N, R, W values and different storage engine support.

Weak points:

  1. Complex distributed coordination/control. First they have a hash function to assign load to all the nodes; then they came up with the concept of virtual nodes which balance the load distribution; and then they worries about the joining and leaving of nodes accidentally or not; and then they realize there are still different assigning strategies which result in different performance in balancing. And all these are just for balancing. I guess it’s inevitable if you want a fully distributed system.
  2. Much computation and communication to do like hashing for node location, Merkle tree calculation and recalculation if the assignment changes, failure detection, gossiping for nodes joining/leaving and quorum-based voting. The system could be really complex and requires some computation power and network bandwidth to support.
  3. Rigid data storage model comparing to Bigtable. Only key-value pairs are allowed in Dynamo which could be a let down in some cases. However this is not much of a big deal since it supports different DB engines. I’m sure there could be some alternatives on the data model but still going around the way could make things complex.
  4. Too many replicas to support the high availability. Although the number of N is configurable but the number of replicas could increase because of node failure. Before the older version is removed (which is also configurable I think), extra replicas will put more demands on the storage space and bandwidth (more communication and data transfer) both.


Paper Outline:

  1. Introduction:

    • Reliability and scalability is dependent on application state management;
    • Treats failure handling as the normal case;
    • Dynamo has very high reliability requirements and need tight control over the tradeoffs between availability, consistency, effectiveness and performance;
    • Simple interface to handle services that need primary-key access only;
    • Uses some well known techniques to achieve scalability and availability:
      • data partition and replication using consistent hashing;
      • consistency is facilitated by object versioning;
      • consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol;
      • gossip-based failure detection and membership protocol;
      • completely decentralized with minimal need for manual administration;
      • storage nodes can be added/removed without any extra work;
  2. Background:

    • System assumptions and requirements:

      • query model:
        • simple read/write operations;
        • data and state is identified by unique keys;
        • no operations span multiple data items;
        • no need for relational schema;
      • ACID properties:
        • data stores provide ACID guarantees tend to have poor availability;
        • targets applications that operate with weaker consistency;
        • provides no isolation guarantees;
      • Efficiency:
        • needs to function on a commodity hardware infrastructure;
        • services must be able to configure Dynamo such that they consistently achieve their latency and throughput;
      • safe environment;
    • Service level agreement:

      • SLA is a formally negotiated contract where a client and a service agree on several system-related characteristics;
      • SLAs are expressed and measured at the 99.9th percentile of the distribution, which could provides better overall experience;
        • a common approach in the industry for forming a performance oriented SLA is by using average, median and expected variance;
    • Design considerations:

      • replication algorithms force to trade off the availability;
        • strong consistency and high availability cannot be achieved together;
      • availability can be increased by using optimistic replication techniques;
        • replicas are propagated in the background;
        • might lead to conflicting changes and need conflict resolution;
      • Dynamo is an always writable data store;
        • some other data store resolve the conflicts during writes and might reject the write operation if it cannot reach all replicas;
      • applications are more suitable for conflict resolution:
        • they know what kind of data can be chosen;
        • data store has limited information but will solve the conflict by “last write wins” policy if application doesn’t take care of it;
      • other key principles:
        • incremented scalability;
        • symmetry;
          • every node in Dynamo should have the same responsibility;
        • decentralization;
        • heterogeneity;
  3. Related work:

    • Peer to peer system:

      • first generation P2P system like Freenet and Gnutella:
        • mainly used as file sharing systems;
        • searching requires flood through the network;
      • next generation P2P structured networks like Pastry and Chord:
        • global consistency protocol to support routing and searching;
      • many storage systems built on these routing overlays:
        • with conflict resolution to handle the concurrent updates;
    • Distributed file systems and databases:

      • typically support hierarchical namespaces;
      • guarantee eventual consistency;
      • traditional replicated relational database systems focus on consistency to provide a conventional programming model by limiting the scalability and availability;
    • Discussion:

      • Dynamo is different:
        • target applications that require “always writable” availability;
        • built for with a single administrator and trusted nodes;
        • no hierarchical namespaces and complex schema;
        • built for latency sensitive applications;
  4. System architecture:

    • Summary of the techniques used in Dynamo and their advantages: 20160203084012

    • System interface:

      • get(key): returns a single object or a list of them with conflicting versions along with a context;
      • put(key, context, object): writes the object with associated key to the disk and context resembles metadata about the object such as version;
    • Partitioning algorithm:

      • partitioning scheme replies on consistent hashing;
        • to distribute load across multiple storage hosts;
        • the output range of a hash function is treated as a circle and each node in the system is assigned a random position on the ring;
        • each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring and then walking the ring clockwise to find the first node with a position larger than the node’s position;
        • thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring;
      • the principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors;
      • some challenges and solutions in the algorithm:
        • random position assignment leads to non-uniform distribution;
        • the basic algorithm is oblivious to the heterogeneous nodes;
        • Dynamo uses a variant of consistent hashing and mapping multiple positions in the ring to each node;
        • each node can be responsible for multiple positions (virtual nodes) in the ring; advantages if virtual nodes are:
          • if a node fails, the load is dispersed across the ring;
          • a new node accepts a roughly equivalent amount of load from old nodes;
          • the number of virtual nodes responsible for a single machine is decided on its capacity;
    • Replication:

      • data is replicated on N hosts where N is a parameter;
      • each key is assigned to a coordinator node:
        • coordinator is responsible for the data replicas within a range;
        • it replicates these keys at the N-1 clockwise successor nodes;
      • the list of nodes that is responsible for storing a particular key is called a preference list;
        • the list contains more than N nodes to account for failure;
        • skip virtual node positions in the ring to ensure the list only contains distinct physical nodes;
    • Data versioning:

      • updates are propagated to all replicas asynchronously;
      • in case of failures, write requests may be handled by nodes not in the top N nodes in the list and causes the size of vector clock to grow;
      • clock truncation scheme:
        • each (node, counter) pair has a timestamp;
        • when the number of pairs reaches a threshold, the oldest pair is removed from the clock;
        • this scheme could be ineffective because the descendant relationships cannot be derived accurately;
    • Execution of get and put operations:

      • operations are invoked using Amazon request processing framework;
      • two strategy for a client to select a node:
        • route its request through a generic load balancer;
        • use a partition-aware client library that routes the request to the appropriate coordinator nodes with lower latency;
      • a node handling a read/write operation is known as the coordinator;
        • coordinator is usually the first among the top N nodes;
      • Read/write operations involve the first N healthy nodes;
      • Dynamo uses a consistency protocol similar to quorum systems:
        • R is the minimum number of nodes that must participate in a successful read operation;
        • W is the same as R except it’s for write operation;
        • R+W > N yields a quorum-like system;
      • During the writing, the coordinator writes the data locally and send to N highest-ranked reachable nodes and consider this operation successful if more than W-1 nodes respond;
      • during the reading, the coordinator requests all existing versions of data for N highest-ranked reachable nodes and waits for R responses;
        • return all the versions it deems to be causally unrelated;
    • Handling failures: hinted handoff:

      • sloppy quorum is used to increase the availability:
        • all read/write operations are performed on the first N healthy nodes, which may not always be the first N nodes in the ring;
    • Handling permanent failures: replica synchronization:

      • Dynamo uses Merkle trees to detect inconsistency and minimize the data transferred for synchronization
        • disadvantage is that many key ranges change when a node joins or leaves and the trees need to be recalculated;
    • Membership and failure detection:

      • ring membership:
        • a node outage rarely signifies a permanent departure and therefore should not result in rebalancing or repair;
        • an administrator adds/removes a node and a permanent history is kept by the node;
        • a gossip-based protocol guarantees eventual consistent view of membership;
      • external discovery:
        • seeds are nodes that are discovered via an external mechanism and are known to all nodes;
      • failure detection:
        • decentralized failure detection protocols use a simple gossip-style protocol that enables each node in the system to learn about the arrival/departure of other nodes;
    • Adding/remove storage nodes:

      • new nodes will be assigned with a number of tokens that are randomly scattered on the ring;
      • confirmation upon adding/removing nodes to:
        • distribute the load uniformly;
        • meet the latency requirements;
        • ensure fast bootstrapping;
  5. Implementation:

    • Each storage node has three main software components:
      • request coordination;
      • membership and failure detection;
      • local persistence engine;
        • allows for different storage engines to be plugged in;
    • The request coordination component is built on top of an event-driven messaging substrate where the message processing pipeline is split into multiple states;
    • Request load is not uniformly distributed:
      • any node in top N could coordinate the writes;
      • always picked the fastest one in the previous read;
      • increasing the chances of getting “read-your-writes” consistency;
  6. Experiences & lessons learned

    • Ensuring uniform load distribution:
      • T random tokens per node and partition by token value;
      • T random tokens per node and equal sized partitions;
      • Q/S tokens per node, equal-sized partitions;





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s