Paper Review: Spark: Cluster Computing withWorking Sets

Paper Review:


Spark is a new computation model which is different from the mainstream MapReduce. It uses RDD, derived from DSM, so that the intermediate results could be stored in-memory to reduce the disk r/w overhead. With the lineage property of RDD, Spark doesn’t have to make disk checkpoints for the purpose of recovery since the lost data could be reconstructed with the existing RDD. Also, Spark is running on Mesos so it naturally extends some of the awesome features in Mesos like the optimal task assignment with delayed scheduling.

Strong points:

  1. Flexible in terms of languages that could be used in the driver, the cache option that enables users to trade off between the cost of storing and the speed of accessing data, and running on Mesos so that it could share hardware resources along with other frameworks.
  2. Delay scheduling and the almost optimal tasks assigning features from Mesos. Developers don’t have to worry about the underlying computation task distribution because it’s handled nicely by Mesos.
  3. RDD greatly reduces the read/write operations of the disk comparing to Hadoop. Also there are many other optimizations to avoid disk r/w like when a node is doing broadcasting, it checks whether the data is in it’s memory first. It’s safe to say that disk r/w operations is the bottleneck for Hadoop.
  4. From my understanding, Spark is a more general execution model than MapReduce. The operations written in Scala are user-defined and easy to copied around. It can run different tasks including MapReduce.

Weak points:

Most of the weak points are due to the lack of details of the paper.

  1. Choosing Scala over Java, or some other languages in implementing Spark is interesting. It’s discussed by the first author himself on Quora: Why is Apache Spark implemented in Scala. I’ll say that Spark is still in a performance-critical layer and applications will be built on it. Using Scala might lead to worse performance comparing to Java implementation and so it was showed in the results section in the paper. Spark was running slower than Hadoop without iterations.
  2. Lineage and recovery is not quite covered in the paper in implementation or performance analysis. I’m not sure about how fast could the recovery be and what percentage of lost data can I recover from the remaining RDD (i.e. how much data lost is accepted). And also the paper doesn’t cover some other solutions like distributing and replicating the in-memory data, which seems more straightforward to me (replicated memory could be recovered without computation);
  3. Another question for this insufficient performance analysis is that, what if the data can’t fit into each machine’s memory in Spark. In this way the machines will have to use disk and here comes the inevitable r/w operations to the disk. Here is the link to the performance comparison between Hadoop and disk-only Spark: Databricks demolishes big data benchmark to prove Spark is fast on disk, too. I haven’t read it through but I guess Spark is faster because it doesn’t have to copy the intermediate results around, in this case, from mapper to reducer due to the optimal task assignment.


Paper Notes:

  1. MapReduce are not good with:
    • iterative jobs. Machine learning algorithms applies a function on the same set of data multiple times. MapReduce reloads the  data from the disk repeatedly and hence not suitable for iterative jobs;
    • Interactive analytics. MapReduce queries are delayed because of the reload.
  2. Resilient distributed dataset (RDD):
    • Achieve fault tolerance through a notion of lineage: could be rebuild from other part of the dataset;
    • The first system that allows an efficient, general-purpose programming language to be used interactively to process large datasets on a cluster;
    • Ways to generate RDD:
      • From HDFS file;
      • Parallelize a Scala collection into slices;
      • Flat-map from the existing RDD;
      • By changing the persistence of an existing RDD.
  3. Parallel operations:
    • Reduce, collect and foreach;

Paper Review: ZooKeeper: Wait-free Coordination for Internet-scale Systems

Paper Review:


ZooKeeper is a wait-free coordination used to construct lock, consensus and group membership for the clients. It uses hierarchical nodes with the same tree structure as UNIX file system to store the data and sequential number for recording. It guarantees FIFO execution for every single clients and it’s heavily optimized for faster reads.

Strong points:

  1. It’s just nice to have both synchronous and asynchronous versions of the APIs. Developers can choose whatever they like to fulfill their requirement on timing or consistency. Also we have a fast read and a slow read (sync + read), which makes thing flexible for developers and the applications.
  2. Locks are powerful primitives for distributed system, which, on the other hand, is hard to relax and fits into more applications. ZooKeeper has no lock but it’s easy to develop lock services upon ZooKeeper. Other different coordination models are also available in ZooKeeper.
  3. During the leader election, ZooKeeper clients don’t have to poll the locks and try to win the leader. Instead, it has a mechanism to keep clients in order and the previous leader failure only wakes the next client without interrupting others.

Weak points:

  1. Since ZooKeeper is an open source project aimed to serve variety of applications, it’s better to implement more options for the developers to choose from. For example, the watch mechanism is basically the same as Chubby with one-time trigger. If they could give some extra options, like when a client creates a znode, it could register for multiple triggers or notifications along with the updated data, that would possibly fit into more situations.
  2. Even with the guarantees on the ordering, the watch is an asynchronous according to ( This means the read could be stale not only because of the inconsistent local read from a non-leader server after a write, but also because of the late watch notification. There’s no good solution to this issue because any stronger consistency model could slow down the reads dramatically. However, since we already have sequence number for each znode, why don’t we bound the number with a timestamp or some version number so that the clients would at least have a better understanding of freshness of the data
  3. Looks like the simple locks without herd effect is done with a strict ordering of sequential znodes. A client will become a leader by being notified about the previous leader with the previous sequential number dies. Since the sequential number is increasing over the requests, does this mean the leader ordering is determined by the time they registered to the ZooKeeper? Sounds like this mechanism is not quite desirable in many cases. For example, if the developer wants the client with better hardware to become the leader, then it will be hard to implement because the znode sequential numbers don’t have anything to do with the hardware. (wait…. can I store the hardware scores into the data in each znode and maintain a ordered queue in the group znode? This way the developers can configure the leader election as they want.)
  4. There’s no mentioning on the issue of load balancing. Although all the ZooKeeper servers keep the exact same replicas, clients have access to any of the servers and can fire queries as they want. This could potentially lead to unbalanced request handling among all the servers and possibly some malicious attack as well. Again I couldn’t think of a good way to tackle this issue as a centralized load-balancing could not only slow down the client accesses but also add complexity to the system. Maybe we can implement the ZooKeeper servers in a way that if they detect a lot of TCP losses/timeouts during a time, it redirects some of its distanced clients to other servers


Paper Notes:

  1. ZooKeeper uses non-block primitives because with the blocking ones, like locks, slow and faulty clients can limit the overall performance of the system. The failure detection would increase the complexity. It’s a lock-free version of Chubby;
  2. The requests are handled with pipelined-fashion to support thousands of operation execution; the FIFO ordering of client operations enables asynchronous operations (multiple consecutive operations without reply);
  3. Zab for leader-based atomic broadcast protocol;
  4. Reads throughput is optimized by 2) servers handle reads locally 2) no strict ordering for reads and 3) client-side caching with a watch mechanism;
  5. Tree of directories and data (hierarchical name spaces) because it’s easier and familiar for users in a standard UNIX file system;
  6. Regular and ephemeral znodes
  7. Increasing sequential flag append to znode
  8. Watches to allow clients to receive timely notifications of changes without polling
    1. the clients send reads with watch flag set and the server will return the data with one-time promise to notify the clients about any change;
    2. watch only indicate the change without sending the actual data;
  9. Clients and ZooKeeper interact by using sessions with timeout. It ends when
    1. inactive clients over time
    2. close session by clients
    3. faulty clients detected by ZooKeeper
  10. Synchronous and asynchronous APIs
    1. no handles but use direct access to the path instead
    2. update with version number to enable version checking
  11. Two features:
    1. linearizability (multiple outstanding operations but ordered with guarantee)
    2. FIFO execution of queries
  12. For a leader re-electing scenario:
    1. lock service will prevent other processes using the old configurations but doesn’t offer a good solution when the new leader partially changed the configurations of all and then failed
    2. in ZooKeeper, the new leader is going to delete the ready znode, make configuration changes and then set the ready znode back on. The configuration change is send asynchronously. New configuration will not be used;
    3. ordering is the key preventing any client read anything before it’s ready;
    4. sync causes a server to apply all pending write requests before processing the read without having any extra write
  13. Write with atomic broadcast and read from any server for better availability
  14. Write-ahead log in disk
  15. Quorum-based algorithm with pipeline optimization
  16. Periodically snapshot for faster recovery. It’s inconsistent because of it’s a non-blocking depth first scan of the tree of each znode data


Paper Review: The Chubby lock services for loosely-coupled distributed systems

Paper Review:


Chubby is a lock service provided to a large number of clients for access control and coordination. It’s made of typically 5 machines and one of them is master. The write operations require ACK from a quorum of machines and reads only need to access the current master.

Strong points:

  1. Chubby exports a file system with APIs similar to UNIX filesystem. Tree of files are stored in directories which makes the access structured and easier for all users. There are some differences though, like the file moving, directory access control and last-access times;
  2. The file handles with sequence number prevent checking at each read/write access and easy to see which master assigned the handles without referring to other replicas. This mechanism is good for small file with frequent accesses because checks are done during the open of handles instead of actual access. Each handle could be reused unless the master assigning the handle fails.
  3. Since the read operations greatly outnumber writes, data is cached in the client side to reduce the traffic of polling Chubby masters for every single read. The master will notify all the clients that possibly cached the data when the data is modified and invalidate the caches. I can imagine it saves a huge number of messages during the run. However, a timed-cached (TTL cache) combined with this mechanism might be better because as the system runs, clients with possible will increase over time. For every read from the clients, the master is going to return the data with a validation time limit, which implies that cache should not be used after this period of time. On the other hand, the master will still invalidate all the clients whose cache is still in the time limit upon modification. This way we can avoid the case when the master has to send too many invalidation messages after several consecutive writes and lead to bursty network usage. This is different from the KeepAlives jeopardy because the TTL is for every cached data while jeopardy is for every client

Weak points:

  1. This is trivial but still not thoroughly explained. The paper mentioned that the clients need so multicast request to all the Chubby replicas stored in DNS in order to locate the master, which means that all the Chubby replicas are managed by another service that keep track of the states and placement of Chubby replicas. The read operation only goes to Chubby master, but if the master is freshly selected, then either 1) the read service will be blocked for a while until the new master is up-to-date by checking with all other replicas or 2) the read returns a insufficiently updated value stored in the new master;
  2. The sequence numbers are assigned to any operations that interact with locks. And if any lock is lost in a abnormal circumstances like failure, the other clients have to wait for a period (1 minute). However, why don’t master actively sends messages to the lock-holding clients to check for liveness instead of waiting. At least having the option of proactively checking the clients could benefit systems with higher requirement on availability. (This could definitely use the TrueTime API to make sure of the ordering of events from different clients and save a lot of waiting/messages);
  3. When I was first reading the paper, I felt that the communication between Chubby clients and cells overlaps with TCP a lot in terms of their functionalities: both maintains a session with sequence number; both requires message exchange to keep alive otherwise the session expires. So I assumed that TCP is a nature for Chubby with all the strict ordering, congestion control and guarantees on the delivery. But the paper claims the TCP back-off policy lead to a lot of KeepAlives timeouts because the transport layer protocol is unaware of the timing of upper layer processes. However, the main reason we are triggering the back-off policy is the bursty message pattern, which is not discussed in the paper. If we can take advantage of the transport layer session to reduce the number of messages used for KeepAlives, then then back-off won’t be a problem in most scenarios. An alternative is to build a new reliable congestion-control protocol upon UDP with optimizations and priority for time-critical messages.
  4. Chubby master is definitely a bottleneck of the overall performance because all the operations, read or write, have to go through the master. On the other hand, Chubby slaves’ computation power and bandwidth is under-utilized. I guess it could be viewed as a down-side of the high consistency of Chubby, which is maintained by a single master decision making mechanism.


Paper Notes:

  1. Reliability, availability and scalability are considered the most important while the throughput and storage capacity are considered secondary.
  2. Lock services, although cannot offer a standard framework for programmers like a library embodying Paxos, maintains high availability as the system scales up; it’s suitable for clients to store and fetch large quantities of data as it uses consistent client caching rather than TTL time-based caching; quorum based consensus reduce the number of servers needed for client to make progress
  3. Only lock-interaction operations will be assigned with a sequence to ensure

Paper Review: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Paper Review:


Mesos is a thin layer application between different frameworks and resources, dynamically assigning tasks to a cluster of machines. Because of the different tasks and constraints of frameworks, the task assignment should be made with a lot of considerations. The overall system seems quite simple and elegant, which makes the results even better.

Strong points:

  1. The system takes a lot of things into consideration during the assignment: the constraints of frameworks, the data location, the time of the tasks, the priority and the fairness. But yet all these considerations could be expressed in simple values and dynamically changed during the run;
  2. Mesos leaves a lot of room for the frameworks, which we can see from it’s implementation with Hadoop. It takes advantages of the fine-grained task assigning of Hadoop and reduce the overhead when Mesos and Hadoop work together. This is something that static partitioning tool could have a hard time to achieve;
  3. Mesos master is implemented with minimum communication between the upper and lower layers. Also the state of master is replicated in other standby masters and managed by ZooKeeper. The data in the memory of Mesos master could be constructed from the frameworks and slaves, which makes the system more resilient.

Weak point:

  1. While the tasks are isolated with Linux Container and separated cores and memory chips, there are still shared system resources like the network bandwidth, which is not taken into consideration during the run as well;
  2. The actual task assignment is still performed by a centralized machine and the bandwidth of other standby masters are wasted. The Mesos master might face short-time network exhaustion during run caused by similar tasks finished all at once. This might explain why the Hadoop tasks have spiky share of cluster performance;
  3. There isn’t much in the paper talking about the resource offer preference is calculated from the data location and other factors (it simply mentioned how to select the suitable frameworks with lottery scheduling with given preference si). I guess the preference could be useful if we can take advantage of the data location by assigning the tasks to nodes which already have the necessary data to save the transfer time. But I guess this is something that’s hard to keep track of and it requires the the frameworks to cooperate, which makes Mesos less adaptive after all.


Paper Notes:

  1. Sharing a cluster is hard. There are two common solutions: statically partition the cluster and run one framework per partition or allocate a set of VMs to each framework. However, there is no way to perform fine-grained sharing across the frameworks. Mesos is a thin layer that lies between frameworks and cluster resources.
  2. A centralized scheduler cannot achieve the complexity and scalability. It cannot adapt to new frameworks. Also it overlaps with the frameworks’ scheduler.
  3. Mesos will offer frameworks with resources based on an organizational policy such as fair sharing and the frameworks decide which ones to accept and the tasks to run on them.
  4. Mesos has minimal interfaces and pushes the control to the frameworks to 1) leave more room for various frameworks and 2) keeps Mesos simple, scalable and low in system requirements
  5. Frameworks resources could be revoked due to a buggy job or a greedy framework; however, frameworks can be assigned with resources with guarantee so the tasks won’t be killed as long as the framework is below the guaranteed allocation.
  6. Linux containers are used for task isolation
  7. ZooKeeper for leader election from all the standby masters. Mesos master’s memory could be reconstructed from the frameworks and resource slaves
  8. Mandatory and preferred resources; lottery scheduling; resource reservation for short tasks

Paper Review: Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

Paper Review:

Mesa is another Google’s data store layered upon Colossus and Bigtable, primarily designed for ads campaign data storage. The versioned kv store with aggregation is scalable and replicated globally. The metadata is consistently updated with Paxos and the data is batched and transferred every few minutes.

Strong points:

  1. In the chapter “experiences and lessons learned”, layered design was mentioned as a key design feature of Google products. Mesa is nicely layered and decoupled in both horizontal and vertical directions. It was build up on Colossus and Bigtable so we can see there isn’t too much about the read/write topics, which are covered in Bigtable. Inside the architecture, it has workers/servers, controllers and global services. The data maintenance/update and query is also decoupled. While there might be some overhead, but it enables clear designing, easy problem identification and detailed performance analysis.
  2. The resume key with streaming transmission is interesting. This way the failed/disconnected query server won’t waste clients’ time since the read operations could be continued on another server instead of firing the query once more.
  3. Parallelizing the worker operation using MapReduce  and linked schema change are good ideas. MapReduce could save days of computation time and the schema change, while not applicable in every scenario and add some computation on the query path, saves 50% of disk space than the traditional simple schema change.  This could be a life saver since Mesa requires a lot of storage space in the first place.

Weak points:

  1. The  controllers assign different tasks to different types of workers. The failure of workers will eventually be captured by the timer maintained for each task in the controllers. However, the timers could add a lot of work to the controllers if there are many tasks running simultaneously. On the other hand, work failures could result in long latency for the corresponding queries since before the timer expires and the tasks get re-assigned. This kind of latency could be resolved if the controllers/workers exchange heartbeat messages so that the worker failures could be detected earlier before the timer runs out.
  2. Query servers are assigned with their own responsible range of data to take advantage of prefetching and caching. However, this assignment could be a little inflexible if there are a lot of queries on the similar data. This way we will only have a small amount of query servers fetching data from Colossus for a long time while the rest of servers are doing nothing. This will be a performance bottleneck and waste of resource if we assume that the read operation in Colossus is lock-free and scales well. Also since we are on the chapter of querying, the global locator service is really vague in the paper. I assume that it is a stateless process running the controllers with the data replicated in Bigtable as well.
  3. The replication mechanism does not make too much sense to me. The Paxos will only make sure that metadata is replicated consistently by majority of Mesa instance and the data replication will be left behind without any strong guarantee. So for each Mesa instance (datacenter), the metadata could be out-dated if it failed during the Paxos; and even if it works fine during Paxos, there’s not much guarantee on the consistency of the actual data.
  4. There are two methods of data validation, the online one by re-aggregating the rows and check for computation errors and the off-line one, which is a light-weight process spanning the recent committed data. There are two problems with the corruption recovery: 1) the online checking will be perform in every update and query, which could result in unnecessary checking and increased latency and load; 2) the data is replicated asynchronously. If the freshly updated copy gets corrupted, there’s no other replicas that could help with the recovery




Data is partitioned and replicated horizontally to achieve scalability and availability. (does this imply NoSQL?)

Multi-version key-value data store with Paxos for consistency

Leverage Bigtable and the Paxos tech underlying Spanner for metadata storage and maintenance.

Asynchronously replication for application data; synchronous replication for metadata.

Dealing with corruption for software and hardware

A query  to Mesa consists of a version number and a predicate P on the key space. And the response contains the corresponding P and a version number between 0 and n.

Strict ordering of updates can ensure atomicity and fraud detection of negative facts.

Mesa pre-aggregates certain versioned data (between v1 and v2 inclusively) and call it delta. Base compaction is the process of merging some of the version in to [0, B] and versions before base are no longer accessible. This kind of idea is commonly used (like append-only GFS) but it could be problematic if old-version data is still useful. Older data are stored in a versioned and expensive way in terms of the process of aggregation which means that Mesa is necessary for cleaning things up but how many versions are we going to keep is gonna be hard to determine for various kinds of data. Since Mesa is primary used for ads campaign data collection, which has uniform and specific requirements in terms of data storage and aggregation, this issue doesn’t seem to matter too much.

Each table has one or more indexes and each table index has its own copy of data that is sorted accordingly for fast search. And there is an index file containing short keys for row blocks for faster localizing the row block.

Metadata is stored in Bigtable and in the memory of controller. Each datacenter has one controller (I assume) and it does not directly interact with the tables. There are different tasks for data maintenance like updates, compaction, schema change and checksum. Note that the last two require coordination of different Mesa instances (datacenters). A set of workers of different types polling the controller for task of their own types. Each task will be assigned with a timer so failed workers won’t affect the system. The controller is also sharded and stateless with all the metadata consistently stored in Bigtable so Mesa is resilient to controller failures. A garbage collector runs separately and continuously, which reads from the metadata in Bigtable and delete the unwanted files in Colossus.

Mesa handles queries with different requirement with different labels and priorities. A set of query servers could access any table in principle but Mesa will direct  queries with the same range of data to the a subset of the query servers to take advantage of the pre-fetching and caching. Global locator service is used to coordinate the query servers.

The operations are batched in Mesa instances once every few minutes. A stateless global committer will assign each updates batch a version number and use Paxos to reinforce the data consistency. Controllers in every Mesa instance will be responsible for the updates batches and acknowledging the committer. There is no locking. Note that in this case metadata is replicated synchronously with Paxos while the data is incorporated asynchronously by various Mesa instances.

New Mesa instances will bootstrap with p2p load mechanism.


Paper Review: Spanner Google’s Globally-Distributed Database

Paper Review:


Spanner is Google’s new global data store with semi-relational data model and standard query language. It uses Paxos and 2PC for operations and uses bounded real time for external consistent transactions.

Strong points:

  1. Spanner switches from NoSQL to NewSQL (?), which is easy to work with (with semi-relational data model and query language) and excellent scalability; however, the data is also version-ed (with TrueTime timestamps) so clients can decide if the read is up-to-date.
  2. TrueTime is just impressive. It enables external consistency and a bunch of cool features like consistent snapshot read across the data centers and dynamic schema changes. It’s like having a wall clock for all the replicas with bounded uncertainty. Not to mention that the uncertainty is controlled sophisticated using GPS and atomic clocks as underlying hardware and algorithm for lair detection;
  3. Data are stored in tablets, which are also classified into different “buckets”. Applications can control the locality of data by carefully assigning keys to the data. This feature could potentially lower the latency (by choosing closer datacenters for storage);
  4. Dynamic controlled replication configuration might be helpful when the application is trying to change the data location or replication factors during the run.

Weak points:

  1. The write operations are still using Paxos for consensus and two phase commit during the transaction. It enforces strong consistency for sure but a) the master could be troublesome. Master failover might result in long waiting time and b) communication overhead is inevitable which increase the latency of every transaction;
  2. TrueTime is sophisticated designed with redundant hardware support and algorithms to verify its correctness. However, the write transactions (to a single Paxos group) performed during a period of time is bounded by epsilon and so is the system’s overall accuracy. Epsilon is caused mainly with hardware errors and hard to be eliminated, which means that Spanner is unlikely to have better writing performance or timestamp accuracy;
  3. Since the system’s ordering is based on clock time and the clock time is uncertain, there are many occasions that we have to wait till the system is definitely sure that the previous event is already done even when the waiting is simply for the purpose of making TT.after true. For example, the commit timestamp, even with all the replicas get back to leader, it still has to wait till it’s certain about the timing;
  4. If a TrueTime API is used with a faulty timestamp, say it fires a read operation in the future, will it block other transactions, or get halted, or return with an error?



Paper Outline:

  1. Introduction:

    • Globally scale database that shards data across many sets of Paxos state machines in the highest level of abstraction;
    • Main focus is managing cross-datacenter replicated data but also designing/implementing important database features;
      • Bigtable (NoSQL) can be difficult for applications with complex schema or strong consistency in the presence of wide-area replication;
      • Megastore (semi-relational data model) supports synchronous replication but has poor write throughput;
    • Spanner is evolved from Bigtable where  data is stored in schematized semi-relational tables and version-ed; it provides a SQL-based query language;
      • replications configuration can be dynamically controlled;
      • externally consistent read/write operations;
      • these features are enabled by the globally-assigned timestamps, which is supported by the TrueTime API and its implementation;
  2. Implementation:

    • Overall structure:

      • a Spanner deployment is called a universe;
      • Spanner is organized with a set of zones which are unit of administrative deployment and resemble data centers;
      • each zone has:
        • one zonemaster;
        • hundreds of spanservers (roughly analog to Bigtable servers);
        • location proxies are used by clients to locate data;
      • universe master and placement driver are singletons:
        • universe master is primary a console that displays stats info;
        • placement driver handles auto movement of data across zones;
    • Spanserver software stack:

      • spanserver structure:
        • each spanserver is responsible for 100 to 1,00 instances of tablets, which is similar to Bigtable’s tablet abstraction;
        • unlike Bigtable, Spanner assigns timestamps to data, which makes it more of a multi-version database than a key-value store;
        • tablet states are stored in B-tree-like files and a write-ahead log;
        • all storage happens on Colossus;
      • coordination and consistency:
        • a single Paxos state machine for each spanserver;
        • a state machine stores its metadata and log in corresponding tablet;
        • long-lived leaders and time-based leader leases for Paxos;
        • every Paxos writes twice: in the tablet log and in the Paxos log;
        • writes must initiate Paxos protocol at the leader but reads access state directly from the underlying tablet as long as it’s up-to-date;
        • each Paxos leader implements a lock table for concurrency control:
          • lock table contains the state of two-phase locking;
          • only operations require synchronization acquire locks;
        • each Paxos leader implements a transaction manager to support distributed transactions:
          • used to implement a participant leader;
          • transaction involves only one Paxos group will bypass the transaction manager;
          • for transactions that involves multiple Paxos groups:
            • one of the participant group is chosen as the coordinator;
            • others are referred as coordinator slaves;
    • Directories and placement:

      • a bucket of contiguous keys that share a common prefix is a directory which allows applications to control the locality of data by choosing keys;
      • all data in a directory share the same replication configuration and could only be moved directory by directory (while the client operations are still ongoing);
      • not necessarily a single lexicographically contiguous partition of the row space but  instead a container that may encapsulate multiple partitions of the row space so that directories could be put together;
      • Movedir task:
        • the background task moving directories between Paxos groups;
        • also used to add/remove replicas to Paxos groups;
        • a part-by-part background process between two Paxos groups;
      • directory is also the smallest unit whose placement can be specified;
        • administrators control the number and types of replicas, and the geographic placement of those replicas;
        • an application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options;
      • shard a directory into multiple fragments if it grows too large;
        • fragments could be served by different Paxos groups;
        • movedir in this case will actually move fragments not directories;
    • Data model:

      • data features for applications:
        • a data model based on schematized semi-relational tables;
          • used by  Megastore; simpler to manage unlike Bigtable;
          • synchronous replication across datacenters unlike Bigtable which only supports eventual consistency;
        • a query language;
          • because of popularity of Dremel as an interactive data analysis;
        • general purpose transactions;
          • complaint on the lack of cross-row transactions in Bigtable;
          • two-phase commit over Paxos mitigates the availability problems (but expensive to support);
        • application data model:
          • layered on the directory-bucketed key-value mapping;
          • an application can create one or more database in a universe;
          • a database can contain unlimited schematized tables;
          • uses a SQL-like query language with extra features;
      • Spanner data model:
        • not purely relational because every table is required to have an ordered set of one or more primary-key columns;
        • each table defines a mapping from the primary-key columns to non-primary-key columns;
        • it lets applications control data locality through key choices;
  3. TrueTime:

    • TureTime

      • explicitly represents time as a TTinterval with bounded time uncertainty, which is different from standard time interface;
    • GPS and atomic clocks failure modes:

      • GPS reference-source vulnerabilities:
        • antenna and receiver failures;
        • local radio interference;
        • correlated failures;
        • GPS system outages;
      • atomic failures;
        • time drift due to frequency error;
    • master/slave implementation:

      • a set of time master machines per datacenter;
        • majority of them have GPS and geographically separated;
          • reduce the effect to failures;
          • uncertainty close to zero;
        • the rest have atomic clocks and are called Armageddon masters;
          • slowly increasing time uncertainty;
        • regularly compared against each other and local clock;
      • timeslave daemon per machine:
        • polls a variety types of masters;
        • applies a variant of Marzullo’s algorithm to detect and reject lairs;
        • worst-case local clock drift is a saw-tooth function;
          • master clock uncertainty, communication delay, local drift;
  4. Concurrency control:

    • Supported operations:

      • read-write transaction;
        • pessimistic and requires leader replication;
      • read-only transactions;
        • not a read-write transaction without write; non-blocking;
      • snapshot reads;
    • Timestamp management:

      • Paxos leader leases:
        • long-live leader is selected with a quorum-based vote;
        • lease could be extended on a successful write or near expiration;
      • assigning timestamps to RW transactions:
        • Spanner assigns timestamp that Paxos assigns to the Paxos write;
        • external consistency: if the start of a transaction T_2 is later than T_1, then the timestamp of T_2 must be greater than T_1;
        • start: the coordinator leader for a write T_i assigns a commit timestamp s_i no less than the value of;
        • commit wait: the coordinator leader ensures that clients cannot see any data committed by Ti until TT.after(s_i) is true;
      • serving reads at a timestamp:
        • every replica tracks a value called safe time t_safe, which is the maximum timestamp at which a replica is up-to-date;
      • assigning timestamps to RO transactions:
        • two-phase transaction: timestamp s_read assigning and execute snapshot reads at s_read;
    • Details:

      • read-write transactions:
        • issues read to the leader replica of the appropriate group;
        • wound-wait read recent data with timestamps;
        • keep-alive messages to the leader to maintain locks;
        • a Paxos algorithm with TrueTime to enforce consistency;
        • buffered writes until two phase commit with the same timestamps on all participants;
      • read-only transactions:
        • scope required for read-only transactions, which summarizes the keys read in the transaction;
        • contact the leader when the scope’s values are served by one Paxos group;
          • S_read ensures that last write is returned;
        • multiple Paxos group read:
          • a round of communication with all the leaders;
          • just read S_read = to see the sufficiently up-to-date values;
      • schema-change transactions:
        • generally non-blocking variant of a standard transaction;
        • reserve a timestamp in the future and block the transactions with bigger timestamps behind the schema changes;
      • refinements:
        • lock with metadata to prevent false conflicts;
        • use leader-lease interval for snapshot read after a Paxos write;
  5. Evaluation:

    • Microbenchmarks:

    • Availability:

    • TrueTime:

    • F1

  6. Related work:

  7. Future work:

  8. Conclusion:


Paper Review: Adapting Microsoft SQL Server for Cloud Computing

Paper Review:


MSSQL is the first distributed commercial SQL storage. It uses primary and secondary replicas across the datacenters, coordinated by a global partition manager. Operations are done using quorum and  Paxos consensus algorithm is used in replication and recovery.

Strong points:

  1. I guess one of the best thing about Microsoft SQL server is that it’s a SQL-based cloud storage solution, which means standard and fast development for most small companies with common data models. It supports aggregation, full-text queries, and referential constraints, views and stored procedures and most of those are not supported by custom record stores.
  2. From the content I think global partition manager is not a single machine but more of a highly-available service made with multiple nodes across the datacenters to ensure the availability.
  3. The decoupled design of layers enables upgrade without interfering with user operations. All the cluster activities, including the two-phase upgrades, are done in the layer of infrastructure and deployment services and the user won’t be able to use the new features unless the process is finished.

Weak points:

  1. The replica placement is good for avoiding heavy traffic where each server host a mix of primary and secondary partitions. Note that only primary partitions serves all the query, update and other operations (however, nearly up-to-date secondaries might be used as read-only copies). Could the consistency be an issue because the asynchronous update? Since the read-only replicas are nearly update but there’s no guarantee, the responsibility of validation of the data is given to the users. And what if the client wants to write something and the primary is far far away? The primary replica might be a good way to coordinate operations but it surely affects availability and consistency.
  2. The update to replicas will be propagated from primary replica to secondary ones, which means that if the server storing primary replica fails during the beginning process of propagation might result in loss of transferred data before one nearly up-to-date secondary replica becomes primary.
  3. Since it’s a SQL server, the scalability could be worse than NoSQL storage like MongoDB and Bigtable since the data is stored in hierarchical fashion. Also I guess MSSQL doesn’t offer dynamic schema as well.

Paper Review: Megastore Providing Scalable, Highly Available Storage for Interactive Services


Megastore is  Google’s solution for multi-datacenter storage, which layers on top pf Bigtable. It has a semi-relational data model using optimized Paxos for transactions within a entity group and asynchronous message queuing for transactions across entity groups. It has relatively low writing throughput with strong consistency (if using 2PC) and availability (works as long as majority of the replicas, including witness replicas, are still alive).

Strong points:

  1. Most criticisms Bigtable received are focusing on its NoSQL data model, which slows down the development and increases the cost of maintenance. Megastore maps its semi-relational data model into Bigtable to encounter some disadvantages we had in Bigtable.
  2. I found the variety of different read operations quite interesting: current read operation is a entity-scope read for committed writes; snapshot read operation is also entity-scope but can read uncommitted states; inconsistent read ignores the log and simply return the latest value which can save a lot of time. This kind of implementation could really make the applications on Megastore flexible in order to meet their different requirement on latency and consistency.
  3. Witness replica is a brilliant idea to enhance system durability without adding too much bandwidth consumption and storage space. They solve the cases where only a few typical replicas and any failover/network partition might end up blocking the Paxos. On the other hand, read-only replicas are also great for distributing data in multiple datacenters without slowing down the write Paxos.

Weak points:

  1. I understand that Paxos is one of the key features in Megastore to implement ACID transactions without heavyweight master or data loss on failure. But could this strategy end up with lower throughput and too many communication messages between replicas for the coordination? It seems to me that Paxos would introduce significantly more network-round-trip delays to a single transaction than any other replication strategies, even with the optimization they made on Paxos. This is also confirmed in the Spanner paper that write throughput is bad in Megastore.
  2. Entity groups could be considered as a logically hierarchical way to organize the data. Transactions within a entity group will be using Paxos while the cross-entity transactions are usually via asynchronous messaging. Paxos offers strong consistency but also leads to communication overhead in each operation while the asynchronous messaging is fast but lacks consistency guarantees. It works when our data model aligns with this kind of hierarchy, like Email and Blog where each user could be visualized as a natural entity and the cross-entity transactions are rare and not really require strong consistency but what if our data model is flat? Or our data model requires things to be store together for faster retrieval but on the other hand requires each entity to have high-level of consistency? Then the whole system could either be expensive due to the 2PC overhead for consistency or not consistent enough. I assume that Megastore has some assumptions on the data models prior to the design. (Also since the entities has very different sizes, could they become problematic during load balancing?);
  3. Megastore is built on Bigtable, which is using GFS, SSTable and Chubby as basic storage management and coordination. While the philosophy of decoupling and layer design is great for debugging, system analysis and faster development, the overhead caused by coordination is gonna be terrible。
  4. Queuing for transactions between multiple groups is asynchronous, which means that the consistency is not guaranteed in a limited period of time. On the other hand, buffer overflow or node failure could directly result in message loss and inconsistency is inevitable. The alternative of queuing is 2PC and it’s way more complex and the communication overhead is gonna hurt the performance. It’s always nice to have an alternative though.


Paper Notes:

  1. Introduction:
    • Since all the requirements like scalability, rapid development for users, low latency, consistency and highly availability are in fact conflicting with each other, Megastore picks a mid-ground between RDBMS and NoSQL:
      • datastore is partitioned and replicated with full ACID semantics but limited consistency guarantees;
      • traditional database features are supported if they can scale with tolerable latency limits and compatible with partitioning scheme;
    • Paxos is optimized in Megastore for low latency operations and used for variety of things including primary user data replication;
  2. Toward availability and scale:
    • Common strategies:
      • asynchronous master/slave (data loss on failures);
        • ACK at master and transmission at slaves in parallel;
        • risks downtime or data loss during failover to a slave;
        • requires a consensus protocol to mediate mastership;
      • synchronous master/slave (heavyweight master);
        • master waits on slaves before ACK;
        • master/slave failures need external detection;
      • optimistic replication (no ACID transaction):
        • any member of a homogeneous replica group can accept mutations;
        • asynchronously propagated through the group;
        • global mutation ordering is not known at commit time so transactions are impossible;
    • Enter Paxos:
      • any node can initiate reads and writes. Each log append blocks on acknowledgments from a majority of replicas, and the rest catch up as they are able;
      • multiple logs increase throughput (reducing possibility of distanced nodes using one log) and availability (operations won’t block when majority fails to ACK), each governing its own partition of the data set;
    • Entity groups:
      • data is partitioned into entity groups;
        • single-phase ACID transactions within an entity group via Paxos;
        • cross-entity transactions could be via expensive 2PC or asynchronous message queuing (looser consistency);
    • Physical layout:
      • each datacenter is a Bigtable instance;
      • minimize the latency by letting applications control data placement;
  3. A tour of Megastore:
    • API design:
      • normalized relational schemas are not used in Megastore  because:
        • high-volume interactive workloads benefit more from predicable performance than from an expressive query language;
        • reads dominates writes so it pays to move work from read time to write time;
        • storing and querying hierarchical data is straightforward in simple key-value data stores like Bigtable;
      • (I have no background on databases so I’ll leave this section later);
    • Data model:
      • entity group root tables and child tables;
      • each entity is mapped into a single Bigtable row;
      • local index is treated as separate indexes for each entity group;
      • global index spans entity groups, used to find entities without knowing in advance the entity groups that contain them;
      • storing clause for faster access at read time;
      • mapping to Bigtable:
        • store root entity as a single row, which allows atomically update;
    • Transactions and concurrency control:
      • store multiple values in same row/column with different timestamps;
      • reads and writes don’t block each other;
      • Megastore provides current, snapshot and inconsistent reads;
      • write operations life cycle: read, application logic, commit, apply, clean up. Note that only one write wins in one Paxos;
    • Other features:
      • periodical full snapshot;
      • optional data encryption;
  4. Replication:
    • Current reads guarantee:
      • a read always observes the last-acknowledged write;
      • after a write has been observed, all future reads observe that write;
    • Brief summary of Paxos:
      • majority of replicas must be active to proceed;
      • use Paxos to replicate a transaction log and positions in the log;
    • Master-based approaches:
      • writes are reduced to single round of communication;
      • writes can be batched together to improve the throughput;
      • master failover can result in user-visible outages;
    • Megastore’s approach:
      • fast reads:
        • local read is allowed with the help of coordinator;
        • coordinator at each datacenter tracks a set of entity groups;
      • fast writes:
        • master-based system using leaders;
        • closest replica as leader;
      • replica types:
        • witness replica vote in Paxos and store the write-ahead log without applying the log or storing the data;
        • read-only replicas only stores data without voting; distribute the data without adding any write latency;
    • Data structure and algorithms:
      • out-of-order proposals are acceptable;
      • catchup when a replica was found out of date during read;
      • read: query local, find position, catchup, validate, query data;
      • write: accept leader, prepare, accept, invalidate, apply;
    • Coordinator availability:
      • failure detection:
        • use out-of-band protocol to detect coordinator failures;
        • coordinators obtain Chubby locks;
        • coordinator failure will be handled(reconstructed) quickly without affecting read/write;


Paper Review: Cassandra A Decentralized Structured Storage System

Paper Review:

Cassandra is another fully distributed data store pretty much like Dynamo in Amazon. It also works with consistent hashing but different in the load balancing part. It has relatively flexible data types and replication methods than Dynamo and is one of the best non-relational data store in the world.

Strong points:

(pretty much every advantage/disadvantage of Dynamo could be listed here so I’m just going to focus on the minor differences they have)

  1. Column indices for faster lookup. This is a great optimization for wide-column data store by preventing the column-by-column search during data retrieval.
  2. Load balancing scheme is more tractable and deterministic because we are not assigning virtual nodes to machines any more. We are actually analyzing the load and performance of each machine and moving data from node to node to balance the distribution. This means that 1. the load distribution is more observation-based which means that the balancing result should be better than virtual node assignment and 2. this balancing method is dynamic and could react to run-time imbalance in load assignment.
  3. While Cassandra is a fully distributed system, the use of Zookeeper really make thing simple and elegant. For example, the replication location is chosen by the leader elected by Zookeeper rather than using consensus algorithms. This way the replication requires less coordination among all nodes and could substantially reduce the message exchange amount and overall latency.

Weak points:

  1. “Cassandra morphs all writes to disk into sequential writes thus maximizing disk write throughput”. I guess it means that concurrent write operations are treated sequentially with their timestamps and the ordering is agreed by all replicas by coordinated consensus or leader decision. Either way this method of serialization could end up with a different ordering from the reality.
  2. Cassandra resembles NoSQL database which is proven to be flexible and scalable (just like Dynamo and Bigtable) than relational databases. However, NoSQL databases requires more attention to the structure of storage and the data could be hard to import/export without standards and proper support of query language.
  3. I’m just being nitpicking here but the load balancing could be potentially problematic. It requires periodical background scanning of loads, which consumes time, bandwidth and computation power, and a recovered light-loaded node could be assigned too much data in a short span of time and might result in node crush and/or network jam.


Paper Outline:

  1. Introduction

  2. Related work

  3. Data model:

    • A table is a distributed multi dimensional map indexed by a key;
    • The value is highly structured:
      • row key is a string with no size limit;
      • every operation under a single row key is atomic per replica;
      • columns are grouped together to form column families;
        • super column families are column families within column families;
      • the sort order within a column family is application-specified;
        • either by time/name;
  4. API:

    •  insert(table, key, rowMutation);
    •  get(table, key, columnName);
    •  delete(table, key, columnName);
  5. System architecture:

    • Partitioning:

      • uses order preserving consistent hashing to replicate data across clusters;
      • the data is assigned onto a ring structure of the hash output;
      • walk clockwise and find the first data node as coordinator;
      • main principle of this design:
        • leaving/joining of a data node only affects the one other node;
      • challenges:
        • random assignment causes non-uniform data and load distribution;
        • oblivious to the heterogeneity in the performance of nodes;
      • solutions:
        • virtual nodes like the ones in Dynamo;
        • analyze load info on the ring and move lightly loaded nodes on the ring to alleviate heavily loaded nodes. This makes the design and implementation very traceable and helps to make very deterministic choices about load balancing;
    • Replication:

      • N as replication factor is configured per-instance;
      • coordinator of the data is responsible for replication:
        • rack unaware:
          • pick the next N-1 data nodes on the ring as replicas;
        • rack aware/datacenter aware:
          • leader election by Zookeeper;
          • makes a concerted effort to maintain the replication;
          • metadata backed up at each node and Zookeeper;
      • each node is aware of every other node in the system:
        • durability guarantees by relaxing the quorum requirements;
    • Membership:

      • cluster membership is based on Scttlebutt Gossip;
      • failure detection:
        • Accrual Failure Detection: not up/down statues but a suspicion level of each monitored nodes;
        • every node in the system maintains a sliding window of inter-arrival times of gossip messages from other nodes;
        • the distribution of these inter-arrival times is determines and suspicion level is calculated from it;
    • Bootstrapping:

      • when a new node is joining:
        • randomly pick a token on the ring;
        • position backed up on local disk and Zookeeper;
        • position info is gossiped around the cluster;
        • read the configuration file (a list of a few contact points);
      • manual startup and joining in to restore a failed node;
    • Scaling the cluster:

      • new node should alleviate a heavily loaded node;
      • data streams using kernel-kernel copy technique;
        • maybe parallelizable;
    • Local persistence:

      • local file system for data persistence;
      • typical write operation involves:
        • a write into a commit log for durability and recover-ability;
        • an update into an in-memory data structure after;
      • dump in-memory data into local disk when it reaches threshold;
      • write is sequential and indexed for easy lookup;
      • periodically merging files;
      • read operations:
        • read memory before disk;
        • bloom filter for key checking before read;
        • maintain column indices to jump to the right chunk;
    • Implementation details:

      • Cassandra process on a single machine consists of:
        • partitioning module;
        • cluster membership and failure detection module;
        • storage engine module;
      • modules rely on an event driven substrate where the message processing and task pipelines are split into multiple stages along the line of SEDA;
      • system control messages are over UDP while the application related messages are over TCP;
      • states for read/write request:
        • identify the nodes that own the data for the key;
        • route the request to the nodes and wait on response;
        • return to the client if timeout;
        • figure out the latest response based on timestamp;
        • schedule a repair of the data if necessary;
      • purging commit log;
        • fast sync mode: write to buffered log with potential risk of data loss;
        • morphs all writes to disk into sequential writes and lockless to maximize the write throughput;
  6. Practical experiences




Paper Review: Dynamo Amazon’s Highly Available Key-value Storage

Paper Review:

Amazon Dynamo is a fully distributed key-value pair data store. The balancing the partitioning is achieved with consistent hashing with sophisticated load assigning strategy and read/write operations are done with quorum-like voting.

Strong points:

  1. A fully distributed system is always better in scalability and maybe availability. A master failure would definitely have negative effect on availability, which is the main focus in Dynamo.
  2. Key feature “always available write” is really attractive. The quorum-like technique used in Dynamo offers the best availability so far and leaves the version conflicts to users to solve (well at least it returns your request somehow).
  3. Flexible and user-configurable features are shown in Dynamo like N, R, W values and different storage engine support.

Weak points:

  1. Complex distributed coordination/control. First they have a hash function to assign load to all the nodes; then they came up with the concept of virtual nodes which balance the load distribution; and then they worries about the joining and leaving of nodes accidentally or not; and then they realize there are still different assigning strategies which result in different performance in balancing. And all these are just for balancing. I guess it’s inevitable if you want a fully distributed system.
  2. Much computation and communication to do like hashing for node location, Merkle tree calculation and recalculation if the assignment changes, failure detection, gossiping for nodes joining/leaving and quorum-based voting. The system could be really complex and requires some computation power and network bandwidth to support.
  3. Rigid data storage model comparing to Bigtable. Only key-value pairs are allowed in Dynamo which could be a let down in some cases. However this is not much of a big deal since it supports different DB engines. I’m sure there could be some alternatives on the data model but still going around the way could make things complex.
  4. Too many replicas to support the high availability. Although the number of N is configurable but the number of replicas could increase because of node failure. Before the older version is removed (which is also configurable I think), extra replicas will put more demands on the storage space and bandwidth (more communication and data transfer) both.


Paper Outline:

  1. Introduction:

    • Reliability and scalability is dependent on application state management;
    • Treats failure handling as the normal case;
    • Dynamo has very high reliability requirements and need tight control over the tradeoffs between availability, consistency, effectiveness and performance;
    • Simple interface to handle services that need primary-key access only;
    • Uses some well known techniques to achieve scalability and availability:
      • data partition and replication using consistent hashing;
      • consistency is facilitated by object versioning;
      • consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol;
      • gossip-based failure detection and membership protocol;
      • completely decentralized with minimal need for manual administration;
      • storage nodes can be added/removed without any extra work;
  2. Background:

    • System assumptions and requirements:

      • query model:
        • simple read/write operations;
        • data and state is identified by unique keys;
        • no operations span multiple data items;
        • no need for relational schema;
      • ACID properties:
        • data stores provide ACID guarantees tend to have poor availability;
        • targets applications that operate with weaker consistency;
        • provides no isolation guarantees;
      • Efficiency:
        • needs to function on a commodity hardware infrastructure;
        • services must be able to configure Dynamo such that they consistently achieve their latency and throughput;
      • safe environment;
    • Service level agreement:

      • SLA is a formally negotiated contract where a client and a service agree on several system-related characteristics;
      • SLAs are expressed and measured at the 99.9th percentile of the distribution, which could provides better overall experience;
        • a common approach in the industry for forming a performance oriented SLA is by using average, median and expected variance;
    • Design considerations:

      • replication algorithms force to trade off the availability;
        • strong consistency and high availability cannot be achieved together;
      • availability can be increased by using optimistic replication techniques;
        • replicas are propagated in the background;
        • might lead to conflicting changes and need conflict resolution;
      • Dynamo is an always writable data store;
        • some other data store resolve the conflicts during writes and might reject the write operation if it cannot reach all replicas;
      • applications are more suitable for conflict resolution:
        • they know what kind of data can be chosen;
        • data store has limited information but will solve the conflict by “last write wins” policy if application doesn’t take care of it;
      • other key principles:
        • incremented scalability;
        • symmetry;
          • every node in Dynamo should have the same responsibility;
        • decentralization;
        • heterogeneity;
  3. Related work:

    • Peer to peer system:

      • first generation P2P system like Freenet and Gnutella:
        • mainly used as file sharing systems;
        • searching requires flood through the network;
      • next generation P2P structured networks like Pastry and Chord:
        • global consistency protocol to support routing and searching;
      • many storage systems built on these routing overlays:
        • with conflict resolution to handle the concurrent updates;
    • Distributed file systems and databases:

      • typically support hierarchical namespaces;
      • guarantee eventual consistency;
      • traditional replicated relational database systems focus on consistency to provide a conventional programming model by limiting the scalability and availability;
    • Discussion:

      • Dynamo is different:
        • target applications that require “always writable” availability;
        • built for with a single administrator and trusted nodes;
        • no hierarchical namespaces and complex schema;
        • built for latency sensitive applications;
  4. System architecture:

    • Summary of the techniques used in Dynamo and their advantages: 20160203084012

    • System interface:

      • get(key): returns a single object or a list of them with conflicting versions along with a context;
      • put(key, context, object): writes the object with associated key to the disk and context resembles metadata about the object such as version;
    • Partitioning algorithm:

      • partitioning scheme replies on consistent hashing;
        • to distribute load across multiple storage hosts;
        • the output range of a hash function is treated as a circle and each node in the system is assigned a random position on the ring;
        • each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring and then walking the ring clockwise to find the first node with a position larger than the node’s position;
        • thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring;
      • the principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors;
      • some challenges and solutions in the algorithm:
        • random position assignment leads to non-uniform distribution;
        • the basic algorithm is oblivious to the heterogeneous nodes;
        • Dynamo uses a variant of consistent hashing and mapping multiple positions in the ring to each node;
        • each node can be responsible for multiple positions (virtual nodes) in the ring; advantages if virtual nodes are:
          • if a node fails, the load is dispersed across the ring;
          • a new node accepts a roughly equivalent amount of load from old nodes;
          • the number of virtual nodes responsible for a single machine is decided on its capacity;
    • Replication:

      • data is replicated on N hosts where N is a parameter;
      • each key is assigned to a coordinator node:
        • coordinator is responsible for the data replicas within a range;
        • it replicates these keys at the N-1 clockwise successor nodes;
      • the list of nodes that is responsible for storing a particular key is called a preference list;
        • the list contains more than N nodes to account for failure;
        • skip virtual node positions in the ring to ensure the list only contains distinct physical nodes;
    • Data versioning:

      • updates are propagated to all replicas asynchronously;
      • in case of failures, write requests may be handled by nodes not in the top N nodes in the list and causes the size of vector clock to grow;
      • clock truncation scheme:
        • each (node, counter) pair has a timestamp;
        • when the number of pairs reaches a threshold, the oldest pair is removed from the clock;
        • this scheme could be ineffective because the descendant relationships cannot be derived accurately;
    • Execution of get and put operations:

      • operations are invoked using Amazon request processing framework;
      • two strategy for a client to select a node:
        • route its request through a generic load balancer;
        • use a partition-aware client library that routes the request to the appropriate coordinator nodes with lower latency;
      • a node handling a read/write operation is known as the coordinator;
        • coordinator is usually the first among the top N nodes;
      • Read/write operations involve the first N healthy nodes;
      • Dynamo uses a consistency protocol similar to quorum systems:
        • R is the minimum number of nodes that must participate in a successful read operation;
        • W is the same as R except it’s for write operation;
        • R+W > N yields a quorum-like system;
      • During the writing, the coordinator writes the data locally and send to N highest-ranked reachable nodes and consider this operation successful if more than W-1 nodes respond;
      • during the reading, the coordinator requests all existing versions of data for N highest-ranked reachable nodes and waits for R responses;
        • return all the versions it deems to be causally unrelated;
    • Handling failures: hinted handoff:

      • sloppy quorum is used to increase the availability:
        • all read/write operations are performed on the first N healthy nodes, which may not always be the first N nodes in the ring;
    • Handling permanent failures: replica synchronization:

      • Dynamo uses Merkle trees to detect inconsistency and minimize the data transferred for synchronization
        • disadvantage is that many key ranges change when a node joins or leaves and the trees need to be recalculated;
    • Membership and failure detection:

      • ring membership:
        • a node outage rarely signifies a permanent departure and therefore should not result in rebalancing or repair;
        • an administrator adds/removes a node and a permanent history is kept by the node;
        • a gossip-based protocol guarantees eventual consistent view of membership;
      • external discovery:
        • seeds are nodes that are discovered via an external mechanism and are known to all nodes;
      • failure detection:
        • decentralized failure detection protocols use a simple gossip-style protocol that enables each node in the system to learn about the arrival/departure of other nodes;
    • Adding/remove storage nodes:

      • new nodes will be assigned with a number of tokens that are randomly scattered on the ring;
      • confirmation upon adding/removing nodes to:
        • distribute the load uniformly;
        • meet the latency requirements;
        • ensure fast bootstrapping;
  5. Implementation:

    • Each storage node has three main software components:
      • request coordination;
      • membership and failure detection;
      • local persistence engine;
        • allows for different storage engines to be plugged in;
    • The request coordination component is built on top of an event-driven messaging substrate where the message processing pipeline is split into multiple states;
    • Request load is not uniformly distributed:
      • any node in top N could coordinate the writes;
      • always picked the fastest one in the previous read;
      • increasing the chances of getting “read-your-writes” consistency;
  6. Experiences & lessons learned

    • Ensuring uniform load distribution:
      • T random tokens per node and partition by token value;
      • T random tokens per node and equal sized partitions;
      • Q/S tokens per node, equal-sized partitions;