Paper Review: Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

Paper Review:

Mesa is another Google’s data store layered upon Colossus and Bigtable, primarily designed for ads campaign data storage. The versioned kv store with aggregation is scalable and replicated globally. The metadata is consistently updated with Paxos and the data is batched and transferred every few minutes.

Strong points:

  1. In the chapter “experiences and lessons learned”, layered design was mentioned as a key design feature of Google products. Mesa is nicely layered and decoupled in both horizontal and vertical directions. It was build up on Colossus and Bigtable so we can see there isn’t too much about the read/write topics, which are covered in Bigtable. Inside the architecture, it has workers/servers, controllers and global services. The data maintenance/update and query is also decoupled. While there might be some overhead, but it enables clear designing, easy problem identification and detailed performance analysis.
  2. The resume key with streaming transmission is interesting. This way the failed/disconnected query server won’t waste clients’ time since the read operations could be continued on another server instead of firing the query once more.
  3. Parallelizing the worker operation using MapReduce  and linked schema change are good ideas. MapReduce could save days of computation time and the schema change, while not applicable in every scenario and add some computation on the query path, saves 50% of disk space than the traditional simple schema change.  This could be a life saver since Mesa requires a lot of storage space in the first place.

Weak points:

  1. The  controllers assign different tasks to different types of workers. The failure of workers will eventually be captured by the timer maintained for each task in the controllers. However, the timers could add a lot of work to the controllers if there are many tasks running simultaneously. On the other hand, work failures could result in long latency for the corresponding queries since before the timer expires and the tasks get re-assigned. This kind of latency could be resolved if the controllers/workers exchange heartbeat messages so that the worker failures could be detected earlier before the timer runs out.
  2. Query servers are assigned with their own responsible range of data to take advantage of prefetching and caching. However, this assignment could be a little inflexible if there are a lot of queries on the similar data. This way we will only have a small amount of query servers fetching data from Colossus for a long time while the rest of servers are doing nothing. This will be a performance bottleneck and waste of resource if we assume that the read operation in Colossus is lock-free and scales well. Also since we are on the chapter of querying, the global locator service is really vague in the paper. I assume that it is a stateless process running the controllers with the data replicated in Bigtable as well.
  3. The replication mechanism does not make too much sense to me. The Paxos will only make sure that metadata is replicated consistently by majority of Mesa instance and the data replication will be left behind without any strong guarantee. So for each Mesa instance (datacenter), the metadata could be out-dated if it failed during the Paxos; and even if it works fine during Paxos, there’s not much guarantee on the consistency of the actual data.
  4. There are two methods of data validation, the online one by re-aggregating the rows and check for computation errors and the off-line one, which is a light-weight process spanning the recent committed data. There are two problems with the corruption recovery: 1) the online checking will be perform in every update and query, which could result in unnecessary checking and increased latency and load; 2) the data is replicated asynchronously. If the freshly updated copy gets corrupted, there’s no other replicas that could help with the recovery




Data is partitioned and replicated horizontally to achieve scalability and availability. (does this imply NoSQL?)

Multi-version key-value data store with Paxos for consistency

Leverage Bigtable and the Paxos tech underlying Spanner for metadata storage and maintenance.

Asynchronously replication for application data; synchronous replication for metadata.

Dealing with corruption for software and hardware

A query  to Mesa consists of a version number and a predicate P on the key space. And the response contains the corresponding P and a version number between 0 and n.

Strict ordering of updates can ensure atomicity and fraud detection of negative facts.

Mesa pre-aggregates certain versioned data (between v1 and v2 inclusively) and call it delta. Base compaction is the process of merging some of the version in to [0, B] and versions before base are no longer accessible. This kind of idea is commonly used (like append-only GFS) but it could be problematic if old-version data is still useful. Older data are stored in a versioned and expensive way in terms of the process of aggregation which means that Mesa is necessary for cleaning things up but how many versions are we going to keep is gonna be hard to determine for various kinds of data. Since Mesa is primary used for ads campaign data collection, which has uniform and specific requirements in terms of data storage and aggregation, this issue doesn’t seem to matter too much.

Each table has one or more indexes and each table index has its own copy of data that is sorted accordingly for fast search. And there is an index file containing short keys for row blocks for faster localizing the row block.

Metadata is stored in Bigtable and in the memory of controller. Each datacenter has one controller (I assume) and it does not directly interact with the tables. There are different tasks for data maintenance like updates, compaction, schema change and checksum. Note that the last two require coordination of different Mesa instances (datacenters). A set of workers of different types polling the controller for task of their own types. Each task will be assigned with a timer so failed workers won’t affect the system. The controller is also sharded and stateless with all the metadata consistently stored in Bigtable so Mesa is resilient to controller failures. A garbage collector runs separately and continuously, which reads from the metadata in Bigtable and delete the unwanted files in Colossus.

Mesa handles queries with different requirement with different labels and priorities. A set of query servers could access any table in principle but Mesa will direct  queries with the same range of data to the a subset of the query servers to take advantage of the pre-fetching and caching. Global locator service is used to coordinate the query servers.

The operations are batched in Mesa instances once every few minutes. A stateless global committer will assign each updates batch a version number and use Paxos to reinforce the data consistency. Controllers in every Mesa instance will be responsible for the updates batches and acknowledging the committer. There is no locking. Note that in this case metadata is replicated synchronously with Paxos while the data is incorporated asynchronously by various Mesa instances.

New Mesa instances will bootstrap with p2p load mechanism.



Paper Review: Spanner Google’s Globally-Distributed Database

Paper Review:


Spanner is Google’s new global data store with semi-relational data model and standard query language. It uses Paxos and 2PC for operations and uses bounded real time for external consistent transactions.

Strong points:

  1. Spanner switches from NoSQL to NewSQL (?), which is easy to work with (with semi-relational data model and query language) and excellent scalability; however, the data is also version-ed (with TrueTime timestamps) so clients can decide if the read is up-to-date.
  2. TrueTime is just impressive. It enables external consistency and a bunch of cool features like consistent snapshot read across the data centers and dynamic schema changes. It’s like having a wall clock for all the replicas with bounded uncertainty. Not to mention that the uncertainty is controlled sophisticated using GPS and atomic clocks as underlying hardware and algorithm for lair detection;
  3. Data are stored in tablets, which are also classified into different “buckets”. Applications can control the locality of data by carefully assigning keys to the data. This feature could potentially lower the latency (by choosing closer datacenters for storage);
  4. Dynamic controlled replication configuration might be helpful when the application is trying to change the data location or replication factors during the run.

Weak points:

  1. The write operations are still using Paxos for consensus and two phase commit during the transaction. It enforces strong consistency for sure but a) the master could be troublesome. Master failover might result in long waiting time and b) communication overhead is inevitable which increase the latency of every transaction;
  2. TrueTime is sophisticated designed with redundant hardware support and algorithms to verify its correctness. However, the write transactions (to a single Paxos group) performed during a period of time is bounded by epsilon and so is the system’s overall accuracy. Epsilon is caused mainly with hardware errors and hard to be eliminated, which means that Spanner is unlikely to have better writing performance or timestamp accuracy;
  3. Since the system’s ordering is based on clock time and the clock time is uncertain, there are many occasions that we have to wait till the system is definitely sure that the previous event is already done even when the waiting is simply for the purpose of making TT.after true. For example, the commit timestamp, even with all the replicas get back to leader, it still has to wait till it’s certain about the timing;
  4. If a TrueTime API is used with a faulty timestamp, say it fires a read operation in the future, will it block other transactions, or get halted, or return with an error?



Paper Outline:

  1. Introduction:

    • Globally scale database that shards data across many sets of Paxos state machines in the highest level of abstraction;
    • Main focus is managing cross-datacenter replicated data but also designing/implementing important database features;
      • Bigtable (NoSQL) can be difficult for applications with complex schema or strong consistency in the presence of wide-area replication;
      • Megastore (semi-relational data model) supports synchronous replication but has poor write throughput;
    • Spanner is evolved from Bigtable where  data is stored in schematized semi-relational tables and version-ed; it provides a SQL-based query language;
      • replications configuration can be dynamically controlled;
      • externally consistent read/write operations;
      • these features are enabled by the globally-assigned timestamps, which is supported by the TrueTime API and its implementation;
  2. Implementation:

    • Overall structure:

      • a Spanner deployment is called a universe;
      • Spanner is organized with a set of zones which are unit of administrative deployment and resemble data centers;
      • each zone has:
        • one zonemaster;
        • hundreds of spanservers (roughly analog to Bigtable servers);
        • location proxies are used by clients to locate data;
      • universe master and placement driver are singletons:
        • universe master is primary a console that displays stats info;
        • placement driver handles auto movement of data across zones;
    • Spanserver software stack:

      • spanserver structure:
        • each spanserver is responsible for 100 to 1,00 instances of tablets, which is similar to Bigtable’s tablet abstraction;
        • unlike Bigtable, Spanner assigns timestamps to data, which makes it more of a multi-version database than a key-value store;
        • tablet states are stored in B-tree-like files and a write-ahead log;
        • all storage happens on Colossus;
      • coordination and consistency:
        • a single Paxos state machine for each spanserver;
        • a state machine stores its metadata and log in corresponding tablet;
        • long-lived leaders and time-based leader leases for Paxos;
        • every Paxos writes twice: in the tablet log and in the Paxos log;
        • writes must initiate Paxos protocol at the leader but reads access state directly from the underlying tablet as long as it’s up-to-date;
        • each Paxos leader implements a lock table for concurrency control:
          • lock table contains the state of two-phase locking;
          • only operations require synchronization acquire locks;
        • each Paxos leader implements a transaction manager to support distributed transactions:
          • used to implement a participant leader;
          • transaction involves only one Paxos group will bypass the transaction manager;
          • for transactions that involves multiple Paxos groups:
            • one of the participant group is chosen as the coordinator;
            • others are referred as coordinator slaves;
    • Directories and placement:

      • a bucket of contiguous keys that share a common prefix is a directory which allows applications to control the locality of data by choosing keys;
      • all data in a directory share the same replication configuration and could only be moved directory by directory (while the client operations are still ongoing);
      • not necessarily a single lexicographically contiguous partition of the row space but  instead a container that may encapsulate multiple partitions of the row space so that directories could be put together;
      • Movedir task:
        • the background task moving directories between Paxos groups;
        • also used to add/remove replicas to Paxos groups;
        • a part-by-part background process between two Paxos groups;
      • directory is also the smallest unit whose placement can be specified;
        • administrators control the number and types of replicas, and the geographic placement of those replicas;
        • an application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options;
      • shard a directory into multiple fragments if it grows too large;
        • fragments could be served by different Paxos groups;
        • movedir in this case will actually move fragments not directories;
    • Data model:

      • data features for applications:
        • a data model based on schematized semi-relational tables;
          • used by  Megastore; simpler to manage unlike Bigtable;
          • synchronous replication across datacenters unlike Bigtable which only supports eventual consistency;
        • a query language;
          • because of popularity of Dremel as an interactive data analysis;
        • general purpose transactions;
          • complaint on the lack of cross-row transactions in Bigtable;
          • two-phase commit over Paxos mitigates the availability problems (but expensive to support);
        • application data model:
          • layered on the directory-bucketed key-value mapping;
          • an application can create one or more database in a universe;
          • a database can contain unlimited schematized tables;
          • uses a SQL-like query language with extra features;
      • Spanner data model:
        • not purely relational because every table is required to have an ordered set of one or more primary-key columns;
        • each table defines a mapping from the primary-key columns to non-primary-key columns;
        • it lets applications control data locality through key choices;
  3. TrueTime:

    • TureTime

      • explicitly represents time as a TTinterval with bounded time uncertainty, which is different from standard time interface;
    • GPS and atomic clocks failure modes:

      • GPS reference-source vulnerabilities:
        • antenna and receiver failures;
        • local radio interference;
        • correlated failures;
        • GPS system outages;
      • atomic failures;
        • time drift due to frequency error;
    • master/slave implementation:

      • a set of time master machines per datacenter;
        • majority of them have GPS and geographically separated;
          • reduce the effect to failures;
          • uncertainty close to zero;
        • the rest have atomic clocks and are called Armageddon masters;
          • slowly increasing time uncertainty;
        • regularly compared against each other and local clock;
      • timeslave daemon per machine:
        • polls a variety types of masters;
        • applies a variant of Marzullo’s algorithm to detect and reject lairs;
        • worst-case local clock drift is a saw-tooth function;
          • master clock uncertainty, communication delay, local drift;
  4. Concurrency control:

    • Supported operations:

      • read-write transaction;
        • pessimistic and requires leader replication;
      • read-only transactions;
        • not a read-write transaction without write; non-blocking;
      • snapshot reads;
    • Timestamp management:

      • Paxos leader leases:
        • long-live leader is selected with a quorum-based vote;
        • lease could be extended on a successful write or near expiration;
      • assigning timestamps to RW transactions:
        • Spanner assigns timestamp that Paxos assigns to the Paxos write;
        • external consistency: if the start of a transaction T_2 is later than T_1, then the timestamp of T_2 must be greater than T_1;
        • start: the coordinator leader for a write T_i assigns a commit timestamp s_i no less than the value of;
        • commit wait: the coordinator leader ensures that clients cannot see any data committed by Ti until TT.after(s_i) is true;
      • serving reads at a timestamp:
        • every replica tracks a value called safe time t_safe, which is the maximum timestamp at which a replica is up-to-date;
      • assigning timestamps to RO transactions:
        • two-phase transaction: timestamp s_read assigning and execute snapshot reads at s_read;
    • Details:

      • read-write transactions:
        • issues read to the leader replica of the appropriate group;
        • wound-wait read recent data with timestamps;
        • keep-alive messages to the leader to maintain locks;
        • a Paxos algorithm with TrueTime to enforce consistency;
        • buffered writes until two phase commit with the same timestamps on all participants;
      • read-only transactions:
        • scope required for read-only transactions, which summarizes the keys read in the transaction;
        • contact the leader when the scope’s values are served by one Paxos group;
          • S_read ensures that last write is returned;
        • multiple Paxos group read:
          • a round of communication with all the leaders;
          • just read S_read = to see the sufficiently up-to-date values;
      • schema-change transactions:
        • generally non-blocking variant of a standard transaction;
        • reserve a timestamp in the future and block the transactions with bigger timestamps behind the schema changes;
      • refinements:
        • lock with metadata to prevent false conflicts;
        • use leader-lease interval for snapshot read after a Paxos write;
  5. Evaluation:

    • Microbenchmarks:

    • Availability:

    • TrueTime:

    • F1

  6. Related work:

  7. Future work:

  8. Conclusion:


Paper Review: Adapting Microsoft SQL Server for Cloud Computing

Paper Review:


MSSQL is the first distributed commercial SQL storage. It uses primary and secondary replicas across the datacenters, coordinated by a global partition manager. Operations are done using quorum and  Paxos consensus algorithm is used in replication and recovery.

Strong points:

  1. I guess one of the best thing about Microsoft SQL server is that it’s a SQL-based cloud storage solution, which means standard and fast development for most small companies with common data models. It supports aggregation, full-text queries, and referential constraints, views and stored procedures and most of those are not supported by custom record stores.
  2. From the content I think global partition manager is not a single machine but more of a highly-available service made with multiple nodes across the datacenters to ensure the availability.
  3. The decoupled design of layers enables upgrade without interfering with user operations. All the cluster activities, including the two-phase upgrades, are done in the layer of infrastructure and deployment services and the user won’t be able to use the new features unless the process is finished.

Weak points:

  1. The replica placement is good for avoiding heavy traffic where each server host a mix of primary and secondary partitions. Note that only primary partitions serves all the query, update and other operations (however, nearly up-to-date secondaries might be used as read-only copies). Could the consistency be an issue because the asynchronous update? Since the read-only replicas are nearly update but there’s no guarantee, the responsibility of validation of the data is given to the users. And what if the client wants to write something and the primary is far far away? The primary replica might be a good way to coordinate operations but it surely affects availability and consistency.
  2. The update to replicas will be propagated from primary replica to secondary ones, which means that if the server storing primary replica fails during the beginning process of propagation might result in loss of transferred data before one nearly up-to-date secondary replica becomes primary.
  3. Since it’s a SQL server, the scalability could be worse than NoSQL storage like MongoDB and Bigtable since the data is stored in hierarchical fashion. Also I guess MSSQL doesn’t offer dynamic schema as well.

Paper Review: Megastore Providing Scalable, Highly Available Storage for Interactive Services


Megastore is  Google’s solution for multi-datacenter storage, which layers on top pf Bigtable. It has a semi-relational data model using optimized Paxos for transactions within a entity group and asynchronous message queuing for transactions across entity groups. It has relatively low writing throughput with strong consistency (if using 2PC) and availability (works as long as majority of the replicas, including witness replicas, are still alive).

Strong points:

  1. Most criticisms Bigtable received are focusing on its NoSQL data model, which slows down the development and increases the cost of maintenance. Megastore maps its semi-relational data model into Bigtable to encounter some disadvantages we had in Bigtable.
  2. I found the variety of different read operations quite interesting: current read operation is a entity-scope read for committed writes; snapshot read operation is also entity-scope but can read uncommitted states; inconsistent read ignores the log and simply return the latest value which can save a lot of time. This kind of implementation could really make the applications on Megastore flexible in order to meet their different requirement on latency and consistency.
  3. Witness replica is a brilliant idea to enhance system durability without adding too much bandwidth consumption and storage space. They solve the cases where only a few typical replicas and any failover/network partition might end up blocking the Paxos. On the other hand, read-only replicas are also great for distributing data in multiple datacenters without slowing down the write Paxos.

Weak points:

  1. I understand that Paxos is one of the key features in Megastore to implement ACID transactions without heavyweight master or data loss on failure. But could this strategy end up with lower throughput and too many communication messages between replicas for the coordination? It seems to me that Paxos would introduce significantly more network-round-trip delays to a single transaction than any other replication strategies, even with the optimization they made on Paxos. This is also confirmed in the Spanner paper that write throughput is bad in Megastore.
  2. Entity groups could be considered as a logically hierarchical way to organize the data. Transactions within a entity group will be using Paxos while the cross-entity transactions are usually via asynchronous messaging. Paxos offers strong consistency but also leads to communication overhead in each operation while the asynchronous messaging is fast but lacks consistency guarantees. It works when our data model aligns with this kind of hierarchy, like Email and Blog where each user could be visualized as a natural entity and the cross-entity transactions are rare and not really require strong consistency but what if our data model is flat? Or our data model requires things to be store together for faster retrieval but on the other hand requires each entity to have high-level of consistency? Then the whole system could either be expensive due to the 2PC overhead for consistency or not consistent enough. I assume that Megastore has some assumptions on the data models prior to the design. (Also since the entities has very different sizes, could they become problematic during load balancing?);
  3. Megastore is built on Bigtable, which is using GFS, SSTable and Chubby as basic storage management and coordination. While the philosophy of decoupling and layer design is great for debugging, system analysis and faster development, the overhead caused by coordination is gonna be terrible。
  4. Queuing for transactions between multiple groups is asynchronous, which means that the consistency is not guaranteed in a limited period of time. On the other hand, buffer overflow or node failure could directly result in message loss and inconsistency is inevitable. The alternative of queuing is 2PC and it’s way more complex and the communication overhead is gonna hurt the performance. It’s always nice to have an alternative though.


Paper Notes:

  1. Introduction:
    • Since all the requirements like scalability, rapid development for users, low latency, consistency and highly availability are in fact conflicting with each other, Megastore picks a mid-ground between RDBMS and NoSQL:
      • datastore is partitioned and replicated with full ACID semantics but limited consistency guarantees;
      • traditional database features are supported if they can scale with tolerable latency limits and compatible with partitioning scheme;
    • Paxos is optimized in Megastore for low latency operations and used for variety of things including primary user data replication;
  2. Toward availability and scale:
    • Common strategies:
      • asynchronous master/slave (data loss on failures);
        • ACK at master and transmission at slaves in parallel;
        • risks downtime or data loss during failover to a slave;
        • requires a consensus protocol to mediate mastership;
      • synchronous master/slave (heavyweight master);
        • master waits on slaves before ACK;
        • master/slave failures need external detection;
      • optimistic replication (no ACID transaction):
        • any member of a homogeneous replica group can accept mutations;
        • asynchronously propagated through the group;
        • global mutation ordering is not known at commit time so transactions are impossible;
    • Enter Paxos:
      • any node can initiate reads and writes. Each log append blocks on acknowledgments from a majority of replicas, and the rest catch up as they are able;
      • multiple logs increase throughput (reducing possibility of distanced nodes using one log) and availability (operations won’t block when majority fails to ACK), each governing its own partition of the data set;
    • Entity groups:
      • data is partitioned into entity groups;
        • single-phase ACID transactions within an entity group via Paxos;
        • cross-entity transactions could be via expensive 2PC or asynchronous message queuing (looser consistency);
    • Physical layout:
      • each datacenter is a Bigtable instance;
      • minimize the latency by letting applications control data placement;
  3. A tour of Megastore:
    • API design:
      • normalized relational schemas are not used in Megastore  because:
        • high-volume interactive workloads benefit more from predicable performance than from an expressive query language;
        • reads dominates writes so it pays to move work from read time to write time;
        • storing and querying hierarchical data is straightforward in simple key-value data stores like Bigtable;
      • (I have no background on databases so I’ll leave this section later);
    • Data model:
      • entity group root tables and child tables;
      • each entity is mapped into a single Bigtable row;
      • local index is treated as separate indexes for each entity group;
      • global index spans entity groups, used to find entities without knowing in advance the entity groups that contain them;
      • storing clause for faster access at read time;
      • mapping to Bigtable:
        • store root entity as a single row, which allows atomically update;
    • Transactions and concurrency control:
      • store multiple values in same row/column with different timestamps;
      • reads and writes don’t block each other;
      • Megastore provides current, snapshot and inconsistent reads;
      • write operations life cycle: read, application logic, commit, apply, clean up. Note that only one write wins in one Paxos;
    • Other features:
      • periodical full snapshot;
      • optional data encryption;
  4. Replication:
    • Current reads guarantee:
      • a read always observes the last-acknowledged write;
      • after a write has been observed, all future reads observe that write;
    • Brief summary of Paxos:
      • majority of replicas must be active to proceed;
      • use Paxos to replicate a transaction log and positions in the log;
    • Master-based approaches:
      • writes are reduced to single round of communication;
      • writes can be batched together to improve the throughput;
      • master failover can result in user-visible outages;
    • Megastore’s approach:
      • fast reads:
        • local read is allowed with the help of coordinator;
        • coordinator at each datacenter tracks a set of entity groups;
      • fast writes:
        • master-based system using leaders;
        • closest replica as leader;
      • replica types:
        • witness replica vote in Paxos and store the write-ahead log without applying the log or storing the data;
        • read-only replicas only stores data without voting; distribute the data without adding any write latency;
    • Data structure and algorithms:
      • out-of-order proposals are acceptable;
      • catchup when a replica was found out of date during read;
      • read: query local, find position, catchup, validate, query data;
      • write: accept leader, prepare, accept, invalidate, apply;
    • Coordinator availability:
      • failure detection:
        • use out-of-band protocol to detect coordinator failures;
        • coordinators obtain Chubby locks;
        • coordinator failure will be handled(reconstructed) quickly without affecting read/write;


Paper Review: Dynamo Amazon’s Highly Available Key-value Storage

Paper Review:

Amazon Dynamo is a fully distributed key-value pair data store. The balancing the partitioning is achieved with consistent hashing with sophisticated load assigning strategy and read/write operations are done with quorum-like voting.

Strong points:

  1. A fully distributed system is always better in scalability and maybe availability. A master failure would definitely have negative effect on availability, which is the main focus in Dynamo.
  2. Key feature “always available write” is really attractive. The quorum-like technique used in Dynamo offers the best availability so far and leaves the version conflicts to users to solve (well at least it returns your request somehow).
  3. Flexible and user-configurable features are shown in Dynamo like N, R, W values and different storage engine support.

Weak points:

  1. Complex distributed coordination/control. First they have a hash function to assign load to all the nodes; then they came up with the concept of virtual nodes which balance the load distribution; and then they worries about the joining and leaving of nodes accidentally or not; and then they realize there are still different assigning strategies which result in different performance in balancing. And all these are just for balancing. I guess it’s inevitable if you want a fully distributed system.
  2. Much computation and communication to do like hashing for node location, Merkle tree calculation and recalculation if the assignment changes, failure detection, gossiping for nodes joining/leaving and quorum-based voting. The system could be really complex and requires some computation power and network bandwidth to support.
  3. Rigid data storage model comparing to Bigtable. Only key-value pairs are allowed in Dynamo which could be a let down in some cases. However this is not much of a big deal since it supports different DB engines. I’m sure there could be some alternatives on the data model but still going around the way could make things complex.
  4. Too many replicas to support the high availability. Although the number of N is configurable but the number of replicas could increase because of node failure. Before the older version is removed (which is also configurable I think), extra replicas will put more demands on the storage space and bandwidth (more communication and data transfer) both.


Paper Outline:

  1. Introduction:

    • Reliability and scalability is dependent on application state management;
    • Treats failure handling as the normal case;
    • Dynamo has very high reliability requirements and need tight control over the tradeoffs between availability, consistency, effectiveness and performance;
    • Simple interface to handle services that need primary-key access only;
    • Uses some well known techniques to achieve scalability and availability:
      • data partition and replication using consistent hashing;
      • consistency is facilitated by object versioning;
      • consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol;
      • gossip-based failure detection and membership protocol;
      • completely decentralized with minimal need for manual administration;
      • storage nodes can be added/removed without any extra work;
  2. Background:

    • System assumptions and requirements:

      • query model:
        • simple read/write operations;
        • data and state is identified by unique keys;
        • no operations span multiple data items;
        • no need for relational schema;
      • ACID properties:
        • data stores provide ACID guarantees tend to have poor availability;
        • targets applications that operate with weaker consistency;
        • provides no isolation guarantees;
      • Efficiency:
        • needs to function on a commodity hardware infrastructure;
        • services must be able to configure Dynamo such that they consistently achieve their latency and throughput;
      • safe environment;
    • Service level agreement:

      • SLA is a formally negotiated contract where a client and a service agree on several system-related characteristics;
      • SLAs are expressed and measured at the 99.9th percentile of the distribution, which could provides better overall experience;
        • a common approach in the industry for forming a performance oriented SLA is by using average, median and expected variance;
    • Design considerations:

      • replication algorithms force to trade off the availability;
        • strong consistency and high availability cannot be achieved together;
      • availability can be increased by using optimistic replication techniques;
        • replicas are propagated in the background;
        • might lead to conflicting changes and need conflict resolution;
      • Dynamo is an always writable data store;
        • some other data store resolve the conflicts during writes and might reject the write operation if it cannot reach all replicas;
      • applications are more suitable for conflict resolution:
        • they know what kind of data can be chosen;
        • data store has limited information but will solve the conflict by “last write wins” policy if application doesn’t take care of it;
      • other key principles:
        • incremented scalability;
        • symmetry;
          • every node in Dynamo should have the same responsibility;
        • decentralization;
        • heterogeneity;
  3. Related work:

    • Peer to peer system:

      • first generation P2P system like Freenet and Gnutella:
        • mainly used as file sharing systems;
        • searching requires flood through the network;
      • next generation P2P structured networks like Pastry and Chord:
        • global consistency protocol to support routing and searching;
      • many storage systems built on these routing overlays:
        • with conflict resolution to handle the concurrent updates;
    • Distributed file systems and databases:

      • typically support hierarchical namespaces;
      • guarantee eventual consistency;
      • traditional replicated relational database systems focus on consistency to provide a conventional programming model by limiting the scalability and availability;
    • Discussion:

      • Dynamo is different:
        • target applications that require “always writable” availability;
        • built for with a single administrator and trusted nodes;
        • no hierarchical namespaces and complex schema;
        • built for latency sensitive applications;
  4. System architecture:

    • Summary of the techniques used in Dynamo and their advantages: 20160203084012

    • System interface:

      • get(key): returns a single object or a list of them with conflicting versions along with a context;
      • put(key, context, object): writes the object with associated key to the disk and context resembles metadata about the object such as version;
    • Partitioning algorithm:

      • partitioning scheme replies on consistent hashing;
        • to distribute load across multiple storage hosts;
        • the output range of a hash function is treated as a circle and each node in the system is assigned a random position on the ring;
        • each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring and then walking the ring clockwise to find the first node with a position larger than the node’s position;
        • thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring;
      • the principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors;
      • some challenges and solutions in the algorithm:
        • random position assignment leads to non-uniform distribution;
        • the basic algorithm is oblivious to the heterogeneous nodes;
        • Dynamo uses a variant of consistent hashing and mapping multiple positions in the ring to each node;
        • each node can be responsible for multiple positions (virtual nodes) in the ring; advantages if virtual nodes are:
          • if a node fails, the load is dispersed across the ring;
          • a new node accepts a roughly equivalent amount of load from old nodes;
          • the number of virtual nodes responsible for a single machine is decided on its capacity;
    • Replication:

      • data is replicated on N hosts where N is a parameter;
      • each key is assigned to a coordinator node:
        • coordinator is responsible for the data replicas within a range;
        • it replicates these keys at the N-1 clockwise successor nodes;
      • the list of nodes that is responsible for storing a particular key is called a preference list;
        • the list contains more than N nodes to account for failure;
        • skip virtual node positions in the ring to ensure the list only contains distinct physical nodes;
    • Data versioning:

      • updates are propagated to all replicas asynchronously;
      • in case of failures, write requests may be handled by nodes not in the top N nodes in the list and causes the size of vector clock to grow;
      • clock truncation scheme:
        • each (node, counter) pair has a timestamp;
        • when the number of pairs reaches a threshold, the oldest pair is removed from the clock;
        • this scheme could be ineffective because the descendant relationships cannot be derived accurately;
    • Execution of get and put operations:

      • operations are invoked using Amazon request processing framework;
      • two strategy for a client to select a node:
        • route its request through a generic load balancer;
        • use a partition-aware client library that routes the request to the appropriate coordinator nodes with lower latency;
      • a node handling a read/write operation is known as the coordinator;
        • coordinator is usually the first among the top N nodes;
      • Read/write operations involve the first N healthy nodes;
      • Dynamo uses a consistency protocol similar to quorum systems:
        • R is the minimum number of nodes that must participate in a successful read operation;
        • W is the same as R except it’s for write operation;
        • R+W > N yields a quorum-like system;
      • During the writing, the coordinator writes the data locally and send to N highest-ranked reachable nodes and consider this operation successful if more than W-1 nodes respond;
      • during the reading, the coordinator requests all existing versions of data for N highest-ranked reachable nodes and waits for R responses;
        • return all the versions it deems to be causally unrelated;
    • Handling failures: hinted handoff:

      • sloppy quorum is used to increase the availability:
        • all read/write operations are performed on the first N healthy nodes, which may not always be the first N nodes in the ring;
    • Handling permanent failures: replica synchronization:

      • Dynamo uses Merkle trees to detect inconsistency and minimize the data transferred for synchronization
        • disadvantage is that many key ranges change when a node joins or leaves and the trees need to be recalculated;
    • Membership and failure detection:

      • ring membership:
        • a node outage rarely signifies a permanent departure and therefore should not result in rebalancing or repair;
        • an administrator adds/removes a node and a permanent history is kept by the node;
        • a gossip-based protocol guarantees eventual consistent view of membership;
      • external discovery:
        • seeds are nodes that are discovered via an external mechanism and are known to all nodes;
      • failure detection:
        • decentralized failure detection protocols use a simple gossip-style protocol that enables each node in the system to learn about the arrival/departure of other nodes;
    • Adding/remove storage nodes:

      • new nodes will be assigned with a number of tokens that are randomly scattered on the ring;
      • confirmation upon adding/removing nodes to:
        • distribute the load uniformly;
        • meet the latency requirements;
        • ensure fast bootstrapping;
  5. Implementation:

    • Each storage node has three main software components:
      • request coordination;
      • membership and failure detection;
      • local persistence engine;
        • allows for different storage engines to be plugged in;
    • The request coordination component is built on top of an event-driven messaging substrate where the message processing pipeline is split into multiple states;
    • Request load is not uniformly distributed:
      • any node in top N could coordinate the writes;
      • always picked the fastest one in the previous read;
      • increasing the chances of getting “read-your-writes” consistency;
  6. Experiences & lessons learned

    • Ensuring uniform load distribution:
      • T random tokens per node and partition by token value;
      • T random tokens per node and equal sized partitions;
      • Q/S tokens per node, equal-sized partitions;




Paper Review: Bigtable A Distributed Storage System for Structured Data

Paper review:

This paper is about a data storage system build upon google’s own file system GFS and Paxos-based coordinator Chubby. It offers flexible storage types with great scalabilty and availability. Some of the optimizations like prefetching and multi-level caching are really impressive and useful.

strong points:

  1. just like GFS, clients are communicating directly with tablet servers for read/write operations. This could help the system to overcome the bottleneck effect brought by the centralized coordination of the master.
  2. There are two different layers of centralization in this system and they are all handled pretty well in order to survive failures. The master of tablet servers is supported by Chubby lock service which ensures that there is one master at any time to eliminate the possibility of inconsistency. The other master is located in GFS and is backup-ed periodically.
  3. The client-side prefetching more than one tablet when it reads from METADATA is brilliant. It reminds me of the implementation of computer memory. By pre-read multiple tablets, the clients are much less likely to refer to the METADATA again. Since the referring is expensive because of the network round-trip time and the limitation of the centralized metadata, we better think of ways to reduce it. Another way I was thinking about is to have a metadata cache between clients and servers. It serves just like a local DNS server. It fetches tablet row keys for clients and caches the “hot” or adjacent rows for later usage. That way clients will be referring to the cache server first before they make it to the metadata. This will save us a lot of time and resource if multiple clients are likely to use or reuse the same chunk of table. A downside of my solution is definitely the network bottleneck for the cache server since the overall bandwidth is under-utilized.  Also if clients are using completely different tables then this will only result-in one more round-trip time.

weak points:

  1. while the usage of different systems and applications (GFS, SSTable, Chubby) decouples different layers and aspects of Bigtable (GFS is the low level file storage solution, SSTable is the actually data structure, Chubby is responsible for metadata, cluster management and the stats monitoring), the interactions of systems could lead to overhead and complexity for maintenance. I wonder if it’s gonna be more efficient in performance if we build Bigtable like some extra feature of Google File System instead of combining a bunch of underlying systems together. For example, one of the most obvious performance bottleneck would be the number of tablet servers. Each tablet server interact with GFS independently as a client and as the number of tablet servers grow, the performance of GFS drops due to coordination overhead. What if the system is not build in different layers and each tablet server has their own storage and distributed replicas?
  2. memtable is a great design which offers a buffer contains recent mutations for data. However, the paper doesn’t specify the where the memtable is, so I’m going to assume it’s located in each tablet server because the master will have a lot to handle if it maintains the memtable for all. So there comes the problem, if a tablet server crushes with a full memtable, then all those mutations will be lost since memtables are stored in memory without any forms of backup. This could result in situations like users find data unchanged even if the mutation operation already finished.
  3.  compaction is used for faster transfer of a source tablet server to the target. The compaction is completed within two states to make sure that during the first phase compacting, the source server is still available for serving this tablet. However, the source tablet server could be really heavy-loaded with a lot of operations going on and the computation might be slower but the target is very likely to be light in load. So why don’t we just leave the compaction to the target server since the computation amount won’t be any different.
  4. Bigtable is supported by Google File System and GFS has its own mechanism to replicate data to handle occasional (or is it) node failure. However, in Bigtable it’s specified that one tablet is only stored in one tablet server (in 5.2 tablet assignment). So I’m not sure where they store the replications. All the tablets must be replicated since the master can reassign them upon node failure. If there are replications, how do they handle the consistency issue? Does the assigned tablet server act like a master/leader?  (So GFS is acting like a disk underlying Bigtable and all servers (master and slaves) have access to. The block assignment is more of a way to “let this tablet server handle request from index A to B” rather than store some data locally in that server. At least I think it is.)

Paper Outline:

  1. Introduction:

    • Goals achieved by Bigtable:

      • wide applicability;
      • scalability;
      • high performance;
      • high availability;
    • Bigtable resembles a database:

      • but provides a different interface;
      • does not support a full relational data model;
      • simple data model which supports dynamic control;
      • treats data like uninterpreted strings;
  2. Data model:

    • Bigtable is a sparse, distributed, persistent multidimensional sorted map;

    • Rows:

      • up to 64 KB in size;
      • each r/w operation on a row is atomic;
        • regardless of the number of columns it accesses;
      • row range:
        • dynamically allocated;
        • called tablet, the unit of distribution and load balancing;
        • read in shorter range will require communication with less machines;
          • URLs are stored reversely to group the related content together;
    • Column families

      • column keys are grouped into sets called column family;
      • column family is the basic unit of access control;
        • family is required before any data is stored under any column key;
      • data stored in the same column family is usually the same type;
      • column key:
        • unlike column family names, it doesn’t have to be printable;
    • Timestamps:

      • timestamps identify the same data of different versions in the cell;
      • 64-bit integer:
        • represent real-time in microseconds by default;
        • or any other client-explicit version representation;
        • must be unique and decreasing order so that the recent version can be read first;
      • Two settings for garbage collection:
        • only the last n versions;
        • only new-enough versions;
  3. API:

    • Provides functions for creating and deleting tables and column families;

    • Provides functions for changing cluster, table and column family metadata;

      • such as row access right;
    • Other features that allow the user to manipulate the data in more complex way:

      • supports single-row transactions;
      • allows cells to be used as integer counters;
      • supports the execution of client-supplied scripts in the address spaces of the servers;
        • in a special language developed by Google called Sawzall;
    • Can be used with MapReduce;

  4. Building blocks:

    • Uses the distributed Google File System to store logs and data files;

    • Depends on a cluster management system for scheduling jobs, managing failures  and monitoring machine stats;

      • because it runs a shared pools of machines;
    • SSTable file format is used internally to store Bigtable data:

      • provides a persistent, ordered immutable map for from keys to values;
      • SSTable contains a sequence of blocks:
        • blocks are located with indexes loaded into memory when opened;
        • binary search in-memory index and then read the data from disk;
        • optionally, SSTable can be completed mapped into memory;
    • Relies on a highly-available and persistent distributed lock service Chubby;

      • a Chubby service consists of five active replicas;
        • one of the replicas is the master that serves requests;
        • the system is active when the majority of the machines are running;
        • uses Paxos algorithm to keep its replicas consistent;
      • read/write operations are atomic;
      • the Chubby client library provides caching of Chubby files;
      • each client maintains a Chubby service session;
      • Bigtable uses Chubby for variety of tasks:
        • ensure there is one master at most at any time;
        • store the bootstrap location Bigtable data;
        • discover table services and finalize tablet server deaths;
        • store Bigtable schema information;
        • store access control lists;
  5. Implementation:

    • Three major components:

      • a library that is linked into every client;
      • one lightly loaded master server;
        • responsible for assigning tablets to tablet servers;
        • responsible for detecting the addition/expiration of tablet servers;
        • balancing load of tablet servers;
        • garbage collection of files in Google File System;
        • handles schema changes;
      • many tablet servers;
        • could be dynamically added or removed;
        • handles a set of tablets;
        • handles read and write requests;
        • splits tablets that has grown too large;
          • maintaining a size of 100-200 MB;
        • clients communicate directly with tablet servers;
          • most clients never communicate with the master;
    • Tablet location:

      • three-level hierarchy to store the tablet location information:
        • first level: a Chubby file contains location of root tablet:
          • root tablet contains the location of all tablets;
          • root tablet is the 1st METADATA tablet;
          • never split to make sure of the three-level structure;
        • second level: rest of the METADATA tablets:
          • stores the location of a tablet under a row key;
          • stores secondary information for debugging and analysis;
      • localizing:
        • client library caches tablet locations;
        • moves to the hierarchy if doesn’t know the correct location;
          • requires three network round-trips if the cache is empty;
          • at most six network round-trips if the cache is stale;
          • client library prefetch more than one line to reduce access cost;
    • Tablet assignment:

      • each tablet is assigned to one tablet server at a time;
      • the master keeps track of the set of love tablet servers and the current assignment of tablets to tablet servers;
      • Bigtable uses Chubby to keep track of tablet servers:
        • acquires an exclusive lock when a tablet server starts up;
        • the master monitors the lock directory and discovers tablet server;
        • a tablet server will lose its lock due to network partition;
        • reacquire a lock as long as the file exists;
      • master:
        • responsible for detecting the status of tablet servers:
          • periodically asking for lock status;
        • acquire a exclusive lock if the tablet server expires or is unavailable:
          • delete the file;
          • reassign the tablet;
        • kills itself if its Chubby session expires:
          • this doesn’t change the assignment of tablets to tablet servers;
        • master start-up:
          • grabs master lock in Chubby;
          • scans the servers directory in Chubby to find live servers;
          • contact with all live tablet servers for the tablet assignments;
          • scans(or add) METADATA table to learn the the set of tables;
        • table changes:
          • master initiate add/delete and merge operations to tables;
          • tablet servers initiate split operations:
            • commit the split by recording information for the new tablet in the METADATA table;
            • notifies master;
    • Tablet serving:

      • persistent state of tablet stored in GFS;
      • updates are stored in a commit log:
        • recent ones are kept in memory buffer called a memtable;
        • older ones are stored in a sequence of SSTable;
      • write operation:
        • check for validity and authorization;
        • mutation is written to commit log;
          • group commit improves the throughput of small mutations;
        • after all the writes has been committed, its contents are inserted into memtable;
      • read operation:
        • check for validity and authorization;
        • merge SSTables and memtable;
          • lexicographically sorted so it’s efficient;
    • Compaction:

      • when the memtable reaches its maximum size, it freezes and a new memtable is created. And the old one will be converted to SSTable in GFS.
        • shrinks the memory usage of the tablet server;
        • reduces the amount of data for recovery;
      • small SSTables converted from memtable will be merged periodically;
        • called major compaction;
        • product of major compaction contains no deletion info or data;
  6. Refinements

    • Locality groups:

      • clients group multiple column families together into a locality group;
      • generates a separate SSTable;
    • Compression:

      • clients can decide whether a SSTable for locality group is compressed;
      • as well as the compression format;
    • Caching for read performance:

      • two levels of cache in tablet servers;
        • scan cache is higher-level, which caches the KV pairs returned by SSTable interface to the tablet server;
        • block cache is lower-level, which caches the SSTable blocks read from GFS;
    • Bloom filters:

      • allows us to ask whether an SSTable might contain any data for a specified row/column pair;
    • Commit-log implementation:

      • append mutations to a single commit log per tablet server;
      • boost performance during normal operation;
      • but slows down recovery;
        • sort the log to avoid duplicate reads;
        • two logs to avoid blocking;
    • Speeding up tablet recovery:

      • minor compaction;
    • Exploiting immutability:

      • differentiate mutable and immutable data;
  7. Performance evaluation;