Paper Review:
Summary:
Spanner is Google’s new global data store with semi-relational data model and standard query language. It uses Paxos and 2PC for operations and uses bounded real time for external consistent transactions.
Strong points:
- Spanner switches from NoSQL to NewSQL (?), which is easy to work with (with semi-relational data model and query language) and excellent scalability; however, the data is also version-ed (with TrueTime timestamps) so clients can decide if the read is up-to-date.
- TrueTime is just impressive. It enables external consistency and a bunch of cool features like consistent snapshot read across the data centers and dynamic schema changes. It’s like having a wall clock for all the replicas with bounded uncertainty. Not to mention that the uncertainty is controlled sophisticated using GPS and atomic clocks as underlying hardware and algorithm for lair detection;
- Data are stored in tablets, which are also classified into different “buckets”. Applications can control the locality of data by carefully assigning keys to the data. This feature could potentially lower the latency (by choosing closer datacenters for storage);
- Dynamic controlled replication configuration might be helpful when the application is trying to change the data location or replication factors during the run.
Weak points:
- The write operations are still using Paxos for consensus and two phase commit during the transaction. It enforces strong consistency for sure but a) the master could be troublesome. Master failover might result in long waiting time and b) communication overhead is inevitable which increase the latency of every transaction;
- TrueTime is sophisticated designed with redundant hardware support and algorithms to verify its correctness. However, the write transactions (to a single Paxos group) performed during a period of time is bounded by epsilon and so is the system’s overall accuracy. Epsilon is caused mainly with hardware errors and hard to be eliminated, which means that Spanner is unlikely to have better writing performance or timestamp accuracy;
- Since the system’s ordering is based on clock time and the clock time is uncertain, there are many occasions that we have to wait till the system is definitely sure that the previous event is already done even when the waiting is simply for the purpose of making TT.after true. For example, the commit timestamp, even with all the replicas get back to leader, it still has to wait till it’s certain about the timing;
- If a TrueTime API is used with a faulty timestamp, say it fires a read operation in the future, will it block other transactions, or get halted, or return with an error?
Paper Outline:
-
Introduction:
- Globally scale database that shards data across many sets of Paxos state machines in the highest level of abstraction;
- Main focus is managing cross-datacenter replicated data but also designing/implementing important database features;
- Bigtable (NoSQL) can be difficult for applications with complex schema or strong consistency in the presence of wide-area replication;
- Megastore (semi-relational data model) supports synchronous replication but has poor write throughput;
- Spanner is evolved from Bigtable where data is stored in schematized semi-relational tables and version-ed; it provides a SQL-based query language;
- replications configuration can be dynamically controlled;
- externally consistent read/write operations;
- these features are enabled by the globally-assigned timestamps, which is supported by the TrueTime API and its implementation;
-
Implementation:
-
Overall structure:
- a Spanner deployment is called a universe;
- Spanner is organized with a set of zones which are unit of administrative deployment and resemble data centers;
- each zone has:
- one zonemaster;
- hundreds of spanservers (roughly analog to Bigtable servers);
- location proxies are used by clients to locate data;
- universe master and placement driver are singletons:
- universe master is primary a console that displays stats info;
- placement driver handles auto movement of data across zones;
-
Spanserver software stack:
- spanserver structure:
- each spanserver is responsible for 100 to 1,00 instances of tablets, which is similar to Bigtable’s tablet abstraction;
- unlike Bigtable, Spanner assigns timestamps to data, which makes it more of a multi-version database than a key-value store;
- tablet states are stored in B-tree-like files and a write-ahead log;
- all storage happens on Colossus;
- coordination and consistency:
- a single Paxos state machine for each spanserver;
- a state machine stores its metadata and log in corresponding tablet;
- long-lived leaders and time-based leader leases for Paxos;
- every Paxos writes twice: in the tablet log and in the Paxos log;
- writes must initiate Paxos protocol at the leader but reads access state directly from the underlying tablet as long as it’s up-to-date;
- each Paxos leader implements a lock table for concurrency control:
- lock table contains the state of two-phase locking;
- only operations require synchronization acquire locks;
- each Paxos leader implements a transaction manager to support distributed transactions:
- used to implement a participant leader;
- transaction involves only one Paxos group will bypass the transaction manager;
- for transactions that involves multiple Paxos groups:
- one of the participant group is chosen as the coordinator;
- others are referred as coordinator slaves;
- spanserver structure:
-
Directories and placement:
- a bucket of contiguous keys that share a common prefix is a directory which allows applications to control the locality of data by choosing keys;
- all data in a directory share the same replication configuration and could only be moved directory by directory (while the client operations are still ongoing);
- not necessarily a single lexicographically contiguous partition of the row space but instead a container that may encapsulate multiple partitions of the row space so that directories could be put together;
- Movedir task:
- the background task moving directories between Paxos groups;
- also used to add/remove replicas to Paxos groups;
- a part-by-part background process between two Paxos groups;
- directory is also the smallest unit whose placement can be specified;
- administrators control the number and types of replicas, and the geographic placement of those replicas;
- an application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options;
- shard a directory into multiple fragments if it grows too large;
- fragments could be served by different Paxos groups;
- movedir in this case will actually move fragments not directories;
-
Data model:
- data features for applications:
- a data model based on schematized semi-relational tables;
- used by Megastore; simpler to manage unlike Bigtable;
- synchronous replication across datacenters unlike Bigtable which only supports eventual consistency;
- a query language;
- because of popularity of Dremel as an interactive data analysis;
- general purpose transactions;
- complaint on the lack of cross-row transactions in Bigtable;
- two-phase commit over Paxos mitigates the availability problems (but expensive to support);
- application data model:
- layered on the directory-bucketed key-value mapping;
- an application can create one or more database in a universe;
- a database can contain unlimited schematized tables;
- uses a SQL-like query language with extra features;
- a data model based on schematized semi-relational tables;
- Spanner data model:
- not purely relational because every table is required to have an ordered set of one or more primary-key columns;
- each table defines a mapping from the primary-key columns to non-primary-key columns;
- it lets applications control data locality through key choices;
- data features for applications:
-
-
TrueTime:
-
TureTime
- explicitly represents time as a TTinterval with bounded time uncertainty, which is different from standard time interface;
-
GPS and atomic clocks failure modes:
- GPS reference-source vulnerabilities:
- antenna and receiver failures;
- local radio interference;
- correlated failures;
- GPS system outages;
- atomic failures;
- time drift due to frequency error;
- GPS reference-source vulnerabilities:
-
master/slave implementation:
- a set of time master machines per datacenter;
- majority of them have GPS and geographically separated;
- reduce the effect to failures;
- uncertainty close to zero;
- the rest have atomic clocks and are called Armageddon masters;
- slowly increasing time uncertainty;
- regularly compared against each other and local clock;
- majority of them have GPS and geographically separated;
- timeslave daemon per machine:
- polls a variety types of masters;
- applies a variant of Marzullo’s algorithm to detect and reject lairs;
- worst-case local clock drift is a saw-tooth function;
- master clock uncertainty, communication delay, local drift;
- a set of time master machines per datacenter;
-
-
Concurrency control:
-
Supported operations:
- read-write transaction;
- pessimistic and requires leader replication;
- read-only transactions;
- not a read-write transaction without write; non-blocking;
- snapshot reads;
- read-write transaction;
-
Timestamp management:
- Paxos leader leases:
- long-live leader is selected with a quorum-based vote;
- lease could be extended on a successful write or near expiration;
- assigning timestamps to RW transactions:
- Spanner assigns timestamp that Paxos assigns to the Paxos write;
- external consistency: if the start of a transaction T_2 is later than T_1, then the timestamp of T_2 must be greater than T_1;
- start: the coordinator leader for a write T_i assigns a commit timestamp s_i no less than the value of TT.now().lastest;
- commit wait: the coordinator leader ensures that clients cannot see any data committed by Ti until TT.after(s_i) is true;
- serving reads at a timestamp:
- every replica tracks a value called safe time t_safe, which is the maximum timestamp at which a replica is up-to-date;
- assigning timestamps to RO transactions:
- two-phase transaction: timestamp s_read assigning and execute snapshot reads at s_read;
- Paxos leader leases:
-
Details:
- read-write transactions:
- issues read to the leader replica of the appropriate group;
- wound-wait read recent data with timestamps;
- keep-alive messages to the leader to maintain locks;
- a Paxos algorithm with TrueTime to enforce consistency;
- buffered writes until two phase commit with the same timestamps on all participants;
- read-only transactions:
- scope required for read-only transactions, which summarizes the keys read in the transaction;
- contact the leader when the scope’s values are served by one Paxos group;
- S_read ensures that last write is returned;
- multiple Paxos group read:
- a round of communication with all the leaders;
- just read S_read = TT.now().latest to see the sufficiently up-to-date values;
- schema-change transactions:
- generally non-blocking variant of a standard transaction;
- reserve a timestamp in the future and block the transactions with bigger timestamps behind the schema changes;
- refinements:
- lock with metadata to prevent false conflicts;
- use leader-lease interval for snapshot read after a Paxos write;
- read-write transactions:
-
-
Evaluation:
-
Microbenchmarks:
-
Availability:
-
TrueTime:
-
F1
-
-
Related work:
-
Future work:
-
Conclusion: