Paper Review: Megastore Providing Scalable, Highly Available Storage for Interactive Services

Summary:

Megastore is Google’s solution for multi-datacenter storage, which layers on top pf Bigtable. It has a semi-relational data model using optimized Paxos for transactions within a entity group and asynchronous message queuing for transactions across entity groups. It has relatively low writing throughput with strong consistency (if using 2PC) and availability (works as long as majority of the replicas, including witness replicas, are still alive).

Strong points:

Most criticisms Bigtable received are focusing on its NoSQL data model, which slows down the development and increases the cost of maintenance. Megastore maps its semi-relational data model into Bigtable to encounter some disadvantages we had in Bigtable.
I found the variety of different read operations quite interesting: current read operation is a entity-scope read for committed writes; snapshot read operation is also entity-scope but can read uncommitted states; inconsistent read ignores the log and simply return the latest value which can save a lot of time. This kind of implementation could really make the applications on Megastore flexible in order to meet their different requirement on latency and consistency.
Witness replica is a brilliant idea to enhance system durability without adding too much bandwidth consumption and storage space. They solve the cases where only a few typical replicas and any failover/network partition might end up blocking the Paxos. On the other hand, read-only replicas are also great for distributing data in multiple datacenters without slowing down the write Paxos.

Weak points:

I understand that Paxos is one of the key features in Megastore to implement ACID transactions without heavyweight master or data loss on failure. But could this strategy end up with lower throughput and too many communication messages between replicas for the coordination? It seems to me that Paxos would introduce significantly more network-round-trip delays to a single transaction than any other replication strategies, even with the optimization they made on Paxos. This is also confirmed in the Spanner paper that write throughput is bad in Megastore.
Entity groups could be considered as a logically hierarchical way to organize the data. Transactions within a entity group will be using Paxos while the cross-entity transactions are usually via asynchronous messaging. Paxos offers strong consistency but also leads to communication overhead in each operation while the asynchronous messaging is fast but lacks consistency guarantees. It works when our data model aligns with this kind of hierarchy, like Email and Blog where each user could be visualized as a natural entity and the cross-entity transactions are rare and not really require strong consistency but what if our data model is flat? Or our data model requires things to be store together for faster retrieval but on the other hand requires each entity to have high-level of consistency? Then the whole system could either be expensive due to the 2PC overhead for consistency or not consistent enough. I assume that Megastore has some assumptions on the data models prior to the design. (Also since the entities has very different sizes, could they become problematic during load balancing?);
Megastore is built on Bigtable, which is using GFS, SSTable and Chubby as basic storage management and coordination. While the philosophy of decoupling and layer design is great for debugging, system analysis and faster development, the overhead caused by coordination is gonna be terrible。
Queuing for transactions between multiple groups is asynchronous, which means that the consistency is not guaranteed in a limited period of time. On the other hand, buffer overflow or node failure could directly result in message loss and inconsistency is inevitable. The alternative of queuing is 2PC and it’s way more complex and the communication overhead is gonna hurt the performance. It’s always nice to have an alternative though.

Paper Notes:

Introduction:
- Since all the requirements like scalability, rapid development for users, low latency, consistency and highly availability are in fact conflicting with each other, Megastore picks a mid-ground between RDBMS and NoSQL:
  - datastore is partitioned and replicated with full ACID semantics but limited consistency guarantees;
  - traditional database features are supported if they can scale with tolerable latency limits and compatible with partitioning scheme;
- Paxos is optimized in Megastore for low latency operations and used for variety of things including primary user data replication;
Toward availability and scale:
- Common strategies:
  - asynchronous master/slave (data loss on failures);
    - ACK at master and transmission at slaves in parallel;
    - risks downtime or data loss during failover to a slave;
    - requires a consensus protocol to mediate mastership;
  - synchronous master/slave (heavyweight master);
    - master waits on slaves before ACK;
    - master/slave failures need external detection;
  - optimistic replication (no ACID transaction):
    - any member of a homogeneous replica group can accept mutations;
    - asynchronously propagated through the group;
    - global mutation ordering is not known at commit time so transactions are impossible;
- Enter Paxos:
  - any node can initiate reads and writes. Each log append blocks on acknowledgments from a majority of replicas, and the rest catch up as they are able;
  - multiple logs increase throughput (reducing possibility of distanced nodes using one log) and availability (operations won’t block when majority fails to ACK), each governing its own partition of the data set;
- Entity groups:
  - data is partitioned into entity groups;
    - single-phase ACID transactions within an entity group via Paxos;
    - cross-entity transactions could be via expensive 2PC or asynchronous message queuing (looser consistency);
- Physical layout:
  - each datacenter is a Bigtable instance;
  - minimize the latency by letting applications control data placement;
A tour of Megastore:
- API design:
  - normalized relational schemas are not used in Megastore because:
    - high-volume interactive workloads benefit more from predicable performance than from an expressive query language;
    - reads dominates writes so it pays to move work from read time to write time;
    - storing and querying hierarchical data is straightforward in simple key-value data stores like Bigtable;
  - (I have no background on databases so I’ll leave this section later);
- Data model:
  - entity group root tables and child tables;
  - each entity is mapped into a single Bigtable row;
  - local index is treated as separate indexes for each entity group;
  - global index spans entity groups, used to find entities without knowing in advance the entity groups that contain them;
  - storing clause for faster access at read time;
  - mapping to Bigtable:
    - store root entity as a single row, which allows atomically update;
- Transactions and concurrency control:
  - store multiple values in same row/column with different timestamps;
  - reads and writes don’t block each other;
  - Megastore provides current, snapshot and inconsistent reads;
  - write operations life cycle: read, application logic, commit, apply, clean up. Note that only one write wins in one Paxos;
- Other features:
  - periodical full snapshot;
  - optional data encryption;
Replication:
- Current reads guarantee:
  - a read always observes the last-acknowledged write;
  - after a write has been observed, all future reads observe that write;
- Brief summary of Paxos:
  - majority of replicas must be active to proceed;
  - use Paxos to replicate a transaction log and positions in the log;
- Master-based approaches:
  - writes are reduced to single round of communication;
  - writes can be batched together to improve the throughput;
  - master failover can result in user-visible outages;
- Megastore’s approach:
  - fast reads:
    - local read is allowed with the help of coordinator;
    - coordinator at each datacenter tracks a set of entity groups;
  - fast writes:
    - master-based system using leaders;
    - closest replica as leader;
  - replica types:
    - witness replica vote in Paxos and store the write-ahead log without applying the log or storing the data;
    - read-only replicas only stores data without voting; distribute the data without adding any write latency;
- Data structure and algorithms:
  - out-of-order proposals are acceptable;
  - catchup when a replica was found out of date during read;
  - read: query local, find position, catchup, validate, query data;
  - write: accept leader, prepare, accept, invalidate, apply;
- Coordinator availability:
  - failure detection:
    - use out-of-band protocol to detect coordinator failures;
    - coordinators obtain Chubby locks;
    - coordinator failure will be handled(reconstructed) quickly without affecting read/write;

Distribute and Conquer

Xiaotian Duan's Tech Blog

Paper Review: Megastore Providing Scalable, Highly Available Storage for Interactive Services

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply