Paper Review:
Mesa is another Google’s data store layered upon Colossus and Bigtable, primarily designed for ads campaign data storage. The versioned kv store with aggregation is scalable and replicated globally. The metadata is consistently updated with Paxos and the data is batched and transferred every few minutes.
Strong points:
- In the chapter “experiences and lessons learned”, layered design was mentioned as a key design feature of Google products. Mesa is nicely layered and decoupled in both horizontal and vertical directions. It was build up on Colossus and Bigtable so we can see there isn’t too much about the read/write topics, which are covered in Bigtable. Inside the architecture, it has workers/servers, controllers and global services. The data maintenance/update and query is also decoupled. While there might be some overhead, but it enables clear designing, easy problem identification and detailed performance analysis.
- The resume key with streaming transmission is interesting. This way the failed/disconnected query server won’t waste clients’ time since the read operations could be continued on another server instead of firing the query once more.
- Parallelizing the worker operation using MapReduce and linked schema change are good ideas. MapReduce could save days of computation time and the schema change, while not applicable in every scenario and add some computation on the query path, saves 50% of disk space than the traditional simple schema change. This could be a life saver since Mesa requires a lot of storage space in the first place.
Weak points:
- The controllers assign different tasks to different types of workers. The failure of workers will eventually be captured by the timer maintained for each task in the controllers. However, the timers could add a lot of work to the controllers if there are many tasks running simultaneously. On the other hand, work failures could result in long latency for the corresponding queries since before the timer expires and the tasks get re-assigned. This kind of latency could be resolved if the controllers/workers exchange heartbeat messages so that the worker failures could be detected earlier before the timer runs out.
- Query servers are assigned with their own responsible range of data to take advantage of prefetching and caching. However, this assignment could be a little inflexible if there are a lot of queries on the similar data. This way we will only have a small amount of query servers fetching data from Colossus for a long time while the rest of servers are doing nothing. This will be a performance bottleneck and waste of resource if we assume that the read operation in Colossus is lock-free and scales well. Also since we are on the chapter of querying, the global locator service is really vague in the paper. I assume that it is a stateless process running the controllers with the data replicated in Bigtable as well.
- The replication mechanism does not make too much sense to me. The Paxos will only make sure that metadata is replicated consistently by majority of Mesa instance and the data replication will be left behind without any strong guarantee. So for each Mesa instance (datacenter), the metadata could be out-dated if it failed during the Paxos; and even if it works fine during Paxos, there’s not much guarantee on the consistency of the actual data.
- There are two methods of data validation, the online one by re-aggregating the rows and check for computation errors and the off-line one, which is a light-weight process spanning the recent committed data. There are two problems with the corruption recovery: 1) the online checking will be perform in every update and query, which could result in unnecessary checking and increased latency and load; 2) the data is replicated asynchronously. If the freshly updated copy gets corrupted, there’s no other replicas that could help with the recovery
Notes:
Data is partitioned and replicated horizontally to achieve scalability and availability. (does this imply NoSQL?)
Multi-version key-value data store with Paxos for consistency
Leverage Bigtable and the Paxos tech underlying Spanner for metadata storage and maintenance.
Asynchronously replication for application data; synchronous replication for metadata.
Dealing with corruption for software and hardware
A query to Mesa consists of a version number and a predicate P on the key space. And the response contains the corresponding P and a version number between 0 and n.
Strict ordering of updates can ensure atomicity and fraud detection of negative facts.
Mesa pre-aggregates certain versioned data (between v1 and v2 inclusively) and call it delta. Base compaction is the process of merging some of the version in to [0, B] and versions before base are no longer accessible. This kind of idea is commonly used (like append-only GFS) but it could be problematic if old-version data is still useful. Older data are stored in a versioned and expensive way in terms of the process of aggregation which means that Mesa is necessary for cleaning things up but how many versions are we going to keep is gonna be hard to determine for various kinds of data. Since Mesa is primary used for ads campaign data collection, which has uniform and specific requirements in terms of data storage and aggregation, this issue doesn’t seem to matter too much.
Each table has one or more indexes and each table index has its own copy of data that is sorted accordingly for fast search. And there is an index file containing short keys for row blocks for faster localizing the row block.
Metadata is stored in Bigtable and in the memory of controller. Each datacenter has one controller (I assume) and it does not directly interact with the tables. There are different tasks for data maintenance like updates, compaction, schema change and checksum. Note that the last two require coordination of different Mesa instances (datacenters). A set of workers of different types polling the controller for task of their own types. Each task will be assigned with a timer so failed workers won’t affect the system. The controller is also sharded and stateless with all the metadata consistently stored in Bigtable so Mesa is resilient to controller failures. A garbage collector runs separately and continuously, which reads from the metadata in Bigtable and delete the unwanted files in Colossus.
Mesa handles queries with different requirement with different labels and priorities. A set of query servers could access any table in principle but Mesa will direct queries with the same range of data to the a subset of the query servers to take advantage of the pre-fetching and caching. Global locator service is used to coordinate the query servers.
The operations are batched in Mesa instances once every few minutes. A stateless global committer will assign each updates batch a version number and use Paxos to reinforce the data consistency. Controllers in every Mesa instance will be responsible for the updates batches and acknowledging the committer. There is no locking. Note that in this case metadata is replicated synchronously with Paxos while the data is incorporated asynchronously by various Mesa instances.
New Mesa instances will bootstrap with p2p load mechanism.