Paper Review: Bigtable A Distributed Storage System for Structured Data

Paper review:

This paper is about a data storage system build upon google’s own file system GFS and Paxos-based coordinator Chubby. It offers flexible storage types with great scalabilty and availability. Some of the optimizations like prefetching and multi-level caching are really impressive and useful.

strong points:

  1. just like GFS, clients are communicating directly with tablet servers for read/write operations. This could help the system to overcome the bottleneck effect brought by the centralized coordination of the master.
  2. There are two different layers of centralization in this system and they are all handled pretty well in order to survive failures. The master of tablet servers is supported by Chubby lock service which ensures that there is one master at any time to eliminate the possibility of inconsistency. The other master is located in GFS and is backup-ed periodically.
  3. The client-side prefetching more than one tablet when it reads from METADATA is brilliant. It reminds me of the implementation of computer memory. By pre-read multiple tablets, the clients are much less likely to refer to the METADATA again. Since the referring is expensive because of the network round-trip time and the limitation of the centralized metadata, we better think of ways to reduce it. Another way I was thinking about is to have a metadata cache between clients and servers. It serves just like a local DNS server. It fetches tablet row keys for clients and caches the “hot” or adjacent rows for later usage. That way clients will be referring to the cache server first before they make it to the metadata. This will save us a lot of time and resource if multiple clients are likely to use or reuse the same chunk of table. A downside of my solution is definitely the network bottleneck for the cache server since the overall bandwidth is under-utilized.  Also if clients are using completely different tables then this will only result-in one more round-trip time.

weak points:

  1. while the usage of different systems and applications (GFS, SSTable, Chubby) decouples different layers and aspects of Bigtable (GFS is the low level file storage solution, SSTable is the actually data structure, Chubby is responsible for metadata, cluster management and the stats monitoring), the interactions of systems could lead to overhead and complexity for maintenance. I wonder if it’s gonna be more efficient in performance if we build Bigtable like some extra feature of Google File System instead of combining a bunch of underlying systems together. For example, one of the most obvious performance bottleneck would be the number of tablet servers. Each tablet server interact with GFS independently as a client and as the number of tablet servers grow, the performance of GFS drops due to coordination overhead. What if the system is not build in different layers and each tablet server has their own storage and distributed replicas?
  2. memtable is a great design which offers a buffer contains recent mutations for data. However, the paper doesn’t specify the where the memtable is, so I’m going to assume it’s located in each tablet server because the master will have a lot to handle if it maintains the memtable for all. So there comes the problem, if a tablet server crushes with a full memtable, then all those mutations will be lost since memtables are stored in memory without any forms of backup. This could result in situations like users find data unchanged even if the mutation operation already finished.
  3.  compaction is used for faster transfer of a source tablet server to the target. The compaction is completed within two states to make sure that during the first phase compacting, the source server is still available for serving this tablet. However, the source tablet server could be really heavy-loaded with a lot of operations going on and the computation might be slower but the target is very likely to be light in load. So why don’t we just leave the compaction to the target server since the computation amount won’t be any different.
  4. Bigtable is supported by Google File System and GFS has its own mechanism to replicate data to handle occasional (or is it) node failure. However, in Bigtable it’s specified that one tablet is only stored in one tablet server (in 5.2 tablet assignment). So I’m not sure where they store the replications. All the tablets must be replicated since the master can reassign them upon node failure. If there are replications, how do they handle the consistency issue? Does the assigned tablet server act like a master/leader?  (So GFS is acting like a disk underlying Bigtable and all servers (master and slaves) have access to. The block assignment is more of a way to “let this tablet server handle request from index A to B” rather than store some data locally in that server. At least I think it is.)

Paper Outline:

  1. Introduction:

    • Goals achieved by Bigtable:

      • wide applicability;
      • scalability;
      • high performance;
      • high availability;
    • Bigtable resembles a database:

      • but provides a different interface;
      • does not support a full relational data model;
      • simple data model which supports dynamic control;
      • treats data like uninterpreted strings;
  2. Data model:

    • Bigtable is a sparse, distributed, persistent multidimensional sorted map;

    • Rows:

      • up to 64 KB in size;
      • each r/w operation on a row is atomic;
        • regardless of the number of columns it accesses;
      • row range:
        • dynamically allocated;
        • called tablet, the unit of distribution and load balancing;
        • read in shorter range will require communication with less machines;
          • URLs are stored reversely to group the related content together;
    • Column families

      • column keys are grouped into sets called column family;
      • column family is the basic unit of access control;
        • family is required before any data is stored under any column key;
      • data stored in the same column family is usually the same type;
      • column key:
        • unlike column family names, it doesn’t have to be printable;
    • Timestamps:

      • timestamps identify the same data of different versions in the cell;
      • 64-bit integer:
        • represent real-time in microseconds by default;
        • or any other client-explicit version representation;
        • must be unique and decreasing order so that the recent version can be read first;
      • Two settings for garbage collection:
        • only the last n versions;
        • only new-enough versions;
  3. API:

    • Provides functions for creating and deleting tables and column families;

    • Provides functions for changing cluster, table and column family metadata;

      • such as row access right;
    • Other features that allow the user to manipulate the data in more complex way:

      • supports single-row transactions;
      • allows cells to be used as integer counters;
      • supports the execution of client-supplied scripts in the address spaces of the servers;
        • in a special language developed by Google called Sawzall;
    • Can be used with MapReduce;

  4. Building blocks:

    • Uses the distributed Google File System to store logs and data files;

    • Depends on a cluster management system for scheduling jobs, managing failures  and monitoring machine stats;

      • because it runs a shared pools of machines;
    • SSTable file format is used internally to store Bigtable data:

      • provides a persistent, ordered immutable map for from keys to values;
      • SSTable contains a sequence of blocks:
        • blocks are located with indexes loaded into memory when opened;
        • binary search in-memory index and then read the data from disk;
        • optionally, SSTable can be completed mapped into memory;
    • Relies on a highly-available and persistent distributed lock service Chubby;

      • a Chubby service consists of five active replicas;
        • one of the replicas is the master that serves requests;
        • the system is active when the majority of the machines are running;
        • uses Paxos algorithm to keep its replicas consistent;
      • read/write operations are atomic;
      • the Chubby client library provides caching of Chubby files;
      • each client maintains a Chubby service session;
      • Bigtable uses Chubby for variety of tasks:
        • ensure there is one master at most at any time;
        • store the bootstrap location Bigtable data;
        • discover table services and finalize tablet server deaths;
        • store Bigtable schema information;
        • store access control lists;
  5. Implementation:

    • Three major components:

      • a library that is linked into every client;
      • one lightly loaded master server;
        • responsible for assigning tablets to tablet servers;
        • responsible for detecting the addition/expiration of tablet servers;
        • balancing load of tablet servers;
        • garbage collection of files in Google File System;
        • handles schema changes;
      • many tablet servers;
        • could be dynamically added or removed;
        • handles a set of tablets;
        • handles read and write requests;
        • splits tablets that has grown too large;
          • maintaining a size of 100-200 MB;
        • clients communicate directly with tablet servers;
          • most clients never communicate with the master;
    • Tablet location:

      • three-level hierarchy to store the tablet location information:
        • first level: a Chubby file contains location of root tablet:
          • root tablet contains the location of all tablets;
          • root tablet is the 1st METADATA tablet;
          • never split to make sure of the three-level structure;
        • second level: rest of the METADATA tablets:
          • stores the location of a tablet under a row key;
          • stores secondary information for debugging and analysis;
      • localizing:
        • client library caches tablet locations;
        • moves to the hierarchy if doesn’t know the correct location;
          • requires three network round-trips if the cache is empty;
          • at most six network round-trips if the cache is stale;
          • client library prefetch more than one line to reduce access cost;
    • Tablet assignment:

      • each tablet is assigned to one tablet server at a time;
      • the master keeps track of the set of love tablet servers and the current assignment of tablets to tablet servers;
      • Bigtable uses Chubby to keep track of tablet servers:
        • acquires an exclusive lock when a tablet server starts up;
        • the master monitors the lock directory and discovers tablet server;
        • a tablet server will lose its lock due to network partition;
        • reacquire a lock as long as the file exists;
      • master:
        • responsible for detecting the status of tablet servers:
          • periodically asking for lock status;
        • acquire a exclusive lock if the tablet server expires or is unavailable:
          • delete the file;
          • reassign the tablet;
        • kills itself if its Chubby session expires:
          • this doesn’t change the assignment of tablets to tablet servers;
        • master start-up:
          • grabs master lock in Chubby;
          • scans the servers directory in Chubby to find live servers;
          • contact with all live tablet servers for the tablet assignments;
          • scans(or add) METADATA table to learn the the set of tables;
        • table changes:
          • master initiate add/delete and merge operations to tables;
          • tablet servers initiate split operations:
            • commit the split by recording information for the new tablet in the METADATA table;
            • notifies master;
    • Tablet serving:

      • persistent state of tablet stored in GFS;
      • updates are stored in a commit log:
        • recent ones are kept in memory buffer called a memtable;
        • older ones are stored in a sequence of SSTable;
      • write operation:
        • check for validity and authorization;
        • mutation is written to commit log;
          • group commit improves the throughput of small mutations;
        • after all the writes has been committed, its contents are inserted into memtable;
      • read operation:
        • check for validity and authorization;
        • merge SSTables and memtable;
          • lexicographically sorted so it’s efficient;
    • Compaction:

      • when the memtable reaches its maximum size, it freezes and a new memtable is created. And the old one will be converted to SSTable in GFS.
        • shrinks the memory usage of the tablet server;
        • reduces the amount of data for recovery;
      • small SSTables converted from memtable will be merged periodically;
        • called major compaction;
        • product of major compaction contains no deletion info or data;
  6. Refinements

    • Locality groups:

      • clients group multiple column families together into a locality group;
      • generates a separate SSTable;
    • Compression:

      • clients can decide whether a SSTable for locality group is compressed;
      • as well as the compression format;
    • Caching for read performance:

      • two levels of cache in tablet servers;
        • scan cache is higher-level, which caches the KV pairs returned by SSTable interface to the tablet server;
        • block cache is lower-level, which caches the SSTable blocks read from GFS;
    • Bloom filters:

      • allows us to ask whether an SSTable might contain any data for a specified row/column pair;
    • Commit-log implementation:

      • append mutations to a single commit log per tablet server;
      • boost performance during normal operation;
      • but slows down recovery;
        • sort the log to avoid duplicate reads;
        • two logs to avoid blocking;
    • Speeding up tablet recovery:

      • minor compaction;
    • Exploiting immutability:

      • differentiate mutable and immutable data;
  7. Performance evaluation;



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s