Paper Review: The Google File System

Paper Review:

The Google File System supports common filesystem APIs along with append operation. It ensures the data redundancy and integrity by maintaining several replicas in chunkservers, which are coordinated by a single master machine. The master machine backups all the operations, logs and checkpoints regularly locally and remotely.

Strong points:

  1. Separating data flow with control during replication is brilliant because the control flow has a centralized structure (star structure from master to primary to secondary replicas in this case) and the data flow is much bigger in the size and should be optimized by using chain/pipeline structure in order to make full use of the network bandwidth;
  2. The concept of master replication and shadow master is great. It has really high-level of security and redundancy. Local backup for master data is simply not enough for critical metadata and it handles centralized master failure nicely. Moreover, they minimize the metadata to make the replication more efficient;
  3. The simplicity (at least on the glimpse of the overall structure)  of the design is just stunning. There’s no complex algorithms/solutions in this filesystem but yet each possible problem is handled elegantly. For example, I can imagine how much trouble a fully distributed filesystem like xFS would have without centralized server for coordination.

Weak points:

  1. There are a lot of assumptions about the data they are going to handle on this particular file system. In order to optimize the system, they mainly focus on large files and append operations, which means the small files and random writes are not well-supported in GFS.
  2. In the construction of chain-structured data flow during replication, “finding the closest IP” is not a sophisticated way to construct the chain of servers. Close IP address doesn’t always mean close in network and surely doesn’t imply faster data transfer. I guess I will choose multi-casting supported by routers to transfer data with in the same data center and use a better network distance estimation (like historical latency and transfer speed) for data transfer between different data centers.
  3. Bottleneck effect of the master
    • clients have to communicate with master before data access;
    • in-memory data structure is of limited size;
    • master failure do effect the system performance even with shadow masters;
  4. Since we already have so many machines for the master replication, why don’t we make master distributed a little? For example, clients with read operation request can ask the master replicas for the primary instead of asking the master to utilize the bandwidth.
  5. The client will repeat the mutation operation if any replica fails, which means that all the data will be transferred again, which consumes time and bandwidth and causes duplicate record. Can primary replica be responsible if any secondary failures? Say we have 3 replicas: P for primary and S1, S2 for secondary. In normal situation, P send control message to S1 and S2 and data flows from P to S1 then to S2. If S2 fails, P will be responsible to send S1 a message to retry the data transfer from S1 to S2 for several attempts. This could save the bandwidth and avoid duplication as well. (without the duplication, all the replicas would be identical and checksum is gonna be easier with higher level of data integrity)
  6. Stale replicas will be detected during master’s regular scan and removed in the garbage collection phase. However, since most files are append-only and we have the version number, stale replicas could be reused easily to save the resource for re-replication.
  7. Concurrent write operations are not defined (maybe solve this with an extra layer between clients and the master? So that concurrent operations could be rearranged in that extra layer. So it could be like there is only one client with serial operations)
  8. What if master and primary are both stale? In that case the client will accept the stale replication. I think a Paxos-like voting phase before master replies to the client would solve this case. Actually no, master and primary will never be both stale. The master will update all the metadata upon the startup. If the master doesn’t crush, it will always have the up-to-date version number.
  9. Checksum in each chunkserver is definitely necessary but isn’t it overhead if we are doing the checksumming upon every request? Say if there is storage hotspot with only one chunk of data, the chunkserver will keep on checking for integrity over and over again. Since we don’t have to worry too much about the stale data (because of version number), we can check the integrity when the data is read for the first time then continue without checksumming because this part is already verified. As long as it has the latest version number, it must be correct. Actually no. Hotspot is a rare situation and data corruption might still happen after it was stored into the disk (I guess). Since the checksumming has little effect I guess we could just leave it there.


Paper Outline:

  1. Intro to the problem:

    • Component failures are norm rather than exceptions;
    • Files are huge by traditional standards;
    • Files are mutated by appending rather than overwriting;
    • Co-designing the applications and FS APIs benefits the overall system;
  2. Design overview:

    • Interface:

      • create, delete, open, close, read, write;
      • snapshot, record append;
    • Architecture:

      • master and multiple chunkservers structure;
      • master keeps all the file metadata;
      • masters periodically communicate with chunkservers with HeartBeat messages;
    • Single master:

      • sophisticated chunk placement and replication decision using global knowledge;
      • might become bottleneck of the system; should minimize the involvement of the master during r/w;
        • no more communication with master once chunk is located
        • multiple chunks in the same request;
    • Chunk size:

      • 64 MB in size;
      • pros:
        • reduce the client-master interaction;
        • reduce the network overhead by keeping persistent TCP connection;
        • reduce the size of metadata stored on the master;
      • cons:
        • wasting space (maybe?);
        • small files with less chunks may become hot spot (could be resolved with higher replication factor or allow clients to read data from other clients);
    • Metadata:

      • three major types:
        • the file and chunk namespace ( in memory. local disk and remote);
        • the mapping from files to chunks ( in memory. local disk and remote);
        • the location of each chunk replicas (in memory);
          • master does not store it persistently but ask chunkservers upon startups and other changes.
      • in-memory data structure
        • periodical fast scanning for the entire state (for garbage collection, re-replication upon chunkserver failures and chunk migration);
        • the system capacity is  limited by the master’s memory;
          • not a serious issue because memory consumption is small and extra memories are cheap.
      • chunk locations:
        • obtained and updated with monitoring all the chunkservers’ states with regular HeartBeat messages;
        • chunkservers have final words about the chunk locations so there’s no point trying to maintain a constant view on the master;
      • operation log:
        • historical record of critical metadata changes;
        • defines the order of concurrent operations with logical time;
        • local and remote replications:
          • flushes the log replications before answering to clients;
          • batches several flush requests to reduce the throughput consumption;
        • master recovers by replaying the logs:
          • checkpoints its state when logs grow beyond certain size;
          • load the latest checkpoint and then replay the subsequent logs to reduce the startup time;
    • Consistent model:

      • Guarantees by GFS:
        • consistent: all clients will always see the same data regardless of which replicas they read from;
        • defined: consistent and all clients will always see what the mutation writes in its entirety;
        • serial successful write operation is defined (apply operations to replicas in the same order and detect/collect the outdated ones);
        • concurrent successful write operation is consistent but undefined;
        • record append is defined;
        • client might access stale data because chunk location is cached on the client side.
          • this is limited due to the cache size;
          • most of the files are append-only so stale replica usually returns a premature end of chunk;
        • data corruption is detected with regular handshakes and check-summing  between master and all chunkservers;
  3. System interactions:

    • Leases and mutation order:

      • master grants a chunk lease to one replica as primary;
      • primary defines the mutation order to all;
      • lease mechanism:
        • initial timeout of 60 secs;
        • could be extended indefinitely with HeartBeat messages;
        • could be revoked by master;
      • lease process:
        • client asks the master which chunkserver holds the lease;
        • master returns the primary and secondary replicas;
        • client pushes the data to all replicas in any order;
        • client sends write request to primary if all replicas ACked;
        • primary forwards to write request to all secondary replicas in the exact same serial order;
        • the secondary replicas reply to primary upon completion (nodes failure might result in inconsistency here. It is handled with few more attempts before falling back);
        • the primary replies to client (with errors or not);
      • a “large” write will be divided into small ones and could be interleaved with or overwritten by concurrent operations.
    • Data flow:

      • the control flow is from client to primary and then to secondaries; but the data is pushed linearly along a chain of chunkservers in a pipelined fashion (to fully utilize the bandwidth of each machine);
    • Atomic record appends:

      • only data is specified (not the offset) and GFS guarantees to append the data at least once;
      • primary replica is responsible to check if the appending operation might result in over-sized chunk;
        • If so, it pads the current chunk (and tell the secondary replicas to do the same) and then replies to the client.
      • the client retries the operation if any replica fails and this might cause duplicate record;
    • Snapshot:

      • master revokes leases to ensure that any subsequent writes will require an interaction the master;
      • master saves operations to local disk and then load this log record to its in-memory state;
      • the next request from clients will first ask the master about primary;
      • data is copied locally to create new chunks for the following operations (in order to save the backup data for last snapshot);
  4. Master operation:

    • Namespace management and locking:

      • locks are used to enable simultaneous master operations;
      • files are stored with full path-names and r/w operations will require all the locks along the directory;
      • it allows concurrent mutations in the same directory;
    • Replica placement

      • maximize the data reliability and availability;
      • maximize network bandwidth utilization;
    • Creation, re-replication, rebalancing:

      • factors to consider upon chunk creation:
        • on chunkservers with lower disk utilization;
        • limit the number of recent creation on each chunkserver;
        • spread replicas across racks;
      • re-replication priority factors:
        • how far it is from the re-replication goal;
        • chunks of live files over the recent deleted files;
        • chunks that are blocking client progress;
      • rebalancing happens periodically and gradually;
    • Garbage collection:

      • mechanism:
        • rename and hide the file;
        • remove the hidden file, orphan chunks and the metadata during master’s regular scan;
      • discussion:
        • garbage is easily identified with chunk handle mappings;
        • advantage over eager deletion:
          • simple and reliable in distributed system with constant failure;
          • merge storage reclamation into master regular activities to minimize the cost;
          • safety against accidental, irrevocably deletion;
    • Stale replica detection:

      • chunk version number is maintain in the master;
      • increase the chunk version number when new lease is granted;
      • stale replicas are removed during the garbage collection;
      • master will send the version number to the client along with the primary when the client is asking for a chunk;
  5. Fault tolerance and diagnosis

    • High availability:

      • fast recovery from normal/abnormal termination;
      • chunk replications;
      • master replication:
        • operation logs and checkpoints are replicated on multiple machines;
        • operations is committed only after it flushes to all replicas;
        • master could be replaced by the replicas during failure;
          • replicas serve as read-only shadow masters with the same canonical name as the original master;
    • Data integrity:

      • each chunkserver must independently verify the integrity:
        • comparing between chunkservers is impractical;
        • divergent replicas are legal (not guaranteed to be identical);
      • chunks are broken into 64 KB pieces with 32-bit long checksums;
        • checksums are part of metadata;
        • checksumming has little effect on read operations;
        • update the checksum for the last partial checksum block after append operations;
    • Diagnostic tools:

      • RPC logs with minimal impact on performance;



[1] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. “The Google File System.” Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles – SOSP ’03 (2003). Web.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s