Paper Review: Fast Crash Recovery in RAMCloud


RAMCloud is a distributed storage solution based on RAM storage instead of disks. It reduces the time of recovery by scatter the backup data into multiple disks and collect them in parallel fashion.

Strong points:

  1. Before I finished chapter 3 I thought about the bottleneck of the receiving/recovering/master machine since the network bandwidth is limited and assembling data takes time. But they walk around the problem by having multiple machines help the recovery process so we have multiple backup disks sending and multiple machines receiving and assembling at the same time. This is a clever design and the recovering process can even start (by finishing the disk loading phase) as soon as the crash happens.
  2. Each master picking the replica on their own is another good scalable design in this system. When a master selects backup location, it would choose the best candidate from a random list without interacting with the coordinator. The randomness appears again with the heartbeat messages. Nodes don’t ping the coordinator like other common distributed systems. Instead, nodes ping each other randomly and report failures. This would avoid massive amount of ping and response at the coordinator, which is not scalable at all.
  3. The overall system has a very straightforward design. For most part, if I’m designing the same system, I would definitely use the same strategy for most part. The backup is expensive in RAM? Then it shall have fewer backups. Not enough backups to make data highly available? Then make the recovery faster. The recovery suffer from network bottleneck? Scatter the data and recover in multiple servers at the same time. Need some machine to instruct the recovery? Add a coordinator. The coordinator might fail? Use zookeeper. The design flow is just nature and simple, which results in a really thin layer of RAM-based faster recovery solution.

Weak points:

  1. The system has very restrictive data model with size and layout limitation. For a large raw file or something structured data, the developers have to find a way to walk around it, by dividing the large data (and re-assembles it every time when you want to fetch it) and make a small sub-field in the data section to use it for more complex data layout. And there’s no atomic update spanning multiple tablets which is inconvenient in some cases.
  2. Client have no control over the tablet configuration like the locality, which might be important in some cases. The developer might want to scatter the data into multiple locations to enable parallel reading or manage the storage directly by putting the data in specific locations. RAMCloud only guarantees that small tablets and adjacent keys in large tables might be stored together and this is a rather weak locality control.
  3. During the recovery setup phase, the coordinator would collect the backup information  from all the backups in the cluster and then start the recovery. I’m not sure why don’t the backups inform the coordinator when they are backup data. Is it about the limited storage space in the coordinator/zookeeper or something else. It seems to me that this kind of late query would add another layer of latency to the whole backup process. If the coordinator has the backup information already, the it could save a lot of time by proceeding to the next phase.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s