Paper Review: Gorilla: A Fast, Scalable, In-Memory Time Series Database

Summary:

Gorilla is an in-memory time series database that serves as an intermediate layer between server data/queries and long-term storage, in this case, HBase. It compresses data nicely and has low latency in writ-dominant scenario.

Strong points:

  1. One thing I like about the system is that it prioritize recent data over older data in the aspect of availability and fault tolerance. Recent data is of more value and queried more often than old data and focusing on recent part of the data is better for performance during normal operation.
  2. They spend a lot of time talking about their compression method. While that paragraph is pretty dry, it does seem helpful and easy to process, which means that Gorilla is going to have much less storage space needed with the same amount of data, and the compressing time won’t be ridiculously long.
  3. The design of Gorilla as an intermediate layer between real-time data/queries and long term storage is inspiring. While long term storage like HBase has good fault-tolerance and better guarantees for r/w operations but latency becomes an issue as the system scales. Introducing a layer is like adding memory between processor and hard disk: it has different guarantees, super-low latency and serves as buffer for long term storage. Using the layer design we could even add more layers in between to improve the system.

Weak points:

  1. As the paper suggested multiple times: “users of monitor-ing systems do not place much emphasis on individual data points”. Gorilla does not meet ACID requirements but only guarantees that a high percentage write succeed at all times. This is a rather weak guarantee for most cloud databases out there but I guess that fast caching is the primary concern in this system.
  2. Gorilla seems to be hard-wired to handle recent 26 hours of data. While the number 26 is extract from the research of previous usage, it might not be a good time limit for some data or in future situations. I was thinking about binding the data with a time limit. During the time the data will be stored in Gorilla with faster query speed. And it will be dumped to long-term storage when the time expires. This way we can treat different types of data in different ways and make the system highly configurable.
  3. The data compression is very neat but it seems that each data stream is heavily related to previous streams in the compressed form. So if one bit goes wrong during the computation/storage, the compressed data might be affected greatly. I was wondering if they have some sort of integrity guarantees, like checksums, to ensure that the everything is correct.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s