Paper Review: Druid: A Real-time Analytical Data Store

Paper Review:

Summary:

Druid is a real-time analytical datastore with column-oriented layout and different types of nodes to support different functionality. It depends on deep storage, MySQL and Zookeeper for persistent storage, query and coordination purpose separately and could handle queries faster than MySQL and Hadoop.

Strong points:

  1. Historical node tier offers flexibility and better performance. Nodes within the same tier shares the identical configuration and different tiers are divided according to their data “importance”. A tier with more cores and memories is suitable for frequently accessed data. Although the tier seems to be manually configured, it’s still a good solution to take full advantage of the limited hardware resources.
  2. Zookeeper and MySQL outages do not impact current data availability on historical nodes. With the cached data, historical nodes are still able to handle its current queries. Also the caching saves time by eliminating unnecessary accesses to the deep storage. However, the coordinator nodes might become unable to perform any operation.
  3. Druid uses column-oriented storage, which is similar to Dremel. Column storage is more efficient in the data analytics since only the useful data is actually loaded and scanned. In row-based storage, all the data in the rows have to be loaded, which results in significant longer running time.

Weak points:

  1. A Druid cluster consists of different types of nodes, which are only responsible for specific kinds of tasks. While this might be a simple design, the load balancing between different roles might be an issue and makes the system inflexible under certain scenarios. If a system has a load of events during day time and all the queries are in the night, Druid still have to separate the roles of real-time nodes and historical nodes, which means that only part of the system is working at any time of the data while the rest of the machines (either real-time nodes or historical nodes) are not contributing and wasted.
  2. I think Druid has the same downside as Hadoop MapReduce, in the sense that all the “intermediate results” are saved remotely in some disk. To access the intermediate result, MapReduce has to load them from disk and transfer using internet bandwidth and so is Druid. In Druid, the batch data are saved in deep storage, which is a distributed file system and all the processing requires loading the data remotely. This brings another layer of delay and dependency. I wonder if the “store” and “processor” could be on the same machine so that we can save the loading and transferring.
  3. The coordinator nodes could be redundant since we already have Zookeeper. Zookeeper is definitely capable of telling what data the historical nodes should load and replicate, and It’s not hard to implement Zookeeper with simple load balancing function. On the other hand, the historical nodes might suffer short outage during the leader election of coordinator nodes. So for me it doesn’t really make sense to add coordinator nodes into the system.
  4. The overall design doesn’t fit the the philosophy of Druid class in role playing games at all. Yes, Druid (in this article) is capable of handling complex data analysis, but the design is messy and inelegant since it solves the problem using different types of nodes and dependency. Druid class, in the game, could transfer into different animals to tackle different problems. It’s flexible and independent. So I guess the “real Druid design” of the system should be one kind of nodes that is programed to handle all the problems in different manner.

 

 

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s