Paper Review: Kafka: a Distributed Messaging System for Log Processing


Kafka is a scalable message system lies between the message producers and consumers. It uses many features to enhance the scalability and simplifies the design while losing some guarantees that other message systems offer.

Strong points:

  1. Messages are addressed by their logical offset in the log to avoid the overhead of maintaining and accessing the index structures. This could save a lot of time by utilizing the low-level storage addresses instead of counting down the indexes every time.
  2. Kafka uses pull instead of push mechanism to fully exploit the user’s ability of processing. Also in order to achieve efficient transfer, the consumer will retrieve multiple messages up to a certain size even if it only requests for one. This is very similar to the RAM accessing mechanism. The retrieved data is cached in underlying file system instead of explicitly cached in Kafka to avoid double caching.
  3. Simple design in load balancing, coordination, backup etc. A lot of the features are intuitive and achieved without centralized coordination and complex algorithm with some sacrifices on the system guarantees. I have some doubts on the load balancing part though since it looks more or less like lock mechanism to me (“at any given time, all messages from one partition are consumed only by a single
    consumer within each consumer group”) so it might hurt the performance a little.

Weak points:

  1. Overall Kafka has weak guarantees as a distributed messaging system. There’s no ordering guarantees when the messages are coming from different partitions. And stateless broker means that there’s no absolute guarantee on the message delivery. False deletion could happen (but I guess 7 days is pretty long so it mostly won’t happen in real life). At-least-once is another weaker guarantee comparing to exactly-once. I understand that there are trade-offs between complexity, performance and guarantees but I was picturing a system that could let the user decide on the guarantees and everything, not “losing a few page-view events occasionally is certainly not the end of the world”.
  2. The pull-based consumption model is not a good idea in many cases. Pulling means that the consumers have to constantly checking for new messages in the brokers, which results in unnecessary bandwidth consumption and pulling loops for the consumers. Notification or push mechanism is more efficient in sparse message environment.
  3. There’s no message processing in any form, which means that Kafka is merely a thin layer between message producers and consumers. The processing of the messages, even as simple as it could be, requires other tools to accomplish. I don’t think it takes that hard to add an extra layer just to trim the messages a little bit and make it suitable for processing while they are waiting for the pulls.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s