Paper Review: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Summary:

Google Dataflow is a new programming model that integrates streaming, micro-batch, batch and other processing models and present them as a set of API with windows and triggers. Developers could define the processing pipeline simply by stating the window (time initiated), trigger (event initiated) and the behavior.

Strong points:

  1. The graph of event time & processing time is brilliant. The Dataflow Model doesn’t just focus on one thing: stream, batch or something simple. It covers all the possible (maybe?) execution models there could be. The graphs help developers to understand and analyze the execution from a different angle and offers a geometrical view of the processing flow.
  2. It has been noted multiple times in the paper that previous works, especially batch execution engines, assume that the process will be completed at some point. This could be false assumption since the data flow is unpredictable and replying on the notion of completeness is a dangerous thing.
  3. The API in the paper seems really easy to use so I looked it up for more details. And it turns out that writing Dataflow programs is quite simple: it’s just window, trigger and behavior. The program will have less lines of code and even more clear about the processing pipeline comparing to Spark programs. And the deployment of the pipeline is no hassle as well.

Weak points:

  1. The paper is confusing and loaded with unexplained terms until examples are shown in page 8. It would have been so much better if they present the examples first and then elaborate the details of it.
  2. Although it might be trivial for developer to understand the implementation of Dataflow, there isn’t much to read in the implementation section. There are some interesting experiences but that’s not really helpful if you are trying to get inside the system. It would be so nice if they put the execution model in Dataflow Service Optimization and Execution on the paper.
  3. No performance test at all. At least I could know something about the execution from the Google Dataflow website, but there isn’t anything about the performance of Dataflow that could be easily found. I guess something related to performance is already covered in FlumeJava and MillWheel but still.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s