Summary:
Apache Tez is an open-source framework lying on top of Hadoop YARN. With the resource management taken care of, Tez offers a nice interface to the user and runs the application by defining the data flow of DAG. Various applications have been rewritten in Tez and it’s proven to be scalable than any other similar framework.
Strong points:
- Motivation for Tez is great: it introduces another layering between the resource management and the computation, which offers a nice abstraction of the data flow to the developers. With the help of YARN, users only have to focus on the program execution itself instead of other miscellaneous low-level things.
- DAG has been used in cloud and parallel computing for a long time and it’s intuitive for cloud users to think in DAG. Tez directly offers DAG API to the users so that they could simple inject code into vertices and edges.
- Late-binding and on-line decision-making runtime optimization is a great feature, which makes Tez execution suitable to the current state. Also Tez can modify the DAG based on its observation of Vertex Manager. This can help Tez make better decision during runtime.
Weak points:
- One thing I realized when I was still reading the introduction is the data transfer mechanism and optimization between the vertices. It’s really hard to make a good runtime decision about the transport of intermediate data: whether to store it in the memory for further usage, transfer the data to other processes to share the load or store it into the disk for recovery for long-term tasks. This is crucial given that MapReduce, Tez and Spark are all using the classic map-reduce parallelism model, so the data transfer is the key to better performance. Without a runtime automatic decision making on this issue, Tez could potentially lose a chunk of performance plus the versatility it claims to have.
- There are some limitations for Tez that might stop some developers from trying. An obvious one is the JVM restriction: all Tez applications will be running on JVM, which means it doesn’t have much language support. Aside from the language restraint, Tez offers DAG and runtime API so that developers could put chunks of codes into the vertices and make a work flow. But the Tez library doesn’t seem to offer too much about the actual processing part. Sure, defining a DAG is easy and straightforward for Tez but writing a program, say a word count, will take a lot of lines. Tez is just a layer above the resource management so writing a application in low-level Tez could be harder than Hive/Pig.
- The functionality of Tez overlaps with Dryad and Spark greatly. Tez and Spark are both MapReduce with some boost and Dryad is a even more general tool for parallel execution. All three of those takes DAG and isolate the low-level parallelization from developers. And since Spark has matured over the years (with its rich API and large number of users) and Dryad’s daddy is Microsoft, Tez only has its scalability and takes long to mature as a cloud library.