Spark is a new computation model which is different from the mainstream MapReduce. It uses RDD, derived from DSM, so that the intermediate results could be stored in-memory to reduce the disk r/w overhead. With the lineage property of RDD, Spark doesn’t have to make disk checkpoints for the purpose of recovery since the lost data could be reconstructed with the existing RDD. Also, Spark is running on Mesos so it naturally extends some of the awesome features in Mesos like the optimal task assignment with delayed scheduling.
- Flexible in terms of languages that could be used in the driver, the cache option that enables users to trade off between the cost of storing and the speed of accessing data, and running on Mesos so that it could share hardware resources along with other frameworks.
- Delay scheduling and the almost optimal tasks assigning features from Mesos. Developers don’t have to worry about the underlying computation task distribution because it’s handled nicely by Mesos.
- RDD greatly reduces the read/write operations of the disk comparing to Hadoop. Also there are many other optimizations to avoid disk r/w like when a node is doing broadcasting, it checks whether the data is in it’s memory first. It’s safe to say that disk r/w operations is the bottleneck for Hadoop.
- From my understanding, Spark is a more general execution model than MapReduce. The operations written in Scala are user-defined and easy to copied around. It can run different tasks including MapReduce.
Most of the weak points are due to the lack of details of the paper.
- Choosing Scala over Java, or some other languages in implementing Spark is interesting. It’s discussed by the first author himself on Quora: Why is Apache Spark implemented in Scala. I’ll say that Spark is still in a performance-critical layer and applications will be built on it. Using Scala might lead to worse performance comparing to Java implementation and so it was showed in the results section in the paper. Spark was running slower than Hadoop without iterations.
- Lineage and recovery is not quite covered in the paper in implementation or performance analysis. I’m not sure about how fast could the recovery be and what percentage of lost data can I recover from the remaining RDD (i.e. how much data lost is accepted). And also the paper doesn’t cover some other solutions like distributing and replicating the in-memory data, which seems more straightforward to me (replicated memory could be recovered without computation);
- Another question for this insufficient performance analysis is that, what if the data can’t fit into each machine’s memory in Spark. In this way the machines will have to use disk and here comes the inevitable r/w operations to the disk. Here is the link to the performance comparison between Hadoop and disk-only Spark: Databricks demolishes big data benchmark to prove Spark is fast on disk, too. I haven’t read it through but I guess Spark is faster because it doesn’t have to copy the intermediate results around, in this case, from mapper to reducer due to the optimal task assignment.
- MapReduce are not good with:
- iterative jobs. Machine learning algorithms applies a function on the same set of data multiple times. MapReduce reloads the data from the disk repeatedly and hence not suitable for iterative jobs;
- Interactive analytics. MapReduce queries are delayed because of the reload.
- Resilient distributed dataset (RDD):
- Achieve fault tolerance through a notion of lineage: could be rebuild from other part of the dataset;
- The first system that allows an efficient, general-purpose programming language to be used interactively to process large datasets on a cluster;
- Ways to generate RDD:
- From HDFS file;
- Parallelize a Scala collection into slices;
- Flat-map from the existing RDD;
- By changing the persistence of an existing RDD.
- Parallel operations:
- Reduce, collect and foreach;