Summary:
GraphX is a set of graph-processing API that builds upon Spark batch processing engine. It tries to unify the general parallel processing with the graph processing and make the things like graph construction easy. It takes advantages of Spark (free fault-tolerance and memory cache) and partitions the graph across the machines to ensure scalability.
Strong points:
- It has strong motivation to unify different processing paradigm and I believe they are on the right track. Using multiple software to solve one graph problem (construction, processing and probably storage) is not attractive to most developers. Unifying those different things and make life easier is a great idea to begin with.
- Runs on Spark which is definitely a time saver. It’s excellent cloud processing engine with memory cache and RDD, which offers free fault-tolerance property. However, I wonder if they are willing to try different processing engines. Right now, GraphX is more like a set of Spark API rather than an independent layer for graph processing.
- From the paper, GraphX has the potential of implementing complex or self-defined algorithms. It’s more general processing model that uses table to represent graph and does not have any of the restrictions that other graph data representation might have. The data could be viewed as table or graph anytime during the run, which makes the context switch a easy job.
Weak points:
- It’s fairly easy to guess from the introduction of this paper that they are going to make a table-like graph and process it just like any other table and achieves unification. There are two RDDs to represent the graph: edge collection and vertex collection. While this might be the easiest way among all, it’s really unnatural for a graph to be shown this way. It treats vertices and edges separately and visualization could be hard (since there are two tables to go through). I would why don’t they make a wide column-table where each vertex is associated withe other vertices with value as directed edges (making edges second-class citizen). Maybe it increases the complexity to be unified with other processes and in this case, simple tables are preferred.
- The performance is not as well as GraphLab and Giraph in most test cases, which means that GraphX could be easily out-performed by well-tailored specialized graph processing system. Also I notice that GPS, which is another popular graph processing model, is missing from this paper. Another thing is that the paper rank evaluation results is completely different in An Experimental Comparison of Pregel-like Graph Processing Systems, where Giraph wins GraphLab in all page rank test cases.
- The reason I bring up GPS in the previous weak point is that, GPS has a nice, dynamic way of balancing the load (graph partition). They put a lot of effort to reduce the bandwidth usage between different machines by re-partitioning on the run. This is crucial in parallel graph processing because unlike common tables processing, balanced graph partition does not imply balanced computation amount. In fact, there’s no way to predict and balance the load among machines before execution, which means that dynamic partitioning is probably the only way to go. They did make some improvement here: GraphX graph partitioning strategy but this is still static partitioning.