Summary:
Iridium is system build for low-latency queries in widely distributed datacenter scenario. In this case the latency is primarily caused by bandwidth limitation during intermediate data transfer. Iridium counters the problem by relocating the dataset to preferred datacenter before the query. While this approach will cost more WAN usage in general, it reduce the latency 3 to 19 times during the tests.
Strong points:
- I enjoy dynamic systems that adapts to the problem in general. Specific systems are inflexible and hard to be used widely. Iridium is more or less like GPS proposed in Stanford, in which the machines will exchange vertices during the run to reduce network usage. Iridium aims to reduce the query time by taking advantage of the “lag” between the data generation time and query time. In this period the data could be transferred to preferred nodes (ones with better network links in this case) to save time for future queries.
- There are some optimizations in Iridium and two of them seem very interesting. The first one would be how they prioritizing between multiple datasets. The ones with more access will be handled early. And the second one is how they estimate the next query. Although those two optimizations are extremely simple and overly intuitive, I believe they are on the right track to make Iridium versatile in all situations and that’s why I list it as strong point. They should implement a more sophisticated algorithm to estimate the next query and the priority in the future.
- The evaluation chapter is pretty good. They are using some standard benchmarks and running Iridium against “in-place” (intermediate data is processed where the data is located) and centralized (put the intermediate data together before processing) approaches. Those two are generally used in other systems like Spark. The tests pretty much covers all the cases necessary and demonstrated the advantages of Iridium in latency reduction and WAN usage.
Weak points:
- Network capacity might not be static. This could actually happen since Iridium is not the only running process in datacenters. And even if we are considering Iridium along, multiple datasets and queries could interfere each other. For example, the placement of dataset A might consume a big chunk of the network bandwidth which will choke the query of dataset B. This could lead to the poor performance (even worse than base-line if we are placing the data at locations that has multiple queries running in the future). This is hard to anticipate and counter. Even if Iridium can estimate all the future queries correctly, other processes that Iridium is unaware of might ruin the game.
- Computation power is ignored in this paper. This is probably a bad assumption since the queries still require some sort of computation to do. CPUs and storage should be taken into account in order to make the system truly adaptive in all situations. It’s mentioned in the paper that computation power is abundant and I wish they could offer some real case analysis to convince the readers about this statement.
- Iridium is using greedy approach to evaluate the preferred datacenters that will move data to better node in a loop. First of all the network might not just be U/D links. Data transferred overseas will counter many problems and the speed is not as simple as min{U_site1, D_site2}. Secondly, I wonder if there exists mathematically optimal solution for the query time assuming the network bandwidth is stable and computable (just like there exists WAN budget optimal solution). I don’t think it’s hard to compute the true optimal solution even if it’s NP hard. We can brutal-force the problem since there are only a handful of data centers. They mentioned that greedy heuristic is not the best in the discussion chapter so I guess they will eventually use optimal or semi-optimal data placement approach.