Paper Review: Pig Latin: A Not-So-Foreign Language for Data Processing

March 25, 2016xduan7Leave a comment

Summary:

Pig is a scripting language built upon MapReduce, trying to save data scientist from the control flow MR programs that they are struggling with. It’s a combination of SQL and procedural language, which compiles highly abstract data flow programs into MapReduce executions. So developers don’t have to worry much about parallelism, fault-tolerance and pipelining etc.

Strong points:

Pig is a combination of SQL querying language and low level MapReduce (high level renders and low level manipulations), which grants Pig some benefits of both. Programmers can easily write “sequential (not declarative)” programs without worrying about database layout and MapReduce parallelism.
Pig is flexible in many ways like various data models and user-define function with Java and Python (in later versions) support etc. You can specify the control flow of programs at each step, just like any other procedural languages while focusing on data flow at the same time.
Also Pig is different from relation database or OLAP. It can deal with large unstructured datasets or nested data structure. On the other hand, it enjoys parallelism just like parallel database, thanks to the internal MapReduce compilation.
Each step is pipelined and optimized for execution. Also it’s based on MapReduce so all the fault-tolerance, scalability and other MR features are given. As a developer you don’t have to worry much about the lower level of the language.

Weak points:

First of all Pig programs are still compiled into MapReduce, which is inefficient and inflexible without doubt. Data must be materialized and replicated on the distributed storage between successive MapReduce steps and this makes thing much slower even with the pipeline. Also the bandwidth might become an issue when pipelined Pig programs are running: at the end of each step, all the intermediate data will be transferred via network almost simultaneously.
Execution order is not actually sequential as the programs wrote in Pig. This is also mentioned in the “future work” part of the paper, that some of the execution order will be modified to meet better performance. While this is a good feature in most scenarios, it could be dangerous and hard to debug and fix in some cases.
The flexibility could be a bit of challenge for developers. For example, only small part of parallelized primitives are included in Pig and users are responsible for the rest of the implementation as well as efficiency. I can imagine it would be a pain to develop a efficient UDF without support from the community.

Paper Review: Druid: A Real-time Analytical Data Store

March 23, 2016xduan7Leave a comment

Paper Review:

Summary:

Druid is a real-time analytical datastore with column-oriented layout and different types of nodes to support different functionality. It depends on deep storage, MySQL and Zookeeper for persistent storage, query and coordination purpose separately and could handle queries faster than MySQL and Hadoop.

Strong points:

Historical node tier offers flexibility and better performance. Nodes within the same tier shares the identical configuration and different tiers are divided according to their data “importance”. A tier with more cores and memories is suitable for frequently accessed data. Although the tier seems to be manually configured, it’s still a good solution to take full advantage of the limited hardware resources.
Zookeeper and MySQL outages do not impact current data availability on historical nodes. With the cached data, historical nodes are still able to handle its current queries. Also the caching saves time by eliminating unnecessary accesses to the deep storage. However, the coordinator nodes might become unable to perform any operation.
Druid uses column-oriented storage, which is similar to Dremel. Column storage is more efficient in the data analytics since only the useful data is actually loaded and scanned. In row-based storage, all the data in the rows have to be loaded, which results in significant longer running time.

Weak points:

A Druid cluster consists of different types of nodes, which are only responsible for specific kinds of tasks. While this might be a simple design, the load balancing between different roles might be an issue and makes the system inflexible under certain scenarios. If a system has a load of events during day time and all the queries are in the night, Druid still have to separate the roles of real-time nodes and historical nodes, which means that only part of the system is working at any time of the data while the rest of the machines (either real-time nodes or historical nodes) are not contributing and wasted.
I think Druid has the same downside as Hadoop MapReduce, in the sense that all the “intermediate results” are saved remotely in some disk. To access the intermediate result, MapReduce has to load them from disk and transfer using internet bandwidth and so is Druid. In Druid, the batch data are saved in deep storage, which is a distributed file system and all the processing requires loading the data remotely. This brings another layer of delay and dependency. I wonder if the “store” and “processor” could be on the same machine so that we can save the loading and transferring.
The coordinator nodes could be redundant since we already have Zookeeper. Zookeeper is definitely capable of telling what data the historical nodes should load and replicate, and It’s not hard to implement Zookeeper with simple load balancing function. On the other hand, the historical nodes might suffer short outage during the leader election of coordinator nodes. So for me it doesn’t really make sense to add coordinator nodes into the system.
The overall design doesn’t fit the the philosophy of Druid class in role playing games at all. Yes, Druid (in this article) is capable of handling complex data analysis, but the design is messy and inelegant since it solves the problem using different types of nodes and dependency. Druid class, in the game, could transfer into different animals to tackle different problems. It’s flexible and independent. So I guess the “real Druid design” of the system should be one kind of nodes that is programed to handle all the problems in different manner.

Paper Review: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

March 11, 2016xduan7Leave a comment

Paper Review:

Summary:

Dryad is Microsoft version of distributed/parallel computing model. It aims to free the developers from the rigid settings and process of MapReduce and design your own concurrent computation model. It doesn’t care about the data flow connection method or the computation process as well (as long as it’s not cycle) so that developers can pretty much configure Dryad the way they want.

Strong points:

Dryad strikes me as a generic version of MapReduce at first. It has all the nodes and looks like map/reduce model but the process is much more configurable. As long as the process forms a acyclic graph, Dryad can go with it. Unlike MapReduce, which is only good for certain kind of massive parallel processing model, Dryad seems able to fit into anything.
One of the biggest constraint of MapReduce is that the developer doesn’t have the freedom to choose how intermediate data is transferred during the process. And Spark beats MapReduce partial due to it transfer data via memory. In Dryad, developers can choose different ways to make the transfer, like TCP, memory or disk. Of course the memory transfer would be the fast way but it’s always nice to have some other options around in case the memory is not enough or some other reason.
The computation is modeled with acyclic graphs. Dryad offers ways to monitors the vertices as well as edges (state manager for each vertex and connection manager, which is not in the paper). It could make dynamic changes to the computation graph according to the monitored results to handle special cases like slow machines.

Weak points:

While Dryad aims to “make it easier for developers to write efficient parallel and distributed applications”, it doesn’t hide all the execution details from developers. Instead it’s doing the exact opposite by exposing more internal structures and leave the decision to developers. The computation, connection, input/output and the monitoring looks intimidating. And the insufficient language support (from what I’ve seen so far it uses C++ and query languages only) makes things even harder.
The execution stops as the job manager fails is a definitely a disadvantage in the system. It could be fixed with 1) distributed coordination with Paxos and some other consensus (slow but effective) 2) shadow master (faster recovery). Either way it’s not hard to implement, which makes me wonder why is this still an issue in Dryad.
There are two reasons why the graph should be acyclic: 1) scheduling is easy because there are no deadlocks and the process is sequential and 2) without cycles, recovering is straightforward. However, there might be some cases where the developers might need to run one method on a piece of data for multiple times in order to meet the requirement. And this is not allowed in current Dryad system.

Paper Review: Dremel: Interactive Analysis of Web-Scale Datasets

March 11, 2016xduan7Leave a comment

Paper Review:

Summary:

Dremel is an interactive ad hoc query system designed to handle real-time web-scale queries with low latency. It uses tree-structured servers and columnar storage with replication to achieve great performance over MapReduce.

Strong points:

Columnar storage is better for faster column retrieval and feature extraction. It exposes data in a more feature-oriented way and make programs easy to process data column by column with better data locality.
Tree structure of server is great. There are more than two layers of servers, which is pretty much all we’ve seen so far (one for metadata and one for data). This structure reminds me of the biggest query system on earth: DNS. With a intermediate layer, the system benefits from one more layers of cache and way better scalabilty. Also it has the potential to yet expand to a bigger scale with another intermediate layer.
The performance is great comparing to similar systems like Hive, which is not real time. It delivers all the requirements as a real-time interactive ad hoc querying system. However, it seems to me that Apache Drill could achieve pretty much the same thing with flexibility on data types.

Weak points:

Relatively poor performance when few columns are read or dealing with unstructured data. In that case we cannot take advantage of the columnar storage. But I guess they are pretty sure about the query types that Dremel is going to handle so it’s fine. Dremel is design to deal with the Ad Hoc queries of structured data ready to be analysed.
I don’t think MapReduce and Dremel make a valid comparison. Of course users can still use MapReduce do perform Dremel’s query and analysis job but that’s not what MapReduce is designed for, which is distributed batch processing. Those two are more complimentary rather than comparable to me, and that’s exactly what the authors suggested in the observation in the paper: “MR and query processing can be used in a complementary fashion; one layer’s output can feed another’s input”.
There’s not you can do to modify the data (update or creation) expect append, which limits what users could perform with Dremel. I guess implement update method is not in their development priority since the data analysis rarely used modification but still it’s nice to have a flexible way to change the data.

Paper Review: Discretized Streams: Fault-Tolerant Streaming Computation at Scale

March 8, 2016xduan7Leave a comment

Paper Review:

Summary:

Spark Streaming is a batch process system based on Spark, trying to simulate streaming by divide real time events into micro-batches with time intervals. This causes inevitable high latency but brings great throughput at the same time. It handles fault and stragglers with checkpoints and parallelized re-computation.

Strong points:

The best part of Spark Stream, which is also indicated in the title of the paper, is the fault tolerance and recovery strategy. The system doesn’t take the path of full replication, which requires 2 times of the storage space and coordination. It maintains checkpoints for data sets but doesn’t serially replay the process from the checkpoints during recovery. Spark Stream recovery is a parallel process that distributes the work of the failed node to all others, and brings the data back in seconds;
Consistency is maintained naturally with discretized streams. Unlike Storm, the data is processed “exactly once” because the micro-batch is synchronized with time interval barriers;
Great throughput comparing to any streaming system. Although it’s not fair to comparing micro-batch processing to stream, but when it comes to real world application, micro-batch processing could fit into many scenarios that doesn’t demand micro-seconds latency;

Weak points:

Latency is inevitable because of the time interval that Spark Streaming batches the events. The latency won’t be much lower if we have fewer events in the batch, or more machines in the cluster. This is really different from Storm, in which the latency is directly associated with computation power and incoming events;
Streaming and batch processing are fundamentally different, which makes the comparison between Storm and Spark Streaming invalid in many ways. Some guys from Storm project were really mad about the claim “Spark Streaming is X times faster than Storm”. I would love to see some comparison between Storm Trident and Spark Streaming because micro-batch to micro-batch makes much more sense.
They mentioned way too much on how they could handle stragglers while other system can’t in the introduction and overall structure, but it turns out to be a simple time threshold calculated from the median running time and a node is marked as slow if it takes longer than that. So I’m guess if a node is marked as slow, it won’t be part of the workers in the future since it might bring latency. And it’s job will be distributed by other workers. But what if the slow is caused by network communication loss, OS noise or underlying HDFS? I’m thinking about giving those slow nodes another chances periodically. But instead of assigning them with real tasks, we give them duplicated ones with special “no pass” marks. So the slow nodes can run the some processes together with normal nodes without returning duplicated results back. There won’t be any bad influence anyway. More sophisticatedly, they will be given the chances in the next [1st, 2nd, 4th, 8th, 16th … ] batches after marked slow.

Paper Review: Storm @ Twitter

March 4, 2016xduan7Leave a comment

Paper Review:

Summary:

Storm is a real-time data process system aimed to handle large-scale streaming data. It offers relatively strong fault-tolerant feature as well as two processing semantic guarantees. It’s now part of the Apache open source project that works well in the Hadoop ecosystem (runs on Mesos and uses ZooKeeper for coordination).

Strong points:

Parallelism is enforced in may layers in the system. There are many worker nodes with one or more worker processes on each of them. For every work process, it runs JVM, carrying one or more executors. And for every executor there could be one or more tasks. So in sum, every machine is divided into multiple services ready for streaming jobs. Although Storm doesn’t optimize the parallelism to the level of machines, this kind of model could still be useful for handling large number of small, streaming process jobs.
This paper doesn’t talk about fault tolerant a lot but it’s easy to see the reason behind. Nimbus and the Supervisor daemons are all stateless, which makes the backup easy to handle. And the snapshot of their memory state is kept locally or in the ZooKeeper. However, the Nimbus failure would block the whole system from topology submissions.
As a real-time streaming process system, Storm has really good average latency. The latency is directly associated with the number of machines during the experiment, which means that the workload is distributed among all the workers nicely. The throughput measurements indicate good scalability as well.

Weak points:

Currently, the programmer has to specify the number of instances for each spout and bolt. I’m not sure if this has already been fixed but if not, the developers could go through a lot of trail and error before they find out the right number for their data stream. I guess Pregel has the same thing with the vertices partitioning. The users have to pre-configure those things before the run and the configuration will be inflexible and inefficient considering what they are doing requires high-level parallelism.
The processing semantics guarantee is weak comparing to Trident/Spark. It’s either at-least-once or at-most-once. Storm doesn’t offer exactly-once guarantee (and i guess that’s one of the reason it has better performance?). Although they have some tricks like the XOR to save memory space and make things look better, lack of stronger guarantee could a disadvantage for some applications.
I guess Storm works well with simple tasks that could fit into bolt but for sophisticated streaming process, it will require a large number of bolt to divide the task into different simple executions and run in parallel, and the overhead for communication and coordination will grow as well. They didn’t compare different streaming process task types in the evaluation so there is no way to find out in the paper, besides the manually configuration for bolts and spouts makes the comparison harder.

Paper Notes:

There are two kinds of vertices. Spouts pull data from the queue like Kafka and bolts process the incoming tuples and pass them down. There could be cycles in the process.
A single master for coordination
Each worker node runs one or more worker processes. Each work process runs a JVM, which runs a or more executors. Each executor is made of one or more tasks. Talking about parallelism.
Summingbird is used for Storm topology generation.
Nimbus and the Supervisors are fail-fast and stateless and all their states are kept in Zookeeper/local disk

Paper Review: Pregel: A System for Large-Scale Graph Processing

March 1, 2016xduan7Leave a comment

Paper Review:

Summary:

Graph problems are naturally hard to be parallelized and distributed because of all the dependency of the vertices and edges. Pregel offers a intuitive way of handling graph processing by using N machines to handle M vertices separately (M >> N). There is a single master in this system, used for coordination, aggregation and analysis.

I tried a similar approach last semester by using threads to simulate nodes in a clustering problem. I believe it’s a good model for demonstration but expensive in terms of communication and coordination overhead.

Strong points:

The way they divide the work is straightforward and intuitive. All the vertices are acting independently with the message passing interface and coordination. Computation is also divided into multiple steps bounded with barriers, which is a model widely used for parallel programming;
Combiners could be useful to reduce the messages that needs to be buffered and processed, hence decrease the running time for each superstep. It works good with shortest path computing;
It’s flexible in many ways. There are different input methods; also the user-defined handlers could be used when the target vertex doesn’t exist or for topology mutation. These features make Pregel suitable for more graph-related problems;

Weak points:

Load balancing and vertices partitioning in topology are hard to solve. Pregel doesn’t care about the load balancing and simply encapsulate steps between barriers regardless of the execution time for each machine. It’s very likely that some vertices have more computation to do because they have more edges or perform as cluster head in the graph. In that case, the threads running for these vertices are going to take longer in each step. The default hashing vertices partitioning might group nodes with heavy computation jobs together. So some machines are going to finish much faster than the others and waste their time by waiting for the barriers. In Pregel, there is no load-balancing solution before or during the run.
Since the whole system is considered synchronous in terms of each step, what’s the actual use of the outdated checkpoints? Re-computation for recovery doesn’t make too much sense because the logged messages from healthy partitions don’t necessarily contain enough information to fully cover the lost partition. For example, if there is an island of nodes in the graph, which sends and receives no messages to/from other part of the graph before the crash, then how could we re-compute the states of the island nodes from the messages of other vertices after reloading the checkpoints. And as we are on the topic of fault tolerance, I’m not sure how does Pregel backup/recover the master.
The paper doesn’t specify the guarantees for messages. It seems to me that slave machines interacting, reporting to the master and getting ACKed by the master could take awfully long if loss and congestion happens. And if one machine gets delayed, the whole barrier is pushed late as well. I’m wondering about how long would a single machine take to finish the same job in the evaluation and I’m almost certain that it takes shorter than we expected. The system doesn’t scale well partially because of the communication overhead.

Paper Notes:

During every superstep, a user-defined function is executed for each vertex;
Pregel programs are inherently lock-free and no race condition;
Vertices are identifiable with a string and they are assigned with a value to represent the state; edges have source and destination with a user defined value as well;
A vertex can vote to halt when it has done all of its work. However, it could still be revoked by receiving messages. The computation is over when all the vertices are inactive;
Aggregators can be used to global reduction and coordination;
Topology mutations processed in order (removal then addition) and conflicts are resolved with user defined handles.
A single machine is chosen as the master, which is responsible for the coordination of the group;
Periodic ping message is used for failure detection; fault tolerance is achieved through checkpointing

Paper Review: Spark: Cluster Computing withWorking Sets

February 29, 2016February 29, 2016xduan7Leave a comment

Paper Review:

Summary:

Spark is a new computation model which is different from the mainstream MapReduce. It uses RDD, derived from DSM, so that the intermediate results could be stored in-memory to reduce the disk r/w overhead. With the lineage property of RDD, Spark doesn’t have to make disk checkpoints for the purpose of recovery since the lost data could be reconstructed with the existing RDD. Also, Spark is running on Mesos so it naturally extends some of the awesome features in Mesos like the optimal task assignment with delayed scheduling.

Strong points:

Flexible in terms of languages that could be used in the driver, the cache option that enables users to trade off between the cost of storing and the speed of accessing data, and running on Mesos so that it could share hardware resources along with other frameworks.
Delay scheduling and the almost optimal tasks assigning features from Mesos. Developers don’t have to worry about the underlying computation task distribution because it’s handled nicely by Mesos.
RDD greatly reduces the read/write operations of the disk comparing to Hadoop. Also there are many other optimizations to avoid disk r/w like when a node is doing broadcasting, it checks whether the data is in it’s memory first. It’s safe to say that disk r/w operations is the bottleneck for Hadoop.
From my understanding, Spark is a more general execution model than MapReduce. The operations written in Scala are user-defined and easy to copied around. It can run different tasks including MapReduce.

Weak points:

Most of the weak points are due to the lack of details of the paper.

Choosing Scala over Java, or some other languages in implementing Spark is interesting. It’s discussed by the first author himself on Quora: Why is Apache Spark implemented in Scala. I’ll say that Spark is still in a performance-critical layer and applications will be built on it. Using Scala might lead to worse performance comparing to Java implementation and so it was showed in the results section in the paper. Spark was running slower than Hadoop without iterations.
Lineage and recovery is not quite covered in the paper in implementation or performance analysis. I’m not sure about how fast could the recovery be and what percentage of lost data can I recover from the remaining RDD (i.e. how much data lost is accepted). And also the paper doesn’t cover some other solutions like distributing and replicating the in-memory data, which seems more straightforward to me (replicated memory could be recovered without computation);
Another question for this insufficient performance analysis is that, what if the data can’t fit into each machine’s memory in Spark. In this way the machines will have to use disk and here comes the inevitable r/w operations to the disk. Here is the link to the performance comparison between Hadoop and disk-only Spark: Databricks demolishes big data benchmark to prove Spark is fast on disk, too. I haven’t read it through but I guess Spark is faster because it doesn’t have to copy the intermediate results around, in this case, from mapper to reducer due to the optimal task assignment.

Paper Notes:

MapReduce are not good with:
- iterative jobs. Machine learning algorithms applies a function on the same set of data multiple times. MapReduce reloads the data from the disk repeatedly and hence not suitable for iterative jobs;
- Interactive analytics. MapReduce queries are delayed because of the reload.
Resilient distributed dataset (RDD):
- Achieve fault tolerance through a notion of lineage: could be rebuild from other part of the dataset;
- The first system that allows an efficient, general-purpose programming language to be used interactively to process large datasets on a cluster;
- Ways to generate RDD:
  - From HDFS file;
  - Parallelize a Scala collection into slices;
  - Flat-map from the existing RDD;
  - By changing the persistence of an existing RDD.
Parallel operations:
- Reduce, collect and foreach;

Paper Review: ZooKeeper: Wait-free Coordination for Internet-scale Systems

February 26, 2016xduan7Leave a comment

Paper Review:

Summary:

ZooKeeper is a wait-free coordination used to construct lock, consensus and group membership for the clients. It uses hierarchical nodes with the same tree structure as UNIX file system to store the data and sequential number for recording. It guarantees FIFO execution for every single clients and it’s heavily optimized for faster reads.

Strong points:

It’s just nice to have both synchronous and asynchronous versions of the APIs. Developers can choose whatever they like to fulfill their requirement on timing or consistency. Also we have a fast read and a slow read (sync + read), which makes thing flexible for developers and the applications.
Locks are powerful primitives for distributed system, which, on the other hand, is hard to relax and fits into more applications. ZooKeeper has no lock but it’s easy to develop lock services upon ZooKeeper. Other different coordination models are also available in ZooKeeper.
During the leader election, ZooKeeper clients don’t have to poll the locks and try to win the leader. Instead, it has a mechanism to keep clients in order and the previous leader failure only wakes the next client without interrupting others.

Weak points:

Since ZooKeeper is an open source project aimed to serve variety of applications, it’s better to implement more options for the developers to choose from. For example, the watch mechanism is basically the same as Chubby with one-time trigger. If they could give some extra options, like when a client creates a znode, it could register for multiple triggers or notifications along with the updated data, that would possibly fit into more situations.
Even with the guarantees on the ordering, the watch is an asynchronous according to (https://zookeeper.apache.org/doc/r3.4.5/zookeeperProgrammers.html#ch_zkWatches). This means the read could be stale not only because of the inconsistent local read from a non-leader server after a write, but also because of the late watch notification. There’s no good solution to this issue because any stronger consistency model could slow down the reads dramatically. However, since we already have sequence number for each znode, why don’t we bound the number with a timestamp or some version number so that the clients would at least have a better understanding of freshness of the data
Looks like the simple locks without herd effect is done with a strict ordering of sequential znodes. A client will become a leader by being notified about the previous leader with the previous sequential number dies. Since the sequential number is increasing over the requests, does this mean the leader ordering is determined by the time they registered to the ZooKeeper? Sounds like this mechanism is not quite desirable in many cases. For example, if the developer wants the client with better hardware to become the leader, then it will be hard to implement because the znode sequential numbers don’t have anything to do with the hardware. (wait…. can I store the hardware scores into the data in each znode and maintain a ordered queue in the group znode? This way the developers can configure the leader election as they want.)
There’s no mentioning on the issue of load balancing. Although all the ZooKeeper servers keep the exact same replicas, clients have access to any of the servers and can fire queries as they want. This could potentially lead to unbalanced request handling among all the servers and possibly some malicious attack as well. Again I couldn’t think of a good way to tackle this issue as a centralized load-balancing could not only slow down the client accesses but also add complexity to the system. Maybe we can implement the ZooKeeper servers in a way that if they detect a lot of TCP losses/timeouts during a time, it redirects some of its distanced clients to other servers

Paper Notes:

ZooKeeper uses non-block primitives because with the blocking ones, like locks, slow and faulty clients can limit the overall performance of the system. The failure detection would increase the complexity. It’s a lock-free version of Chubby;
The requests are handled with pipelined-fashion to support thousands of operation execution; the FIFO ordering of client operations enables asynchronous operations (multiple consecutive operations without reply);
Zab for leader-based atomic broadcast protocol;
Reads throughput is optimized by 2) servers handle reads locally 2) no strict ordering for reads and 3) client-side caching with a watch mechanism;
Tree of directories and data (hierarchical name spaces) because it’s easier and familiar for users in a standard UNIX file system;
Regular and ephemeral znodes
Increasing sequential flag append to znode
Watches to allow clients to receive timely notifications of changes without polling
1. the clients send reads with watch flag set and the server will return the data with one-time promise to notify the clients about any change;
2. watch only indicate the change without sending the actual data;
Clients and ZooKeeper interact by using sessions with timeout. It ends when
1. inactive clients over time
2. close session by clients
3. faulty clients detected by ZooKeeper
Synchronous and asynchronous APIs
1. no handles but use direct access to the path instead
2. update with version number to enable version checking
Two features:
1. linearizability (multiple outstanding operations but ordered with guarantee)
2. FIFO execution of queries
For a leader re-electing scenario:
1. lock service will prevent other processes using the old configurations but doesn’t offer a good solution when the new leader partially changed the configurations of all and then failed
2. in ZooKeeper, the new leader is going to delete the ready znode, make configuration changes and then set the ready znode back on. The configuration change is send asynchronously. New configuration will not be used;
3. ordering is the key preventing any client read anything before it’s ready;
4. sync causes a server to apply all pending write requests before processing the read without having any extra write
Write with atomic broadcast and read from any server for better availability
Write-ahead log in disk
Quorum-based algorithm with pipeline optimization
Periodically snapshot for faster recovery. It’s inconsistent because of it’s a non-blocking depth first scan of the tree of each znode data

Paper Review: The Chubby lock services for loosely-coupled distributed systems

February 25, 2016February 25, 2016xduan7Leave a comment

Paper Review:

Summary:

Chubby is a lock service provided to a large number of clients for access control and coordination. It’s made of typically 5 machines and one of them is master. The write operations require ACK from a quorum of machines and reads only need to access the current master.

Strong points:

Chubby exports a file system with APIs similar to UNIX filesystem. Tree of files are stored in directories which makes the access structured and easier for all users. There are some differences though, like the file moving, directory access control and last-access times;
The file handles with sequence number prevent checking at each read/write access and easy to see which master assigned the handles without referring to other replicas. This mechanism is good for small file with frequent accesses because checks are done during the open of handles instead of actual access. Each handle could be reused unless the master assigning the handle fails.
Since the read operations greatly outnumber writes, data is cached in the client side to reduce the traffic of polling Chubby masters for every single read. The master will notify all the clients that possibly cached the data when the data is modified and invalidate the caches. I can imagine it saves a huge number of messages during the run. However, a timed-cached (TTL cache) combined with this mechanism might be better because as the system runs, clients with possible will increase over time. For every read from the clients, the master is going to return the data with a validation time limit, which implies that cache should not be used after this period of time. On the other hand, the master will still invalidate all the clients whose cache is still in the time limit upon modification. This way we can avoid the case when the master has to send too many invalidation messages after several consecutive writes and lead to bursty network usage. This is different from the KeepAlives jeopardy because the TTL is for every cached data while jeopardy is for every client

Weak points:

This is trivial but still not thoroughly explained. The paper mentioned that the clients need so multicast request to all the Chubby replicas stored in DNS in order to locate the master, which means that all the Chubby replicas are managed by another service that keep track of the states and placement of Chubby replicas. The read operation only goes to Chubby master, but if the master is freshly selected, then either 1) the read service will be blocked for a while until the new master is up-to-date by checking with all other replicas or 2) the read returns a insufficiently updated value stored in the new master;
The sequence numbers are assigned to any operations that interact with locks. And if any lock is lost in a abnormal circumstances like failure, the other clients have to wait for a period (1 minute). However, why don’t master actively sends messages to the lock-holding clients to check for liveness instead of waiting. At least having the option of proactively checking the clients could benefit systems with higher requirement on availability. (This could definitely use the TrueTime API to make sure of the ordering of events from different clients and save a lot of waiting/messages);
When I was first reading the paper, I felt that the communication between Chubby clients and cells overlaps with TCP a lot in terms of their functionalities: both maintains a session with sequence number; both requires message exchange to keep alive otherwise the session expires. So I assumed that TCP is a nature for Chubby with all the strict ordering, congestion control and guarantees on the delivery. But the paper claims the TCP back-off policy lead to a lot of KeepAlives timeouts because the transport layer protocol is unaware of the timing of upper layer processes. However, the main reason we are triggering the back-off policy is the bursty message pattern, which is not discussed in the paper. If we can take advantage of the transport layer session to reduce the number of messages used for KeepAlives, then then back-off won’t be a problem in most scenarios. An alternative is to build a new reliable congestion-control protocol upon UDP with optimizations and priority for time-critical messages.
Chubby master is definitely a bottleneck of the overall performance because all the operations, read or write, have to go through the master. On the other hand, Chubby slaves’ computation power and bandwidth is under-utilized. I guess it could be viewed as a down-side of the high consistency of Chubby, which is maintained by a single master decision making mechanism.

Paper Notes:

Reliability, availability and scalability are considered the most important while the throughput and storage capacity are considered secondary.
Lock services, although cannot offer a standard framework for programmers like a library embodying Paxos, maintains high availability as the system scales up; it’s suitable for clients to store and fetch large quantities of data as it uses consistent client caching rather than TTL time-based caching; quorum based consensus reduce the number of servers needed for client to make progress
Only lock-interaction operations will be assigned with a sequence to ensure