Paper Review:
Cassandra is another fully distributed data store pretty much like Dynamo in Amazon. It also works with consistent hashing but different in the load balancing part. It has relatively flexible data types and replication methods than Dynamo and is one of the best non-relational data store in the world.
Strong points:
(pretty much every advantage/disadvantage of Dynamo could be listed here so I’m just going to focus on the minor differences they have)
- Column indices for faster lookup. This is a great optimization for wide-column data store by preventing the column-by-column search during data retrieval.
- Load balancing scheme is more tractable and deterministic because we are not assigning virtual nodes to machines any more. We are actually analyzing the load and performance of each machine and moving data from node to node to balance the distribution. This means that 1. the load distribution is more observation-based which means that the balancing result should be better than virtual node assignment and 2. this balancing method is dynamic and could react to run-time imbalance in load assignment.
- While Cassandra is a fully distributed system, the use of Zookeeper really make thing simple and elegant. For example, the replication location is chosen by the leader elected by Zookeeper rather than using consensus algorithms. This way the replication requires less coordination among all nodes and could substantially reduce the message exchange amount and overall latency.
Weak points:
- “Cassandra morphs all writes to disk into sequential writes thus maximizing disk write throughput”. I guess it means that concurrent write operations are treated sequentially with their timestamps and the ordering is agreed by all replicas by coordinated consensus or leader decision. Either way this method of serialization could end up with a different ordering from the reality.
- Cassandra resembles NoSQL database which is proven to be flexible and scalable (just like Dynamo and Bigtable) than relational databases. However, NoSQL databases requires more attention to the structure of storage and the data could be hard to import/export without standards and proper support of query language.
- I’m just being nitpicking here but the load balancing could be potentially problematic. It requires periodical background scanning of loads, which consumes time, bandwidth and computation power, and a recovered light-loaded node could be assigned too much data in a short span of time and might result in node crush and/or network jam.
Paper Outline:
-
Introduction
-
Related work
-
Data model:
- A table is a distributed multi dimensional map indexed by a key;
- The value is highly structured:
- row key is a string with no size limit;
- every operation under a single row key is atomic per replica;
- columns are grouped together to form column families;
- super column families are column families within column families;
- the sort order within a column family is application-specified;
- either by time/name;
-
API:
- insert(table, key, rowMutation);
- get(table, key, columnName);
- delete(table, key, columnName);
-
System architecture:
-
Partitioning:
- uses order preserving consistent hashing to replicate data across clusters;
- the data is assigned onto a ring structure of the hash output;
- walk clockwise and find the first data node as coordinator;
- main principle of this design:
- leaving/joining of a data node only affects the one other node;
- challenges:
- random assignment causes non-uniform data and load distribution;
- oblivious to the heterogeneity in the performance of nodes;
- solutions:
virtual nodes like the ones in Dynamo;- analyze load info on the ring and move lightly loaded nodes on the ring to alleviate heavily loaded nodes. This makes the design and implementation very traceable and helps to make very deterministic choices about load balancing;
-
Replication:
- N as replication factor is configured per-instance;
- coordinator of the data is responsible for replication:
- rack unaware:
- pick the next N-1 data nodes on the ring as replicas;
- rack aware/datacenter aware:
- leader election by Zookeeper;
- makes a concerted effort to maintain the replication;
- metadata backed up at each node and Zookeeper;
- rack unaware:
- each node is aware of every other node in the system:
- durability guarantees by relaxing the quorum requirements;
-
Membership:
- cluster membership is based on Scttlebutt Gossip;
- failure detection:
- Accrual Failure Detection: not up/down statues but a suspicion level of each monitored nodes;
- every node in the system maintains a sliding window of inter-arrival times of gossip messages from other nodes;
- the distribution of these inter-arrival times is determines and suspicion level is calculated from it;
-
Bootstrapping:
- when a new node is joining:
- randomly pick a token on the ring;
- position backed up on local disk and Zookeeper;
- position info is gossiped around the cluster;
- read the configuration file (a list of a few contact points);
- manual startup and joining in to restore a failed node;
- when a new node is joining:
-
Scaling the cluster:
- new node should alleviate a heavily loaded node;
- data streams using kernel-kernel copy technique;
- maybe parallelizable;
-
Local persistence:
- local file system for data persistence;
- typical write operation involves:
- a write into a commit log for durability and recover-ability;
- an update into an in-memory data structure after;
- dump in-memory data into local disk when it reaches threshold;
- write is sequential and indexed for easy lookup;
- periodically merging files;
- read operations:
- read memory before disk;
- bloom filter for key checking before read;
- maintain column indices to jump to the right chunk;
-
Implementation details:
- Cassandra process on a single machine consists of:
- partitioning module;
- cluster membership and failure detection module;
- storage engine module;
- modules rely on an event driven substrate where the message processing and task pipelines are split into multiple stages along the line of SEDA;
- system control messages are over UDP while the application related messages are over TCP;
- states for read/write request:
- identify the nodes that own the data for the key;
- route the request to the nodes and wait on response;
- return to the client if timeout;
- figure out the latest response based on timestamp;
- schedule a repair of the data if necessary;
- purging commit log;
- fast sync mode: write to buffered log with potential risk of data loss;
- morphs all writes to disk into sequential writes and lockless to maximize the write throughput;
- Cassandra process on a single machine consists of:
-
-
Practical experiences