March 10, 2009

Cassandra of Facebook, or A Tale of a Distributed Database, part 2

This is the third in my series of blog posts about Facebook's Cassandra distributed database. Part 0 discussed the background of the problem Cassandra is meant to address, while part 1 gave an overview of the distributed systems techniques employed by Cassandra. This post is a scattering of my observations from reading the code.

The most obvious feature of Cassandra's code is the pervasive presence of checksums in communication and data handling modules. Checksums are used both to detect data and message corruption, and as a means of finding out if replicated nodes are out of sync.

Cassandra is a production system with a large deployment and the management features present reflect this. There is code for measuring node statistics and sending them off to a collection server. Nodes register themselves with both Zookeeper and JMX cluster-management services, although exactly what both of those are used for at Facebook is unclear as their cluster-management tools have not been released as free software.

The mechanism for selecting nodes for replication can be customized to use metadata about machines to choose nodes on redundant networks/power grids/backplanes/etc to increase availability. The strategy provided in the code indicates that Facebook structures their Cassandra deployment using IPv4 addresses so that all the machines with the same third octet of an address are in the same rack, and all machines with the same second octet are in the same datacenter.

The gossip protocol, besides being used for failure detection, is also used for load-balancing the cluster. Workload is measured as the number of requests per number of keys in a given time window that are handled by a node responsible for a particular interval of the key ring, and is disseminated by each node to the others. To alleviate high load, a node may move itself around on the key ring to more evenly distribute the keys between itself and a lightly loaded neighbor. The approach is an implementation of Abdallah and Le's Scalable Range Query Processing for Large-Scale Distributed Database Applications.

The gossip protocol itself is rather interesting, featuring three stages (gossip message, ack, ack-ack).

An interesting artifact is the com.facebook.infrastructure.continuations package, which has some stubs based on the Javaflow bytecode continuation transformer. You would guess this would be used for the server code since it's based on non-blocking IO, but the server code is actually hand-coded staged event-driven. The stubs in the package aren't used anywhere else, which means that some other Facebook project must be using those. I wonder what for?

The code itself can be found at, but seems to be in the process of being moved to a different repository. In the meantime, there is an active git repository at that features community-submitted bugfixes and improvements.

To end on a light note, it is refreshing to find that the authors of Cassandra give a nod to the LOLspeak code commenting guidelines:

* 3. Set one of teh messages to get teh data and teh rest to get teh digest
// 4. SendRR ( to all the nodes above )
// 5. Wait for a response from atleast X nodes where X <= N and teh data node
* 6. If the digest matches return teh data.

No comments: