Consensus algorithms

I’m going to start talking about data replication, and the first and probably most important topic is that of consensus algorithms. I don’t want to regurgitate what others have written, so at first this is going to be a reading list. But, at some point, it will contain original bits, and then all of this will turn into a bibliography.

What went before

Here are a handful of the papers I think are important.

Time, Clocks and the Ordering of Events in a Distributed System. 1978 - brilliant or obvious, or both, this provided a more formal description of partial orders (which is all we get in our universe where relativity applies).

Consensus algorithms

This sub-field was launched with a paper from Lamport titled The Byzantine Generals Problem.

The Byzantine Generals Problem. 1982 - making progress in the presence of conflicting information.

Wikipedia has a decent article

Consensus

Note that state machine replication and consensus algorithms go together, and we’ll be diving into that as well, since that’s the actual fun/hard work.

State machine replication on Wikipedia.

2PC - two-phase commit

Conceptually, two-phase commit is straightforward; the commit-request phase (or call it the voting phase), and the commit phase, where the coordinator decides whether to commit or not, based on the information gathered in the voting phase.

This is a widely-used technique, although it has problems in the presence of failures. E.g. it assumes there is storage that can be trusted at each node, that no node crashes, and that nodes can communicate with each other. And it’s a blocking protocol. Other than that, it’s great :)

Two-phase commit protocol from Wikipedia.
Consensus Protocols: Two-Phase Commit. 2008.

3PC - three-phase commit

Fixes much of fragility of 2PC, although it’s still vulnerable to partitions. E3PC tries to address the problem of partitions.

Three-phase commit protocol from Wikipedia.
A Formal Model of Crash Recovery in a Distributed System. 1983, Skeen and Stonebraker, introduced the three-phase commit protocol.
Consensus Protocols: Three-phase Commit. 2008.
Crash Recovery in a Distributed Data Storage System. 1979, Butler Lampson and Howard Sturgis.
Increasing the Resilience of Distributed and Replicated Data Systems. 1998, Idit Keidar and Danny Dolev, enhanced three-phase commit.

Paxos

This was developed by Leslie Lamport in the late 1980s, circulated informally in the community in the early 1990s, and finally formally published in Transactions on Computer Systems in 1998. It is an improvement over 2PC and 3PC (two-phase commit and three-phase commit).

Paxos has the reputation of being complex, and developers are often warned against implementing it themselves. This seems a little draconian or elitist to me.

These are in publication order, but are starred with a reading order for those wanting to understand Paxos.

Paxos - Wikipedia. 2015 - recent version.

Papers

The Part-Time Parliament. 1998 - the first Leslie Lamport paper on the Paxos algorithm.
Paxos Made Simple. 2001 - another paper from Lamport explaining Paxos, presumably in a fashion easier for some people to understand.
Cheap Paxos. 2004 - a version of Paxos that is cheaper (less work) when processors aren’t constantly failing.
Fast Paxos. 2005 - paper by Lamport proposing a faster version of Paxos that learns in two message delays instead of three (but error-intolerant).
Generalized Consensus and Paxos. 2005 - allow parallel execution of non-interfering commands.
Stoppable Paxos. 2008.
Vertical Paxos and Primary-Backup Replication. 2009 - an exploration of the idea that valid master/slave replica systems are a variant of Paxos.
Leaderless Byzantine Paxos. 2011.

Blog articles

Consensus Protocols: Paxos. 2009 - A fairly clear article from the Paper Trail blog on Paxos.
Understanding Paxos Part 1, Part 2. 2013 - pair of articles on Paxos.
Neat Algorithms - Paxos. 2014 - Javascript implementation from Harry Brundage.
Paxos Made Moderately Complex. 2015 - paper from Robbert van Renesse and Deniz Altinbuken talking about full reconfiguratble multidecree Paxos.

Uses of Paxos

Google: the Chubby distributed lock service (BigTable uses Chubby).
Yahoo and ZooKeeper.
Heroku and Doozerd.
Amazon Web Services.
Cassandra and Nutanix.
libpaxos-cpp

Raft

Raft was developed as an easier-to-understand consensus algorithm, easier that Paxos, that is.

The Raft Consensus Algorithm. Home page on Github.
Raft on Wikipedia.
In Search of an Understandable Consensus Algorithm. 2014, Osterhout paper introducing Raft.
The Secret Lives of Data
Raft refloated - do we have consensus. 2015.
Consensus Protocol. Consul uses Raft for its consensus protocol.
In Search of an Understandable Consensus Algorithm video of presentation at Usenix 2014.
ARC: Analysis of Raft Consensus. 2014, Heidi Howard.
logcabin. LogCabin is a distributed storage system built on Raft; written in C++.

Facebook is using Raft in HydraBase, to replace HBase. CoreOS is using Raft in etcd.

https://github.com/coreos/etcd

Ark

Introduced for TokuMX and MongoDB.

Names in consensus algorithms

Major names

Leslie Lamport. Also wrote TLA+, and LaTeX.
Jim Gray. Two-tier transaction commit semantics, described ACID semantics.
Butler Lampson. One of the key names for Xerox PARC and the Alto.
Barbara Liskov.

Implementors

Harry Brundage
Henry Robinson LinkedIn, Paper Trail, henryr on Github - Cloudera, ZooKeeper.