Sunday, November 29, 2015

Data consistency in distributed systems: From ACID to BASE

Brewer's CAP conjecture proven by Gilbert and Lynch established that it is impossible to achieve consistency, high availability and partition tolerance together.  This has led to the design of distributed systems that provide weaker consistency guarantees while ensuring high availability even with node and communication failures - BASE (Basically Available Soft state Eventually consistent).  In traditional databases, the two-phase commit protocol guarantees consistency with updates either being committed to all of the N nodes or none at all.  Waiting for all N nodes to respond increases latency for the operation and also impacts availability,  since if any one of the N nodes fails to respond the update cannot occur.  In eventually consistent systems like Dynamodb, updates are considered complete once they are written to a subset of the nodes. They are eventually propagated to all the nodes, so it is possible that a subsequent read operation against a node may return stale data if the update has not yet propagated to that node.  As a result, it is possible for these systems to have different "versions" of the same data and hence they must also be able to reconcile these different versions. DynamoDb uses vector clocks to order the versions when possible - so latest wins - otherwise, relying on the client to reconcile conflicting versions.   It is possible to avoid conflicts entirely by using consensus algorithms like PAXOS or RAFT that ensure that only one entity can perform an update. The general consensus problem is about reaching agreement across a set of distributed processes that can fail due to faults in the infrastructure (network or node failure) or for malicious reasons (byzantine failures).  Most practical implementations of these algorithms adopt a leader/master based approach where all writes are always made to the master and then propagated to replicas as in MongoDb.  When the master fails a leader election process is initiated that elects a new master while demoting the failed master.  MongoDb supports various consistency vs latency tradeoffs by allowing you to specify different write-concerns: un-acknowledged, journaled (written to journal on master), or replica acknowledged (written to one or more replicas in addition to master).  Jepsen (a tool that tests partition tolerance of distributed systems) testing with MongoDb shows even with write-concern set to acknowledge writes to majority of the replicas in a cluster, you can still end up with missing writes!