December 23, 2019
In this document, Distributed Systems
refers to Stateful software systems where there exists at least 1 operation that involves network call. In practice, distributed systems are normally concurrent.
Consistency models attempt to answer the following question:
Does our system preserve orders of operations? Under what constraints. Eg. If we write A
, then B
into a registry, are we guaranteed to observe the data in the same sequence?
This is important because in a stateful system, order of operations affect the state. In general, Operation A, Operation B
is not equal to Operation B, Operation A
. It is useful for participants of the system to be able observe operations in correct order.
There are many different consistency models, eg
and many more, aphyr documented a lot of them
Consistency model is likely one of the most complicated part of a distributed system, not only because the vast number of different models, but also because it tends to span across multiple operations, when talking about consistency model, we are talking about how multiple operations interact with a state, and what kind of guarantee are we giving to each of those operations.
Here I am referring to the user interaction
with our system, if user have to wait for a result, then the interaction is synchronous, if user does not need to wait for result, then it is asynchronous.
Normally, synchronous interaction implies stronger requirement in terms of latency, as we normally want to minimize wait time of our user.
Note: Different operations in a system can have different interaction mode
Atomicity guarantee an operation either completed or never happen, there should be no observable
intermediate state.
In the context of distributed system, it is not uncommon to have partial failure, eg. if an operation requires 2 stateful network call to complete it’s job, what should we do if 1 call succeeded and the other failed?
This problem is fairly difficult to solve, and solution likely depends on your problem domain, possible solution includes 2PC, Saga Pattern.
Be aware on dependency of system time, in theory distributed system should rely on system time for correctness, but it is ok to depend on system time for other purpose like live-ness or debugging. We just need to keep in mind that clock drift is real.
Availability refers to the ability of the system to continue it’s operation even when part of the system is down.
In distributed system, it is inevitable to have timeout, because network partition can happen anytime. Timeout brings a lot of complexity to our system because when an operation timed out on the caller, the operation might actually finished, or failed on the callee, or worse it might finished later, which the caller has no visibility at all.
It is typically for client to retry, thus it is crucial for api to be idempotent.
In general, the stronger the consistency model you need, the worse throughput, latency and availability you will get.
If you need extra mechanism to enforce atomicity, it is likely to worsen throughput, latency and availability too.
This is a list of questions to ask when building distributed system: