Is it a weakness to only commit on majority consensus? I'm thinking of a very un...

aaronblohowiak · on Aug 16, 2022

This is a consistent and partition tolerant system what you are describing is an available and partition tolerant system, but not one that can provide consistent results. (That you cannot have all three properties is called the CAP theorem and some people say they have all three but they just put a tight bound on unavailable and claim it doesn’t exist..) There are a variety of ways to achieve available and partition tolerant, with the conflict resolution as a rule implemented by the database or by the application.

kortex · on Aug 16, 2022

I'm not sure Raft is the best distributed consensus algorithm for the situation of a global, unstable, frequently-partitioning network. I think it is in its niche when leaders are running on fairly stable networks (>1-2 nines), and the main source of node failures are due to task cycles / rolling deploys.

I've played around with Hashicorp Consul on "edge boxes" - long-haul, wirelessly-connected embedded computers, with unreliable power supplies. Allowing edge boxes be Consul leaders results in all kinds of mayhem: split brain situations, corrupted state, stale DNS resolutions (Consul handles DNS as well), cats and dogs living together, mass hysteria. A much better topology is to have 3 server nodes on a LAN as the "head cluster" and letting all the edge boxes be clients of the head.

I haven't used it but Consul has a multi-datacenter mode, which I believe is designed to better handle such a situation, which I believe has a dedicated raft cluster per datacenter.

https://learn.hashicorp.com/tutorials/consul/federation-goss...

zambal · on Aug 16, 2022

You either need consistency or you don't. Raft is for systems that need this guarantee. If you don't need it, something like CRDT's can be used.

baby · on Aug 16, 2022

Yes, IIRC they are called leaderless protocols where you have more than 1 writer at a time. When there's a conflict you either let the user resolve (slow) or you pick a default resolution strategy. For example, LWW (last write win) simply accepts the last write.

It's been a while since I read "Designing Data-Intensive Applications" though, but there's a chapter on that.