I'm curious about "We chose to only store cluster metadata in Raft due to performance limitations imposed by distributed consensus"
I would naively assume that since you need to replicate your database to the slave servers, you need to send the WAL or changes somehow anyway. Could you elaborate on what was the problem there? I'd be surprised if it is because of a fundamental limitation of Raft (but could easily be a limitation of a Raft library!)
Suppose you configure RethinkDB to store three copies of the data. When you do a write, RethinkDB will replicate the data to three nodes; that operation is roughly a matter of sending three messages to the appropriate nodes, getting the acks, and sending the ack back to the client.
If we used Raft to replicate the document, in many cases it would be a lot more chatty. If you look at the Raft paper and track all the messages that would have to go back and forth, it would dramatically increase the latency for each write. This is inherent to Raft (and any distributed consensus protocol). Unfortunately distributed consensus isn't free; you have to pay pretty heavy latency costs which would be unacceptable in production, so we couldn't just uniformly apply that to every write.
OK, so it's a latency concern, which makes perfect sense.
> If we used Raft to replicate the document, in many cases it would be a lot more chatty.
I'm out of my depth here. Leader needs to send AppendEntries and slave needs to apply to persistent storage and ACK. Leader needs to wait for majority of ACKS before responding to the client. That's the same as your three-node replication scenario, so what am I missing here? Does Rethink db relax consistency guarantees in some cases to achieve better latency?
It's not your job to educate me on Raft, I appreciate your being patient with me, but feel free to opt out of this conversation anytime you please.
> Leader needs to send AppendEntries and slave needs to apply to persistent storage and ACK. Leader needs to wait for majority of ACKS before responding to the client. That's the same as your three-node replication scenario, so what am I missing here?
A couple of things -- the payload in Raft tends to be much higher (though it could probably be fixed with sufficient engineering effort), and in some scenarios this process would have to happen multiple times during netsplits (which may or may not be ok).
RethinkDB doesn't relax consistency guarantees; we implement them in a different way. Check out http://rethinkdb.com/docs/consistency/ for more details.
I'm a bit busy today, but this is a really interesting question. I'll see if we can do a technical blog post on this and go into all the details in depth.
> Does Rethink db relax consistency guarantees in some cases to achieve better latency?
You responded:
> RethinkDB doesn't relax consistency guarantees; we implement them in a different way.
But the page you linked says:
> `single` returns values that are in memory (but not necessarily written to disk) on the primary replica. This is the default.
> `majority` will only return values that are safely committed on disk on a majority of replicas. This requires sending a message to every replica on each read, so it is the slowest but most consistent.
which seems to imply that the default settings sacrifice consistency for better latency. Can you (or someone else, if you're busy) clarify?
Sorry, this is a bit nuanced and my comment was unclear.
We implement a variety of modes for reading and writing that allow the user to select different trade-offs for consistency and performance. By default: writes are always safe; reads are safe when the cluster is healthy, but can sometimes have anomalies in failure scenarios. You can also do `majority` reads which are completely safe even during failure scenarios, but are slower.
However note, that the default read mode isn't an "anything can happen" implementation. The guarantees are precisely defined, and 2.1 passes all tests in a large variety of failure scenarios with respect to these guarantees.
My comment about not relaxing consistency guarantees was meant in a slightly different context. The OP was talking write transactions (and implementing individual writes with Raft vs. using a different way), and I pointed out that we don't relax consistency guarantees for writes despite not using Raft.
I realize my comment was confusing -- sorry about that; I didn't mean to mislead.
tim@rethinkdb here. coffeemug's earlier comment is not quite right. There were two reasons why we went for this hybrid approach where Raft handles metadata but not documents: because of how Raft interacts with the storage system, and because of sharding.
In the Raft protocol, the leader sends AppendEntries; the follower writes the log entries to persistent storage, but doesn't apply them to the state machine yet; the leader sends another AppendEntries with a higher commit index; and then the follower applies the changes to the state machine. In RethinkDB's case, the "state machine" is the B-tree we store on disk. One of the guarantees we provide is that if the server acknowledges a write to the client, then all future reads should see that write; and we perform reads by querying the B-tree. So we can't acknowledge writes until they've been actually written to the B-tree, and we can't start writing to the B-tree until after the write has been committed via Raft. So that's where the latency would come from if we were using Raft to manage individual documents. We considered a couple of ways to work around this. One would be to make reads check both the B-tree and the Raft log; but that makes the read logic much more complicated. Another would be to start writing to the B-tree as soon as we put the write in the Raft log. The problem is that Raft expects to be able to roll back parts of its log at any time, and our storage engine's MVCC capabilities aren't good enough for that. The hybrid approach allows us to have good latency without major rewrites of the existing storage engine.
The other reason is that RethinkDB allows tables to be split into shards, and the number of shards can be changed while the databases is running and accepting queries. We considered having one Raft instance per shard, but we would have needed to modify the Raft algorithm to allow splitting and merging Raft instances while they are running and accepting queries. (This is approximately what CockroachDB is doing [1].) But the Raft algorithm is really tricky to get right even when you're not trying to modify it, and we wanted to stick as closely as possible to the official algorithm to minimize bugs. The approach we ended up going with allows us to have the performance and convenience of live resharding without having to modify the Raft algorithm.
In response to your question about consistency guarantees: RethinkDB gives users several options for trading off consistency and latency. The default is to acknowledge writes only once they're safely on disk on a majority of replicas, but to perform reads only on the primary replica (leader). This doesn't give perfect consistency; if the leader fails over, reads that hit the database around the time of the failover might see outdated data, or they might read writes that were rolled back as part of the failover. Unfortunately, the only way to get stronger consistency guarantees is to wait for a majority of replicas to acknowledge the read, which makes performance much worse. We offer a safe-but-slow mode for reads, but it's not the default because the performance is so bad. We also offer fast-but-unsafe modes for writes, for users that want better latency and are OK with losing the last few writes in the event of a failover. See the documentation [2] for more information.
Thank you for the detailed and fascinating response! I read and re-read it to make sure I didn't miss any details. I understand now why you chose to handle only metadata with Raft. I really appreciate the time you took to explain not only your reasoning, but your reasoning through the alternatives as well. Databases and distributed systems are both fascinating, and very challenging, problem domains. Combining them is even more challenging.
I would naively assume that since you need to replicate your database to the slave servers, you need to send the WAL or changes somehow anyway. Could you elaborate on what was the problem there? I'd be surprised if it is because of a fundamental limitation of Raft (but could easily be a limitation of a Raft library!)