RethinkDB 2.1 beta: automatic failover with Raft

coffeemug · on July 16, 2015

Hey guys, Slava @ RethinkDB here. I'll be around all day to answer questions.

We're super-excited about this beta. We've been working on Raft, automatic failover, and all the relevant enhancements for over a year. It's a massive engineering project that's getting very close to production quality, and it feels so good to finally get it into people's (proverbial) hands!

scott_karana · on July 16, 2015

Congratulations! Have you been testing this with Jepsen internally, out of curiosity? :)

coffeemug · on July 16, 2015

Yes, of course! We've also been working with Kyle to understand the Jepsen codebase better. RethinkDB does really well on the Jepsen tests in our internal testing. Of course it's up to Kyle to have the final word :)

chrisfosterelli · on July 16, 2015

Hi Slava. Excuse me if this is a beginner's question, but on your FAQ[0] page the author argues that it makes more sense for the primary to not failover because this avoids split-brain problems during netsplits.

If you've switched to this new Raft algorithm, does this address the split-brain problem? What will now happen if a netsplit occurs and two primaries are elected, then rejoined?

[0] http://rethinkdb.com/docs/architecture/#cap-theorem

coffeemug · on July 16, 2015

Chris, I think we didn't communicate this well. When we were first building RethinkDB we looked at other similar products and noticed that they don't perform well in split-brain scenarios, and allow all kinds of errors and issues to creep in. For examples of this, just read Aphyr's posts that review various distributed systems.

So we thought, first, let's build a system based on manual failover, make that really solid, learn as much as possible about real-world use cases, and then build automatic failover on top of that.

This process took a couple of years, and we're finally at a point where we can ship automatic failover that we're confident will behave correctly in real-world (and theoretical) scenarios (modulo bugs that may not have been uncovered in testing).

To answer your specific question, if a netsplit occurs, two primaries could never be elected. There will be a primary either one side, or the other side of the split, so rejoining will not be a problem.

In practice, however, this is more subtle -- what if there is a netsplit, one side maintains its primary, and the other side elects a new one? What if the user writes to both primaries on either side? What happens after the cluster rejoins? In this specific case, we solve this by requiring the majority of the replicas to acknowledge writes by default before the write acknowledgement is sent to the client.

These questions can get really, really subtle -- (for example, what happens if there are multiple cascading netsplits in your cluster?) We wanted to take the time to understand these problems much better before we build automated failover, so we went with a manual failover system first.

chrisfosterelli · on July 16, 2015

> In practice, however, this is more subtle -- what if there is a netsplit, one side maintains its primary, and the other side elects a new one? What if the user writes to both primaries on either side? What happens after the cluster rejoins? In this specific case, we solve this by requiring the majority of the replicas to acknowledge writes by default before the write acknowledgement is sent to the client.

That's the particular case that I was curious about, that solution makes a lot of sense. I suppose that would also mean that if an outage takes out more than half of your replicas then you will loose all write-abilities (unless I'm mistaken), but that's probably better than the potential mess that could ensue.

Do these write requests to nodes that have no route to a primary simply "hang", or are they rejected by the daemon?

coffeemug · on July 16, 2015

> That's the particular case that I was curious about, that solution makes a lot of sense. I suppose that would also mean that if an outage takes out more than half of your replicas then you will loose all write-abilities

That's correct. We've built in an "emergency repair" provisions into the product to handle cases like this, should they happen, but that requires manual intervention. (In general, if you lose more than half of your servers, you want to intervene manually anyway)

> Do these write requests to nodes that have no route to a primary simply "hang", or are they rejected by the daemon?

They time out via normal TCP mechanisms and get rejected.

chrisfosterelli · on July 16, 2015

Sounds like you guys are taking a pretty reasonable approach to all this, thanks for letting me pick your brain!

fweespeech · on July 16, 2015

Are you aiming for production quality over WAN or LAN?

Obviously latency is an issue but I'm not sure with the way RethinkDB replicates what the performance penalty would be for WAN in terms of write latency.

coffeemug · on July 16, 2015

We are aiming for production quality WAN deployments. Many of our customers run RethinkDB in the cloud, and the latency between cloud regions can be quite punishing.

We'll document this setup better, but the replication options allow you to safely specify the set of nodes that need sync and async replication (so you can have sync replication for closely connected nodes in a datacenter, and async replication across datacenters).

fweespeech · on July 17, 2015

Thanks for the response. :)

mononcqc · on July 17, 2015

Have you dealt with mitigating Asymmetric partitions in Raft? http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf shows that a single defective node has the potential to entirely render a Raft cluster useless, specifically in the case of asymmetric partitions:

> The mobile devices will often be disconnected from the majority of other devices, thus running election after election without being able to gain enough votes. When these devices rejoin the network, they will force a new election as their term will be much higher than the leader. This leads to frequently disconnected devices becoming the leader when reattached to the network, thus causing regular unavailability of the system when they leave again.

> [...]

> It is also interesting to consider the impact of a single node being behind an asymmetric partition [...] This node will repeatedly re-run elections as it will not be able to receive incoming messages, each RequestVotes will have a higher term than the last, forcing the leader to step down.

Specifically, they required to add client request caches, exposing serial transaction IDs in every single record and log entry, specific timeouts and measures for livelock bugs this introduced, which include replacing an entry by a noop value to be replicated on some read-only operation done at a later time. On top of this, these modifications may end up reducing the serializability constructs to bring it closer to what a Master-Follower replication may give in a DB -- i.e. it doesn't guarantee one true serializable sequence of commits anymore, but only that the state replication will be consistent across all nodes

danielmewes · on July 18, 2015

We haven't added special logic for handling such cases. I think having servers with particularly unstable connectivity in a RethinkDB cluster is really uncommon though. If you want to run a RethinkDB server on your mobile device that's a different scenario of course...

mononcqc · on July 18, 2015

Well the problem is obvious in mobile devices, but any asymmetric netsplit will result in a single node taking down your cluster for as long as the netsplit lasts.

danielmewes · on July 20, 2015

Ah, I have to check how we behave in that case. Thanks for the pointers.

danielmewes · on July 20, 2015

If I understand that scenario correctly, it assumes that there is a node that can successfully send messages to the other nodes in the cluster, but cannot receive any responses.

Since RethinkDB uses TCP connections, this shouldn't usually happen (since the TCP acknowledgements wouldn't get through either). The exception might be a layer 5 router / firewall somewhere in the network that allows the TCP connection to work, but only passes the data stream through in one direction. RethinkDB is partially protected against this case, because we use bidirectional heartbeats on top of the same TCP connection that is used for Raft traffic. The heartbeat usually ensures that the other host is still alive and reachable. In this case, the node that cannot receive any responses from the remaining cluster would get a heartbeat timeout after a couple of seconds, and disconnect from the remaining servers. This is turn should limit further damage and allow the Raft cluster to proceed.

Please let me know if I'm missing something else.

Edit: There might still be some problems with non-transitive connectivity, where one node can talk to only parts of the cluster. We have built in some protection against this, but don't always handle that case well.

karlgrz · on July 16, 2015

This is so timely, I am really excited to get this new version out on our nodes to kick the tires!

lemevi · on July 16, 2015

Do you have any plans for windows support for your server?

coffeemug · on July 16, 2015

Yes, coming soon. See https://github.com/rethinkdb/rethinkdb/issues/1100 to track development progress.

v3ss0n · on July 16, 2015

Awesomeness , Confirmed.

RoboTeddy · on July 16, 2015

I'm continually impressed with the rate and quality of rethinkdb's advancement. There's some really fantastic software engineering going on over there.

carterehsmith · on July 17, 2015

Quite impressive, and very honest, too. They provide a "When is RethinkDB not a good choice?" point, they provide a "stability report", and that report was a proper torture test, and they responded promptly to the issues raised by the Jepsen test. BTW I have no use for their db at the time, but if I did one day, I would pick RethinkDB in a heartbeat over all competitors.

v3ss0n · on July 16, 2015

The way RethinkDB crews co-operate over Github is fantastic. Every workflow is tru github and you can transparently see their operation. I learnt a lot from that and i apply that to my team. We are building "A Killer" Communication tool using RethinkDB + HTML5. We will be announcing soon.

eterps · on July 17, 2015

Can you give an example how their cooperation over Github is special?

v3ss0n · on July 17, 2015

I am with @pest . rethinkdb's github issues becomes my daily read, along with Github's Atom repo.

https://github.com/rethinkdb/rethinkdb

Just look at the issues , how they handle them , how their professional responses , and how quick they resolves issues after issues with full co-operation between issue owner , users and contributor. They guide contributors as well. And it is really Addictive , be warned.

pests · on July 17, 2015

I really can't give you a concrete answer but I wanted to confirm what my GP said:

I must admit to being obsessed with their GitHub Issues for awhile. It was almost as addicting as HN or reddit to check and read new issues and comments - I've probably read every major discussion and minor interaction they've had over the last year or more.

contingencies · on July 16, 2015

Don't get me wrong, it's good to see people building alternative database systems, and it's great to see the community as a whole embracing the architectural issues of high availability through a combination of Paxos and Raft.

However, I'm not sure that we should be supporting every damn service having its own rolled-in implementation of its own choice of consensus algorithm.

Why not limit service scope to service provisioning, and normalize to a reasonable service-independent solution for high availability consensus and orchestration?This leads to far more solid and well tested operations practices that respect service interactions and you don't wind up with "we swept it under the rug" problems like the RethinkDB 2.1 beta announcement's "with only a few seconds of unavailability". Come on guys, call a spade a spade.

coffeemug · on July 16, 2015

We explored this possibility in quite a bit of depth, but this turns out to be much harder to do in practice than it seems.

There are currently two types of projects that offer this kind of service. The first is built in C/C++ and packaged as a library. We tried to use these libraries, but a project of RethinkDB's scope has a lot of abstractions for networking, threading, memory allocation, etc. We quickly found that we couldn't effectively integrate existing libraries into our coroutine system and networking stack.

The second type is projects that are typically built in higher level languages and run standalone. We tried that too, but it turns out to be a user experience nightmare. Users would need to know how to configure these different services, and would have to deal with deployment. The configuration aspect is quite difficult -- it's hard to configure these services in the right way without making a mistake (to make them work for the needs of a database system). We also looked into abstracting that away by building a porcelain UI, but there are a lot of challenges there that are very difficult to overcome.

Finally, RethinkDB's needs are quite specialized and there isn't a service that does what we need out of the box. We looked at implementing Raft ourselves and realized it isn't actually hard -- a much bigger challenge is properly architecting the rest of the system. To give you an idea of timing, it took us two weeks to get a Raft implementation, and another few weeks to polish it. It took another year to get everything else working (plus two years of expertise in the field wrt real-world issues that users encounter).

> only a few seconds of unavailability

This isn't something you can escape in system that's based on authoritative replicas. All other systems that use this architecture face this problem; an external service wouldn't solve it.

TL;DR: it would be really nice if we could use an existing service, but unfortunately this is a much trickier problem than it first appears.

contingencies · on July 17, 2015

Yes, it's difficult. However, sometimes the best and most responsible thing to do - especially when you find something difficult - is to keep it out of scope!

I was referring to the "higher level languages and run standalone" class of solution and I agree with you about their drawbacks (hard to use). However, I would argue that a hard to use but proven and correct solution for a general class of cases that someone else maintains is ten times better than an easier to use but unproven and new solution for a specific case that you have to maintain. Your suggestions requires everyone to learn another syntax for every damn daemon they wish to operate, along with the ongoing cognitive drain and maintenance overhead (upgrades, etc.).

Actually upgrades are a great case in point. How does your daemon handle updates while in clustered mode? I suppose it doesn't. This is another reason why a proven, general solution is great to have ... you actually get an operations process that handles the extremely common but swept under the rug edge cases that nobody wants to talk about, like upgrades, disk failures, oops I unplugged the cable/DDoS/network fabric issues, horizontal scaleout, appropriate rules for (non-)co-habitation with other services, for all services! Yes, TMTOWTDI, but I seriously doubt some random daemon knows better how to handle failure than a proven, daemon-agnostic operations process designed at a significantly more abstract, resource-oriented level.

Have you had a look at the OCF format resource agents? If you define one, you can wash your hands of the whole issue: https://github.com/ClusterLabs/resource-agents/

I don't understand why "authoritative replicas" demand a few seconds of downtime? I find that most daemons, assuming they fsync() appropriately, come up again in well under a second using a dual master (live, standby) setup with DRBD (essentially network RAID1, works with any filesystem) with zero client reconfiguration required by using a shared (floating) IP to handle failover. Check it out.

krakensden · on July 17, 2015

A lot of people have a lot of trouble with the Cluster Labs stack- and get into a lot of trouble. I've never really used it much myself, but integrating failover into the server process, and making sure the defaults are all correct for a very particular use case seems like a fairly reasonable thing to do.

thelinuxl1ch · on July 16, 2015

Being developing with RethinkDB for one year, I can say that it's amazing how fast this database evolved, the API is very flexible, the developer team cares a lot about user feedback, it's a thriving community! Auto failover will be a life saver, but I'm much more anxious for reliable changefeeds :)

coffeemug · on July 16, 2015

Reliable feeds (and much more) is coming in 2.2. I think we found an elegant solution[1] that will make different types of users happy, so it'll be exciting to ship it!

[1] https://github.com/rethinkdb/rethinkdb/issues/3471

v3ss0n · on July 16, 2015

Thats very nice to hear, Can't wait for Reliable change feeds.

thelinuxl1ch · on July 16, 2015

Nice, this will open a whole world of use cases

danielmewes · on July 16, 2015

(Daniel at RethinkDB here) Thanks for the nice words. We're going to add changefeeds that can survive short client disconnects in 2.2, and add further reliability features for changefeeds in the following versions.

ascendantlogic · on July 16, 2015

Every time I say to myself "How can Rethink get more awesome?" I am shown how. What a great product.

v3ss0n · on July 17, 2015

well said.

jtwebman · on July 16, 2015

I have played around with this database I am looking forward to using it on my next project! This is great news but I would love if they wrote a Erlang/Elixir driver. There is one that that the community wrote but would love to see someone full time on it.

nodesocket · on July 16, 2015

@cofeemug, can you describe a typical HA deployments? Are 3 nodes required to start? If you want to shard across 3 nodes, but also sustain a 1 node failure, how many nodes are required? Finally, how about shard across 3 nodes, but sustain a two node failure?

danielmewes · on July 16, 2015

3 nodes are indeed required to have automatic failover working.

We recommend using 3 (or more) nodes and replicating all tables with a replication factor of 3. That way each node is going to maintain a full copy of the data. In case of a 1 node failure (with or without sharding, as long as the replication factor is set to 3), one of the two remaining servers is going to take over for the primary roles that might have been hosted on the missing server.

If you want to sustain a two node failure without any data loss, you will need 5 servers and also set the replication factor to 5. 5 is the lowest number that guarantees that if two nodes fail, you will still be left with a majority of replicas (i.e. 3 out of 5). A majority like that is required to guarantee data durability and to enable auto failover. If you are ok with losing a few seconds worth of changes and do not require automatic failover, even a 3 node setup can be enough to sustain a two node failure. In that case you will have to perform a manual "emergency_repair" to recover availability (see http://docs.rethinkdb.com/2.1/api/javascript/reconfigure/ for details), but most of the data should still be there.

In addition you can shard the table into 3 shards for additional performance gains. This is for the most part unrelated to availability and data safety.

sciurus · on July 16, 2015

With data sharded across three nodes, why can't a replication factor of two handle one node's failure? Why does the replication factor need to be three?

danielmewes · on July 16, 2015

That's a great question.

In our current architecture, we perform failover by selecting a new primary among the existing replicas for a given shard. We do not however add new servers to the replica list during a failover condition. Therefore, if you have a table with two replicas per shard and have a one node failure, you would only have a single replica left.

Currently we do not automatically failover in that case. We always require a majority of replicas to be present for any given shard before failing over the primary of that shard.

I believe we could relax this constraint in this specific case, and allow failing over despite only a single replica being left, as long as we still have a majority for the table overall. (I'm going to double-check this with one of our core developers...) Even if we did perform the failover, that would not restore write availability of the table though, since there wouldn't be a majority to acknowledge a write (unless you set write_acks to "single"). It could still be useful to restore availability for up to date read queries. We might add support for auto failover in this scenario in the future.

thelinuxl1ch · on July 16, 2015

So, on a minimum three-node setup, three replicas + three shards per table are doable? Are there any caveats?

coffeemug · on July 16, 2015

Yes, doable and desirable. There shouldn't be any caveats to this deployment.

coffeemug · on July 16, 2015

Great questions!

The simplest HA deployment needs three replicas on three nodes. If one of the nodes fails, one of the other two will automatically take over its responsibilities, either until the failed node comes back up, or the user reconfigures the table to not put data on the failed node anymore. So to sustain a one node failure, you need three nodes and three replicas.

To automatically sustain two node failures, you'd need five nodes.

In general, to sustain X failures without manual intervention, you'll need X*2+1 nodes (with the minimum of three nodes to get started).

thelinuxl1ch · on July 16, 2015

Are there any performance issues we should know after the Raft implementation?

coffeemug · on July 16, 2015

The Raft protocol only touches the metadata (table information, location of tables on physical nodes, etc). The query engine works the way it always did, so Raft shouldn't affect query performance in a measurable way.

There might be difference in performance for administrative operations (e.g. resharding), but they should all be positive changes!

EDIT: Daniel beat me to it.

danielmewes · on July 16, 2015

General query performance should be the same, if not better in this release (the improvements are not due to Raft, but due to other changes). We use Raft only to determine a consistent configuration. The queries themselves are executed in the same efficient way as before.

Note that in the beta, reconfiguring tables over servers on rotational disks can be slow https://github.com/rethinkdb/rethinkdb/issues/4279 . If you store large documents, you might also see some increases in memory usage during backfills https://github.com/rethinkdb/rethinkdb/issues/4474 . We're going to address both of these issues before the final release.

DanWaterworth · on July 17, 2015

> We use Raft only to determine a consistent configuration. The queries themselves are executed in the same efficient way as before.

So, you've made the noob mistake in using Raft, where only configuration goes in the log and actual changes to data aren't replicated in the same manner.

danielmewes · on July 17, 2015

We have carefully designed the way data is versioned and replicated in RethinkDB 2.1 to result in a correct and consistent system in combination with the Raft-supported configuration management. For example we make sure that both components use consistent quorum definitions and the same membership information at any point.

This allows us to provide different degrees of consistency guarantees for the data, depending on the user's need. Our default is already pretty conservative, but allows uncommitted data to be visible by reads. In the strongest consistency mode, RethinkDB 2.1 provides full linearizability (see http://docs.rethinkdb.com/2.1/docs/consistency/ for details). We have confirmed this both theoretically as well as by testing the overall system using the Jepsen test framework.

DanWaterworth · on July 17, 2015

What do you gain? Let's say that you're right and you've managed actually managed to provide the guarantees that you say you have. By adding things to Raft, you haven't made it simpler and, having implemented Raft myself recently, I can attest that it's not the simplest of things to start with.

If you're looking for different degrees of consistency, I'm fairly sure you can do that without protocol level changes to Raft.

aaronorosen · on July 17, 2015

Awesome to see!

I tried giving this a shot starting up 3 rethinkdb nodes with a proxy node with my existing application. Setup 3 shards and 2 replicas. Unfortunately, I wasn't able to get my application working. The UI had the tables going back to outdated reads (everything was able to initialize though once). I'll give it another shot next week and report back. I'm sure it could also be a problem with my setup maybe.

coffeemug · on July 17, 2015

Could you submit a bug report at https://github.com/rethinkdb/rethinkdb/issues/new ? It would help immensely!

v3ss0n · on July 16, 2015

I havn't check the PRs yet , but does this BETA also have performance improvement commit a few weeks ago? https://github.com/rethinkdb/rethinkdb/issues/4441

That one rethinkdb overall performance of writes by 2x right?

danielmewes · on July 16, 2015

Yes, that patch for soft durability writes is in there. Looking forward to hear if it improves things for you.

v3ss0n · on July 16, 2015

thanks , i am going to test after building.

shockzzz · on July 17, 2015

What's the most mature client driver for RethinkDB, in your opinion?

coffeemug · on July 17, 2015

Ruby, Python, and JavaScript are all equally mature (and developed in-house full time). Of the community drivers, Go is probably the most mature (but other community drivers, like the .NET one are really nice).