My reading of the article's introduction is that Redis is adding this feature and are (among other things I'm sure) paying jepsen to test it. So this is them having tests.
> If you follow basic software engineering principles, you'll find distributed systems easier to approach.
When I implemented Paxos I had tests and when they failed they spit out an exact trace of what happened in what order and on what node. Sometimes it was still excruciating to figure out what happened. Here's[1] a comment which you can think of as a bug tombstone. It took me half a day to figure out after I had a trace to analyze the issue.
Sure, but now imagine you have no confidence that any part of your paxos implementation works at all, nevermind the paxos part. That's my impression of issue #13 from the article: not only did the software not pass the test, it's clear that nobody ever even tried to use it, at all!
Full-scale blackbox testing of a database system is similar to dogfooding. You only use it when you have high confidence that you have exhausted the possibilities of unit and integration tests. It's clear this project did not start with exhaustive unit tests.
It reminds me a bit of FoundationDB, which is also a terrible program nobody should entrust with data they ever want to see again. The first time I tried to use it it ran out of memory and crashed in about ten seconds. I found the problem, which was that their huge-page-aware allocator, which has no tests, had never actually been used by anybody on a machine with huge pages. It was a core library of a released database which had never been executed by anyone. This Redis thing is the same: nobody had ever said "RAFT SET foo bar", if they had done they would have seen the problem right away.
I'm hesitant to draw too strong a conclusion here, and I can't speak for the Redis Labs team, but I do suspect that this is somewhere where... having an outside tester, like Jepsen (or a suitably adversarial QA team) can help detect missing-stairs sorts of problems. Coming from the perspective of a prospective operator (and having some experience with testing distributed systems), I immediately said "of course I want proxy mode by default", when this wasn't how the Redis-Raft designers necessarily intended things to be used--they intended smart clients to make it so that users wouldn't actually need proxy mode, so they hadn't focused on testing it that way.
Fair enough. I think I misinterpreted the "easier to approach" part of your original answer. Sorry if my answer came across as defensive. My wounds are still fresh. ;)
> If you follow basic software engineering principles, you'll find distributed systems easier to approach.
When I implemented Paxos I had tests and when they failed they spit out an exact trace of what happened in what order and on what node. Sometimes it was still excruciating to figure out what happened. Here's[1] a comment which you can think of as a bug tombstone. It took me half a day to figure out after I had a trace to analyze the issue.
[1]: https://github.com/benschulz/paxakos/blob/ee051ff67b5da6f287...