My reading of the article's introduction is that Redis is adding this feature an...

jeffbee · on June 23, 2020

Sure, but now imagine you have no confidence that any part of your paxos implementation works at all, nevermind the paxos part. That's my impression of issue #13 from the article: not only did the software not pass the test, it's clear that nobody ever even tried to use it, at all!

Full-scale blackbox testing of a database system is similar to dogfooding. You only use it when you have high confidence that you have exhausted the possibilities of unit and integration tests. It's clear this project did not start with exhaustive unit tests.

It reminds me a bit of FoundationDB, which is also a terrible program nobody should entrust with data they ever want to see again. The first time I tried to use it it ran out of memory and crashed in about ten seconds. I found the problem, which was that their huge-page-aware allocator, which has no tests, had never actually been used by anybody on a machine with huge pages. It was a core library of a released database which had never been executed by anyone. This Redis thing is the same: nobody had ever said "RAFT SET foo bar", if they had done they would have seen the problem right away.

aphyr · on June 23, 2020

> It's clear this project did not start with exhaustive unit tests.

I can't speak to "exhaustive", but Redis-Raft did have an extant unit and integration test suite prior to our collaboration. Here's what they looked like: https://github.com/RedisLabs/redisraft/tree/ff9fb28c74db880c...

I'm hesitant to draw too strong a conclusion here, and I can't speak for the Redis Labs team, but I do suspect that this is somewhere where... having an outside tester, like Jepsen (or a suitably adversarial QA team) can help detect missing-stairs sorts of problems. Coming from the perspective of a prospective operator (and having some experience with testing distributed systems), I immediately said "of course I want proxy mode by default", when this wasn't how the Redis-Raft designers necessarily intended things to be used--they intended smart clients to make it so that users wouldn't actually need proxy mode, so they hadn't focused on testing it that way.

AtlasBarfed · on June 24, 2020

How would unit tests truly test anything for the meat of a distributed protocol?

To me that would be the ur-example of "proving it is correct, but that doesn't mean there aren't bugs in it"

benschulz · on June 23, 2020

Fair enough. I think I misinterpreted the "easier to approach" part of your original answer. Sorry if my answer came across as defensive. My wounds are still fresh. ;)