Slack's house literally did burn down for 24 hours because of DNSSEC back into 2021.
When you frame the risk as "marginal benefit against one specific threat" Vs "removes us from the internet for 24 hours", the big players pass and move on. This is the sort of event the phrase "sev 1" gets applied to.
Some fun companies have a reg requirement to provide service on a minimum SLA, otherwise their license to operate is withdrawn. Those guys run the other way screaming when they hear things like "DNSSEC" (ask me how I know).
What percentage of the fortune 500 is served over DNSSEC?
Oh. I thought it burned down because of their engineers not having fully acquainted themselves with the tool before applying it. It's misguided to hold DNSSEC culpable for Slack's engineers' goof-up. Like advising people against ever going near scissors because they might run with one in their hands.
That is an extremely uncharitable take for an outage that involved two of the most poorly defined and difficult to (correctly) implement DNS features (wildcards and NSEC records) that the three major DNS operators mentioned in the Slack post-mortem (AWS, Cloudflare, and Google) all implemented differently.
IIRC, Slack backed out of their DNSSEC deployment because of a bug with wildcards and NSEC records (in Route53, not Slack), but the problem Slack subsequently experienced was not caused by that bug, but was instead caused by the boneheaded way in which Slack tried to back out of DNSSEC. I.e. Slack’s problem was entirely their own doing, and completely avoidable if they had had any idea of what they were doing.
Having read the post mortem, I disagree. Slack engineers did something dumb, under pressure during an outage. Even if they hadn't, they still would have been in a degraded state until they could properly remove their DNSSEC records and/or get the Route53 bug they hit fixed. In other words, they still would have had a 24+ hour outage, albeit with a smaller blast radius.
The design of DNSSEC is simply not fit for purpose for zone operators. It is far too easy to screw your zone up for far too marginal a benefit, to say nothing of the huge increase in CPU resource required to authenticate DNSSEC record chains.
The story for implementers is just as bad - the specifications are baroque, filled with lousy crypto and poorly thought-out options.
To give just one example, consider the NSEC3 iteration limit field. NSEC3 itself was designed mostly[0] to prevent zone enumeration when validating negative responses (which is trivial to perform with NSEC.) The iteration count was designed to give zone operators the ability to increase the cost to an attacker of generating a dictionary of nsec3 names[1]. Of course, a high iteration count also raises the cost to a non-malicious resolver of validating a negative response for a nsec3-enabled zone.
In good old DNSSEC fashion, the iterations field is a single number that is subject to a... wide variety of potential limits:
* 16 bits by the wire protocol
* 150 for 1,024 bit keys (RFC 5155 10.3[2])
* 500 for 2,048 bit keys (RFC 5155 10.3[2])
* 2,500 for 4,096 bit keys (RFC 5155 10.3[2])
* 0 (RFC 9276 3.2)
Why 0? It was noted -- after publishing the NSEC3 spec -- that high iterations just don't provide that much benefit, and come with a high cost to throughput. Appendix B of RFC 9276 shows a roughly 50% performance degradation with an iteration count of 100. So, RFC 9276 3.2 says:
Validating resolvers MAY also return a SERVFAIL response when processing NSEC3 records with iterations larger than 0.
Of course, their guidance to implementers is to set the limits a bit higher, returning insecure responses at 100 iterations and SERVFAIL at 500. That said, if you want to be maximally interoperable, as a zone operator, you should pretend like the iteration count field doesn't exist: it is standards compliant for a validating resolver to refuse an nsec3 response with more than a single hash round.
As I said, this is one example, but I'm not cherry picking here. The whole of the DNSSEC spec corpus is filled with incomprehensible verbiage and opportunities for conflicting interpretations, far beyond what you see in most protocol specs.
0 - also to reduce the size of signed top-level zones
1 - all NSEC and NSEC3 records, while responsive to queries about names that don't exist, consist of obfuscated names that do exist.
2 - According to the letter of the standard, the limits applied to the iterations field should be 149, 499, and 2,499. Implementations are inconsistent about this.
IIUC, if Slack had done the correct thing, only wildcard DNS records (if any) would have been affected. They would certainly not have had a complete DNS blackout. I would classify that as significant.
> The story for implementers is just as bad - the specifications are baroque, filled with lousy crypto and poorly thought-out options.
I don’t care. So is almost every other standard, but until something better comes along, DNSSEC is what we have. Arguing that a working and implemented solution should not be used since it is worse than a non-existing theoretical perfect solution is both:
1. True
2. Completely and utterly useless, except as a way to waste everyone’s time and drain their energy.
> IIUC, if Slack had done the correct thing, only wildcard DNS records (if any) would have been affected
There's the problem - you DON'T understand. Straight from the post-portem that you clearly have not read:
One microsecond later, app.slack.com fails to resolve with a ‘ERR_NAME_NOT_RESOLVED’ error:
[screenshot of error ]
This indicated there was likely a problem with the ‘*.slack.com’ wildcard record since we didn’t have a wildcard record in any of the other domains where we had rolled out DNSSEC on. Yes, it was an oversight that we did not test a domain with a wildcard record before attempting slack.com — learn from our mistakes!
> I don’t care. So is almost every other standard,
Cool story. I do care. I'd like to see greater protection of the DNS infrastructure. DNSSEC adoption is hovering around 4%. TLS for HTTP is around 90%. At least part of that discrepancy is due to how broken DNSSEC is.
They could have done a quick fix by adding an explicit app.slack.com record. But instead they removed the DNSSEC signing from the whole domain, thereby invalidating all records, not just the wildcard ones.
> I do care.
I will care once something else comes around with any promise of being implemented and rolled out. Until then, I see no need to discourage the adoption of DNSSEC, or disparage its design, except when designing its newer version or replacement.
> I'd like to see greater protection of the DNS infrastructure. DNSSEC adoption is hovering around 4%.
I work at a registrar and DNS hosting provider for more than 10.000 domains. More than 70% of them have DNSSEC.
> They could have done a quick fix by adding an explicit app.slack.com record. But instead they removed the DNSSEC signing from the whole domain, thereby invalidating all records, not just the wildcard ones.
1) That would do nothing to fix resolvers that had already cached NSEC responses lacking type maps.
2) That presumes the wildcard record was superfluous and could have been replaced with a simple A record for a single or small number of records. Would love to see a citation supporting that.
3) That presumes the Slack team could have quickly identified that the problem they were having was caused by the fact that app.slack.com (and whatever other hosts resolve from that wildcard) was caused by the fact the record was configured as a wildcard and would have been resolved by eliminating the wildcard record. If you read the postmortem, it is clear they zeroed in on the wildcard record as being suspect, but had to work with AWS to figure out the exact cause. I doubt that was an instantaneous process.
Any way you slice it, there was no quick way to fully recover from this bug once they hit it, and my argument is that the design of DNSSEC makes these issues a) likely to happen and b) difficult to model ahead of time, while providing fairly marginal security benefit.
At this point, I really don't care if you agree or disagree.
> I will care once something else comes around with any promise of being implemented and rolled out.
Yeah. DNSSEC is going to be widely deployed any day now. The year after the year of Linux on the desktop.
> I work at a registrar and DNS hosting provider for more than 10.000 domains. More than 70% of them have DNSSEC.
Cool. There are, what, 750 million domains registered worldwide? We are at nowhere near 10% adoption worldwide, let alone 70%. Of the top 100 domains -- the operators you would assume would be the most concerned about DNS response poisoning -- *six* have turned DNSSEC on.
> 1) That would do nothing to fix resolvers that had already cached NSEC responses lacking type maps.
The TTL for NSEC records are presumably way lower than the TTL for the DS records.
> 2) That presumes the wildcard record was superfluous and could have been replaced with a simple A record for a single or small number of records. Would love to see a citation supporting that.
It’s theoretically possible that it would not have worked for all cases, but that is, in my experience, very unlikely.
> Any way you slice it, there was no quick way to fully recover from this bug once they hit it
The bug seems to me to have been reasonably easy to mitigate, but their problem was that Slack did not know what they were doing. Thu bug itself was minor, but Slack tried to fix it by stopping to serve DNSSEC-signed DNS data, while long-TTL DS records were still being unexpired in the world. This is the worst possible thing you could do.
> Of the top 100 domains -- the operators you would assume would be the most concerned about DNS response poisoning -- *six* have turned DNSSEC on.
1. That number used to be zero, as tptacek liked to point out.
2. The huge operators often have fundamentally different security priorities than regular companies and users.
3. People said the same about IPv6 and SSL, which were also very slow to adopt. But they are all climbing.
> The TTL for NSEC records are presumably way lower than the TTL for the DS records.
Possibly. It was still an outage they had to wait out the TTL for, due to the design of DNSSEC.
> It’s theoretically possible that it would not have worked for all cases, but that is, in my experience, very unlikely.
This is completely unsubstantiated speculation on your part.
> The bug seems to me to have been reasonably easy to mitigate, but their problem was that Slack did not know what they were doing.
It is indeed reasonably easy to Monday-morning quarterback someone else's outage and blame operators for the sharp edges around poorly designed protocols.
> 1. That number used to be zero, as tptacek liked to point out.
Cool. so, at this rate, in another 100 years or so we should be at 50% adoption.
> 2. The huge operators often have fundamentally different security priorities than regular companies and users.
Priorities like uptime?
> 3. People said the same about IPv6 and SSL, which were also very slow to adopt. But they are all climbing
1) people started rolling IPv6 out once v4 addresses got scarce. There is no such compelling event to drive DNSSEC adoption. 2) SSL is easy to roll out and provides compelling security benefits. It is also exceedingly unlikely in practice to blow up in your face and result in run-out-the-clock outages -- unlike DNSSEC.
> This is completely unsubstantiated speculation on your part.
Do you have any support for your assumption that the wildcard record was vital and practically impossible to replace with regular records?
> It is indeed reasonably easy to Monday-morning quarterback someone else's outage and blame operators for the sharp edges around poorly designed protocols.
When Slack, being a large company presenting themselves as proficient in tech, make a tech mistake so bad that they lock themselves out if the internet for an entire day, a mistake even I know not to make, then I get to criticize them.
You just gave a link with a graph that shows a recent sharp drop in DNSSEC adoption as if it was a mic drop. The page I showed you barely even has text; it doesn't need any, the implication is obvious.
It used to show "a recent sharp drop", back when you originally gave the link. It quite soon started to climb again, and the climb has continued, as is now clearly visible. This was pointed out to you, but you acted, and are still acting, as if nothing has happened since that time you first looked at it.
Internally at slack the general consensus was that dnssec was a giant waste of time and money from a security perspective. We did it for compliance to sell into the Federal govt and federal contractors.
When you frame the risk as "marginal benefit against one specific threat" Vs "removes us from the internet for 24 hours", the big players pass and move on. This is the sort of event the phrase "sev 1" gets applied to.
Some fun companies have a reg requirement to provide service on a minimum SLA, otherwise their license to operate is withdrawn. Those guys run the other way screaming when they hear things like "DNSSEC" (ask me how I know).
What percentage of the fortune 500 is served over DNSSEC?