Slack's house literally did burn down for 24 hours because of DNSSEC back into 2...

daneel_w · 2025-06-19T21:46:38 1750369598

Oh. I thought it burned down because of their engineers not having fully acquainted themselves with the tool before applying it. It's misguided to hold DNSSEC culpable for Slack's engineers' goof-up. Like advising people against ever going near scissors because they might run with one in their hands.

marcusb · 2025-06-19T23:53:33 1750377213

That is an extremely uncharitable take for an outage that involved two of the most poorly defined and difficult to (correctly) implement DNS features (wildcards and NSEC records) that the three major DNS operators mentioned in the Slack post-mortem (AWS, Cloudflare, and Google) all implemented differently.

teddyh · 2025-06-20T21:48:26 1750456106

IIRC, Slack backed out of their DNSSEC deployment because of a bug with wildcards and NSEC records (in Route53, not Slack), but the problem Slack subsequently experienced was not caused by that bug, but was instead caused by the boneheaded way in which Slack tried to back out of DNSSEC. I.e. Slack’s problem was entirely their own doing, and completely avoidable if they had had any idea of what they were doing.

tptacek · 2025-06-20T22:49:25 1750459765

Slack's actual internal take on DNSSEC is now recorded in this thread, for what it's worth.

growse · 2025-06-21T07:24:45 1750490685

This feels like a classic example of "actions at the sharp end resolve all ambiguity" from Richard Cook's 1998 paper on complex systems.

marcusb · 2025-06-21T02:02:18 1750471338

Having read the post mortem, I disagree. Slack engineers did something dumb, under pressure during an outage. Even if they hadn't, they still would have been in a degraded state until they could properly remove their DNSSEC records and/or get the Route53 bug they hit fixed. In other words, they still would have had a 24+ hour outage, albeit with a smaller blast radius.

The design of DNSSEC is simply not fit for purpose for zone operators. It is far too easy to screw your zone up for far too marginal a benefit, to say nothing of the huge increase in CPU resource required to authenticate DNSSEC record chains.

The story for implementers is just as bad - the specifications are baroque, filled with lousy crypto and poorly thought-out options.

To give just one example, consider the NSEC3 iteration limit field. NSEC3 itself was designed mostly[0] to prevent zone enumeration when validating negative responses (which is trivial to perform with NSEC.) The iteration count was designed to give zone operators the ability to increase the cost to an attacker of generating a dictionary of nsec3 names[1]. Of course, a high iteration count also raises the cost to a non-malicious resolver of validating a negative response for a nsec3-enabled zone.

In good old DNSSEC fashion, the iterations field is a single number that is subject to a... wide variety of potential limits:

* 16 bits by the wire protocol

* 150 for 1,024 bit keys (RFC 5155 10.3[2])

* 500 for 2,048 bit keys (RFC 5155 10.3[2])

* 2,500 for 4,096 bit keys (RFC 5155 10.3[2])

* 0 (RFC 9276 3.2)

Why 0? It was noted -- after publishing the NSEC3 spec -- that high iterations just don't provide that much benefit, and come with a high cost to throughput. Appendix B of RFC 9276 shows a roughly 50% performance degradation with an iteration count of 100. So, RFC 9276 3.2 says:

  Validating resolvers MAY also return a SERVFAIL response when processing NSEC3 records with iterations larger than 0.

Of course, their guidance to implementers is to set the limits a bit higher, returning insecure responses at 100 iterations and SERVFAIL at 500. That said, if you want to be maximally interoperable, as a zone operator, you should pretend like the iteration count field doesn't exist: it is standards compliant for a validating resolver to refuse an nsec3 response with more than a single hash round.

As I said, this is one example, but I'm not cherry picking here. The whole of the DNSSEC spec corpus is filled with incomprehensible verbiage and opportunities for conflicting interpretations, far beyond what you see in most protocol specs.

0 - also to reduce the size of signed top-level zones

1 - all NSEC and NSEC3 records, while responsive to queries about names that don't exist, consist of obfuscated names that do exist.

2 - According to the letter of the standard, the limits applied to the iterations field should be 149, 499, and 2,499. Implementations are inconsistent about this.

teddyh · 2025-06-21T14:40:11 1750516811

IIUC, if Slack had done the correct thing, only wildcard DNS records (if any) would have been affected. They would certainly not have had a complete DNS blackout. I would classify that as significant.

> The story for implementers is just as bad - the specifications are baroque, filled with lousy crypto and poorly thought-out options.

I don’t care. So is almost every other standard, but until something better comes along, DNSSEC is what we have. Arguing that a working and implemented solution should not be used since it is worse than a non-existing theoretical perfect solution is both:

1. True

2. Completely and utterly useless, except as a way to waste everyone’s time and drain their energy.

growse · 2025-06-21T15:43:17 1750520597

> So is almost every other standard, but until something better comes along, DNSSEC is what we have.

I must have missed the last 40 years of the internet where DNS without DNSSEC has worked (and continues to work) just fine.

marcusb · 2025-06-21T16:47:48 1750524468

> IIUC, if Slack had done the correct thing, only wildcard DNS records (if any) would have been affected

There's the problem - you DON'T understand. Straight from the post-portem that you clearly have not read:

  One microsecond later, app.slack.com fails to resolve with a ‘ERR_NAME_NOT_RESOLVED’ error:

  [screenshot of error ]

  This indicated there was likely a problem with the ‘*.slack.com’ wildcard record since we didn’t have a wildcard record in any of the other domains where we had rolled out DNSSEC on. Yes, it was an oversight that we did not test a domain with a wildcard record before attempting slack.com — learn from our mistakes!

> I don’t care. So is almost every other standard,

Cool story. I do care. I'd like to see greater protection of the DNS infrastructure. DNSSEC adoption is hovering around 4%. TLS for HTTP is around 90%. At least part of that discrepancy is due to how broken DNSSEC is.

teddyh · 2025-06-22T13:06:29 1750597589

They could have done a quick fix by adding an explicit app.slack.com record. But instead they removed the DNSSEC signing from the whole domain, thereby invalidating all records, not just the wildcard ones.

> I do care.

I will care once something else comes around with any promise of being implemented and rolled out. Until then, I see no need to discourage the adoption of DNSSEC, or disparage its design, except when designing its newer version or replacement.

> I'd like to see greater protection of the DNS infrastructure. DNSSEC adoption is hovering around 4%.

I work at a registrar and DNS hosting provider for more than 10.000 domains. More than 70% of them have DNSSEC.

marcusb · 2025-06-22T16:03:40 1750608220

> They could have done a quick fix by adding an explicit app.slack.com record. But instead they removed the DNSSEC signing from the whole domain, thereby invalidating all records, not just the wildcard ones.

1) That would do nothing to fix resolvers that had already cached NSEC responses lacking type maps.

2) That presumes the wildcard record was superfluous and could have been replaced with a simple A record for a single or small number of records. Would love to see a citation supporting that.

3) That presumes the Slack team could have quickly identified that the problem they were having was caused by the fact that app.slack.com (and whatever other hosts resolve from that wildcard) was caused by the fact the record was configured as a wildcard and would have been resolved by eliminating the wildcard record. If you read the postmortem, it is clear they zeroed in on the wildcard record as being suspect, but had to work with AWS to figure out the exact cause. I doubt that was an instantaneous process.

Any way you slice it, there was no quick way to fully recover from this bug once they hit it, and my argument is that the design of DNSSEC makes these issues a) likely to happen and b) difficult to model ahead of time, while providing fairly marginal security benefit.

At this point, I really don't care if you agree or disagree.

> I will care once something else comes around with any promise of being implemented and rolled out.

Yeah. DNSSEC is going to be widely deployed any day now. The year after the year of Linux on the desktop.

> I work at a registrar and DNS hosting provider for more than 10.000 domains. More than 70% of them have DNSSEC.

Cool. There are, what, 750 million domains registered worldwide? We are at nowhere near 10% adoption worldwide, let alone 70%. Of the top 100 domains -- the operators you would assume would be the most concerned about DNS response poisoning -- *six* have turned DNSSEC on.

teddyh · 2025-06-23T08:28:01 1750667281

> 1) That would do nothing to fix resolvers that had already cached NSEC responses lacking type maps.

The TTL for NSEC records are presumably way lower than the TTL for the DS records.

> 2) That presumes the wildcard record was superfluous and could have been replaced with a simple A record for a single or small number of records. Would love to see a citation supporting that.

It’s theoretically possible that it would not have worked for all cases, but that is, in my experience, very unlikely.

> Any way you slice it, there was no quick way to fully recover from this bug once they hit it

The bug seems to me to have been reasonably easy to mitigate, but their problem was that Slack did not know what they were doing. Thu bug itself was minor, but Slack tried to fix it by stopping to serve DNSSEC-signed DNS data, while long-TTL DS records were still being unexpired in the world. This is the worst possible thing you could do.

> Of the top 100 domains -- the operators you would assume would be the most concerned about DNS response poisoning -- *six* have turned DNSSEC on.

1. That number used to be zero, as tptacek liked to point out.

2. The huge operators often have fundamentally different security priorities than regular companies and users.

3. People said the same about IPv6 and SSL, which were also very slow to adopt. But they are all climbing.

marcusb · 2025-06-25T20:58:53 1750885133

> The TTL for NSEC records are presumably way lower than the TTL for the DS records.

Possibly. It was still an outage they had to wait out the TTL for, due to the design of DNSSEC.

> It’s theoretically possible that it would not have worked for all cases, but that is, in my experience, very unlikely.

This is completely unsubstantiated speculation on your part.

> The bug seems to me to have been reasonably easy to mitigate, but their problem was that Slack did not know what they were doing.

It is indeed reasonably easy to Monday-morning quarterback someone else's outage and blame operators for the sharp edges around poorly designed protocols.

> 1. That number used to be zero, as tptacek liked to point out.

Cool. so, at this rate, in another 100 years or so we should be at 50% adoption.

> 2. The huge operators often have fundamentally different security priorities than regular companies and users.

Priorities like uptime?

> 3. People said the same about IPv6 and SSL, which were also very slow to adopt. But they are all climbing

1) people started rolling IPv6 out once v4 addresses got scarce. There is no such compelling event to drive DNSSEC adoption. 2) SSL is easy to roll out and provides compelling security benefits. It is also exceedingly unlikely in practice to blow up in your face and result in run-out-the-clock outages -- unlike DNSSEC.

teddyh · 2025-06-28T08:08:23 1751098103

> This is completely unsubstantiated speculation on your part.

Do you have any support for your assumption that the wildcard record was vital and practically impossible to replace with regular records?

> It is indeed reasonably easy to Monday-morning quarterback someone else's outage and blame operators for the sharp edges around poorly designed protocols.

When Slack, being a large company presenting themselves as proficient in tech, make a tech mistake so bad that they lock themselves out if the internet for an entire day, a mistake even I know not to make, then I get to criticize them.

tptacek · 2025-06-23T21:07:07 1750712827

Cloudflare has been promoting DNSSEC for almost as long as I've been writing about DNSSEC, so no, nothing has really changed with the Top 100.

tptacek · 2025-06-22T21:46:54 1750628814

Of the top 100, only 2 have DNSSEC enabled --- cloudflare.com and cloudflare.net.

marcusb · 2025-06-22T22:00:52 1750629652

I imagine it depends on which "top domain" list you use. I use Cloudflare radar. As of today, for their top 100 domains, 6 have published DS records:

2371 13 2 32996839A6D808AFE3EB4A795A0E6A7A39A76FC52FF228B22B76F6D6 3826F2B9 cloudflare.com

2371 13 2 F52DBA4AAEA13A1F457C0FB4C1953F40E16AFC5C5E79EDF7CEED0FCF 0CBD81F0 cloudflare-dns.com

53074 13 2 86F2929EE3E5E501032B6DC94841A4A056A2D2876CABCF46A5F8907E B4917782 one.one

56044 8 2 1B0A7E90AA6B1AC65AA5B573EFC44ABF6CB2559444251B997103D2E4 0C351B08 dns.google

48553 13 2 57AF2F182A541A91AD24CC6583867C2BA331255B03E2A32579A625AD 1F3BE3CA taboola.com

33751 8 2 90C6CD28626CA7B8E3A1FACAD58D20D486E52DF040B9B2F085ACD5C7 03E624C6 nist.gov

Not that it matters much one way or the other - you won't find a top 100 domains list where, say, 50 have DNSSEC enabled.

tptacek · 2025-06-22T22:01:37 1750629697

There's a Right Answer for this!

https://tranco-list.eu/

(The Tranco list includes the Cloudflare data as a factor).

tptacek · 2025-06-22T21:45:41 1750628741

Fascinating.

My rebuttal: dnssecmenot.fly.dev.

teddyh · 2025-06-23T08:30:40 1750667440

I will respond with a link you once gave: <https://www.verisign.com/en_US/company-information/verisign-...>

tptacek · 2025-06-23T16:39:24 1750696764

You just gave a link with a graph that shows a recent sharp drop in DNSSEC adoption as if it was a mic drop. The page I showed you barely even has text; it doesn't need any, the implication is obvious.

teddyh · 2025-06-28T07:52:36 1751097156

It used to show "a recent sharp drop", back when you originally gave the link. It quite soon started to climb again, and the climb has continued, as is now clearly visible. This was pointed out to you, but you acted, and are still acting, as if nothing has happened since that time you first looked at it.

tptacek · 2025-06-29T04:59:43 1751173183

I'm happy, like you seem to be, to point people to the chart you just pasted.

tptacek · 2025-06-19T21:52:13 1750369933

LOL.

https://ianix.com/pub/dnssec-outages.html

growse · 2025-06-19T22:04:07 1750370647

Apparently the way to do sane risk management is just "don't be an idiot"?

daneel_w · 2025-06-19T22:03:49 1750370629

Miserable list. Now share a link that shows all the successful deployments that never had any hiccups.

Avamander · 2025-06-20T10:06:50 1750414010

It's not like anyone is actually monitoring it or would notice.

I know for a fact there are a lot of broken deployments out there, have been for a decade if not more, nobody really gives a rats ass.

From a technical perspective, lack of DNSSEC transparency is also a major downside of DNSSEC compared to WebPKI.

iscoelho · 2025-06-19T19:36:48 1750361808

and Slack.com still uses DNSSEC. They appear to not have come to the same conclusion.

tptacek · 2025-06-19T20:09:56 1750363796

They added DNSSEC because of FedGov accounts that require it.

iscoelho · 2025-06-19T21:04:15 1750367055

Happy to see security audits doing good work.

psanford · 2025-06-20T14:31:31 1750429891

Internally at slack the general consensus was that dnssec was a giant waste of time and money from a security perspective. We did it for compliance to sell into the Federal govt and federal contractors.