> if I run a workload that's 50% AES-NI and 50% something else, then it takes ex...

> if I run a workload that's 50% AES-NI and 50% something else, then it takes exactly twice as long to fault as if the workload was 100% AES-NI.

If you mean it takes twice as long on average, that's consistent with marginal hardware, where the fault is more sensitive to occurring during AES instructions so the more you execute the higher the probability.

If you mean it always takes exactly twice as long, that sounds more like a software issue, where there is some counter and when it rolls over its behavior changes. In theory this could be microcode/firmware/library rather than your code, but then it's likely that someone else would have noticed by now.

> And it isn't provoked any more quickly, by having just provoked it and then running the same workload again — i.e. there's no temporal locality to it. Which would make both "environmental conditions" and "CPU is overheating / overvolting" much less likely as contributing factors.

That's if the workload is inducing higher power consumption or higher temperatures, rather than the voltage or temperature constantly being out of spec but only marginally, so there is at all times a probability of random errors, to which certain types of instructions are more susceptible.

> Imagine the TLB havoc going on, as those forked-off heavy-workload query workers also all fight to memory-map the same huge set of backing table heap files.

So now there are two possibilities. One, your workload is really unusual and is triggering a rare hardware bug nobody else hits. But then nobody else is worried about or even knows about it, so it wouldn't be affecting the market. Two, it's on the heavy side but not so much that other people don't hit it too, and then the issue would be public.

It's not very plausible that the problem could be known to all cloud providers but not the general public.

> Yes, but that doesn't explain why they weren't able to ramp up production at any point in the last four years. Even now, there are still likely some smaller hosts that would like to buy EPYC 7xxxs at more-affordable prices, if AMD would make them.

They don't always lower the prices of the old models very much, they just keep making them for anyone who wants to keep buying them because they want uniformity. But then anyone who didn't buy a lot of them before isn't going to buy a lot of them now instead of just buying the new ones.

The prices come down on the used market from supply and demand, as the people buying new ones and selling the old ones, and then for new stock that retailers already own and want to get it off the shelves to make room for the new. But that doesn't mean you could find new stock of the old model in volume for the lower price. Which is why the people who want that contract for it ahead of time.

> But why would AMD agree to that, without anything the clouds could hold over their head to force them into it?

Because they're just as happy to make more of the newer models as the older ones, if the production capacity is available. Also, they could just be charging more for them. "Oh, you want 10,000 units for $2500 each instead of the contract which says you have to buy 10,000 units for $2000 each? Okay then."

> It would mean shutting down many of the 7xxx production lines early, translating to the CapEx for those production lines not getting paid off!

AMD doesn't have production lines, they use TSMC. TSMC, in turn, would just sell the capacity to someone else.

> And if the clouds are replacing capacity, then where are all those used CPUs going?

The original problem was that they couldn't make enough of them. Also, the cloud providers generally replace their oldest servers. Zen2/Zen3 isn't all that old. They'll be installing Zen4 and taking out Zen1 or five and ten year old Xeons. Which are all over the place on eBay.