On the technical side, this is a case for avoiding languages with exceptions and/or languages that allow silent error dropping. That unhandled exception is the rare event that is actually common.
I will say though that I am a curious what they mean by fatal server error. I once worked at a well known company with about 400 production servers that were all running near capacity. I cannot remember a single serious hardware problem that truly killed a server (we did failover often to a backup but that was usually for software or upstream infra reasons rather than a hardware failure). I understand the scale is lower than in the article, but a server failure every day with a fleet of 2000 server feels like a lot to me.
At any rate, assuming they were just spitballing on a number, the point stands that you need to design and plan for failure even if it is rare. You really don't want the one time the server fails to be the time the CEO is demoing your product to your highest value customer.
I routinely see hardware failures in a fleet about an order of magnitude larger than the article. It's often enough that we have to plan for and recognize it, but not often enough that we have fully automated handling every edge case.
I will say though that I am a curious what they mean by fatal server error. I once worked at a well known company with about 400 production servers that were all running near capacity. I cannot remember a single serious hardware problem that truly killed a server (we did failover often to a backup but that was usually for software or upstream infra reasons rather than a hardware failure). I understand the scale is lower than in the article, but a server failure every day with a fleet of 2000 server feels like a lot to me.
At any rate, assuming they were just spitballing on a number, the point stands that you need to design and plan for failure even if it is rare. You really don't want the one time the server fails to be the time the CEO is demoing your product to your highest value customer.