I don't think you understand the sheer scale you need to be experiencing a failu...

I don't think you understand the sheer scale you need to be experiencing a failure more often than once a month. By my anecdotal experience you'd need at least 1k servers for that to happen... and if your company is big enough for $2MM capex for servers alone you can handle $100 remote hands and 30 minutes of engineer time.

Not to mention that at that scale you have plenty of redundancy and, if your ops team knows what they're doing, automagic failover / HA. Anything that happens can easily "wait till Monday", no need for 24/7 anything.