Jeff Martins at New Stack had a beautiful take-down of the SaaS provider practice of not sharing internal status, and how that affects down-stream reliabiilty programs. Jeff is absolutely right, each SaaS provider (or IaaS provider) you put in your critical path decreases the absolute maximum availability your system can provide. This also isn't helped by SaaS providers using manual processes to update status pages. We would all provide better services to customers if we shared status with each other in a timely way.
Spot on. I've felt this frustration too. I've been in the after-action review following an AWS incident when an engineering manager asked the question:
Can we set up alerting for when AWS updates their status page?
As a way to improve our speed of response to AWS issues. We had to politely let that manager down by saying we'd know there was a problem before AWS updates their page, which is entirely true. Status pages shouldn't be used for real-time alerting. This lesson was further hammered home after we gained access to AWS Technical Account Managers, and started getting the status page updates delivered 10 or so minutes early, directly from the TAMs themselves. Status Pages are corporate communications, not technical.
That's where the problem is, status pages are corporate communication. In a system where admitting fault opens you up to expensive legal actions, corporate communication programs will optimize for not admitting fault. For Status Pages, it means they only get an outage notice after it is unavoidably obvious that an outage is already happening. Status Pages are strictly reactive. For a true real time alerting system, you need to tolerate the occasional false positive; for a corporate communication platform designed to admit fault, false positives must be avoided at all costs.
How do we fix this? How do we, a SaaS provider, enable our downstream reliability programs to get the signal they need to react to our state?
There aren't any good ways, only hard or dodgy ones.
The first and best way is to do away with US style adversarial capitalism, which reduces the risks of saying "we're currently fucking up." The money behind the tech industry won't let this happen, though.
The next best way is to provide an alert stream to customers, so long as they are willing to sign some form of "If we give this to you, then you agree to not sue us for what it tells you" amendment to the usage contract. Even that is risky, because some rights can't be waved that way.
What's left is what AWS did for us, have our Technical Acount Managers or Customer Success Managers hand notify customers that there is a problem. This takes the fault admission out of the public space, and limits it to customers who are giving us enough money to have a TAM/CSM. This is the opposite of real time, but at least you get notices faster? It's still not equivalent to instrumenting your own infrastructure for alerts, and is mostly useful for writing the after-action report.
Yeah, the SaaSification of systems is introducing certain communitarian drives; but the VC-backed capitalistic system we're in prevents us from really doing this well.