June 2024 Archives

In a Slack I'm on someone asked a series of questions that boil down to:

Our company has a Reliability team, but another team is ignoring SLA/SLO obligations. What can SRE do to fix this?

I got most of the way through a multi-paragraph answer before noticing my answer was, "This isn't SRE's job, it's management's job." I figured a blog post might help explain this stance better.

The genius behind the Site Reliability Engineer concept at Google is they figured out how to make service uptime and reliability matter to business management. The mathematical framework behind SRE is all about quantizing risk, quantizing impact, and that allows quantizing lost revenue; possibly even quantizing lost sales opportunity. All this quantizing falls squarely into the management you can't manage what you can't measure mindset crossed with if I can't measure it, it's an outside dependency I can ignore subtext. SRE is all about making uptime and reliability a business problem worth spending management cycles on.

In the questioner's case we already have some signal that their management has integrated SRE concepts into management practice:

  • They have a Reliability team, which only happens if someone in management believes reliability is important enough to devote dedicated headcount and a manager to.
  • They have Service Level Agreement and Service Level Objective concepts in place
  • Those SLA/SLO obligations apply to more teams than the Reliability team itself, indicating there is at least some management push to distribute reliability thinking outside of the dedicated Reliability team.

The core problem the questioner is running into is that this non-compliant team is getting away with ignoring SLA/SLO stuff, and the answer to "what can SRE do to fix this" is to be found in why and how that team is getting away with this ignoring. Management is all about making trade-off decisions against competing priorities, clearly something else is becoming a higher priority than compliance with SLA/SLO practices. What are these more important priorities, and are they in alignment with upper management's priorities?

As soon as you start asking questions along the lines of "what can a mere individual contributor do to make another manager pay attention to their own manager," you have identified a pathological power imbalance. The one tool you have is "complain to the higher level manager to make them aware of the non-compliance," and hope that higher level manager will do the needful things. If that higher level manager does not do the needful things, the individual contributor is kind of out of luck.

Under their own authority, that is. In the case of the questioner, there is a Reliability team with a manager. This means there is someone in the management chain who officially cares about this stuff, and can raise concerns higher up the org-chart. Non-compliance with policy is supposed to be a management problem, and should have management solutions. The fact the policy in question was put in place due to SRE thinking is relevant, but not the driving concern here.

The above works for organizations that are hierarchical, which implies deeper management chains. You count the number of managers between the VP of Engineering and the average engineer, and that number is between 1.0 and 2.5, you probably have a short enough org-chart to directly talk to the team in question for direct education (bridging the org-chart to use Dr. Westruum's term.) If the org-chart is >2.5 managers, you're better served going through the org-chart to solve this particular problem.

But if you're in a short org-chart company, and that other team is still refusing to comply with SLA/SLO policies, you're kind of stuck complaining to the VP of Engineering and hoping that individual force alignment through some method. But if the VPofE doesn't, that is a clear signal that Reliability is not as important to management as you thought, and you should go back to the fundamentals of making the case for prioritizing SRE practices generally.

...will never happen more than once at a company.

I say this knowing that chunks of Germany's civil infrastructure managed to standardize on SuSE desktops, and some may still be using SuSE. Some might view this as proof it can be done, I say that Linux desktops not spreading beyond this example is proof of why it didn't happen. The biggest reason we have the German example is because the decision was top down. Government decision making is different than corporate decision making, which is why we're not going to see the same thing happen, a Linux desktop (actually laptop) mandate from on high, more than few times; especially in the tech industry.

it all comes down to management and why Linux laptop users are using Linux in the first place.

You see, corporate laptops (hereafter referred to as "endpoints" to match management lingo) have certain constraints placed upon them when small companies become big companies:

  • You need some form of anti-virus and anti-malware scanning, by policy
  • You need something like either a VPN or other Zero Trust ability to do "device attestation", proving the device (endpoint) is authentic and not a hacker using stolen credentials from a person
  • You need to comply with the vulnerability management process, which means some ability to scan software versions on and endpoint and report up to a dashboard.
  • The previous three points strongly imply an ability to push software to endpoints

Windows has been able to do all four points since the 1990s. Apple came somewhat later, but this is what JAMF is for.

Then there is Linux. It is technically possible to do all of the above. Some tools, like osquery, were built for Linux first because the intended use was on servers. However, there is a big problem with Linux users. Get 10 Linux users in a room, and you're quite likely to get 10 different combination of display manager (xorg or wayland), window manager (gnome, kde, i3, others), and OS package manager. You need to either support heterogeneity or commit to building the Enterprise Linux that has one from each category and forbid others. Enterprise Linux is what the German example did.

Which is when the Linux users revolt, because banning their tiling window manager in favor of Xorg/Gnome ruins their flow -- and similar complaints. The Windows and Apple users forced onto Linux will grumble about their flow changing and why all their favorite apps can't be used, but at least it'll be uniform. If you support all three, you'll get the same 5% Linux users but the self-selected cranky ones who can't use the Linux they actually want. Most of that 5% will "settle" for another Linux before using Windows or Apple, but it's not the same.

And 5% Linux users puts supportability of the platform below the concentration needed to support that platform well. Companies like Alphabet are big enough the 5%  is big enough to make a supportable population. For smaller companies like Atlassian, perhaps not. Which puts Enterprise Linux in that twilight state between outright banned and just barely supported so long as you can tolerate all the jank.