July 2021 Archives

24/7 availability and oncall

There is another meme going around OpsTwitter the past few days. This is a familiar refrain in discussions about on-call and quality of life. But the essence is:

If you need 24/7 availability, you also need follow-the-sun support. That way any crisis is in someone's day-time, regular-work day.

I agree, this is the standard you need to judge your solution against. However, this solution has some assumptions baked into it. Here are a few:

  • You have three teams operating 8 timezones from their neighbors (or two timezones spanning 12)
  • No one set of employment laws spans 24 timezones, so these teams will each be under different labor and national holiday laws.
  • Each timezone needs an on-call rotation.
  • The minimum viable on-call rotation per timezone is 3 people, but 6 is far more friendly to the people supporting the site.
  • Due to staffing reasons, your global on-call team needs 9 to 18 people on it (or 6 to 12 for a 12 timezone spread).
  • Due to the timezone spread, each team will have minimal coordination with each other. What coordination there is will involve one team being on a video-call at o-dark-thirty.
  • You need enough work to keep 9 to 18 people busy in addition to their fire-watch duties.

You know who can pull that off? Really big companies.

You know who can't pull that off? Companies employing in a single labor market, such as the US.

I mean, Guam is a US holding (UTC+10). Theoretically if you had a team in Guam and a team in New York City (UTC-4) you would have a 10 hour difference between them. You could sort of make this work while staying inside the US tax and legal domains, but you're reliant on the technical base of people in Guam which has a population a bit smaller than Des Moines, Iowa. Colonialism means people will think about hiring in Ireland or India before Guam. To do this you need to go international.

Most smaller companies won't go international, way too much paperwork involved at a time when you're supposed to be lean and fast.

I have worked with follow-the-sun exactly once in my career. We had Ops teams in the US East Coast, Poland, and China. It wasn't a true 8/8/8 split, but it was enough of a split that "after hours maintenance" always happened in someone's daytime. It was pretty dang nice. Then we had a layoff round and the Poland office went away. And we fired our Chinese Ops folk to save money, which meant we were waking the US staff up at o-dark-thirty to do maintenance.

I'm conflicted on this advice. On the surface, I totally get the sentiment: keep the annoying shit in everyone's daytime and don't force people to work overnights.

As an industry, we have history with split-shifting and incident response. The night operator used to be a common feature of any company with a computer, the person (or team of people) responsible for loading/unloading tapes, swapping paper for the printer, collating and packaging print-jobs, making sure the batch-jobs ran, calling the SYSOP when things smelled off, and a bunch of other now-forgotten tasks. Most organizations have gotten rid of the night operator for a lot of reasons. The two biggest being:

  1. We've mostly automated the job out of existence. Tapes (if tapes are still in use) are handled by robots. Print-jobs now show up as a PDF in your email. Batch-schedulers are really fancy now, so getting those batch-jobs run is highly automated. Monitoring systems keep track of way more things than we could track in the night operator era.
  2. No one wants to work overnights. Like, no one. At least not enough to easily find a replacement when the one person who does like it decides to leave/retire.

(The second point hit WWU while I was there)

As an industry we no longer have tradition of doing shift-work. The robust expectation is we'll have a day-time job and go home in the evenings. If you offer me an overnight job at +30% pay -- I'll take it for a while, but I'm still job-hunting for a real daytime job. Not sustainable, which is why on-call is how we're solving the one night operator task we couldn't automate out of existence: incident response.

Everyone needs some way to do incident response, even if they're 15 people with a big idea and a website -- far too small to be doing follow-the-sun rotations. Are they supposed to make it clear that you only guarantee availability certain hours? I think there is some legs in that idea, but the site will be negatively compared with the site next door that offers 24/7 availability (at the cost of little sleep for their few engineers).

Forcing change to the idea that Ops-type work is always done with a pager attached with unknown extra hours will take a shit-ton of work. Sea changes like that don't happen naturally. We cross-faded from night operators to on-call rotations due to the changing nature of the role: there wasn't enough work to do on the 11pm to 7am shift to keep someone fully occupied, so we tacked those duties onto the 7am-3pm crew (who now work a nicer 9am to 5pm schedule).

The only way to break the need for on-call for Ops-type roles is to stop making availability promises when you're not staffed to support it with people responding as part of their normal working hours. If your support-desk isn't answering the phone, site availability shouldn't be promised.

It's that or unionizing the entire sector.

It's performance-review season at work, which means more feedback surveys, and a few questions on an all-company survey like:

On a scale from 1 to 5, how strongly do you agree with these statements:

  • Management makes speedy decisions.
  • Management makes high quality decisions

I want to talk about one of the typically unnoticed factors that lowers these scores in larger organizations. To get there, I need to go back to the old DevOps standby, the Westrum Typology. Figure 1 shows a slide from a presentation I've given several times that shows some of the attributes of the three types.

Chart of the three types of the westrum typology. Pathologic is power-driven. Bureaucratic is rules drive. Generative is performance driven.
Figure 1: The Westrum Typology from Dr. Westrum's 2004 paper titled, "A Typology of Organizational Cultures". The pathological, bureaucratic, and generative culture styles have been a key part of talks on office culture in technical organizations for nearly a decade. Note the bolded row regarding the differences in bridging (talking to people across the org-chart rather than up/down the chart).

As you can see from the slide text, this talk is about workplace toxicity. Not at all coincidentally, when faith in the quality of management decisions is low, perceptions of workplace toxicity are higher. But that's not the relationship I'm talking about today. No, today is about improving that "decision quality" score. This slide has a row bolded regarding 'bridging'.

Westrum's definition of bridging is reaching across the org-chart. Figure 2 shows how bridging works in a bureaucratic organization.

In bureaucratic organizations, the official feedback method is up the org-chart then back down.
Figure 2: In bureaucratic cultures, casual bridging (dotted line) is tolerated. However, the official feedback method (dashed line) is to go up the org-chart and then back down again.

For a concrete example, Team Left is working with an API written by Team Right. However, this API has a few bugs that Team Left would like to get fixed. How does this work in a bureaucratic organization?

A member in Team Left is in the Latinx Employee Resource Group, and shares meetings there with someone from Team Right. Using their personal relationship, the Team Left member asks the Team Right member to get the bugs fixed as a personal favor.

If that isn't enough to get the bug fixed, Team Left's manager decides if this bug is worth all the trouble and decides (or not) to ask their manager. Manager 2 makes the same decision before asking Manager 3, who makes the same decision before ordering the other manager 2 to fix the issue. And so on. Bugs have to be a certain level of painful to bother with, or they just get worked-around.

This is what 'bridging is tolerated' means; there isn't an official channel for this, but if something happens out-of-band that won't be penalized. This brings us to figure 3, and the generative organization.

In a generative organization, you can just ask the other team for help
Figure 3: In a generative organization bridging is explicitly allowed. Members from Team Left are allowed (dashed lines) to directly talk to members of Team Right. The same goes for their managers. The 2 and 3 levels of the org-chart don't need to be involved at all.

Let's look at our API bugs example again:

A member in Team Left has been working around these bugs and is tired of it. They look up the members of Team Right and DMs one to ask how to get a bug fixed. The Team Right person helps them through filing a bug-ticket and setting a severity. The manager of Team Left follows up with the Manager of Team Right to get the bugfix prioritized correctly.

For contrast, let's look at how our API bugs example works in the pathological organization where bridging is discouraged.

A member in Team Left is tired of the bugs they keep running into in the API from Team Right. They talk to their manager to see if they can get the Team Right to fix the bugs. Team Left's manager makes a decision about how much political capital is required to ask for this change from the manager of Team Right. If the cost is too high, nothing is done and Team Left continues to build work-arounds.

This is briging. Generative organizations cut a lot of overhead out of team communications. Dr. Nicole Fosgren has spent the last decade studying software-producing organizations and how effective they are based on their workplace cultures, and has shown year after year that generative organizations release software faster and with fewer defects. Not only that, but software companies have been increasingly adopting generative attributes specifically to attain these performance gains.

Because bridging shows up on the Westrum charts a lot, this is what people think generativity is about: lateral feedback and rewarded risk-taking. But there is more to this, and it's the intersection of DevOps practices and the less well cited parts of Westrum's work. Famously, DevOps is about busting the silos between software development and the people who are charged with stability. Silo-busting is all about improving communication (also known as feedback), and there are a variety of ways to do this.

Look again at figure 2, the bureaucratic organization. The "official channels" are up, over, then down. Inefficient, especially in the face of emerging issues. This is also the way that pathologic and bureaucratic require feedback to run. This slowness is why incident management processes are built to bust official channels, because speed of resolution is more important than making sure every manager has approved the changes. Even so If you're in an organization that already has a lot of bridging, and rewards risk-taking — a generative culture by the chart — your scores on "management makes high quality decisions" will start going down the larger your organization gets.

This has to do with the simple truth of the "management quality" question, which is asked of everyone, which means that the worker-units at the bottom of the org-chart far out number members of the management layers.

People believe management makes high quality decisions when they agree with the decision.

In a small org with just three layers of management, everyone kinda already knows the big decisions being made. But when the organization gets bigger? Well, figure 4 looks at a deeper management structure.

Eight management layers. Local context flows up, global context flows down.
Figure 4: A management chain typical of a larger organization. Individual contributers (ICs) report to a manager at least one higher than they are, so an M3 could manage IC1 and IC2, but an IC5 must have an M6 or higher manager. The bottom managers have maximum local context and summarized global context. The top manager has maximum global context and summarized local context.

With a management chain this deep, the top manager (the M8) has maximal global context about everything and makes their decisions using that global context. Global context is the summarized local context of every manager below them on the chart. In a highly generative environment, the downward context flow is robust. Remember how the 'decision quality' question depends on the observer agreeing with the decision? That happens most often if the observer has the global context needed to appropriately assess the decision.

When you don't have that return context, you get figure 5.

When the global context doesn't flow down the management chain, lower managers can't use the global context in decisions.
Figure 5: When global context doesn't flow down the management chain, lower managers (and their direct report ICs) can't use the global context to asses higher level decisions. Information flows up hill, but not down, so it feels like being kept in the dark.

With downward context flows like this, where the summarized global context doesn't filter down the chain, context needed to judge decisions for correctness is lacking so people are more likely to judge a decision as bad. If you want to improve your "decision quality" metrics, you need to improve how feedback flows down the management chain. To demonstrate how this works, let's look at a progression of announcements of a decision. First off, a simple decision statement:

We're closing our Beijing development center effective March 1st

If Beijing were actually Beijing, Indiana (not a real place) this statement could be given on March 1st while everyone in Beijing Indiana is having a talk with Human Resources about severance packages. Under the standard US playbook for Reductions In Force, this sort of sudden death decision has to be a surprise to everyone involved. What have is:

  • A statement of decision.
  • No context or justification.

You see this sort of communication in command-and-control environments, where the consent of the managed doesn't actually matter so long as they do their jobs reasonably. For the Beijing Indiana case, this is entirely expected due to how US companies mostly work. Layoffs are sudden death.

However, what if it really was Beijing China? Chinese rules give way more power to workers than US rules do, and closing a workcenter like this is something that requires months of notice. March 1 could be six months out, and the Beijing office will be slowly wound down. In this case, where everyone will have months to second guess the decision, a bare statement of decision is not enough. If you want to minimize the damage of a mass reduction in force, you need to communicate some of the context.

Consider this revision. This is the same decision announcement notice as before, but with a lot more of the global context included:

Today, Staff has made the hard decision to sunset development on EBULLIENT YAK, RELIABLE EMU, and GRUBBING SLOTH. Our market performance never met parameters, and feedback from customers was that our products were solving the wrong problem. We would like to solve the right problem, but research shows we're just not where we need to be for that. Thanks to all the engineering and product managers and engineers who spent the last year on this wild bet, you worked hard.

Given our last two quarters of performance, Wall Street needs to see us improve profitability. Rather than further our investment into these three products, we've decided to wind them down. Because these projects were the only projects developed out of our Beijing space, we will not be renewing the lease on our office and will close it. These measures should result in savings that will improve our perception on Wall Street, though it will be two quarters before those changes show in our bottom line.

For employees affected by this, we will be offering remote work opportunities and placement services.

This is far longer, but it also provides a lot of justification and context for this decision:

  • Several projects have consistently failed to perform.
  • The company had a massive miss in their minimum viable product projections for these products.
  • The company stocks are getting punished due to profitability concerns.
  • These projects were the only ones in the Beijing office.
  • Workers will be transitioned nicely.

However, this is a great example of communications coming from the M8 level in the manager chain from figures 4 and 5: the very top. This sort of communication is actually pretty good in most companies! This is why that whole company all-hands meeting is so valuable. Figure 6 shows how various kinds of all-hands are useful for communicating global context.

M8: Company all-hands. M6: Division all-hands. M4: Department all-hands. M: Team status meeting
Figure 6: How all-hands meetings are used to communicate global context down the org-chart. Context for decisions made at the M8, M6, and M4 level are shared downward at these meetings.

There is a caveat about all-hands meetings: they generally consider things that would be of general interest to everyone below that level. So a decision that only affects your line made at the M6 level may not get mentioned because other lines don't care. Decisions like these don't get communicated down through all-hands meetings, so the context for those will have to be moved through a different channel. You could add more all-hands meetings, but synchronous feedback sessions like those get tiresome and take everyone out of work.

You need a different method of downwards feedback for most decisions. Unfortunately, the role of manager in a highly generative environment is to be an information router top to bottom, bottom to top, and side to side, summarizing as appropriate, and keeping track of which people are interested in which decisions. Feedback is totally a game of telephone, because the M4 needs to keep track of what each of their M3s are interested in so when they hear something in that list from the M5 they can communicate it downward, and it only gets worse the higher up you get.

Now that I spent too many words explaining the problem, here are some concrete steps you can take to improve your "decision quality" metrics in your already pretty generative organization. These are all ways to improve the quality of context flowing down the org-chart:

  • If you are a manager whose team's work is dominated by planning processes mostly set by higher level managers — most agile teams in large organizations — if you're not already in the planning process, poke the manager who is for status. Also, communicate the summary of the balancing-act to your reports. They will thank you for the context.
  • If you are a manager of managers (everything from M4 up in the above figures) explicitly ask your reports about areas they would like more feedback from you. Prompt them with quarterly and annual planning processes if they shrug and can't think of anything.
  • If you are a high level manager of managers (M5-7 above) encourage your reports to get better about downward communication of context.
  • You can fake an all-hands by doing a "newsletter" email once a month or whatever and put your downward context there. Put in your planning goals, top-of-mind items, and weighty decisions you're looking to address. For extra credit, directly solicit feedback. The point is to build organizational priopreception around decisions in the pipeline.
  • Communicate the process, not just the final decision. People feel better about decisions when they have a perception of the progress behind them.