September 2021 Archives

...and habits that follow you: impact

Part 4 of a 4 part series.

The fifth and last item on the Google re:Work blogpost on attributes of an effective team is impact: team-members believe their work matters. At this point we have most of the list.

  • We have psychological safety.
  • We can depend on our coworkers to deliver on their promises.
  • We have enough structure and clarity in our roles, goals, and plans to believe them and work towards them.
  • We consider our work personally meaningful.

But for some reason you still think the work doesn't matter. Why? There can be a few causes of this.

  • Maybe you're on a team that's doing great work, but your employer simply needs more of you, needs to invest in your area more, or needs to listen to you more often -- and isn't. Many QA and Security teams suffer this blight.
  • Maybe your organization is simply working towards the wrong problem. You have clear roles, goals, and plans, and your team is working on the right stuff (once in a while) but you aren't allowed to do that often enough.
  • Maybe you're the one person on the team that fully understands what you're working on, and no one else is even trying to get to your level. This affects senior techs (both in tenure and industry experience).
  • Maybe your company had a large layoff reduction in force and what is left of your team is merely going through the motions right now, while job-hunting on the side.

Most of these points are symptoms of a breakdown in the feedback cycle of decisions. I wrote about how context for decisions flows up and down the org-chart this past July.

  • If your team is doing great work but is being under-invested for some reason (in time, resources, or attention), this is a sign of a breakdown in feedback down the org-chart. Assuming you're communicating your needs upwards, the context for why that advice is not being acted upon is not getting communicated down. Or what is being communicated down isn't being believed. Troubleshoot where the context is getting lost.
  • If your organization is working on the wrong problem, this is a breakdown in feedback up the org-chart. It could be the message is getting lost along the way, or your part of the organization is suffering a credibility deficit so feedback isn't believed.
  • The senior pathology of being the only one who knows how it works is in part due to that person not creating the conditions where other people need to up-skill. Throughput thinking (whoever is fastest does the work) does not create cross-knowledge because there is no incentive to learn something that someone else already knows. Growth thinking (the person who knows how helps someone who isn't quite there) is far better at that.
  • Layoffs Reductions in force are intentionally terrible, everyone is going to come out the other side traumatized. There are changes you can make to the RIF process to reduce the damage.

Whatever the cause, the two biggest compensating behaviors for dealing with this trauma is quite similar to what we saw with meaning, detachment and cynicism. The difference between meaning and impact are that with impact you're sharing this experience with your whole team. This means you have a support structure in place: all of your coworkers going through the same thing, thinking the same bad thoughts about management, and reinforcing each other's assumptions about how it all works.

How I got broken

In my post about personal meaning I talked a lot about how traditional pre-DevOps Operations teams became people you didn't like to work with. Too cranky, snarky, and quick to anger. I talked about how the service-center approach to operations made for an environment that fostered toxicity, which definitely applied to me from 2003 to 2011.

Today I want to talk about how going through that with my entire team was subtly different than the isolation I talked about in the meaning post. The pathology we experienced was the second bullet above: we weren't allowed to work on the right problem often enough. This meant that from time to time the decisions coming down the org-chart were the problem we had to manage.

This happened often enough it destroyed our faith in management's ability to make good decisions, and slowly trained us into thinking that only we knew how things actually worked around here -- and no one listened to us. This made us build strong cynical walls because it was better to expect the worst and be surprised when things worked out, than expect the best and be surprised when things went terrible. The team itself was great to be on, these people got it and were there for me.

However, when the official feedback processes don't work, all that's left are the casual ones. So we became assholes. If someone came to us with a bad idea, we let them know it was a bad idea, and they should be ashamed for bringing it to us. The most tragic part? This actually worked; sometimes we could scrub some of the bad off before we had to just do it. It also got us a reputation for being cranky, snarky, quick to anger, and generally hard to work with. We wanted people to think twice about coming to us with a thing.

That said, because of our traumas and how we reacted to it, we were even less likely to be invited into the official process: who wants assholes in the room?

When I left there to go to a 20 person startup as their only Ops person, I brought this mindset with me. I literally had a seat at the decision table, but I was still treating the process as one I had to be unhealthily reactive to in order to participate. Their culture was far better than the one I had left, I just wasn't primed to see it that way.

Note: I need to point out that when someone with this trauma in their head starts on your team, they consider themselves a team of one. This trauma damages your ability to trust people just because you're supposed to. The above was all about this trauma happening to a team of people, but getting a new job destroys their sense of team. That's why a lot of the same advice I gave for the meaningful article applies here.

If you are a manager or team-lead, you will start to see pathologies here once your new person starts trying to build that sense of team. The cynic's handshake will happen where you can't see it. For reports who aren't traumatized this way, the handshake is off-putting. For those that are, well, now you have two problems on your hands.

In my case, what would have helped me was my manager sitting me down and explaining to me the cardinal rule of the agile rituals: flow is everything. The improv rules totally apply. Use yes, and or no, but to keep the flow going. Using a flat no stops everything cold and people begin to hate you. That would have redirected me into more healthy communication patterns, and slowly detoxify me.

For other things, employ radical empathy. In your 1:1s complete the cynic's handshake and get them talking about org structures at their previous jobs. Even if they don't tell you it was terrible, you can still smell the terrible. It took until 2015 before I saw what I lived through from 2003 to 2011 as terrible. This gives you the information you need for highlighting how different this workplace is from their previous ones.

This is one area where the manager/lead is only weakly effective, though. If you suspect you have a cynic on your hands, you can work with your other reports to get them to help nudge the newcomer onto better ways of communicating. This is long-term work, people like me adopt cynical shields for safety reasons and it takes time to get them to come down (or a hard knock to the head in the form of getting fired).

...and habits that follow you: meaning

Part 3 of a 4 part series.

The third item on the Google re:Work blog about the five attributes of an effective team is meaning: work is personally important to team members. By now we have psychological safety, we can depend on our coworkers to deliver on their promises, and have enough structure and clarity to know the plans, roles, and goals of our team. But for some reason the work isn't personally meaningful. There are many causes of this, some overlap, some come in big swarms.

  • Maybe you're the junior person on the team and are stuck with all the BUGFIX tickets, and never new feature work. You're learning lots about how the codebase interoperates and some of the strange decisions made along the way, but this isn't what you want to do. This isn't making, this is... janitor work.
  • Maybe the team is a pathological den of vipers, and you're just here for the (fat) paycheck. Your real passions are elsewhere.
  • Maybe the work you want to do has been consistently given to other people, so you've given up trying.
  • Maybe you've been moved from team to team so often you haven't had a chance to build any sense of ownership and pride of work.
  • Maybe your big passion is open source software, and the only way to make that happen is a closed source job with a paycheck.
  • Maybe the release patterns are so onerous that all attempts to make them better have foundered on the shoals of past practice and you've shrugged and given up.
  • Maybe you're the QA on the team and have been ignored so long you've habituated to not being listened to.

Whatever the reason, and there can be many, the job is... nice... it just isn't all that meaningful. And when that happens, the fifth item on the re:Work list (we believe the work fundamentally matters) simply can't happen. If you're experiencing this lack of importance in your work the big compensating behavior is:

You detach from work, because it only sort of matters.

Detachment. We saw some of this with the compensating behaviors for the lack of structure and clarity, but lacking this also causes detachment. You might participate in quarterly planning; but it's more half-hearted than earnest, throwing up ideas just in case you can make change this time, but privately convinced it'll never happen. When this is lacking, you see people focusing most on the current sprint's work rather than the long term goals, because the short term goals are easier to identify and use for motivation.

Also, you will see more focus on the people on the team than the work of the team. We have dependability now, maintaining that matters more than the work itself. While the work may not motivate (it doesn't matter), keeping your relationships healthy does. Again and again I hear about companies and organizations that are structured to magnify your connection to each other as a buffer against the psychological impact of shitty work.

So if you have been marinating in this sort of toxicity and get a new job, what kind of behaviors are you prone to in your new job?

  • You are primed to think that the work you do doesn't matter, and will need heavy convincing to think otherwise.
  • You are less likely to take quarterly planning seriously.
  • You are primed to not bother with tiny bugs, not worth the effort to smash.
  • You build cynical shields around getting attached to work.

That last is the most toxic behavior, and is also why pre-DevOps Operations teams were as nasty to work with as we were. I was totally damaged this way, and it has everything to do with why I lost my job in November 2013. Let my pain help you.

Old-school Ops teams

The Operations teams of yore, the Dragons in the Datacenter, Mordak Preventer of Information Services, the Bastard Operator from Hell, all had cynicism as a core organizational and team building concept. We did it that way because we were structurally disempowered, which made our work hard to connect with success. Yeah, we kept everything running -- but we weren't helped by how business processes or development processes worked.

All too often, Operations was seen as a service center: there to keep everything running, and be the professional installers and maintainers of all our systems. If you approach the maintainability and availability problem-space from a "service center" mindset, you delegate all the responsibility for availability on the team with the least power to actually affect it. The last decade plus of Site Reliability Engineering discourse has proven that availability is an end to end problem, not only a problem at the production endpoints. When you have a team who is responsible for something they only have a little control over, you get cynicism.

This is entirely human. Any time you have a group of humans whose job it is to clean up after other people's messes you get cynicism. I personally have seen the same Ops-team dynamic in:

  • Animal Control officers (you did what to the dog?)
  • Emergency Room professionals (you stuck that where?)
  • Janitorial staff (where is the barf this time?)
  • School Bus drivers (where is the barf this time?)
  • Apartment move-out cleaners (did they not understand that stovetops need cleaning?)

Get a group of these people together and you get exchanges of dark horror stories, all as a way to bond with each other. Knowing others have seen the horrors you have makes those horrors easier to withstand. Go to conferences with a lot of former sysadmins and sit in the bar, you'll hear the war-stories of cleaning up after other people's messes.

This becomes a problem when the cynic starts someplace that actually doesn't have those attributes. They cynic all over your team, probably where you aren't around, and start either poisoning the team or alienating them.

If you're a manager or team-lead and see your new person showing some of these signs, there are a few ways you can help get them reconnected to work:

  • If they're not paying enough attention to low level bugs (and the rest of the team is), in your 1:1 ask them how low stuff was handled at previous jobs. This will give you context for where they're coming from. You probably will learn more about the minimum threshold of suck before anything was worth fixing. This gives you what you need to point out how different this workplace is.
  • If they're used to environments where the new kid on the team gets ignored until they've proven themselves, make sure they're included and listened to in planning. Eventually they will figure it out.
  • If their motivation is open source, work to get them permission to commit fixes to upstream projects.
  • Foster any sense of ownership they seem to be developing and defend it against all comers.
  • If you are a team-lead and not the manager (the manager will never see this first) and you get the Cynic's Handshake of snarking about management, return the handshake (give them an anecdote of Management Gone Wrong from a previous job) to earn their trust, and start pointing out how this job actually isn't structurally disempowered and how much better it is from your previous place.

The signals here are subtle, and the causes are the kind you won't learn about in the first few weeks of them being there. Get them to tell stories about their past jobs and listen for patterns. So much of what you need to decompensate these behaviors comes in those casual conversations.

Your job here is to learn why they're not connecting with the work and try to build/rebuild that sense of importance. Some people never had it in the first place, they're the easier ones. The people who had it then lost it as a way to maintain psychological safety are going to be your hardest cases. Any time you try to get someone to overcome a safety reflex, you will be fighting.

Part 2 of a 4 part series.

The third item on the Google re:Work blog about the five attributes of an effective team is Structure & Clarity, team-members know where the team is going, who is responsible for what, and how those plans will be measured. The plans, roles, and goals. At this point you have dependability, you can trust your coworkers to do what they promise. What happens if the plans, roles, or goals don't matter for some reason?

  • Plans don't matter: Short-term thinking. If plans don't matter, won't worry about them. Just focus on the next sprint or release. Ignore quarterly planning completely, because it won't matter anyway. Work on what hurts the most, because it feels better that way.
  • Roles don't matter: If you don't know who is supposed to do what, you just go with whoever feels right; the org-chart isn't going to tell you. This means you develop a tables of "Person A deal with Feature B" mappings in your head. This sort of thing happens anyway for cross-team work, but if you're doing it inside your own team, maybe you have a roles problem.
  • Goals don't matter: More short-term thinking. If the quarterly goals don't matter because your team is always veering from crisis to crisis, why bother worrying about them? Just focus on what hurts right now.

Above all, you are not focusing your work on the goals of the organization, you're focusing your work to make sure your teammates consider you dependable. You will absolutely tank an organizational goal (they don't matter anyway) to help a coworker out of bad situation. If you're not helping a teammate, you're working on something to take away some of the pain of operating/maintaining/building this thing.

Stuff that matters to you, not stuff that matters to your boss and whoever they report to.

And if you don't have structure and clarity:

  • Meaning of work doesn't matter, because you're not tracking it. What meaning you find, is in keeping your teammates happy.
  • Impact of work doesn't matter because... to be frank, this metric is a manager-metric. You can feel your work has plenty of impact, just not organizational impact. But really, without those plans, roles, and goals you can't identify your impact of work well If enough to build a sense that your work has (organizational) impact.

If you're a manager or team-lead, there are a few behaviors that signal a new person has come from a place with bad structure and clarity:

  • Indifference to quarterly goals (because they haven't mattered before)
  • No participation in setting future goals (why bother, they don't matter)
  • Extensive participation in reducing problems the team is facing, but little participation around pursuing the goals of the organization.

That last one is tricky, because it's the one that makes everyone else think this person is pretty great. You know better, because you know what isn't getting worked on.

Getting this person to decompensate takes time, it always does. They're willing to work on the team-based goals, which you can definitely harness in the short term. Longer term, make sure to point out when your team meets the plans that were set a quarter ago, and keep doing that until they realize that those plans and goals actually matter here. What you're doing is pointing out how different their current environment is from their previous ones.

You can also embrace the power of your 1:1 meetings. Ask how planning was handled at previous jobs. Get the gossip on how it was (mis)handled. You'll get a lot of details you can use to nudge them into noticing we don't work like that, and achieving planned goals makes the whole team better.

Six years ago Google Rework posted a list of five dynamics that teams need to be successful. This list is familiar if you've been working in and around office cultures for tech companies.

  1. Psychological Safety. Team members feel safe to take risks.
  2. Dependability. Team members get done what they promise (see below).
  3. Structure & Clarity. People have clear plans, goals, and roles.
  4. Meaning. Work is personally important to team members.
  5. Impact. Team members think their work matters.

Well, the first item on this list is familiar. In my experience the other four points get lost in the overall discussion, even if internal functions focusing on 'office culture' do pay attention to the whole list. Overlooking the rest of these is a problem for one big reason:

People maintain psychological safety above anything else, even if it means adopting toxic coping strategies. They can put a "5" down on the "I feel safe to share my opinions" question in the bi-annual employee sentiment survey, while also sarcastically belittling anyone who disagrees with them. They can put a 5 down on that question because they know, in the core of their being, that their opinion only matters to their teammates and their team is completely ignored otherwise. You can get deep workplace toxicity while also having psychological safety.

If your teams are missing any one of these points your teams are adopting toxic behaviors to compensate.

This list was intentionally structured, if you miss on one point you can't have the points below it on the list.

This blog series will cover the overlooked four points in detail.

First up is dependability: Team members get done what they promise.

If you can't trust your coworkers to get done what they say they will:

  • You learn to only accept work you can do entirely yourself.
  • When someone offers to help you, you always have a plan for how to deal with them not following through on their promise.
  • If you find out that you actually need help, you will reach out for help as the absolute last step. You will do everything you can to avoid that, because asking for help almost never works out.

Question: In your working life, have you felt this? What taught you that others couldn't be trusted? How did that follow you to later jobs?

When I was in middle-school (roughly ages 10 through 13) American education was fond of peer education as a teaching technique. On paper it looks fine, by teaching a thing you learn it better, and by having a peer teach the thing you get more cultural context than the capital-T Teacher doing it. In my case, it was called "foursomes" because my class was split into groups of four. To provide an incentive, the team's grade would be based on the whole team's work. The mix of students in these foursomes was distinctive: 1 high achiever, 1 to 2 middle achievers, and 1 to 2 slackers (the exact mix depended on the class). We absolutely were not allowed to self-select the teams.

This team dynamic failed dependability for the high achiever. By tying the overall grade to the efforts of the whole team, the high achiever could only maintain high achieving by spurring everyone else to their high standard. Or, as happened way more often, by doing most of the work while seething with resentment. When talking to fellow high achievers once I got to college, our universal experience was that we couldn't trust anyone to help us in our academic work, and others would happily freeload off our efforts. Toxicity poisoned us all, and this carried over into my working career.

There is another area where a person's sense of dependability is compromised, and that is from having worked in a pathological environment. Pathological environments are driven by power, your ability to get anything done is 100% related to how much power you wield. People with power over you can make you do things. Maintaining your power is key to getting anything done. Because everyone is playing power games, you can't depend on your coworkers to follow through on promises -- they'll only do it if they get something from it, won't hurt their own power, or betraying you later helps them more.

The thing is, if you don't have dependability, you can't have the other points. If you can't trust your coworkers to do what they promise:

  • The roles, goals, and plans your team has (structure & clarity) don't matter because you can't trust any of the promises that roles, goals, and plans are built upon.
  • If you have a sense of personal meaning in your work, it isn't because of who you are working with. Or for.
  • The sense that your team has impact in the greater organization is heavily compromised because you can't trust your coworkers to do what they need to do.

If you are a manager or team-leader, you can help people retrain from past experiences where dependability was a problem. The first step is recognizing when a new person could be applying inappropriate past coping strategies to your team.

The number one signal is that they never, ever ask for help. If they get stuck, they spiral in a vortex of self-hate, pretend nothing is wrong, go deep into expectation-management to buy themselves enough time to train up on the thing, or reach out to their non-work friend-group for help because they don't trust your team to be helpful. Or all of these.

The trick here is figuring out where this is coming from. Someone who spent a lot of time in volunteer-run organizations where people didn't prioritize the work is coming from a different place than someone coming from pathological environments where trust was a weapon. You the manager (or team leader) have power over them, and if you find they are managing your expectations of them or are hiding a lot of their troubles, you're probably dealing with someone carrying a lot pathological environment damage.

You can help decompensate by gently pointing out how this environment is different than their old ones. This will take time. They adopted this stance to keep themselves safe, and people don't give up safety techniques lightly. As with all of the advice I give in this series, your enemy is confirmation bias; all it takes is one event to 'prove' that this job is just like the others and they lose all the ground they gained.

  • Point out instances where people asked for help, got it, and faced no negative consequences.
  • In 1:1 meetings, ask how their previous jobs handled risk-taking (asking for help is a risk). This will give you the context you need to illustrate how this team is different. It'll also give you a chance to build rapport by sharing your own experiences.
  • Reward them taking the risk of trusting others, positive reinforcement.
  • If you see instances of help-shaming on your team, visibly smack it down. You need to demonstrate to the new person that this behavior is not allowed.

There is another pathology that also comes from terrible environments: having to 'prove' yourself before you let yourself trust others. Many pre-DevOps Operations teams worked this way, the newbie had to prove they had what it takes before getting fully embraced. The key concept here is credibility. If someone has credibility dynamics in their head, they know that the new person (them) gets griefed and harassed until they can prove themselves.

As a team lead or manager you can help short circuit this by explicitly granting them credibility:

  • Call on them in meetings.
  • Ask their opinion in private, and in a meeting say that their idea was a good one.
  • Give them easy-win projects.

The future of telemetry

One of the nice things about living right now is that we know what the future of software telemetry looks like. Charity Majors is dead right about what the datastructure of future telemetry will be:

Arbitrarily cardinal and dimensional events.

Arbitrarily Cardinal means each field in a telemetry database can have an arbitrary number of values, similar to how Elasticsearch and most RDBMSs are built to handle.

Arbitrarily Dimensional means there can be an arbitrary number of fields in the telemetry database, similar to how column-based datastores like MariaDB, HBase, and Hypertable are built.

This combination of attributes allows software engineers to not have to worry about how many attributes, metrics, and values they're stuffing into each event cross the entire software ecosystem. Here is a classic cardinality explosion that is all too common in modern systems; assume each event has the following attributes:

  • account_id: describing the user account ID.
  • subscription_id: the payment subscription ID for the account/team/org.
  • plan_id: the subscription plan ID, which is often used to gate application features.
  • team_id: the team the account_id belongs to.
  • org_id: the parent organization ID the team or user belongs to.
  • code_commit: the git-hash of the currently running code.
  • function_id: the class-path of the function that generated the event.
  • app_id: the ID of the application generating the event.
  • process_id: the individual execution of code that generated the event.
  • cluster_id or host_id: the kubernetes or VM that the process was running on.
  • span_id: the workflow the event happened in, used to link multiple events together.

This is a complex set of what I call correlation identifiers in my book. This set of 11 fields will give you a high degree of context for where the event happened and the business-logic context around why it happened. That said, in even a medium size SaaS app the union of unique values in this set is going to be in the trillions or higher. You need a database designed for deep cardinality in fields, which we have these days; Jaeger is designed for this style of telemetry right now.

However, this is only part of what telemetry is used for. These 11 fields are all global context, but sometimes you need localized context such as file-size, or want to capture localized metrics like final number of pages. These local context and localized metrics are where the arbitrarily dimensional aspect of telemetry comes in to play.

To provide an example, let's look at what the local context I might encounter at work. I work for an Electronic Signature provider, where you can upload files in any number of formats, mark them up for signers to fill out, have them signed, and get a PDF at the end. In addition to the previous global context, here is one example of local context we would care about for an event that tracks how we converted an uploaded Word Perfect file into the formats we use on the signing pages:

  • file_type: the source file type.
  • file_size: how big that source file was.
  • source_pages: how many pages the source file was.
  • converted_pages: how many pages the final convert ended up being (suggesting this can differ from source_pages, how interesting)
  • converted_size: how big the converted pages ended up being.
  • converted_time: how long it took to do the conversions, as measured by this process.

This set seems fine, but lets take a look at the localized context for a different team; the one writing the Upload API endpoints.

  • upload_type: the type of the file uploaded.
  • upload_bytes: how big that uploaded file was
  • upload_quota: how much quota was left on the account after the upload was done.
  • persist_time: how long it took to get the uploaded file out of local-state and into a persistent storage system.

We see similar concepts here! Chapter 14 in the book gives you techniques for reducing this level of field-sprawl, but if you're using a datastore that allows arbitrary dimensionality you don't need to bother. All the overhead of reconciling different teams use of telemetry to reduce complexity in the database goes away if the database is tolerant of that complexity.

Much of my Part 3 chapters are spent in providing ways to handle database complexity. If you're using a database that can handle it, those techniques aren't useful anymore. Really, the future of telemetry is in these highly complex databases.

The problem with databases that support arbitrary cardinality and dimensionality is that you need to build them from scratch right now. You can start with a column-store database and adapt it to support arbitrary field cardinality, but that's homebrewing. Once you have the database problem solved, you need to solve the analysis and presentation problems leveraging your completely fluid data-model.

This is a hard problem, and it's hard enough you can build and finance a company to solve it. This is exactly what did and is doing, and why the future of telemetry will see much less insourcing. Jaeger is the closest we have to an open source system that does all this, but it has to rely on a database to make it work; currently that's either Elasticsearch or Cassandra. The industry needs to see an open source database that can handle both cardinality and dimensionality, we and just don't have it yet.

The typical flow of telemetry usage in a growing startup these days is roughly:

  1. Small stage: SaaS for everything that isn't business-code, including telemetry.
  2. Medium stage: Change SaaS providers for cost-savings based on changed computing patterns.
  3. Large stage: Start considering insourcing telemetry systems to save SaaS provider costs. Viable because at this stage you have enough engineering talent that this doesn't seem like a completely terrible idea.
  4. Enterprise stage: If insourcing hasn't happened yet for at least one system, it'll happen here.
  5. Industry dominance stage: Open source (or become a major contributor to) the telemetry systems being used.

The constraints of the perfect telemetry database mean that SaaS use -- through stand-alone telemetry companies like Honeycomb or the offering from your cloud provider -- will persist much deeper into the growth cycle. There is a reason that the engineering talent behind the Cloud Native Computing Foundation largely comes from the biggest tech companies on the planet, it is in their interests to provide good enough solutions to provide internal systems that are competitive with the SaaS providers. Those internal systems wont be quite as featured as the SaaS providers; but when you're doing internal telemetry for cost savings, having an 80% solution feels pretty great for what would otherwise be a $2M/month contract.

For the rest of us? SaaS. We'll just have to get used to an outside party holding all of our engineering data.