September 2019 Archives

When terms shift

For a few years I've been giving talks on monitoring, observability, and how to get both. In those talks I have a Venn Diagram like this one from last year:

Venn Diagram showing monitoring nested inside observability nested inside telemetry

Note the outer circle, telemetry. I was using this term to capture the concept of all the debug and log-line outputs from an in-house developed piece of software. Some developer put every log-line in there for a reason. It made sense to call this telemetry. Using this concept I point out a few things:

  • Telemetry is your biggest pot of data. It is the full, unredacted sequence of events.
  • So big, that you probably only keep it searchable for a short period of time.
  • But long enough you can troubleshoot with it and see how error rates change across releases.
  • Observability is a restricted and sampled set of telemetry, kept for longer because its higher correlation.
  • Higher correlation and longer time-frames make for a successful observability system.
  • Monitoring is a subset of them all, since those are driving metrics, dashboard, and alarms. A highly specialized use.

A neat model, I thought. I've been using it internally for a while now, and thought it was getting traction.

Then the two major distributed tracing frameworks decided to merge and brand themselves as Open Telemetry. I understand that using Open Tracing or Distributed Tracing were non-viable; one of the two frameworks was already called Open Tracing, and Distributed Tracing is a technique not a trademark-able project name.

There is a reason I didn't call it Centralized Logging, because telemetry encompassed more than just centrallized logging. It included things that aren't centralizable because they exist in SaaS platforms that don't have log-export. Yes, I'm miffed at having to come up with a new term for this. Not sure what it will be yet.

Technical maturity

Having worked in the mandatory growth or death part of the tech industry1 for a few years now, I've had some chances to see how organizations evolve as they grow.  Not just evolve as an organization, but the technical environment. I've talked about this before, but what's most important for a mandatory-growth company changes based on their market-stage.

Early stage startup (before product-release to the months after the product is released)

  • Finding market-fit is everything.
  • The biggest threat to the infrastructure is running out of money.
  • Get out the tech-debt charge-card because we need to get something out there right now or we'll need new jobs.
  • Feature delivery way more important than disaster resilience.

Middle stage startup (has market-fit, a cadre of loyal customers in the small/medium business space)

  • Extending market penetration is the goal.
  • Feature drive is slacking a bit, focusing on attracting those bigger customers.
  • Some tech-debt is paid off, but still accumulating in areas.
  • Work on improving uptime/reliability starts coming into focus.

Up-market stage (attempting to break into the large business market)

  • Features that large businesses need are the goal.
  • Compliance pressures show up big time due to all the vendor-security-assessments slowing down the sales process.
  • First big chance of a major push to reduce early-stage tech-debt. Get those SPOFs out, institute real change-management, vulnerability assessment program, actual disaster-recovery plan, all the goodies.

These are very broad strokes, but there is a concept called technical maturity that is being shown here. Low-maturity organizations throw code at the wall, and if it sticks in an attractive way, leaves it in place. High maturity organizations have perfected the science of assessing new code for attractiveness and built code-deployers that can repeatedly hit the wall and maintain aesthetic beauty, all without having to train up professional code-throwers2.

Maturity applies to Ops processes just as much, though. Having been working on some of this internally, it's come to feel kind of like building a tech-tree for a game like Starcraft.

Logging/Metrics/Observability
Level 1:
Centralized logging! And you can search them!
Level 2: You've got metrics now!
Level 3: High-context events!
Level 4: Distributed-tracing!

Disaster Management
Level 1:
You've got an on-call system and monitoring!
Level 2: You've got a Disaster Recovery plan!
Level 3: You've got SLAs, and not-Ops is now on-call too!
Level 4: Multi-datacenter failover!

Patching
Level 1: You have a routine patching process!
Level 2: Patching activities not related to databases no longer require downtime!
Level 3: You can patch, update, and upgrade your databases without requiring downtime!
Level 4: You can remove the planned outage carve-out in the SLA's uptime promise!

These can definitely be argued, but it looks like it might be a useful tool for companies that have graduated beyond the features or death stage. It can let internal technical maturity take an equal place at the table as Product. Whether or not that will actually work depends entirely on the organization and where the push is coming from.


1: As opposed to the tech enables our business, it isn't THE business part of the industry. Which is quite a bit larger, actually.
2: This analogy may be a bit over-extended.

Proxysql query routing

The proxysql project doesn't have much documentation around their query engine, even though it is quite powerful. What they do have is a table-schema for the query-rules table, and figuring out how to turn that into something useful is left as an exercise for the reader. It doesn't help that there are two ways to define the rules depending on how you plan to use proxysql.

For the on-box usecase, where proxysql is used as a local proxy for a bunch of DB-consuming processes, the config of it is likely a part of whatever you're using for configuration-management. Be that Docker, Puppet, Chef or something else. Fire once, forget. For this usecase, a config-file is most convenient.

mysql_query_rules =
(
{
rule_id = 1
active = 1
username = "read_only_user"
destination_hostgroup = 2
},
{
rule_id = 2
active = 1
schemaname = "cheese_factory"
destination_hostgroup = 1
}
)

Two rules. One says that if the read-only user is the one logging in, send it to the second hostgroup (which is the read-only replica). The other says that if the "cheese_factory" database is being accessed, use the first hostgroup. Seems easy. For the on-box usecase, changing rules is as easy as rolling a new box/container.

However, the other way to define these is through a SQL interface they built. This usecase is more for people operating a cluster of proxysql nodes and need to change rules and configuration on the fly with no downtime. It's this method that all of their examples are written in.

Which leaves those of us using the config-file to scratch our heads.

INSERT into mysql_query_rules (rule_id, active, username, destination_hostgroup)
VALUES (1,1,"read_only_user",2);
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

INSERT into mysql_query_rules (rule_id, active, schemaname, destination_hostgroup)
VALUES (2,1,"cheese_factory",1);
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

These two ways of describing a rule do the same thing. If you're writing a config-management thingy for an on-box proxysql, the first is probably the only way you care about. If you're building a centralized one, the second one is the only one you care about.

For those of you looking to make the translation, or looking for the config-file schema, each of those column names in the table-schema can be a value in the mysql_query_rules array.

  • Different lines are ANDed together.
  • Rules are processed in the rule_id order.
  • The first match wins, so put your special cases in with low rule_id numbers, and your catch-alls with high numbers.
    • The flagIN, flagOUT, and apply columns allow you to get fancy, but that's beyond me right now.

When you get to a certain level of the org-chart in a startup-style tech company, you spend an embarrassing amount of your time doing interviews. Senior individual-contributor people end up doing technical screens because -- in theory -- we're good at judging who is actually good at something and who is faking it. Sometimes we get to do the culture-fit interviews as well, since we tend to be the people with a lot of tenure.

All of which means I'm in more of these than I used to be.

In theory, the technical screen is awesome because:

  • It's a test of skill, not glad-handing (though glad-handing is an important skill for senior and super-senior IC anyway).
  • It gives the reviewer a sense of a prospect's technical abilities by looking at their work-output.
  • It gives a sense of the prospect's adherence to language styles.
  • It's objective.

And it very well may be, but from talking with others who deal with these I've noticed something else creep in. It comes when a candidate is on the edge of close-enough. Not a clear passing grade, but close.

  • Men: A little pairing, and he'll get up to speed fast. Hire.
  • Women: Not quite where we want her to be, we don't have time to train someone up. Pass.

It's the old adage again:

Men are hired because of their potential, women are hired because of their proven ability.

Subjectivity comes in the edge-cases, as always. No matter how brutally pass/fail you make the tech-screen, if literally everything else is saying hire someone, you will be temped to look at that almost pass for the tech-screen and subjectively judge if it is good enough. That's human nature, we exist to add common-sense to ruthless numbers and automation.

So if you find yourself making that is this candidate good enough or too much work? calculation, ask yourself what biases may be at work on you. Be deliberate.

When death-squads stalk the cube-halls

Recessions mean layoffs.

The US Standard Model for layoffs, used by risk-managers everywhere, is to minimize the time between a layoff target receiving the news that they are to be terminated and when the termination actually happens. The theory here is that such news turns people into insider threats and that means a swift removal is called for. Therefore you see:

  • No-notice HR meetings show up on your calendar.
  • HR arrives at your desk, with security.
  • You get to work one morning and the doors won't work for you.

This is the death-squad approach to layoffs. People just disappear from the floor without warning, never to be seen again. Happily talking at lunch, and their slack-user suddenly dissapears at 1400. It makes people think they might be next, you know?

Wondering if you're next for the death-squad treatment, especially in a recession where the next job will take a long time to arrange, kinda destroys any sense of psychological safety. You know, that thing that is foundational to any healthy office culture? The thing where the lack of it leaves scars on the survivors for years?

Nothing turns a generative office-culture into pathologic faster than death-squad style layoffs.

A better model

For the most part, US labor law treats employees as disposable worker-widgets to be thrown away at will and replaced by someone new who is eager to have a job at all. Because of this, many countries have better employee-laws than we do. One of the most common of these laws are how layoffs are handled. Specifically, workers are garanteed to be given notice before they're made effective.

It's a simple thing, but for a workplace culture it is profound. Getting two weeks notice that someone will be an ex-employee, or a group of folk will be ex-employees, allows people to process. It allows them to come to grips with it. It allows a going-away party where formal good-byes may be made. A generative office-culture is far more likely to survive that.

You don't need a union for this! Any company can do it! All it takes is the will to do the hard thing and break US Standard Practice.

The time is coming. If your company hasn't seen a mass layoff in its history, a recession is when it'll happen. Start thinking about it now.