Recently in sysadmin Category

This is a controversial take, but the phrase "it's industry standard" is over-used in technical design discussions of the internal variety.

Yes, there are some actual full up standards. Things like RFCs and ISO-standards are actual standards. There are open standards that are widely adopted, like OpenTelemetry and the Cloud Native Computing Foundation suite, but these are not yet industry standards. The phrase "industry standard" implies consensus, agreement, a uniform way of working in a specific area.

Have you seen the tech industry? Really seen it? It is utterly vast. The same industry includes such categories as:

  • Large software as a service providers like Salesforce and
  • Medium software as a service providers like and Dr. Chrono
  • Small software as a service providers like every bay area startup made in the last five years
  • Large embedded systems design like the entire automotive industry
  • Highly regulated industries like Health Care and Finance, where how you operate is strongly influenced by the government and similar non-tech organizations
  • The IT departments at all of the above, which is much smaller than they used to be due to the SaaS revolution, but still exist
  • Scientific computing for things like space probes, satellite base systems, and remote sensing platforms floating the oceans
  • Internal services work at companies that don't sell technology, places like UPS, Maersk, Target, and Orange County California.

The only thing the above have any kind of consensus on is "IP-based networking is better than the alternatives," and even that is a bit fragile. Such out there statements like "HTTP is a standard transport" are ones you'd think there would be consensus on, but you'd be wrong. Saying that "kubernetes patterns are industry standard" is a statement of desire, not a statement of fact.

Thing is, the Sysadmin community used this mechanic for self-policing for literal decades. Any time someone comes to the community with a problem, it has to pass a "best practices" smell test before we consider answering the question as asked; otherwise, we'll interrogate the bad decisions that lead to this being a problem in the first place. This mechanic is 100% why ServerFault has a "reasonable business practices" close reason:

Questions should demonstrate reasonable information technology management practices. Questions that relate to unsupported hardware or software platforms or unmaintained environments may not be suitable for Server Fault.

Who sets the "best practices" for the sysadmin community? It's a group consensus of the long time members, which is slightly different between each community. There are no RFCs. There are no ISO standards. The closest we get is ITIL, the IT Infrastructure Library, which we all love to criticize anyway.

Best practices, which is "industry standard" by an older name, have always been an "I know it when I see it" thing. A tool used by industry elders to shame juniors into changing habits. Don't talk to me until you level up to the base norms of our industry, pleeb; and never mind that those norms are not canonicalized anywhere outside of my head.

This is why the phrase "it's industry standard" should not be used in internal technical design conversations

This phrase is shame based policing of concepts. If something is actually a standard, people should be able to look it up and see the history of why we do it this way.

Maybe the "industry" part of that statement is actually relevant; if that's the case, say so.

  • All of the base technology our market segment run on is made by three companies, so we do what they require.
  • Our industry are startups founded in 2010-2015 by ex-Googlers, so our standard is what Google did then.
  • Our industry computerized in the 1960s and has consumers in high tech and high poverty areas, so we need to keep decades of backwards compatibility.
  • Our industry is VC-funded SaaS startups founded after 2018 in the United States, who haven''t exited yet. So we need to stay on top of the latest innovations to ensure our funding rounds are successful.
  • Our industry is dominated by on-prem Java shops, so we have to be Java as well in order to sell into this market.

These are useful, important constraints and context for people to know. The vague phrase "industry standard" does not communicate context or constraints beyond, "your solution is bad, and you should feel bad for suggesting it." Shame is not how we maintain generative cultures.

It's time to drop "it's industry standard" from regular use.

Centralization of email

I've been managing email systems for darn near all of my career. I first started in any capacity in late 1997 and it has only had periodic interruptions. I'm no longer maintaining email for my business users, but I am making sure the email we send to our customers actually gets there. I'm not the person in charge of it, but I am recognized as the person who has worked on this stuff the longest.

Way back in the day, this blog used to be managed through Blogger, back when they allowed FTPing to remote sites. When Google launched Gmail, they invited their bloggers to join and talk about it. I was one of them. Nearly 18 years later, I'm still there, and Google has fundamentally rewritten what email means for the Internet. You can see some of that fight from my deep archives:

So, that was 13ish years ago. Now adays I'm in the outbound email business and all that implies. The other day I took a look at the logging for our mail sends to see what mail-servers we were talking to and ran some statistics. They're a weeny bit eye-opening. Here are the mail receivers that got over 1% of our mail:

  • 55% Google (includes Google Apps and Gmail)
  • 20% Microsoft
  • 4.5% Yahoo
  • 2.5% Point Proof Hosted
  • 2% Minecast
  • 1.5% Barracuda Networks
  • 1% Sophos
  • 13.5% Literally everyone else

Which means that two providers, Google and Microsoft, control about 75% of the email boxes we sent to that day. The rest over 1% are various email protection providers likely fronting self-hosted email systems.

This has profound effects on how email works as a whole. What Google says goes, and if Microsoft agrees everyone else has to deal with it or be left behind. Yahoo is the only other mail-provider to break the 1% line. If Google's spam algorithms suddenly mark you as suspicious, it can be weeks to dig out of that hole. Old standard techniques like DNS Reverse Blacklists are still used in part by the 25% non-GOOG/MSFT mailers, but getting blacklisted on those is something we can go a few days before noticing. As I wrote in 2007:

First and foremost, SPAM. The native anti-spam inside GroupWise is a simple blacklist last time I looked, which is effectively worthless in the modern era of SPAM.

Yeah, blacklists were definitely not the first line of defense even 15 years ago. They're absolutely not in the modern era. They're useful inputs to the spam/ham decision, but you get far more leverage out of building an IP reputation database of bad actors sending you stuff. And that benefits greatly from scale. Google and Microsoft will see the whole internet at some point, probably more than once a day. Hard to compete with that.

Finally, Google is killing off the open source protocols that used to be standard for accessing email: POP and IMAP. They're just too prone to attack these days, and they're password based which we know is a weak defense. Hard to two-factor-authenticate those without forcing the user into a browser anyway.

Lost history: service pack 1

I learned recently that one of the absolute maxims of Systems Administration has someone gotten muddled in this SaaSy world we now live in. I refer to, of course:

Never deploy the 0 release, always wait for the first service-pack.

There are people working on Operations-like roles, and release-engineering roles, who haven't lived this. You don't see it as much in a micro-waterfall environment where you're doing full-ritual releases every 2/3/4/5/6 weeks, or if you're only shipping software to your own servers. But people forget that there are still companies shipping actual-software for installation on customer hardware.

In the interests of history, I give you the social factors for why the dot-zero release, or the P0 release, or the hotfix 0 release is as buggy as it is. Especially if it's a major version increment.

In this hypothetical company, major releases happen every two years. Point releases happen quarterly, and hotfix-releases happen once a month (or more often if needs require). This company sells to other companies running RedHat Enterprise Server or SuSE Linux Enterprise, not the latest LTS from Ubuntu, and certainly not a 'rolling release'. These target-companies expect a year's warning before dealing with a major upgrade with changed UI and possibly changed behavior.

12 months from release of version +1

Engineering has been working on this for the last year already, in-between hotfixing the old code-branch. Feature-set is mostly finalized by Product. Sales starts running some features past customers (under NDA of course) to see how they like it.

6 months from release of version +1

Feature-set is locked in by Product, Marketing starts building campaigns.

Engineering continues to work, certain features are on track and testing well. Others, not so much.

Sales or Customer Success starts having Roadmap meetings with customers, especially customers that have either shown interest in the new features, or flatly demanded the features as a condition of not churning.

3 months from release of version +1

Engineering takes a hard look at progress, and is concerned. Most of it is there, but a few of the features demanded by potential-churn customers aren't on track because they were more complex than expected.

Marketing is already submitting advertising campaigns to print-media, and is deep in development of web advertising.

Sales/Customer Success is giving monthly updates to those churn-risks to keep them reassured that we have their needs in mind, and it will be fixed real soon. Promise.

Engineering gets an extra sprint to finish the release, so the release is pushed N weeks. This is the last reprieve.

2 months from release of version +1

Engineering is pretty sure that at least two of those must-have features won't be ready by the scheduled release date, but Marketing is already queued up, and those revenue-risk customers are quite keen to get their fixes. After high-level discussion with product, polish work on the other features in the release are put on the back-burner while more people are put into dealing with the must-have features.

1 month from release of version +1

Disaster strikes: major engineering casualty in the shipping version drew four whole days of Engineering time away from developing the new version. An out-of-schedule hotfix is released. Engineering asks for a delay, or to bump one of the must-have features to a point-release instead. Marketing, who has already built an entire campaign around that one feature, says they're committed to that feature already and we don't lie to customers. Engineering toys with the possibility of going into crunch-time.

The Managed Hosting division -- we install and maintain this so you don't have to -- who has been running the new version for certain 'beta' customers for a few months now, comes to Engineering to say the installer is an automation disaster and we need to fix it before we go live.

2 weeks from release of version +1

Work on a revised installer for the version starts.

Engineering is now very sure that at least two features are not production-ready, and says so.There is a lot of arguing, but Executive says ship it anyway. Engineering focuses work on getting the feature working specifically for the use-case of the highest value customer calling for it. Meanwhile all the other features are missing a fair amount of polish. They work, but have some sharp edges. Documentation-fixes are developed.

Release day

Boom. Ring the gong, send the triumphal email, eat the cake, this thing is out.

Engineering knows they shipped crap, gets working on those last features, aiming to make them fully functional for the .1 release in a quarter. Meanwhile, polish on the rest of the features gets scheduled in for the hotfixes.

2 weeks after release

Customer Success works with those high value customers, and hears back that the features are there, true, but they're not exactly... functional. The churn risk has been pushed a quarter. Gets on Engineering to prioritize work on those features. Which engineering already was doing, because professional pride is a real thing.

First hotfix release, 1 month after release-day

Engineering ships a lot of polish, including a more functional installer. Support breathes a sigh of relief, because they now have a fix for a lot of annoyance-tickets.

Second hotfix release, 2 months after release-day

Most of the features are now working pretty well, but the gnarly ones still have rough edges. Shipping full fixes for those is postponed to the .1 release due to scope of change.

Hotfix 3 release for the .0 version, .1 release, 3 months after release-day

Engineering ships those hard features in a way that actually works. Support cheers. Customer Success brings this to their cranky customers as proof we have their values at heart.

Social-factors the whole way. If Engineering was the sole driver of the release day, all the expectation-setting done by marketing would be blown out of the water and the company would get a reputation for being unreliable. If Marketing was the sole driver of the release day the quality of the release would be absolute shit. Most companies strike a balance between these two extremes, but it does mean the .1 release is the one with the finishing touches.

in a SaaS world, this can apply to major blockbuster features. Perhaps supporting more than one team on a billing-code is really hard, so you take a year to refactor so many things. Being a frequently asked for feature, this gets sent out to high value customers (who need multi-team billing) as a coming feature to keep them from churning to another SaaS provider. Engineering gets close, but not quite. Maybe that one customer agrees to be the beta-test, so they get the feature-flag turned on. Or maybe feature-flags aren't a thing, so this feature gets pushed before all the knobs are installed in order to stop churn, and Support ends up untangling a lot of billing problems that first billing-cycle.

Keep these factors in mind for anything that is expected to take a long time to develop.

Technical Renewal Cycles

If you've been in the tech industry for longer than 7 years, chances are good you've lived through some paradigm change in how things work. I've been here since 1996, so I've seen a few. And I'm looking down the barrel of another one. So I figured I would go over the renewal cycles I've personally lived through.

Ordered by dates impacting my career.

The Rise and Fall of Novell NetWare (Predates me, until about 2007)

Office automation was Novell NetWare for years. It was the first server in most offices, until they were replaced by either Windows NT, Windows 2000, or Windows 2003 machines. It was honestly the first distributed platform I ever worked with, since Novell Directory Services was a distributed database. It doesn't qualify as 'distributed' under today's definition, but in 1996 when NetWare 4.0 came out? It was cutting edge.

It was a NetWare migration that got me into Systems Administration in the first place. I did all the Sysadminny stuff like backups, user migrations, server patching and updates, and server consolidations. Servers were well pampered pets with cutesy names.

For me, NetWare died when my dayjob decided to get rid of it. I'm not at all surprised at the choice in retrospect. The upgrade at the time was from NetWare 6.5 (Novell kernel, 32-bit) to Open Enterprise Server (Linux kernel, 64-bit), and the effort to upgrade to that versus migrating everyone to Windows was equal enough we went to Windows instead.

Fun fact for newcomers: This very blog was hosted on NetWare for the first few years of its existence! It ran Apache! And Tomcat! Have a flash-back to that time.

This was the first operating system to die underneath me.

The Rise and Fall of File-Servers (Predates me through 2011)

File-serving is what NetWare did. The Windows migrations done were all to replace that functionality, and the NetWare admins involved were pissed because it was a step back in functionality. In my opinion, Windows didn't catch up to the NetWare feature-set until Server 2008.

But by 2008, the dedicated file-server was on the decline. At first it was NetApps, dedicated NAS systems.

Then it was things like SharePoint, that moved storage off of the direct-mapped file-share to application-specific application shares.

The university I was working for was one of the last market-segments to still operate large, clustered file-servers. When I left the university, it was in part because my dayjob had me working on a dead technology as a majority thing I was working on.

I'm not familiar with any new deployments of true file-servers. The orgs running them all started running them during the file-server era and haven't gotten rid of them yet.

The Rise and Fall of Actual Datacenters (Predates me to 2015)

The cloud mostly won. You need to be a certain size of company to make the physical plant of a datacenter make sense financially. Rack-n-stack sysadmins like me have moved on to cloudier things.

The Rise and Fall of Microsoft Windows (1997 through 2013)

I picked up Windows concurrent with NetWare, and grew my fluency with both at about the same rate. That made me marketable, which was nice. In the beginning (1997 through 2003) servers were still pampered pets with cutesy names. Once I moved to the university, the cutesy names mostly went away since our fleet was big enough we couldn't keep track of what Oberon did, but could figure out what AD-DC-3 did (2003-2011).

Then I moved to a startup that was using Windows as part of a SaaS product. That was an interesting time, but I learned lots doing it (2011-2013). By that time, servers were mere VMs with auto-generated names.

It stopped being relevant to my career in 2013, but I do have a soft spot for Windows.

The Rise and Fall of VMWare (2008 through 2015)

This was a big sea-change in how IT systems were managed. Once the Intel ISA support was there, ESX changed everything in IT. At the university I was working at, the rack with the ESX servers in it ate a sizable portion of the power-budget for the entire computer-room, at a fraction of the space that power-budget took in 2003. It was amazing.

Used it again at my first startup to manage an internal cloud of machines.

The big-corp after that was all ESX again with whole datacenters and a Managed Services platform on it. As I was leaving they were in the process of a lift-and-shift to AWS so they could close down their datacenter colos.

My current job does nothing with it because we have no physical datacenters.

The Rise of Linux (2011 to current)

I've been a Linux user since 1995, but wasn't allowed to administer it professionally until I moved to my first startup. This was one of several reasons why I left the University, by the way. I'm not going to spend much text on it, because Linux is central to so many things right now.

The Rise of Amazon AWS (2011 to current)

Using someone else's computers to do things! This was like VMware, but you didn't have to manage datacenters, or hardware, or power budgets, or rack-space, or cooling density, or cable-lengths, or server warranties, or out-of-band networks, or...

You paid someone else to handle all that, so you could focus on what you needed to focus on, which was provisioning operating systems and applications. And maybe not even that, if you end up using the managed-services like Relational Database Services, or go 'serverless' with Lambdas, ECS, and other techniques.

The Rise and Fall of VM-Oriented Configuration Management Systems (2011 through 2020)

Puppet was the first big-bang application in this space (there were others before it, like cfengine), with follow-ups in Chef, Salt, and Ansible. But Puppet was where I started, and still am. I spend a big chunk of my day making changes to our puppet-configs in order to bring change to what's running on our stuff. Infrastructure as code!

But Docker.

Puppet/Chef and company were all about building a VM image, or maintaining a pre-defined config on a running box. It does a great job of that, and auditors understand it. This paradigm doesn't work in Docker, which uses overlays to achieve similar things to how Puppet used modules. It sure can produce a VM-sized Docker image. But that image is not stylish.

The future has Puppet relegated to managing the state-maintaining systems that don't Docker well, and can't be run by your Cloud provider for some reason (be it compliance restrictions or technical).

From NetWare on a cutely named server in a closet, to a completely unnamed Docker instance in the Cloud, my career has been through a few paradigm changes. Change is constant. Just when I feel like I really know something well, a new thing comes along and I start anew.

Technical maturity

Having worked in the mandatory growth or death part of the tech industry1 for a few years now, I've had some chances to see how organizations evolve as they grow.  Not just evolve as an organization, but the technical environment. I've talked about this before, but what's most important for a mandatory-growth company changes based on their market-stage.

Early stage startup (before product-release to the months after the product is released)

  • Finding market-fit is everything.
  • The biggest threat to the infrastructure is running out of money.
  • Get out the tech-debt charge-card because we need to get something out there right now or we'll need new jobs.
  • Feature delivery way more important than disaster resilience.

Middle stage startup (has market-fit, a cadre of loyal customers in the small/medium business space)

  • Extending market penetration is the goal.
  • Feature drive is slacking a bit, focusing on attracting those bigger customers.
  • Some tech-debt is paid off, but still accumulating in areas.
  • Work on improving uptime/reliability starts coming into focus.

Up-market stage (attempting to break into the large business market)

  • Features that large businesses need are the goal.
  • Compliance pressures show up big time due to all the vendor-security-assessments slowing down the sales process.
  • First big chance of a major push to reduce early-stage tech-debt. Get those SPOFs out, institute real change-management, vulnerability assessment program, actual disaster-recovery plan, all the goodies.

These are very broad strokes, but there is a concept called technical maturity that is being shown here. Low-maturity organizations throw code at the wall, and if it sticks in an attractive way, leaves it in place. High maturity organizations have perfected the science of assessing new code for attractiveness and built code-deployers that can repeatedly hit the wall and maintain aesthetic beauty, all without having to train up professional code-throwers2.

Maturity applies to Ops processes just as much, though. Having been working on some of this internally, it's come to feel kind of like building a tech-tree for a game like Starcraft.

Level 1:
Centralized logging! And you can search them!
Level 2: You've got metrics now!
Level 3: High-context events!
Level 4: Distributed-tracing!

Disaster Management
Level 1:
You've got an on-call system and monitoring!
Level 2: You've got a Disaster Recovery plan!
Level 3: You've got SLAs, and not-Ops is now on-call too!
Level 4: Multi-datacenter failover!

Level 1: You have a routine patching process!
Level 2: Patching activities not related to databases no longer require downtime!
Level 3: You can patch, update, and upgrade your databases without requiring downtime!
Level 4: You can remove the planned outage carve-out in the SLA's uptime promise!

These can definitely be argued, but it looks like it might be a useful tool for companies that have graduated beyond the features or death stage. It can let internal technical maturity take an equal place at the table as Product. Whether or not that will actually work depends entirely on the organization and where the push is coming from.

1: As opposed to the tech enables our business, it isn't THE business part of the industry. Which is quite a bit larger, actually.
2: This analogy may be a bit over-extended.

Proxysql query routing

The proxysql project doesn't have much documentation around their query engine, even though it is quite powerful. What they do have is a table-schema for the query-rules table, and figuring out how to turn that into something useful is left as an exercise for the reader. It doesn't help that there are two ways to define the rules depending on how you plan to use proxysql.

For the on-box usecase, where proxysql is used as a local proxy for a bunch of DB-consuming processes, the config of it is likely a part of whatever you're using for configuration-management. Be that Docker, Puppet, Chef or something else. Fire once, forget. For this usecase, a config-file is most convenient.

mysql_query_rules =
rule_id = 1
active = 1
username = "read_only_user"
destination_hostgroup = 2
rule_id = 2
active = 1
schemaname = "cheese_factory"
destination_hostgroup = 1

Two rules. One says that if the read-only user is the one logging in, send it to the second hostgroup (which is the read-only replica). The other says that if the "cheese_factory" database is being accessed, use the first hostgroup. Seems easy. For the on-box usecase, changing rules is as easy as rolling a new box/container.

However, the other way to define these is through a SQL interface they built. This usecase is more for people operating a cluster of proxysql nodes and need to change rules and configuration on the fly with no downtime. It's this method that all of their examples are written in.

Which leaves those of us using the config-file to scratch our heads.

INSERT into mysql_query_rules (rule_id, active, username, destination_hostgroup)
VALUES (1,1,"read_only_user",2);

INSERT into mysql_query_rules (rule_id, active, schemaname, destination_hostgroup)
VALUES (2,1,"cheese_factory",1);

These two ways of describing a rule do the same thing. If you're writing a config-management thingy for an on-box proxysql, the first is probably the only way you care about. If you're building a centralized one, the second one is the only one you care about.

For those of you looking to make the translation, or looking for the config-file schema, each of those column names in the table-schema can be a value in the mysql_query_rules array.

  • Different lines are ANDed together.
  • Rules are processed in the rule_id order.
  • The first match wins, so put your special cases in with low rule_id numbers, and your catch-alls with high numbers.
    • The flagIN, flagOUT, and apply columns allow you to get fancy, but that's beyond me right now.

What my CompSci degree got me

The what use is a csci degree meme has been going around again, so I thought I'd interrogate what mine got me.

First, a few notes on my career journey:

  1. Elected not to go to grad-school. Didn't have the math for a masters or doctorate.
  2. Got a job in helpdesk, intending to get into Operations.
  3. Got promoted into sysadmin work.
  4. Did some major scripting as part of Y2K remediation, first big coding project after school.
  5. Got a new job, at WWU.
  6. Microsoft released PowerShell.
  7. Performed a few more acts of scripting. Knew I so totally wasn't a software engineer.
  8. Manage to change career tracks into Linux. Started learning Ruby as a survival mechanism.
  9. Today: I write code every day. Still don't consider myself a 'software engineer'.

Elapsed time: 20ish years.

As it happens, even though my career has been ops-focused I still got a lot out of that degree. Here are the big points.

Sysadmins and risk-management

This crossed my timeline today:

This is a risk-management statement that contains all of a sysadmin's cynical-bastard outlook on IT infrastructure.

Disappointed because all of their suggestions for making the system more resilient to failure are shot down by management. Or, some of them are, which is like all in that there are disasters that are uncovered. Commence drinking heavily to compensate.

Frantically busy because they're trying to mitigate all the failure-modes their own damned self using not enough resources, all the while dealing with continual change as the mission of the infrastructure shifts over time.

A good #sysadmin always expects the worst.

Yes, we do. Because all too often, we're the only risk-management professionals a system has. We better understand the risks to the system than anyone else. A sysadmin who plans for failure is one who isn't first on the block when a beheading is called for by the outage-enraged user-base.

However, there are a few failure-modes in this setup that many, many sysadmins fall foul of.

Perfection is the standard.

And no system is perfect.

Humans are shit at gut-level risk-assessment, part 1: If you've had friends eaten by a lion, you see lions everywhere.

This abstract threat has been made all too real, and now lions. Lions everywhere. For sysadmins it's things like multi-disk RAID failures, UPS batteries blowing up, and restoration failures because an application changed its behavior and the existing backup solution no longer was adequate to restore state.

Sysadmins become sensitized to failure. Those once-in-ten-years failures, like datacenter transfer-switch failures or Amazon region-outages, seem immediate and real. I knew a sysadmin who was paralyzed in fear over a multi-disk RAID failure in their infrastructure. They used big disks, who weren't 100% 'enterprise' grade. Recoveries from a single-disk failure were long as a result. Too long. A disk going bad during the recovery was a near certainty in their point of view, never mind that the disks in question were less than 3 years old, and the RAID system they were using had bad-block detection as a background process. That window of outage was too damned long.

Humans are shit at gut-level risk-assessment, part 2: Leeroy Jenkins sometimes gets the jackpot, so maybe you'll get that lucky...

This is why people think they can win mega-million lotterys and in casinos playing roulette. Because sometimes, you have to take a risk for a big payoff.

To sysadmins who have had friends eaten by lions, this way of thinking is completely alien. This is the developer who suggests swapping out the quite functional MySQL databases for Postgres. Or the peer sysadmin who really wants central IT to move away from forklift SAN-based disk-arrays for a bunch of commodity hardware, FreeBSD, and ZFS.

Mm hm. No.

Leeroy Jenkins management and lion-eaten sysadmins make for really unhappy sysadmins.

When it isn't a dev or a peer sysadmin asking, but a manager...

Sysadmin team: It may be a better solution. But do you know how many lions are lurking in the transition process??

Management team: It's a better platform. Do it anyway.

Cue heavy drinking as everyone prepares to lose a friend to lions.

This is why I suggest rewording that statement:

A good #sysadmin always expects the worst.
A great #sysadmin doesn't let that rule their whole outlook.

A great sysadmin has awareness of business risk, not just IT risks. A sysadmin who has been scarred by lions and sees large felines lurking everywhere will be completely miserable in an early or mid-stage startup. In an early stage startup, the big risk on everyone's mind is running out of money and losing their jobs; so that once-in-three-years disaster we feel so acutely is not the big problem it seems. Yeah, it can happen and it could shutter the company if it does happen; but the money remediating that problem would be better spent by expanding marketshare enough that we can assume we'll still be in business 2 years from now. A failure-obsessed sysadmin will not have job satisfaction in such a workplace.

One who has awareness of business risk will wait until the funding runway is long enough that pitching redundancy improvements will actually defend the business. This is a hard skill to learn, especially for people who've been pigeon-holed worker-units their entire carer. I find that asking myself one question helps:

How likely is it that this company will still be here in 2 years? 5? 7? 10?

If the answer to that is anything less than 'definitely', then there are failures that you can accept into your infrastructure.

The origins of on-call work

On September 6th, Susan Fowler posted an article titled, "Who's on-call?", talking about evolving on-call duties between development teams and SRE teams. She has this quote at the top:

I'm not sure when in the history of software engineering separate operations organizations were built and run to take on the so-called "operational" duties associated with running software applications and systems, but they've been around for quite some time now (by my research, at least the past twenty years - and that's a long time in the software world).

My first job was with a city government, and many of the people I was working with started at that city when they decided to computerize in 1978. Most of them have retired or died off by now. In 1996, when I started there, the original dot-com boom was very much on the upswing, and that city was still doing things the way they'd been done for years.

I got into the market in time to see the tail end of that era. One of the things I learned there was the origins of many of the patterns we see today. To understand the origins of on-call in IT systems, you have to go back to the era of serial networking, when the term 'minicomputer' was distinct from 'microcomputer', which were marketing terms to differentiate from 'mainframe'.

IT systems of the era employed people to do things we wouldn't even consider today, or would work our damnedest to automate out of existence. There were people who had, as their main job, duties such as:

  • Entering data into the computer from paper forms.
    • Really. All you did all day was punch in codes. Computer terminals were not on every desk, so specialists were hired to do it.
    • The worst part is: there are people still doing this today.
  • Kick off backups.
  • Change backup tapes when the computer told them to.
  • Load data-tapes when the computer told them to.
    • Tape stored more than spinning rust, so it was used as a primary storage medium. Disk was for temp-space.
    • I spent a summer being a Tape Librarian. My job was roboticized away.
  • Kick off the overnight print-runs.
  • Colate printer output into reports, for delivery to the mailroom.
  • Execute the overnight batch processes.
    • Your crontab was named 'Stephen,' and you saw him once a quarter at the office parties. Usually very tired-looking.
  • Monitor by hand system usage indicators, and log them in a paper logbook.
  • Keep an Operations Log of events that happened overnight, for review by the Systems Programmers in the morning.
  • Follow runbooks given to them by Systems Programming for performing updates overnight.
  • Be familiar with emergency procedures, and follow them when required.

Many of these things were only done by people working third shift. Which meant computer-rooms had a human on-staff 24/7. Sometimes many of them.

There was a side-effect to all of this, though. What if the overnight Operator had an emergency they couldn't handle? They had to call a Systems Programmer to advise a fix, or come in to fix it. In the 80's, when telephone modem came into their own, they may even be able to dial in and fix it from home.

On-Call was born.

There was another side-effect to all of this: it happened before the great CompSci shift in the colleges, so most Operators were women. And many Systems Programmers were too. This was why my first job was mostly women in IT management and senior technical roles. This was awesome.

A Systems Programmer, as they were called at the time, is less of a Software Engineering role as we would define it today. They were more DevOps, if not outright SysAdmin. They had coding chops, because much of systems management at the time required that. Their goal was more wiring together purchased software packages to work coherently, or modifying purchased software to work appropriately.

Time passed, and more and more of the overnight Operator's job was automated away. Eventually, the need for an overnight Operator exceeded requirements. Or you simply couldn't hire one to replace the Operator that just quit. However, the systems were still running 24/7, and you needed someone ready to respond to disasters. On-call got more intense, since you no longer had an experienced hand in the room at all times.

The Systems Programmers earned new job-titles. Software Engineering started to be a distinct skill-path and career, so was firewalled off in a department called Development. In those days, Development and Systems people spoke often; something you'll hear old hands grumble about with DevOps not being anything actually new. Systems was on-call, and sometimes Development was if there was a big thing rolling out.

Time passed again. Management culture changed, realizing that development people needed to be treated and managed differently than systems people. Software Engineering became known as Software Engineering, and became its own career-track. The new kids getting into the game never knew the close coordination with Systems that the old hands had, and assumed this separation was the way it's always been. Systems became known as Operations; to some chagrin of the old Systems hands who resented being called an 'Operator', which was typically very junior. Operations remained on-call, and kept informal lists of developers who could be relied on to answer the phone at o-dark-thirty in case things went deeply wrong.

More time, and the separation between Operations and Software Engineering became deeply entrenched. Some bright sparks realized that there were an awful lot of synergies to be had with close coordination between Ops and SE. And thus, DevOps was (re)born in the modern context.

Operations was still on-call, but now it was open for debate about how much of Software Engineering needed to be put on the Wake At 3AM In Case Of Emergency list.

And that is how on-call evolved from the minicomputer era, to the modern era of cloud computing.

You're welcome.

InfluxDB queries, a guide

I've been playing with InfluxDB lately. One of the problems I'm facing is getting what I need out of it. Which means exploring the query language. The documentation needs some polishing in spots, so I may submit a PR to it once I get something worked up. But until then, enjoy some googlebait about how the SELECT syntax works, and what you can do with it.

Rule 1: Never, ever put a WHERE condition that involves 'value'. Value is not indexed. Doing so will cause table-scans, and for a database that can legitimately contain over a billion rows, that's bad. Don't do it.
Rule 2: No joins.

With that out of the way, have some progressively more complex queries to explain how the heck this all works!

Return a list of values.

Dump everything in a measurement, going back as far as you have data. You almost never want to do this

SELECT value FROM site_hits

The one exception to this rule, is if you're pulling out something like an event stream, where events are encoded as tags-values.

SELECT event_text, value FROM eventstream

Return a list of values from a measurement, with given tags.

One of the features of InfluxDB, is that you can tag values in a measurement. These function like extra fields in a database row, but you still can't join on them. The syntax for this should not be surprising.

SELECT value FROM site_hits WHERE webapp = 'api' AND environment = 'prod'

Return a list of values from a measurement, with given tags that match a regex.

Yes, you can use regexes in your WHERE clauses.

SELECT value FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod'

That's cool and all, but the real power of InfluxDB comes with the aggregation functions and grouping. This is what allows you to learn what the max value was for a given measurement over the past 30 minutes, and other useful things. These yield time-series that can be turned into nice charts.

Return a list of values, grouped by application

This is the first example of GROUP BY, and isn't one you'll probably ever need to use. This will emit multiple time-series.

SELECT value FROM site_hits where webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY webapp

Return a list of values, grouped by time into 10 minute buckets

When using time for a GROUP BY value, you must provide an aggregation function! This will add together all of the hits in the 10 minute bucket into a single value, returning a time-stream of 10 minute buckets of hits.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY time(10m)

Return a list of values, grouped by both web-server and time into 10 minute buckets

This does the same thing as the previous, but will yield multiple time-series. Some graphing packages will helpfully chart multiple lines based on this single query. Handy, especially if servername changes on a daily basis as new nodes are added and removed.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY time(10m), servername

Return a list of values, grouped by time into 10 minute buckets, for data receive in the last 24 hours.

This adds a time-based condition to the WHERE clause. To keep the line shorter, we're not going to group on servername.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' AND time > now() - 24h GROUP BY time(10m)

There is one more trick InfluxDB can do, and this isn't documented very well. InfluxDB can partition data in a database into retention policies. There is a default retention policy on each database, and if you don't specify a retention-policy to query from, you are querying the default. All of the above examples are querying the default retention-policy.

By using continuous queries you can populate other retention policies with data from the default policy. Perhaps your default policy keeps data for 6 weeks at 10 second granularity, but you want to keep another policy for 1 minute granularity for six months, and another policy for 10 minute granularity for two years. These queries allow you to do that.

Querying data from a non-default retention policy is done like this:

Return 14 weeks of hits to API-type webapps, in 1 hour buckets

SELECT sum(value) FROM "6month".site_hits WHERE webapp =~ /api_[a-z]*/ AND environment = 'prod' AND time > now() - 14w GROUP BY time(1h)

The same could be done for "18month", if that policy was on the server.