On September 6th, Susan Fowler posted an article titled, "Who's on-call?", talking about evolving on-call duties between development teams and SRE teams. She has this quote at the top:
I'm not sure when in the history of software engineering separate operations organizations were built and run to take on the so-called "operational" duties associated with running software applications and systems, but they've been around for quite some time now (by my research, at least the past twenty years - and that's a long time in the software world).
My first job was with a city government, and many of the people I was working with started at that city when they decided to computerize in 1978. Most of them have retired or died off by now. In 1996, when I started there, the original dot-com boom was very much on the upswing, and that city was still doing things the way they'd been done for years.
I got into the market in time to see the tail end of that era. One of the things I learned there was the origins of many of the patterns we see today. To understand the origins of on-call in IT systems, you have to go back to the era of serial networking, when the term 'minicomputer' was distinct from 'microcomputer', which were marketing terms to differentiate from 'mainframe'.
IT systems of the era employed people to do things we wouldn't even consider today, or would work our damnedest to automate out of existence. There were people who had, as their main job, duties such as:
- Entering data into the computer from paper forms.
- Really. All you did all day was punch in codes. Computer terminals were not on every desk, so specialists were hired to do it.
- The worst part is: there are people still doing this today.
- Kick off backups.
- Change backup tapes when the computer told them to.
- Load data-tapes when the computer told them to.
- Tape stored more than spinning rust, so it was used as a primary storage medium. Disk was for temp-space.
- I spent a summer being a Tape Librarian. My job was roboticized away.
- Kick off the overnight print-runs.
- Colate printer output into reports, for delivery to the mailroom.
- Execute the overnight batch processes.
- Your crontab was named 'Stephen,' and you saw him once a quarter at the office parties. Usually very tired-looking.
- Monitor by hand system usage indicators, and log them in a paper logbook.
- Keep an Operations Log of events that happened overnight, for review by the Systems Programmers in the morning.
- Follow runbooks given to them by Systems Programming for performing updates overnight.
- Be familiar with emergency procedures, and follow them when required.
Many of these things were only done by people working third shift. Which meant computer-rooms had a human on-staff 24/7. Sometimes many of them.
There was a side-effect to all of this, though. What if the overnight Operator had an emergency they couldn't handle? They had to call a Systems Programmer to advise a fix, or come in to fix it. In the 80's, when telephone modem came into their own, they may even be able to dial in and fix it from home.
On-Call was born.
There was another side-effect to all of this: it happened before the great CompSci shift in the colleges, so most Operators were women. And many Systems Programmers were too. This was why my first job was mostly women in IT management and senior technical roles. This was awesome.
A Systems Programmer, as they were called at the time, is less of a Software Engineering role as we would define it today. They were more DevOps, if not outright SysAdmin. They had coding chops, because much of systems management at the time required that. Their goal was more wiring together purchased software packages to work coherently, or modifying purchased software to work appropriately.
Time passed, and more and more of the overnight Operator's job was automated away. Eventually, the need for an overnight Operator exceeded requirements. Or you simply couldn't hire one to replace the Operator that just quit. However, the systems were still running 24/7, and you needed someone ready to respond to disasters. On-call got more intense, since you no longer had an experienced hand in the room at all times.
The Systems Programmers earned new job-titles. Software Engineering started to be a distinct skill-path and career, so was firewalled off in a department called Development. In those days, Development and Systems people spoke often; something you'll hear old hands grumble about with DevOps not being anything actually new. Systems was on-call, and sometimes Development was if there was a big thing rolling out.
Time passed again. Management culture changed, realizing that development people needed to be treated and managed differently than systems people. Software Engineering became known as Software Engineering, and became its own career-track. The new kids getting into the game never knew the close coordination with Systems that the old hands had, and assumed this separation was the way it's always been. Systems became known as Operations; to some chagrin of the old Systems hands who resented being called an 'Operator', which was typically very junior. Operations remained on-call, and kept informal lists of developers who could be relied on to answer the phone at o-dark-thirty in case things went deeply wrong.
More time, and the separation between Operations and Software Engineering became deeply entrenched. Some bright sparks realized that there were an awful lot of synergies to be had with close coordination between Ops and SE. And thus, DevOps was (re)born in the modern context.
Operations was still on-call, but now it was open for debate about how much of Software Engineering needed to be put on the Wake At 3AM In Case Of Emergency list.
And that is how on-call evolved from the minicomputer era, to the modern era of cloud computing.