Synthetic theory of monitoring: The alarms you want

You are in the weeds.

You're getting a thousand or more alarms over the course of a weekend. You can't look at them all, that would require not sleeping. And you actually do sleep. How do you cope with it?

Lots of email rules, probably. Send the select few you actually care about to the email-to-sms gateway for your phone. File others into the special folders that make your phone go ♫♬♫bingle♬♫♬. And mark-all-read the folder with 1821 messages in it when you get to the office on Monday (2692 on Tuesday after a holiday Monday).

Your monitoring system is OK. It does catch stuff, but the trick is noticing the alarm in all the noise.

You want to get to the point where alarms come rarely, and get acted upon when they show up. 1821 messages a weekend is not that system. Over a 60 hour weekend, 1821 messages is one message every two minutes. Or if it's like most monitoring system, it's a few messages an hour with a couple of bursts of hundreds over the course of a few polling-cycles as something big flaps and everything behind it goes 'down'. That alarming load is only sustainable with a fully staffed round-the-clock NOC.

Very few of us have those.

Paring down the load requires asking a few questions:

What does this alarm tell me?

An alarm should tell you something meaningful. But there are different levels of meaning.

  1. Application XYZ is unreachable from the internet and is fully out.
  2. Service R in application XYZ is down, creating a partial outage.
  3. Service M in application XYZ is experiencing beyond-acceptable delays, which is creating a bad user-experience for part of the app.
  4. Load balancer Q is returning HTTP 500 errors more than 2% of the time.
  5. Node MQ19 running Application XYZ behind Load balancer Q is down.
  6. Node MK42 has high CPU for more than 10 minutes.
  7. Switch port 19 (Node NG09) on switch LJ2100-MZB is offline.

Items 1-3 have a clear meaning and impact.

Item 4 has an implicit meaning.

Item 5's meaning depends on the nature of Application XYZ. If XYZ can tolerate individual node failures without a flap, this is an FYI alarm. If the application can't, this alarm should be "Application XYZ is experiencing a partial outage due to a node outage."

Items 6 and 7 are very much, Human! Something is wrong! Come see if we care!

Does this alarm tell me something general or specific?

General: High CPU on node MQ42.

Specific: Service R in application XYZ is excessively slow, creating a partial outage.

General alarms rely on human knowledge of the system to determine if this is normal or abnormal, and to infer what broader problems the alarm could be indicating. Specific alarms tell you what is wrong and what it means. The above two examples can mean the exact same thing, the difference is the specific alarm is clear about what the impact is. And you should probably use that one anyway.

Avoid general alarms unless there really is no other way to get that information. And there being no other way is a sign of a bad monitoring system.

If I turn off this alarm, will I miss it?

You may be very surprised at how clean your alert folder is after asking this question with all honesty. We turn on alarms when we think we need to know something. And... it turns out we really don't need to know a lot of things.

What alarms do I want to have, but don't have yet?

Good question. About that...


So far this has been a bottom up analysis of your alerting system. Look at what you have, toss what doesn't work. This is the easiest work to do and can be done by the minions at the sharp end of the on-call rotation. The question, what do we need to monitor/alarm on? is one that is best answered in a top down approach.

Which is the hard part, since 'top' is not in Ops, but in the business side. Many sysadmins are crap at talking to business people, so this is the time to find allies who can.

What does the top care about?

It's not disk-full alerts, that's for damned sure. The 'top' probably has a Service Level Agreement somewhere. It may be written, or it may be a de facto one everyone just understands. "Downtime costs us money and should be avoided" is not a good SLA. "99.95% uptime in a given 3 month period, exclusive of maintenance windows; but total uptime shall not drop below 99%" is a much better one.

If everyone 'just knows', it's a good idea to talk to a bunch of the higher-ups and see what they think acceptable outage is. If they say, "100%!" then you can say to them, "What I'm hearing is that you're OK with me spending a million to avoid a single second of downtime," and then figure out what they really want.

Define "uptime", please.

Is it simply, 'reachable/unreachable'?

Is there a functional test, such as, "the main page must render"?

Is there a suite of functional tests that must be passed to be considered up, such as front page reachable and the member-page renders in under 5 seconds?

This step is taking the written SLA and turning it into Service Level Objectives. SLOs are concrete things that define whether or not you are passing your SLA. This is turning the wibbly-wobbly language of the SLA into something you can engineer against.

Some example Service Level Objectives supporting the 'better' SLA above:

  • The main page renders in under 5 seconds for customers in North America 99% of the time.
  • The main page is available at least 99% of the time in a given quarter.
  • Planned maintenance windows shall not introduce more than 0.5% of downtime in a given quarter.
  • Member profile pages render in under 7 seconds for customers in North America.

Determining whether or not these SLOs are met is probably a report, not an alarm. That said, they do give clues as to where alarms need to be set. The next step down the path is to define your Service Level Indicators, which are things that determine whether or not the SLO is being met. Some of these are easy, some of these... less so.

  • The main page renders in under 5 seconds for customers in North America 99% of the time.
    • Main-page loading time in North America.
  • The main page is available at least 99% of the time in a given quarter.
    • Main-page availability in North America.
    • Main-page availability globally.
  • Maintenance windows shall not introduce more than 0.5% of downtime in a given quarter.
    • Maintenance window periods.
    • Downtime periods.
  • Member profile pages render in under 7 seconds for customers in North America
    • Member profile-page loading time of a test customer from North America.
    • Actual profile-page loading times as recorded by customers.

None of these are CPU/Disk/Ram/SWAP yet. But we're getting closer.

The next step is determining which specific monitoring points support each SLI. This is where we get technical, and the answers can change over time as you learn how parts are interconnected.

  • The main page renders in under 5 seconds for customers in North America 99% of the time.
    • Main-page loading time in North America.
      • Pingdom monitors from North American hosts.
      • Page load-time from the poller in Oregon.
      • Page load-time from the poller in Texas.
      • Page load-time from the poller in New York.
  • The main page is available at least 99% of the time in a given quarter.
    • Main-page availability in North America.
      • Pingdom monitors from North American hosts.
      • Main-page availability from the pollers in Oregon, Texas and New York.
    • Main-page availability globally.
      • Pingdom monitors globally.
      • Main-page availability from the poller in Singapore.
      • Main-page loadbalancer availability as polled internally.
      • Main-page cluster status.
        • Cluster quorum status.
      • Postgres Database status for account database.
        • Postgres server disk-space status
      • DR site replication status.
        • DR site storage replication status.
  • Maintenance windows shall not introduce more than 0.5% of downtime in a given quarter.
    • Maintenance window periods.
      • Database of maintenance windows.
    • Downtime periods.
      • Outages reported by Pingdom
      • Outages reported by pollers.
  • Member profile pages render in under 7 seconds for customers in North America
    • Member profile-page loading time of a test customer from North America.
      • Profile-page load times from the pollers in Oregon, Texas and New York.
      • Postgres database status for profile database.
        • Disk space status for Postgres servers.
      • Redis status for mood-status.
        • Redis server RAM status.
      • CDN status for profile images.
    • Actual profile-page loading times as recorded by customers.
      • Database analytics on profile-page load-times as logged by the app.

Now we're talking! There are a couple of actual disk-space alarms in there, and even a RAM one! A great list of monitorables here. Now, how do we define alarms?

Tune in tomorrow.