Synthetic theory of monitoring: Alarms when you want them

In the last article we created a list of monitorables and things that look like the kind of alarms we want to see.

Now what?

First off, go back to the list of alarms you already have. Go through those and see which of those existing alarms directly support the list you just created. It may be depressing how few of them do, but rejoice! Fewer alarms mean fewer emails!

What does 'directly support' mean?

Lets look at one monitorable and see what kind of alarms might directly or indirectly support it.

Main-page cluster status.

There are a number of alarms that could already be defined for this one.

Main-page availability as polled directly on the load-balancer.
Pingability of each cluster member.
Main-page reachability on each cluster member.
CPU/Disk/Ram/Swap on each cluster member.
Switch-port status for the load-balancer and each cluster-member.
Webserver process existence on each cluster member.
Webserver process CPU/RAM usage on each cluster member.

And more, I'm sure. That's a lot of data, and we don't need to define alarms for all of it. The question to ask is, "How do I determine the status of the cluster?"

The answer could be, "All healthy nodes behind the load-balancer return the main-page, with at least three nodes behind the load-balancer for fault tolerance." This leads to a few alarms we'd like to see:

Cluster has dropped below minimum quorum.
Node ${X} is behind the load-balancer but serving up errors.
The load-balancer is not serving any pages.

We can certainly track all of those other things, but we don't need alarms on them. Those will come in handy when the below-quorum alarm is responded to. This list is what I'd call directly supporting. The rest are indirect indicators and we don't need PagerDuty to tell us about them, we'll find it ourselves once we start troubleshooting the actual problem.

Now that we have a list of existing alarms we want to keep and a list of alarms we'd like to have, the next step is determining when we want to be alarmed.

The fact of the matter is that different teams want to know different things at different times. Let's look at that big list of monitorables again.

The main page renders in under 5 seconds for customers in North America 99% of the time.
- Main-page loading time in North America.
  - Pingdom monitors from North American hosts.
  - Page load-time from the poller in Oregon.
  - Page load-time from the poller in Texas.
  - Page load-time from the poller in New York.
The main page is available at least 99% of the time in a given quarter.
- Main-page availability in North America.
  - Pingdom monitors from North American hosts.
  - Main-page availability from the pollers in Oregon, Texas and New York.
- Main-page availability globally.
  - Pingdom monitors globally.
  - Main-page availability from the poller in Singapore.
  - Main-page loadbalancer availability as polled internally.
  - Main-page cluster status.
    - Cluster quorum status.
  - Postgres Database status for account database.
    - Postgres server disk-space status
  - DR site replication status.
    - DR site storage replication status.
Maintenance windows shall not introduce more than 0.5% of downtime in a given quarter.
- Maintenance window periods.
  - Database of maintenance windows.
- Downtime periods.
  - Outages reported by Pingdom
  - Outages reported by pollers.
Member profile pages render in under 7 seconds for customers in North America
- Member profile-page loading time of a test customer from North America.
  - Profile-page load times from the pollers in Oregon, Texas and New York.
  - Postgres database status for profile database.
    - Disk space status for Postgres servers.
  - Redis status for mood-status.
    - Redis server RAM status.
  - CDN status for profile images.
- Actual profile-page loading times as recorded by customers.
  - Database analytics on profile-page load-times as logged by the app.

Some of these SLIs are time-series based, such as things that are only alarming if the trend is bad. Like "99% uptime in the last 3 months". You can sure alarm on DOWN, but the big alarm is, "we're not going to hit our SLA this quarter unless things change." THAT is something we deeply care about. Each possible alarm has 3 characteristics:

How immediate of a response is required.
Who cares about it.
How big of an impact is there if the alarm or its response are delayed by minutes/hours/days.

CLUSTER DOWN is a rather immediate thing and should be alarmed very quickly, and with rapid response by the ops team. PagerDuty that sucker and make sure there is a escalation policy on it.

MIGHT NOT HIT OUR SLA is something that can wait a few hours or days, and technical management can schedule meetings with people to ensure the right analysis is being done to figure out why. High priority email is probably sufficient, and escalation is probably not needed.

High immediacy (minutes count) rapid response required: PagerDuty or some other notification engine that can do multi-modal messaging paired with an on-call schedule and escalations.
Medium immediacy (hours count) response required: PagerDuty etc, but without an escalation policy; or hold alarm delivery until business hours.
Low immediacy (days or longer count): High priority email.

One person, or a distribution-list?

If you have an on-call schedule, alarms should go to the one person on duty. Period. FYI emails may be sent if desired, but don't rely on those for critical notification.

If you don't have a schedule, perhaps its during business hours and you don't have dedicated Alarm Responders on staff, open a ticket. If you can't open a ticket directly, send an email to a distribution list.

Do not send highly critical respond-now alarms to a large group of people in the hope that at least one of them will respond. We're all fire-fighters, you'll get either multiple people stepping on each other's toes figuring out what's going wrong, or no one responding as everyone thinks someone else is going to handle it.

Repeating alarms, or not.

I've found there can be strong opinions about how often alarm emails/texts need to be repeated.

One camp holds that alarms should be repeated every N (often 5) minutes until the problem is dealt with. The theory here is that the phone going buzz every five minutes is going to get noticed a lot more reliably than the phone going buzz once. It is my opinion that this method is used when email is the only method to deliver alarms, and email is a crappy method of delivering highly critical respond-now alarms. But you deal with what you have.

The risk with repeating alarms is that it creates noise. If you have 9 things going wrong at once, each of which is emailing about it every 5 minutes, it creates a huge stack of email. And when something breathtakingly critical drops dead in the middle of that noise, that alarm is going to get missed.

A multi-modal alert delivery system is a key part of any true on-call rotation. The best ones allow watch-standers to configure the alert-intrusiveness escalation progression and tailor it to how best to wake them up from a dead sleep. Mine involves six steps, the last of which has three phones ringing at the same time.

Synthetic theory of monitoring: Alarms when you want them

Categories:

Tags: