A synthetic theory of monitoring

LISA 2013 was very good to me. I saw a lot of sessions about monitoring, theories of, and I've spent most of 2014 trying to revise the monitoring system at work to be less sucky and more awesome. It's mostly worked, and is an awesome-thing that's definitely going on my resume.

Credit goes primarily to two sources:

  • A Working Theroy of Monitoring, by Caskey L Dickson of Google, from LISA 2013.
  • SRE University: Non-abstract large systems design for sysadmins, by John Looney and company, of Google, also at LISA 2013.

It was these sessions that inspired me to refine a slide in A Working Theory of Monitoring and put my own spin on it:

I'll be explaining this further, but this is what the components of a monitoring system look like. Well, monitoring ecosystem since there is rarely a single system that does everything. There are plenty of companies that will sell you a product that does everything, but even they get supplimented by home-brew reporting engines and custom-built automation scripts.

  • Polling engine: The engine that gathers monitorables. There can be many of these depending on the ecosystem in question! Pollers that hit things every five minutes, or on-system agents that push updates centrally; both count.
  • Aggregation engine: This is what gathers all the monitorables and drops them into databases. It also performs summarization, such as MRTG-style time-series decay. As with the polling engine, there may be many of these in a large ecosystem.
  • Alerting engine: Or, the bother the humans engine. It tries to get the attention of humans in an urgent way. If the system has auto-response capabilities, they'll be triggered here.
  • Reporting engine: Or, the advise the humans engine. It could be on demand, or it could emit scheduled reports that humans need to look at. The on-demand system is likely different than the scheduled system, and there may be several different report engines depending who needs to know what, when, by what methods.
  • API: For interfacing with the monitoring system in a programatic way. Because you need to do this.
  • User Interface: For interfacing with the monitoring system in a human-friendly way. Very important.
  • Policy engine: For telling everything else how to do its job. From monitoring intervals, time-series decay settings, to report mailing intervals, policy touches all layers.
  • Humans: For setting policy, responding to alerts, and reading reports.

I find this a great way to explain monitoring systems to people. The monitoring system in your environment handles each of these layers differently. Out of the box, most quality monitoring systems make an attempt to touch each of these, but with various results. That said, there are some axioms I like to follow regarding what a good system looks like.

A system must be easy to use by everyone.

If it does everything you want, but using it requires exactly the right browser and a kajillion clicks to get to it, you're not actually going to use it.

If granting access to it requires arcane incantations on the command-line no one remembers, no one is going to use it.

UI and UX are extremely important for a monitoring system. Which is a hard problem because different users want to see different things from it, and you need to handle all the use-cases. From technicians wanting to know when a cluster node crashed, to mid-level managers looking for SLA reports, it needs to be usable by all layers. An unused monitoring system is a bad monitoring system.

UX is hard. This is one of the biggest selling points for non-free monitoring systems.

A system must be extensible enough to support service level checks.

CPU/DIsk/Ram/Swap is all well and good, but SLAs are usually based on other things. Like application availability, or API transaction rates. If you can't jigger your system to gather that kind of thing, you have a bad system. Fortunately, most polling engines are extensible these days. You may need to build the monitoring end-points for the extended poller, but you can do it.

A system must have an API.

An API greatly eases integration into other systems. Systems like helpdesk ticketing systems or big CEO-office dashboard walls. It also permits feedback loops to be built into monitored systems; like an alert-pauser when certain events trigger on a server that are known and expected but would trigger alarms.

Most systems have an API of some kind these days, but not all do, and some of them are hard to leverage. A system without an API is one that can't get used programatically and is a bad monitoring system.

A system must be able to keep up.

For extremely large systems, keeping track of it all requires handling a lot of data. The aggregation engines need to keep up with the flow from the polling engines. The reporting engines need to be able to produce reports on time. The clarion cry of monitor everything! is a great way to get into systemic overload. If it can't keep up, it's not a good system.

A system must emit alarms that are specifically actionable.
FYI and just-in-case alarms need to be trivially easy to segregate and redirect.

These two go hand in hand for a reason. One of the easiest ways to get your alarm-receiving people into monitoring overload is to hit them with alarms every 5-15 minutes. Fix that. This kind of alarming trains people into thinking that something is always wrong, which leads to ignoring alarms. Even the actually-important ones. An ignored monitoring system is a bad monitoring system.

This is a hard problem to fix once you're in the weeds, and I'll be covering how to do this in the next post.