Synthetic theory of monitoring: Monitoring all the things

There is more to a monitoring system than alarms and reports. Behind all of those cries for action are a lot of data. A lot of data. So much data, that you face scaling problems when you grow because all of your systems generate so much monitoring data.

Monitor everything!
-- Boss

A great idea in principle, but falls apart in one key way...

"Define 'everything', please"

'Everything' means different things for different people. It just isn't feasable to track every monitorable on everything with one-second granularity. Just about everyone will want to back away from that level of monitor-all-the-things. But what is the right fit?

It depends on what you want to do with it. Data being tracked supports four broad categories of monitoring.

  1. Performance
  2. Operational
  3. Capacity
  4. SLA

Performance Monitoring

This kind of monitoring tends to have low intervals between polls. It could be five minutes, but may be as little as every second. That kind of monitoring will create a deluge of data, and may only be done when diagnosing exposed problems or doing in-depth research on the system. It's not run all the time, unless you really do care about per-second changes in state of something.

This kind of monitoring is defined by a few attributes:

  • High granularity. You poll a lot.
  • Low urgency. You're doing this because you're looking into something, not because it's down.
  • Occasional need. You don't run it all the time, and not on a schedule.

Everything: 1 second granularity for CPU, IOPS, and pagefaults for a given cluster.


Operational Monitoring

The kind we're all familiar with. This is the kind of monitoring that tends to emit alarms for on-call rotations.

  • Medium granularity. Every 5 minutes, that kind of thing.
  • High urgency. Fast responses are needed.
  • Constant need. You run it all the time.

Everything: Every disk-event, everywhere.


Capacity Monitoring

Some of the alarms you have defined already may be capacity alarms, but capacity monitoring just tracks how much you are using of what you have. Some of this stuff doesn't change very fast.

  • Low granulariy. It may only get checked once a day.
  • Low urgency. Responding in a couple of days may be fast enough. If not slower.
  • Periodic need. Reviewed once in a while, probably on a schedule.

Everything: Anything that has a "Max" size value greater than the "Current" value.


SLA Monitoring

I've already gone on at length about SLAs, but this is the monitoring that directly supports the SLA pass/fail metrics. I break it apart from the other types because of how it's accessed.

  • Low granularity. Some metrics may be medium, but in general SLA trackers are over-time style.
  • Medium urgency. If a failing grade is determined, response needs to happen. How fast, depends on what's not going to get met.
  • Continual and Periodic need. Some things will be monitored continually, others will only be checked on long schedules; possibly once a week, if not once a month.

Everything: Everything it takes to build those reports.


Be aware that 'everything' is context-sensitive when you're talking with people and don't freak out when a grand high executive says, "everything," at you. They're probably thinking about the SLA Monitoring version of everything, which is entirely manageable.

Don't panic, and keep improving your monitoring system.