The general pointlessness of high-CPU alarms

When it comes to things to send alarming emails about, CPU, RAM, Swap, and Disk are the four everyone thinks of. If something seems slow, check one or all of those four to see if it really is slow. This sets up a causal chain...

It was slow, and CPU was high. Therefore, when CPU is high it is slow. QED.

We will now alarm on high CPU.

It may be true in that one case, but high CPU is not always a sign of bad. In fact, high CPU is a perfectly normal occurrence in some systems.

  1. Render farms are supposed to run that high all the time.
  2. Build servers are supposed to be running that high a lot of the time.
  3. Databases chewing on long-running queries.
  4. Big-data analytics that can run for hours.
  5. QE systems grinding on builds.
  6. Test-environment systems being ground on by QE.

Of course, not all CPU checks are created equal. Percent-CPU is one thing, Load Average is another. If Percent-CPU is 100% and your load-average matches the number of cores in the system, you're probably fine. If Percent-CPU is 100% and your load-average is 6x the number of cores in the system, you're probably not fine. If your monitoring system only grabs Percent-CPU, you won't be able to tell what kind of 100% event it is.

As a generic, apply-it-to-everything alarm, High-CPU is a really poor thing to pick. It's easy to monitor, which is why it gets selected for alarming. But, don't do that.

Cases where a High-CPU alarm won't actually tell you that something is going wrong:

  • Everything in the previous list.
  • If your app is single-threaded, the actual high-CPU event for that app on a multi-core system is going to be WELL below 100%. It may even be as low as 12.5%.
  • If it's a single web-server in a load-balanced pool of them, it won't be a BOTHER HUMANS RIGHT NOW event.
  • During routine patching. It should be snoozed on a maintenance window anyway, but sometimes it doesn't happen.
  • Initializing a big application. Some things normally chew lots of CPU when spinning up for the first time.

CPU/Load Average is something you probably should monitor, since there is value in retroactive analysis and aggregate analysis. Analyzing CPU trends can tell you it's time to buy more hardware, or turn up the max-instances value in your auto-scaling group. These are all the kinds of thing you look at in retrospective, they're not things that you want waking you up at 2:38am.

Only turn on CPU alarms if you know that is an error condition worthy of waking up a human. Turning it on for everything just in case is a great way to train yourself out of ignoring high-CPU alarms, which means you'll miss the ones you actually care about. Human factors, they're part of everything.