Synthetic theory of monitoring: After the rollout is completed

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

Retrospective Process

There are some questions you need to ask.

  • Did the monitoring system catch the problem?
  • Did we respond to the monitoring system or something else?
  • What changes to monitorables do we need to make to respond to this better?
  • What changes to alarms do we need to make to respond to this better?

Put them on your wiki, scribble them on the incident commander's whiteboard, whatever. Ask them. So much of sysadmining is making sure it doesn't hurt that way again, and these are the questions that will help you do that.

Periodic Review

You did this review during the deploy project. Do it again a year later. It should take a lot less work, and if the retrospective process is working there should be some changes to deal with. Who knows, maybe during a year's worth of operational experience you find a bunch of monitors you don't actually care about. Good! Less to keep track of.

And that's it! I hope this has been helpful.