Monitoring IS a pain, but we do it anyway

Mathew Duggan wrote a blog post on June 9th titled, "Monitoring is a pain" where he goes into delicious detail around where observability, monitoring, and telemetry goes wrong inside organizations. I have a vested interest in this, and still agree. Mathew captures a sentiment I didn't highlight enough in my book, that a good observability platform for engineering tends to get enmeshed ever deeper into the whole company's data engineering motion, even though that engineering observability platform isn't resourced enough to really serve that broader goal all that well. This is a subtle point, but absolutely critical for diagnosing criticism of observability platforms.

By not designing your observability platform from the beginning for eventual integration into the overall data motion, you get a lot of hard to reduce misalignment of function, not to mention poorly managed availability assumptions. Did your logging platform hiccup for 20 minutes, thus robbing the business metrics people of 20 minutes of account sign-up/shutdown/churn metrics? All data is noisy, but data engineering folk really like it if the noise is predictable and thus can be modeled out. Did the metrics system have an unexpected reboot, which prevented the daily code-push to production because Delivery Engineering couldn't check their canary deploy metrics? Guess your metrics system is now a production critical system instead of the debugging tool you thought it was.

Data engineering folk like their data to be SQL shaped for a lot of reasons, but few observability and telemetry systems have an SQL interface. Mathew proposed a useful way to provide that:

When you have a log that must be stored for compliance or legal reasons, don't stick it into the same system you use to store every 200 - OK line. Write it to a database (ideally) or an object store outside of the logging pipeline. I've used DynamoDB for this and had it work pretty well by sticking it in an SQS pipeline -> Lambda -> Dynamo. Then your internal application can query this and you don't need to worry about log expiration with DynamoDB TTL.

Dynamo is SQL-like, so this method could work for doing business metrics things like the backing numbers for computing churn rate and monthly active users (MAU) numbers. Or tracking things like password resets, email changes, and gift-card usage for your fraud/abuse department. Or all admin-portal activity for your annual external audits.

Mathew also unintentionally called me out.

Chances are this isn't someones full-time job in your org, they just happened to pick up logging. It's not supposed to be a full-time gig so I totally get it. They installed a few Helm charts, put it behind an OAuth proxy and basically hoped for the best. Instead they get a constant flood of complaints from consumers of the logging system. "Logs are missing, the search doesn't work, my parser doesn't return what I expect".

That's how I got into this. "Go upgrade the Elasticsearch cluster for our logging system" is what started me on the road that lead to the Software Telemetry book. It worked for me because I was at a smaller company. Also our data people stuck their straw directly into the main database, rather than our logging system, which is absolutely a dodged bullet.

Mathew also went into some details around scaling up a metrics system. I spent a chapter and part of an appendix on this very topic. Mathew gives you a solid concrete example of the problem from the point of view of microservices/kubernetes/CNCF architectures. This stuff is frikkin hard, and people want to add things like account tags and commit-hash tags so they can do tight correlation work inside the metrics system; the sort of cardinality problem that many metrics systems aren't designed to support.

All in all Mathew lines out several common deficiencies in how monitoring/observability/telemetry is approached in many companies, especially growing companies:

  • All too often, what systems are in place are there because someone wanted it there and made it. Which means it was built casually, and probably not fit for purpose after other folks realize how useful they are, usage scope creeps, and it becomes mission critical.
  • These systems aren't resourced (in time, money, and people) commensurate with their importance to the organization, suggesting there are some serious misalignments in the org somewhere.
  • "Just pay a SaaS provider" works to a point, but eventually the bill becomes a major point of contention forcing compromise.
  • Getting too good at centralized logging means committing to a resource intensive telemetry style, a habit that's hard to break as you get bigger.
  • No one gets good at tracing if they're using Jaeger. Those that do, are an exception. Just SaaS it, sample the heck out of it, and prepare for annual complaining when the bill comes due.

The last point about Jaeger is a key one, though. Organizations green-fielding a new product often go with tracing only as their telemetry style, since it gives so much high quality data. At the same time, tracing is the single most expensive telemetry style. Unlike centralized logging, which has a high quality self-hosted systems in the form of ELK and Loki, tracing these days only has Jaeger. There is a whole SaaS industry who sell their product based on how much better they are than Jaeger.

There is a lot in the article I didn't cover, go read it.