June 2023 Archives

Mathew Duggan wrote a blog post on June 9th titled, "Monitoring is a pain" where he goes into delicious detail around where observability, monitoring, and telemetry goes wrong inside organizations. I have a vested interest in this, and still agree. Mathew captures a sentiment I didn't highlight enough in my book, that a good observability platform for engineering tends to get enmeshed ever deeper into the whole company's data engineering motion, even though that engineering observability platform isn't resourced enough to really serve that broader goal all that well. This is a subtle point, but absolutely critical for diagnosing criticism of observability platforms.

By not designing your observability platform from the beginning for eventual integration into the overall data motion, you get a lot of hard to reduce misalignment of function, not to mention poorly managed availability assumptions. Did your logging platform hiccup for 20 minutes, thus robbing the business metrics people of 20 minutes of account sign-up/shutdown/churn metrics? All data is noisy, but data engineering folk really like it if the noise is predictable and thus can be modeled out. Did the metrics system have an unexpected reboot, which prevented the daily code-push to production because Delivery Engineering couldn't check their canary deploy metrics? Guess your metrics system is now a production critical system instead of the debugging tool you thought it was.

Data engineering folk like their data to be SQL shaped for a lot of reasons, but few observability and telemetry systems have an SQL interface. Mathew proposed a useful way to provide that:

When you have a log that must be stored for compliance or legal reasons, don't stick it into the same system you use to store every 200 - OK line. Write it to a database (ideally) or an object store outside of the logging pipeline. I've used DynamoDB for this and had it work pretty well by sticking it in an SQS pipeline -> Lambda -> Dynamo. Then your internal application can query this and you don't need to worry about log expiration with DynamoDB TTL.

Dynamo is SQL-like, so this method could work for doing business metrics things like the backing numbers for computing churn rate and monthly active users (MAU) numbers. Or tracking things like password resets, email changes, and gift-card usage for your fraud/abuse department. Or all admin-portal activity for your annual external audits.

Mathew also unintentionally called me out.

Chances are this isn't someones full-time job in your org, they just happened to pick up logging. It's not supposed to be a full-time gig so I totally get it. They installed a few Helm charts, put it behind an OAuth proxy and basically hoped for the best. Instead they get a constant flood of complaints from consumers of the logging system. "Logs are missing, the search doesn't work, my parser doesn't return what I expect".

That's how I got into this. "Go upgrade the Elasticsearch cluster for our logging system" is what started me on the road that lead to the Software Telemetry book. It worked for me because I was at a smaller company. Also our data people stuck their straw directly into the main database, rather than our logging system, which is absolutely a dodged bullet.

Mathew also went into some details around scaling up a metrics system. I spent a chapter and part of an appendix on this very topic. Mathew gives you a solid concrete example of the problem from the point of view of microservices/kubernetes/CNCF architectures. This stuff is frikkin hard, and people want to add things like account tags and commit-hash tags so they can do tight correlation work inside the metrics system; the sort of cardinality problem that many metrics systems aren't designed to support.

All in all Mathew lines out several common deficiencies in how monitoring/observability/telemetry is approached in many companies, especially growing companies:

  • All too often, what systems are in place are there because someone wanted it there and made it. Which means it was built casually, and probably not fit for purpose after other folks realize how useful they are, usage scope creeps, and it becomes mission critical.
  • These systems aren't resourced (in time, money, and people) commensurate with their importance to the organization, suggesting there are some serious misalignments in the org somewhere.
  • "Just pay a SaaS provider" works to a point, but eventually the bill becomes a major point of contention forcing compromise.
  • Getting too good at centralized logging means committing to a resource intensive telemetry style, a habit that's hard to break as you get bigger.
  • No one gets good at tracing if they're using Jaeger. Those that do, are an exception. Just SaaS it, sample the heck out of it, and prepare for annual complaining when the bill comes due.

The last point about Jaeger is a key one, though. Organizations green-fielding a new product often go with tracing only as their telemetry style, since it gives so much high quality data. At the same time, tracing is the single most expensive telemetry style. Unlike centralized logging, which has a high quality self-hosted systems in the form of ELK and Loki, tracing these days only has Jaeger. There is a whole SaaS industry who sell their product based on how much better they are than Jaeger.

There is a lot in the article I didn't cover, go read it.

The trajectory of "AI" features

I've been working for bay area style SaaS providers for a bit over a decade now, which means I've seen some product cycles come and go. Something struck me today that I want to share, and it relates to watching product decisions get made and adjusting to market pressures.

All the AI-branded features being pushed out by everyone that isn't ChatGPT, Google, or Microsoft all depend on one of the models from those three companies. That, or they're rebranding existing machine learning features as "AI" to catch the marketing moment. Even so, all the features getting pushed out come from the same base capabilities of ChatGPT style large language models:

  • Give it a prompt, it generates text.

Okay, that's the only capability. But this one capability is driving things like:

  • Rewriting your thing to be grammatical for you!
  • Rewriting your thing to be more concise!
  • Suggesting paragraphs for your monthly newsletter!
  • Answering general knowledge questions!

That's about it. We've had about half a year of time to react to ChatGPT and start generating products based on that foundation, and the above is what we're seeing. A lot of these are one to three engineering teams worth of effort over the course of a few months, with a month or three of internal and external testing. These are the basic features of LLM-based machine learning, also known right now as AI.

We don't yet know what features we'll have in two years. "Suggest text from a prompt" went from hot new toy to absolute commodity table-stakes in 8 months, I've never seen it before. Maybe we've fully explored what these models are capable of, but we sure haven't explored the regulatory impacts yet.

The regulatory impacts are the largest risk to these tools. We've already seen news stories of lawyers using these tools to write briefs with fictional citations, and efforts to create deep fake stories "by" prolific authors. The European Union is already considering regulation to require "cite your sources", which GPT isn't great at. Neither are GPT-based models good about the Right To Be Forgotten enshrined in EU law by GDPR.

The true innovation we'll see in coming years will be in creating models that can comply with source-citing and enable targeted excludes. That's going to take years, and training those models is going to keep GPU makers in profit margins.

Jeff Martins at New Stack had a beautiful take-down of the SaaS provider practice of not sharing internal status, and how that affects down-stream reliabiilty programs. Jeff is absolutely right, each SaaS provider (or IaaS provider) you put in your critical path decreases the absolute maximum availability your system can provide. This also isn't helped by SaaS providers using manual processes to update status pages. We would all provide better services to customers if we shared status with each other in a timely way.

Spot on. I've felt this frustration too. I've been in the after-action review following an AWS incident when an engineering manager asked the question:

Can we set up alerting for when AWS updates their status page?

As a way to improve our speed of response to AWS issues. We had to politely let that manager down by saying we'd know there was a problem before AWS updates their page, which is entirely true. Status pages shouldn't be used for real-time alerting. This lesson was further hammered home after we gained access to AWS Technical Account Managers, and started getting the status page updates delivered 10 or so minutes early, directly from the TAMs themselves. Status Pages are corporate communications, not technical.

That's where the problem is, status pages are corporate communication. In a system where admitting fault opens you up to expensive legal actions, corporate communication programs will optimize for not admitting fault. For Status Pages, it means they only get an outage notice after it is unavoidably obvious that an outage is already happening. Status Pages are strictly reactive. For a true real time alerting system, you need to tolerate the occasional false positive; for a corporate communication platform designed to admit fault, false positives must be avoided at all costs.

How do we fix this? How do we, a SaaS provider, enable our downstream reliability programs to get the signal they need to react to our state?

There aren't any good ways, only hard or dodgy ones.

The first and best way is to do away with US style adversarial capitalism, which reduces the risks of saying "we're currently fucking up." The money behind the tech industry won't let this happen, though.

The next best way is to provide an alert stream to customers, so long as they are willing to sign some form of "If we give this to you, then you agree to not sue us for what it tells you" amendment to the usage contract. Even that is risky, because some rights can't be waved that way.

What's left is what AWS did for us, have our Technical Acount Managers or Customer Success Managers hand notify customers that there is a problem. This takes the fault admission out of the public space, and limits it to customers who are giving us enough money to have a TAM/CSM. This is the opposite of real time, but at least you get notices faster? It's still not equivalent to instrumenting your own infrastructure for alerts, and is mostly useful for writing the after-action report.

Yeah, the SaaSification of systems is introducing certain communitarian drives; but the VC-backed capitalistic system we're in prevents us from really doing this well.