Recently in telemetry Category

All metrics are scar tissue (unless they're Business Intelligence)

By SysAdmin1138 on July 9, 2024 9:00 AM

I've now spent over a decade teaching how alarms are supposed to work (specific, actionable, with the appropriate urgency) and even wrote a book on how to manage metrics systems. One topic I was repeatedly asked to cover in the book, but declined because the topic is big enough for its own book, is how to do metrics right. The desire for an expert to lay down how to do metrics right comes from a number of directions:

No one ever looked at ours in a systematic way and our alerts are terrible [This is asking about alerts, not metrics; but they still were indirectly asking about metrics]
We keep having incidents and our metrics aren't helping, how do we make them help?
Our teams have so many alarms important ones are getting missed [Again, asking about alerts]
We've half assed it, and now we're getting a growth spurt. How do we know what we should be looking for?

People really do conflate alarms/alerts with metrics, so any discussion about "how do we do metrics better" is often a "how do we do alarms better" question in disguise. As for the other two points, where people have been using vibes to pick metrics and that's no longer scaling, we actually do have a whole lot of advice; you have a whole menu of "golden signals" to pick from depending on how your application is shaped.

That's only sort of why I'm writing this.

In the mathematical construct of Site Reliability Engineering, where everything is statistics and numerical analysis, metrics are easy. Track the things that affect availability, regularly triage your metrics to ensure continued relevance, and put human processes into place to make sure you're not burning out your worker-units. But the antiseptic concept of SRE only exists in a few places, the rest of us have to pollute the purity of math with human emotions. Let me explain.

Consider your Incident Management process. There are certain questions that commonly arise when people are doing the post incident reviews:

Could we have caught this before release? If so, what sort of pre-release checks should we add to catch this earlier?
Did we learn about this from metrics or customers? If customers, what metrics do we need to add to catch this earlier? If metrics, what processes or alarms should we tune to catch this earlier?
Could we have caught this before the feature flag rolled out to the Emerald users? Do we need to tune the alarm thresholds to catch issues like this in groups with less feature-usage before the high value customers on Emerald plans?

And so on. Note that each question asks about refining or adding metrics. Emotionally, metrics represent anxieties. Metrics are added to catch issues before they hurt us again. Metrics are retained because they're tracking something that used to hurt us and might hurt again. This makes removing metrics hard; the people involved remember why certain metrics are present, intuitively know they needs tracking, which means emotion says to keep.

Metrics are scar tissue, and removing scar tissue is hard, bloody work. How do you reduce the number of metrics, while also not compromising your availability goals? You need the hard math of SRE to work down those emotions, but all it takes is one Engineering Manager to say "this prevented a SEV, keep it" to blow that effort up. This also means you'll have much better luck with a metric reformation effort if teams are already feeling the pinch of alert fatigue or your SaaS metric provider bills are getting big enough that the top of the company is looking at metric usage to reduce costs.

Sometimes, metrics feed into Business Intelligence. That's less about scar tissue and more about optimizing your company's revenue operations. Such metrics are less likely to lead to rapid-response on-call rotations, but still can lead to months long investigations into revenue declines. That's a different but related problem.

I could write a book about making your metrics suck less, but that book by necessity has to cover a lot of human-factors issues and has to account for the role of Incident Management in metrics sprawl. Metrics are scar tissue, keep that in mind.

Monitoring IS a pain, but we do it anyway

By SysAdmin1138 on June 28, 2023 9:00 AM

Mathew Duggan wrote a blog post on June 9th titled, "Monitoring is a pain" where he goes into delicious detail around where observability, monitoring, and telemetry goes wrong inside organizations. I have a vested interest in this, and still agree. Mathew captures a sentiment I didn't highlight enough in my book, that a good observability platform for engineering tends to get enmeshed ever deeper into the whole company's data engineering motion, even though that engineering observability platform isn't resourced enough to really serve that broader goal all that well. This is a subtle point, but absolutely critical for diagnosing criticism of observability platforms.

By not designing your observability platform from the beginning for eventual integration into the overall data motion, you get a lot of hard to reduce misalignment of function, not to mention poorly managed availability assumptions. Did your logging platform hiccup for 20 minutes, thus robbing the business metrics people of 20 minutes of account sign-up/shutdown/churn metrics? All data is noisy, but data engineering folk really like it if the noise is predictable and thus can be modeled out. Did the metrics system have an unexpected reboot, which prevented the daily code-push to production because Delivery Engineering couldn't check their canary deploy metrics? Guess your metrics system is now a production critical system instead of the debugging tool you thought it was.

Data engineering folk like their data to be SQL shaped for a lot of reasons, but few observability and telemetry systems have an SQL interface. Mathew proposed a useful way to provide that:

When you have a log that must be stored for compliance or legal reasons, don't stick it into the same system you use to store every 200 - OK line. Write it to a database (ideally) or an object store outside of the logging pipeline. I've used DynamoDB for this and had it work pretty well by sticking it in an SQS pipeline -> Lambda -> Dynamo. Then your internal application can query this and you don't need to worry about log expiration with DynamoDB TTL.

Dynamo is SQL-like, so this method could work for doing business metrics things like the backing numbers for computing churn rate and monthly active users (MAU) numbers. Or tracking things like password resets, email changes, and gift-card usage for your fraud/abuse department. Or all admin-portal activity for your annual external audits.

Mathew also unintentionally called me out.

Chances are this isn't someones full-time job in your org, they just happened to pick up logging. It's not supposed to be a full-time gig so I totally get it. They installed a few Helm charts, put it behind an OAuth proxy and basically hoped for the best. Instead they get a constant flood of complaints from consumers of the logging system. "Logs are missing, the search doesn't work, my parser doesn't return what I expect".

That's how I got into this. "Go upgrade the Elasticsearch cluster for our logging system" is what started me on the road that lead to the Software Telemetry book. It worked for me because I was at a smaller company. Also our data people stuck their straw directly into the main database, rather than our logging system, which is absolutely a dodged bullet.

Mathew also went into some details around scaling up a metrics system. I spent a chapter and part of an appendix on this very topic. Mathew gives you a solid concrete example of the problem from the point of view of microservices/kubernetes/CNCF architectures. This stuff is frikkin hard, and people want to add things like account tags and commit-hash tags so they can do tight correlation work inside the metrics system; the sort of cardinality problem that many metrics systems aren't designed to support.

All in all Mathew lines out several common deficiencies in how monitoring/observability/telemetry is approached in many companies, especially growing companies:

All too often, what systems are in place are there because someone wanted it there and made it. Which means it was built casually, and probably not fit for purpose after other folks realize how useful they are, usage scope creeps, and it becomes mission critical.
These systems aren't resourced (in time, money, and people) commensurate with their importance to the organization, suggesting there are some serious misalignments in the org somewhere.
"Just pay a SaaS provider" works to a point, but eventually the bill becomes a major point of contention forcing compromise.
Getting too good at centralized logging means committing to a resource intensive telemetry style, a habit that's hard to break as you get bigger.
No one gets good at tracing if they're using Jaeger. Those that do, are an exception. Just SaaS it, sample the heck out of it, and prepare for annual complaining when the bill comes due.

The last point about Jaeger is a key one, though. Organizations green-fielding a new product often go with tracing only as their telemetry style, since it gives so much high quality data. At the same time, tracing is the single most expensive telemetry style. Unlike centralized logging, which has a high quality self-hosted systems in the form of ELK and Loki, tracing these days only has Jaeger. There is a whole SaaS industry who sell their product based on how much better they are than Jaeger.

There is a lot in the article I didn't cover, go read it.

Why SaaS providers don't share status

By SysAdmin1138 on June 5, 2023 9:00 AM

Jeff Martins at New Stack had a beautiful take-down of the SaaS provider practice of not sharing internal status, and how that affects down-stream reliabiilty programs. Jeff is absolutely right, each SaaS provider (or IaaS provider) you put in your critical path decreases the absolute maximum availability your system can provide. This also isn't helped by SaaS providers using manual processes to update status pages. We would all provide better services to customers if we shared status with each other in a timely way.

Spot on. I've felt this frustration too. I've been in the after-action review following an AWS incident when an engineering manager asked the question:

Can we set up alerting for when AWS updates their status page?

As a way to improve our speed of response to AWS issues. We had to politely let that manager down by saying we'd know there was a problem before AWS updates their page, which is entirely true. Status pages shouldn't be used for real-time alerting. This lesson was further hammered home after we gained access to AWS Technical Account Managers, and started getting the status page updates delivered 10 or so minutes early, directly from the TAMs themselves. Status Pages are corporate communications, not technical.

That's where the problem is, status pages are corporate communication. In a system where admitting fault opens you up to expensive legal actions, corporate communication programs will optimize for not admitting fault. For Status Pages, it means they only get an outage notice after it is unavoidably obvious that an outage is already happening. Status Pages are strictly reactive. For a true real time alerting system, you need to tolerate the occasional false positive; for a corporate communication platform designed to admit fault, false positives must be avoided at all costs.

How do we fix this? How do we, a SaaS provider, enable our downstream reliability programs to get the signal they need to react to our state?

There aren't any good ways, only hard or dodgy ones.

The first and best way is to do away with US style adversarial capitalism, which reduces the risks of saying "we're currently fucking up." The money behind the tech industry won't let this happen, though.

The next best way is to provide an alert stream to customers, so long as they are willing to sign some form of "If we give this to you, then you agree to not sue us for what it tells you" amendment to the usage contract. Even that is risky, because some rights can't be waved that way.

What's left is what AWS did for us, have our Technical Acount Managers or Customer Success Managers hand notify customers that there is a problem. This takes the fault admission out of the public space, and limits it to customers who are giving us enough money to have a TAM/CSM. This is the opposite of real time, but at least you get notices faster? It's still not equivalent to instrumenting your own infrastructure for alerts, and is mostly useful for writing the after-action report.

Yeah, the SaaSification of systems is introducing certain communitarian drives; but the VC-backed capitalistic system we're in prevents us from really doing this well.

The future of telemetry

By SysAdmin1138 on September 21, 2021 9:00 AM

One of the nice things about living right now is that we know what the future of software telemetry looks like. Charity Majors is dead right about what the datastructure of future telemetry will be:

Arbitrarily cardinal and dimensional events.

Arbitrarily Cardinal means each field in a telemetry database can have an arbitrary number of values, similar to how Elasticsearch and most RDBMSs are built to handle.

Arbitrarily Dimensional means there can be an arbitrary number of fields in the telemetry database, similar to how column-based datastores like MariaDB, HBase, and Hypertable are built.

This combination of attributes allows software engineers to not have to worry about how many attributes, metrics, and values they're stuffing into each event cross the entire software ecosystem. Here is a classic cardinality explosion that is all too common in modern systems; assume each event has the following attributes:

account_id: describing the user account ID.
subscription_id: the payment subscription ID for the account/team/org.
plan_id: the subscription plan ID, which is often used to gate application features.
team_id: the team the account_id belongs to.
org_id: the parent organization ID the team or user belongs to.
code_commit: the git-hash of the currently running code.
function_id: the class-path of the function that generated the event.
app_id: the ID of the application generating the event.
process_id: the individual execution of code that generated the event.
cluster_id or host_id: the kubernetes or VM that the process was running on.
span_id: the workflow the event happened in, used to link multiple events together.

This is a complex set of what I call correlation identifiers in my book. This set of 11 fields will give you a high degree of context for where the event happened and the business-logic context around why it happened. That said, in even a medium size SaaS app the union of unique values in this set is going to be in the trillions or higher. You need a database designed for deep cardinality in fields, which we have these days; Jaeger is designed for this style of telemetry right now.

However, this is only part of what telemetry is used for. These 11 fields are all global context, but sometimes you need localized context such as file-size, or want to capture localized metrics like final number of pages. These local context and localized metrics are where the arbitrarily dimensional aspect of telemetry comes in to play.

To provide an example, let's look at what the local context I might encounter at work. I work for an Electronic Signature provider, where you can upload files in any number of formats, mark them up for signers to fill out, have them signed, and get a PDF at the end. In addition to the previous global context, here is one example of local context we would care about for an event that tracks how we converted an uploaded Word Perfect file into the formats we use on the signing pages:

file_type: the source file type.
file_size: how big that source file was.
source_pages: how many pages the source file was.
converted_pages: how many pages the final convert ended up being (suggesting this can differ from source_pages, how interesting)
converted_size: how big the converted pages ended up being.
converted_time: how long it took to do the conversions, as measured by this process.

This set seems fine, but lets take a look at the localized context for a different team; the one writing the Upload API endpoints.

upload_type: the type of the file uploaded.
upload_bytes: how big that uploaded file was
upload_quota: how much quota was left on the account after the upload was done.
persist_time: how long it took to get the uploaded file out of local-state and into a persistent storage system.

We see similar concepts here! Chapter 14 in the book gives you techniques for reducing this level of field-sprawl, but if you're using a datastore that allows arbitrary dimensionality you don't need to bother. All the overhead of reconciling different teams use of telemetry to reduce complexity in the database goes away if the database is tolerant of that complexity.

Much of my Part 3 chapters are spent in providing ways to handle database complexity. If you're using a database that can handle it, those techniques aren't useful anymore. Really, the future of telemetry is in these highly complex databases.

The problem with databases that support arbitrary cardinality and dimensionality is that you need to build them from scratch right now. You can start with a column-store database and adapt it to support arbitrary field cardinality, but that's homebrewing. Once you have the database problem solved, you need to solve the analysis and presentation problems leveraging your completely fluid data-model.

This is a hard problem, and it's hard enough you can buildÂ and finance a company to solve it. This is exactly what honeycomb.io did and is doing, and why the future of telemetry will see much less insourcing. Jaeger is the closest we have to an open source system that does all this, but it has to rely on a database to make it work; currently that's either Elasticsearch or Cassandra. The industry needs to see an open source database that can handle both cardinality and dimensionality, we and just don't have it yet.

The typical flow of telemetry usage in a growing startup these days is roughly:

Small stage: SaaS for everything that isn't business-code, including telemetry.
Medium stage: Change SaaS providers for cost-savings based on changed computing patterns.
Large stage: Start considering insourcing telemetry systems to save SaaS provider costs. Viable because at this stage you have enough engineering talent that this doesn't seem like a completely terrible idea.
Enterprise stage: If insourcing hasn't happened yet for at least one system, it'll happen here.
Industry dominance stage: Open source (or become a major contributor to) the telemetry systems being used.

The constraints of the perfect telemetry database mean that SaaS use -- through stand-alone telemetry companies like Honeycomb or the offering from your cloud provider -- will persist much deeper into the growth cycle. There is a reason that the engineering talent behind the Cloud Native Computing Foundation largely comes from the biggest tech companies on the planet, it is in their interests to provide good enough solutions to provide internal systems that are competitive with the SaaS providers. Those internal systems wont be quite as featured as the SaaS providers; but when you're doing internal telemetry for cost savings, having an 80% solution feels pretty great for what would otherwise be a $2M/month contract.

For the rest of us? SaaS. We'll just have to get used to an outside party holding all of our engineering data.

« zenworks | Main Index | Archives