Recently in telemetry Category

I wrote a weird little book. I'm still getting royalties, so thank you all for buying, but this book does not easily fit into 2025 concepts of "observability engineering," so I want to talk about my goals and how it still fits.

At the base, I ended up writing a book for Platform teams looking to deliver internally deployed observability systems. That's not quite what I had in mind when I started writing in 2020, but that's where it lives five years later. My actual goal was to write a book that was usable by people in the SaaS industry, but also in businesses where the main user of internally developed software was internal users. The non-SaaS population often gets ignored in book targeting, and I wanted something that would let these people feel seen in a way that reading yet another Observability for Cloud Systems book would. In 2025, this book is a Platform book.

In 2019 and early 2020 when I was working with Manning on the title and terms, the word "observability" came up. It seems hard to remember in 2025, but "observability' was still a vague term that didn't yet have industry consensus behind it. OpenTelemetry was a thing at the time, but the "metrics" leg of OTel was still in beta, and "logs" was merely roadmapped. In 2025 there are debates around whether the fourth pillar is profiles, performance traces, or errors, which could be stack-dumps or a category of logs. If we had decided to use "observability" instead of "telemetry" the book may have sold better, but the term "telemetry" works better for me because observability is a practice built on top of telemetry signals. I wasn't writing a book about practice, I was writing about herding signals.

Herding signals, not interpreting them.

In 2025, most of the herding is supposed to be done through OpenTelemetry these days. Or if it isn't OTel, the signals are being herded through other systems like Apache Spark. This is industry consensus; instrument your code, add attributes in the emitters and collectors, change your vendors as you need to, build dashboards in your vendor's platform. A rewrite of Software Telemetry would reference OTel far more often than I did, but I would still make sure to mention non-OTel styles due to OTel not actually being supported (or in some cases, a good fit) in certain environments like network telemetry.

Whatever the API format of the signals getting herded, platform engineers need to know the fundamentals of how telemetry systems operate and that's what I wrote about. But also, I wrote about storing those signals, which is something that OpenTelemetry deliberately leaves out as a detail for the implementer. As I extensively wrote about, storing signals and creating a reporting interface is a hard enough part of telemetry that you can build a business around it. In fact, the Observability Tools market in 2025 is valued at around $2.75 Billion, and they all would love for you to use OTel to send them data to store and present.

In the language of my book, OpenTelemetry is an early shipping stage technology. Early because it has no role in storage. OTel arguably has a role in the emitting stage through explicit markup in code itself. OpenTelemetry's impact to the presentation stage is mostly in tagging and attribute schemas and how they get represented in storage. Observability needs to consider every stage, but also the SRE Guide problems of figuring out what to instrument, to which markup standards, following which procedures to ensure reliability. Observability sits on top of telemetry.

One of the consistent comments I got during the pre-publication reviews was: "I want to know what to track."

My answer was simple: that's not the book I'm writing.

This book is for you, the growth engineer tasked with taking a Kafka topic (or group of topics) of logging data, sent there by OTel, and transform it in the big Databricks instance with all  the other business data.

This book is for you, the network engineer tasked with extracting network metrics out of a proprietary system, so you can chart network things in the main engineering dashboarding platform.

This book is for you, the security engineer tasked with extracting security event data out of a cloud provider to put into the SIEM system.

This book is for you, the project manager who has just been given a digital transformation project to revitalize how all the internally developed apps will produce telemetry, and how engineers will observe the system.

"Columnar databases store data in columns, not rows," says the definition. I made a passing reference to the technology in Software Telemetry, but didn't spend any time on what they are and how they can help telemetry and observability. Over the last six months I worked on converting a centralized logging flow based on Elasticsearch, Kibana, and Logstash, to one based on Logstash and Apache Spark. For this article, I'll be using examples from both methods to illustrate what columnar databases let you do in telemetry systems.

How Elasticsearch is columnar (or not)

First of all, Elasticsearch isn't exactly columnar but it can fake it to a point. You use Elasticsearch when you need full indexing and tokenization of every field in order to accelerate query-time performance. Born as it was in the early part of the 2010s, Elasticsearch balances ingestion-side complexity in order to optimize read-side performance. There is a reason that if you have a search field in your app, there is a good chance that Elasticsearch or OpenSearch is involved in the business logic. While Elasticsearch is "schema-less," schema still matters, and there are clear limits to how many fields you can add to an Elasticsearch index.

Each Elasticsearch index or datastream has defined fields. Fields can be defined at index/datastream creation, or configured to auto-create on first use. Both are quite handy in telemetry contexts. Each document in an index or datastream has a reference for every defined field, even if the contents of that field are null. If you have 30K fields, and one document has only 19 fields defined, the rest will still exist on the document but be nulled; which in turn makes that 19 defined-field document rather larger than the same document in an index/datastream with only 300 defined fields.

Larger average document size slows down search for everything in general, due to the number and size of field-indexes the system has to keep track of. This also balloons index/datastream size, which has operational impacts when it comes to routine operations like patching and maintenance. As I mentioned in Software Telemetry, Elasticsearch's cardinality problem manifests in number of fields, not in unique values in each field.

If you are willing to get complicated in your ingestion pipeline through careful crafting of telemetry shape, and ingestion into multiple index/datastreams to bucket types of telemetry into shards of similarly shaped telemetry, you can mitigate some of the above problems. Create an alias to use as your search endpoint, and populate the alias with the index/datastreams of your various shards. Elasticsearch is smart enough to know where to search, which lets you bucket your field-count cardinality problems in ways that will perform faster and save space. However, this is clearly adding complexity that you have to manage yourself.

How Apache Spark is columnar

Spark is pretty clearly columnar, which is why it's the de facto platform of choice for Business Intelligence operations. You know, telemetry for business ops not engineering ops. A table defined in Spark (and most of its backing databases like Parquet or Hive) can have arbitrary columns defined in it. Data for each column is stored in separate files, which means queries like the following looking to build a histogram of log-entries per hour "COUNT timestamp GROUP BY hour(timestamp)" are extremely efficient as the system only needs to look at a single file out of thousands.

Columnar databases have to do quite a bit of read-time and ingestion-time optimization to truly perform fast, which demonstrates some of the tradeoffs of the style. Where Elasticsearch was trading ingestion-time complexity to speed up read-time performance, columnar databases are tilting the needle more towards increasing read-time complexity in order to optimize overall resource usage. In short columnar databases have better scaling profiles than something like Elasticsearch, but they don't query as fast as a result of the changed priorities. This is a far easier trade-off to make in 2024 than it was in 2014!

Columnar databases also don't tokenize the way Elasticsearch does. Have a free-text field that you want to do sub-string searches on? Elasticsearch is built from the bolts out to make that search as fast as possible. Columnar databases, on the other hand, do all of the string walking and searching at query-time instead of pulling the values out of some b-trees.

Where Elasticsearch suffers performance issues when field-count rises, Spark only encounters this problems if the query is designed to encounter it through use of "select *" or similar constructs. The files hit by the query will only be the ones for columns referenced in the query! Have a table with 30K columns in it? So long as you query right, it should perform quite well; the 19 defined fields in a row problem shouldn't be a problem so long as you're only referencing one of those 19 fields/columns.

Why columnar is neat

A good centralized logging system can stand in for both metrics and traces, and in large part can do so because the backing databases for centralized logging are often columnar or columnar-like. There is nothing stopping you from creating metric_name and metric_value fields in your logging system, and building a bunch of metrics-type queries using those rows.

As for emulating tracing, this isn't done through OpenTelemetry, this is done old-school through hacking. Chapter 5 in Software Telemetry covered how the Presentation Stage uses correlation identifiers:

"A correlation identifier is a string or number that uniquely identifies that specific execution or workflow."

Correlation identifiers allow you to build the charts that tracing systems like Jaeger, Tempo, and Honeycomb are known for. There is nothing stopping you creating an array-of-strings type field named "span_id" where you dump the span-stack for each log-line. Want to see all the logs for a given Span? Here you are. Given a sophisticated enough visualization engine, you can even emulate the waterfall diagrams in dedicated tracing platforms.

The reason we haven't used columnar databases for metrics systems has to do with cost. If you're willing to accept cardinality limits, you can store a far greater number of metrics for the same amount of money as doing it in a columnar database. However, the biggest companies already are using columnar datastores for engineering metrics, and nearly all companies are using columnar for business metrics.

But if you're willing to spend the extra resources to use a columnar-like datasource for metrics, you can start answering questions like "how many 5xx response-codes did accounts with the Iridium subscription encounter on October 19th." Traditional metrics system would consider subscription-type to be too highly cardinal, where columnar databases shrug and move on.

What this means for the future of telemetry and observability

Telemetry over the last 60 years of computing has gone from digging through the SYSLOG printout from one of your two servers, to digging through /var/log/syslog, to the creation of dedicated metrics systems, to the creation of tracing techniques. Every decade's evolution of telemetry has been constrained by the compute and storage performance envelope available to the average system operator.

  • The 1980s saw the proliferation of multi-server architectures as the old mainframe style went out of fashion, so centralized logging had to involve the network. NFS shares for Syslog.
  • The 1990s saw the first big scale systems recognizable as such by people in 2024, and the beginnings of analytics on engineering data. People started sending their web-logs direct to relational databases, getting out of the "tail and grep" realm and into something that kinda looks like metrics if you squint. Distributed processing got its start here, though hardly recognizable today.
  • The 2000s saw the first bespoke metrics systems and protocols, such as statsd and graphite. This era also saw the SaaS revolution begin, with Splunk being a big name in centralized logging, and NewRelic gaining traction for web-based metrics. Distributed processing got more involved, and at the end of the decade the big companies like Google and Microsoft lived and breathed these systems. Storage was still spinning disk, with some limited SSD usage in niche markets.
  • The 2010s saw the first tracing systems and the SaaS revolution ate a good chunk of the telemetry/observability space. The word observability entered wide usage. Distributed processing ended the decade as the default stance for everything, including storage. Storage bifurcated into bulk (spinning disk) and performance (SSD) tiers greatly reducing cost.

We're part way through the 2020s, and it's already clear to me that columnar databases are probably where telemetry systems are going to end up by the end of the decade. Business intelligence is already using them, so most of our companies have them in our infrastructure already. Barriers to adoption are going to be finding ways to handle the different retention and granularity requirements of what we now call the three pillars of observability:

  • Metrics need visibility going back years, and are aggregated not sampled. Observability systems doing metrics will need to allow multi-year retention somehow.
  • Tracing retention is 100% based on cost and sample-rate, which should improve over the decade.
  • Centralized logging is like tracing in that retention is 100% based on cost. True columnar stores scale more economically than Elasticsearch-style databases, which increases retention. How sample rate affects retention is less clear, and would have to involve some measure of aggregation to remain viable over time.

Having columnar databases at the core allows a convergence of the pillars of observability. How far we get in convergence over the next five years remains to be seen, and I look forward to finding out.

I've now spent over a decade teaching how alarms are supposed to work (specific, actionable, with the appropriate urgency) and even wrote a book on how to manage metrics systems. One topic I was repeatedly asked to cover in the book, but declined because the topic is big enough for its own book, is how to do metrics right. The desire for an expert to lay down how to do metrics right comes from a number of directions:

  • No one ever looked at ours in a systematic way and our alerts are terrible [This is asking about alerts, not metrics; but they still were indirectly asking about metrics]
  • We keep having incidents and our metrics aren't helping, how do we make them help?
  • Our teams have so many alarms important ones are getting missed [Again, asking about alerts]
  • We've half assed it, and now we're getting a growth spurt. How do we know what we should be looking for?

People really do conflate alarms/alerts with metrics, so any discussion about "how do we do metrics better" is often a "how do we do alarms better" question in disguise. As for the other two points, where people have been using vibes to pick metrics and that's no longer scaling, we actually do have a whole lot of advice; you have a whole menu of "golden signals" to pick from depending on how your application is shaped.

That's only sort of why I'm writing this.

In the mathematical construct of Site Reliability Engineering, where everything is statistics and numerical analysis, metrics are easy. Track the things that affect availability, regularly triage your metrics to ensure continued relevance, and put human processes into place to make sure you're not burning out your worker-units. But the antiseptic concept of SRE only exists in a few places, the rest of us have to pollute the purity of math with human emotions. Let me explain.

Consider your Incident Management process. There are certain questions that commonly arise when people are doing the post incident reviews:

  • Could we have caught this before release? If so, what sort of pre-release checks should we add to catch this earlier?
  • Did we learn about this from metrics or customers? If customers, what metrics do we need to add to catch this earlier? If metrics, what processes or alarms should we tune to catch this earlier?
  • Could we have caught this before the feature flag rolled out to the Emerald users? Do we need to tune the alarm thresholds to catch issues like this in groups with less feature-usage before the high value customers on Emerald plans?

And so on. Note that each question asks about refining or adding metrics. Emotionally, metrics represent anxieties. Metrics are added to catch issues before they hurt us again. Metrics are retained because they're tracking something that used to hurt us and might hurt again. This makes removing metrics hard; the people involved remember why certain metrics are present, intuitively know they needs tracking, which means emotion says to keep.

Metrics are scar tissue, and removing scar tissue is hard, bloody work. How do you reduce the number of metrics, while also not compromising your availability goals? You need the hard math of SRE to work down those emotions, but all it takes is one Engineering Manager to say "this prevented a SEV, keep it" to blow that effort up. This also means you'll have much better luck with a metric reformation effort if teams are already feeling the pinch of alert fatigue or your SaaS metric provider bills are getting big enough that the top of the company is looking at metric usage to reduce costs.

Sometimes, metrics feed into Business Intelligence. That's less about scar tissue and more about optimizing your company's revenue operations. Such metrics are less likely to lead to rapid-response on-call rotations, but still can lead to months long investigations into revenue declines. That's a different but related problem.

I could write a book about making your metrics suck less, but that book by necessity has to cover a lot of human-factors issues and has to account for the role of Incident Management in metrics sprawl. Metrics are scar tissue, keep that in mind.

Mathew Duggan wrote a blog post on June 9th titled, "Monitoring is a pain" where he goes into delicious detail around where observability, monitoring, and telemetry goes wrong inside organizations. I have a vested interest in this, and still agree. Mathew captures a sentiment I didn't highlight enough in my book, that a good observability platform for engineering tends to get enmeshed ever deeper into the whole company's data engineering motion, even though that engineering observability platform isn't resourced enough to really serve that broader goal all that well. This is a subtle point, but absolutely critical for diagnosing criticism of observability platforms.

By not designing your observability platform from the beginning for eventual integration into the overall data motion, you get a lot of hard to reduce misalignment of function, not to mention poorly managed availability assumptions. Did your logging platform hiccup for 20 minutes, thus robbing the business metrics people of 20 minutes of account sign-up/shutdown/churn metrics? All data is noisy, but data engineering folk really like it if the noise is predictable and thus can be modeled out. Did the metrics system have an unexpected reboot, which prevented the daily code-push to production because Delivery Engineering couldn't check their canary deploy metrics? Guess your metrics system is now a production critical system instead of the debugging tool you thought it was.

Data engineering folk like their data to be SQL shaped for a lot of reasons, but few observability and telemetry systems have an SQL interface. Mathew proposed a useful way to provide that:

When you have a log that must be stored for compliance or legal reasons, don't stick it into the same system you use to store every 200 - OK line. Write it to a database (ideally) or an object store outside of the logging pipeline. I've used DynamoDB for this and had it work pretty well by sticking it in an SQS pipeline -> Lambda -> Dynamo. Then your internal application can query this and you don't need to worry about log expiration with DynamoDB TTL.

Dynamo is SQL-like, so this method could work for doing business metrics things like the backing numbers for computing churn rate and monthly active users (MAU) numbers. Or tracking things like password resets, email changes, and gift-card usage for your fraud/abuse department. Or all admin-portal activity for your annual external audits.

Mathew also unintentionally called me out.

Chances are this isn't someones full-time job in your org, they just happened to pick up logging. It's not supposed to be a full-time gig so I totally get it. They installed a few Helm charts, put it behind an OAuth proxy and basically hoped for the best. Instead they get a constant flood of complaints from consumers of the logging system. "Logs are missing, the search doesn't work, my parser doesn't return what I expect".

That's how I got into this. "Go upgrade the Elasticsearch cluster for our logging system" is what started me on the road that lead to the Software Telemetry book. It worked for me because I was at a smaller company. Also our data people stuck their straw directly into the main database, rather than our logging system, which is absolutely a dodged bullet.

Mathew also went into some details around scaling up a metrics system. I spent a chapter and part of an appendix on this very topic. Mathew gives you a solid concrete example of the problem from the point of view of microservices/kubernetes/CNCF architectures. This stuff is frikkin hard, and people want to add things like account tags and commit-hash tags so they can do tight correlation work inside the metrics system; the sort of cardinality problem that many metrics systems aren't designed to support.

All in all Mathew lines out several common deficiencies in how monitoring/observability/telemetry is approached in many companies, especially growing companies:

  • All too often, what systems are in place are there because someone wanted it there and made it. Which means it was built casually, and probably not fit for purpose after other folks realize how useful they are, usage scope creeps, and it becomes mission critical.
  • These systems aren't resourced (in time, money, and people) commensurate with their importance to the organization, suggesting there are some serious misalignments in the org somewhere.
  • "Just pay a SaaS provider" works to a point, but eventually the bill becomes a major point of contention forcing compromise.
  • Getting too good at centralized logging means committing to a resource intensive telemetry style, a habit that's hard to break as you get bigger.
  • No one gets good at tracing if they're using Jaeger. Those that do, are an exception. Just SaaS it, sample the heck out of it, and prepare for annual complaining when the bill comes due.

The last point about Jaeger is a key one, though. Organizations green-fielding a new product often go with tracing only as their telemetry style, since it gives so much high quality data. At the same time, tracing is the single most expensive telemetry style. Unlike centralized logging, which has a high quality self-hosted systems in the form of ELK and Loki, tracing these days only has Jaeger. There is a whole SaaS industry who sell their product based on how much better they are than Jaeger.

There is a lot in the article I didn't cover, go read it.

Jeff Martins at New Stack had a beautiful take-down of the SaaS provider practice of not sharing internal status, and how that affects down-stream reliabiilty programs. Jeff is absolutely right, each SaaS provider (or IaaS provider) you put in your critical path decreases the absolute maximum availability your system can provide. This also isn't helped by SaaS providers using manual processes to update status pages. We would all provide better services to customers if we shared status with each other in a timely way.

Spot on. I've felt this frustration too. I've been in the after-action review following an AWS incident when an engineering manager asked the question:

Can we set up alerting for when AWS updates their status page?

As a way to improve our speed of response to AWS issues. We had to politely let that manager down by saying we'd know there was a problem before AWS updates their page, which is entirely true. Status pages shouldn't be used for real-time alerting. This lesson was further hammered home after we gained access to AWS Technical Account Managers, and started getting the status page updates delivered 10 or so minutes early, directly from the TAMs themselves. Status Pages are corporate communications, not technical.

That's where the problem is, status pages are corporate communication. In a system where admitting fault opens you up to expensive legal actions, corporate communication programs will optimize for not admitting fault. For Status Pages, it means they only get an outage notice after it is unavoidably obvious that an outage is already happening. Status Pages are strictly reactive. For a true real time alerting system, you need to tolerate the occasional false positive; for a corporate communication platform designed to admit fault, false positives must be avoided at all costs.

How do we fix this? How do we, a SaaS provider, enable our downstream reliability programs to get the signal they need to react to our state?

There aren't any good ways, only hard or dodgy ones.

The first and best way is to do away with US style adversarial capitalism, which reduces the risks of saying "we're currently fucking up." The money behind the tech industry won't let this happen, though.

The next best way is to provide an alert stream to customers, so long as they are willing to sign some form of "If we give this to you, then you agree to not sue us for what it tells you" amendment to the usage contract. Even that is risky, because some rights can't be waved that way.

What's left is what AWS did for us, have our Technical Acount Managers or Customer Success Managers hand notify customers that there is a problem. This takes the fault admission out of the public space, and limits it to customers who are giving us enough money to have a TAM/CSM. This is the opposite of real time, but at least you get notices faster? It's still not equivalent to instrumenting your own infrastructure for alerts, and is mostly useful for writing the after-action report.

Yeah, the SaaSification of systems is introducing certain communitarian drives; but the VC-backed capitalistic system we're in prevents us from really doing this well.

The future of telemetry

One of the nice things about living right now is that we know what the future of software telemetry looks like. Charity Majors is dead right about what the datastructure of future telemetry will be:

Arbitrarily cardinal and dimensional events.

Arbitrarily Cardinal means each field in a telemetry database can have an arbitrary number of values, similar to how Elasticsearch and most RDBMSs are built to handle.

Arbitrarily Dimensional means there can be an arbitrary number of fields in the telemetry database, similar to how column-based datastores like MariaDB, HBase, and Hypertable are built.

This combination of attributes allows software engineers to not have to worry about how many attributes, metrics, and values they're stuffing into each event cross the entire software ecosystem. Here is a classic cardinality explosion that is all too common in modern systems; assume each event has the following attributes:

  • account_id: describing the user account ID.
  • subscription_id: the payment subscription ID for the account/team/org.
  • plan_id: the subscription plan ID, which is often used to gate application features.
  • team_id: the team the account_id belongs to.
  • org_id: the parent organization ID the team or user belongs to.
  • code_commit: the git-hash of the currently running code.
  • function_id: the class-path of the function that generated the event.
  • app_id: the ID of the application generating the event.
  • process_id: the individual execution of code that generated the event.
  • cluster_id or host_id: the kubernetes or VM that the process was running on.
  • span_id: the workflow the event happened in, used to link multiple events together.

This is a complex set of what I call correlation identifiers in my book. This set of 11 fields will give you a high degree of context for where the event happened and the business-logic context around why it happened. That said, in even a medium size SaaS app the union of unique values in this set is going to be in the trillions or higher. You need a database designed for deep cardinality in fields, which we have these days; Jaeger is designed for this style of telemetry right now.

However, this is only part of what telemetry is used for. These 11 fields are all global context, but sometimes you need localized context such as file-size, or want to capture localized metrics like final number of pages. These local context and localized metrics are where the arbitrarily dimensional aspect of telemetry comes in to play.

To provide an example, let's look at what the local context I might encounter at work. I work for an Electronic Signature provider, where you can upload files in any number of formats, mark them up for signers to fill out, have them signed, and get a PDF at the end. In addition to the previous global context, here is one example of local context we would care about for an event that tracks how we converted an uploaded Word Perfect file into the formats we use on the signing pages:

  • file_type: the source file type.
  • file_size: how big that source file was.
  • source_pages: how many pages the source file was.
  • converted_pages: how many pages the final convert ended up being (suggesting this can differ from source_pages, how interesting)
  • converted_size: how big the converted pages ended up being.
  • converted_time: how long it took to do the conversions, as measured by this process.

This set seems fine, but lets take a look at the localized context for a different team; the one writing the Upload API endpoints.

  • upload_type: the type of the file uploaded.
  • upload_bytes: how big that uploaded file was
  • upload_quota: how much quota was left on the account after the upload was done.
  • persist_time: how long it took to get the uploaded file out of local-state and into a persistent storage system.

We see similar concepts here! Chapter 14 in the book gives you techniques for reducing this level of field-sprawl, but if you're using a datastore that allows arbitrary dimensionality you don't need to bother. All the overhead of reconciling different teams use of telemetry to reduce complexity in the database goes away if the database is tolerant of that complexity.

Much of my Part 3 chapters are spent in providing ways to handle database complexity. If you're using a database that can handle it, those techniques aren't useful anymore. Really, the future of telemetry is in these highly complex databases.


The problem with databases that support arbitrary cardinality and dimensionality is that you need to build them from scratch right now. You can start with a column-store database and adapt it to support arbitrary field cardinality, but that's homebrewing. Once you have the database problem solved, you need to solve the analysis and presentation problems leveraging your completely fluid data-model.

This is a hard problem, and it's hard enough you can build and finance a company to solve it. This is exactly what honeycomb.io did and is doing, and why the future of telemetry will see much less insourcing. Jaeger is the closest we have to an open source system that does all this, but it has to rely on a database to make it work; currently that's either Elasticsearch or Cassandra. The industry needs to see an open source database that can handle both cardinality and dimensionality, we and just don't have it yet.

The typical flow of telemetry usage in a growing startup these days is roughly:

  1. Small stage: SaaS for everything that isn't business-code, including telemetry.
  2. Medium stage: Change SaaS providers for cost-savings based on changed computing patterns.
  3. Large stage: Start considering insourcing telemetry systems to save SaaS provider costs. Viable because at this stage you have enough engineering talent that this doesn't seem like a completely terrible idea.
  4. Enterprise stage: If insourcing hasn't happened yet for at least one system, it'll happen here.
  5. Industry dominance stage: Open source (or become a major contributor to) the telemetry systems being used.

The constraints of the perfect telemetry database mean that SaaS use -- through stand-alone telemetry companies like Honeycomb or the offering from your cloud provider -- will persist much deeper into the growth cycle. There is a reason that the engineering talent behind the Cloud Native Computing Foundation largely comes from the biggest tech companies on the planet, it is in their interests to provide good enough solutions to provide internal systems that are competitive with the SaaS providers. Those internal systems wont be quite as featured as the SaaS providers; but when you're doing internal telemetry for cost savings, having an 80% solution feels pretty great for what would otherwise be a $2M/month contract.

For the rest of us? SaaS. We'll just have to get used to an outside party holding all of our engineering data.