The future of telemetry

One of the nice things about living right now is that we know what the future of software telemetry looks like. Charity Majors is dead right about what the datastructure of future telemetry will be:

Arbitrarily cardinal and dimensional events.

Arbitrarily Cardinal means each field in a telemetry database can have an arbitrary number of values, similar to how Elasticsearch and most RDBMSs are built to handle.

Arbitrarily Dimensional means there can be an arbitrary number of fields in the telemetry database, similar to how column-based datastores like MariaDB, HBase, and Hypertable are built.

This combination of attributes allows software engineers to not have to worry about how many attributes, metrics, and values they're stuffing into each event cross the entire software ecosystem. Here is a classic cardinality explosion that is all too common in modern systems; assume each event has the following attributes:

  • account_id: describing the user account ID.
  • subscription_id: the payment subscription ID for the account/team/org.
  • plan_id: the subscription plan ID, which is often used to gate application features.
  • team_id: the team the account_id belongs to.
  • org_id: the parent organization ID the team or user belongs to.
  • code_commit: the git-hash of the currently running code.
  • function_id: the class-path of the function that generated the event.
  • app_id: the ID of the application generating the event.
  • process_id: the individual execution of code that generated the event.
  • cluster_id or host_id: the kubernetes or VM that the process was running on.
  • span_id: the workflow the event happened in, used to link multiple events together.

This is a complex set of what I call correlation identifiers in my book. This set of 11 fields will give you a high degree of context for where the event happened and the business-logic context around why it happened. That said, in even a medium size SaaS app the union of unique values in this set is going to be in the trillions or higher. You need a database designed for deep cardinality in fields, which we have these days; Jaeger is designed for this style of telemetry right now.

However, this is only part of what telemetry is used for. These 11 fields are all global context, but sometimes you need localized context such as file-size, or want to capture localized metrics like final number of pages. These local context and localized metrics are where the arbitrarily dimensional aspect of telemetry comes in to play.

To provide an example, let's look at what the local context I might encounter at work. I work for an Electronic Signature provider, where you can upload files in any number of formats, mark them up for signers to fill out, have them signed, and get a PDF at the end. In addition to the previous global context, here is one example of local context we would care about for an event that tracks how we converted an uploaded Word Perfect file into the formats we use on the signing pages:

  • file_type: the source file type.
  • file_size: how big that source file was.
  • source_pages: how many pages the source file was.
  • converted_pages: how many pages the final convert ended up being (suggesting this can differ from source_pages, how interesting)
  • converted_size: how big the converted pages ended up being.
  • converted_time: how long it took to do the conversions, as measured by this process.

This set seems fine, but lets take a look at the localized context for a different team; the one writing the Upload API endpoints.

  • upload_type: the type of the file uploaded.
  • upload_bytes: how big that uploaded file was
  • upload_quota: how much quota was left on the account after the upload was done.
  • persist_time: how long it took to get the uploaded file out of local-state and into a persistent storage system.

We see similar concepts here! Chapter 14 in the book gives you techniques for reducing this level of field-sprawl, but if you're using a datastore that allows arbitrary dimensionality you don't need to bother. All the overhead of reconciling different teams use of telemetry to reduce complexity in the database goes away if the database is tolerant of that complexity.

Much of my Part 3 chapters are spent in providing ways to handle database complexity. If you're using a database that can handle it, those techniques aren't useful anymore. Really, the future of telemetry is in these highly complex databases.

The problem with databases that support arbitrary cardinality and dimensionality is that you need to build them from scratch right now. You can start with a column-store database and adapt it to support arbitrary field cardinality, but that's homebrewing. Once you have the database problem solved, you need to solve the analysis and presentation problems leveraging your completely fluid data-model.

This is a hard problem, and it's hard enough you can build and finance a company to solve it. This is exactly what did and is doing, and why the future of telemetry will see much less insourcing. Jaeger is the closest we have to an open source system that does all this, but it has to rely on a database to make it work; currently that's either Elasticsearch or Cassandra. The industry needs to see an open source database that can handle both cardinality and dimensionality, we and just don't have it yet.

The typical flow of telemetry usage in a growing startup these days is roughly:

  1. Small stage: SaaS for everything that isn't business-code, including telemetry.
  2. Medium stage: Change SaaS providers for cost-savings based on changed computing patterns.
  3. Large stage: Start considering insourcing telemetry systems to save SaaS provider costs. Viable because at this stage you have enough engineering talent that this doesn't seem like a completely terrible idea.
  4. Enterprise stage: If insourcing hasn't happened yet for at least one system, it'll happen here.
  5. Industry dominance stage: Open source (or become a major contributor to) the telemetry systems being used.

The constraints of the perfect telemetry database mean that SaaS use -- through stand-alone telemetry companies like Honeycomb or the offering from your cloud provider -- will persist much deeper into the growth cycle. There is a reason that the engineering talent behind the Cloud Native Computing Foundation largely comes from the biggest tech companies on the planet, it is in their interests to provide good enough solutions to provide internal systems that are competitive with the SaaS providers. Those internal systems wont be quite as featured as the SaaS providers; but when you're doing internal telemetry for cost savings, having an 80% solution feels pretty great for what would otherwise be a $2M/month contract.

For the rest of us? SaaS. We'll just have to get used to an outside party holding all of our engineering data.