February 2019 Archives

Charity Majors had a twitter thread last night that touched on something I've kind of known for a while.

This goes into Observability, but touches on something I've been arguing about for some time. Specifically, the intersection of observability, monitoring, and centralized logging. Log-processing engines like Logstash are in effect Extract/Transform/Load pipelines specialized in turning arbitrary inputs into JSON blobs (or some other field-enriched data format). This observation about the core datastructure of it all has been known for a while; the phrase structured logging is what we called it.

In my talk at DevOps Midwest last year in St. Louis, one of my slides was this one.


In many infrastructures, it is the centralized logging system that provides the raw data for observability.

  • Centralized logging provides telemetry.
  • Telemetry is needed by engineering to figure out why a specific thing went wrong at a specific point of time (traces).
    • This is kept for the shorted period of time of everything because it is so huge.
  • Observabilityis derived from telemetry, providing general information about how the system behaves.
    • This needs to have very long time ranges in it in order to be useful, so it is a summary of the larger dataset.
  • Monitoring is derived from telemetry, providing time-series datasets.
  • Reporting and alertingare derived from monitoring.
    • Retention of reports is determined by SLA, contract-law, and regulation.

I'm including pure polling frameworks like Nagios or SolarWinds in telemetry here, but Charity's point can be seen in that chart.

To better show what I'm talking about, take the following bit of code.

syslog.log("Entering ResizeImage. Have #{imgFormat} of #{imgSize}")
[more code]
syslog.log("Left ResizeImage. Did #{imgSize} in #{runtime} seconds.")

This is what Charity was talking about when she said logs are a sloppy version of it. You can get metrics out of this, but you have to regex the strings to pull out the numbers, which means understanding the grammar. You can get observability out of this, since the time difference between the two events tells you a lot about ResizeImage, the syslog metadata will give you some idea as to the atomicity of what happened, and the imgSize can be used to break ties. This is the kind of observability nearly all developers put into their code because outputting strings is built into everything.

The un-sloppy version of this is something like the Open Tracing framework. Using that framework, those log-injections, which still have use, would be matched with another function-call to open/close 'spans', and have any context attached to them that the software engineers think might possibly be useful someday. This is a specialized application of centralized logging, but one with the objective of making distributed systems traceable. This feed of events would be samples and uploaded to systems like Honeycomb.io for dissection and display.

Democratizing Observability

That's pretty abstract so far, but how do you actually get there?

This is where we run into some problems in the industry, since getting to this ideal of managing data with huge cardinalities doesn't currently have any obvious OSS projects.

  • Small companies can get away with tools like ElasticSearch or MongoDB, because they're not big enough to hit the scaling problems with those.
  • Small companies can use SaaS products like Honeycomb because their data volumes are low enough to be affordable.
  • Large companies can use their ETL engineers to refine their pipelines to send statistically valid samples to SaaS products to keep them affordable.
  • Very large companies can build their own high-cardinality systems.

Note the lack of mid-sized companies in that list. Too much data to afford a SaaS product, too high cardinality to use ElasticSearch, but not enough in-house resources to build their own. Another Charity tweet:

That assload of data comes from the operational reality of scaling up your one-datastore small-company system into a many-datastore mid-sized company system. Many datastores because each is specialized for the use-case given to it. ElasticSearch for your telemetry. OpenTSDB for your metrics. A fist-full of Python scripts and RedShift for your observability. There simply isn't much out there right now that is both battle-proven and able to deal with very high cardinalities.

So, don't feel like a looser because you've got three lakes of data. Remember, you still need telemetry even when you have an Observability system. Reducing your lakes of data from three (telemetry, metrics, observability) to two (telemetry, observability) will save you money.

Hard days at work

Last Monday ended up ranking in my top 3 stressful days at work listing. It took a while for me to figure out if it ranked ahead or behind getting fired.

The list:

  1. Having a coworker unexpectedly die, August 6, 2003.
  2. Having my company acquired, January 28, 2019.
  3. Getting fired, November 11, 2013.

You may be wondering how an acquisition could rank above getting fired. All things considered, this was the nice kind of getting bought out; the executives fell in love with each other, realized there were synergies here, and decided to make it official. People at work were generally excited about it when it broke. I had no idea why.

Getting fired was a single event, but at least there was a kind of runbook of what to do afterward.

  • File for unemployment.
  • Start job-hunting.
  • Keep job-hunting.
  • Keep the unemployment office happy.
  • Grin and bear it for the weeks between the offer and the start-date when unemployment isn't coming in.

It had an emotional toll, but at least the what next question was answered. The emotional side was more of a diassociated stun.

Getting acquired was something I had no baseline expectations for except bad. It took me most of the day for me to figure out if the happy-happy I was seeing was corporate-koolaid or if people were honestly happy about it. My reaction was to become a giant stressball of anxiety. I had no baseline. I had no way to predict what was going to happen next. And I had a remote job, which even now take longer to find than on-prem jobs. Even dealing with the sudden death had a bit of a runbook to it.

They were honestly happy about it. It turns out that had to do with what was going to happen to our stock-options. A big check for a lot of us, including me. As I mentioned on a private social network on Tuesday:

This looks like a lottery-win. My cynicism doesn't know how to deal with this.

It took me a day and a half before I stopped being a gigantic stressball about the whole thing. That night I slept for absolute crap. By the end of the day I was feeling better since most of my pressing questions...

  • Are you serious about the options thing? (Yes. Holyshit.)
  • What about benefits? (infodump, far more comprehensive than we get now)
  • What about trans coverage? (Somewhat TBD, but it looks better than I have now by a fair piece)
  • What job will I have? (4% raise, plus a 20% bonus opportunity, and a drop in title; I lost 'Staff')

...got answers, and that meant my stomach wasn't threatening to ulcerate. All in all, a nice improvement. But for those two days, oh boy was I not fun to be around.