Recently in monitoring Category

The following log-line in my Elasticsearch logs confused me. The format is a bit different than you will find in yours, I added some line-breaks to improve readability.

[2021-06-01T19:59:04,832][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] 
[ip-192-0-2-125.prod.internal]
failed to execute [indices:monitor/stats] on node [L3JiFxy5TTuBiGXH_R_dLA]
org.elasticsearch.transport.RemoteTransportException:
[ip-192-0-2-125.prod.internal][192.0.2.125:9300] [indices:monitor/stats[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent]
Data too large, data for [<transport_request>]
would be [9317804754/8.6gb], which is larger than the limit of [9295416524/8.6gb],
usages [
request=0/0b,
fielddata=6592649559/6.1gb,
in_flight_requests=3803/3.7kb,
accounting=2725151392/2.5gb
]

There was just about no search-engine reachable content when I ran into this problem. Decoding this one took some sleuth-work, but the key break came when I found the circuit breaker documentation for Elasticsearch. As the documentation says, the circuit breakers are there to backstop operations that would otherwise run an Elasticsearch process out of memory. As the log-line suggests, there are four types of circuit breakers in addition to a 'parent' one. All four are defined as a percentage of HEAP:

  • Request: The maximum memory a request is allowed to consume. This is different than the size of the request itself, because it includes memory used to compute aggregations.
  • Fielddata: The maximum memory threshold for loading a field's data into memory. So, if you have a "hosts" field with 1.2 million unique values in it, you take a memory hit for each unique. Or, if you have 5000 fields on each request, each field needs to be loaded into memory. Either problem can trigger this.
  • In Flight: The maximum memory of all in-process requests. If a node is too busy doing work, this can fire.
  • Accounting: The maximum memory usable by items that persist after a request is completed, such as Lucene segment memory.

In the log-line I posted above we see three things:

  • Field-data is by far the largest component at 6.1GB
  • The total usages add up to 8.04GB (logged as 9317804754 bytes), which is larger than the limit of 8.600GB
  • We hit the parent breaker.

The parent circuit-breaker is a bit confusing, but out of the box (as of ES 7.x) is 70% of HEAP. So, 8.6GB is 70%, then HEAP is 12.28GB. This told me which nodes were having the problem.


The fix for this isn't nice. I needed to do two things:

  1. Increase the parent circuit-breaker to 80% to get things moving again (the indices.breaker.total.limit cluster setting). And clean up all the damage caused by hitting this breaker. More on that in a bit.
  2. Look deeply into my Elasticsearch schema to identify field-sprawl and fix it. As this was our Logging cluster, we had a few Java apps that log in deeply nested JSON datastructures causing thousands of fields to be created, mostly empty.

There are a few reasons Elasticsearch sets a limit for the maximum fields per index (index.mapping.total_fields.limit) and we ran into one such reason: field-sprawl caused by JSON-deserializing the logging from (in this case) Java applications. Raising the circuit-breaker only goes so far,  the Compressed Ordinary Object Pointer feature of Java puts a functional HEAP ceiling around 30GB. Throw more resources at it has a ceiling, so you will have to fix the problem sometime.

In our case, running nodes with 30GB of HEAP is more expensive than we want to pay so fixing the problem now is what we're doing. Once we get the schema issue fixed, we'll lower the parent breaker back to 70%.


The symptom we saw that told us we had a problem was a report from users that they couldn't search more than day in the past (we rotate logging indexes once a day) in spite of rather more days of indexes being available. Going to Index Management in Kibana and looking at indexes we saw that only a few indexes had index stats available; the rest had no details about document count or overall index size.

Using the Tasks API we got a list of all tasks in process, and found a large number of "indices:monitor/stats" jobs were failing. This task is responsible for updating the index statistics Kibana uses in the Index Management screens. Without those statistics Kibana doesn't know if those indexes are usable in queries.


Cleaning up after this was complicated by an node-failover that happened while the cluster was in this state. Elasticsearch dutifully made any Replica shards into Primary shards, but mostly couldn't create new Replica shards because those operations hit the circuit-breaker. Did you know that Elasticsearch has an internal retry-max when attempting to create new shards? I do now.

Even after getting the parent breaker reset to a higher value, those shards did not recreate: their retry-max had been hit. The only way to get those shards created was to close the affected indexes (using the indexname/_close API) and re-open them. That reset the retry counter, and the shards recreated.

When terms shift

For a few years I've been giving talks on monitoring, observability, and how to get both. In those talks I have a Venn Diagram like this one from last year:

Venn Diagram showing monitoring nested inside observability nested inside telemetry

Note the outer circle, telemetry. I was using this term to capture the concept of all the debug and log-line outputs from an in-house developed piece of software. Some developer put every log-line in there for a reason. It made sense to call this telemetry. Using this concept I point out a few things:

  • Telemetry is your biggest pot of data. It is the full, unredacted sequence of events.
  • So big, that you probably only keep it searchable for a short period of time.
  • But long enough you can troubleshoot with it and see how error rates change across releases.
  • Observability is a restricted and sampled set of telemetry, kept for longer because its higher correlation.
  • Higher correlation and longer time-frames make for a successful observability system.
  • Monitoring is a subset of them all, since those are driving metrics, dashboard, and alarms. A highly specialized use.

A neat model, I thought. I've been using it internally for a while now, and thought it was getting traction.

Then the two major distributed tracing frameworks decided to merge and brand themselves as Open Telemetry. I understand that using Open Tracing or Distributed Tracing were non-viable; one of the two frameworks was already called Open Tracing, and Distributed Tracing is a technique not a trademark-able project name.

There is a reason I didn't call it Centralized Logging, because telemetry encompassed more than just centrallized logging. It included things that aren't centralizable because they exist in SaaS platforms that don't have log-export. Yes, I'm miffed at having to come up with a new term for this. Not sure what it will be yet.

Charity Majors had a twitter thread last night that touched on something I've kind of known for a while.

This goes into Observability, but touches on something I've been arguing about for some time. Specifically, the intersection of observability, monitoring, and centralized logging. Log-processing engines like Logstash are in effect Extract/Transform/Load pipelines specialized in turning arbitrary inputs into JSON blobs (or some other field-enriched data format). This observation about the core datastructure of it all has been known for a while; the phrase structured logging is what we called it.

In my talk at DevOps Midwest last year in St. Louis, one of my slides was this one.

ObservabilityMonitoring.png

In many infrastructures, it is the centralized logging system that provides the raw data for observability.

  • Centralized logging provides telemetry.
  • Telemetry is needed by engineering to figure out why a specific thing went wrong at a specific point of time (traces).
    • This is kept for the shorted period of time of everything because it is so huge.
  • Observabilityis derived from telemetry, providing general information about how the system behaves.
    • This needs to have very long time ranges in it in order to be useful, so it is a summary of the larger dataset.
  • Monitoring is derived from telemetry, providing time-series datasets.
  • Reporting and alertingare derived from monitoring.
    • Retention of reports is determined by SLA, contract-law, and regulation.

I'm including pure polling frameworks like Nagios or SolarWinds in telemetry here, but Charity's point can be seen in that chart.

To better show what I'm talking about, take the following bit of code.

syslog.log("Entering ResizeImage. Have #{imgFormat} of #{imgSize}")
[more code]
syslog.log("Left ResizeImage. Did #{imgSize} in #{runtime} seconds.")

This is what Charity was talking about when she said logs are a sloppy version of it. You can get metrics out of this, but you have to regex the strings to pull out the numbers, which means understanding the grammar. You can get observability out of this, since the time difference between the two events tells you a lot about ResizeImage, the syslog metadata will give you some idea as to the atomicity of what happened, and the imgSize can be used to break ties. This is the kind of observability nearly all developers put into their code because outputting strings is built into everything.

The un-sloppy version of this is something like the Open Tracing framework. Using that framework, those log-injections, which still have use, would be matched with another function-call to open/close 'spans', and have any context attached to them that the software engineers think might possibly be useful someday. This is a specialized application of centralized logging, but one with the objective of making distributed systems traceable. This feed of events would be samples and uploaded to systems like Honeycomb.io for dissection and display.

Democratizing Observability

That's pretty abstract so far, but how do you actually get there?

This is where we run into some problems in the industry, since getting to this ideal of managing data with huge cardinalities doesn't currently have any obvious OSS projects.

  • Small companies can get away with tools like ElasticSearch or MongoDB, because they're not big enough to hit the scaling problems with those.
  • Small companies can use SaaS products like Honeycomb because their data volumes are low enough to be affordable.
  • Large companies can use their ETL engineers to refine their pipelines to send statistically valid samples to SaaS products to keep them affordable.
  • Very large companies can build their own high-cardinality systems.

Note the lack of mid-sized companies in that list. Too much data to afford a SaaS product, too high cardinality to use ElasticSearch, but not enough in-house resources to build their own. Another Charity tweet:

That assload of data comes from the operational reality of scaling up your one-datastore small-company system into a many-datastore mid-sized company system. Many datastores because each is specialized for the use-case given to it. ElasticSearch for your telemetry. OpenTSDB for your metrics. A fist-full of Python scripts and RedShift for your observability. There simply isn't much out there right now that is both battle-proven and able to deal with very high cardinalities.

So, don't feel like a looser because you've got three lakes of data. Remember, you still need telemetry even when you have an Observability system. Reducing your lakes of data from three (telemetry, metrics, observability) to two (telemetry, observability) will save you money.

Immutable infrastructure

This concept has confused me for years, but I'm beginning to get caught-up on enough of the nuance to hold opinions on it that I'm willing to defend.

This is why I have a blog.

What is immutable infrastructure?

This is a movement of systems design that holds to a few principles:

  • You shouldn't have SSH/Ansible enabled on your machines.
  • Once a box/image/instance is deployed, you don't touch it again until it dies.
  • Yes, even for troubleshooting purposes.

Pretty simple on the face of it. Don't like how an instance is behaving? Build a new one and replace the bad instance. QED.

Refining the definition

The yes, even for troubleshooting purposes concept encodes another concept rolling through the industry right now: observability.

You can't do true immutable infrastructure until after you've already gotten robust observability tools in place. Otherwise, the SSHing and monkey-patching will still be happening.

So, Immutable Infrastructure and Observability. That makes a bit more sense to this old-timer.

Example systems

There are two design-patterns that structurally force you into taking observability principles into account, due to how they're built:

  • Kubernetes/Docker-anything
  • Serverless

Both of these make traditional log-file management somewhat more complex, so if engineering wants their Kibana interface into system telemetry, they're going to have to come up with ways to get that telemetry off of the logic and into the central-place using something other than log-files. Telemetry is the first step towards observability, and one most companies do instinctively.

Additionally, the (theoretically) rapid iterability of containers/functions mean much less temptation to monkey-patch. Slow iteration means more incentive to SSH or monkey-patch because that's faster than waiting for an AMI or template-image to bake.

The concept so many seem to miss

This is pretty simple.

Immutable infrastructure only applies to the pieces of your infrastructure that hold no state.

And its corollary:

If you want immutable infrastructure, you have to design your logic layers to not assume local state for any reason.

Which is to say, immutable infrastructure needs to be a DevOps thing, not just an Ops thing. Dev needs to care about it as well. If that means in-progress file-transforms get saved to a memcache/gluster/redis cluster instead of the local filesystem, so be it.

This also means that you will have both immutable and bastion infrastructures in the same overall system. Immutable for your logic, bastion for your databases and data-stores. Serverless for your NodeJS code, maintenance-windows and patching-cycles for your Postgress clusters. Applying immutable patterns to components that take literal hours to recover/re-replicate introduces risk in ways that treating them for what they are, mutable, would not.

Yeahbut, public cloud! I don't run any instances!

So, you've gone full Serverless, and all of your state is sitting in something like AWS RDS, ElasticCache, and DynamoDB, and using Workspaces for your 'inside' operations. No SSHing, to be sure. That said, this is about as automated as you can get. Even so, there are still some state operations you are subject to:

  • RDS DB failovers still yield several to many seconds of "The database told me to bugger off" errors.
  • RDS DB version upgrades still require a carefully choreographed dance to ensure your site continues to function, if glitchy in short periods.
  • ElasticCache failovers still cause extensive latency as your underlying SDKs catch up to the new read/write replica location.

You're still not purely immutable, but you're as close as you can get in this modern age. Be proud.

InfluxDB queries, a guide

I've been playing with InfluxDB lately. One of the problems I'm facing is getting what I need out of it. Which means exploring the query language. The documentation needs some polishing in spots, so I may submit a PR to it once I get something worked up. But until then, enjoy some googlebait about how the SELECT syntax works, and what you can do with it.

Rule 1: Never, ever put a WHERE condition that involves 'value'. Value is not indexed. Doing so will cause table-scans, and for a database that can legitimately contain over a billion rows, that's bad. Don't do it.
Rule 2: No joins.

With that out of the way, have some progressively more complex queries to explain how the heck this all works!


Return a list of values.

Dump everything in a measurement, going back as far as you have data. You almost never want to do this

SELECT value FROM site_hits

The one exception to this rule, is if you're pulling out something like an event stream, where events are encoded as tags-values.

SELECT event_text, value FROM eventstream

Return a list of values from a measurement, with given tags.

One of the features of InfluxDB, is that you can tag values in a measurement. These function like extra fields in a database row, but you still can't join on them. The syntax for this should not be surprising.

SELECT value FROM site_hits WHERE webapp = 'api' AND environment = 'prod'

Return a list of values from a measurement, with given tags that match a regex.

Yes, you can use regexes in your WHERE clauses.

SELECT value FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod'


That's cool and all, but the real power of InfluxDB comes with the aggregation functions and grouping. This is what allows you to learn what the max value was for a given measurement over the past 30 minutes, and other useful things. These yield time-series that can be turned into nice charts.

Return a list of values, grouped by application

This is the first example of GROUP BY, and isn't one you'll probably ever need to use. This will emit multiple time-series.

SELECT value FROM site_hits where webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY webapp

Return a list of values, grouped by time into 10 minute buckets

When using time for a GROUP BY value, you must provide an aggregation function! This will add together all of the hits in the 10 minute bucket into a single value, returning a time-stream of 10 minute buckets of hits.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY time(10m)

Return a list of values, grouped by both web-server and time into 10 minute buckets

This does the same thing as the previous, but will yield multiple time-series. Some graphing packages will helpfully chart multiple lines based on this single query. Handy, especially if servername changes on a daily basis as new nodes are added and removed.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY time(10m), servername

Return a list of values, grouped by time into 10 minute buckets, for data receive in the last 24 hours.

This adds a time-based condition to the WHERE clause. To keep the line shorter, we're not going to group on servername.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' AND time > now() - 24h GROUP BY time(10m)


There is one more trick InfluxDB can do, and this isn't documented very well. InfluxDB can partition data in a database into retention policies. There is a default retention policy on each database, and if you don't specify a retention-policy to query from, you are querying the default. All of the above examples are querying the default retention-policy.

By using continuous queries you can populate other retention policies with data from the default policy. Perhaps your default policy keeps data for 6 weeks at 10 second granularity, but you want to keep another policy for 1 minute granularity for six months, and another policy for 10 minute granularity for two years. These queries allow you to do that.

Querying data from a non-default retention policy is done like this:

Return 14 weeks of hits to API-type webapps, in 1 hour buckets

SELECT sum(value) FROM "6month".site_hits WHERE webapp =~ /api_[a-z]*/ AND environment = 'prod' AND time > now() - 14w GROUP BY time(1h)

The same could be done for "18month", if that policy was on the server.

Groking audit

I've been working with Logstash lately, and one of the tasks I was given was attempting to improve parsing of audit.log entries. Turning things like this:

type=SYSCALL msg=audit(1445878971.457:6169): arch=c000003e syscall=59 success=yes exit=0 a0=c2c3a8 a1=c64bc8 a2=c34408 a3=7fff44e370f0 items=2 ppid=16974 pid=18771 auid=1004 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=5 comm="compiled_evil" exe="/home/justsomeuser/bin/compiled_evil" key="hinkystuff"

Into nice and indexed entries where we can make Kibana graphs of all commands caught with the hinkystuff audit ruleset.

The problem with audit.log entries is that they're not very regexible. Oh, they can be. But optional sometimes-there-sometimes-not fields suck a lot. Take for example, the SYSCALL above. Items a0 through a3 are arguments 1-3 of the syscall, and there may be 1 to 3 of them. Expressing that in regex/grok is trying.

So I made a thing:

Logstash-auditlog: Grok patterns and examples for parsing Audit settings with Logstash.

May it be useful.

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

There is more to a monitoring system than alarms and reports. Behind all of those cries for action are a lot of data. A lot of data. So much data, that you face scaling problems when you grow because all of your systems generate so much monitoring data.

Monitor everything!
-- Boss

A great idea in principle, but falls apart in one key way...

"Define 'everything', please"

'Everything' means different things for different people. It just isn't feasable to track every monitorable on everything with one-second granularity. Just about everyone will want to back away from that level of monitor-all-the-things. But what is the right fit?

It depends on what you want to do with it. Data being tracked supports four broad categories of monitoring.

  1. Performance
  2. Operational
  3. Capacity
  4. SLA

Performance Monitoring

This kind of monitoring tends to have low intervals between polls. It could be five minutes, but may be as little as every second. That kind of monitoring will create a deluge of data, and may only be done when diagnosing exposed problems or doing in-depth research on the system. It's not run all the time, unless you really do care about per-second changes in state of something.

This kind of monitoring is defined by a few attributes:

  • High granularity. You poll a lot.
  • Low urgency. You're doing this because you're looking into something, not because it's down.
  • Occasional need. You don't run it all the time, and not on a schedule.

Everything: 1 second granularity for CPU, IOPS, and pagefaults for a given cluster.


Operational Monitoring

The kind we're all familiar with. This is the kind of monitoring that tends to emit alarms for on-call rotations.

  • Medium granularity. Every 5 minutes, that kind of thing.
  • High urgency. Fast responses are needed.
  • Constant need. You run it all the time.

Everything: Every disk-event, everywhere.


Capacity Monitoring

Some of the alarms you have defined already may be capacity alarms, but capacity monitoring just tracks how much you are using of what you have. Some of this stuff doesn't change very fast.

  • Low granulariy. It may only get checked once a day.
  • Low urgency. Responding in a couple of days may be fast enough. If not slower.
  • Periodic need. Reviewed once in a while, probably on a schedule.

Everything: Anything that has a "Max" size value greater than the "Current" value.


SLA Monitoring

I've already gone on at length about SLAs, but this is the monitoring that directly supports the SLA pass/fail metrics. I break it apart from the other types because of how it's accessed.

  • Low granularity. Some metrics may be medium, but in general SLA trackers are over-time style.
  • Medium urgency. If a failing grade is determined, response needs to happen. How fast, depends on what's not going to get met.
  • Continual and Periodic need. Some things will be monitored continually, others will only be checked on long schedules; possibly once a week, if not once a month.

Everything: Everything it takes to build those reports.


Be aware that 'everything' is context-sensitive when you're talking with people and don't freak out when a grand high executive says, "everything," at you. They're probably thinking about the SLA Monitoring version of everything, which is entirely manageable.

Don't panic, and keep improving your monitoring system.

In the last article we created a list of monitorables and things that look like the kind of alarms we want to see.

Now what?

First off, go back to the list of alarms you already have. Go through those and see which of those existing alarms directly support the list you just created. It may be depressing how few of them do, but rejoice! Fewer alarms mean fewer emails!

What does 'directly support' mean?

Lets look at one monitorable and see what kind of alarms might directly or indirectly support it.

Main-page cluster status.

There are a number of alarms that could already be defined for this one.

  • Main-page availability as polled directly on the load-balancer.
  • Pingability of each cluster member.
  • Main-page reachability on each cluster member.
  • CPU/Disk/Ram/Swap on each cluster member.
  • Switch-port status for the load-balancer and each cluster-member.
  • Webserver process existence on each cluster member.
  • Webserver process CPU/RAM usage on each cluster member.

And more, I'm sure. That's a lot of data, and we don't need to define alarms for all of it. The question to ask is, "How do I determine the status of the cluster?"

The answer could be, "All healthy nodes behind the load-balancer return the main-page, with at least three nodes behind the load-balancer for fault tolerance." This leads to a few alarms we'd like to see:

  • Cluster has dropped below minimum quorum.
  • Node ${X} is behind the load-balancer but serving up errors.
  • The load-balancer is not serving any pages.

We can certainly track all of those other things, but we don't need alarms on them. Those will come in handy when the below-quorum alarm is responded to. This list is what I'd call directly supporting. The rest are indirect indicators and we don't need PagerDuty to tell us about them, we'll find it ourselves once we start troubleshooting the actual problem.


Now that we have a list of existing alarms we want to keep and a list of alarms we'd like to have, the next step is determining when we want to be alarmed.

You are in the weeds.

You're getting a thousand or more alarms over the course of a weekend. You can't look at them all, that would require not sleeping. And you actually do sleep. How do you cope with it?

Lots of email rules, probably. Send the select few you actually care about to the email-to-sms gateway for your phone. File others into the special folders that make your phone go ♫♬♫bingle♬♫♬. And mark-all-read the folder with 1821 messages in it when you get to the office on Monday (2692 on Tuesday after a holiday Monday).

Your monitoring system is OK. It does catch stuff, but the trick is noticing the alarm in all the noise.

You want to get to the point where alarms come rarely, and get acted upon when they show up. 1821 messages a weekend is not that system. Over a 60 hour weekend, 1821 messages is one message every two minutes. Or if it's like most monitoring system, it's a few messages an hour with a couple of bursts of hundreds over the course of a few polling-cycles as something big flaps and everything behind it goes 'down'. That alarming load is only sustainable with a fully staffed round-the-clock NOC.

Very few of us have those.

Paring down the load requires asking a few questions: