Recently in monitoring Category

InfluxDB queries, a guide

| No Comments

I've been playing with InfluxDB lately. One of the problems I'm facing is getting what I need out of it. Which means exploring the query language. The documentation needs some polishing in spots, so I may submit a PR to it once I get something worked up. But until then, enjoy some googlebait about how the SELECT syntax works, and what you can do with it.

Rule 1: Never, ever put a WHERE condition that involves 'value'. Value is not indexed. Doing so will cause table-scans, and for a database that can legitimately contain over a billion rows, that's bad. Don't do it.
Rule 2: No joins.

With that out of the way, have some progressively more complex queries to explain how the heck this all works!

Return a list of values.

Dump everything in a measurement, going back as far as you have data. You almost never want to do this

SELECT value FROM site_hits

The one exception to this rule, is if you're pulling out something like an event stream, where events are encoded as tags-values.

SELECT event_text, value FROM eventstream

Return a list of values from a measurement, with given tags.

One of the features of InfluxDB, is that you can tag values in a measurement. These function like extra fields in a database row, but you still can't join on them. The syntax for this should not be surprising.

SELECT value FROM site_hits WHERE webapp = 'api' AND environment = 'prod'

Return a list of values from a measurement, with given tags that match a regex.

Yes, you can use regexes in your WHERE clauses.

SELECT value FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod'

That's cool and all, but the real power of InfluxDB comes with the aggregation functions and grouping. This is what allows you to learn what the max value was for a given measurement over the past 30 minutes, and other useful things. These yield time-series that can be turned into nice charts.

Return a list of values, grouped by application

This is the first example of GROUP BY, and isn't one you'll probably ever need to use. This will emit multiple time-series.

SELECT value FROM site_hits where webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY webapp

Return a list of values, grouped by time into 10 minute buckets

When using time for a GROUP BY value, you must provide an aggregation function! This will add together all of the hits in the 10 minute bucket into a single value, returning a time-stream of 10 minute buckets of hits.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY time(10m)

Return a list of values, grouped by both web-server and time into 10 minute buckets

This does the same thing as the previous, but will yield multiple time-series. Some graphing packages will helpfully chart multiple lines based on this single query. Handy, especially if servername changes on a daily basis as new nodes are added and removed.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY time(10m), servername

Return a list of values, grouped by time into 10 minute buckets, for data receive in the last 24 hours.

This adds a time-based condition to the WHERE clause. To keep the line shorter, we're not going to group on servername.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' AND time > now() - 24h GROUP BY time(10m)

There is one more trick InfluxDB can do, and this isn't documented very well. InfluxDB can partition data in a database into retention policies. There is a default retention policy on each database, and if you don't specify a retention-policy to query from, you are querying the default. All of the above examples are querying the default retention-policy.

By using continuous queries you can populate other retention policies with data from the default policy. Perhaps your default policy keeps data for 6 weeks at 10 second granularity, but you want to keep another policy for 1 minute granularity for six months, and another policy for 10 minute granularity for two years. These queries allow you to do that.

Querying data from a non-default retention policy is done like this:

Return 14 weeks of hits to API-type webapps, in 1 hour buckets

SELECT sum(value) FROM "6month".site_hits WHERE webapp =~ /api_[a-z]*/ AND environment = 'prod' AND time > now() - 14w GROUP BY time(1h)

The same could be done for "18month", if that policy was on the server.

Groking audit

| No Comments

I've been working with Logstash lately, and one of the tasks I was given was attempting to improve parsing of audit.log entries. Turning things like this:

type=SYSCALL msg=audit(1445878971.457:6169): arch=c000003e syscall=59 success=yes exit=0 a0=c2c3a8 a1=c64bc8 a2=c34408 a3=7fff44e370f0 items=2 ppid=16974 pid=18771 auid=1004 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=5 comm="compiled_evil" exe="/home/justsomeuser/bin/compiled_evil" key="hinkystuff"

Into nice and indexed entries where we can make Kibana graphs of all commands caught with the hinkystuff audit ruleset.

The problem with audit.log entries is that they're not very regexible. Oh, they can be. But optional sometimes-there-sometimes-not fields suck a lot. Take for example, the SYSCALL above. Items a0 through a3 are arguments 1-3 of the syscall, and there may be 1 to 3 of them. Expressing that in regex/grok is trying.

So I made a thing:

Logstash-auditlog: Grok patterns and examples for parsing Audit settings with Logstash.

May it be useful.

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

There is more to a monitoring system than alarms and reports. Behind all of those cries for action are a lot of data. A lot of data. So much data, that you face scaling problems when you grow because all of your systems generate so much monitoring data.

Monitor everything!
-- Boss

A great idea in principle, but falls apart in one key way...

"Define 'everything', please"

'Everything' means different things for different people. It just isn't feasable to track every monitorable on everything with one-second granularity. Just about everyone will want to back away from that level of monitor-all-the-things. But what is the right fit?

It depends on what you want to do with it. Data being tracked supports four broad categories of monitoring.

  1. Performance
  2. Operational
  3. Capacity
  4. SLA

Performance Monitoring

This kind of monitoring tends to have low intervals between polls. It could be five minutes, but may be as little as every second. That kind of monitoring will create a deluge of data, and may only be done when diagnosing exposed problems or doing in-depth research on the system. It's not run all the time, unless you really do care about per-second changes in state of something.

This kind of monitoring is defined by a few attributes:

  • High granularity. You poll a lot.
  • Low urgency. You're doing this because you're looking into something, not because it's down.
  • Occasional need. You don't run it all the time, and not on a schedule.

Everything: 1 second granularity for CPU, IOPS, and pagefaults for a given cluster.

Operational Monitoring

The kind we're all familiar with. This is the kind of monitoring that tends to emit alarms for on-call rotations.

  • Medium granularity. Every 5 minutes, that kind of thing.
  • High urgency. Fast responses are needed.
  • Constant need. You run it all the time.

Everything: Every disk-event, everywhere.

Capacity Monitoring

Some of the alarms you have defined already may be capacity alarms, but capacity monitoring just tracks how much you are using of what you have. Some of this stuff doesn't change very fast.

  • Low granulariy. It may only get checked once a day.
  • Low urgency. Responding in a couple of days may be fast enough. If not slower.
  • Periodic need. Reviewed once in a while, probably on a schedule.

Everything: Anything that has a "Max" size value greater than the "Current" value.

SLA Monitoring

I've already gone on at length about SLAs, but this is the monitoring that directly supports the SLA pass/fail metrics. I break it apart from the other types because of how it's accessed.

  • Low granularity. Some metrics may be medium, but in general SLA trackers are over-time style.
  • Medium urgency. If a failing grade is determined, response needs to happen. How fast, depends on what's not going to get met.
  • Continual and Periodic need. Some things will be monitored continually, others will only be checked on long schedules; possibly once a week, if not once a month.

Everything: Everything it takes to build those reports.

Be aware that 'everything' is context-sensitive when you're talking with people and don't freak out when a grand high executive says, "everything," at you. They're probably thinking about the SLA Monitoring version of everything, which is entirely manageable.

Don't panic, and keep improving your monitoring system.

In the last article we created a list of monitorables and things that look like the kind of alarms we want to see.

Now what?

First off, go back to the list of alarms you already have. Go through those and see which of those existing alarms directly support the list you just created. It may be depressing how few of them do, but rejoice! Fewer alarms mean fewer emails!

What does 'directly support' mean?

Lets look at one monitorable and see what kind of alarms might directly or indirectly support it.

Main-page cluster status.

There are a number of alarms that could already be defined for this one.

  • Main-page availability as polled directly on the load-balancer.
  • Pingability of each cluster member.
  • Main-page reachability on each cluster member.
  • CPU/Disk/Ram/Swap on each cluster member.
  • Switch-port status for the load-balancer and each cluster-member.
  • Webserver process existence on each cluster member.
  • Webserver process CPU/RAM usage on each cluster member.

And more, I'm sure. That's a lot of data, and we don't need to define alarms for all of it. The question to ask is, "How do I determine the status of the cluster?"

The answer could be, "All healthy nodes behind the load-balancer return the main-page, with at least three nodes behind the load-balancer for fault tolerance." This leads to a few alarms we'd like to see:

  • Cluster has dropped below minimum quorum.
  • Node ${X} is behind the load-balancer but serving up errors.
  • The load-balancer is not serving any pages.

We can certainly track all of those other things, but we don't need alarms on them. Those will come in handy when the below-quorum alarm is responded to. This list is what I'd call directly supporting. The rest are indirect indicators and we don't need PagerDuty to tell us about them, we'll find it ourselves once we start troubleshooting the actual problem.

Now that we have a list of existing alarms we want to keep and a list of alarms we'd like to have, the next step is determining when we want to be alarmed.

You are in the weeds.

You're getting a thousand or more alarms over the course of a weekend. You can't look at them all, that would require not sleeping. And you actually do sleep. How do you cope with it?

Lots of email rules, probably. Send the select few you actually care about to the email-to-sms gateway for your phone. File others into the special folders that make your phone go bingle. And mark-all-read the folder with 1821 messages in it when you get to the office on Monday (2692 on Tuesday after a holiday Monday).

Your monitoring system is OK. It does catch stuff, but the trick is noticing the alarm in all the noise.

You want to get to the point where alarms come rarely, and get acted upon when they show up. 1821 messages a weekend is not that system. Over a 60 hour weekend, 1821 messages is one message every two minutes. Or if it's like most monitoring system, it's a few messages an hour with a couple of bursts of hundreds over the course of a few polling-cycles as something big flaps and everything behind it goes 'down'. That alarming load is only sustainable with a fully staffed round-the-clock NOC.

Very few of us have those.

Paring down the load requires asking a few questions:

LISA 2013 was very good to me. I saw a lot of sessions about monitoring, theories of, and I've spent most of 2014 trying to revise the monitoring system at work to be less sucky and more awesome. It's mostly worked, and is an awesome-thing that's definitely going on my resume.

Credit goes primarily to two sources:

  • A Working Theroy of Monitoring, by Caskey L Dickson of Google, from LISA 2013.
  • SRE University: Non-abstract large systems design for sysadmins, by John Looney and company, of Google, also at LISA 2013.

It was these sessions that inspired me to refine a slide in A Working Theory of Monitoring and put my own spin on it:

I'll be explaining this further, but this is what the components of a monitoring system look like. Well, monitoring ecosystem since there is rarely a single system that does everything. There are plenty of companies that will sell you a product that does everything, but even they get supplimented by home-brew reporting engines and custom-built automation scripts.

There is more to an on-call rotation than a shared calendar with names on it and an agreement to call whoever is on the calendar if something goes wrong.

People are people, and you have to take that into consideration when setting up a rotation. And that means compromise, setting expectations, and consequences for not meeting them. Here are a few policies every rotation should have somewhere. Preferably easy to get to.

The rotation should be published well in advance, and easy to find.

This seems like an obvious thing, but it needs to be said. People need to know in advance when they're going to be obligated to pay attention to work in their usual time off. This allows them to schedule their lives around the on-call schedule, and you, as the on-call manager, will have to deal with fewer shift-swaps as a result. You're looking to avoid...

Um, I forgot I was on-call next week, and I'm going to be in Peru to hike the Andes. *sheepish look.*

This is less likely to happen if the shift schedule is advertised well in advance. For bonus points, add the shift schedules to their work calendars.

(US) Monday Holiday Law is a thing. Don't do shift swaps on Monday.

If you're doing weekly shifts, it's a good idea to not do your shift swap on Monday. Due to the US Monday Holiday Law there are five weeks in a year (10% of the total!) where your shift change will happen on an official holiday. Two of those are days that almost everyone gets off: Labor Day and Memorial Day.

Whether or not you need to avoid shift swaps on a non-work day depends a lot on how hand-offs work for your organization.

Set shift-handoff expectations.

When one watch-stander is relieved by the next, there needs to be a handoff. For some organizations it could be as simple as making sure the other person is there and responsive before stepping down. For others, it can be complicated as they have more state to transfer. State such as:

  • Ongoing issues being tracked.
  • Hardware replacements due during the next shift period.
  • Maintenance tasks not completed by the outgoing watch-stander.
  • Escalation engineers that won't be available.

And so on. If your organization has state to transfer, be sure you have a policy in place to ensure it is transferred.

Acknowledge time must be defined for alarms.

The maximum time a watch-stander is allowed to wait before ACKing an alarm must be defined by policy, and failure to meet that must be noticed. If the ACK time expires, the alarm should escalate to the next tier of on-call.

This is a very critical policy to define, as it allows watch-standers to predict how much life they can have while on-call. If response-time is 10 minutes, it means that if the watch-stander is driving they have to pull over to ack the alarm. 10 minute ACK likely means they can't do anything long, like go to movies or their kid's band recital.

This also goes for people who are on the escalation schedule. It may be different times, but it should still be defined.

Response time must be defined for alarms.

The time to first response, actually working on the reported problem, must be defined by policy, and failure to meet that must be noticed. This may vary based on the type of alarm received, but it should still be defined for each alarm.

This is another very critical policy to define, and is even more impactful on watch-stander ability to do other things than be at work. The watch-stander pretty much has to stay within N minutes of a reliable Internet connection for their entire watch. If their commute to and from work is longer than N minutes, they can't be on-watch during their commute. And other things.

  • If 10 minutes or less, it is nearly impossible to do anything out of the house. Such as picking up prescriptions, going to the kid's band recital, soccer games, and theatre rehearsals.
  • If 20 minutes or less, quick errands out of the house may be doable, but that's about it.

As with ACK time, this should also be defined for the people on the escalation schedule.

Consequences for escalations must be defined.

People are people, and sometimes we can't get to the phone for some reason. Escalations will happen for both good reasons (ER visits) and bad (slept through it). The severity of a missed alarm varies based on the organization and what got missed, so this will be a highly localized policy. If an alarm is missed, or a pattern of missed alarms develops, there should be defined consequences.

This is the kind of thing that can be brought up in quarterly/annual reviews.

This gets its own section because it's very important:

The right shift length

How long your shifts should be is a function of several variables:

  • ACK time.
  • Response time.
  • Frequency of alarms.
  • Average time-to-resolution (TTR) for the alarms.

First and foremost: If your alarm frequency and TTR are frequent enough, say in the 30% of attention or larger range, you don't have an on-call rotation; you have a distributed NOC, and being on watch is a full time job. Don't expect them to do anything else, and pay them like they're at work.

People need to sleep, and watch-standers are more effective if they're not sleep-deprived. What's more, being sleep-deprived is incredibly fatiguing, so schedules that promote such fatigue are more likely to have people quit. Here is a table I made showing the combinations of alarm frequency and mean TTR, showing on average how many minutes a watch-stander will have between periods of on-demand attention:

Time To Resolution (TTR)
Alarm Freq < 5 min 5 to 10 min 10 to 20 min up to 30 min up to 60 min
15 min 10 5 0 0 0
30 min 25 20 10 0 0
60 min 55 50 40 30 0
90 min 85 80 70 60 30
2 hour 115 110 100 180 60
4 hour 235 230 220 210 180
6 hour 355 350 340 330 300
8 hour 475 470 460 450 420
No sleep possible
Quick naps possible
Restful sleep possible
Uninterupted sleep possible

Given that the average sleep cycle is 45 minutes, any combination that has less than that will be a shift that the watch-stander will not be able to restfully sleep. This has been colored dark orange. If you allow people time to get to sleep, say 20 minutes, and add that to the 45 minute sleep-cycle, that locks out anything at or under 70 minutes (colored light orange). For people who don't go to sleep easily (such as me) even the 80 and 85 minute slack times would be too little. The next tier where it's probable you could get one or two sleep-cycles in before having to wake up again, this is resful sleep. Finally, we have the tier where you may get uninteruppted (by work) sleep; the most resful sort.

Keep in mind the only-sort-of-random nature of alarms. Sometimes they arrive in clusters. Sometimes they are random, with quite a bit of variation which makes 'average' something you only sometimes see. If your average is 3 hours between alarms (180 minutes), alarms may commonly show up 20 minutes after a previous one, and 5 hours after. There will be nights where no sleep can be found, even though your average is long enough to theoretically support it.

If you're in the orange areas, shifts shouldn't be longer than a day. And probably should be half-days.

If you're near the orange areas, you probably should not have a week of that kind of thing so shift lengths should be less than 7 days.

Shift lengths longer than these guidelines risks burnout and creates rage-demons. We have too many rage-demons as it is, so please have a heart.

I believe this policy set provides the groundwork for a well defined on-call rotation. The on-call engineers know what is expected of them, and know the or-else if they don't live up to it.

And if there isn't a stipend...

| No Comments

Sysadmin-types, we kind of have to have a phone. It's what the monitoring system makes vibrate when our attention is needed, and we also tend to be "always on-call", even if it's tier 4 emergency last resort on-call. But sometimes we're the kind of on-call where we have to pay attention any time an alert comes in, regardless of hour, and that's when things get real.

So what if you're in that kind of job, or applying for one, and it turns out that your employer doesn't provide a cell phone and doesn't provide reimbursement. Some Bring Your Own Device policies are written this way. Or maybe your employer moves to a BYOD policy and the company paid telecoms are going away.

Can they do that?

Yes they can, but.

As with all labor laws, the rules vary based on where you are in the world. However, in August 2014 (a month and a half ago!) Schwann's Home Services, Inc lost an appeal in California Appellate court. This is important because California contains Silicon Valley and what happens there tends to percolate out to the rest of the tech industry. This ruling held that employees who do company business on personal phones are entitled to reimbursement.

The ruling didn't provide a legal framework for how much reimbursement is required, just that some is.

This thing is so new that the ripples haven't been felt everywhere yet. No-reimbursement policies are not legal, that much is clear, but beyond that, not much is. For non-California based companies such as those in tech hot-spots like Seattle, New York, or the DC area this is merely a warning that the legal basis for such no-reimbursement policies is not firm. As the California-based companies revise policies in light of this ruling, accepted-practice in the tech field will shift without legal action elsewhere.

My legal google-fu is too weak to tell if this thing can be appealed to the state Supreme Court, though it looks like it might have already toured through there.

Until then...

I strongly recommend against using your personal phone for both work and private. Having two phones, even phones you pay for, provides an affirmative separation between your work identity subject to corporate policies and liability, and your private identity. This is more expensive than just getting an unlimited voice/text plan with lots of data and dual-homing, but you face fewer risks to yourself that way. No-reimbursement BYOD policies are unfair to tech-workers the way that employers that require a uniform to be worn who don't provide a uniform allowance are unfair; for some of us, that phone is essential to our ability to do our jobs and should be expensed to the employer. Laws and precedent always take a while to catch up to business reality, and BYOD is getting caught up.

When it comes to things to send alarming emails about, CPU, RAM, Swap, and Disk are the four everyone thinks of. If something seems slow, check one or all of those four to see if it really is slow. This sets up a causal chain...

It was slow, and CPU was high. Therefore, when CPU is high it is slow. QED.

We will now alarm on high CPU.

It may be true in that one case, but high CPU is not always a sign of bad. In fact, high CPU is a perfectly normal occurrence in some systems.

  1. Render farms are supposed to run that high all the time.
  2. Build servers are supposed to be running that high a lot of the time.
  3. Databases chewing on long-running queries.
  4. Big-data analytics that can run for hours.
  5. QE systems grinding on builds.
  6. Test-environment systems being ground on by QE.

Of course, not all CPU checks are created equal. Percent-CPU is one thing, Load Average is another. If Percent-CPU is 100% and your load-average matches the number of cores in the system, you're probably fine. If Percent-CPU is 100% and your load-average is 6x the number of cores in the system, you're probably not fine. If your monitoring system only grabs Percent-CPU, you won't be able to tell what kind of 100% event it is.

As a generic, apply-it-to-everything alarm, High-CPU is a really poor thing to pick. It's easy to monitor, which is why it gets selected for alarming. But, don't do that.

Cases where a High-CPU alarm won't actually tell you that something is going wrong:

  • Everything in the previous list.
  • If your app is single-threaded, the actual high-CPU event for that app on a multi-core system is going to be WELL below 100%. It may even be as low as 12.5%.
  • If it's a single web-server in a load-balanced pool of them, it won't be a BOTHER HUMANS RIGHT NOW event.
  • During routine patching. It should be snoozed on a maintenance window anyway, but sometimes it doesn't happen.
  • Initializing a big application. Some things normally chew lots of CPU when spinning up for the first time.

CPU/Load Average is something you probably should monitor, since there is value in retroactive analysis and aggregate analysis. Analyzing CPU trends can tell you it's time to buy more hardware, or turn up the max-instances value in your auto-scaling group. These are all the kinds of thing you look at in retrospective, they're not things that you want waking you up at 2:38am.

Only turn on CPU alarms if you know that is an error condition worthy of waking up a human. Turning it on for everything just in case is a great way to train yourself out of ignoring high-CPU alarms, which means you'll miss the ones you actually care about. Human factors, they're part of everything.

Other Blogs

My Other Stuff

Monthly Archives