Or, a post I'd never thought I'd make seeing as I'm a sysadmin.

But it seems I'm the senior git expert in my team, so I'm making it. So odd.


There are a series of questions you should ask among your team before moving a repo over to git. Git is a hell of a toolbox, and like all toolboxes there are nearly infinite ways of using it. There is no one true way, only ways that are better for you than others. These are a series of questions to help you figure out how you want to use it, so you can be happier down the road.

Q: How do you use the commit-log?

History is awesome. Looking back five years in the code repository to figure out WTF a past developer was thinking about writing that bit of spaghetti code is quite useful if that commit includes something like, "found weird-ass edge case in glib, this is the workaround until they get a fix." That's actionable. Maybe it's even tied to a bug number in the bug tracking system, or a support ticket.

Do you ever look through the history? What are you looking for? Knowing this allows you to learn what you want out of your source-control.

Q: What is the worth of a commit?

A commit in Git is not the same thing as in SVN, Fog, or ClearCase. In some, a commit, or checkin, is a pretty big thing. It takes reviews, or approvals before it can be made.

This question is there to get you thinking about what a commit is. Commits in git are cheap, that changes things. Knowing that you will be facing more of then than you had in the past will help guide you in the later questions.

Q: Is every commit sacred, or you do you value larger, well documented commits more?

Practically everyone I know has made a commit with the message of 'asdf'. If you're grinding on a stupid thing, it may take you 19 commits to come up with the two lines of code that actually work. In five years, when you come back to look at that line of code, the final commit-message on those lines might be '

a1bd0809 maybe this will work

Not exactly informative.

bdc8671a Reformat method calls to handle new version of nokogiri

That is informative.

Most projects value more informative commits over lots of little, iterative ones. But your team may be different. And may change its mind after experience has been had.

Q: Should new features be all in one commit, or in a few modular commits?

Some features are quite large. So large, that rebasing them into a single commit leads to a diff of hundreds of lines. Such a large feature means that the history on those files will be slathered with the same initial-feature-commit with no context for why it is that way.

Is that good enough? Mabe it is, maybe you're more interested in the hotfix commits that are fixing bugs and explain non-intuitive behavior and workaround. Maybe it isn't, and you need each sub-feature in its own. Or maybe you want every non-fixup commit.

This is where your approach to the history really informs your decision. If you know how you deal with the past, you will be better able to put process in place to be happier with your past self.


Once you've thought about these questions and your answers to them, you'll be better able to consider the deeper problem of branching strategy. Git is notoriously lacking in undo features, at least in shared repos, so getting this out of the way early is good.

Public Cloud (AWS, Azure, etc) is a very different thing than on-prem infrastructures. The low orbit view of the sector is that this is entirely intentional: create a new way of doing things to enable businesses to focus on what they're good at. A lot of high executives get that message and embrace it... until it comes time to integrate this new way with the way things have always been done. Then we get some problems.

The view from 1000ft is much different than the one from 250 miles up.

From my point of view, there are two big ways that integrating public cloud will cause culture problems.

  • Black-box infrastructure.
  • Completely different cost-model.

I've already spoken on the second point so I won't spend much time on it here. In brief: AWS costing makes you pay for what you use every month with no way to defer it for a quarter or two, which is completely not the on-prem cost model.

Black-box infrastructure

You don't know how it works.

You don't know for sure that it's being run by competent professionals who have good working habits.

You don't know for sure if they have sufficient controls in place to keep your data absolutely out of the hands of anyone but you or nosy employees. SOC reports help, but still.

You may not get console access to your instances.

You're not big enough to warrant the white glove treatment of a service contract that addresses your specific needs. Or will accept any kind of penalties for non-delivery of service.

They'll turn your account off if you defer payment for a couple of months.

The SLA they offer on the service is all you're going to get. If you need more than that... well, you'll have to figure out how to re-engineer your own software to deal with that kind of failure.

Your monitoring system doesn't know how to handle the public cloud monitoring end-points.


These are all business items that you've taken for granted in running your own datacenter, or contracting for datacenter services with another company. Service levels aren't really negotiable, this throws some enterprises. You can't simply mandate higher redundancies in certain must-always-be-up single-system services, you have to re-engineer them to be multi-system or live with the risk. As any cloud integrator will tell you if asked, public cloud requires some changes to how you think about infrastructure and that includes how you ensure it behaves the way you need it to.

Having worked for a managed services provider and a SaaS site, I've heard of the ways companies try to lever contracts as well as lazy payment of bills. If you're big enough (AWS) you can afford to lose customers by being strict about on-time payment for services. Companies that habitually defer payment on bills for a month or two in order to game quarterly results will describe such services as, 'unfriendly to my business'. Companies that expect to get into protracted SLA negotiations will find not nearly enough wiggle room, and the lack of penalties for SLA failures to be contrary to internal best practices. These are abuses that can be levered at startup and mid-size businesses, quite effectively, but not so much at the big public cloud providers.

It really does require a new way of thinking about infrastructure, at all levels. From finance, to SLAs, to application engineering, and to staffing. That's a big hill to climb.

Over the weekend I ran into an article on the Safari Books blog about writing job postings. The big recommendation they put forth was a simple one: the 'requirements' section on every job-posting is where you lose most of the qualified candidates for the job who then elect not to apply. Wouldn't it be nice if you rephrased it to be inclusive? I tweeted about it.

Responses came in two versions:

  1. Favorites.
  2. People telling me I'm senior enough to know better about how the requirements game works.

Yes, I'm pretty senior now. However, in the 15+ years I've been a sysadmin I've job-hunted as one only three times.

Time 1: One app, one offer. It wasn't really a job-hunt so much as a pounce-kill.
Time 2: Was in the middle of a huge recession, I had a stale skill-set, and even in the technology I had skills in I didn't have experience with the hot newness. A startup took a chance on me. That search took better than two years.
Time 3: I was terminated and needed another job RIGHT NOW. It was also a hot market, and I had relevant skills. The firms that gave me offers (I had three) all were applied to in the first week of my search. It only took as long as it did to start working due to the Thanksgiving and Christmas holidays getting in the way of everything. That search took six weeks, of which only three were active search/apply/interview focused weeks.

One of the replies is very good at capturing the spirit of the requirements game as it exists now:

True, that's what the startup that hired me did. They needed a blend of Windows and Linux, and apparently I talked a good enough game they hired me even though I didn't exactly have much being-paid-for-it Linux experience at the time (this was one of the big reasons I left WWU, by the way; they wouldn't let me get out of the Windows box). That job posting? It had a 'Qualifications' section, and didn't mention operating system! This was my in!

I haven't done enough hiring or searching to know how flexible 'requirements' are. If I hit every point, I make sure to mention that in my cover-letter. If I hit all but one, I'm worried but if the one doesn't sound mission-critical I'll still drop an app on it. If I'm missing two... I'll apply only if I really want to work there, or I have some other clue that the 'requirements' are actually a wish-list (big hint: 20 items in the requirements list).

Here's the thing though. If you're suffering impostor syndrome, and I sure as hell was during the Time 2 search, it's extremely easy to talk yourself out of thinking you're good enough for a position.

Do not bother applying unless you have...

  • 6 years of Enterprise Linux administration.
  • 5 years of Python scripting development.
  • 5 years of Postgres administration.
  • 3 years of Chef.

That's what a 'Requirements' section looks like. It takes a sense of entitlement to see that list and add, "...or can talk us into taking you anyway" to the bolded text.

I have one of those bullet-points. However, I do have Ruby, MySQL, and Puppet. Is that enough to take a chance on the position, or are they dead set on not having to train in a new-hire on those things? Can't tell, not going to bother going to all the effort of crafting a resume and coverletter just to be told 'Go away'.

Or maybe I tell my impostor syndrome to go hide in a corner somewhere and trust in my interview skills to win me a chance.

By changing the 'Requirements' away from a checklist of skills, and towards the qualities you need in a new-hire, you remove a big barrier to application. You'll get apps from people in non-traditional career-paths, like the person with a Masters in History but spent the last six years doing statistical analysis automation in AWS on a grant, and kept getting consults from other academic computation people for automating things.

I keep hearing there is a real talent crunch for senior people these days. Doesn't it make sense to encourage applications, rather than discourage them?

I've seen this dyamic happen a couple of times now. It goes kind of like this.

October: We're going all in on AWS! It's the future. Embrace it.
November: IT is working very hard on moving us there, thank you for your patience.
December: We're in! Enjoy the future.
January: This AWS bill is intolerable. Turn off everything we don't need.
February: Stop migrating things to AWS, we'll keep these specific systems on-prem for now.
March: Move these systems out of AWS.
April: Nothing gets moved to AWS unless it produces more revenue than it costs to run.

What's prompting this is a shock that is entirely predictable, but manages to penetrate the reality distortion field of upper management because the shock is to the pocketbook. They notice that kind of thing. To illustrate what I'm talking about, here is a made-up graph showing technology spend over a course of several years.

BudgetType-AWS.pngThe AWS line actually results in more money over time, as AWS does a good job of capturing costs that the traditional method generally ignores or assumes is lost in general overhead. But the screaming doesn't happen at the end of four years when they run the numbers, it happens in month four when the ongoing operational spend after build-out is done is w-a-y over what it used to be.

The spikes for traditional on-prem work are for forklifts of machinery showing up. Money is spent, new things show up, and they impact the monthly spend only infrequently. In this case, the base-charge increased only twice over the time-span. Some of those spikes are for things like maintenance-contract renewals, which don't impact base-spend one whit.

The AWS line is much less spikey, as new capabilities are assumed into the base-budget in an ongoing basis. You're no longer dropping $125K in a single go, you're dribbling it out over the course of a year or more. AWS price-drops mean that monthly spend actually goes down a few times.

Pay only for what you use!

Amazon is great at pointing that out, and hilighting the convenience of it. But what they don't mention is that by doing so, you will learn the hard way about what it is you really use. The AWS Calculator is an awesome tool, but if you don't know how your current environment works, it's like throwing darts at a wall for accurately predicting what you'll end up spending. You end up obsessing over small line-item charges you've never had to worry about before (how many IOPs do we do? Crap! I don't know! How many thousands will that cost us?), and missing the big items that nail you (Whoa! They meter bandwidth between AZs? Maybe we shouldn't be running our Hadoop cluster in multi-AZ mode).

There is a reason that third party AWS integrators are a thriving market.

Also, this 'what you use' is not subject to Oops exceptions without a lot of wrangling with Account Management. Have something that downloaded the entire EPEL repo twice a day for a month, and only learned about it when your bandwidth charge was 9x what it should be? Too bad, pay up or we'll turn the account off.

Unlike the forklift model, you pay for it every month without fail. If you have a bad quarter, you can't just not pay the bill for a few months and tru-up later. You're spending it, or they're turning your account off. This takes away some of the cost-shifting flexibility the old style had.

Unlike the forklift model, AWS prices its stuff assuming a three year turnover rate. Many companies have a 5 to 7 years lifetime for IT assets. Three to four years in production, with an afterlife of two to five years in various pre-prod, mirror, staging, and development roles.The cost of those assets therefore amortizes over 5-9 years, not 3.

Predictable spending, at last.

Hah.

Yes, it is predictable over time given accurate understanding of what is under management. But when your initial predictions end up being wildly off, it seems like it isn't predictable. It seems like you're being held over the coals.

And when you get a new system into AWS and the cost forecast is wildly off, it doesn't seem predictable.

And when your system gets the rocket-launch you've been craving and you're scaling like mad; but the scale-costs don't match your cost forecast, it doesn't seem predictable.

It's only predictable if you fully understand the cost items and how your systems interact with it.

Reserved instances will save you money

Yes! They will! Quite a lot of it, in fact. They let a company go back to the forklift-method of cost-accounting, at least for part of it. I need 100 m3.large instances, on a three year up-front model. OK! Monthly charges drop drastically, and the monthly spend chart begins to look like the old model again.

Except.

Reserved instances cost a lot of money up front. That's the point, that's the trade-off for getting a cheaper annual spend. But many companies get into AWS because they see it as cheaper than on-prem. Which means they're sensitive to one-month cost-spikes, which in turn means buying reserved instances doesn't happen and they stay on the high cost on-demand model.

AWS is Elastic!

Elastic in that you can scale up and down at will, without week/month long billing, delivery and integration cycles.

Elastic in that you have choice in your cost accounting methods. On-demand, and various kinds of reserved instances.

It is not elastic in when the bill is due.

It is not elastic with individual asset pricing, no matter how special you are as a company.


All of these things trip up upper, non-technical management. I've seen it happen three times now, and I'm sure I'll see it again at some point.

Maybe this will help you in illuminating the issues with your own management.

The No Asshole Rule is a very well known criteria for building a healthy workplace. If you've worked in a newer company (say, incorporated within the last 8 years) you probably saw this or something much like it in the company guidelines or orientation. What you may not have seen were the principles.

Someone is an asshole if:

    1. After encountering the person, do people feel oppressed, humiliated or otherwise worse about themselves?
    2. Does the person target people who are less powerful than them?

If so, they are an asshole and should be gotten rid of.

And if we all did this, the tech industry would be a much happier place. It's a nice principle.

The problem is that the definition of 'asshole' is an entirely local one, determined by the office culture. This has several failure-modes.

Anyone who doesn't agree with the team-leader/boss is an asshole.

Quite a perversion of the intent of the rule, as this definitely harms those less powerful than the oppressor, but this indeed happens. Dealing with it requires a higher order of power to come down on the person. Which doesn't happen if the person is the owner/CEO. Employees take this into their own hands by working somewhere else for someone who probably isn't an asshole.

All dissent to power must be politely phrased and meekly accepting of rejection.

In my professional opinion, sir, I believe the cost-model presented here is overly optimistic in several ways.

I have every confidence in it.

Yes sir.

The thinking here is that reasoned people talk to each other in reasoned ways, and raised voices or interruptions are a key sign of an unreasonable person. Your opinion was tried, and adjudged lacking; lose gracefully.

In my professional opinion, sir, I believe the cost-model presented here is overly optimistic in several ways.

I have every confidence in it.

The premise you've built this on doesn't take into account the ongoing costs for N. Which throws off the whole model.

Mr. Anderson, I don't appreciate your tone.

It's a form of tone policing.

All dissent between team-mates must be polite, reasoned, and accountable to everyone's feelings.

Sounds great. Until you get a tone cop in the mix who uses the asshole-rule as a club to oppress anyone who disagrees with them.

I'm pretty sure this new routing method introduces at least 10% more latency to that API path. If not more.

I worked on that for a week. I'm feeling very intimidated right now.

😝. Sorry, I'm just saying that there are some edge cases we need to explore before we merge it.

😟

Okay, let's merge it into Integration and run the performance suite on it.

This is a classic bit of office-politics judo. Because no one wants to be seen as an asshole, if you can make other people feel like one they're more likely to cave on contentious topics.

Which sucks big-time for people who actually have problems with something. Are they a political creature, or are they owning their pain?


The No Asshole Rule is a great principle in the abstract, but it took the original author 224 pages to communicate the whole thing in the way he intended it. That's 223.5 pages longer than most people have the attention-span for, especially in workplace orientations, and are therefore not going to be clued into the nuances. It is inevitable that a 'no asshole rule' enshrined in a corporate code-of-conduct is going to be defined organically through the culture of the workplace and not by any statement of principles.

That statement of principles may exist, but the operational definition will be defined by the daily actions of everyone in that workplace. It is going to take people at the top using the disciplinary hammer to course-correct people into following the listed principles, or it isn't going to work. And that hammer itself may include any and all of the biases I lined out above.

Like any code of conduct, the or else needs to be defined, and follow-through demonstrated, in order for people to give it appropriate attention. Vague statements of principles like no assholes allowed or be excellent to each other, are not enforceable codes as there is no way to objectively define them without acres of legalese.

I'm not a developer but...

| No Comments

...I'm sure spending a lot of time in code lately.

Really. Over the last five months, I'd say that 80% of my normal working hours are spent grinding on puppet code. Or training others in getting them to maybe do some puppet stuff. I've even got some continuous integration work in, building a trio of sanity-tests for our puppet infrastructure:

  • 'puppet parser validate' returns OK for all .pp files.
    • Still on the 'current' parser, we haven't gotten as far as future/puppet4 yet.
  • puppet-lint returns no errors for the modules we've cleared.
    • This required extensive style-fixes before I put it in.
  • Catalogs compile for a certain set of machines we have.
    • I'm most proud of this, as this check actually finds dependency problems unlike puppet-parser.

Completely unsurprising, the CI stuff has actually caught bugs before it got pushed to production! Whoa! One of these days I'll be able to grab some of the others and demo this stuff, but we're off-boarding a senior admin right now and the brain-dumping is not being done by me for a few weeks.

We're inching closer to getting things rigged that a passing-build in the 'master' branch triggers an automatic deployment. That will take a bit of thought about, as some deploys (such as class-name changes) require coordinated modifications in other systems.

Because I get to define what's 'possible', and anything is possible given enough time, management backing, and an unlimited budget.

If I don't have management backing, I will decide on my own how to fit this new ASAP in amongst my other ASAP work and the work that has actual deadlines attached to it.

If this ASAP has a time/money tradeoff, I need management backing to tell me which way to go. And what other work to let sluff in order to get the time needed.


In the end, there are only a few priority levels that people actually use.

  1. Realtime. I will stand here until I get what I need.
  2. ASAP.
  3. On this defined date or condition.
  4. Whenever you can get to it.

Realtime is a form of ASAP, but it's the kind of ASAP where the requester is highly invested in it and will keep statusing and may throw resources at it in order to get the thingy as soon as actually possible. Think major production outages.

ASAP is really 'as soon as you can get to it, unless I think that's not fast enough.' For sysadmin teams where the load-average is below the number of processors this can work pretty well. For loaded sysadmin teams, the results will not be to the liking of the open-ended deadline requestors.

On this defined date or condition is awesome, as it gives us expectations of delivery and allows us to do queue optimization.

Whenever you can get to it is like nicing a process. It'll be a while, but it'll be gotten to. Eventually.

"ASAP, but no later than [date]" is a much better way of putting it. It gives a hint to the queue optimizer as to where to slot the work amongst everything else.

Thank you.

Paternity leave and on-call

| No Comments

It all started with this tweet.

Which you need to read (Medium.com). Some pull-quotes of interest:

My manager probably didn't realize that "How was your vacation" was the worst thing to ask me after I came back from paternity leave.

Patriarchy would have us believe that parenting is primarily the concern of the mother. Therefore paternity leave is a few extra days off for dad to chillax with his family and help mom out.

Beyond a recovery time from pregnancy, much of parental leave is learning to be a parent and adjusting to your new family and bonding with the baby. I can and did bond with the baby, but not as much as my female coworkers bonded with their babies.

I should also state, that I don't just want equality, I want a long time to bond with my child. Three months or more sounds nice. Not only can I learn to soothe him when he's upset, put him to sleep without worrying about being paged, but I can be around when he does the amazing things babies do in their first year: learning to sit, crawl, eat, stand and even walk.

At my current employer, I was shocked to learn that new dads get two weeks off.

Two.

At my previous startup, paternal leave was under the jurisdiction of the 'unlimited vacation' policy. Well...

Vacations are important. My friends would joke that the one way to actually be able to take vacations was to keep having children. Here the conflation was in jest, and also a caricature of the reality of vacations at startups.

We had a bit of a baby-boom while I was there. Dads were glared at if they showed up less than two weeks in and told to go home. After that, most of them worked part-time for a few weeks and slowly worked up to full time.

This article caused me to tweet...

The idea here is that IT managers who work for a company like mine with a really small amount of parental leave do have a bit of power to give Dad more time with the new kid: take them off of the call rota for a while. A better corporate policy is ideal, but it's a kind of local fix that just might help. Dad doesn't have to live to the pager and new-kid.

Interesting idea, but not a great one.

Which is a critique of the disaster-resilience of 3-person teams. I was on one, and we had to coordinate Summer Vacation Season to ensure we had two-person coverage for most of it, and if 1-person was unavoidable, keep it to a couple days at best. None of us had kids while I was there (the other two had teenagers, and I wasn't about to start), so we didn't get to live through a paternity-leave sized hole in coverage.

Which is the kind of team I'm on right now, and why I thought of the idea. We have enough people that a person sized hole, even a Sr. Engineer sized hole, can be filled for several to many weeks in the rotation.

That's the ideal route though, and touches on a very human point: if you're in a company where you always check mail or can expect pages off-hours, it doesn't matter if you're not in the official call-rotation. That's a company culture problem independent of the on-call rotation.

My idea can work, but it takes the right culture to pull off. Extended leave would be much better, and is the kind of thing we should be advocating for.

You should still read the article.

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

There is more to a monitoring system than alarms and reports. Behind all of those cries for action are a lot of data. A lot of data. So much data, that you face scaling problems when you grow because all of your systems generate so much monitoring data.

Monitor everything!
-- Boss

A great idea in principle, but falls apart in one key way...

"Define 'everything', please"

'Everything' means different things for different people. It just isn't feasable to track every monitorable on everything with one-second granularity. Just about everyone will want to back away from that level of monitor-all-the-things. But what is the right fit?

It depends on what you want to do with it. Data being tracked supports four broad categories of monitoring.

  1. Performance
  2. Operational
  3. Capacity
  4. SLA

Performance Monitoring

This kind of monitoring tends to have low intervals between polls. It could be five minutes, but may be as little as every second. That kind of monitoring will create a deluge of data, and may only be done when diagnosing exposed problems or doing in-depth research on the system. It's not run all the time, unless you really do care about per-second changes in state of something.

This kind of monitoring is defined by a few attributes:

  • High granularity. You poll a lot.
  • Low urgency. You're doing this because you're looking into something, not because it's down.
  • Occasional need. You don't run it all the time, and not on a schedule.

Everything: 1 second granularity for CPU, IOPS, and pagefaults for a given cluster.


Operational Monitoring

The kind we're all familiar with. This is the kind of monitoring that tends to emit alarms for on-call rotations.

  • Medium granularity. Every 5 minutes, that kind of thing.
  • High urgency. Fast responses are needed.
  • Constant need. You run it all the time.

Everything: Every disk-event, everywhere.


Capacity Monitoring

Some of the alarms you have defined already may be capacity alarms, but capacity monitoring just tracks how much you are using of what you have. Some of this stuff doesn't change very fast.

  • Low granulariy. It may only get checked once a day.
  • Low urgency. Responding in a couple of days may be fast enough. If not slower.
  • Periodic need. Reviewed once in a while, probably on a schedule.

Everything: Anything that has a "Max" size value greater than the "Current" value.


SLA Monitoring

I've already gone on at length about SLAs, but this is the monitoring that directly supports the SLA pass/fail metrics. I break it apart from the other types because of how it's accessed.

  • Low granularity. Some metrics may be medium, but in general SLA trackers are over-time style.
  • Medium urgency. If a failing grade is determined, response needs to happen. How fast, depends on what's not going to get met.
  • Continual and Periodic need. Some things will be monitored continually, others will only be checked on long schedules; possibly once a week, if not once a month.

Everything: Everything it takes to build those reports.


Be aware that 'everything' is context-sensitive when you're talking with people and don't freak out when a grand high executive says, "everything," at you. They're probably thinking about the SLA Monitoring version of everything, which is entirely manageable.

Don't panic, and keep improving your monitoring system.

Other Blogs

My Other Stuff

Monthly Archives