June 2021 Archives

Many under-represented minorities (URMs) in tech quickly learn that who your manager and team-members are matters more than overall company culture. Majority people often learn this too, but non-conformance with stereotype means this lesson is often learned faster/earlier. The friction that comes with non-conformance is like any workplace friction -- It's noticed, and maybe commented on. A team with a good culture will adjust to accommodate (within reason), a bad culture will use techniques to erase the friction and increase conformance.

Some people will never be able to achieve full conformance due to visible differences. Others can get really close, so long as they edit how they present to the world. For those who can't visibly conform due to skin color, visible disability, or other reasons, sometimes conforming extra hard in other ways can make up for it. You've maybe heard these pieces of advice, if not given to you, but second-hand:

"You have to dress better than them so they take you seriously."

"You have to learn the King's English to go anywhere."

"You can't show too much skin or all they'll think of is fucking you, and not what you actually do."

"No one can tell you're in a wheelchair on Zoom."

"Learn the Christian holidays, you'll need to know the basics to get by."

"Always turn on video so they can see you're smiling."

"Never raise your voice. They can, you can't. Soon as you do, they won't respect you."

All of this paring away of yourself can be greatly reduced if your team culture, and the culture your manager builds, is good. You will spend less time working on conformance, and more on what you want to be doing. These are the people you work with most of the hours you're at work, so this is the culture that matters the most to you. This shit matters so much.

Which brings me to one of the big pathologies in tech-hiring at big companies. We all know the 20 plus hours of "phone" screens, live coding challenges, take-home tests, and six-hour marathon interview panels has some problems, but there is a thing that often happens after the candidate has already invested half a work-week trying to get a job with a specific manager/team:

We like what we see, but we think you'll be a better fit with Manager UnknownBozo as a Level 2 than Manager DudeYouKnow as a Level 3.
$138K/year, 5% bonus oppo, and $50K in stock, with a 1 year cliff and 4 year vest. You have three days to reply or we'll rescind the offer.

The old bait-and-switch. You put in all that fucking work, including several interviews with DudeYouKnow and his team, only to be told that you're going to be given to someone else you've never even heard of. Thank you ever so much for wasting my fucking time (respect for ExampleCorp drops).

Or as hiring managers like to think of it, making sure a qualified candidate doesn't get away.

No, really. That's what they think.

I really like this candidate, but they're not quite there for the team they requested. But we have open requisitions elsewhere, let's try to see if they'll accept one? We put in all that work to qualify them.

As I've gotten more senior, seen my share of team-based trauma, trauma-recovery, helped others get over their trauma, and watched URMs realize that the team I'm on is actually not a hellpit that forces compliance, I'm really feeling this right now. I'm damned picky over who will be my manager, because I've seen what bad ones do to teams. A good corporate culture is nice clue that their teams will also be mostly-nice, but the specific manager still matters more in my calculus.

For URMs this calculus happens all the time. For companies looking to improve their minority hiring, getting a reputation for bait-and-switching offers will hurt your goals. The old axiom of, "The candidate is interviewing you as much as you are interviewing them," is so true, only for URMs and senior ICs we're also interviewing the team not the company.

I'm looking to work for DudeYouKnow, who happens to work at ExampleCorp. Not ExampleCorp and DudeYouKnow if I can get him. So if I can't work for DudeYouKnow? I'm going to hate you so much and reject the offer, or insist on another round of interviews so I can re-interview the new team.

The following log-line in my Elasticsearch logs confused me. The format is a bit different than you will find in yours, I added some line-breaks to improve readability.

failed to execute [indices:monitor/stats] on node [L3JiFxy5TTuBiGXH_R_dLA]
[ip-192-0-2-125.prod.internal][] [indices:monitor/stats[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent]
Data too large, data for [<transport_request>]
would be [9317804754/8.6gb], which is larger than the limit of [9295416524/8.6gb],
usages [

There was just about no search-engine reachable content when I ran into this problem. Decoding this one took some sleuth-work, but the key break came when I found the circuit breaker documentation for Elasticsearch. As the documentation says, the circuit breakers are there to backstop operations that would otherwise run an Elasticsearch process out of memory. As the log-line suggests, there are four types of circuit breakers in addition to a 'parent' one. All four are defined as a percentage of HEAP:

  • Request: The maximum memory a request is allowed to consume. This is different than the size of the request itself, because it includes memory used to compute aggregations.
  • Fielddata: The maximum memory threshold for loading a field's data into memory. So, if you have a "hosts" field with 1.2 million unique values in it, you take a memory hit for each unique. Or, if you have 5000 fields on each request, each field needs to be loaded into memory. Either problem can trigger this.
  • In Flight: The maximum memory of all in-process requests. If a node is too busy doing work, this can fire.
  • Accounting: The maximum memory usable by items that persist after a request is completed, such as Lucene segment memory.

In the log-line I posted above we see three things:

  • Field-data is by far the largest component at 6.1GB
  • The total usages add up to 8.04GB (logged as 9317804754 bytes), which is larger than the limit of 8.600GB
  • We hit the parent breaker.

The parent circuit-breaker is a bit confusing, but out of the box (as of ES 7.x) is 70% of HEAP. So, 8.6GB is 70%, then HEAP is 12.28GB. This told me which nodes were having the problem.

The fix for this isn't nice. I needed to do two things:

  1. Increase the parent circuit-breaker to 80% to get things moving again (the indices.breaker.total.limit cluster setting). And clean up all the damage caused by hitting this breaker. More on that in a bit.
  2. Look deeply into my Elasticsearch schema to identify field-sprawl and fix it. As this was our Logging cluster, we had a few Java apps that log in deeply nested JSON datastructures causing thousands of fields to be created, mostly empty.

There are a few reasons Elasticsearch sets a limit for the maximum fields per index (index.mapping.total_fields.limit) and we ran into one such reason: field-sprawl caused by JSON-deserializing the logging from (in this case) Java applications. Raising the circuit-breaker only goes so far,  the Compressed Ordinary Object Pointer feature of Java puts a functional HEAP ceiling around 30GB. Throw more resources at it has a ceiling, so you will have to fix the problem sometime.

In our case, running nodes with 30GB of HEAP is more expensive than we want to pay so fixing the problem now is what we're doing. Once we get the schema issue fixed, we'll lower the parent breaker back to 70%.

The symptom we saw that told us we had a problem was a report from users that they couldn't search more than day in the past (we rotate logging indexes once a day) in spite of rather more days of indexes being available. Going to Index Management in Kibana and looking at indexes we saw that only a few indexes had index stats available; the rest had no details about document count or overall index size.

Using the Tasks API we got a list of all tasks in process, and found a large number of "indices:monitor/stats" jobs were failing. This task is responsible for updating the index statistics Kibana uses in the Index Management screens. Without those statistics Kibana doesn't know if those indexes are usable in queries.

Cleaning up after this was complicated by an node-failover that happened while the cluster was in this state. Elasticsearch dutifully made any Replica shards into Primary shards, but mostly couldn't create new Replica shards because those operations hit the circuit-breaker. Did you know that Elasticsearch has an internal retry-max when attempting to create new shards? I do now.

Even after getting the parent breaker reset to a higher value, those shards did not recreate: their retry-max had been hit. The only way to get those shards created was to close the affected indexes (using the indexname/_close API) and re-open them. That reset the retry counter, and the shards recreated.