War rooms vs. deep investigations

256 points by ingve 8 months ago

yuliyp 8 months ago

A war room / call is for coordination. If you need the person draining the bad region to know that "oh that other region is bad, so we can't drain too fast" or "the metrics look better now".

For truly understanding an incident? Hell no. That requires heads down *focus*. I've made a habit of taking a few hours later in the day after incidents to calmly go through and try to write a correct timeline. It's amazing how broken peoples' perceptions of what was actually happening at the time are (including my own!). Being able to go through it calmly, alone provided tons of insights.

It's this type of analysis that leads to proper understanding and the right follow-ups rather than knee-jerk reactions.

thwarted 8 months ago

It took me seven weeks (not full time, but from the initial incident to the final publishing) to do the research and write up for a recent event. This included in-person interviews, data correlation, reading code, and revision control spelunking across multiple repositories to understand the series of events and decisions that led to the event, some of them months earlier. Some people were advocating "get it out because we have to move on", which I pushed back on. Once published, the feedback was positive and some folks acknowledged that knee-jerk follow-up reactions would have made things worse. But to get to the point where the post-incident review is valuable someone has to put in the actual work and time to make it so. It should be a learning experience, not a checking a box; otherwise, we're just spinning our wheels without making any progress.
master_crab 8 months ago

This. People keep commenting about it being performative. That’s orthogonal to its purpose. Even the original blogpost points out the limitation of singular focused effort without acknowledging it. It took the author weeks to figure out the actual issue.
If FB had been down that long, they’d be out of business.
DylanDmitri 8 months ago

I've been in some good coordinating calls for widespread incidents. Many unique individuals (15+) talked in a ten minute period, sharing context on what their teams were seeing, what re-meditations had worked for them, etc..

belval 8 months ago

> Could I run my terminals in there? Yes. Did I? Yes, for a while. Was I effective? Not really. I missed my desk, my normal chair, my big Thunderbolt monitor, my full-size (and yet entirely boring) keyboard, and a relatively odor-free environment.

Not Meta but at Amazon I always felt like war rooms are a place for some leader to scream at you and not much else. The reality is that debugging some retry storm, resource exhaustion or whatever won't happen in a room with 18 people talking over one another.

Give me a meeting link, I'll join and provide info as I find it, but this type of sweaty hackathon-style all-hands-on-deck was never productive for me.

Hasu 8 months ago

> I always felt like war rooms are a place for some leader to scream at you and not much else. The reality is that debugging some retry storm, resource exhaustion or whatever won't happen in a room with 18 people talking over one another.
I once walked out of a war room (at a much smaller company that I wouldn't be at for much longer) that had devolved into finger-pointing and blame games. Half an hour later, my boss came out to find out what I was doing and I pointed at my screen and said, "This. This is what's wrong. Ship my fix and we're done here." The entire war room came to my desk to see and discuss the fix, which we shipped, which solved the issue.
At my next job, I had to hold back laughter when the VP of Engineering, who was pushing mob programming, said, "Think about it. When we have an incident, when something is really important, what do we do? We all get in a room together. No one leaves the war room to go solve the incident on their own."
- bloomingkales 8 months ago
  
  "This. This is what's wrong. Ship my fix and we're done here."
  Lol. This is not hyperbole. Just about everyone has several stories like this, and they are quite hilarious in their utter absurdity. It's like these people get possessed by the spirit of Gordon Gekko in that exact moment and must absolutely play out the role to the tee. Then they become unpossessed and go Skiing on weekends.
  - aledalgrande 8 months ago
    
    In my experience it's only (and exactly) leaders without tech nor people skills that do this. Have experienced both good and bad. A world of difference.
nine_zeros 8 months ago

>Not Meta but at Amazon I always felt like war rooms are a place for some leader to scream at you and not much else.
It is for the some "leader". The vast large tech industry is filled with phony leaders who don't understand how the job is done and what makes the doers tick.
But they occupy the place of "leadership". They must be seen as doing something. So they are doing the something that they can - scream at people in a locked room.
If they could actually solve technical problems or talk to their bosses like a real engineering leader, they would. But they literally are incapable of doing so.
So war rooms and BS performative art it is.
- teeray 8 months ago
  
  > It is for the some "leader".
  Exactly. It’s so the leader can ask “do we have an update?” every 10 minutes when nothing has changed.
- bloomingkales 8 months ago
  
  The role of a leader is an age old role. When someone is thrown into leadership, I do believe a lot of adrenaline kicks in. You begin acting as if you are a leader similar to how a parent has parent senses and will run into a street to save any kid from a car (poor example, any human should, but hopefully you get my point). I think what you get in a war room is that primal "phenomena" of "oh shit I'm the leader now". You have to weather the primal emotions, and get a cool head back on to fulfill the leadership role.
  If it's your first time, then yeah, you will probably handle it like a dick (or twat). You gotta take on an ancient role with humility.
- varelse 8 months ago
  
  [dead]
alabastervlog 8 months ago

Leaders are really into playing pretend and it seems to just get worse the higher they are on the ladder.
Like, literally, an effective way to sell to them is to make them feel like they’re in a movie doing Super Important Things. LOL. Executive Disneyland.
- SpicyLemonZest 8 months ago
  
  Much of a leader's job is to visibly perform leadership. It seems silly until the first time you need something big from a team whose managers do it poorly, and you realize that they're incapable of making commitments or setting priorities.
  The expectation that leaders will play pretend about a "war" and call everyone into a "war room" is just a part of what it means for an organization to commit that consistent high uptime is a top priority.
  - fn-mote 8 months ago
    
    > commit that consistent high uptime is a top priority
    I’m hoping you find SOME leaders who can find a more effective way to convey this priority!
    Theoretically in our “data driven” leadership you could think of some metrics that would demonstrate the war room is / is not really working. At least after the fact, to save you from the next one.
    Who feels like a powerful leader in those circumstances, though?
    On the other hand, it feels good to me to be able to tell people I’m not involved at that level of detail, and they should just talk to X on the team.
    
    SpicyLemonZest 8 months ago
    
    You can indeed think of such metrics. The most common one is average time to resolution (often abbreviated TTR or MTTR), and war rooms are common precisely because they produce massive benefits in this metric. I know of at least one organization that began outright mandating war rooms after running the numbers and discovering how many incidents would have benefitted from having one.
    If you can invent a strategy that produces even better MTTR than war rooms, I'm sure a lot of managers would love to hear it. I've never seen one and I'm skeptical whether it's possible.
    
    JadeNB 8 months ago
    
    > I know of at least one organization that began outright mandating war rooms after running the numbers and discovering how many incidents would have benefitted from having one.
    How can you run numbers to find out what would have happened?
    
    SpicyLemonZest 7 months ago
    
    A large organization can usually identify groups of "similar enough" incidents to identify the correlation. (Can the numbers prove causation? Not to the standards of a scientific journal, but you can't exactly induce 100 incidents to do a controlled study.)
kenz0r 8 months ago

My tenure at Amazon was during the meeting link era, rather than putting everyone in a room - I agree, way better. Get your secondary oncall to give updates to the room, while you dig in and figure out what is actually happening in relative peace

trollied 8 months ago

I used to be a Rachel-a-like in a past life. Really tight SLAs (mobile network infra etc, so people have to be able to make emergency calls, for example).

So many times I got bridged into a conference call whilst fixing things & doing RCAs against tight SLAs, as non-technical people didn't have any sort of idea that it was wasting my time. "I am fixing it, I will send updates as per contractual agreements" puts phone down.

On several occasions I got 2 emails after the fact - one praising me for resolving quickly, another asking me to please be nicer to executives. The calls stopped after 5 or 6 times.

Things have moved on these days, and it's much easier to coordinate such events on Slack etc. Thankfully!

Nifty3929 8 months ago

Please be empathetic to people who do not understand what is going on, but who have tremendous responsibility to the business. The business problem is always a superset of the technology problem.
Yes, of course you aren't able to fix it while you are on the phone with them. A conference call will not fix the code. They know that too. But they also need meaningful information and updates in order to do their jobs, which often requires them to provide updates to others like important customers, shareholders, the CEO, or even the government. They may also need this in order to plan out other activities.
Providing useful information and frequent updates (not "contractual" updates) to them with this in mind would go a long way toward solving the whole business problem that is created by the technical problem. It might also get them off your back sooner, and with more respect for you.
There are two critical pieces of information that would help an executive very much: Do we know what the problem is? and Do we know what the solution is? A simple yes/no on both of those would be a great start.
- ameliaquining 8 months ago
  
  Communicating that information to executives needs to be the responsibility of someone who isn't currently heads-down debugging. Google's SRE Book suggests creating a "communications lead" role.
  - HolyLampshade 8 months ago
    
    I used to run a fluid ops group managing a complicated and (relatively) unstable system.
    I always approached this as a difference between incident management vs problem management, the later being the “what actually happened” phase, with lots of bureaucracy and post-mortems.
    I always taught people in my group to manage out during incidents if they understood what was happening. In the vast majority of failure modes you don’t need the most technical people working the keyboard performing, for example, a failover. Most of those processes are well documented and well understood. Very technical/operationally minded people tend to want to solve the problem as quickly as possible, but I always found them far more valuable discussing the issue with stakeholders, and playing a blocking move for the more junior guys/gals on the keyboards. This also helps the juniors get the experience necessary to eventually be able to help develop future staff.
  - cratermoon 8 months ago
    
    At one employer our site outage recovery runbook specifically stated that one person was to be designated to communicate status outside the tiger team and be a buffer between panicky people across the company and the technical folks fixing the problem.
- lifeisstillgood 8 months ago
  
  Then you need to pay for and assign extra highly technical people whose only job is to watch and follow and report upwards. This is One advantage of a war room in that there are usually enough technical people that one finds themselves just watching (“I’m a bit behind the curve so Inwill watch, don’t want to jump in there Bob is already working on it …”) and becomes this role.
- aqueueaqueue 8 months ago
  
  Surely more than one person is working on the fix? If you have a pair one can pop off and give updates to a third technical person (maybe their manager or an inicident manager) who can liaise.

CapricornNoble 8 months ago

I'm not familiar with the "War Room" in the context of computer network operations specifically, but I have deep experience running military operations centers and I'm reading this through that lens.

>People figured out that yes, they had run the machines out of memory, specifically with the push - the distribution of new bytecode to the web servers. Other people started taking steps to beat back some of the bloat that had been creeping in that summer, so the memory situation wouldn't be so bad. I suspect some others also dialed back the number of threads (simultaneous requests) on the smaller web servers to keep them from running quite as "hot".

Cross-functional information exchange. Who is coordinating or directing all these disparate actions? Who is fusing the knowledge gained from these actions? Who is disseminating a clearer picture of "what really happened"? Who is using that updated picture to frame new taskings for all the people doing these independent investigations? The answer to all those questions should be "the staff in the War Room", and the leadership in the War Room in particular. My take-away is that the author is arguing that their ability to pursue single-function actions within their domain of expertise was optimized in their work environment, and was degraded in the War Room. They aren't wrong.

>I guess a "war room" might work out if you have a bunch of stuff that has to happen to deal with a possible "crisis" and then it's just a matter of coordinating it. You don't have people doing "heads-down hack" stuff nearly as much in a case like that.

Exactly. Coordinating a bunch of stuff for crisis management = put those people in the War Room. Focused heads-down tasks = put those people where they can ...focus. Now that said....one thing I've come to HATE about working in a military headquarters is open offices for everyone who isn't the G-shop lead and his/her deputy. Everyone else is shoved into a cubicle farm, probably with ESPN blaring in the background on top of a half-dozen conversations and people constantly dropping by your desk to BS about cover sheets for TPS Reports. So even if you're NOT in the War Room, you can't focus.

icegreentea2 8 months ago

I feel like some things that consistently gets in the way of the clean separation between (crudely speaking) deciders and doers, and keeping the doers out of the war room (so they can work effectively) are:
Poor, or fear of poor communication. The "do-ers" become compelled to be in the war room to try to mitigate communication failures.
Unclear decision making processes and ownership. People with high technical expertise (who would be top tier do-ers who maybe should be kept out the war room) are kept around because their immediate feedback in the war room can significantly shift the decision making process and decisions made.
I should be more specific - I believe there's often a desire (and makes instinctive sense) to fall back to decision by consensus. Once everyone understands that this is how these things work, then obviously you want to pack the smartest, most competent people in the room, either because you're playing political games, and you want more "votes", or because truly you believe that you need the best people in the room to guide consensus.
These are structural and cultural (non-cynical) issues that drive both doers and decision makers to -want- to keep smart, competent doers in the room, even though separation -should- lead to better outcomes.
- dasil003 8 months ago
  
  There can’t and shouldn’t be a clean separation between deciders and doers. The whole point of incident response is you don’t know exactly what needs to be done, so you do need a mix of people to help figure it out. And the people proposing actions should be the most expert people (typically doers). It happens in a war room because we are prioritizing speed of communication and coordination. If individuals aren’t given space to focus, or no one is taking the role of incident commander, those are indeed problems but the solution is not to classify people as doers or deciders and separate them.

bossyTeacher 8 months ago

> I can't imagine doing that kind of multi-window parallel investigation stuff on a teeny little laptop screen with people right next to me on either side

This is it. Managers (I mean non technical folk) don't understand this. They don't understand that putting people physically together won't help you solve the issue faster. This is the same mentality that believes that typing code faster or generating more code is a good thing. The kind that believes that all employees need to always be physically together for "good stuff" to happen.

Sadly, they will never learn. Those managers and c-suite people will never read Rachel's post or investigate if their rto policies are necessarily good for the business. These folks are just reading numbers on a spreadsheet without fully understanding what those numbers actually mean in their business.

Sadly, I don't see that ever changing because that mentality provides a comforting worldview where office gives you sense of control and having all your cows in the farm under your watchful eye (or that of your trusty shepherds) feels so intuitive that any alternatives are simply too uncomfortable to even think about.

bloomingkales 8 months ago

There was a head of a department that once forced everyone to uninstall iTunes because he believed it was reducing productivity. Feels like a never-ending battle with these types.
psunavy03 8 months ago

My working theory is that business people and sales folks especially live in a world where you literally can't be productive by yourself. You have to be schmoozing, gladhandling, sitting in meetings etc. to close deals and grow business. So the idea that software people don't have to do that but can still be "productive" breaks their brains. Same with RTO; they come from a world where you truly do need to be with other people to do your job, so they think that everyone is the same way.

hinkley 8 months ago

Two of my former coworkers noticed a horrible truth about war rooms in a meeting we were having.

It would have been perfectly reasonable for our team to be responsible for at least a third of all production outages, including the nasty ones that took a war room instead of some scrambling to roll back deployments and frantically resetting some feature toggles, but we had both some of the youngest code and most operationally sophisticated code in the division (probably not a coincidence) and we just... didn't have that many outages over 15 minutes.

So we weren't in those war rooms, because it wasn't our stuff and we didn't know enough about the rest to really provide more than moral support. So we weren't in the room being all heroic in front of our boss's boss. We weren't in the shit with everyone else.

We weren't part of the fucked up Stockholm Syndrome that was driving a substantial part of the emotional bonding that was going on in that company.

We set up everybody's caches and maintained them. We figured out the docker deployment strategy. We figured out a whole bunch of things and barely got credit for any of it because when you do things right the first time, nobody appreciates how hard it was.

I've worked for bosses who understood how to sell that to upper management. This place had garbage management from the day my first boss quit until I left.

willvarfar 8 months ago

The "war room" or "tiger team" or whatever its called is often a way to parachute in a handful of engineers that top management trusts to sort out the mess made by the masses. Often crusty old-timer engineers are kept around just to be called on in these scenarios.

Nifty3929 8 months ago

Yes, and this also gives the lie to ageism. I hear this from older-than-me people fairly often, that the reason that they can't get the job they want, or a promotion, or whatever, is 'ageism.'
Meanwhile, I routinely see people older than me (I'm not young) being hired, promoted and generally shown great respect - because their years of experience has given them wisdom. They also remember how things developed over time and have more experience with details farther down the abstraction stack because those abstractions weren't around when they cut their teeth.
I aspire to be one of those grey beards in the not so distant future. And I doubt my age will ever hold back my career, aside from change my personal choices (for retirement, fewer hours, etc).
- pbronez 8 months ago
  
  Yes, but that expertise may not be easily transferable. Two decades of experience with a firm is much more valuable to that specific firm that anywhere else. If you leave that place, you only have general lessons to apply elsewhere.
  - YZF 8 months ago
    
    Once you've seen enough similar systems you have a pretty good idea. You'll ramp up much faster than someone who has not. It will take time.
    
    whizzter 8 months ago
    
    Except recruiters doesn't think there a point to hire anyone with less than 5 years of experience with Z-DeNode (that was created 2 years ago).

hackpelican 8 months ago

In the places I’ve worked, a war room was always the place where we cut the bleeding and revert the system to a working state. Never was the RCA the intended outcome of a war room, though we’d often reach the RCA in the silence of the meeting bridge while something deployed/rolled back.

Root cause analysis is definitely not a group activity, it’s best done in a place where one can have complete focus.

However, cutting the bleeding requires plenty of communication, weighing different options, having a higher-up sign off on a tradeoff, getting our ops team to coordinate towards some common goal, monitoring the recovery… etc.

afro88 8 months ago

IIRC, Facebook don't (or didn't) do rollbacks. They always fix forward. I guess hours long incidents like this are the other edge of that double edged sword.
- claytonjy 8 months ago
  
  Language can be tricky here. If I revert to an older commit, literally rewriting history to remove newer, bad commits, I think we’d all consider that a rollback. But if I instead add a new commit which undoes the bad commits, is that a rollback or a roll forward?
  - arccy 8 months ago
    
    commit state / pipelines roll forward, code / content rolls back
sunshowers 8 months ago

So interestingly, I think root cause analysis can be a group effort, but I think it has to be done on a remote call where everyone is in front of a big monitor or two, and people can take breaks and such. I've been part of teams that have done root cause analysis over a call (sometimes many calls), and it's been quite effective.

pards 8 months ago

> "nothing at FB is someone else's problem"

I love this credo. Sadly, large enterprises seem to operate on the exact opposite model of "everything at XYZ corp is someone else's problem".

There's literally no way to fix our own problems. Everything beyond the application code is managed by an external team with their own priorities, and the best we can hope for is readonly access to their repos.

adolph 8 months ago

It’s interesting to think about how broad a net must be cast to understand the state of a system.

That was another rathole, and the answer was also a thing to behold: I couldn't see it in the checked-in source code because it had been fixed. Some other engineer on a completely unrelated project had tripped over it, figured it out, and sent a fix to the team which owned that program. They had committed it, so the source code looked fine.

esafak 8 months ago

And the fact that everyone benefits when people aren't just doing their own, narrowly-defined jobs.
tantalor 8 months ago

This is an obvious, first thing to check when you are looking directly at source code.
Oh, the code changed 1 week ago? Let's see the diff. Oooooooh!
- yuliyp 8 months ago
  
  It's an obvious thing to check if you suspect it. If you're just reading code with no reason to suspect it has changed recently you might not bother to look.
  - tantalor 8 months ago
    
    This should have been the clue:
    > The call to fork() did check for a -1 and handled it as an error and bailed out. So how was it somehow surviving all the way down to where kill() was called?
    If the source code doesn't match the observed behavior, you should suspect the code changed.
    
    yuliyp 8 months ago
    
    Eventually she got there. There were a lot of other possibilities: maybe there's another place where kill is called; maybe the `kill` hypothesis is wrong; maybe something weirder is happening. Getting back to "oh maybe that's not what the code looked like back then" would likely take some further time.
  - jjk166 8 months ago
    
    I would think the default assumption for a large codebase is that the code changes regularly. And especially if you are looking for a bug that started after a recent push, checking the revision history seems like the very first place you'd look.
    
    yuliyp 8 months ago
    
    In this case the code went from bad -> good. The bug had existed for a long time before it finally got triggered in a catastrophic way. Even a large codebase has parts that are relatively stable. `fbagent` in this situation was a service responsible for gathering metrics, which is something that doesn't need that many changes (relative to, say product code) to it once it works.
    
    jjk166 8 months ago
    
    > In this case the code went from bad -> good.
    That doesn't matter, the code changed. It's not like you can know what effects changes had before checking whether changes occurred.
    > The bug had existed for a long time before it finally got triggered in a catastrophic way.
    The issue isn't missing that one bug, it's not realizing that the code you are looking at isn't the version of the code on the affected machine.
    > Even a large codebase has parts that are relatively stable. `fbagent` in this situation was a service responsible for gathering metrics, which is something that doesn't need that many changes (relative to, say product code) to it once it works.
    Clearly changes were nevertheless happening. I don't care if you're dealing with a COBOL program that hasn't been updated since the 80s - if you have a problem now, I just don't see any reason why you would ever not check the revision history.
drewg123 8 months ago

This, so much this. It has happened to me one too many times in the past, so I always make sure I'm looking at the same code (release branch, etc) where a bug has been observed.

coldcode 8 months ago

I've watched many war room in various employers.

At one (20 years ago), they met for six months to determine why our field offices' network connection to the home office was so pathetic and unusable. It was led by the head of networking. After all those meetings, it was decided that all 1000 independent field offices should upgrade their internet to T1 connections. It didn't help. Another six months goes by, and I hear from my connections in networking that the real problem was the head of networking had installed a half-duplex low-speed ethernet card: all 1000 office's data had been going through a pinhole. It was replaced, and suddenly everything was fine again, other than the hole in the office's pockets for an unnecessary upgrade.

No one ever mentioned it publically.

hedayet 8 months ago

Ex-Google SRE here with experience in multiple revenue-critical war rooms. At Google, war rooms were particularly useful because saying, "X is in a war room" (at least as late as 2017) gave X the credibility to say no to everything else. Having technically competent leaders made the experience enjoyable—because they weren’t just there to demand updates but actively contributed by writing queries, and nudging the team in the right direction by asking series of right questions.

My worst experience with crisis management was with one particular team at another big tech company, where the leaders were ignorant about the technology—completely clueless about the service and its architecture. In cases like this, the issue becomes a binary 0/1 problem: the service is either broken (0) or running smoothly (1). When a leader lacks the technical knowledge to grasp the intermediate steps, their only contribution is yelling for updates—and that’s exactly what they did.

Bottom line: War rooms can be a space for deep work with good leadership (a combination of technical soundness and co-ordination skills under pressure). But they can quickly turn into hell when leadership lacks one of these two essential qualities—and resorts to yelling to cover their asses.

cratermoon 8 months ago

"This fbagent process ran as root, ran a bunch of subprocesses, called fork(), didn't handle a -1 return code, and then later went to kill that "wayward child". Sending a signal (SIGKILL in this case) to "pid -1" on Linux sends it to everything but init and yourself. If you're root (yep) and not running in some kind of PID namespace (yep to that too), that's pretty much the whole world."

Key phrase "didn't handle a -1 return code".

Yuan, Ding, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. “Simple Testing Can Prevent Most Critical Failures.” Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), 2014, 17. https://www.eecg.utoronto.ca/~yuan/papers/failure_analysis_o...

edflsafoiewq 8 months ago

> This fbagent process ran as root, ran a bunch of subprocesses, called fork(), didn't handle a -1 return code, and then later went to kill that "wayward child".

In-band error codes strike again.

pedrocr 8 months ago

This is a case of both in-band error codes and overloaded meanings of inputs colliding. Modern languages make both things much better but even in C the kill(2) interface seems much too clever. It seems it could have easily been a couple of different functions.

sixhobbits 8 months ago

I haven't been in war rooms in big companies and had no idea that they used the term for stuff like this (fixing downed IT infrastructure etc).

My experience and previous understanding of the term was you set up a war room when something big and potentially company-destroying happens and a lot of different people from different departments/divisions need to coordinate very closely as new information comes in and the situation changes more.

Or if there is some distinction between 'trusted' and 'untrusted' people internally, you want the trusted people in the war room and the untrusted ones out.

Wild to me to hear that people call it a war room in cases when the people in the room are expected to be hands-on doing things.

steveBK123 8 months ago

The purpose of the war room is not to solve the problem but to perform the act of problem solving visibly for certain audiences.

gherkinnn 8 months ago

War room. In the trenches. War stories. Pasty programmers and plump PMs using such terminology is a bit silly.

Fixing a printer that sometimes does something unexpected is not even a sailor's yarn, let alone a war story.

XorNot 8 months ago

Thank you. The overuse of militaristic lingo is incredibly annoying.
I've personally eliminated it entirely from my vocabulary: whatever you ask me, there was no war room: there was a conference call or a meeting.
alabastervlog 8 months ago

“Telemetry” for “keylogging our fart app’s users” because everyone wishes they were doing something cool and/or meaningful.

whiatp 8 months ago

This made me think of a series of "war room" meetings I had been part of early in my career. Strangely enough, also a defect revealed when the platform was low on memory. This was also the issue where I learned the value of documenting experiments and results once an investigation has taken a non-trivial amount of time. Not just to show management what you are doing, but to keep track of all the things you have already tried rather than spinning in circles.

The war room meetings were full of managers and QA engineers reporting on how many times they reproduced the bug. Their repro was related to triggering a super slow memory leak in the main user UI. I had the utmost respect for the senior QA engineer who actually listened to us when we said we could repro the issue way faster, and didn't need the twice daily reports on manual repro attempts. He took the meetings from his desk, 20 feet away, visible through the glass wall of the room we were all crammed into. I unfortunately didn't have the seniority to do the same.

Since I can't resist telling a good bug story:

The symptom we were seeing is that when the system was low on memory, a process (usually the main user UI, but not always) would get either a SIGILL at a memory location containing a valid CPU instruction, or a floating point divide by zero exception at a code location that didn't contain a floating point instruction. I built a memory pressure tool that would frequently read how much memory was free and would mmap (and dirty) or munmap pages as necessary to hold the system just short of the oom kill threshold. I could repro what the slow memory leak was doing to the system in seconds, rather than wait an hour for the memory leak to do it.

I wanted to learn more about what was going on between code being loaded into memory and then being executed, which lead me to look into the page fault path. I added some tracing that would dump out info about recent page faults after a sigill was sent out. It turns out all of the code that was having these mysterious errors was always _very_ recently loaded into memory. I realized when Linux is low on memory, one of the ways it can get some memory back is to throw out unmodified memory mapped file pages, like the executable pages of libraries and binaries. In the extreme case, the system makes almost no forward progress and spends almost all of its time loading code, briefly executing it, and then throwing it out for another process's code.

I realized there was a useful looking code path in the page fault logic we would never seem to hit. This code path would check if the page was marked as having been modified (and if I recall correctly, also if it was mapped as executable.) If it passed the check, this code would instruct the CPU to flush the data cache in the address range back to the shared L2 cache, and then clear the instruction cache for the range. (The arm processor we were using didn't have any synchronization between the L1 instruction and L1 data cache, so writing out executable content requires extra synchronization, both for the kernel loading code off disk, as well as JIT compilers.) With a little more digging around. I found the kernel's implementation of scatter gather copy would set that bit. However, our SOC vendor, in their infinite wisdom, made a copy of that function that was exactly the same, except that it didn't set the bit in the page table. Of course they used it in their SDIO driver.

Aeolun 8 months ago

I’ve spent hours in a war room with people asking me questions and making concentration impossible, left it, and figured out the problem on 30 minutes.

Maybe for things that involve a lot of manual actions, but my experience has been that they hinder instead of help work.

dakiol 8 months ago

As a software engineer I generally can help little when a non-trivial incident occurs whether it is via war rooms or deep investigations. I do have some kind of access to some logs, traces and metrics (datadog, for instance), but at the end only the SREs or platform engineers are the one who determine the root cause of any incident because they have 100% observability.

spc476 8 months ago

Unless "platform engineers" includes "deveopers", they're not the only ones who can diagnose an issue. Once at my previous job (providing a name based on a phone number to the Oligarchic Cell Phone companies), our service just stopped. I wasn't there during the outage, only heard about it after the fact. The servers were fine (they were not out of memory, nor did they have an outrageous load). The network was fine. The program just wasn't serving up data. There had been no recent updates to the code [1] so it shouldn't have been the software. It took the main developer, who knew the code, to know that a "this should not happen" situation, did---that is, the name query we used for a health check had been removed, and thus our software took that to mean the name service was out of commission, thus shutting down.
Now, it could be argued that was the wrong thing for our software to do, but it was what it was, and no amount of SREs or "platform engineers" would have solved the issue (in my opinion).
[1] The Oligarchic Cell Phone companies do not move fast, and they had veto power over any updates to production.

unit149 8 months ago

[dead]

imcritic 8 months ago

That's a shame that Rachel turned racist too...

brazzy 8 months ago

Elaborate?
- imcritic 8 months ago
  
  She configured her server to block requests from certain nations. One could suppose that's just due to some misconfiguration or something, but in fact it became a quite popular trend nowadays.
  - kjs3 8 months ago
    
    We routinely block nations where we have no customers from some assets, because why would they be there other than doorknob rattling hoping for some security issue. "No customers" is now racism? If we block people from the UK because of their burdensome Online Safety Act, we're racist? If we blocked Russian or Chinese IP addresses because their government doesn't want their citizens data stored on our infrastructure, it's because we're racists?
    Apparently the definition of 'racism' has grown wildly in certain quarters.
    One could suppose that's just due to some misconfiguration
    Sure...but so much easier to assume they're a racist I guess.
    
    imcritic 8 months ago
    
    I don't know you, so I don't care about your services, block whomever you want.
    Rachel has a weblog, she writes articles, I used to come to her blog to read them, I am not a customer of any of her services.
    Now that she blocks users by their origin - I see she has no problems with being discriminatory. That's a shame, I thought of her a bit higher than that.
    
    kjs3 8 months ago
    
    I read her blog as well. Please point exactly to where she says "I block this user/IP address/netblock/AS #/etc. because of the ethnicity of the user".
    
    imcritic 8 months ago
    
    That would expose her to the others, so obviously she doesn't manifest her blocks to everyone, she just configures her servers to drop packets received from specific origins.
    It used to work directly and now it doesn't.
    It works via foreign proxy/VPN, but I'll just stop visiting her site from now until she stops with her racism.
    
    kjs3 8 months ago
    
    So you have absolutely no proof this is racism, and calling it racism is just your petulant response to something you don't like. Got it.
    
    imcritic 8 months ago
    
    How is that not racism if it's discrimination by origin of requests?
    I bet money she doesn't block requests by such a wide filter if the origin is USA.
  - neuroticnews25 7 months ago
    
    What makes you think it's country wide? Have you tried contacting her? See https://news.ycombinator.com/item?id=42599359
  - tristor 8 months ago
    
    > She configured her server to block requests from certain nations.
    That's not racist. It's pragmatic, reasonable, and a good idea. I haven't looked at the figures recently, but some years back over 90% of all automated exploits were coming from just 3 countries. If you don't do business in those 3 countries, blocking them wholesale massively decreases the risk of succumbing to an automated exploit and reduces the sheer noise in your operational workflows so you can focus on things that matter, not script kiddies with fat pipes.
    It can also reduce your hosting costs significantly. I do this on every service I personally operate, and I've never once though about the race of the person I'm blocking. In fact, I don't even believe that the IP space being blocked necessarily represents attacks coming from inside that IP space, it could just be exploited systems being used as jump boxes by attackers elsewhere. From my perspective, it doesn't matter, I just don't see any reason to allow attackers to access my sites and services from regions I don't have any reason to serve.
    Maybe come up with a better analysis than jumping to "she's racist"? It just makes you sound dumb.
    
    imcritic 8 months ago
    
    That's a very stupid generalization on your side. And again: you talk about some customers as if I came to her site to buy some service.
    No.
    I just used to read her blog.
    Now she can go to hell with her racist discrimination.
    Easy for you to speak like you do until you personally get in a situation where you get discriminated for something like your origin.
    That's even worse than being discriminated for skin color.
    And again, that's very stupid of you to think that people from "good" country won't exploit your site or that all people from "bad" country only come to your site to exploit it.
    You are as disgusting as she is.

Simon_O_Rourke 8 months ago

Why do all these posts descend into the "I'm so awesome" archetype, describe the damned problem and how it was resolved and for goodness sake stop trying to stroke that ego while you're doing it.