>> I personally took it to heart, it's a good system for forcing a cache miss in...

kbenson · on Oct 27, 2020

> you delete the test database and it's not the test database.

> (long story)

I think you can skip the long story, as most of us can tell a story similar in theme if not specifics (and sometimes, probably some similar specifics too). ;)

With great power comes great responsibility (to not completely screw stuff up because you were on autopilot for a second...)

YeGoblynQueenne · on Oct 28, 2020

Indeed. God damn muscle memory.

throwaway894345 · on Oct 27, 2020

I worked at a company where someone deleted the production database by accident and the snapshot mechanism hadn't been working AND the alerting for the snapshot mechanism was also broken. Fortunately someone had taken a snapshot manually some weeks prior and they were able to restore from that and lose relatively little data (it was a startup, so one database was a big deal, but weeks worth of data was not such a big deal).

txutxu · on Oct 27, 2020

I worked at a company were someone deleted the production RDS and all the snapshots.

Typing the confimation and requesting to delete the snapshots.

He had two brosers open, one for development (of cloudformation, etc)... but someone did ask him to change a thing in prod.

Both browsers were identical. Only the account in the top right corner did change.

Both cloudformation stacks were identical (instance names, etc).

He had been all the morning launching and deleting the dev environment.

Team mates were joking loud around his table before the moment it did happen.

Sadly, he got fired (the company was proud of it's cost savy choices, didn't have other backups than a few days of snapshots, probably CTO choice).

Gene_Parmesan · on Oct 27, 2020

Firing the person who happened to be at the wheel when a mistake like this occurs never seems like the right choice to me, especially if their performance to-date had otherwise been good.

Everybody has off days, or just instances where circumstances misalign in just the wrong way. To pretend otherwise is silly; instead, it's the leader's/team's responsibility to ensure that those sort of off days don't lead to massive losses via redundancy & the sort of measures we're talking about here & in the OP. Firing somebody in these circumstances just acts to severely reduce morale, since we all secretly know in our hearts that it very easily could have been us.

Firing in this case just seems retributive. It's not going to bring the lost data back, and you've just eliminated the very person who could have told you most about the chain of events leading to the incident in question to help you guard against it in the future. These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues. A lack of team focus on reliability/quality, a lack of communication or trust about decisions made (or not made) by higher ups, or so on.

And they are probably the single least likely person to cause a similar incident again -- that person will now likely be double and triple checking their commands for eternity.

jacobsenscott · on Oct 27, 2020

Agree. There is never a single cause to this kind of error. It takes a village. Someone didn't name things properly, someone else didn't store backups properly, someone else gave everyone root access to production, etc. It was inevitable the database would be deleted - doesn't matter who actually did it.

If your CTO scattered those landmines all over then "not stepping right" is not an error. It just sucks.

greedo · on Oct 27, 2020

Sometimes. And sometimes they make the same mistake over and over.

We had an admin in charge of our storage. He had worked with our old vendor's SAN for years, then we got a new SAN. Trained him/certified him etc. He "accidentally" shut down the entire SAN. That brought down the entire company for over 9 hours.

Fast forward two years later, he screwed up again and caused a storage outage affecting about 1100 VMs. Luckily not much data loss, but a painful outage.

Then a month ago, he offlines part of the SAN.

Some people never learn, and recognizing this early is usually better than letting someone continue to risk things.

dataflow · on Oct 27, 2020

3 mistakes in... >2 years? I feel like it's really hard to tell if the problem is really the person at that point. Have you had others perform the same job for a similar duration to see if they avoid the same mistakes?

nitrogen · on Oct 27, 2020

If you made a list of every mistake each person makes in 2-3 years, and omitted all other detail, pretty much everybody would look like a terrible person. Context, frequency, etc. all matter.

If particular systems or people are seeing a high frequency of mistakes, maybe the system design is at fault, not just the person. Obviously it's hard to do in practice, but the ideal is to design systems that are mistake proof.

greedo · on Oct 27, 2020

This is just the mistakes made in the SAN/Storage part of his responsibilities. As we used to say in World of Warcraft, "Can't heal stupid."

jodrellblank · on Oct 28, 2020

> "He had worked with our old vendor's SAN for years, then we got a new SAN."

Great way to invalidate years of experience. Presumably from your telling of the story, he didn't cause problems with the old vendor's SAN?

> "He "accidentally" shut down the entire SAN."

So, was it an accident, or was it an "accident"? You can't have it being a mistake if you're also hinting it was deliberate and malicious.

greedo · on Oct 28, 2020

He was trained and certified on the new SAN, and surely some of his prior experience on the legacy SAN would translate. Just as moving from AIX to RHEL/CentOS wouldn't invalidate all your skills and experience.

It was a real accident when he shut down the SAN the first time. I don't know why I put it in scare quotes.

Lex-2008 · on Oct 27, 2020

> These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues.

These words reminded me a story of similar/different "flaps" and "landing gear" controls on a plane - where crashed airplanes were also blamed on pilots first, before a trivial engineering/UI solution was implemented: https://www.endsight.net/blog/what-the-wwii-b17-bomber-can-t...

Huggernaut · on Oct 28, 2020

Nickolas Means has an absolutely wonderful set of talks on themes like this. Particularly relevant here I think, is his talk: "Who Destroyed Three Mile Island?" - which goes through the events that occurred at the nuclear power plant, the systemic problems, and how to find the "second stories" of why failures occurred.

https://www.youtube.com/watch?v=1xQeXOz0Ncs

_asummers · on Oct 28, 2020

There's a really good book describing this phenomenon called Behind Human Error. It speaks of "first stories" and "second stories" and how in analysis of incidents, it is all too common to stop at the first story and chalk it up to human error, when the system itself allowed it to take place.

ahoka · on Oct 27, 2020

"Both cloudformation stacks were identical (instance names, etc)."

This is why it's a good practice to include the environment name in the resource names when it makes sense. Even better, don't append the env name, but use it as a prefix, like ProdCustomerDb instead of CustomerDbProd. I also like to change the theme to dark mode in the production environments as most management UIs support this. One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.

greedo · on Oct 27, 2020

I have my background colors configured for each environment so when I'm shelled into a server, I know exactly what I'm working with.

brlewis · on Oct 27, 2020

I'm too lazy to do this manually for each server, but I change the hostname color in my prompt based on its hash.

https://gitlab.com/brlewis/brlewis-config/-/blob/master/bash...

nitrogen · on Oct 27, 2020

One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.

This is definitely a nice one to add. Though I did work with someone once who believed that all servers should be 100% vanilla and reverted my environment colors.

In container-only shops with no ssh, this is less of an issue, and instead you rely on having different permissions and automations for different environments.

YeGoblynQueenne · on Oct 28, 2020

That's very similar to what happened to me - except I didn't delete any backups, thank the Great Old Ones. And I didn't get fired.

Basically, I had a habit of starting a new SQL Server Management Studio instance in its own window for each database I was working on. At some point this struck me as wasteful, for some reason, so I closed all my windows and opened all the databases in one window. Then sometime after that I went to delete the test database as a routine maintainance task, but of course I was used to clicking the database at the top of the left pane in SSMS, which was the test database when it was the only database in a window... but now happened to be the production database. Then five minutes later I got a call from the client company that used our system, to ask me if there was any maintainance going on because everyone's client had just crashed.

The horror when I realised.

It was educational, though. I don't think I'll make that particular mistake ever again. And my bosses were ace to be fair, probably because I worked my ass off to correct the mess that ensued.

shezi · on Oct 29, 2020

When I worked in production environments, I used to set up little Firefox userscripts that would add a banner or anything visual to the production site. It's entirely client side and easy to customize.