So, if you’re a sysadmin with a few years under your belt, chances are you’ve already had an “Oh fuuuuuck…” moment or two. Something has gone horribly wrong – you may have even been right in the middle of it all when it did!
Well, just recently, I was involved in one of those. I won’t go into too much detail because of NDA’s and privacy and all that jazz, but I can provide a few real-life “lessons learnt” from this situation.
To set the scene, let’s just say, shit got real really fast. A change was planned, a change was executed, something went really wrong (and it honestly wasn’t predicted that it would go as wrong as it did) and so things fell over in a bad way. Catastrophic way. We’re talking a large scale outage lasting for more than a few minutes, this thing stretched on for hours.
So, with that in mind, I have a few lessons learnt from in the trenches. Many of these have already been discussed within my organisation, which is a great thing after something like this, but I honestly feel they may benefit others.
Have a DR plan
More than anything else, your disaster recovery or business continuity plans need to be ready to go. Because this is the time you’ll need them. You need to know where to get your information – especially in the case of a total data centre failure, you need to have it somewhere else.
Backup everything. Store it on encrypted USB drives. Print it out. Make sure you’ve practiced your run-throughs of rebuilds. Know your contacts. Because when it comes down to it, you don’t want to be making things up as you go – you want a plan to follow.
Information on your services – know what’s what
If you’re like me, in a large enterprise environment, you’re not responsible for everything. You’re responsible for our little corner of the network – with it’s infrastructure and services. And you need to know those services – know what they connect to, know what they depend on, know what depends on them! This is vital – and is why a CMDB (provided you can access your CMDB) is so important.
Have your testing plans ready to go
So you know what services you have. You’ve checked your DR plan and know what you need to fix and what’s most important.
But what do you need to test to ensure that *all* your services are available, especially after a major outage?
Have a testing plan – this should be a list of all your major services, and the primary checks that are required in order to guarantee that they’re back up and operational. Sure, it might be as simple as checking a website, or seeing that a status light is green, but you need to have this. You want this list so you can check off each item, hand it to management, and say that all your services are now good to go because you *know* they’re good to go.
This earns brownie points with management – promise!
Fatigue management is a big thing
This is probably one of the hardest lessons learnt for me, mainly because I don’t know when to call it a day. I don’t want to the one who dumps a pile of crap on someone else just because I’m feeling a little tired.
For me, being awake for 38 hours straight isn’t “feeling a little tired”. At this point, you’re definitely into sleep deprivation territory and you won’t be thinking straight, nor will you be able to do your job effectively. I’d say if you’ve been working for longer than 12-14 hours, it’s time to call it a day. You’d be surprised, your work may even have guidelines already on what’s acceptable for maximum amount of time worked.
So don’t be a hero. Hand over to someone else, get some shut eye and pick it up when you wake up, refreshed and alert. Yes, it sucks that someone else has to get dragged in. Yes, it sucks that you’ll be missing out on what’s going on and have to catch up once you wake up.
But you can’t function on little-to-no sleep – you are of no use to anyone, especially to yourself.
So fatigue management: ensure staff are getting sleep during major incidents and similarly ensure staff are eating during major incidents!
This one is huge. Trust me, you want to be friends with coworkers, because they will seriously save your arse. And I’m not just talking about within your own team. I’m talking other teams, especially if you’re in a large enterprise like I am. You want to know people, be on good terms and know that, if you have to call them at 3am in the morning, they’re not going to be *too* pissy with you.
In the incident I was involved in, it was just sheer dumb luck that most of the group of people on-call/involved were friends outside of work. Because when shit hit the fan, we had no way of communicating internally. We started relying on outside methods – we fell back on trusty social networks.
Twitter to the rescue! No, I’m not kidding. We formed a private DM group in Twitter and were coordinating through there. It saved our butts (and our sanity). The most hilarious part of all of this – the social media policy we’re supposed to adhere to says that we should refrain from “friending” coworkers on social networks – if we hadn’t, the organisation would have been in a *much* worse position.
Communication is vital
Part of the issue we had was that as the “lowest level” of tech (which I find hilarious as we’re all at least 3rd level support…), we were pretty much in the dark. We didn’t actually know what was going on. No one was talking to us and telling us what we needed to know. We could talk amongst ourselves, but we didn’t have that over-arching communication to tell us what was broken, how it was broken, what was being done to fix it – we were just on the periphery, waiting for our stuff to be available so we could check that everything was working.
This was probably the biggest issue we had, and it was discovered that communication was happening – but it was simply going up, and not coming back down. Our management were *really* well informed…they just weren’t passing that information on to us. This was a massive “lesson learnt” that I personally hope is going to be addresses in future incidents – because the communication and information has to flow both ways – up *and* down.
This is probably the most important lesson learnt of them all.
Look, shit happens. Things break. Sometimes plans don’t go to plan. Hardware screws up, people make mistakes, software bugs out. It’s all part of the job.
But there’s no point pointing fingers. That just makes the job of a) fixing the problem b) making sure the problem doesn’t happen again and c) learning from the experience *that* much harder.
Take emotion out of any reviews you have – you want to be purely logical and factual. This is what happened. This is what we did. This is what we should’ve done. This is what wasn’t done. Don’t point fingers, don’t lay blame at someone’s feet. That just makes for hurt feelings, poor relationships and then no one learns because everyone comes away feeling worse.
When these things happen everyone gets their shit together, everyone pitches in and the job gets done. As it was, the outage that spurred this blog post was barely noticed – yes, it did have some impact, but it could have been far, far worse had it gone on for longer than it did. The reason it didn’t? People got in and got shit done. In the end, that’s what matters – shit got fixed, everything is working.
And you’ve got things to improve on for the next time it happens…because there will be a next time.