If you go take a gander at this, you’ll see that some relatively well known (and some relatively not well known….. http://www.fakeplasticrock.com ???….I dare you to read that quickly without doing a double-take, BTW) sites went down over the weekend due to a hard-drive failure on a dedicated managed server that was hosting various virtual machines.
Now, it would be easy to gloat and flame about this, and really, people should know better. Same goes for some popular blogger sites that lose all their images when a server fails. However, I won’t specifically for a number of reasons:
a) I’ve done Ops, and I think I was damn good at it, thank you very much. It still was hard. With so much to manage, there is a lot to forget or overlook. Sure, you should have not only a backup plan, but also a recovery plan. Everyone know that. If you tested your recovery plan last week, you are light years ahead of most people. Have you tested it this week? How do you know nothing has changed? Do you have the resources to test your recovery plan everyday, both in terms of people and hardware and time? No, of course you don’t.
b) Have you ever saved anything important to CD? Yeah, me too. You do know that CDs degrade over time, right? So those important pictures you saved to CD in 2000 might be unreadable now. You’ve checked this, right? Sure you have.
c) The last time I had to deal with hardware failure personally, I did not have complete backups of everything. I was lucky enough that the failing machine would stay up after a reboot for five minutes. So, I had increments of 5 minutes over (something like) two days to copy off anything important. That was fun.
d) The computer gods admire hubris, but they also punish it. Vigorously. When I ran an Ops department, I used to test the production SAN by yanking a hard drive out of a slot, just for the hell of it, just to make sure it worked. The computer gods admired my testing, and punished me by making me accidently run the batch script that turned off credit card processing on the entire web farm (except for one server) a few weeks later. If I rag Haack too hard, my apartment will catch fire and burn to the ground along with all my hardware. Or something.
In any event, I did think it important to re-iterate that “Operations”, my global catch-all term for all that non-programming stuff, is just as, if not more, important than all the nifty whiz-bang programming stuff. An environment with a recovery plan and no separation of concerns trumps the opposite (and if you disagree, you are wrong, sorry…well, okay, usually. There are exceptions).
Oddly enough, I ran into ‘Ops’ considerations last Friday. I was asked to merge in some code from one environment to another one. On a Friday. The weekend before a very important series of events was occurring in the new environment. But it was tested, right? Sure, it was tested in a different environment, and the only differences are environmental variables, no problem, right? Right. Wrong. After merging in all the code, and getting ready to leave for the weekend, it occurred to me that it wouldn’t hurt to run a small piece of the code, just to verify. No problem.
FAIL. So, spend 15 minutes or so to fix that and rerun it. Okay, code is launching, looks like it might take a while to run. For the heck of it, let’s just run one other piece of code, just to verify. No problem.
FAIL. Finally, after these figurative kicks to the nether regions, I thought about what I was actually attempting to do. Merge in code on a Friday, in an environment that had passed important tests run the previous week, the weekend before this environment was going to go through an important series of events. Code that I could make to run without error. Hell, that’s easy enough. I generally can make failing code run without an error, even if I don’t know what the code does. Read the error message, use experience combined with brain power, rinse and repeat, till code completes….Slight problem. What if it actually matters what the code is doing, this Friday before the weekend before a week of an important series of events?
I made the executive decision, and rolled back the code merge. Let them yell at me on Monday (which they didn’t….in retrospect, they agreed it was dumb to try).
Having said all of that, make sure your really important kool kidz stuff can survive a hard drive failure. Seriously.