Power went out at the District Office (DO) today at around 5:30pm. The outage affected a large area of our community. With only 25 minutes of backup battery time on our servers, we only had a few minutes to shutdown everything. At least we had that time. No, we don’t have automated shutdown scripts in place. We have a klugey UPS setup with several separate units powering each rack. As for automating the shutdown, we have two Dell Chassis, 36 Blade Servers, VMWare 4.1, Windows 2008, 2003 and Linux, NetApp, the Core Network, Cisco VoIP/Unity and Nimble to contend with. Getting it all automated would require a serious Professional Services engagement. And really who has money for that these days?
While pretty much everything hosted at the DO went down, email stayed up, being that it is hosted “in the cloud” and 3g/4g access worked as backup for those of us with smartphones. I’m seriously considering getting a dedicated 4G backup Internet connection and isolating some key network components/ports on a beefy UPS (like the IT offices), for just such events.
After several hours of manually bringing systems back online, it looks like the act of gracefully shutting down the 4.1 VCenter Server and DB Server made a huge difference. Not nearly as many orphaned VMs as after a hard down. And the new 5.1 VMWare VCenter instance that I recently built fired up right away when the hosts came back online. I’m really loving that little VMWare appliance.
We still have things to work on, like fixing the 6509 so the sups boot up automatically, making sure the redundant routers are actually configured for redundancy (they aren’t! but we’ll actually be removing them soon anyway) and the interfaces are not administratively down. I also want to add a DHCP, DNS and AD server outside the Dell Chassis/VMWare stack so we can get network services back up and running a bit faster. And it looks like we need to assign static IPs to the Meraki Access Points, if we want them to come up before the DHCP server.
It would also be nice if our Generator wasn’t parked five blocks down the street in the M&O yard. It took an hour for it to get driven over to the DO and hooked up, by which time we were already down for 30 minutes. But once the generator was up and running we were able to get systems back online even before power was restored to the neighborhood.
All in all, it could have been a worse day. I mean, it’s not like the server room overheated or anything.