I Don't Think VDI Redundancy Means What You Think It Means

In my previous VDI posts, I outlined the Virtual Desktop infrastructure I inherited and discussed the stability issues being faced. In this, part three, I will talk about redundancy or lack thereof and how I was able to use some over engineering to my advantage, if only for a little while.

Summer was fast coming to a close, a few hundred more Virtual Desktops were about to come on line and system performance was horrible. We had already decided not to go down the road of making the significant investments required to rebuild a VDI system capable of supporting the swelling number of desktops and providing acceptable levels of performance for staff and teachers. Desktops after all were a dying paradigm but if we didn’t do something, the whole system would implode under the load of the additional desktops. Enter the promise of total system redundancy.

The initial system design called for a redundant backup site with identical SAN and Server hardware to failover to in the event of an outage at the DO. Using VMWare’s Sight Recovery Manager (SRM), the VMs at the DO would automatically fail over to the secondary site and come up with only minor interruption. Email would continue to flow, users would continue to work on their virtual desktops, all would be right with the world. In fact, everyone was under the impression that this was in place when I arrived. Only it wasn’t. SRM was not working. SAN replication was not working. Digging further, I found that even had they been working, when the VMs failed over to the secondary site, they would have had no where to go because all but the secondary site would have been unable to access them. There were no redundant links from the other school sites to the secondary site. There was also no backup Internet connection at the secondary site. It would have been a failover to nowhere.

A further breakdown revealed that SRM had been configured to failover servers only (no View Virtual Desktops) and was at some point actually working. However, it broke when the DO site was upgraded to 4.1 and the secondary sight was not. To add insult to injury, in the course of evaluating storage upgrade options, it was discovered that the DO SAN, the one that was running all the district’s servers, email and virtual desktops was purchased with only a single controller. Somehow the project moved forward with redundant firewalls, web filters and routers, but not the SAN controller. So much for redundancy. As for the site connections, the redundancy plan obviously called for a network design that never materialized and left the district with a secondary site full of equipment, under warranty, sucking up power and AC that was basically sitting idle with nothing to do except run two backup Active Directory servers.

So in an unorthodox (and probably unsupported) move, I decided to harness the idle power of the secondary site to run the student virtual desktops. Because of the way the view connection brokers were setup with the DevonIT echo server, I could not separate out the student pools onto their own VCenter server in the time allotted, as much as I wanted to. Instead, I attached the hosts from the secondary sight to the VCenter at the DO, moved the student master images over to the secondary SAN and reconfigured the pools for the new cluster, effectively running all of the student virtual desktops on the secondary site hardware. This re-use of the redundant hardware saved us for a while. We found out later that the combined IOPS between the two SANs in this configuration was running at 18,000-20,000 which would have easily brought the DO SAN to it’s knees.

However, even over the 600Mbps connection from the DO to the secondary site we started to see problems. Pool errors during provisioning pointed to communication problems so we disabled the stop provisioning on error settings. Incidents of disconnecting student computers, either because of the increased sensitivity of PCoIP or because of the old hardware being used as clients, increased. Under the increased student load, VCenter started becoming unresponsive. Staff and Teacher pools continued to have performance problems even with the students running on separate hardware. Despite this, the system did not totally crash and burn. It limped into the new school year with no money invested and another 200 desktops online or so we thought.

In part 4, how 200 new desktops turned into 300 and what happened when schools actually started using them.