The Long March To VDI Stability

This is the second post in a multi part series about my experience with VDI over the past 10 months. In the first post I laid out the VDI situation I inherited. To recap  the situation, post VMWorld in August, I realized that I had a VDI system that was under spec’d, not well implemented or configured, was a major version out of date (4.1u1), was badly over utilized (suffering from VMWare sprawl, more on that later) and was providing users with a very poor computing experience. So I set out to develop a plan to fix it, as any good IT person would do.

Initially I was looking for ways to stabilize the system and improve the end user experience. Never mind that the desktop paradigm for teachers and students is horribly outdated in the age of anywhere, anytime learning. Never mind that tying teachers to a desktop fixed in space makes building collaborative Professional Learning Communities around student assessment data basically impossible. And never mind that virtual desktops unable to run skype or google hangouts or webcams, that can’t play videos or connect to other classrooms over the Internet, or authors or NASA, that continuously run out of hard drive space with every adobe flash or java update, do not empower teachers or students with 21st Century learning abilities and are not the kind of computer environments we should be building for teachers today.

I was looking for cost effective ways to get the system back to what it was designed to do, which was provide a platform for teachers to take attendance, enter grades, check email and marginally support student computing. It turns out, cost effective and VDI don’t really play well together. Just to stabilize the system, do a health check and migration to 5.0, was a six figure prospect. Adding the hardware to increase RAM and HDD capacity in Guest VMs, more six figures. Fixing the storage problems with something better suited for the peak demands of 1200 virtual desktops, more six figures. The management software needed to really see what was going on with the complex moving parts? Only 5 figures, but with a high recurring cost. Replacement end point devices for teachers, six figures yet again. The numbers kept adding up, and no matter how I tried to slice and dice them, the conclusion was to get the system stable and viable over the next three years, it was going to be expensive. Certainly much more than the low cost system it was initially pitched as.

There was another factor I was considering when looking at price. Support had been pre-paid for five years with VMWare renewal just one budget year away and SAN and Server renewals due the year after . On top of that, the server hardware and existing SAN would soon be five years old. Five years for critical infrastructure that 99% of all the desktops in the district where running on. Now I have run servers out to seven and even eight years but never critical systems. Five years has always been my end of life for critical production servers and in this case, this equipment had experienced two major high heat events when the Air Conditioning failed in the server room. In one instance, the thermometers were pegged at 120 and the SAN did not shut itself down. Not the environment that lends itself to extending the life of computer hardware.

Factor in a significant investment to make the VDI system right, a critical lack of sysadmin capacity and skill level (I’ve learned more about VDI in the past 10 months than I care to know, it’s basically my second job) and the prospect of significant support renewal costs on the horizon; the only cost effective solution was obvious. Scale back the number of users to a point where the existing hardware could support decent performance and phase out the VDI system over time. We would make the best use of the investment that had been made but not throw more money into an outdated paradigm that we weren’t equipped to support and couldn’t afford to maintain over the long term. But this would take time. Time we did not have.

Storage was the major issue. With several schools bringing new thin clients online over summer (purchases already in the pipeline when I arrived), we were looking at a total system collapse if we didn’t do something. The solution presented itself in an unexpected place. In part three, We talk redundancy!