VDI Hell

I sat down to write a post about storage solutions and my recent decision to purchase a Nimble Storage Array however I wanted to properly address why I was looking for new storage to begin with and didn’t want a simple post to turn into an 8,000 word greek tragedy. So let me set the stage with the (multi-part) backstory and I’ll write the storage post a bit later.

Almost 10 months ago today I stepped into a new district and inherited a VDI infrastructure that on paper most IT people dream about. Lots of Dell chassis and blades, big iron SANs, VMWare View, redundancy. The whole nine yards. At least on paper. You see the district, back in 2009, decided to be an early adopter to VDI. Rather than pilot a few dozen users and scale up, they went all in and cut over every user in the district virtually over night. And they did it with a Vendor that had never implemented VDI on such a scale. To be fair I don’t think very many had back then. Suffice it to say, there were issues. Upon my arrival I found a system suffering from major performance problems with many different causes.

The traditional SAN storage, which pretty much everyone now acknowledges as being critical for running VDI desktop environments was an obvious bottle neck. The system was also suffering from a major case of VM sprawl. More and more client machines had been added without consideration for server side capacity. After all, adding VMs was so easy in the new View environment. Additionally, to the school sites adding clients became a cheap proposition in the case of sub $400 Wyse thin clients or free in the case of donated desktops running thin client software (we are using DevonIT VDI Blaster).

As if all that were not enough, in planning for the initial hardware resources, the absolute bare minimum requirements for memory and hard drive space were used. The Guest VMs were setup to run with 512MB of RAM and 8GB HDD for Windows XP SP3. As you can imagine this caused Operating System performance issues inside the VMs in addition to the external storage performance issues with the SAN. In a perfect world, we would simply allocate more resources to each VM, however an insufficient lack of forward planning meant that the original hardware purchased was just enough to meet the initial VM Guest requirements  We did not have enough Host memory or SAN space to provide additional system resources to the Windows XP clients, forget about trying to do an upgrade to Windows 7.

A cost saving decision that haunts us to this day (and there were others) was to re-use old computer hardware. Those machines are now anywhere from 8-12 years old. Not only do we see hardware failures among these systems present themselves as intermittent connection and stability problems but many of them can only run the RDP protocol. Audio/Video playback, something that has become critical to classroom instruction over the last few years, is painful if not downright impossible under RDP. This severely limits the options teachers and students have for accessing 21st Century learning resources.

There are also many moving parts with the VDI system. There is a SQL Server Database Server hosting the databases for VCenter and View. I have had to dust off my rusty Database server skills to fix major downtime causing issues both with the the SQL Server and the individual Databases. There are also two connection servers which provide the View connection brokering. There have been several issues with both of these. The most interesting was a corruption with the local ADAM database which caused all kinds of odd behavior with our View Desktop Pools. The single VCenter server is managing both ESX server hosts and View hosts and often appears to “pause” under the heavy task load of serving up 1400+ available VMs with over 800 active connections. After a particularly bad power outage, when all the systems went down hard, two View hosts appeared perfectly fine but when active in their respective clusters caused all kinds of havoc with desktop provisioning. Active Directory and networking also play a pivotal role in the system and on more than one occasion both have thrown a wrench into the system in one way or another.

By now you may be thinking, “What’s the big deal, IT department spends lots of time keeping VDI running. Isn’t that what the IT department should be doing?” No. We should not. Not with a staff of three including me.  After attending the VMWorld conference this summer I was struck by how often I heard “Storage Team”, “Server Team” and “Database Team” in conversations about supporting VDI. These were people talking about 200-500 desktop deployments. There I was with 1200 (at the time) VMs thinking, “Teams? I’ve got me, a network engineer and a desktop tech. There are no Teams!”. With the limited staff available to me and the many moving parts, complex enterprise moving parts I might add, keeping the system running was an exercise in extreme firefighting. We had no time to be proactive and when the system hiccuped, every user in the district was affected. It was an untenable situation to find oneself in but that is where I was after coming back from VMWorld. Hit with the realization that I had a hugely complex system that had not been setup well and was failing on many levels.

What was I to do? The answer, perhaps, to come later in Part 2.