VDI Sprawl

This is part 4 in a series of posts on my not so great experience inheriting a VDI infrastructure at a school district. We left off with the student virtual desktops running on repurposed non-redundant, redundant hardware at the secondary site and teachers and staff running on the DO hardware. Through the first few months of school, teacher and staff performance remained an issue mainly because the VMs themselves originally weren’t allocated enough system resources to run Windows XP SP3 well. On top of that, the C: and D: (user data) drives were continually running out of space with every Adobe or Java update. The SAN was rapidly approaching 65% utilization, which for NFS represents the beginning of the performance degradation threshold. To increase memory and hard drive space to the level needed to improve Windows XP performance would require doubling the hardware resources which represented a significant investment in continuing to run VDI.

On the student desktop side, desktops were running ok off of the secondary hardware while the 200 new Wyse clients started to slowly come on line. However we started seeing network connection errors, disconnects and View provisioning errors probably related to running the View hosts across a 600Mbps connection link instead of local to the VCenter server. Student machines were also affected by limited resources assigned to the OS and again, increasing them to the appropriate settings would require doubling hardware. For our older machines running the VDI blaster, we continued to see bad hard drives causing tough to diagnose connectivity issues.

We also started to experience VCenter server issues with dropped connections, loss of connectivity and the VCenter Service crashing. Through many troubleshooting calls with VMWare support and trying many KB suggested steps, we would resolve the issues for a time but they would return inevitably and bring more problems with them each time. And then we went on Thanksgiving break and something amazing happened. The system worked fine. With 30-100 users, the system didn’t experience any of the major performance issues we were seeing under normal use. This led me to the basic conclusion that fundamentally VM Sprawl was killing the system.

While at the VMWorld conference over summer, I heard a VMWare rep during a presentation say that Hertz, the rental car company, built a 4,000 seat VDI test infrastructure using traditional storage, just like us and that the system hit the wall at 800 active users. Extrapolating that out to our system, which was supposedly built to support 1500 Windows XP SP3 systems with 512MB RAM and 8GB HDD, they probably hit the wall at 300 but because the district cut over all desktops in one shot, there was no wall to stop them. They had started off with an under engineered system, overloaded it on day one and then because of the ease of adding new clients as well as the perception that thin clients were cheap and inexpensive for schools to purchase had continued to add on clients and suffer performance problems without regard to the inevitable consequences.

Now, running 1200 VMs with 850 active connections, the system was continuously failing, running the split scenario didn’t address the core storage issues because the drive allocations inside the VMs was not sufficient and the connectivity issues were requiring constant daily intervention to keep the student desktops running. Going into Winter break, something had to change. In Part 5, enter the Big Band-aid and the Great Migration plan.