Updates from February, 2013 Toggle Comment Threads | Keyboard Shortcuts

  • Andrew T Schwab 12:12 pm on February 24, 2013 Permalink | Reply  

    VDI Sprawl 

    This is part 4 in a series of posts on my not so great experience inheriting a VDI infrastructure at a school district. We left off with the student virtual desktops running on repurposed non-redundant, redundant hardware at the secondary site and teachers and staff running on the DO hardware. Through the first few months of school, teacher and staff performance remained an issue mainly because the VMs themselves originally weren’t allocated enough system resources to run Windows XP SP3 well. On top of that, the C: and D: (user data) drives were continually running out of space with every Adobe or Java update. The SAN was rapidly approaching 65% utilization, which for NFS represents the beginning of the performance degradation threshold. To increase memory and hard drive space to the level needed to improve Windows XP performance would require doubling the hardware resources which represented a significant investment in continuing to run VDI.

    On the student desktop side, desktops were running ok off of the secondary hardware while the 200 new Wyse clients started to slowly come on line. However we started seeing network connection errors, disconnects and View provisioning errors probably related to running the View hosts across a 600Mbps connection link instead of local to the VCenter server. Student machines were also affected by limited resources assigned to the OS and again, increasing them to the appropriate settings would require doubling hardware. For our older machines running the VDI blaster, we continued to see bad hard drives causing tough to diagnose connectivity issues.

    We also started to experience VCenter server issues with dropped connections, loss of connectivity and the VCenter Service crashing. Through many troubleshooting calls with VMWare support and trying many KB suggested steps, we would resolve the issues for a time but they would return inevitably and bring more problems with them each time. And then we went on Thanksgiving break and something amazing happened. The system worked fine. With 30-100 users, the system didn’t experience any of the major performance issues we were seeing under normal use. This led me to the basic conclusion that fundamentally VM Sprawl was killing the system.

    While at the VMWorld conference over summer, I heard a VMWare rep during a presentation say that Hertz, the rental car company, built a 4,000 seat VDI test infrastructure using traditional storage, just like us and that the system hit the wall at 800 active users. Extrapolating that out to our system, which was supposedly built to support 1500 Windows XP SP3 systems with 512MB RAM and 8GB HDD, they probably hit the wall at 300 but because the district cut over all desktops in one shot, there was no wall to stop them. They had started off with an under engineered system, overloaded it on day one and then because of the ease of adding new clients as well as the perception that thin clients were cheap and inexpensive for schools to purchase had continued to add on clients and suffer performance problems without regard to the inevitable consequences.

    Now, running 1200 VMs with 850 active connections, the system was continuously failing, running the split scenario didn’t address the core storage issues because the drive allocations inside the VMs was not sufficient and the connectivity issues were requiring constant daily intervention to keep the student desktops running. Going into Winter break, something had to change. In Part 5, enter the Big Band-aid and the Great Migration plan.

    Advertisements
     
  • Andrew T Schwab 9:00 am on February 11, 2013 Permalink | Reply  

    I Don't Think VDI Redundancy Means What You Think It Means 

    In my previous VDI posts, I outlined the Virtual Desktop infrastructure I inherited and discussed the stability issues being faced. In this, part three, I will talk about redundancy or lack thereof and how I was able to use some over engineering to my advantage, if only for a little while.

    Summer was fast coming to a close, a few hundred more Virtual Desktops were about to come on line and system performance was horrible. We had already decided not to go down the road of making the significant investments required to rebuild a VDI system capable of supporting the swelling number of desktops and providing acceptable levels of performance for staff and teachers. Desktops after all were a dying paradigm but if we didn’t do something, the whole system would implode under the load of the additional desktops. Enter the promise of total system redundancy.

    The initial system design called for a redundant backup site with identical SAN and Server hardware to failover to in the event of an outage at the DO. Using VMWare’s Sight Recovery Manager (SRM), the VMs at the DO would automatically fail over to the secondary site and come up with only minor interruption. Email would continue to flow, users would continue to work on their virtual desktops, all would be right with the world. In fact, everyone was under the impression that this was in place when I arrived. Only it wasn’t. SRM was not working. SAN replication was not working. Digging further, I found that even had they been working, when the VMs failed over to the secondary site, they would have had no where to go because all but the secondary site would have been unable to access them. There were no redundant links from the other school sites to the secondary site. There was also no backup Internet connection at the secondary site. It would have been a failover to nowhere.

    A further breakdown revealed that SRM had been configured to failover servers only (no View Virtual Desktops) and was at some point actually working. However, it broke when the DO site was upgraded to 4.1 and the secondary sight was not. To add insult to injury, in the course of evaluating storage upgrade options, it was discovered that the DO SAN, the one that was running all the district’s servers, email and virtual desktops was purchased with only a single controller. Somehow the project moved forward with redundant firewalls, web filters and routers, but not the SAN controller. So much for redundancy. As for the site connections, the redundancy plan obviously called for a network design that never materialized and left the district with a secondary site full of equipment, under warranty, sucking up power and AC that was basically sitting idle with nothing to do except run two backup Active Directory servers.

    So in an unorthodox (and probably unsupported) move, I decided to harness the idle power of the secondary site to run the student virtual desktops. Because of the way the view connection brokers were setup with the DevonIT echo server, I could not separate out the student pools onto their own VCenter server in the time allotted, as much as I wanted to. Instead, I attached the hosts from the secondary sight to the VCenter at the DO, moved the student master images over to the secondary SAN and reconfigured the pools for the new cluster, effectively running all of the student virtual desktops on the secondary site hardware. This re-use of the redundant hardware saved us for a while. We found out later that the combined IOPS between the two SANs in this configuration was running at 18,000-20,000 which would have easily brought the DO SAN to it’s knees.

    However, even over the 600Mbps connection from the DO to the secondary site we started to see problems. Pool errors during provisioning pointed to communication problems so we disabled the stop provisioning on error settings. Incidents of disconnecting student computers, either because of the increased sensitivity of PCoIP or because of the old hardware being used as clients, increased. Under the increased student load, VCenter started becoming unresponsive. Staff and Teacher pools continued to have performance problems even with the students running on separate hardware. Despite this, the system did not totally crash and burn. It limped into the new school year with no money invested and another 200 desktops online or so we thought.

    In part 4, how 200 new desktops turned into 300 and what happened when schools actually started using them.

     
  • Andrew T Schwab 9:00 am on February 7, 2013 Permalink | Reply  

    The Long March To VDI Stability 

    This is the second post in a multi part series about my experience with VDI over the past 10 months. In the first post I laid out the VDI situation I inherited. To recap  the situation, post VMWorld in August, I realized that I had a VDI system that was under spec’d, not well implemented or configured, was a major version out of date (4.1u1), was badly over utilized (suffering from VMWare sprawl, more on that later) and was providing users with a very poor computing experience. So I set out to develop a plan to fix it, as any good IT person would do.

    Initially I was looking for ways to stabilize the system and improve the end user experience. Never mind that the desktop paradigm for teachers and students is horribly outdated in the age of anywhere, anytime learning. Never mind that tying teachers to a desktop fixed in space makes building collaborative Professional Learning Communities around student assessment data basically impossible. And never mind that virtual desktops unable to run skype or google hangouts or webcams, that can’t play videos or connect to other classrooms over the Internet, or authors or NASA, that continuously run out of hard drive space with every adobe flash or java update, do not empower teachers or students with 21st Century learning abilities and are not the kind of computer environments we should be building for teachers today.

    I was looking for cost effective ways to get the system back to what it was designed to do, which was provide a platform for teachers to take attendance, enter grades, check email and marginally support student computing. It turns out, cost effective and VDI don’t really play well together. Just to stabilize the system, do a health check and migration to 5.0, was a six figure prospect. Adding the hardware to increase RAM and HDD capacity in Guest VMs, more six figures. Fixing the storage problems with something better suited for the peak demands of 1200 virtual desktops, more six figures. The management software needed to really see what was going on with the complex moving parts? Only 5 figures, but with a high recurring cost. Replacement end point devices for teachers, six figures yet again. The numbers kept adding up, and no matter how I tried to slice and dice them, the conclusion was to get the system stable and viable over the next three years, it was going to be expensive. Certainly much more than the low cost system it was initially pitched as.

    There was another factor I was considering when looking at price. Support had been pre-paid for five years with VMWare renewal just one budget year away and SAN and Server renewals due the year after . On top of that, the server hardware and existing SAN would soon be five years old. Five years for critical infrastructure that 99% of all the desktops in the district where running on. Now I have run servers out to seven and even eight years but never critical systems. Five years has always been my end of life for critical production servers and in this case, this equipment had experienced two major high heat events when the Air Conditioning failed in the server room. In one instance, the thermometers were pegged at 120 and the SAN did not shut itself down. Not the environment that lends itself to extending the life of computer hardware.

    Factor in a significant investment to make the VDI system right, a critical lack of sysadmin capacity and skill level (I’ve learned more about VDI in the past 10 months than I care to know, it’s basically my second job) and the prospect of significant support renewal costs on the horizon; the only cost effective solution was obvious. Scale back the number of users to a point where the existing hardware could support decent performance and phase out the VDI system over time. We would make the best use of the investment that had been made but not throw more money into an outdated paradigm that we weren’t equipped to support and couldn’t afford to maintain over the long term. But this would take time. Time we did not have.

    Storage was the major issue. With several schools bringing new thin clients online over summer (purchases already in the pipeline when I arrived), we were looking at a total system collapse if we didn’t do something. The solution presented itself in an unexpected place. In part three, We talk redundancy!

     
  • Andrew T Schwab 12:44 pm on February 2, 2013 Permalink | Reply  

    VDI Hell 

    I sat down to write a post about storage solutions and my recent decision to purchase a Nimble Storage Array however I wanted to properly address why I was looking for new storage to begin with and didn’t want a simple post to turn into an 8,000 word greek tragedy. So let me set the stage with the (multi-part) backstory and I’ll write the storage post a bit later.

    Almost 10 months ago today I stepped into a new district and inherited a VDI infrastructure that on paper most IT people dream about. Lots of Dell chassis and blades, big iron SANs, VMWare View, redundancy. The whole nine yards. At least on paper. You see the district, back in 2009, decided to be an early adopter to VDI. Rather than pilot a few dozen users and scale up, they went all in and cut over every user in the district virtually over night. And they did it with a Vendor that had never implemented VDI on such a scale. To be fair I don’t think very many had back then. Suffice it to say, there were issues. Upon my arrival I found a system suffering from major performance problems with many different causes.

    The traditional SAN storage, which pretty much everyone now acknowledges as being critical for running VDI desktop environments was an obvious bottle neck. The system was also suffering from a major case of VM sprawl. More and more client machines had been added without consideration for server side capacity. After all, adding VMs was so easy in the new View environment. Additionally, to the school sites adding clients became a cheap proposition in the case of sub $400 Wyse thin clients or free in the case of donated desktops running thin client software (we are using DevonIT VDI Blaster).

    As if all that were not enough, in planning for the initial hardware resources, the absolute bare minimum requirements for memory and hard drive space were used. The Guest VMs were setup to run with 512MB of RAM and 8GB HDD for Windows XP SP3. As you can imagine this caused Operating System performance issues inside the VMs in addition to the external storage performance issues with the SAN. In a perfect world, we would simply allocate more resources to each VM, however an insufficient lack of forward planning meant that the original hardware purchased was just enough to meet the initial VM Guest requirements  We did not have enough Host memory or SAN space to provide additional system resources to the Windows XP clients, forget about trying to do an upgrade to Windows 7.

    A cost saving decision that haunts us to this day (and there were others) was to re-use old computer hardware. Those machines are now anywhere from 8-12 years old. Not only do we see hardware failures among these systems present themselves as intermittent connection and stability problems but many of them can only run the RDP protocol. Audio/Video playback, something that has become critical to classroom instruction over the last few years, is painful if not downright impossible under RDP. This severely limits the options teachers and students have for accessing 21st Century learning resources.

    There are also many moving parts with the VDI system. There is a SQL Server Database Server hosting the databases for VCenter and View. I have had to dust off my rusty Database server skills to fix major downtime causing issues both with the the SQL Server and the individual Databases. There are also two connection servers which provide the View connection brokering. There have been several issues with both of these. The most interesting was a corruption with the local ADAM database which caused all kinds of odd behavior with our View Desktop Pools. The single VCenter server is managing both ESX server hosts and View hosts and often appears to “pause” under the heavy task load of serving up 1400+ available VMs with over 800 active connections. After a particularly bad power outage, when all the systems went down hard, two View hosts appeared perfectly fine but when active in their respective clusters caused all kinds of havoc with desktop provisioning. Active Directory and networking also play a pivotal role in the system and on more than one occasion both have thrown a wrench into the system in one way or another.

    By now you may be thinking, “What’s the big deal, IT department spends lots of time keeping VDI running. Isn’t that what the IT department should be doing?” No. We should not. Not with a staff of three including me.  After attending the VMWorld conference this summer I was struck by how often I heard “Storage Team”, “Server Team” and “Database Team” in conversations about supporting VDI. These were people talking about 200-500 desktop deployments. There I was with 1200 (at the time) VMs thinking, “Teams? I’ve got me, a network engineer and a desktop tech. There are no Teams!”. With the limited staff available to me and the many moving parts, complex enterprise moving parts I might add, keeping the system running was an exercise in extreme firefighting. We had no time to be proactive and when the system hiccuped, every user in the district was affected. It was an untenable situation to find oneself in but that is where I was after coming back from VMWorld. Hit with the realization that I had a hugely complex system that had not been setup well and was failing on many levels.

    What was I to do? The answer, perhaps, to come later in Part 2.

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
%d bloggers like this: