Over the past few months I've been trying to figure out exactly why one of my customers are having such inexplicably low performance on their vSMP VMs. I've seen this now for both linux and windows VMs, and their common denominator is that they are all four-way vSMP, and they run as terminal servers (X11 and RDP).
In its latest incarnation, the problem was observed on a windows 2003 terminal server being bogged down way earlier than when it was a physical server, while reporting CPU system time (as reported by task manager) in the 70-90% range when busy. What was basically going on was that whenever the system had a spike in CPU usage, system was using almost all of it with almost zero CPU reported as "user" time.
What we did first was to try and elimitate co-scheduling problems by dedicating a single ESX server for this VM. The host had four dual-core AMD Opteron 8222 processors, and with the VM running on CPU 4-7 it should never contend with other VMs, and consequently be able to run immediately on all four cores whenever becoming "ready to run".
HOwever, when viewing esxtop on the host we observed that CPU ready times were in the three-digit range, going up to 500 (whatever that is, the utility says percent but I am not convinced....) quite frequently.
This really has me stumped, since CPU ready is supposedly time spent waiting for the scheduler to open up a slot for the VM to run and with only 50% of the cores used this should never happen.
One possible explanation, which I hope someone can comment on, may be that context switch overhead is being "hidden" as system time inside the VM, and CPU Ready in the hypervisor. There is quite a lot of context switching going on, and obviously this can hurt performance quite badly.
And this syptom is actually identical to what we've seen earlier on, for several linux-based 4vSMP terminal servers. Back then I assumed that we were creating an impossible problem for the scheduler to solve, because they do have quite a few of these 4-way systems intermixed with many single-CPU VMs running on 24 cores in total. However, based on my latest observations the problem cannot be explained this way, and I'm left wondering if there is something fundamentally wrong in the combination of the platform used and the ESX version running.
The environment is as follows:
HP BL465c blades with 64GB mem and four dual-core Opteron 8222s. Memory interleaving is turned off in BIOS, and ESX treats the nodes as being NUMA.
ESX version 3.0.3, virtualcenter 2.5 Update 2.
Storage: EVA3000.
VM being troubleshot: Windows 2003 Enterprise Edition 32bit with PAE. 4vSMP, 12GB mem. Userload ~30, but only a few users actually working in the measured interval. A quite high context switch level. No user-process showing up in the top 10 list, but with lsass and a few other system processes periodically spiking quite a bit.
PS: The reason why ESX 3.0.X is installed is that the EVA3000 disappeared from the HCL in 3.5 and we had to assume there was a reason for this.