It’s interesting that up until a year ago many people were showing their AMD tattoos claiming that Opteron was king of the hill and Intel was going nuts. Nowadays all these people seem to wear very nice shirts that hide these tattoos as there seems to be a consensus now that Intel, backed by their immense R&D capabilities and more than immense marketing funds, have returned to be king of the hill again leaving the AMD Opteron (and even their own Itanium processor) in the dust. This is mainly due to two big achievements:
- the new Intel microarchitecture (i.e. the brand name is “Intel core”) which has been “core per core” a huge boost compared to the old Netburst microarchitecture.
- the quad-core cpu’s in the 2-socket space that AMD has not yet managed to get working and that is a very interesting technology (given also the software vendor price schemas).
So Intel is back and AMD is lost again in the blue? This may be but there are things going on that aren’t getting too much attention (in my opinion) and that might give AMD a little bit of boost in the virtualization arena again (as it was the case when they launched the Opteron and everybody was going mad for them).
Back to the point there has been lots of talking lately about this concept of the “hardware assists” which is basically a mean for processors vendors (namely Intel and AMD in this x86 space) to create hardware platforms that are more virtualization friendly than in the past. I have already touched very briefly on this concept in another post where I have discussed, from a high-level perspective, the architectural choices for various hypervisors (namely VMware ESX, Xen and MS Viridian). You can read it here. Everyone knows that Intel-VT and AMD-V are really first generation “hardware assists” and they pretty much focus on the CPU subsystem. Future hardware assists implementations will cover other server subsystems such as Memory and I/O.
This is what I’d like to talk about: Memory hardware assists. With the new code-named Barcelona quad-core CPU due to be available in a few weeks with volume shipments in a few months, AMD is going to provide support for what they refer to as “Nested Page Tables” (or NPT for short) which is nothing but memory virtualization support.
A year ago at VMworld 2007 Sr Director R&D Jack Lo provided an illuminating session on the matter: VMware and Hardware Assist Technology (Intel-VT and AMD-V). This session provided a very interesting inside about the mechanisms that VMware is using today in terms of memory virtualization (i.e. Shadow Page Tables) that are basically a software “fake” that allows Guest OS’es to pretend to have full control of the memory address space provided to them while in reality it is the hypervisor maintaining full control of that. In fact if you think about it, in a standard x86 world, only one OS could run on the system and it is that OS keeping control of the hardware resources. In a virtual environment this stack is “screwed up” since the OS doesn’t run on real hardware (and there are many OS’es running on the system) so the hypervisor needs to create this software re-mapping of physical resources into the Guest space. Mr. Lo also touched on future hardware assist technologies that should provide a performance boost in this area and AMD NPT was in fact mentioned. The good thing is that “future” at some point becomes “present” and here we are.
The whole idea is that now the processor itself can keep track of these two levels of memory space (i.e. the one that the hypervisor sees and the one that each guest OS sees) without any sort of software remapping being done within the hypervisor as it is the CPU that is able to maintain these multiple mappings onto the registries built into the silicon. What VMware has been suggesting lately is that while their “software binary translation” has better performance than the silicon counterpart Intel-VT and AMD-V for CPU operations, these Nested Page Tables will give a performance boost comparing to their own “software shadow page tables” for memory operations. Without getting into the specifics you should rest assured that VMware is going to intercept NPT support in future releases of the hypervisor in a timely manner. And no, if you were wondering, ESX 3.0.2 (which is the current version as of today) won’t support NPT.
So when is this supposed to show big improvements? As always for performance related things it really depends on what you are doing. For the vast majority of CPU intensive and/or IO intensive workloads NPT won’t make much of a difference. There are however some workloads that might gain huge performance benefits. Typically these applications are those with specific memory patterns. This does not necessarily mean virtual machines with big memory footprints but specifically virtual machines with a very high number of “context switches”. A <context switch> occurs whenever a thread needs to leave control to another thread; at the high-level when this occurs the OS needs to save the volatile state of the exiting thread and load the previously saved volatile state of the next thread to be executed. On a standard physical system this is a procedure that the OS handles with the support of the processor while in a virtual environment the Guest OS tries to do the same but instead of getting hardware support to achieve the context switch the hypervisor traps the request and re-works it to fit into the real system resources (well what happens is more complex but you have got the point). This generates overhead especially if you think that you normally get hundreds if not thousands of context switches per second on a Windows system. NPT is all about getting rid of this software re-mapping and allow a much streamlined path from the Guest to the physical resource without the hypervisor acting as the “man in the middle”.
I have come across a situation a while back with one of our biggest customers reporting “performance” issues in a particular virtualized workload. This was an in-house built COM+ application. During the analysis it turned out that the system under stress at peak hours was generating between 20.000 and 30.000 context/switches per second which is obviously a number that is well above the average number of context switches you would find on a Windows box. It is interesting that the problem being brought to my attention was not that the response time was not acceptable nor the application didn’t scale. The problem was that the virtual machine(s) in subject were performing fine (in terms of response time) but CPU usage was absolutely abnormal: where a 2-cpu physical system running the same workload was showing an average 5-10% of cpu utilization with peaks in the range of 20-30%, the same workload in a 2-cpu vm would show an average 30-40% of cpu utilization with peaks in the range of 70-80%. And this was not an overcommitted ESX host obviously, it was an 8-way system and this was the only vm running on it at test-time. My current speculation is that this workload poses an extreme overhead on the hypervisor layer due to the very high number of context switches and this causes in turn a very high CPU utilization to handle the re-mapping. This is a circumstance where NPT would/could/should be a life saver (based on my speculations of course).
Another situation where NPT might give a boost (and that might interest the general VMware customers) is Terminal Services / Citrix environments. It is not, in my opinion, by chance that TS and Citrix environments are known to be not very scalable on VMware. Although there has been moderate success in running these workloads on ESX with some customers most of the users report a very high CPU utilization and a very limited scalability (10-15 users would be able to drive the vCPU’s to 100% utilization). If you think about this it might make sense and it could very well be a similar situation compared to the one described above. While a standard server pattern is to run a single application/process (i.e. a single back-end application can require 1, 2, 3, 4 CPU’s depending on the workload that it needs to support) a Terminal Server / Citrix environment is a bit different in nature: there is no single “big process” to support but rather many small processes (and an even greater number of threads) associated to the users connecting to it. So a standard server workload can be defined as a “one big process” pattern while the TS / Citrix server can be defined as a “many small processes/threads” pattern. No surprise that VMware has always had issues with running these latest patterns … they are very niche in nature and cause a huge number of context switches as well. In the final analysis the Terminal Services / Citrix workloads, based again on my pure speculations, could get a really huge boost with the combination of AMD Barcelona which supports NPT and a hypervisor that takes advantage of that.
As always in these cases hardware features are worth nothing if the software doesn’t exploit them so double check the whole chain before buying a new CPU and realizing later on that your hypervisor of choice won’t support it immediately.
For the sake of the discussion it must be said that Intel is known to be working on a similar technology called “Extended Page Tables” or EPT for short. However this time they are going to be beaten by AMD (at least in time-to-market) since we won’t see EPT that soon. It is worth noticing though that Intel is also working, in the short term, with industry partners to fine tune these software algorithms and try to bridge the difference between the current situation and what NPT might bring onto the table (until they get to EPT).
Come one AMD friends… perhaps this is the right time to use again your t-shirts and show the world your green tattoos!! Joking aside I strongly believe this will be another big milestone towards removing the obstacles and fears surrounding virtualization and its associated performance overhead. Well done AMD!
P.S. Just 2 seconds before publishing I have noticed AMD is now referring to NPT as “Rapid Virtualization Indexing”. Well I guess that even the marketing guys need/want their piece of the pie ($$$). I must agree that Nested Page Tables didn’t mean much for the average buyers.