Virtualization hardware sizing (quick and dirty approach)

Hardware virtualization these days is a hot topic and we all know that. There are many customers looking into it for the first time and one of the problems they are facing right now is how they are going to size their new virtual infrastructure. Lately I have received lots of requests from many people in order to help them project the hardware investments (in terms of physical servers) that they need to jump onto the virtualization band wagon. In this post I'd like to try to provide you with a very quick and dirty method to do that.

Consider that there are many alternatives to get to a "decent and professional" technical result: you can either hire a consultant for a performance analysis of your current physical infrastructure and have him/her come out with the required hardware infrastructure to support your workload or you can do that on your own with professional tools available in the market (consultants can also leverage these tools and yet provide additional value). These are the best alternatives if you want to come out with a "professional" output that could help you to better present your internal hardware purchase request; please keep this in mind throughout the document. These approaches however could have a few drawbacks:

  • They are time consuming. No matter what, it takes time to gather the data and analyze them to come out with a proper sizing (professional tools can help a lot here)
  • They are expensive. If you want to use these professional tools and/or consultants to do this, it will cost you some dollars/euros to come out with that magic number.

There is no free lunch. The more professional you want to go... the more expensive it gets.

Having this said I'd like to offer you a "non-professional way out" so to speak. But before we get into the details of my methodology (ITIL practitioners might want to kill me for calling it a methodology) I want to set the stage for it. My sizing suggestions are going to be somewhat weak if you don't take into account the following concepts:

  • Taking a "snapshot" of a physical environments is certainly time consuming but also difficult to achieve. By definition x86 server farms are very dynamic in nature so it could happen, especially for big deployments, that as soon as you are done with accounting the last server in your study, the whole picture has changed (sometimes dramatically) with new servers being introduced and old servers being removed. It's like taking a picture of 20 kids playing in a green field... you can be 100% sure that two snapshots are never going to be the same.
  • Sometimes the performance effects that virtualization introduces are not predictable. There have been circumstances where a given application that was consuming very little resources on a physical system started to drag lots of CPU cycles once virtualized for no particular reasons. It is not so easy to develop an algorithm that will take as an input the physical resource utilization and translate them into virtual resources utilization simply because the effects of this thin layer are still very unpredictable in some cases.
  • Another thing to keep in mind is that over-provisioning of hardware resources might turn to be cheaper than buying consultants and/or tools to calculate in a more professional way the exact size of the resources you need. The idea here is that if you take an educated (and conservative) guess on the number of systems and their configurations that you would require to support your own server farm (and you add a certain % of resources for contingency just in case) the total amount of money that will be spent on the infrastructure is (likely) to be less than what you would pay for a proper sizing consultancy + a more precise sizing of the infrastructure. Am I suggesting consultants are a waste of money? Not at all. They would provide you with a very professional study report that you can bring to your management for the budget approval of new servers and new infrastructure software.
  • I have provided myself quick and dirty suggestions to customers and IBM business partners where I have taken educated guesses on the sizing of a target Virtual Infrastructure and where a proper professional study report was not strictly required. And in those situations, back to the over-provisioning discussed above, I have never come across a customer that was not happy because his/her systems were only used at 45% and not properly sized to run at 65/70% of resources utilization (especially in those circumstances where companies had to expand and could put those spare resources at good use in a short period of time).
  • How much would you want to stress your physical systems anyway? Is 65/70% of CPU resource utilization a reasonable target number before "moving to the next" box? Or is 50% a better number? Or can you push it close to 90/95%? Also what are the effects of response time of applications running in virtual machines when the resource utilization of the host goes up? This CPU-centric measure of course assumes that you have enough/balanced memory in your system (you will never get to 65% of CPU utilization on a 4 socket systems with just 4GB of memory for example). These are however road-blocks during your professional studies: even if you take into account all the details regarding the actual resource utilization of your physical systems there is no magic rule at this point at least (and that I am aware of) that will be able to provide you with a definite answer on the target resource utilization for a given virtualized host before you start to see degraded performance.
  • Most of the time one should concentrate around CPU and Memory configurations for sizing hosts that support virtual machines. This is not because other subsystems are not important obviously but sizing a storage subsystem is a complete other matter and certainly out of the scope of this post. By the way since a virtual infrastructure allows you to decouple physical computational resources (i.e. the servers) from the virtual machines hard drives (i.e. the centralized storage) you can size the two components separately. Sizing the storage will need to take into account things like overall space required, proper raid levels and size of the logical units (but as I said this discussion is not within the scope of this post). Typically two standard HBA's in each of the servers comprising the virtual infrastructure have enough bandwidth to support the traffic to and from the SAN (no matter what the sizing of the centralized storage has been). Similarly for the NIC's it shouldn't be a problem given the fact that most of the time the number of NIC's in a given system is a function of the complexity of the networks you need to connect your hosts to rather than a performance sizing exercise. That's the reason for which the methodology below will focus on CPU and Memory sizing: once you have determined the # of CPU's and total amount of RAM to be used you can stick with two SAN HBA's per server and as many network connections you need (as I said the choice of the network subsystem configuration is more related to the virtual infrastructure architectural design rather than a pure sizing discussion).

At this point you might consider this a weak argument and certainly not a very precise study on how to properly size your servers to support your VMware project. And that is exactly what it is: an educated guess based on "not actually measured data". So what is this all about? It is very simple and straight forward. Again, not very professional, but easy, quick and most of the time close to what a tool or a diligent study would tell you in a professional report.

So for those people that do not have time and money to get on the "professional band wagon" my suggestion is this: instead of starting from an actual measurement of your physical servers and then analyze your data to "engineer" a proper sizing, why not taking the other way around. Why not leveraging industry "rules of thumbs" reverse-engineering what other customers have already experienced world-wide and apply this to your own scenario? We are talking about two very simple and straightforward data-points here:

  • Rule of thumb #1: Every brand new Intel/AMD core (or pCPU from now on) can support on average between 3 to 5 virtual CPU's (or vCPU from now on).
  • Rule of thumb #2: Per every brand new Intel/AMD core configured you should have between 2 and 4 GB of RAM to obtain a "balanced system".

That's pretty much it. Most VMware customers using VI3 in production world-wide will probably tell you that, no matter how they got to their setup (i.e. through a pilot, an educated guess, a consultancy, a tool analysis) they will likely fall into the rules of thumbs above: they are all supporting 3 to 5 vCPU per physical core and their servers have between 2 and 4 GB or physical memory installed per physical core. Consider that the 3-5 is an average. I know customers in extreme situations that have 2 vCPU's per core and others that have 8 vCPU per core. 3 to 5 is pretty conservative and, unless you are in such an extreme situation, the worst that could happen is that you are over-provisioning your new server farm (see above).

So why not getting straight into an example which I am sure will clarify the whole thing. Say you have for example a physical server farm comprised of 60 physical servers. You are going to virtualize 55 of these 60 (this can be for multiple reasons). Of these 55 servers, 45 are going to be 1 vCPU virtual machines (this can be either because they were already 1 pCPU servers or because they were SMP servers but the resource utilization is so low that you can afford to migrate them into a 1 vCPU sand-box). 10 of the 55 are going to be 2 vCPU virtual machines.

So let's do some math. The total amount of vCPU's that you are going to activate is 45 x 1 vCPU + 10 x 2 vCPU = 45 vCPU + 20 vCPU = 65 vCPU's

Let's now try to see how many cores you need to support this workload applying the first rule of thumb: 65 vCPU / 3 = 22 cores (rounded). Consider that I wanted to be even more conservative and based on the generic "3 to 5 vCPU per core" I wanted to take the more conservative 3 vCPU per core. If I wanted to be more aggressive I would have used 5 vCPU per core so the math would have been 65 vCPU / 5 = 13 cores. This might be possible but we want to be conservative in such an "educated guess" so I would stick on the 22 cores.

Now, given that quad-core CPU's are widely available, we will calculate the number of actual "CPU packages" that we need to support this virtual infrastructure: 22 cores / 4 cores = 6 CPU packages (rounded). In terms of memory we would need 22 cores * 4GB = 88GB of memory. Again here we have taken a very conservative approach using 4GB per core Vs 2GB per core (see above rule of thumb #2).

So in the end you can assume that in order to support your new 55 virtual servers above you need 6 CPU packages with a total of (roughly) 88GB of memory. Mapping this exercise to actual physical systems is not that difficult: most likely you might want to use 3 x dual-socket rack based systems or blade systems with 2 x quad-core CPU's and around 32GB of memory.

A word of caution should be mentioned for the memory configuration. This methodology requires a bit of "business sense" and the numbers should not be treated as a given; for example due to high memory costs it might be even possible (and cheaper) to buy 4 x dual-socket rack / blade systems with 24GB or even 16GB in each. 16GB per each of the 4 systems is 64GB of memory which is anyway certainly above the original thought if we were to use a slightly more aggressive formula with 2GB of memory per core (22 cores * 2 GB = 44GB).

Another point of attention is the CPU "sku" (or model) to be used. There is typically a wide range of processor models within a given family (for example within the Intel 53xx family or the Intel 73xx family). There is usually also a big price fluctuation between the low(est)-end model and the high(est)-end model within the same family. Given the nature of this methodology it would be difficult to suggest the right model to be used for a specific scenario but a good business/technical practice would be to use the "n-1" or "n-2" models with "n" being the high-end sku. Usually the high-end model is optimal for "raw performance" while the n-1 and n-2 sku's are optimal for "price/performance" metrics.

Again it needs to be stressed that this is a high-level educated guess on how much computational power you would need to run your projected virtual servers. As I said at the beginning this methodology does NOT overlap with a more professional approach. Having this said however the same business sense should be used anyway when interpreting suggestions gathered from professional IT tools or consultants that might or might not have a deep understanding of the x86 hardware market dynamics (in terms of pricing).

Two more things before we close this topic.

You might want to consider having an additional "building block" (i.e. server) in case one of the systems that have been sized accordingly to the methodology fails. So in the example above, assuming to stick with the 3 x dual-socket with 32GB of memory, you might want to have a 4th server so that at any point in time you will always have at least 3 of them running even in the case one of them goes off-line. This is not mandatory but you need to consider that you would be running with fewer resources in case that happens.

The other thing worth mentioning is the fact that, as the number of the physical servers to be virtualized grows, the number of CPU packages and total amount of memory grow with it. It wouldn't be uncommon to require hundreds of CPU packages and TB of memory for a large scale project. If that is the case than another decision point is whether you want to scale out your new virtual infrastructure (i.e. with 2-socket rack / blades) or scale up (i.e. with 4-socket and above high-end systems). This is a pretty old document that I wrote on the subject that might help you with that decision. As I said it's a bit old but most of the considerations still apply.

And with this I think I am done. I just want to wrap up stressing again on the fact that:

  • this is far from being a professional approach (I know it very well)
  • yet it's a quick and dirty methodology that will get you "there" anyway.

Long live the sizing tools. Long live the sizing consultants.

Massimo.