A data center provisioning horror story

Yesterday I noted a tweet from Frank Denneman:

I guess he was asking this in the context of the VMWonAWS cloud offering and how, with said service, you could provision vSphere capacity without having to “acquire server hardware”.

This reminded me of an anecdote I often use in talks to describe some of the data center provisioning and optimization horror stories. This won’t answer Frank’s question specifically but it offers a broader view of how awful (and off rail) it could quickly get inside a data center.

It was around year 2005/2006 and I was working at IBM as a hardware pre-sales on the xSeries server line. I was involved in a traditional Server Consolidation project at a major customer. The pattern, those days, was very common:

  • Pitch vSphere
  • POC vSphere
  • Assess the existing environment
  • Produce a commercial offer that would result in an infrastructure optimization through the consolidation the largest number of physical servers currently deployed
  • Roll out the project
  • Go Party

We executed flawlessly up until stage #4 at which point the CIO decided to provide “value” to the discussion. He stopped the PO process because he thought that the cost of the VMware software licenses was too high (failing to realize that the “value for the money” he was going to extract out of those was much higher than the “value for the money” he was paying for the hardware, I should add).

They decided to split the purchase of the hardware from the purchase of the VMware licenses. They executed on the former while they started a fierce negotiation for the latter with VMware directly (I think Diane Greene still remember those phone calls).

Meanwhile (circa 2 weeks), the hardware was shipped to the customer.

And the fun began.

The various teams and LOBs had projects in flight for which they needed additional capacity. The original plan was that the projects could have been served on the new virtualized infrastructure that was to become the default (some projects could have been deployed on bare metal, but that would have been more of an exception).

The core IT team had the physical servers available but didn’t have the VMware licenses that were supposed to go with the hardware. They tried to push back as much as they could those but they got to the point where they couldn’t handle it anymore.

Given that IT ran out of small servers they used in the past to serve small requirements (to be fulfilled by VMs from now on), they started to provision the newly acquired super powerful 4 sockets (2 cores) / 64GB of memory bare metal servers to host small scale out web sites and DNS servers!

While they would have traditionally used a small server for this (at a 5% utilization rate), they had now to use a monster hardware (at 0.5% utilization rate).

If you think this is bad. You saw nothing. More horror stories to come.

Time went by, negotiations came to an end and an agreement with VMware was found. As soon as the licenses were made available, a new assessment had to be done (given the data center landscape has drifted in the meanwhile).

At that time, there were strict rules and best practices re what you could (or should) have virtualized. One of those best practices were that you could (or should) not virtualize servers with a high number of CPUs (discussing the reason for which is beyond the scope of this short post).

Given those recently deployed small web sites and DNS servers showed up in the assessment as “8 CPUs servers” they were immediately deemed as servers that couldn’t be virtualized for technical reasons.

We were left with a bunch of servers that were supposed to go onto 2 vCPUs VMs in the first place but that had to go into 8 CPUs monster hardware (due to gigantically broken decisions). And we couldn’t do anything about it.

This was 2005 and lots of these specific things have changed. However, I wonder how much of these horror stories still exists nowadays in different forms and shapes.

Massimo.