vCloud Director 1.5 Multisite Cloud Considerations

In the last few months, among other things, I have been working on the document in subject. Being able to deploy vCloud Director 1.5 across different sites is something our customers and service provider partners have been asking us a lot.

Some of these customers and partners have decided to deploy independent vCloud Director instances in different “sites”, others wanted to get more clarity on how far they could stretch a single vCloud Director instance across multiple “sites”. Of course both approaches present advantages and disadvantages.

We have never been very clear about the supportability boundaries other than “a single vCD instance can only been implemented in a single site”. What is a single site anyway? Is it a rack? Is it a building? Is it a campus? Is it a city? Is it a region? What is it? In this paper we have tried to clarify those boundaries. We have also provided some supportability guidelines.
In the document we have described the various components that comprise a vCloud environment and we have classified them in macro areas such as provider workloads, user workload clusters and user workloads.

In a nutshell, throughout the document, we have tried to clarify and classify different MAN and WAN scenarios based on network connectivity characteristics (namely latency). We have determined, in our vCD parlance, what would constitute a single site deployment (over a MAN) and what would constitute a multisite deployment (over WAN). We have determined 20 ms of latency to be “our” threshold between what we can support and what we cannot support with this specific vCloud Director 1.5 release.

The document gets into a lot more details and scenarios but the two major takeaway are:

  • It is not possible to stretch the provider workloads that is the software modules that comprise your VMware vCloud (e.g. vCD cells, vCD database, the NFS share, etc).
  • It is possible to have Provider vDCs that are located up to 20 ms (RTT) from the provider workloads.

This picture summarizes one of the supported scenarios:

In the doc we call out and describe more precisely other supported scenarios (such as stretched clusters) and various caveats associated. The following are the scenarios we are taking into account:

It is important to understand that, when we talk about a distributed vCloud environment, we are not necessarily referring to DR of the end-user workloads. This is really about how a Service Provider can allow an end user to spin up workloads in a distributed environment. This doesn’t, necessarily, mean that the SP is responsible for failing over those workloads in the other data centers. If you want to know more about how to build a resilient vCloud architecture you should read this link.

Towards the end of the document we have summarized the supportability statements associated to distributing compute resources in a vCloud setup. In the current version of the doc the summary looks like this:

If you are evaluating a multisite vCloud Director 1.5 deployment you may want to give this document a read. Note that it isn’t published externally on vmware.com but it is available through your VMware representative.

Any question, comment, feedback you may have I’d be interested to hear.

Massimo.

15 comments to vCloud Director 1.5 Multisite Cloud Considerations

  • Interesting information on the support of stretched clusters with vCD. It seems that more and more of the VMware product suite is aiming to solve physical locality issues. I especially like the flow chart. :)

  • Dmitri Kalintsev

    Hi Massimo,

    What would be helpful in addition to the “20 ms” figure is a couple of clarifications:

    – 20ms RTT or OWD?
    – How much bandwidth need to be permanently available to the vCD for reliable support of a remote pvDC?

    Cheers,

    — Dmitri

    • Massimo

      Hi Dmitri.

      It is 20ms RTT.

      The bandwidth need… I knew you were going to ask it. :)
      Talking to engineering it was clear that the “chokepoints” were primarily latency related. Also, latency sensitiveness was playing a big role in the communication among the “provider workloads” modules. Provider vDCs (especially when coupled and remoted with their own vCenter and vSM instances) are pretty good at managing low latency connections and network glitches.

      To be conservative we determined that splitting the provider workloads was not supported and we would allow remoting the PvDCs with that 20ms RTT latency.

      We didn’t call out bandwidth because we determined BW would be a function of 1) the amount of traffic you generate and 2) the user experience you want to deliver. In other words we don’t see BW as being a chokepoint that could generate timeouts / error messages (whereas latency could).

      We obviously need to apply some common sense here. If you are setting up a remote PvDC with 20ms RTT and 1Kb/sec bandwidth this may not fly very well (and you could possibly see problems). All in all we didn’t feel there was a need to call out a minimum BW requirement for the reasons above.

      Thoughts?

      Thanks. Massimo.

      • Dmitri Kalintsev

        Hi Massimo,

        Thanks for the answer. My thoughts on the bandwidth requirements ran on the lines of “how does vCD use this bandwidth?”, i.e., what end-user or machine-to-machine operations consume it? This would give one some method of estimating, with confidence, how much to provision with the knowledge (or estimation of) the projected user/m2m transaction load.

        So if it is possible to provide some information of the conversations that vCD is having with its remote pvDC(s), what are they caused by (user action or scheduled/transactional m2m action), and how big a typical conversation is (bytes/time), that would be very helpful.

        The reason I’m asking is that in SP environment, it is often desirable to control quality of experience, to achieve which an SP may for example choose to dedicate some bandwidth, so obviously the question will be “how much to dedicate?”. There is a fairly straight relationship between the dedicated bandwidth and latency – if bandwidth user starts to consume more than it has at its disposal, network may choose to either start to delay or drop traffic that is in excess, which is probably not very desirable.

        Just thought – if these conversations are marked (surely they are, right?) with DSCP value or values, knowing these would also be quite cool.

        And hopefully, that would not only be useful to me, however some people may say that I’m busy splitting hairs here. 😉

        — Dmitri

        • Massimo

          Dmitri, I’ve just sent you offline the doc for you to have a look at. There is a chapter (#2) that is devoted to describing “what actions” consume that bandwidth (and deal with latency). Note that these actions aren’t so much infrastructure hand-sackings of sorts but rather have to do with how the end-user consumes the infrastructure. Typical example is… we (cloud admin) don’t know upfront whether there will be 2 users every day deploying 3 VMs (w/ a 2GB disk) from a centrlized virtual data center into a remote virtual data center…. or there will be 100 users deploying 500 VMs (w/ a 50GB disk) doing that.

          Have a look at the doc and let me know what you think. Your comments are always spot on.

          Thanks. Massimo.

  • khai

    dear massimo,

    is it possible for me to have the offline docs too/ .

  • Chris

    Would it be possible to get a copy of the doc as well? I’ve asked my account rep, but he isn’t exactly sure what document to send.

  • Ho!! docs too please, can you?

    • Massimo

      Can you work with your VMware representative to get it? (if you don’t have an obvious VMware contact send me an email with your BUSINESS address – see my about page).

  • Loren Gordon

    Great info, thanks! Are there any changes to the vCD multi-site latency considerations for v5.1?

  • Ben

    Has the latency consideration changed for vCloud 5.5 ?

Leave a Reply