TCP-clouds, UDP-clouds, “design for fail” and AWS

An entire Amazon AWS Region was recently down for four days. Everyone has got to blog something about it and this is my attempt. Just as a warning: this post may be highly controversial.

There has been a litany of tweets pontificating how applications on AWS should be deployed in a certain way to achieve the maximum level of availability and how applications need to be “re-architected” to properly fit into the new cloud paradigm. Basically the idea is that your application should be thought, designed, architected, developed and deployed with failure in mind. Many call it “design for fail“. That is to say: software architects and developers should never assume that any given piece of the infrastructure is reliable.

I beg to differ. I don’t like this idea even though some of you will be thinking I am a bit archaic.

George Reese wrote a great blog post titled The AWS Outage: The Cloud’s Shining Moment outlining the differences between the “design for fail” model and the “traditional” model. The traditional model, among other things, has high-availability and DR characteristics built right into the infrastructure and these features are typically application-agnostic (a couple of years ago I wrote a big document on the various alternatives for HA and DR of virtual infrastructures if you are interested). George nailed down the story very well and the story is that there are a couple of different philosophies at play here. I don’t call these two models “design for fail” and “traditional” though. I call them TCP-clouds and UDP-clouds. Let’s look at a summary of the characteristics of these two protocols.

In the context of cloud resiliency this is what that means:

AWS uses a UDP-cloud model because it doesn’t guarantee reliability at the infrastructure level. AWS essentially offers an efficient distributed computing platform that doesn’t have any built-in high availability services. The notion of Availability Zones and Regions is often misunderstood since the name may imply there is high availability built into the EC2 service. That’s not the case: AWS suggests to deploy in multiple Availability Zones simply to avoid concurrent failures. It’s mere statistic. In other words, if you deploy your application in a given Availability Zone, there is nothing that will “fail it over” to another Availability Zone as part of the AWS service (RDS is a vertical example that does that for MySQL but I am instead talking about an application-agnostic service that does that for every application regardless of the nature).

Since I am not able at the moment to write a structured thought around this complex matter, let me write down mixed and random thoughts, opinions and questions to try to make you think. I am giving you some food for thoughts. As far as answers, call me when you find them please.

Isn’t this “design for fail” theory a step back?

What we have seen in the last decade was a trend where we were able to remove the non-functional requirements complexity from within the traditional OS and put them down into the “virtual infrastructure” (arguably the backbone of any IaaS cloud). This is the point I was trying to come across during this VMworld 2007 breakout session 4 years ago. And what we are saying now is that we should put that logic back into the application (not even the Guest OS)? I thought the trend I have just described was quite successful and one of the many reasons of the success of virtualization deployments. Are we now questioning it?  My idea is fairly simple although I am open to be challenged: developers focus on functional requirements, IT focuses on non-functional requirements (which includes resiliency and reliability among other aspects). If interested, you can download the full deck here. Note I did that presentation before joining VMware so, if you think I am biased, well I am biased just because I bought into that school of thought long before I was on the VMware’s payroll system.

Excuse me? What did you say? NoSQL… to whom?

In his post George suggested exploring NoSQL solutions. Not a bad idea however, other than the risk of losing transactions that he was mentioning, I’d say 95% of the customers I have been working with so far would look at me strangely and they’d ask: “what do you e x a c t l y mean by NoSQL? Is it a bad word?”. Let’s be honest folks: this is not mainstream. If we want to create a cloud for an elite of people I am fine with that. However I am convinced one of the key values of an IaaS infrastructure is, among others, providing a cloud-like experience (pay-as-you-go, elasticity, etc) to traditional workloads. I am not philosophically against the idea of re-architecting applications, however I am also convinced that, for one person thinking about writing a brand new Ruby application for a UDP-cloud leveraging NoSQL (pardon me?)… there are at least 1.000 poor sysadmins trying to figure out how to live with their traditional applications.

Can you afford a personal Chaos Monkey?

Some of the AWS customers developed tools to test the resiliency of their applications. Do you remember the old good HA and DR plans?  IT people would walk into the server room to power-off servers and eventually the entire datacenter to simulate a failure and see if their HA and DR policies were working properly. If everything was good applications could survive the failure (more or less) transparently. This is what a Chaos Monkey tool does, but with a different perspective: these are software programs that are designed to break things randomly (on purpose) in order to see if the application itself is robust enough to survive those artificially created infrastructure issues in the cloud. In a TCP-cloud it would be the cloud provider to run traditional tests to make sure the infrastructure could self-recover. In a UDP-cloud it is the developer to run these Chaos Monkey tests to make sure the application could self-recover since it’s been “designed for fail“. Now, my take is that if you are Netflix or the like of Nasa and JPMorgan (these two are just examples of big organizations – not even sure if they are on Amazon) then you may have enough motivation and business reasons to re-architect your application for a UDP-Cloud and create your own Chaos Monkey to test your “design for fail” deployment. Certainly at Netflix they know what they are doing and in fact they seem to not have been impacted by this AWS outage. But if you are these guys do you think you have bandwidth, knowledge and time to re-architect the application and test it for failure? That AWS forum discussion showed up during the 4 days debacle and it deserves a proper copy and paste just in case it gets lost:

< Sorry, I could not get through in any other way. We are a monitoring company and are monitoring hundreds of cardiac patients at home. We were unable to see their ECG signals since 21st of April.

> Man mission critical systems should never be ran in the cloud. Just because AWS is HIPPA certified doesn’t mean it won’t go down for 48+ hours in a row.

< Well, it is supposed to be reliable…Anyway, I am begging anyone from Amazon team to contact us directly.

This is shocking isn’t it? Try to argue with them about NoSQL and “design for fail“. They barely probably understand the notion of Availability Zones and Regions. Don’t get me wrong. It’s not these people’s fault. They are not in the business to re-architect an application to be written with reliability in mind, they are in the business of helping their patients. Sure you can argue with them that it was their fault if they failed. But the net of this story is that they are not going to re-architect anything nor write a Chaos Monkey. When they realize what happened, they will look for a TCP-Cloud.

Design for fail: philosophy or necessity?

I hope you’ve got at least to this point because this is my biggest struggle at the moment. The more I read about suggestions to design applications for fail the more I miss whether these suggestions are tactical or strategic. In other words, are you suggesting to design for fail simply because that’s the way Amazon AWS works today (but you’d rather use an Amazon TCP-cloud if that was available)? Or are you suggesting that, in any case, you should design an application for fail because you are happy to deal with a UDP-cloud and that’s how every cloud should behave? Are we saying that it’s strategically and philosophically better to have developers deal with application high availability and disaster tolerance because that’s what makes sense to do? Or are we saying we need to do this because that’s the only option we have on Amazon AWS (today) and there is no other choice? I know it may sound like a rhetoric question but it’s actually not. Perhaps we need both models?

You don’t like the noise coming from the other apartments? Buy the entire building!

This isn’t related to the outage and the resiliency of the cloud but it relates to the overall TCP-cloud Vs UDP-cloud discussion. Similar to the “design for fail” there is the “deploy for performance” thread going on. In a multi-tenant environment (a must-have to achieve economy of scale and elasticity) there is obviously contention of resources. In an ideal world I’d like to be able to buy virtual capacity for what I need and have a certain level of guarantee that that capacity (or at least a contracted part of it) is always available for me. There are of course circumstances where I can trade-off performance and availability of capacity for a lower cost, but there are other situations where I cannot trade that off. A TCP-cloud should (ideally) be able to deliver that guarantee. A UDP-cloud works in best-effort mode and typically leverages statistical law to fight contention. This is the statistical assumption: not all users running on a shared infrastructure will be pushing like hell at the same time (one would hope – finger crossed).

So what do you have to do if you are running on a UDP-cloud? You keep the other people out of your garden.

I think Adrian is a genius but I don’t agree with his point of view :

“…you cannot control who you are sharing with and some of the time you will be impacted by the other tenants, increasing variance within each EC2 instance. You can minimize the variance by running on the biggest instance type, e.g. m1.xlarge, or m2.4xlarge. In this case there isn’t room for another big tenant, so you get as much as possible of the disk space and network bandwidth to yourself.”

“…busy client can slow down other clients that share the same EBS service resources. EBS volumes are between 1GB and 1TB in size. If you allocate a 1TB volume, you reduce the amount of multi-tenant sharing that is going on for the resources you use, and you get more consistent performance. Netflix uses this technique, our high traffic EBS volumes are mostly 1TB, although we don’t need that much space.”

“If you ever see public benchmarks of AWS that only use m1.small, they are useless, it shows that the people running the benchmark either didn’t know what they were doing or are deliberately trying to make some other system look better.”

The last sentence is like saying that, if you buy a new apartment and then complain about the big noise coming from other apartments, it’s your fault: you should have bought the entire building and enjoyed the silence! Hell Adrian, I say no! There must be a better way.

I think there must be rules in place to keep the noise at an acceptable level and if there is someone trying to scream all the time someone should “enforce” silence without having you to buy an entire building to cook and sleep in peace. That’s how it works in real life, that’s how it should work in the cloud. In my opinion at least.

In cloud terms I’d be ok if what I was buying always delivers a contracted baseline as a guarantee and then can burst (I said burst Beaker, not cloudburst) to higher throughput if there isn’t contention. What I would NOT be ok with is no baseline at all so what I get is no predictable performance all times. BTW note that Amazon made a step forward in the right direction a few weeks ago announcing the availability of what they call dedicated instances. This is an attempt to solve the noisy neighbors problem. However in doing so they did trade off multi-tenancy (hence the higher cost of such a service).

For the records I have to say that I don’t think there is a single public cloud at the moment delivering such a fine grained QoS across all subsystems on rented resources. This is a generic discussion about TCP-clouds and UDP-clouds and if you interpreted it like a vCloud Vs AWS shootout you are mistaken. In fact I think George gave vCloud too much credit in his blog associating it to the “traditional” datacenter model. There is a gap between what we can deliver, in terms of non-functional requirements, with a raw vSphere deployments and what we can deliver with a vCloud Director 1.x implementation. I am not hiding this by any means, in fact you can read here (the post but more importantly the comments) what I had to say about this. Having this said I believe VMware has a vision to fill that gap and create a true TCP-cloud. Last but not least I don’t see why a VMware service provider partner shouldn’t be able to implement a vCloud-powered UDP-cloud if need be.

PaaS and Design for fail?

If I struggle with IaaS clouds (and I do), go figure with PaaS clouds. To me PaaS is all about moving the level of abstraction at a higher level. IaaS is all about hiding infrastructure details. PaaS is all about hiding infrastructure and middleware details. In a PaaS you can upload your WAR file and that’s it. It’s the PaaS cloud provider that is going to deal with the complexity of setting up, managing and maintaining the middleware stack that can interpret that WAR file (for example). Fundamentally the developer should focus (even more than with IaaS) on the functional requirements of the application and let the cloud provider deal with the non-functional requirements aspect of it. Last time I checked HA and DR were still part of the non-.functional requirements domain. Note that, ironically, it may be easier for a PaaS cloud provider to build out-of-the-box resiliency given the nature of the interfaces they are exposing. Amazon is half way through that already with their RDS “My-SQL as a service”: they already offer automatic failover across Availability Zones and they would just need to extend this failover support across regions (this would have helped with the recent failure by the way).  So, if my theory is sound, that means that if you are architecting your application for PaaS you shouldn’t design for fail. Upload your WARs, create a db instance on the fly and you are done. The cloud provider will figure out how to failover to the next server, to the next datacenter room or to another geography should a problem occur at any of the given levels.

So why isn’t Amazon offering resiliency and reliability as part of their cloud services in the end?

After all they offer other non-functional requirements such as automatic scaling of applications through tools such as Autoscaling. So why would Amazon offer auto-scale services and shouldn’t offer an automatic, agnostic, infrastructure-level recovery service across Availability Zones (or even better across Regions)? Guess what. It is at least two order of magnitude easier to instantiate a new web server and add an IP to a load balancer than implementing a (reasonably performant) backend traditional database that can geographically fail over without losing transactions in case of a disaster. Dealing with stateless objects is a piece of cake. Try to deal with statefull objects if you can.

I am sure Amazon doesn’t think that dealing with autoscaling is something the cloud should do for developers whereas dealing with reliability and DR is something a developer should do on his/her own. What do you think? My speculation is that they are simply not there yet. As easy as it sounds. But don’t be fooled. Amazon is full of smart people and I think they are looking into this as we speak. While we are suggesting (to an elite of programmers) to design for fail, they are thinking how to auto-recovery their infrastructure from a failure (for the masses). I bet we will see more failure recovery across AZs and Regions type of services in one form or another from AWS. I believe they want to implement a TCP-cloud in the long run since the UDP-cloud is not going to serve the majority of the users out there. Mark my words. I’ll have to link to this blog post once this happens and I’ll have to say “I told you” (I hate this). And that is only going to be a good thing because developers will start again to focus on functionalities and IT the cloud will continue to focus on making sure those functionalities are (highly) available.

As I said, just food for thoughts. If you find definitive answers, please let me know.

Last but not least this is a good time to remind the disclosure of my blog (courtesy of a big copy and paste from the Sam Johnston‘s blog): “The views expressed on these pages are mine alone and not (necessarily) those of any current, future or former client or employer. As I reserve the right to review my position based on future evidence, they may not even reflect my own views by the time you read them. Protip: If in doubt, ask.”


34 comments to TCP-clouds, UDP-clouds, “design for fail” and AWS

  • Hey,

    Nice Post. I love the UDP/TCP Cloud Anology.

    The main issue that I see with TCP Clouds is that you can get away with running your current app on them and utilise the features that vendors such as VMware offer (HA/DR), but if you want 100% uptime you are always going to have to ‘design for failure’ at the application Layer.

    If I hired a Chaos Monkey to go and pull out some VMware ESXi Blades, Customers wouldnt be very happy…Although their VM’s would restart, there would be an outage (considering I am not running FT, as my machines are not compatible). However if I designed my application for failure and ran it across Different Availability Zones(Blade Chassis/Clusters), they would be much happier. Yes, they would still know there was an outage; however their application would have still been running.

    Just my 2 cents 🙂


    • Massimo

      Steve, thanks for the comment. I agree with what you are saying. My argument is that designing an application for fail (especially transparently with not even a brief outage) is a titanic effort and most may be happy leveraging platform generic services.

  • Resiliency CHECK
    Reliability CHECK
    Recoverability … but we can’t back it up, so we can’t recover it… what now?

    • Massimo


      > Recoverability … but we can’t back it up, so we can’t recover it… what now?

      Did you read the post? That’s one of the reasons why I said vCloud is not yet a full TCP-cloud.


  • Doug B


    Good stuff here. My thought is that this should be handled at both layers, and implementation (or not) depends on the availability needs of an application. To that end, I’ll play somewhat of a Devil’s Advocate.

    If you’re building scaling/elasticity into an app, you should probabably be communicating with the platform and it should be (made) fairly straightforward to leverage the DR/HA services provided by that platform. I would agree with you that autoscaling provided by a platform is fairly trivial for stateless workloads and significantly more complex for anything else.

    If you have a simple application that does not require (auto)scaling — think of a traditional app, wrapped in a VM — I would think the (transparent) HA capabilities of the platform (a la VMware HA) would apply. With that level of HA (no application awareness), there would be a brief outage as recovery occurs, and that might be acceptable for most users/applications. I propose that, for those with higher availability needs, a modified, ‘cloud-aware’ application that is designed with the cloud platform in mind may suffer brief performance degradation while it re-protects itself across availability zones or regions, but it would not go offline. I think of this like VMware FT, but at the application layer — VMware FT is an interesting feature, but the complexity involved with what it actually does behind the scenes has got to be significant. (Those developers have my respect for sure)

    Hacking the infrastructure to prevent *any* additional work by the application developers is, in my opinion, adding unnecessary complexity to the entire stack. Is it really too much to ask that developers leverage the services provided by the platform rather than blindly compiling code that meets functional requirements while assuming it will run anywhere? Let’s apply the solution to the places where it makes the most sense.

    My 2 cents,


    • Massimo

      Hi Doug. Thanks for the comment. I believe what you are suggesting is pretty much in line with what VMware has been trying to pitch so far, which doesn’t necessarily mean it’s the right thing to do and you are following the dogma. It just makes sense to me. When I was writing the post I was thinking that what I was putting down was sort of neglecting the concept of Devops. That’s not actually the case since, as you point out, developers may still be able to interface with the infrastructure subscribing to services provided by it. Obviously this cannot be done “in the application code” (IMO) but it needs an additional level of abstraction/wrapping that is a sort of bridge between how the application behave (or has been engineered) and how the services the infrastructure publishes. It’s a long way to go but I believe that OVF can be that bridge where developers can describe what they need the infra to deliver. If the infra doesn’t understand these metadata it will just ignore them.


  • DeckerEgo

    I can definitely agree that having to force software design constraints around “design-to-fail” is a step backward. Hardware failures happen, without a doubt, but the right infrastructure should shield the applications from outright hardware failure.

    Application architecture should be centered on scalability (i.e. stateless and asynchronous architectures), infrastructure architecture should be centered on fail-safety (split-brain networks, drive failures, PSU fires).

  • Massimo,
    I strongly agree with your point of view and I would like to add my two cents.

    “Design for fail”/”UDP clouds” have two big issues:

    1) the development cost of a “UDP cloud” safe application is very high, you need a more skilled development team with reliability in their DNA, a longer development process (==time) and more resources for tests/quality purposes.

    2) If you develop a “designed for fail” application/service on an “UDP cloud” you need to know very well the underlying technology and APIs with the risk to write very closed and non portable software. Services like AWS are sold with the promise of big savings on the infrastructure but if I need to rethink all my applications from 0 to adapt them to AWS the only result is to move my money from one pocket to another!

    the sum of two points remember me the mainframe years… are we sure we want to go back to the past?
    do we need to start to talk about public clouds lock-ins?


  • Massimo,

    I really enjoyed reading your article, it introduced an interesting philosophy to understand what are the implications to develop and deploy applications with failure in mind.

    I think you have coined two new terms with “TCP-cloud” and “UDP-cloud” that we will hear a lot in the future…


  • Massimo,

    I like TCP/UDP comparison but, as a longstanding fan of UDP, I feel the need to challenge you.

    DNS is a full redundant and reliable service. It works mainly on UDP and fails back by design on TCP if there is a response truncation.
    Guess what? DNS is Internet’s backbone and worked pretty well for a number of years.
    Why shouldn’t UDP Clouds do the same?

    As you point it out: it’s about the application level.
    If somebody builds an application on the Cloud the same way it does for a classic continuity-oriented infrastructure, then this person is obviously missing the point, unfortunately 95% of the applications on the Cloud might miss the point to my perception.

    Performance variance, ephemeral instances, shared resources, design constraints: Cloud Computing (in its purest form as Amazon sells) is a peculiar environment that needs rethinking of application architecture paradigms and shifts accountability on the user.

    All we have to expect from IaaS services is scalability and an increasing level of openness and interoperability.

    Eventually applications will become more distributed and more self-healing. It’s a trend, it’s happening now and will gradually absorb any additional cost (e.g. distributed, portable, stateless apps), such a model might become mainstream. Enrico is right just for now, but not for long.

    BTW: I did not really understand the part regarding noSQL. What did G. Reese did want to say? For me it was a bit off-course; no noSQL I know does guarantee either consistency or availability per-se, they always need special attention or an IDA file-system.

    Regarding noSQL awareness: I work with some customers that use noSQL tools and every one of them knows very well the implications of the CAP theorem. It’s not a big crowd but it’s encouraging.


    Gabriele B

    • Massimo

      Gabriel thanks for the comment. Sure thing you have the right to disagree/challenge.

      I don’t know if we can make a good parallel with DNS (to applications). DNS is a bit of a weird beast. In one way it is a statefull object for which someone found a way to create a nice distributed architecture. However it also imposes limitations and issues due to its heavy caching depending algorithms. But this is not the point. There are hundreds of UDP applications that are run just fine… my question is.. what about the other thousands of TCP applications?

      To my “philosophy Vs necessity” question you are basically answering it’s a philosophy and applications in the cloud should take into account the volatility of the resources there. We obviously disagree on that point but that’s fine. No one has the magic crystal ball to see what will happen in the future.

      I am always doubtful when dealing with re-architecting / re-writing applications. Perhaps it’s because I have been through the Xeon Vs Itanium discussions many years ago. I am not saying that the re-architecting of the applications to fit a UDP cloud model is going to end-up like re-writing applications for Itanium (not at all), however this is not a matter that can be over-simplified with a “it is happening”. My opinion.

      As far as the NoSQL field experience I guess that it really depends on the points of view. This reminds me of a conversion I had with a partner during my tenure at IBM. During an event I said something on the line of “we don’t see RedHat/Xen a lot as far as virtualization is concerned”. He approached me saying I was wrong and that this is what he was doing all day long. When I asked what his job was it was something like “RedHat virtualization practice leader” for that partner. I made a bold statement that NoSQL-like technologies are not widespread in the field based on what I have seen. If you are very involved in that space you may have a different percpeption.

      Thanks for commenting.


  • Massimo.
    Thanks for answering.
    Just to point out that I too think noSQL is NOT a mainstream technology. What I wanted to say is that those who need it know very well how to use it (usual of early adopters) and that we find day after day new practical uses (complementary to common DBs). I am not specifically dedicated to noSQL, it’s the industry which is starting to use it.

    On “UDP”, my rationale is that such distributed architectures will become progressively easier to tackle and will find their place in the IT panorama. We have seen this with SOA: these are cases that fit greenfield projects, no “oude koeien” (a colorful Flemish way to call legacy stuff).
    I always been convinced that, in perspective, a public Cloud is an enabler for distributed designs because of elasticity at relatively low investment (where relative means: you still need the skills, someone consider them as “sunken costs”, me not).

    Whatever. I won’t bother you any more with my bla-blas. 🙂

    On the specific case of AWS, after reading some sharp comments on the post of G. Reese, I took the time to peruse Amazon’s SLA and their FAQs and they failed. It’s confusing how they functionally presented availability zones: in good faith someone might have developed a system thinking that AWS would have guaranteed continuity across AZs, expect them to change something in the coming days.

  • Lance Berc

    It’s a matter of cost and complexity versus perceived need. Leaning on facilities for availability in underlying infrastructure greatly simplifies application life-cycles, lowering development, test, and maintenance costs and increasing business agility. Distributed systems that can pass ChaosMonkey-style testing are very complex and the testing is very expensive, and they’re generally one-off applications – DNS cited earlier is a good example, as is AD. Yet we see them fail, too – usually when a botched complex configuration doesn’t come to light until something distantly related fails. In addition, those sorts of systems also tend to require skilled priests to feed and care for the deployment.

    (If you think it’s easy, Nominum is hiring more people to work on the next-generation BIND system. I’d be happy to forward some resumes.)

    When faced with the costs associated with such systems it’s not surprising that those paying initially say there is no need for such complexity. It’s only through the losses associated with failure that perceptions change. The costs to ensure real Business Continuity are currently so high that most companies require a CEO- or even Board of Directors-level mandate before embarking on an initiative.

    What’s needed is a level of software that’s above current infrastructure and below today’s applications that provides scaled distributed data access and persistence while easing development, test, and maintenance burdens. Relational database answers like GoldenGate are prohibitively expensive for many systems; NoSQL by itself isn’t an answer, nor are sharding key/value stores. So I think systems built on technologies like Gemfire have a very bright future if the Gemfire layer can be made general enough to support a wide variety of use cases – it has the right primitives for scale, performance, distribution, and persistence with coherency rules one can actually understand. This is a facility above IaaS that should help make PaaS a legitimate layer to develop on.

    Failure happens. People that rely on multiple in-memory copies for persistence are just delaying their day of reckoning, and in the end there is no real replacement for streaming transaction logs to tape. The grey-hairs embed this knowledge in the mantra, “Amateurs talk about backup; professionals talk about recovery.”

    But maybe it doesn’t matter – in these days where people become billionaires before making a profit, isn’t occasional data loss somewhat over valued?


  • […] So clearly we should be designing our apps to fail? That’s easy to say but not so easy to square with the basic idea that we can have cheap and flexible apps for short periods. A much more radical approach as to exactly what technology we are using and exactly what that means in terms of expectations and options is, I believe, called for. At the root of this is the difference between TCP-based cloud services and UDP-based cloud services, a little understood topic, which in this case can be summarised as AWS uses UDP as a basis for its clouds and most IT departments have an expectation that the service level they will receive is that of a TCP cloud.  Some people think that this is a controversial argument, but at its root is a very simple set of differences starting with TCP using connection oriented, and UDP being connectionless. This ying yang occurs at every level of the two approaches, and hopefully I have now interested you enough to go to the lively and interesting blog of Massimo on IT 2.0, and next generation IT infrastructures in which he discusses this topic. […]

  • Jacques Talbot

    I tentatively propose that the appropriate moto is “design for some failures”.
    In the sense that the infra should take care of some failures and the application of some other failures.
    Let’s take Azure as a model for a change from the Amazon obsession of the last few days.
    As clarified by David Chappell, there is a fundamental programming model assumption in the Azure PaaS:
    “An application that follows the Windows Azure programming model must be built using roles, and it must run two or more instances of each of those roles. It must also behave correctly when any of those role instances fails.”
    So, for the Web and Worker roles, you MUST have a cluster of VMs, and its minimum size is 2. Moreover , the PaaS, under the cover, has the right to kill on of the instances (to patch the OS for example).
    This forces the programmer to think about state in a different way, and more or less requires a reliable cache service.
    On the other hand, the data tier is (supposedly) reliable, and the application is not supposed to assume that data come and go too often (RTO and RPO permitting).
    So perhaps, it looks like Microsoft Azure philosophy is : design around web and business tiers failures, trust data tier.
    My 2 cents …

    • Massimo

      Thanks Jacques for chiming in. The more I think about this the more I am convinced that all this discussion should be around statefull services (e.g. databases). Stateless services is a no brainer and we have been using the 2+ instances of the front-end for how long… since the inception of the web? If we stick on a pure 2 or 3 tiers web architecture let’s just agree we don’t need a TCP cloud for the front-end. Easy. The problem, as usual, is with the backend and with the data. I know very little about Azure and I have never actually played with it but my feeling is that the way they handle the SQL Azure is not very different to how Amazon treats RDS. As far as I understand they are both clustered instances of a database (MS SQL Server and MySQL) whose (db) interfaces are exposed to the consumer. I believe Amazon have a clear description of the RDS implementation. I am not sure if MS has one but my speculation is that it’s a MSCS-clustered SQL database. In a way this is a TCP-cloud (a PaaS cloud or better a DBaaS cloud).

      So in a way, if your applications adhere to these patterns (and these technologies) both Amazon and AWS do provide a TCP-cloud type of service (where it’s needed). My rant was about those applications that do not adhere to this pattern (web app) not these technologies (MySQL, MSSQL). Not only they don’t have a solution for “legacy” applications but even for web applications the devil might be in the details. Forget about the bashing around stability (it’s not the point), this article touches on a few points why, if you are used to develope/deploy a web application in house.. you may find substantial differences in a PaaS cloud environment:
      I can imagine a great number of customers i have been working with finding very difficult to lose, all of the sudden, full control over the OS/Middleware layer that is backing their web applications. The devils is always in the details.


  • Ciao Massimo,

    As many others have said, I love the TCP/UDP cloud analogy.

    But my concern is that we will still have to design for some degree of ‘fail’ even if the underlying architecture is already providing appropriate resiliency.

    I give you a fairly recent example:

    In that case, they were providing redundancy at the infrastructure level, but, the ‘fail’ scenario wasn’t a graceful one, or one that you design your infrastructure for (well, sometimes you can’t even design around similar scenarios). The problem is that if you don’t design (or re-engineer, or re-host) your application for failure, it will still be tied to the highest level of protection that your infrastructure can give you, and sometimes, this is simply not enough.

    Just my 0.02€ 🙂


    • Massimo

      mh… I tend to disagree with this view. I am in favor of discussing whether more resiliency should be built into the application Vs the infrastructure. However I am not in favor of creating resiliency at both layers (at least not to the level that create overlapping efforts). If I have to build resiliency into the application to overcome problems that a resilient infrastructure (TCP-cloud) may experience then I’d just use a NON-resilient infrastructure (UDP-cloud) in the first place. I have seen a Parallel Sysplex going down entirely crashing all applications running on it. Yet this doesn’t mean mainframe programmers build resiliency into the application to overcome events like this. Problems happen. Sure if you build resiliency into the app and deploy it onto a resilient infrastructure you diminish drastically the chance of an outage… but on the other hand you increase exponentially the costs.


  • […] we wish to worry less, and let a height do more.  Massimo Re Ferrè’s fascinating post, TCP-clouds, UDP-clouds, “design for fail” and AWS, likewise hurdles a required knowledge that concentration architects ought to sojourn wholly […]

  • This thread was a very interesting start of my Saturday morning – thanks! Even though I don’t fully grasp all of the low end discussions I would like to add some thoughts to the discussion:

    1) I think it is very crucial to expose a cost comparison between a 100% available solution and a solution with 99.x% availability. In my opinion most non-functional requirements in tenders/RFP-s demand the 100% availability approach. This is of course something you would expect from an application, but we all know that this is complex to achieve – and comes with a very hefty price tag.

    I urge all IT-suppliers to ask you customer (Note! make sure this is someone from the business side that is responsible for picking up the bill – not an IT person) if they are willing to pay a 2-5 times higher price for the solution to support the 100% availability requirement or if they can settle for a lower availability guarantee with a significantly lower cost? Let the client evaluate the cost/benefit based on realistic risk assumptions! OK, so AWS broke down once but honestly how often does this occur? What impact does these failure have on your business compared to the costs of trying to avoid it (Is there really any 100& guarantee IRL?)

    2) In the best of all worlds the programmers would care about the non-functional requirements… I would say that +90% of the programmers do not have a clue! They do not understand infrastructure, virtualization, high availability, fail-over, recovery, response-times, network latency issues etc. The are focused on functional requirements.

    3) I believe there is a need for the cloud providers (public and private) to offer new innovative HA solutions using vitualization techniques, SAN replication, load balancing networks etc to replace the traditional way of building HA solutions in the middleware layer using clustering. This will remove a lot of complexity in the PaaS-layer. Does anyone provide this today?


  • […] IT 2.0 fail”, TCPclouds, UDPclouds, “design […]

  • Massimiliano

    Hello Massimo,
    I think you touched a very interesting point, that will spark (and already has!) a rich debate.
    The answer I would give would be. “it depends on the type of application”…safest position 🙂

    Moving to the cloud computing model means moving to a new business model, as such the applications might need a re-engineering simply for business reasons, so in such a case why not adding an extra-effort and make them “TCP-like”?, and gracefully survive to the UDP-cloud failure (see here an example).

    Thinking high level I would say that multi-tiered applications might hide intricacies that could lead to faults in case of a simple porting to the cloud. On the other side could we say that apps adhering to SOA model are better candidate for cloud and as such easier to re-engineer than non-SOA apps? probably yes.

    I might be now under the ‘sceptical’ influence of my home reading, but if my core business had to move to the cloud I would seriously consider making my apps robust and not relying too much on the cloud service, simply to avoid falling into the ‘Platonic fallacy” of immature standards and because a Black Swan might just be lurking in there.

    Again thanks for your interesting blog.

  • so per VMW sales – buy Vcloud and you have better availability ?
    well this is just lies isn’t it ? if vshield edge device fails ?(any related host failure)
    it has no backup till ~5 minutes later , till then all VMs on all other hosts that relies on vshield edge fails for 5 minutes
    those applications might not come up if TCP sessions will not be re-initiated …
    how this facts don’t tell you quite the opposite about vcloud (that relies heavily on vshield edge), that is is NOT high available at all ?

    • Massimo

      This sounds the junk arguments our “friend” Koren Lev would use. Is that you or a close friend that stole your junk Koren? 5 minutes to restart an Edge appliance? Excuse me, on which planet?

  • Will

    I’m dumb so you can you explain this sentence? I do not know what it means.

    I hear ‘LOTS’ of bloggers use this sentence over and over-

    “To me PaaS is all about moving the level of abstraction at a higher level”

  • Jon

    Hi Massimo

    Just wondering if I could use your TCP and UDP comparison image in an assignment I am doing (i.e., non-commercial)?


  • […] While OpenStack is primed to transform enterprise IT, enterprises still have a lot of questions today. One popular topic is how OpenStack compares with VMware. Perhaps the best analogy that I have heard to explain OpenStack to vSphere skilled staff is that of cattle and pets. vSphere servers are likened to pets that are given names, uniquely cared for and nursed back to health when sick. OpenStack servers are likened to cattle, which get random identification numbers, cared for as a group and are replaced when ill. Figure 1 below shows an excellent slide from a Gavin McCance presentation. For network technologist, another good analogy I have seen compares OpenStack to UDP and VMware to TCP. […]

  • […] the difference? Well, the discussion was sparked by a blog by Massimo on IT 2.0 following an outage in April on an entire Amazon AWS region. In his blog Massimo argues that AWS […]

  • […] this topic and I am not going to repeat myself. If interested, you can read some of those thoughts here, here and […]

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>