I have just heard of a massive outage that a localized IaaS Cloud Service Provider is experiencing: they have been down (at the time I am drafting this short blog post) 4 days and counting. When I get to publish this it may be they have been down 4 days or… god knows.
Apparently the issue was due to a firmware bug in the technology stack they are using that caused an upgrade to bring the whole site to its knees.
I am not going to mention who they are (nor who the vendor product whose fault caused all this is) because this could be literally every CSP and every vendor on this planet. So naming names won’t add anything to my rant below.
There is a lot to learn from this (and from other similar experiences we have seen throughout the last few years). I thought this was a good opportunity to share some thoughts.
No worries, no one will remember
Yes there will be customers (more on them later) that will, in what I think will be an irrational reaction, leave the service and jump on an alternative. The fact is that sh*t happens and it will happen everywhere.
You remember 4 years ago when an entire region of the leader in public clouds was down for like (almost?) a week? They seemed to be toast and private cloud pundit were all over it. The public cloud was “dead”. So they said.
Fast forward 4 years, not only they keep announcing record revenues and experience exceptional growth, but actually no one even remember what happened to them then.
Sorry, but you have no clue what cloud really is
I am super respectful of the comments I am reading from devastated users whose business is being heavily impacted by this outage. Having that said, I can’t help myself to not think they (or some of them) have no idea of what the cloud really is. When you hear someone saying “I moved to the cloud because I didn’t want to experience downtime” it is fairly clear to me that you either have been heavily misinformed or you misunderstood what the benefits of a (IaaS) public cloud are. If there is a reason why you would NOT want to move to a (IaaS) cloud, that is for higher up-time than you could get in your data center..
Let alone 100% up-time.
This would deserve a blog post of its own but, long story short, there are a couple of public cloud DNA types out there. One type essentially puts the focus entirely on the control plane and provides little to no guarantees on the data plane. These cloud providers will tell you that by design your instances (i.e. the data plane) may have issues at any point in time but you will always be able to spin up new instances (through the control plane) should you need to. These are what I refer to as UDP clouds. This is the cattle world.
Another type essentially puts the focus on the data plane while providing a robust enough control plane. These cloud providers will tell you that your individual instances will be guaranteed to be up and running (with an SLA). These are what I refer to as TCP clouds. This is the pets world.
The outage that triggered this post is occurring with a CSP that has taken a TCP approach. So why did they fail? Well the reason is fairly simple really: there is no magic in (public) cloud and, technology wise, what a TCP cloud provider does is leveraging all Enterprise class technologies (available also as products you could instantiate on-prem) to instantiate and deliver what you may consider an Enterprise class public cloud IaaS service.
In other words, when it comes to robustness, a TCP cloud isn’t better than your on-prem setup (assuming like for like designs). Instead, it’s a UDP cloud that is designed on purpose to be intrinsically less reliable than your on-prem setup.
Put it in (yet) another way: a properly designed public cloud is not intrinsically more reliable than a properly designed Enterprise data center (assuming like for like IT budgets).
That is because sh*t happens, which leads me to the next section.
Yes. It happens. This is not the first time. And it will not be the last time. Rest assured.
Sh*t could happen at the control plane level and / or at the data plane level. Hidden software bugs, hardware bugs or bad operational practices are all waiting to surface into a catastrophic cloud failure.
I remember many years ago I was working for a big bank and I have seen first hand a catastrophic infrastructure failure (that lasted many days) due to, no kidding, an air conditioning failure. Long story short the air conditioning system broke (very badly and unexpectedly) and all servers started to shut down themselves cause the facility was approaching temperatures in the order of 55+ degrees Celsius (and those servers were instrumented to auto shut down to avoid components damage). It took them literally days (and a lot of effort) to go back to normal operations.
This is just an example (and an extreme one I will admit) but who has never seen a catastrophic infrastructure failure during a storage upgrade for example? Sure those big enterprise and monolithic storage servers (regularly found in traditional data centers) do not do much to create fault domains so when a problem happens (and it happens!), it’s usually a problem with a large diameter and a tremendous impact. And, no I don’t think that distributed storage architectures solve this problem. IMO what you gain in having more shielded fault domain, you lose in data partitioning issues (which you don’t usually have in monolithic storage). Oh well, you have to choose your poison I guess.
This all happens inside your very own data center as well as in the public cloud data centers. The only difference is that when all of you INDIVIDUALLY experience these problems in your data centers you don’t make the headlines. When a CSP experience these problems and bring you down altogether at the same time, it does indeed (make the headlines).
If I was a CSP I would not sleep at night, which nicely get into my next point.
How can an IaaS cloud be a commodity?
This is something that had me thinking. We typically often refer to IaaS as a commodity service and we always applaude when we see 50% price drops. This is until we realize that when sh*t happens… it happens in a way that make you willing you’d have paid more to have had more chances to stay up.
This applies though to those users who NEED to rely on data plane resiliency cause their apps have no other option than relying on that. The reality for these users is that they won’t see a 100% uptime guarantee (because, repeat after me, sh*t happens) but they should look for CSPs with a proper designed data plane resiliency cause they are betting on it and it could mitigate (yet not entirely resolve) those latent downtime issues.
And if I was a CSP and this was a truly commodity market, then I would ask myself why I am in a non lucrative business where I have little to gain and pretty much everything to lose? Perhaps the answer is that this is not a commodity business and there is more than a “little” to gain despite what you think?
In conclusion, I think what you need to bring home from this story is that there is no such a thing as a 100% uptime. Hope for the best but be prepared for the worst. Have a contingency plan. You need to remember that ,on the other side of the fence, there are professionals that know what they are doing (most of the time they know it better than you cause that is their core business). Yet they are not magician with a magic wand. You may (and will!) go down.
Last but not least, I am not saying this to discount the value of a TCP cloud. The way I am looking at this is that you can choose to fly with an aircraft that has a single engine (because “well, if the plane goes down we will ask another one to take off, what’s the problem?”) or you can choose to fly with an aircraft that have redundant engines.
Not sure if you noticed but, aircrafts with redundant engines do fall. Get over it.
I have got to go now, they are boarding my flight (yes I am nervous when I fly even on fully redundant aircrafts but going to London with a boat wasn’t really an option).