The 93.000 Firewall Rules Problem and Why Cloud is Not Just Orchestration

A few days ago I was in a very interesting meeting with a big Service Provider in Europe and I heard a lot of interesting comments. I’d like to quote the best that I heard which was “Oh a portal? Oh not another one… we have many of them already!” but this will open up a different can of worms so I am not going to talk about this now. What I am going to talk about relates to another comment someone made in the middle of the meeting which was “…there is a firewall with 93.000 rules configured“.

I can’t say to be a security expert by any stretch, however they sound a lot to me. This was confirmed by someone with a lot of background in this area saying that “… they are a lot but the record is a Cisco device (somewhere on this earth) with 750.000 rules“. Suddenly someone else jumped into the discussion asking “…and what happens when you fat finger rule #457.986?“. I thought this was a joke (however I am not sure).

Before we make any step further, let’s try to dump, in a picture, the layout of this scenario (at a very high level):

Basically the idea, pretty common these days, is that you have a multi-tenant virtual infrastructure with a number of VMs running on top of it. These VMs belong to different customers and, by means of standard layer 2 segregations (VLANs if you will), you keep them separate. The big (BIG) firewall at the bottom of the picture is the one that is holding the 93.000 rules that govern how these workloads talk to each others. By the way this doesn’t appear obvious in the picture but each customer could (and will!) have more than one single VLAN because that’s how it works in this world (see below). So 93.000 firewall rules is just the tip of the iceberg… there are other problems these Service Providers are dealing with which are, for example, the sprawl of VLANs – along with all sort of issues associated with that.

So why is this a problem for an IaaS cloud? I think there are at least a couple of dimensions to this problem.

Manageability, serviceability and scalability

The first dimension relates to “how on earth can you deal with such a beast?”. How do you manage this firewall but, even more importantly, how do you troubleshoot it? That’s why I am not sure that the person that referred to the “fat finger” problem was really joking. Again, my background is not security so bear with me and please advice where I am missing something. However, whenever I mention situations like these to people that do have a security background their typical reaction is:

  1. they laugh first….
  2. …and scratch their head then.

So there must be something wrong somewhere, I think.

For sake of clarity, I am not bashing the firewall administrator that configured 93.000 in that box. I think that the problem is how networks (and related security) have been working until now and the associated “best practices” we built in the last 10 years. One could write a book on this but, in a nutshell, the way it works is that, to secure “services”, you need to create layer two domains (aka VLANs) that you connect by means of a firewall. Depending on what you need you may have to create subnet-based rules and/or IP-based rules. Take this approach and apply it to a Service Provider with thousands of customers each with a certain amount of “services” deployed, and before you realize what’s going on you get to thousands of firewall rules in a blink of an eye.

End-user self-service

The other dimension of the problems we are discussing strictly pertains to self-service, a key concept of paramount importance in all cloud related discussions. This is a pattern I have seen over and over again at every single Service Provider I have met so far: the usage of a central monolithic firewall to serve multiple different tenants doesn’t allow the SP to create (easily) a self-service experience for the user. Why? Simply because the more complex and more critical the object (whose functionalities you want to expose to the end-user) becomes, the more complex and critical the tool that mediates its access needs to be. You could solve this problem by using a dedicated physical firewall per each of the customers the SP is hosting. That would reduce the complexity and the criticality to a level for which the effort of the SP would be as low as telling the customer “Here is how to access the device as root“. Between the lines you could read “Screw it up and only your own organization will be screwed up, I don’t care“. It sounds great but this isn’t very scalable nor manageable obviously. Do you deploy a new physical firewall every time you get a new customer? Not the promise of cloud I’d say if cloud is really about agility, scalability, pay-per-use and the list of attributes goes on. These attributes have, in fact, very little to do with the option of deploying a new physical device on-the-fly when needed.

So what did all these SPs do when they stood up their so called… “clouds“? They created a portal (probably one of those many we were talking about at the beginning) where they gave some self-service capabilities to do basic and simple stuff (such as VMs provisioning) and they implemented a ticket system for more advanced stuff (such as creating network security rules for the workloads they were provisioning). Not very different from how you’d do it with a traditional hosting solution you may think. Well that’s one of the reasons many people refer to this practice as “lipstick on a pig” (i.e. take a hosting solution, put a cloud label on it and sell it as if it was a cloud).

The role of the orchestrator

I always say that orchestration is not cloud but cloud needs orchestration. Will orchestration alone help solving the problems we are discussing here? I don’t personally think so. I see orchestrators more like tools that are supposed to solve operational issues (especially at the level of scaling a cloud infrastructure requires) not like tools that can fix broken architectures. If you take a stone and clean it, it doesn’t become a gold nugget automagically. It becomes a cleaned stone. Same thing goes for cloud. If you take a “junk architecture” and you orchestrate it, does it become a “great architecture“? No, it becomes an “orchestrated junk architecture“. Better than having to deal with it manually… but still “junk“.

Don’t get me wrong, I do think that orchestration is key and you can’t have a cloud without (at least a certain degree of) orchestration. However don’t think that a properly architected cloud is just your “legacy” stuff with an additional kilo of orchestration workflows and a nice new portal (“Oh a portal? Oh not another one… we have many of them already!“).

Is there a way out?

Yes there is (I think). I believe there is a shared feeling in the industry, at this point, that an architecture as shown in the picture below is the way to go forward. So what is that vFW (aka virtual Firewall) below? At VMware we call it vShield Edge. Other vendors may call it differently. Other vendors don’t have anything like this today (so expect some level of bashing from their sales rep in the field) but they may end-up having it down the road (expect some level of embarassement from the same sales rep that bashed this approach in the past). We started shipping vShield Edge less than a year ago but we have seen a huge number of people experimenting with an approach like this for years. Just recently I have met another SP that said that 2 years ago they started looking into something like this using virtual appliances from Vyatta. Just recently I wrote about a small business partner getting into the “cloud” from a provider perspective and using the same model/architecture without anyone telling them this was “the right” model: they figured this out themselves based on the challenges they were dealing with! And if this isn’t enough to convince you that there is a trend here, look at what Amazon has started to pitch a couple of weeks ago.

So what’s so neat about this model? The idea is pretty simple: instead of using a monolithic physical firewall outside of the virtual infrastructure domain, you can deploy different virtualization-aware firewalls that are essentially backing the same VLAN(s) but do that in a more flexible and agile way. Other than simplifying the complexity of a single object configuration (the “93.000 rules” problem) you also gain easy self-service through administration delegation. As we have said at the beginning it is difficult to get controlled access to a shared device. However if you create a virtual device that is only supposed to “rule” access to given VLANs dedicated to a customer… you can easily delegate full access for that virtual device to that specific customer. This is at at the core of the vCloud Director self-service capabilities. In many cases you’d still want to have the traditional physical device for data center level protection against external attacks and advanced firewall features that these virtual firewall may be missing today. However the complexity of its configuration would be drastically reduced because the workloads security rules would be managed directly on the virtual firewall devices.

Can we do even better?

We could do something better, yes! What we have been talking about so far is, basically, all about keeping the very same number of VLANs and firewall rules.. and spread these rules across virtual firewalls. This solves a lot of problems when it comes to self-service for example (delegation of the entire device) and scalability (just deploy another virtual appliance when there is a new customer) but it doesn’t really solve itself the problem of VLAN sprawl and the 93.000 firewall rules (although they are now segmented in different and dedicated security domains per each customer). VMware has other technologies that may help to address these other problems.

The first one is called vCloud Director Network Isolation (vCDNI) in vCloud parlance or vShield PortGroup Isolation (PGI) in vShield parlance. It’s, basically, a technology that allows you to virtualize a VLAN. This allows different customers to be assigned dedicated vDS PortGroups that represent separate layer 2 domains… yet sharing the same VLAN ID. We use a technique called MAC-in-MAC to implement this. Kamau just posted a very interesting blog on how this works. You can read more here if you are interested. This technology is already available and fully integrated in vCloud Director so you can use it today if you want to.

There is another elegant method to solve the VLAN sprawl problem and, more specifically, the proliferation of rules you have to create in the firewall(s). This can be achieved with another vShield technology called vShield App. Think of vShield App as a vDS port-based firewall where you can say “this vNic can talk to this other vNic over this particular port“. The vNics in question are connected to the same vDS PortGroup (i.e in essence one single layer 2 domain). So imagine having a single network segment where you can create rules that mimic the deployment of a DMZ, an Application security zone, a Database security zone, etc etc. Instead of using three VLANs (in this example) you could use one and have this segmentation happening at the vDS layer via vShield App rules. The cool thing about App, in my opinion at least, is that it supports both the typical 5-tuple firewall rules as well as it works with traditional vSphere constructs such as datacenters, clusters, resource pools and things like that. So that you can say that all VMs that are in this “container” can only communicate with VMs that are in this other “container” over a specific port. This way you can change IPs, add/remove VMs from the containers and the security policies will still apply simplifying and reducing the “93.000 rules problem“. For sake of clarity this vShield technology (App) isn’t integrated (today) with vCloud Director but I hope you see a trend here.

Now imagine combining vCDNI with vShield App. You could – potentially – use one single VLAN to support multiple tenants, and within each “virtual VLAN” you can create rules that represent multiple security zones effectively mimicking DMZ’s, back-end’s etc.

Conclusions

While I focused a lot on the products I am working with at the moment, the message that I wanted to pass along with this post is that the current network security model seems to be broken, in a big way. Especially if you think about it in the scope of cloud-like deployments where agility and self-service are big mantras. There are alternative architectures that are proving to be better in this context and there is a range of products that can implement that new architecture. I mentioned vShield and vCloud Director but you can use other products if you want… as long as you fix that junk! The other point I was trying to make in this post is that orchestration itself cannot fix a bad architecture and these two topics (architecture and orchestration) should really be considered two separate workstreams when you design your cloud infrastructure. Once again, orchestration is not the means by which you can fix a bad architecture layout.

Now I talk like if I knew what I was saying. Funny.

Massimo.

17 comments to The 93.000 Firewall Rules Problem and Why Cloud is Not Just Orchestration

  • Awesome post. So accurate and easy to understand, yet even marketing people can read it.

    Will vshield evolve to look like inter-VM .NET ? I gave a talk predicting such a future at SDforum CloudSIG and would love to get your take on it. I’m looking at how to secure such a world.

    • Massimo

      Thanks Dave for the kind comments. You’ll have to apologize but… I am not sure what the “inter-VM .NET” is. Can you talk a little bit more about that? Tks again!

      • Stu

        I think he means is will vShield become a fully fledged framework, like .NET – the kind of thing that is perhaps baked into the core vSphere API, allowing vendors to write management products that sit on top of vShield rather than vShield having both roles as it does currently. Or something.

        • Massimo

          Hi Stu.

          Well if that’s the case then the only thing I can say is that… I don’t know. :-)

          I can tell you what I (personally) would like to see… which is a state of the art where we do provide the plumbing platform that allows third parties to inject stuff on it… and one of those third parties may happen to be a VMware BU providing a “vertical solution” built on that plumbing. I believe the vChassis vision that was presented at VMworld last year was along the lines of this concept of a “pluggable” architecture.

          Thanks. Massimo.

          Massimo.

  • Massimo,
    Large firewall rule bases are almost impossible to clean up and is indeed a problem for many organisations. I have seen rule bases a lot smaller than 93k rules that were hard to manage.

    There has however been technology around for quite a few years that will split a firewall cluster into multiple smaller firewalls that each customer can administer independently of the other firewalls. This makes the rule base maintenance much easier and each rule base will not block another one. This is referred to as virtual firewalls (even though it has nothing to do with x86 virtualization as we normally know it) or security contexts. It does however not solve the VLAN sprawl problem and does not by itself offer inter-vm-security settings like you can do with the vShield/vCloud package.

    http://www.checkpoint.com/services/education/training/courses/samples/VSX_C02_VSX_Arch_Deployment.pdf
    http://www.cisco.com/en/US/docs/security/fwsm/fwsm22/configuration/guide/context.html

    Lars

    • Massimo

      Lars, you are smart. I was about to touch on this point too but I thought it would have added to much meat to a single post. I am glad you brought it up as it gives me a chance to comment on that too. Honestly I am struggling with that too. I have always wondered why organizations haven’t used these “partitioning” functionalities to (partially) solve the concentration of rules + self-service issues. I have from time to time asked the exact question but, quite surprisingly, I have never had a precise answer. I can tell you what my feeling is re why they haven’t:

      1- organizations (especially service providers hosting different customers) aren’t confident with the level of separation these features provide. Hence they don’t want to give “root” access to these contexts directly to end-users. There may be too much “shared components” for their likings to be able to say “you are root, go ahead and screw it up”.

      2- these objects have a limited number of contexts that you can cretate. Hence diminishing but not solving the hardware sprawl issue when you have thousands of tenants you need to support.

      #2 is more objective and someone may jump in and correct me if I am wrong. I haven’t found a “max contexts” number in the paper you attached.

      #1 is more subjective and it’s open for discussion (+ it may vary depending on the organization you talk to). I always say that “enough secure” is a variable concept that is a function of the level of paranoia of the person you are talking to.

      Note there may be also other reasons why they haven’t been using these “contexts”. The two above was what I gathered from talking to them. Perhaps I didn’t even ask the right questions.

      Thanks for jumping in.

      Massimo.

  • Massimo,
    Our company has been installing a number of these virtual firewall systems each year for the past few years. It is quite easy if you have several existing customers who are moving to the same datacenter since you can import their existing firewall configs into separate virtual firewalls. The problem is if you have already many customers behind a single firewall and a huge rule base. Cleaning up the mess can be a very time consuming and risky job if you have 93000 rules already. So I guess those guys with 93000 rules have painted them self into a corner and they don’t know how to turn around and get into a better managed system.

    Most firewall rule bases are small in the beginning, but may grow hugely, especially if there’s nobody with full control of the impact of all change requests regarding the rule base. A firewall rule base is a “living creature” that needs cleanup on a regular basis in order to be up to date with an ever changing dynamic environment. There are analysis tools that can help you with this, but they can only bring you to a certain level unless you have a good overview of the current state of the environment you’re trying to protect.

    A single firewall cluster can have a maximum of 250 virtual firewalls. (http://www.checkpoint.com/products/vpn-1-power-vsx/index.html#specs
    http://www.cisco.com/en/US/products/ps6120/prod_models_comparison.html)

    Even if you don’t want to give each customer the power to control their own firewall it reduces the complexity hugely to use virtual firewalls.

    Lars

    • Massimo

      Thanks Lars.

      250 isn’t in fact a huge number (especially if you consider many “small/mid-size” tenants subscribing to your cloud. The “cost per context” may be prohibitive. I think.

      As far as the “change” is concerned… I have never been a fan of “in-place upgrades” nor things like P2V to move from physical to virtual. I believe that the x86 platform is very dynamic in nature and I have seen a lot of customers moving (almost entirely) from physical to virtual just deploying new workloads on virtual and decommission old workloads on physical. After all we are not dealing with monolithic systems and applications where you may have a 1.000.000 lines static RPG application on an AS/400 that had to survive 20 years of “server upgrades”. So the way I can imagine you can solve the 93.000 rules problem is to slowly migrate those workloads (ideally by decommissioning + redeploying) from the legacy infrastructure to the new infrastructure. Plus this is not a migration you would do over-night but more like a life-cycle of workloads that will gradually and naturally move from one architecture to the other. Sure much easier to be said than to be done. :-)

      When I hear things like “how do we upgrade this vSphere infrastructure to become a vCloud infrastructure?” I always tremble…

      Thanks!

      • Matt

        It may be cost prohibitive to provide a virtual firewall for every client (especially the smaller customers). But perhaps it needs to be integrated into the service plan itself? i.e. a clause: You may only request X firewall rules, at which point you must pay Y (fairly low) cost for an individual virtualized firewall. The infrastructure cost of the virtual firewall is clearly less than that of a new host, but it still may be a cost effective option to the customer. The objective here is to move the cost analysis to the customer’s perspective, rather than the service provider. The virtual firewall should be an extra feature.

        • Massimo

          Matt, that may be a way to tackle it yes. Why not. I am coming from a VMware vCloud Director background and there the virtual firewall (vShield Edge) is already included in the package so whether you deploy it or not it won’t change the bottom line (for the provider). I can understand though that if you have to license a virtual firewall for the customer you need to factor in its cost into the customer service.

  • […] Re Ferre – The 93.000 Firewall Rules Problem and Why Cloud is Not Just Orchestration – A few days ago I was in a very interesting meeting with a big Service Provider in Europe […]

  • Hey Massimo, nice post. But a lot of the things you reference have been in the cloud networking space for years. Look at how Amazon EC2 deals with this and how they solved it 4 years ago. Look at EC2 Security Groups or the Nimbula Secuirty Lists.

    • Massimo

      Hi Reza. I beg to disagree.

      In my head Amazon Security Groups really compares (philosophically) to vShield App. That is to say a single layer2 network shared by multiple workloads where security is imposed at a higher layer (L3+). AWS and VMWare use different technologies to accomplish this (and I’d say they achieve/bump into different overall results/limitations). vShield App uses a filter driver in the vSwitch to enforce this while AWS (I think) use iptables on the EC2 hosts. This has a number of limitations such as not being able to do multi-cast etc etc.
      We haven’t been talking too much about using vShield App in conjuction with vCloud Director because App (today) doesn’t support IP overlapping so you can’t have two Orgs using the same IP schema (i.e. 192.168.x.x for example). But if you design your cloud accordingly (to by-pass this limitation) you can do it today effectively creating different security enclaves on a single layer 2. This is what the Los Alamos National Lab did:
      https://www.ibm.com/developerworks/mydeveloperworks/blogs/c2028fdc-41fe-4493-8257-33a59069fa04/entry/building_a_cloud_at_the_los_alamos_national_lab7?lang=en
      Note there are better descriptions of what they did (including the “best of VMworld 2010” breakout session) but I found funny linking an IBM article describing their setup (HP blades, NetApp, vCD, vShield – perhaps IBM did the services part? Who knows!).

      What i was describing in this article is a radically different approach. I was describing how you would be using a vShield Edge to be the “edge” (no pun intended) of a vCD Organization which is essentially meant to dedicate a layer2 to each tenant and manage security at the “edge” (again) Vs managing security on the layer2 itself. I am not saying this is a better way to do things. However it is perhaps more representative of what customers are doing today 1) in their datacenters 2) using physical devices. That means… easier to migrate to a 1) “virtual setup” and 2) moving workloads into a public cloud.
      Ironically Amazon announced a similar architecture as part of their VPC offering roughly 2 months ago.

      Massimo.

  • Hi Massimo, sorry for the misunderstanding, I was not implying that EC2 did the same thing. But they solved the problem of large scale deployments a long time ago. Their approach to networking allows the level of scale they can provide to their customers. My point was more about the problem having been sovled by them earlier than them doing the same thing :)

    • Massimo

      Hi Reza. I guess my point is that they didn’t “solve the problem” of large scale deployments. They implemented something to start addressing that problem and they started covering some use cases. The fact that they came out with a totally different design for security says that their users were demanding something (radically) different. Not by chance Chris Hoff came out with a post after Amazon’s recent announcement titled “AWS’s New Networking Capabilities: sucking less”. It’s here: http://www.rationalsurvivability.com/blog/?p=2942 .

      Don’t get me wrong I am not saying that we got it right and they copied us. I think we all have a long way to go before we solve the security problem in the cloud (which will most likely require a mix of solutions, not only one). So I wouldn’t say Amazon “solved” the problem 4 years ago. But you can always say I am biased… :)

      Massimo.

    • Massimo

      Hi Reza. I guess my point is that they didn’t “solve the problem” of large scale deployments. They implemented something to start addressing that problem and they started covering some use cases. The fact that they came out with a totally different design for security says that their users were demanding something (radically) different. Not by chance Chris Hoff came out with a post after Amazon’s recent announcement titled “AWS’s New Networking Capabilities: sucking less”. It’s here: http://www.rationalsurvivability.com/blog/?p=2942 .

      Don’t get me wrong I am not saying that we got it right and they copied us. I think we all have a long way to go before we solve the security problem in the cloud (which will most likely require a mix of solutions, not only one). So I wouldn’t say Amazon “solved” the problem 4 years ago. But you can always say I am biased… :)

      Massimo.

  • koren

    guys, a VM running a FW software that is placed as a gateway between VM and outside world is not a technology breakthrough for sure , quite a typical option. ask yourself how many FWs are implemented like this (one per every network). we all need to do the simple estimate of packet-per-second/BW/session-per-second and small packet processing power to realize that for example 500 vFW contexts (one per tenant) on ASIC-based platform like Juniper netscreen is highly cost effective and better performing solution compared to n x ESX hosts running software-based FW. this is of course not comparing features available at any one of these 500 contexts compared to vshield-edge. The ‘future’ of FW deployment should be something else, not an in-line deployment like this (btw , it is called ‘services chaining’ or ‘services sandwich’ and was tried many times in the past).

Leave a Reply