Virtual Distributed Switch (vDS): Clearing it up (to myself).

One of the design decisions I’m currently faced with is the network configuration for a new virtualization platform. This has led me to doing some further reading on vDS and its implications to design.

We’re in a similar position to a lot of other org’s: We’ve had a couple of iterations of virtualized resources and are now reaching a maturity point where we’re looking to improve the platform, raise its importance and market as services (PaaS, IaaS) to our internal clients. The next logical step is to improve our management processes, SLA’s etc and start moving to a Hybrid-Cloud type model, but we’re not quite there yet.

We’ve gone through the process of comparing rackmount, blade and vBlock architectures and have pretty much come to the conclusion that blade makes the most sense to us for a number of reasons:

  • Cable once: A chassis can be provisioned and blades added periodically without involving the network team, which is currently a slow and error-prone process. Also means no need for physical cabling when scaling to another host for the Storage Team.
  • Standard platform: Makes it easy for us to specify what hardware platforms we offer as a default. The old ford mantra used to simplify management “you can have any colour you like… so long as its black”
  • Density: We’ve got plenty of power (and to a lesser extent, cooling), a new row and a shinny new seccondary Data Center…. but a lot of large infrastructure projects and getting more rackspace will add delays, therefore increasing density makes a lot of sense so we don’t run out of space halfway through a project.
  • Utilize Existing Network and Storage: Unlike vBlock, we can use our existing investment and also take advantage of significant expansion in both network and storage spaces that are current multi-million dollar projects.
  • Consolidation Network / Storage Ports: Access ports on 10gb Cisco Nexus and 8gbit brocade fabrics are relatively expensive and often underutilized, consolidating the uplinks out of the chassis makes logical and (after running the numbers) financial sense.

What Networking?
So, assuming for a moment we go with blades, what sort of network and storage connectivity should we utilize?

Well, in my organization, the network team have no knowledge, interest or input into the storage network and we currently have a reasonably solid fibre-channel network managed by the storage team. The lack of interaction between these two groups, coupled with a questionable data network (that’s the subject of a massive re-design project) basically rules out a converged data/storage network in my eyes, at least for the short-medium term.

On the network side, we’re able to provision 10gbit ports on Cisco Nexus kit, so the obvious choice is to utilize this for esx networking.

So we’ve got 10gbit and FC as our preference for connectivity, that’s easy enough to provision in a blade environment. Fabric A for LOM/ILO/DRAC, Fabric B for 10gbit and Fabric C for FC, all being redundant of course.

So, how does vDS fit in?
Well, we’re already licensed to enterprise-plus level (one of the benefits of working in the education sector…. awesome pricing from VMware), it makes sense to lower our management overhead by using a vDS for VM traffic.

However, since we’ll likely only be using a pair of 10gbit ports as our only networking ports on the host, it means we will also be using the vDS for our management traffic.

This makes me a little nervous, as conventional wisdom was to seperate vm and management traffic to separate vSwitches. I’ve even heard some people recommend sticking with a standard vSwitch for management traffic if you use vDS. This means adding another pair of uplinks, partially managing switch config at the host level and just generally doesn’t feel like a clean solution to me.

So I did some digging. What I came up with is this:

  1. vSwitches are essentially a standard vSwitch, where the config is templated and updated from a central point (vCenter). GeekSilver
  2. If vCenter is unavailable, you will lose management of the vDS (obviously), but traffic will still flow(as you would hope in an enterprise feature) VCPGeeks
  3. vDS config is stored / distributed from a special folder on an automatically selected shared LUN RTFM
  4. If you have vCenter down AND you need to make a vDS config change to bring it back up (Like, change vCenter’s portgroup), then you may have trouble
  5. Conflicting information about HA functionality if vCenter was one of the VM’s that went down in a host crash. Basically, HA should start vCenter on an available host…. but will it have a network when it comes back up? YellowBricks (in the comments) This is definitely something I will test in the lab….

Now, item 4 is particularly interesting and it makes sense, a VM portgroup change would be a vDS change, and that management layer is gone if vCenter is down. There is some speculation that changing the portgroup binding to Ephemeral may solve the groups visible when connecting to an esx host directly, but I wonder if you would be able to make such a change in that instance, and what the implications are.

Further investigation required…. but my knowledge of vDS has improved a lot today. :)

EDIT: Found some more info.

Duncan over @ Yellow-bricks, wrote up a post dvSwitch some time ago. Sums up my position quite well.

Rich @ VM / ETC also weighed in on the issue dvSwitch Design considerations

So long as HA will work and the VM’s are powered up on a new host with networking intact(even if vCenter is one of the failed VMs), then personally the issue is certainly less dire and would push me towards the pure vDS route instead of sticking with vSS’s.

One Response to “Virtual Distributed Switch (vDS): Clearing it up (to myself).”

Leave a Reply

The opinions expressed on this site are my own and not necessarily those of my employer.

All code, documentation etc is my own work and is licensed under Creative Commons and you are free to use it, at your own risk.

I assume no liability for code posted here, use it at your own risk and always sanity-check it in your environment.