Routed vMotion: Why?

Recently, a customer  asked me, “What are the limitations around vMotion across an L3 Clos?”. That question prompted me to re-raise the issue via a discussion I had on Twitter. This post documents my thought process on why vMotion at the routing layer is a requirement in the modern data center.

Background

Recently, I’ve been involved in a lot of next generation data center architectures. One theme is pretty universal: reducing the size of the L2 domain, having a very clear L2/L3 boundary and making more use of routing inside the data center.

This is a fundamental shift from the traditional L2-centric Core-Agg-Access topology that’s been prevalent in most enterprises up until now. The new L3 Leaf-Spine or Clos fabric is very common in hyperscale data centers that a lot of enterprises are seeking to emulate and in some cases, compete with. I won’t go into the numerous reasons for this here. But the summary is: It used to be fairly reasonable to assume routing (L2/L3) boundaries as the logical boundary of a data center or perhaps a cluster, which is not true in L3-Clos.

For example, it’s now pretty common to keep the L2/L3 boundary at the top-of-rack switch (ToR).

Basic L3 CLOS Network topology

Basic L3 Clos network topology

For the purposes of this post, I’m going to keep it simple and assume a single ToR switch. In most enterprise deployments there would likely be a pair of ToRs (possibly spread across two racks), with some form of L2 host redundancy protocol running. However, all of these protocols effectively present as a single switch from the host’s perspective, so I’m going to ignore them for the purposes of simplicity.

L2 Adjacency Requirements

The L3 Clos topology presents an issue for VM environments: cross-rack L2 adjacency. If you want to move a VM from one rack to another, conventional wisdom says the same VLANs need to be present. Period. Or do they?

Routed vMotion : More detail

Routed vMotion: A little more detail

But first, a pretty block diagram of what we’re discussing. I’ve jumped ahead a little and added an overlay stack on the VM networking side for illustration purposes. This will make more sense in a minute, bear with me.

ESXi Block Diagram

ESXi block diagram

The main takeaway here is there is logical separation between a VM’s front-side network and the network stack(s) that ESXi itself uses for management, vMotion, storage, etc. This means we can treat the two separately. The L2 adjacency requirement exists in a both places; let’s address them individually.

1) VM Network(s)

The data-center/enterprise world is largely L2 centric. ESX is no different. VMs are connected to port groups, and port groups map to an L2 segment (VLAN). So if you want to vMotion, the front end network must be present. Case closed.

Hold on a moment. There are multiple ways to solve that problem:

  • Overlay networking: L2 encapsulated over L3. NSX, Midokura, Nuage.
  • Dynamic routing @ the VM: In the case of a front-end load balancer for example, it may be advertising its IPs dynamically, in which case L2 equivalency may not be such a concern.
  • Border NAT: Ala Amazon EC2. Maybe it’s OK for a VM’s IP to change via DHCP when it moves hosts; the inbound NAT is aware and reacts to the change.
  • Other solutions from people smarter than me.

The meta point here though is: there are solutions, some of them pretty out-of-left-field.

2) VMKernel Network Used by ESXi for vMotion

The first thing to consider here is, by default, vMotion across subnet boundaries (that is, routed) will work. Fundamentally, vMotion is using TCP/IP, so it can and will route.

The issue comes from various references/warnings in VMware documentation, industry blogs and so forth.

vMotion and IP-based storage traffic should not be routed, as this may cause latency issues.” - KB2007467 : Multiple-NIC vMotion on vSphere 5 (2013)

 

“Minimize the amount of hops needed to reduce latency, is and always will be, a best practice. Will vMotion work when your vmkernels are in two different subnets, yes it will. Is it supported? No it is not as it has not explicitly gone through VMware’s QA process.” – YellowBricks (2010)

 

“vMotion across two different subnets will, in fact, work, but it’s not yet supported by VMware.” - Scott Lowe, vMotion Layer 2 Adjacency Requirement (2010)

 

A lot of it is based on the assumption that routing implicitly adds latency and therefore should be avoided. In a modern data center, this may not be the case. Let’s explore that.

 

Routing Should Be Avoided! Or Should It?

There is one dirty little secret people may be unaware of. Most modern switch ASICs perform L2 or L3 lookups at the same speed (or as close as makes no difference, think +/- 50 nanoseconds).

Consider the example above with a L3 Clos network. What would that look like if we were forced to move to pure L2 model (as has been suggested as a best practice for vMotion).

L2 Network w/ Spanning Tree

L2 network w/spanning tree

Notice that STP will shut down most of the links. Only one spine switch will be actually forwarding vMotion traffic (since it is carried over a single VLAN). How is this a step forward?

Now, there are various solutions to problem of STP shutting down redundant links; MLAG, VLT, FabricPath (Trill), Virtual Chassis, Q-Fabric… pick your poison. But choose carefully, because they are all vendor-specific, proprietary, “lock-in” protocols.

I’m going to pick MLAG for the purpose of this example. Generally, MLAG-like solutions have lots of caveats associated. They work in pairs only, there are Inter-Switch-Links (ISLs) between the pairs that must remain up and sized appropriately. And did we mention they’re proprietary?

L2 MLAG Example

L2 MLAG example

 

All of this just to put vMotion interfaces in the same VLAN/subnet, when they can be routed anyway, with minimal/no latency overhead. Why?

And wait, isn’t one of the points of modern, software-defined, overlay networks to decouple from brittle, proprietary, L2 centric network architecture?

So, Is It Supported?

So far, like most things in this industry, the answer seems to be: it depends.

I have to say, I’m a little disappointed that in 4 years since the blogs above mention it works… but needs some QA attention, that it hasn’t been tested and publicly supported.

However, I understand there are a lot of other VMware technologies built on top of vMotion (like DPM and DRS), so running those through their paces may open a can of worms. Maybe enough people haven’t raised a feature request for it.

But there is an alternative: the Request for Product Qualification (RPQ) process, which, conveniently, is mentioned as the way to get support for this exact feature — in the VMware NSX Design Guide (page 14)!

“From the support point of view, having the VMkernel interfaces in the same subnet is recommended. However, while designing the network for network virtualization using L3 in the access layer, users can select different subnets in different racks for vSphere vMotion VMkernel interface.

For ongoing support, it is recommended that users go through the RPQ process so VMware will validate the design. ” - VMware NSX Design Guide (page 14)

The caveat around RPQ is the process is on a customer-by-customer basis and you need to submit a design that makes sense. A routed (L3) Clos topology with a sensible subnet scheme and VM L2 adjacency handled some other way (or not being required) should fit those requirements.

So there you have it folks. Routed vMotion. Yes, it works. No, it won’t impact performance (in the right context). Yes, it is supported (through RPQ), and it is even recommended for NSX.

Summary

In traditional Core-Agg-Access, L2-centric topologies, applications could assume the L2-boundary as equivalent to the cluster boundary.

Generally speaking, if someone was seeking to cross an L3 boundary, it usually meant they were trying to cross a DC or cluster boundary. This would obviously have latency, throughput and other connotations. Obviously, these are all not great for applications such as vMotion, so support statements and best practices were crafted around this assumption.

In the L3 Clos network topology presented here, these assumptions are actually false. Crossing an L3 boundary does not infer latency or reduction in bandwidth. In that scenario, this support statement and best practice is trying to limit latency and bandwidth constraints. However, it may stop customers from architecting the “underlay” network in a way that may actually lower latency, increase bandwidth, be less brittle and more standards based.

This is an example where blindly following outdated best practices around one application could have much larger negative impacts on the overall architecture. The world has moved on, and VMware themselves have recognized this in their NSX design guide.

Thanks to @scott_lowe, @Josh_Odgers@joecarvalho_jr@grantorchard for their input on Twitter, this has been a lot of fun. :)

-Doug aka @cnidus

ESXi, vCenter, vCloud Director, Zerto, AD….. all nested inside vCloud.

So this is basically another piece of craziness born out of necessity.

I needed to do some testing with the latest release of Zerto virtual replication suite. I didn’t want to do it in Prod (obviously!) and our existing physical lab environment is a bit-too-secure to be useful and has no ability to demo what I build in it to clients…. So, what’s a cloud architect to do….. run in the cloud of course! (OK, that’s a bit wank, but sarcasm doesn’t translate well in text…). Read the rest of this entry »

ObjectStore + vDAS = Win?

So this post start came about as a result of me fishing for some information from a fellow Engineer/Architect @ another cloud-provider, Kyle Bader (@mmgaggle). Basically, I’d seen a video about DreamObjects’ Ceph Implementation and picked up on a mention of using Coraid and was intrigued.

Kyle and I exchanged a few tweets and he questioned why I would use Coraid behind an ObjectStore platform…. so I thought I’d put my thoughts together and get some feedback. Read the rest of this entry »

Building a Multi-Tenant Veeam Replication Target

Intro

Ok, so it’s been a while, but I’ve been very busy building a few new products :) Honestly I don’t know how the other bloggers manage to get time to blog if they’re actually doing work as well, but I digress….

The task laid upon me was pretty simple in it’s definition:

“Provide one or more ways for a customers to replicate their on-premise VM’s to a cloud provider in a scalable and secure manner”

… simple, right? Not so much…. Read the rest of this entry »

New Home Lab: CniLab 1.0 (Part 1)

Well, since I’m moving on from my current workplace which has a fantastic lab environment, I thought it was probably about time to build myself a testlab at home. I’ve done a fair bit of research into what others have done, as well as looking at a variety of SMB sites. Ultimately, I want to create something that meets my own needs, not anyone elses though.

Why CniLab?
After seeing Simon Gallagher’s vTardis, I thought “damn that’s a cool name for a vm-lab”, not to mention a very nice setup in general, so being a complete geek; I set about thinking up a name for mine. Read the rest of this entry »

Time to move on…

Well after three years, I’ve decided its time to throw in the towel @ UWA. This environment has provided me with a fantastic learning platform and helped to accelerate my growth as an IT professional. I had the opportunity to work with some really talented people and hope that we keep in touch. I learned a lot, but it’s time to move on.

I have been offered and accepted a new position of Senior Virtualization Engineer @ ZettaServe. I will be working on developing the next generation hosting platform for their ZettaGrid project. I’m really excited to get started, if a little daunted by the challenge.

Moving from the ‘cruisy’ education sector to the commercial consulting world is sure to be a change, but I’m looking forward to it. :)

Virtual Distributed Switch (vDS): Clearing it up (to myself).

One of the design decisions I’m currently faced with is the network configuration for a new virtualization platform. This has led me to doing some further reading on vDS and its implications to design.

Background
We’re in a similar position to a lot of other org’s: We’ve had a couple of iterations of virtualized resources and are now reaching a maturity point where we’re looking to improve the platform, raise its importance and market as services (PaaS, IaaS) to our internal clients. The next logical step is to improve our management processes, SLA’s etc and start moving to a Hybrid-Cloud type model, but we’re not quite there yet. Read the rest of this entry »

Problem: VM’s disconnecting vNIC after vmotion.

Well today I had an interesting conundrum. Was doing some routine patching of an ESX cluster and suddenly alerts were going off about VM’s being disconnected.

It turns out we hit the default port limit of the vSwitch on the destination ESX host, which is 64 (or 56 usable).

A quick check of the logs and vswitch config on the service console confirmed the suspicion.

After the incident, did a google and it would appear one of my fellow countrymen, Cristoph Fromage encountered the same limit last year. Link

To get services back online quickly, I simply migrated a few machines off the over-allocated host, then re-enabled the interfaces on the affected VM’s. A better monitoring system would’ve been helpful here, or if I were faster with powercli, perhaps finding the disabled interfaces through that… I ended up going through all the VM’s in that cluster to be sure I’d got them all.

The ultimate fix is to carefully juggle the VM’s around so you don’t hit the limit again, then increase the port limit on each vSwitch in the affected cluster….

Between this, a massive spanning tree issue taking down half the campus and an abandoned snapshot…. I think I’ve had enough disaster for one day.

vCenter as a vApp?

So I’ve been doing a fair bit of thinking lately on what I want my new virtual infrastructure to look like….

I’ve got multiple datacenters, with multiple clusters in each (differing hardware requires that) plus a dedicated VM testlab and I was thinking…. well probably best to have a vcenter in each.

My line of thinking was basically:

  • vCenter in each DC (in linked mode?)
  • Separate DB’s
  • Maybe template it?
  • Well the DB should be a VM too
  • Need to sortout the startup order…
  • Hmm what about a vAPP

Now it seems like a reasonable leap to me, but (correct me if I’m wrong), all the vApp detail is stored in the VCDB, if vCenter is unavailable… will the startup order of the vCenter vAPP work as expected in a HA event?

Time to test it in the testlab I think….

Custom shares on a Resource Pool, scripted (Modified)

Well I’ve been taking a bit of a break from XML-based powershell code, ‘coz it was doing my head in working with that. I was going through some older blog posts on YellowBricks and stumbled across a few related to Resource Pools, specifically related to how shares work.

Now I was always under the impression that shares were already weighted to account for the number and size of VM’s in them… however I was clearly mistaken. I did actually raise this question during my vSphere training and was assured that was the case…. so it’s definitely something to be aware of. The Resource Pool Priority Pie Paradox Read the rest of this entry »

The opinions expressed on this site are my own and not necessarily those of my employer.

All code, documentation etc is my own work and is licensed under Creative Commons and you are free to use it, at your own risk.

I assume no liability for code posted here, use it at your own risk and always sanity-check it in your environment.