Building a Multi-Tenant Veeam Replication Target

Intro

Ok, so it’s been a while, but I’ve been very busy building a few new products :) Honestly I don’t know how the other bloggers manage to get time to blog if they’re actually doing work as well, but I digress….

The task laid upon me was pretty simple in it’s definition:

“Provide one or more ways for a customers to replicate their on-premise VM’s to a cloud provider in a scalable and secure manner”

… simple, right? Not so much….

The problem is, pretty much none of the software vendors understand real multi-tenancy (let alone implement it in any sort of serious way). Every product we tested fell way short and after talking with SE’s at various vendors, it was pretty clear we’re on our own. Undeterred, it was time to get my hands dirty.

This post pretty much goes my thought process when designing this particular solution and some hints on implementing it yourself. Hopefully you find it interesting / useful.

We ended up settling on supporting Veeam as the first product to target. In the SME space, Veeam is already wildly popular; so it made good commercial sense to support it if we could. We’ve also had experience automating Veeam and just generally know the product pretty well…

The problems to overcome with Veeam

Veeam (like most other software…) pretty much assumes all infrastructure is owned /controlled by the same organization, it’s designed with that in mind and it’s very difficult to ‘tack on’ multi-tenancy later.

So with Veeam, the software can either be installed in one of two places:

  1. At the Source (customer site): means a “push” type architecture, Snapshots are pushed up to a remote ESXi host.
  2. At the Destination (Cloud-Provider): Means a “Pull” type architecture, Snapshots are requested and pulled from the client site.

Problem is: Veeam needs a high-privilege level on the source VM’s, datastore and target ESXi host, datastore and resource pools. Things like ‘register VM’, delete VM, revert to snapshot @ the cluster or host level… Obviously, no cloud-provider in their right mind is going to give that sort of access to a third party (a customer) and just trust them to not delete someone elses VM on the target. Conversely, no customer would ever agree to giving a cloud-provider access to their internal VM infrastructure to ‘pull’  the backups, nor should they have to…..

The later a pproach also means the cloud-provider needs to securely store credentials to the customer-site which could have close to root-level access….. Across a large number of clients, that represents a massive risk and frankly, its probably just a disaster waiting to happen. Rather than try to build a secure place to store that sort of info… why not just avoid storing it at all.

In the end we decided on going with a “Push-Type” architecture for a number of reasons:

  • Pulling customer data represented an unacceptable risk in credential storage.
  • No clear demarcation point in the infrastructure.
  • Realistically; Troubleshooting backup/snapshot etc issues in the customer environment would become our problem w/ a “pull-type” architecture.

So “Push” it is, then…

So it’s decided, we want to give the customer access to our SAN and prod ESXi server…. yeh, not so much!

That presents another problem;

Q: How do we give customers access to an ESXi API and some storage to replicate to? But not our production ESXi nodes and preferably not directly to our SAN either.

A: Nested ESXi !?

In this particular use-case, nested ESXi really makes a lot of sense:

  1. They can be rapidly provisioned on demand.
  2. Access / permissions can be limited to each tenant/customer individually.
  3. They can be  easily ‘dual-homed’ onto the individual customer’s existing VLAN within our environment.
  4. We know how to manage ESXi servers.
  5. They’re obviously well integrated with vmware Orchestrator.
  6. We know how to use the vSphere API’s / have a library of workflows already….
Point 2 is the most interesting from my perspective: by deploying a vmkernel NIC on our side and one on the customer network, we can leverage existing network infrastructure setup to bridge the customer’s network to ours (MPLS via our sister-ISP-company, or  IPSEC VPN’s back to the customer). Most importantly, it can be achieved in a secure manner.
Ok, its about time for a diagram, aye?
Overview of Replication Target NetworkingFigure 1a: Networking overview

Figure 1a (Above) shows how the infrastructure fits together and hopefully raises a few questions:

Q: What’s with the Static MACs?

A: Good question, Watson! There’s a few reasons, actually.

A1: We’re deploying from templates, w/ vmk0 using DHCP. Setting a static MAC on the primary vmnic allows us to preallocate the IP for the management interface and know what IP to connect to in the deployment workflows etc.

A2: Static MACs allow us to use arp-locking @ the switch level to prevent customers stealing other customer’s IPs (even if they gain root-privilege and SSH access to the nested host… RTSM is obviously disabled by policy, by the way).

A3: If the MAC of the adapter matches the VMK that’s bound to it, it’s not necessary to enable Promiscuous-mode on the portgroup(s). Further info about the ‘normal’ nested ESXi deployment available here.

Q: vmk0 is marked as “following”, what’s that mean?

A: vmk0’s MAC will match the first physical adapter (in this case, that’s actually a virtual adapter). This is achieved using the command: “esxcfg-advcfg -s 1 /Net/FollowHardwareMac” described in KB1031111. It’s worth noting, this is the default behaviour when installing ESXi. This setting allows us to deploy from a template w/ a Static MAC set on the primary ‘physical’ adapter and have vmk0 pickup the correct DHCP address.

Q: Why NFS?

A: for this use-case, we want everything to be ‘easily’ automated during deployment. NFS provisioned through a Nexenta appliance allows us to achieve this.

Q: So what happens when I want to power-on my VM on the nested host, surely the performance would be terrible?

A: Yes, it would…. luckily, we’ve thought of that…. read the next section :)

Q: Routing would be a problem wouldn’t it? Where’s the default gateway?

A: Gateway is set on the customer network and we use a couple of fairly tight static routes to route back to the infrastructure we need to (vcenter, AD, DNS etc).

So what happens when the customer needs to start the VMs…..

Well, that’s the million-dollar question isn’t it! Basically, that’s really the clever part (and where most of the effort in automating went)  :)

In short, we start the VM’s on real ESX servers and give the customer access to them the same way we do with Virtual Private Server offering (VPS)….

We use web-front end, integrated into our portal, vmware orchestrator workflows to do it. The high-level goes something like this:

  1. Customer logs into ZG portal.
  2. List of VM’s registered on their Replication Target, list of available networks presented to customer dynamically.
  3. Customer selects VM’s to Failover (or test Failover), the destination VLAN (selected from a list of their services).
  4. Customer pushes ‘the big red button’
  5. Selected VM’s are unregistered from the Replication Target (noting down the storage path of the VMX etc).
  6. Customer’s NFS DataStore is mounted to a production host with sufficient resources.
  7. VM’s are registered to Prod vCenter.
  8. VM’s are reconfigured to the destination VLAN, permissions added for the customer’s production VPSUSERxxxx account.
  9. VM’s are powered on.
  10. Storage vMotion to Production LUNs (iSCSI) initiated.
  11. Unmount NFS DataStore from prod host when svMotion completes.
  12. FAILOVER Complete.
Obviously, there’s a fair bit of work to all of that workflow…. and that’s where the line in the sand between free knowledge and Intellectual Property is for me, I’m afraid. Suffice to say; you’ll need a developer that’s familiar with vSphere APIs, vmware Orchestrator and whatever front end you’re using…. which are a rare breed indeed! Luckily, we have a few, which makes my life as an solutions architect pretty awesome.
Figure 2a: High-Level overview of the system components.

Wrapping up

So hopefully, that gives you a little insight into what’s possible with nested ESXi hosts with a bit of lateral thinking. I found the project to be really rewarding (certainly challenging!) and its projects like these that makes me love my job.

I’d like to thank Nicki Pereira (ZettaGrid General Manager) for allowing me to post this and having the vision to allow the project in the first place.

Shameless plug: The Replication Target is one of ZettaGrid’s showcase products, part of the “ZettaGrid Replication Service” suite. The suite is launching into Public-Beta @ CeBIT next week (22-05-2012). I’ll be presenting an overview on Thursday in the Cloud Theatre and will be floating around talking to anyone who’s interested. Come say hello (world)!

10 Responses to “Building a Multi-Tenant Veeam Replication Target”

  • Doug:

    karlochacon, If I were to do it again (and I did last year)… I’d use Zerto instead.

    It had built in multi-tenacy and proper client/provider separation.

    Feel free to ping me for more info :)

    @cnidus on twitter or cnidus101 on skype.

  • karlochacon:

    hi

    I just read your article I know this was in 2012, but what about now?

    we have datacenter and we provide some services like hosting some virtual machines for some customers (this VMs are created for ourselves customer don’t not create anything). Our vCenter and ESXi host are in our management private network where customer don’t have access.

    now customer are starting to ask for Replciation as a services using either Veeam or Vmware Replication…. but we have the same problem like you did We cannot face our Vcenter/ESXi ips to Veeam or Vmware Replication…. so my question is there is a new way to overcome this? or we don’t follow similar process you did we are going to lose these customers?
    I already spoke to the networking team and we can not share our managemenet IPs so any workaround for tis 2014 or we have to follow a similar way you did in 2012

    thanks

  • […] ESXi Node is deployed from a standard template I built for a previous project. Pretty standard nested ESXi […]

  • charles:

    I would like to know do your service provide the end customer a user portal to start and edit the backup jobs.( Veeam provide the web interface: veeam backup and enterprise manager , do your design apply this solution). I think Veeam totally not a multi-tenant solution

    • Yes, in our case we operate in a pure IaaS type model, so our customers need self-service capabilities of a web portal.

      As you eluded to, the Veeam Enterprise console is not built in a multi-tenant nature at all. Or at least it wasn’t when we looked.

      In our case, we built an integrated module into our billing/management portal that did the following:
      * interrogated the Replication Target,
      * returned a list of registered VM’s (with tick boxes next to them)
      * Showed a “big red button” to Fail-over the VM’s to our stack.

      Simple and it worked.

  • Joe:

    We offer service that is similar. We only offer it local in Minnesota.

    Cnidus- $150 for how big of VM?

    Solbrekk.com

    • Joe, Cool! How popular is your service? What product do you use to make it work, is it all automated? or service desk staff handle it manually?

      Storage is priced @ 49c / GB for standard… Which is the major component of the cost. You can adjust the sliders here (https://www.zettagrid.com/buy-now) and wrk out the unit pricing if you want to look into it more.

      I’d be keen to hear more about what you’ve done, skype/msn/etc?

      -Cni (Doug)

  • Travis Phipps:

    This is GREAT information. I’ve been asking Veeam for quite awhile if they had any partners offering a service like this. I could never get a straight answer, but your explanations make perfect sense why I can’t find a public offering for this. I’m very excited to see what your offering looks like as well as the pricing you’re able to offer.

    Thanks for the great writeup.

    • Cheers Travis. Send me an email douglas youd zettagrid com and I’ll set up with a trial if you like.

      Not exactly sure on pricing yet, but starting around the $150 a month-ish mark I believe.

      Thanks,
      Doug

Leave a Reply

The opinions expressed on this site are my own and not necessarily those of my employer.

All code, documentation etc is my own work and is licensed under Creative Commons and you are free to use it, at your own risk.

I assume no liability for code posted here, use it at your own risk and always sanity-check it in your environment.