Instant VM Recovery considerations for modern data center – Part 2

Source: Veeam

Previously, we’ve gone over design, performance and backup target considerations. Let’s continue our discussion with thoughts on network and restore targets.

Network consideration

The next thing to look at is the network. Ideally, you’ll have 10 Gbps or better in your backup targets (I normally leverage dual 10 Gbps). On the backup source, you can get away with less but then you really need to leverage some form of bandwidth aggregation such as NIC teaming with LACP or SMB Multichannel. You could use 4×1 Gbps LOM ports for this.

As backups source to backup target tend to be a “many-to-one” situation, this can work, but avoid this at the backup target itself. Likewise, parallel restores from a backup target to multiple restore targets (quite often the backup source) is a “one-to-many” game.

You can have multiple backup jobs for multiple Hyper-V hosts / LUNs write to the same backup target. Assume you have 10 hosts with 4×1 Gbps aggregated bandwidth running one or more backup jobs and you’ll see that if you can fill those, you can easily fill one 10 Gbps pipe on the backup target.

If you are leveraging SMB shares as backup targets, SMB multichannel does its work just fine, potentially helped by SMB Direct if you have RDMA capable NICS configured end to end. Today, many Hyper-V cluster designs leverage 10 Gbps RDMA-capable NICs already so you can use them. The Veeam data mover doesn’t leverage SMB however. Keep that in mind when designing your solution.

Also note that with Windows NIC teaming the switch independent mode allows sending from all members but only allows for receiving on one member. If you want optimal bandwidth in both directions for a single process, you are better off leveraging LACP, but then you need to have multiple restores to the same host. Bandwidth aggregation comes with conditions and is not the same as a bigger pipe. Keep this in mind during design as you look at your use cases.

Depending on your network environment, you can leverage Windows-native NIC teaming in LACP or Switch Independent mode and/or SMB Multichannel. The latter is useful when you use SMB file share and want to leverage SMB Direct in your environment. All these possible configurations deserve one or more articles by themselves.

 

What’s important here is that you want a lot of bandwidth and low latency to provide Instant VM Recovery with the best possible performance for mounting the virtual disks, accessing data and copying data over. Remember, you might very well be doing multiple recoveries in parallel. During that process, backups jobs might still be running. As all networking today is full duplex, you don’t need to worry about the incoming traffic hindering the outgoing traffic. When bandwidth is plentiful, it’s compute and storage that determine the speed. When you have designed your backup solutions network well, you most often will find that you can leverage that for your (Instant) VM Recoveries without the need for anything different or more.

Restore target considerations

Normally our storage arrays for virtualization are the best storage you’ll find in the datacenter. So, you could think you have nothing to worry about. While we would all love to have “All NVMe Arrays” where you can go crazy with huge random IO and never notice even a slight drop in performance due to lack of IOPS or too high latencies. The reality is that this is probably not something you have. So, let’s look at some of the options you should optimize.

Restore to Hyper-V production hosts directly to the production LUNs

Even when you have a high-performance storage with read/write caching or a tier 1 storage layer, you have to be careful not to fill up the tier 1 storage layer so you’re not falling back to the lowest common denominator in the array. This can have a profound impact on the workloads running on that storage array. This can easily happen when you push large amounts of data to it as fast as you can. Normally you see this happen during storage migrations and you protect against it by avoiding tier 1 for such operations. This is something to consider as well during massive VM restores. Maybe you’ll want to restore to separate LUNs with a different storage profile for example. The restored virtual machines can be storage live migrated to the production CSV at a controlled pace and be made highly available again in the cluster via storage live migration. Depending on your storage array’s capabilities regarding IOPS and latency, this can work fine.

Restore to Hyper-V production hosts with local SSD/NMVe disks

Another approach that is cost-effective and very efficient is to have a Hyper-V node that has some local SSD or NVMe storage. The size depends on how many virtual machines you want to restore in a given time frame and how large those VMs are. But the quality doesn’t need to be premium as I hope you’re not doing restores all of the time to your Hyper-V hosts. This means it isn’t that expensive to set up. You could have one SSD in every cluster node, in just one or in a couple. The more you have, the smaller (cheaper) the SSD/NVMe can be and the more the workload is spread across different hosts. The instantly recovered virtual machines can be moved to the normal production CSVs leveraging storage live migration at a more relaxed pace and be made highly available again.

 


A possible design. You can mix and match the options discussed above to achieve a solution depending on your needs and environment.

Restore to dedicated Hyper-V restore hosts with local SSD/NVMe disks

Instead of having some fast, local storage in the Hyper-V hosts themselves like above, we you can also opt to use one or more separate (dedicated) restore host for recovery. This avoids any resource impact on the production Hyper-V hosts in the cluster. In this case you really might want to consider some NVMe disks to ingest all the restores elegantly. Testing will show how well such a restore host can scale up (CPU, network, storage).

 

If you need more, you can scale out. In this case you’ll leverage the Shared Nothing Live Migration after the restore to get virtual machines back to the production nodes. This means setting up additional security configurations to allow for this. For the networks side of things, you can leverage your SMB Multichannel and SMB Direct capable CSV/Live Migration / S2D Hyper-V networks for this. Please note that storage live migration is not the fastest process. It will take a while to get the restored virtual machines moved back to the CSVs. The good news is that they are up and running during the process.

Conclusion

Depending on where the bottleneck is in your environment (source, fabric, target) you’ll have to decide what options you choose based on your needs and economics. You can do this perhaps only for a subset of VMs that are important to the business or for customers that are willing to pay for such a service and/or as a way to differentiate yourself from the competition.

No matter what design you end up with, you can achieve your primary goal. That is to have very fast virtual machine restores to get the customer or your services back up and running as soon as possible. When that is achieved, you can storage or Shared Nothing Live Migrate the virtual machines back to the redundant, high available storage of the cluster at a more relaxed pace to make sure the workloads are available again. The last thing to do then is to make sure those virtual machines are still or again protected by Veeam, just in case the day arrives we need to restore again.

While the Veeam Backup & Replication resource scheduling will optimize the use of resources the best it can, you can help by providing adequate resources to make sure the process goes smooth and fast.

This article has given you some ideas on how to deliver super-fast restores to your customer or business units. To what extent you implement such a solution will depend on the economics of your wants and needs. That’s an exercise I leave to the readers for their environment. Remember that a small setup like this would help deliver great results to a mission-critical subset of virtual machines without breaking the bank while a larger scale-out design can help deliver top notch SLA for larger environments. You might need it or not, if you think you do, I hope you found this useful to think about tackling the challenge.

Afterword

If you happen to run any of these storage systems, you can also leverage an advanced integration with Veeam for highly-efficient restore of guest OS files, application items and entire VMs from a storage snapshots with help of Veeam Explorer for Storage Snapshots.

See also

The post Instant VM Recovery considerations for modern data center – Part 2 appeared first on Veeam Software Official Blog.

Instant VM Recovery considerations for modern data center – Part 2

Instant VM Recovery considerations for modern data center

Source: Veeam

Introduction

If you haven’t heard of Instant VM Recovery, you need to go and read up on it here. Veeam describes it as follows:

“With Instant VM Recovery, you can immediately restore a VM into your production environment by running it directly from the compressed and deduplicated backup file. Instant VM Recovery helps improve recovery time objectives, minimize disruption and downtime of production VMs.”
 

 
A lot of discussion and thought goes into designing backup solutions for adequate capacity, performance at an affordable cost. Normally the focus is on the backups, but we also need to think about the restores. The faster the restore, the less down time and economical loss. Instant VM Recovery is there to achieve the fastest possible restore time. This works great, but at scale, you need to worry about the performance of those virtual machines you made available so quickly and how the operations involved impact your environment. We’ll discuss some key design considerations to make Instant VM Recovery shine.

Next to Instant VM Recovery, many design points here will also benefit and optimize “normal” backup and restores. But when time and scalability are the most important factors during restores, Instant VM Recovery is a great feature. The benefit of speed when getting a service up and running is clear. When you do this for only one or a couple of VMs, knowing this option exists might be all you should care about. But when you might have different external and/or internal customers with hundreds or even thousands of VMs, things change a bit. Consider the case when you have a subset of virtual machines that are so important your recovery time objective becomes mission-critical. You can have all the High Availability and redundancy you want, no mission-critical service should exist without a plan to restore it as fast as possible when things go south.

What if you would like or need to restore multiple virtual machines, dozens or more, simultaneously?  How do you ensure that the performance of those virtual machines that you got available so fast is adequate and that you can handle the required number of concurrent restores within a certain time frame? On top of that, can you do this without causing too big of a negative impact on the workloads that are still running or that are being restored at the same time?

Optimizing versus overdesigning

I have designed a couple of smaller solutions leveraging Instant VM Recovery for a few mission-critical services. The number of VMs involved ranged from a 6 to 30. I also helped come up with a larger scale design for a broader capability to do so. That scenario was driven by the desire to reduce the time needed to recover from a whole-sale disaster such as either storage corruption (it does happen) or even a ransomware attack. Even when the backups themselves are not affected (different storage than the VMs) or not encrypted so they don’t need to be recovered from an off-site/air-gapped system, restoring might just take too long. That could make paying the more economically feasible option, if that even works (yes, ransomware operations can also have SLA issues). The design aim was to deliver fast, parallel VM restores in combination with known established restore priority for all VMs in order to get up and running as fast as possible. All this at a lesser cost and in less time than paying for the decryption key after one major ransomware attack and decrypting the backups and/or workloads. It is that simple, but perhaps not that easily done. The biggest concern next to speed was to protect disk-based backups form the ransomware. Hardening the repositories and protecting access (Multi-Factor Authentication) is key here. I myself always like to have multiple options to recover data fast like application-consistent SAN Snapshots that are replicated across arrays or air-gapped copies, (i.e. Tape or Virtual Tape Libraries). Some organizations don’t have that capability and for them it’s even more critical to make sure what they have is rock solid.

Optimizing is always about checks and balances, otherwise it becomes geeks indulging in overdesign. To be clear, I’m not stating or claiming you need to be able to restore all your VMs super-fast and without too much performance impact via Instant VM Recovery.  However, if you get your 20, 50, 100, … most critical VMs for mission-critical services back online this way, you’ll get your business moving again while you wait for the remainder of the services to come back online. What I have built has sometimes been called over the top, but I have seen too many cases where backups and restores are just a low priority and any solution will do as long as there is one. Normally, that goes well until restore time comes around.

Please note that you always have to look at your backup design and the placement of your Veeam components in multi-site environments when it comes to optimizing for backups and restores. In that respect, Instant VM Recovery is not magic.

Finally, I do not cover the dark moments you will face and need to overcome during a ransomware event. Like your clusters that are not really playing well with encrypted resources. You need to stop the attack or you’ll just be adding new files to encrypt into the environment. Those days are long, dark and far from easy.

Prerequisites to performance

The goal here is to have very fast restores of multiple virtual machines as fast as possible and to have those run without significant performance loss or impact on other workloads. This requires:

  • Fast reading from the backup targets
  • A fast network fabric for data movement
  • A fast restore target (can be the backup source) to ingest all IO involved

That’s where we focus on here. In essence, this is quite simple. You need ample resources (compute, network, storage). Simple is nice, but is it easy to do? Sizing is difficult, but the options and technologies for optimization are not that different for normal versus Instant VM Recovery.

The faster your backup storage target is the better the performance of the instant recovered virtual machines can be as the data is being read from there for both restore and operation of the VM. Your network needs to be able to handle the traffic elegantly. 10Gbps (or better) is the way to go. Finally, the storage, where your virtual machines are recovered to, needs to be performant as well. For one, all the new IO is written there, so you want the storage to be able to handle that while the data is being restored simultaneously from the backup target.

When you’ve taken care of compute, network and storage in terms of performance of individual components (scale up), scale out comes into play. This is where you add multiple backup targets and restore targets for Instant VM Recovery to leverage, to be able to restore more virtual machines simultaneously. Let’s look at this in a bit more detail.

Backup target considerations for Instant VM Recovery

On the backup side we try to have a solution where the most recent backups land on the fast storage, offering great backup throughput. This gets expensive, so we need to offload older backups to a more cost-effective solution. Depending on the storage array, older backups can be tiered down to a less expensive storage or copied to a lower tier backup repository. There are options here with entry level SANs, S2D. Not all solutions provide shared storage nor do they need to. That depends on the requirements for your backup targets Availability. The goal here is to provide a cost-effective and efficient way to store your most recent backups on a performant storage. That could be the first four backups of the day or the daily backups of the past two days, etc. Again, this depends on your needs. This can most certainly involve some SSD or even NVMe layer.

The key point here is that the backups you’ll use in an Instant VM Recovery scenario are most probably the most recent ones from the latest restore points. These reside on a fast storage and as such give the best possible performance during the Instant VM Recovery process. Especially when multiple instant recovery jobs are running and other backup jobs are still active. Data is being read for virtual machine IO as the VM is “instantly” available (disk mount). But data is also being read to recover the VM (data restore). all while other backup jobs might be writing to that target.

Let’s look at some examples. Depending on the scale and budget, you have different options. We’ll look at three of them. Whatever works for you will do, and there are variants on these as well as other options out there.

Example 1

Buy a decent entry-level SMB/SME SAN (that doesn’t have to break the bank any more) with configurable tiering. Have a lower capacity tier 1 storage layer for the backups to land on and set a storage progression policy that moves data older down to a tier 2 higher capacity storage layer. You can build both highly available or non-highly available backup repositories with this. As long as the IOPS and latency can follow, you can add repositories to the SAN. If not, you can have more of them and scale it out. As a rule, try to avoid having the same storage array type for your workloads as for your backups. Firmware bugs that can potentially lead to data corruption do exist and you want to minimize your risk.

Example 2

Deploy Storage Spaces Direct to benefit from High Availability, multiple target servers with ReFS Multi-Resilient Volumes (MVR), providing protection and mirror-accelerated parity that you can size tweak so it can hold “hot” (recently written) data for a while in an SSD mirror before moving “cold” (data that wasn’t accessed when the threshold for moving the data is exceeded) data to the less expensive capacity tier. This has scale up and scale out capabilities.
 

Example 3

Build a tier 2 backup solution, perhaps only for backups of those VMs that require the fastest possible backup and restores. This could involve a couple of 2TB SSD/NVMe drives with short retention backup jobs and have those backups copied to cheaper, long-term archival backup targets. Those can be on the same repository host(s) or on different ones. You can leverage Veeam Backup Copy jobs to create a tiered backup repository within the same backup repository or between different repositories.
 


A tiered backup repository example within the same repository for the fas,t most recent backups (less storage capacity) and the older backups (more storage capacity).

 


A tiered backup repository example with different repositories for the fast, most recent backups (less storage capacity) and the older backups (more storage capacity).

 
Normally, these solutions are not highly available, but you can add some protection against storage failure in the usual ways.

Note: You will be hammering that “tier 1” in any solution, so make sure you use write intensive models.  If you have an AFA with 60 SDD for virtualization workloads, you can get away with MLC as the IO is distributed over all the disks, but in the case of the backup target here, you are hammering a small set of disks continuously. So, design accordingly.

In the next article, we’ll continue with the discussion with network and restore target considerations, so stay tuned!

The post Instant VM Recovery considerations for modern data center appeared first on Veeam Software Official Blog.

Instant VM Recovery considerations for modern data center