Modern datacenters and beyond

VM Lifecycle Management

Why This Matters

The previous chapters covered hypervisor internals, the VMware baseline, and KubeVirt's architecture. This chapter moves from "how does the engine work" to "how do you operate virtual machines day-to-day at scale." Every topic here -- live migration, snapshots, templates, resource governance, placement rules, hot-add, GPU assignment -- maps to an operational workflow that the current VMware team executes routinely for 5,000+ VMs. If any replacement platform cannot replicate, improve on, or explicitly retire these workflows, the migration will stall.

Live migration is the single most scrutinized capability in any VMware replacement evaluation. vMotion is the foundational operation behind zero-downtime maintenance, DRS load balancing, and hardware lifecycle management. The team performs hundreds of vMotion operations per week -- many automatically via DRS, others manually for planned maintenance windows. Any replacement must demonstrate that live migration works reliably at scale, completes within acceptable time windows, and does not impose restrictions that make routine node maintenance painful.

Snapshots and clones are the foundation of backup integration, rapid provisioning, and development workflows. Templates and cloud-init drive the provisioning pipeline. Resource pools and quotas are how the organization enforces multi-tenancy and cost governance across business units. Affinity rules keep database clusters separated across failure domains and keep latency-sensitive VM pairs co-located. Hot-add lets the team respond to capacity spikes without scheduling downtime. GPU passthrough enables a growing set of AI/ML inference and VDI workloads.

Each section compares the VMware baseline with the KubeVirt/OVE approach, the Hyper-V/Azure Local approach, and notes implications for Swisscom ESC. The comparison tables at the end provide a summary, but the real evaluation depth is in the sections themselves.


Concepts

1. Live Migration (vMotion Equivalent)

Why Live Migration Is Non-Negotiable

Live migration is not a convenience feature -- it is a structural dependency of modern infrastructure operations. Without it:

At 5,000+ VMs, the organization performs planned maintenance on hypervisor hosts weekly. Without live migration that works reliably and completes within minutes, the operational overhead of maintenance windows alone would make a platform non-viable.

Migration Algorithms: Pre-Copy vs. Post-Copy

All modern live migration implementations use one of two fundamental algorithms (or a hybrid of both):

  Pre-Copy Live Migration Algorithm

  Source Host                              Target Host
  +-----------------+                      +-----------------+
  | VM running      |                      |                 |
  | Memory: 64 GB   |                      |                 |
  +-----------------+                      +-----------------+

  Phase 1: Pre-copy iteration 0 (bulk transfer)
  ====================================================
  | Copy ALL memory |  ---- 64 GB ------>  | Receives pages  |
  | pages to target |                      | into staging    |
  | VM continues    |                      | memory          |
  | running, dirtying|                      |                 |
  | pages           |                      |                 |

  Phase 2: Pre-copy iteration 1 (dirty page resend)
  ====================================================
  | Track dirty     |  ---- 2 GB ------> | Overwrites dirty|
  | pages since     |                      | pages           |
  | iteration 0     |                      |                 |
  | Send dirty pages|                      |                 |

  Phase 3: Pre-copy iteration 2 (converging)
  ====================================================
  | Fewer dirty     |  ---- 200 MB ----> | Overwrites dirty|
  | pages this round|                      | pages           |

  Phase N: Final iteration (switchover)
  ====================================================
  | PAUSE VM        |  ---- 5 MB -------> | Apply last      |
  | Send remaining  |                      | dirty pages     |
  | dirty pages     |                      | RESUME VM       |
  | Redirect clients|                      | VM now runs here|
  +-----------------+                      +-----------------+
        |                                        |
        v                                        v
   VM deleted from                          VM running on
   source after                             target
   confirmation

Pre-copy transfers memory iteratively while the VM continues running on the source. Each iteration sends only the pages that were dirtied since the previous iteration. The dirty set shrinks with each pass (ideally). When the dirty set is small enough that it can be transferred in less than the maximum tolerable downtime (typically <500 ms), the VM is paused, the final dirty pages are sent, and execution resumes on the target.

The convergence problem: If the VM's dirty page rate exceeds the migration bandwidth, the dirty set never shrinks. A VM running an in-memory database that touches 50 GB/s of memory cannot converge over a 10 Gbps migration link (which transfers ~1.2 GB/s). Strategies for convergence:

Post-copy takes the opposite approach: the VM is paused on the source, minimal state (CPU registers, device state) is transferred to the target, and the VM resumes on the target immediately -- before most memory has been copied. Pages are faulted in on demand: when the VM on the target accesses a page that has not yet been transferred, it triggers a network page fault, and the hypervisor fetches the page from the source.

  Post-Copy Live Migration

  Source Host                              Target Host
  +-----------------+                      +-----------------+
  | VM running      |                      |                 |
  | Memory: 64 GB   |                      |                 |
  +-----------------+                      +-----------------+

  Phase 1: Pause and transfer state
  ====================================================
  | PAUSE VM        |  -- CPU state, -->   | RESUME VM       |
  | Send CPU regs,  |  -- device state --> | immediately     |
  | device state    |  -- (small, ~MB) --> | with NO memory  |
  |                 |                      | pages local     |

  Phase 2: Background push + demand paging
  ====================================================
  | Push remaining  |  ---- bulk -------> | VM runs, faults |
  | pages in        |                      | on missing pages|
  | background      |  <-- page request -- | Requests pages  |
  |                 |  -- page data ----> | on demand        |

  Phase 3: Complete
  ====================================================
  | All pages sent  |                      | All pages local |
  | Source freed    |                      | Migration done  |
  +-----------------+                      +-----------------+

Post-copy advantages: Guaranteed convergence. The VM runs on the target from the start, so the total migration time is bounded regardless of dirty page rate. Total downtime for the switchover is typically <100 ms.

Post-copy disadvantages: The VM on the target runs with degraded performance until all pages are faulted in -- every page miss incurs a network round-trip. If the source host crashes during migration, the VM is lost (pages still on the source are unrecoverable). This makes post-copy dangerous for host failures that are the reason you are migrating in the first place.

QEMU/KVM supports both pre-copy and post-copy, and a hybrid mode where pre-copy runs first, and if it fails to converge, the system switches to post-copy.

VMware vMotion uses pre-copy exclusively (as of vSphere 8.0). VMware has historically not implemented post-copy because the risk of VM loss during migration is considered unacceptable for enterprise workloads.

Hyper-V Live Migration uses pre-copy with SMB-based storage migration for non-shared-storage scenarios.

KubeVirt Live Migration: How It Works

KubeVirt's live migration is orchestrated through Kubernetes primitives. The migration creates a new pod on the target node, migrates the QEMU process's state between the source and target pods, and then deletes the source pod. This is fundamentally different from vMotion, where both source and target are managed by agents on ESXi hosts -- in KubeVirt, the migration is a Kubernetes-native pod lifecycle event.

  KubeVirt Live Migration Flow

  Step 1: Migration initiated
  ===============================================================
  User creates VirtualMachineInstanceMigration (VMIM) CR
  (or node drain triggers automatic migration)

  $ kubectl create -f migration.yaml
    OR
  $ virtctl migrate my-vm
    OR
  $ kubectl drain node-1 --delete-emptydir-data

       |
       v
  virt-controller detects VMIM for VMI "my-vm"
       |
       v
  Step 2: Target pod creation
  ===============================================================
  virt-controller creates a NEW virt-launcher pod on the
  target node with the same spec as the source pod:
    - Same CPU/memory requests and limits
    - Same PVC mounts (shared storage required for RWX,
      or storage migration for RWO)
    - Same network configuration
    - Migration-specific annotations

  +--------------------+              +--------------------+
  | Source Node         |              | Target Node        |
  | +----------------+ |              | +----------------+ |
  | | virt-launcher  | |              | | virt-launcher  | |
  | | Pod (running)  | |              | | Pod (pending)  | |
  | |                | |              | |                | |
  | | QEMU/KVM      | |              | | QEMU/KVM      | |
  | | (VM active)   | |              | | (waiting for   | |
  | |                | |              | |  incoming      | |
  | |                | |              | |  migration)    | |
  | +----------------+ |              | +----------------+ |
  +--------------------+              +--------------------+

  Step 3: Migration handshake
  ===============================================================
  virt-handler on target node:
    - Prepares QEMU on target to receive migration
    - Opens a TCP port for incoming migration data
    - Signals readiness to virt-controller

  virt-handler on source node:
    - Receives migration target address
    - Instructs libvirtd to initiate migration to target

  Step 4: Memory transfer (pre-copy iterations)
  ===============================================================
  +--------------------+              +--------------------+
  | Source Node         |              | Target Node        |
  | +----------------+ |              | +----------------+ |
  | | QEMU           | |  migration   | | QEMU           | |
  | | (VM running,   | | ==========>  | | (receiving     | |
  | |  sending pages)| | TCP stream   | |  memory pages) | |
  | |                | | (port 49152) | |                | |
  | +----------------+ |              | +----------------+ |
  +--------------------+              +--------------------+

  - Pre-copy iterations: bulk copy, dirty page resends
  - Auto-converge throttles vCPUs if dirty rate too high
  - Progress reported to VMIM status (percentage, bandwidth)

  Step 5: Switchover
  ===============================================================
  - Source QEMU pauses VM
  - Final dirty pages + CPU state + device state sent
  - Target QEMU resumes VM
  - Typical pause: 50-200 ms for well-converging VMs

  Step 6: Cleanup
  ===============================================================
  - virt-controller updates VMI to point to target node
  - Source virt-launcher pod is terminated
  - VMIM status set to Succeeded
  - If migration fails: source VM continues running,
    target pod is cleaned up, VMIM status set to Failed

  +--------------------+              +--------------------+
  | Source Node         |              | Target Node        |
  |                    |              | +----------------+ |
  | (pod deleted)      |              | | virt-launcher  | |
  |                    |              | | Pod (running)  | |
  |                    |              | |                | |
  |                    |              | | QEMU/KVM      | |
  |                    |              | | (VM active)   | |
  |                    |              | +----------------+ |
  +--------------------+              +--------------------+

Migration Policies in KubeVirt

KubeVirt provides MigrationPolicy as a cluster-scoped CRD that governs migration behavior. This is the equivalent of vMotion network settings and resource controls in vSphere.

apiVersion: migrations.kubevirt.io/v1alpha1
kind: MigrationPolicy
metadata:
  name: production-migration-policy
spec:
  # Bandwidth limit per migration (prevents saturating the network)
  bandwidthPerMigration: 1Gi

  # Maximum number of concurrent outbound migrations per node
  # Prevents a draining node from overwhelming the migration network
  # Default: 2
  # For a 25 Gbps migration network, 5 concurrent at 1 Gi each = 5 Gbps
  allowAutoConverge: true

  # Timeout for the migration to complete before it is cancelled
  completionTimeoutPerGiB: 150

  # Allow post-copy migration as a fallback if pre-copy does not converge
  allowPostCopy: false

  # Selector: which VMIs this policy applies to
  # Uses label selectors on both VMI and namespace
  selectors:
    virtualMachineInstanceSelector:
      matchLabels:
        migration-policy: production
    namespaceSelector:
      matchLabels:
        environment: production

Key MigrationPolicy parameters:

Parameter Purpose Default
bandwidthPerMigration Cap migration throughput per VM to avoid saturating the network. Unlimited
allowAutoConverge Throttle source vCPUs if dirty page rate prevents convergence. false
completionTimeoutPerGiB Seconds allowed per GiB of VM memory before migration is cancelled. A 64 GB VM with a value of 150 gets 64 * 150 = 9600 seconds (~2.7 hours). 150
allowPostCopy Allow fallback to post-copy if pre-copy fails to converge. Risky -- if source host fails during post-copy, VM is lost. false

When no MigrationPolicy matches a VMI, KubeVirt falls back to the cluster-wide defaults configured in the KubeVirt CR under spec.configuration.migrations.

Network Requirements for Live Migration

Live migration transfers tens of gigabytes of data between hosts. The migration network must be sized appropriately:

VM Memory 10 Gbps (~1.2 GB/s) 25 Gbps (~3.1 GB/s) 100 Gbps (~12.5 GB/s)
8 GB ~7 seconds ~3 seconds <1 second
32 GB ~27 seconds ~10 seconds ~3 seconds
64 GB ~54 seconds ~21 seconds ~5 seconds
256 GB ~213 seconds ~82 seconds ~20 seconds
1 TB ~853 seconds ~330 seconds ~82 seconds

These are theoretical minimums for a single pre-copy pass with zero dirty pages. Real-world times are 2-5x higher due to dirty page resends, convergence delays, and protocol overhead.

Best practice: Dedicate a VLAN and physical NICs (or SR-IOV virtual functions) for migration traffic. In KubeVirt, this means defining a dedicated NetworkAttachmentDefinition for migration and configuring it in the KubeVirt CR:

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: openshift-cnv
spec:
  configuration:
    migrations:
      network: migration-network  # Name of the NAD for migration traffic
      parallelMigrationsPerCluster: 20
      parallelOutboundMigrationsPerNode: 5
      bandwidthPerMigration: 1Gi

Hyper-V Live Migration

Hyper-V provides two migration mechanisms:

Live Migration transfers a running VM between Hyper-V hosts with minimal downtime. It uses the same pre-copy algorithm as KVM/QEMU -- iterative memory copy with dirty page tracking. Hyper-V's implementation transfers memory pages via a TCP connection between the source and target hosts. For VMs using shared storage (Cluster Shared Volumes, SMB shares, or SAN LUNs), only memory and CPU state must be transferred. For VMs on local storage, Hyper-V can simultaneously migrate the storage (Storage Live Migration) by mirroring writes to both source and target.

Quick Migration is a faster but not zero-downtime approach: the VM is saved to disk (all memory written to a save-state file), the VM definition is moved to the target host, and the VM is restored from the save-state file. Downtime equals the time to write + read memory state from storage. For a 64 GB VM on fast shared storage, this is typically 30-90 seconds. Quick Migration exists as a fallback when Live Migration fails or is not configured.

Azure Local specifics: Azure Local clusters use Cluster Shared Volumes (CSV) backed by Storage Spaces Direct (S2D). Live Migration within an Azure Local cluster operates over a dedicated migration network. Azure Local also supports planned VM mobility across Azure Local instances in different locations, but this requires Azure Arc orchestration and is not transparent live migration -- it involves VM shutdown and restart.

Comparison to VMware vMotion

Aspect VMware vMotion KubeVirt Live Migration Hyper-V Live Migration
Algorithm Pre-copy only Pre-copy (default), post-copy (optional), auto-converge Pre-copy (Live), save/restore (Quick)
Shared storage required? Yes (or use Storage vMotion for combined migration) Yes for RWX volumes. RWO volumes require storage migration via CDI. No -- Storage Live Migration supported
Concurrent migrations per host Default 4 outbound, 4 inbound (configurable to 8) Configurable per cluster (default 2 outbound/node) Default 2 simultaneous migrations
Migration network Dedicated vmknic on a vMotion-tagged port group Dedicated NetworkAttachmentDefinition (Multus) Dedicated SMB or TCP network
Encryption in transit Supported (AES-256 since vSphere 6.5) TLS-encrypted by default between virt-handler instances IPsec or SMB encryption
Maximum VM size No hard limit (tested with multi-TB VMs) No hard limit (but convergence at >256 GB requires tuning) No hard limit
Cross-cluster migration Requires Enhanced vMotion Compatibility (EVC) for CPU compatibility Same requirement: CPU model must be compatible (use cpu-model: host-model) Processor compatibility mode
Downtime (typical) <1 second for well-converging VMs 50-500 ms (comparable to vMotion) <1 second for Live, 30-90s for Quick
Automated triggering DRS (every 5 minutes) No built-in DRS equivalent. Descheduler is available but less mature. No built-in DRS equivalent

What is the same: The core algorithm (pre-copy with iterative dirty page convergence) is identical across all three. The physics are the same -- migration time is governed by VM memory size, dirty page rate, and available bandwidth.

What is different: KubeVirt wraps migration in Kubernetes pod lifecycle semantics. A KubeVirt live migration is a pod-to-pod event: a new pod is created on the target, state is transferred, the old pod is deleted. This means Kubernetes scheduling rules (affinity, taints, tolerations, resource requests) all apply to migration target selection. In vMotion, target host selection is done by DRS or manually by an admin using vCenter's host picker.

What is worse: KubeVirt does not have a DRS equivalent. There is no built-in controller that continuously monitors cluster resource utilization and automatically migrates VMs to balance load. The Kubernetes Descheduler project can evict pods (including virt-launcher pods) from overloaded nodes, which triggers re-scheduling and migration, but it is less sophisticated than DRS. It does not compute migration cost vs. benefit, it does not avoid thrashing, and it does not coordinate multiple migrations to achieve a target cluster balance. This is a significant gap for an organization accustomed to fully automated DRS.

What is better: KubeVirt's node drain (kubectl drain) is arguably more predictable than vMotion-based maintenance mode. When you drain a node, Kubernetes systematically evicts all pods (including VMs) according to PodDisruptionBudgets, which lets you guarantee that at most N VMs from a given set are simultaneously migrating. vSphere maintenance mode moves VMs in parallel but does not have a PDB equivalent -- it relies on DRS rules and administrator judgment.

Migration at Scale: Node Drain and Rolling Updates

For an organization running 50+ VMs per node across 100+ nodes, the primary live migration scenario is node drain for maintenance. The workflow:

  Node Drain for Maintenance (KubeVirt)

  Cluster: 100 worker nodes, ~5000 VMs (50 VMs/node)
  Goal: Patch node-42 (kernel update, firmware update)

  Step 1: Cordon node-42 (prevent new VM scheduling)
  $ kubectl cordon node-42
  -> Node marked as unschedulable
  -> Existing VMs continue running

  Step 2: Drain node-42 (migrate VMs away)
  $ kubectl drain node-42 \
      --delete-emptydir-data \
      --ignore-daemonsets \
      --pod-selector=kubevirt.io/domain \
      --timeout=3600s

  -> Kubernetes evicts virt-launcher pods one at a time
     (or N at a time, governed by PodDisruptionBudgets)
  -> Each eviction triggers a live migration:
     1. virt-controller creates new virt-launcher pod on another node
     2. QEMU pre-copy migration transfers memory
     3. Switchover occurs
     4. Old pod is deleted
  -> 50 VMs x ~60 seconds average = ~50 minutes total drain time
     (with 5 concurrent migrations: ~10 minutes)

  Step 3: Perform maintenance
  -> SSH into node-42, apply patches, reboot
  -> 5-10 minutes

  Step 4: Uncordon node-42
  $ kubectl uncordon node-42
  -> Node is schedulable again
  -> No automatic VM rebalancing (no DRS)
  -> VMs stay where they were migrated to
  -> Manual rebalancing or Descheduler needed

PodDisruptionBudgets (PDBs) are the mechanism for controlling migration parallelism. For a set of VMs that should never all be down at once (e.g., a 3-node database cluster), a PDB ensures that at most one VM from the set is migrating at any time:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: database-cluster-pdb
  namespace: database-tier
spec:
  minAvailable: 2    # At least 2 of 3 must be running at all times
  selector:
    matchLabels:
      app: oracle-rac
      tier: production

2. VM Snapshots & Clones

Snapshot Mechanisms Compared

A VM snapshot captures the state of a VM's disks (and optionally memory and device state) at a point in time, enabling rollback. The mechanism differs fundamentally across platforms.

VMware Snapshots (VMDK): VMware snapshots use a redo-log / delta-disk approach. When a snapshot is taken, the base VMDK becomes read-only, and a new delta file (a "child" VMDK with a -delta.vmdk suffix) captures all subsequent writes. Each additional snapshot adds another layer in the chain.

  VMware Snapshot Chain

  +------------------+    +------------------+    +------------------+
  | Base VMDK        |    | Snapshot 1       |    | Snapshot 2       |
  | (flat disk,      |<---| Delta VMDK       |<---| Delta VMDK       |
  |  read-only after |    | (writes since    |    | (writes since    |
  |  first snapshot) |    |  base, now also  |    |  snap 1)         |
  |                  |    |  read-only)      |    |  <-- ACTIVE       |
  | 100 GB           |    | 5 GB             |    | 2 GB             |
  +------------------+    +------------------+    +------------------+

  Read path for VM:
  - Read block X:
    1. Check Snapshot 2 delta -> block not here
    2. Check Snapshot 1 delta -> block not here
    3. Read from Base VMDK -> found

  Write path for VM:
  - Write block Y:
    1. Write to Snapshot 2 delta (active disk)

  Revert to Snapshot 1:
  - Delete Snapshot 2 delta
  - Snapshot 1 delta becomes active

  Delete Snapshot 1 (consolidation):
  - Merge Snapshot 1 delta into Base VMDK
  - I/O intensive operation, can stun VM briefly

Performance degradation: Each snapshot layer adds a lookup to the read path. With 5+ snapshots, read I/O latency increases measurably because the storage subsystem must check each delta file before falling through to the base. Snapshot consolidation (merging deltas back into the base) is an I/O-intensive operation that can cause VM stuns (brief freezes of 1-5 seconds) during the final commit phase.

VMware best practice: do not leave snapshots running for more than 24-72 hours. Snapshots are not backups -- they are temporary state for specific operations (patch testing, upgrade rollback).

KubeVirt Snapshots (VolumeSnapshot via CSI): KubeVirt delegates VM snapshots to the Kubernetes VolumeSnapshot API, which in turn calls the CSI (Container Storage Interface) driver's snapshot capability. The snapshot is taken at the storage layer, not at the hypervisor layer.

apiVersion: snapshot.kubevirt.io/v1beta1
kind: VirtualMachineSnapshot
metadata:
  name: my-vm-snap-before-patch
  namespace: production
spec:
  source:
    apiGroup: kubevirt.io
    kind: VirtualMachine
    name: my-database-vm

When this resource is created:

  1. KubeVirt's snapshot controller freezes the VM's filesystems (via the QEMU Guest Agent, if installed) to achieve application consistency.
  2. For each PVC attached to the VM, the controller creates a VolumeSnapshot object.
  3. The CSI driver creates a point-in-time copy of the underlying storage volume (mechanism depends on the storage backend -- thin-provisioned clone, copy-on-write snapshot, or hardware array snapshot).
  4. A VirtualMachineSnapshotContent object is created to track the relationship between the VM snapshot and the underlying volume snapshots.
  5. Filesystem thaw is issued to the guest.
  KubeVirt Snapshot Architecture

  +-------------------+
  | VirtualMachine-   |     Creates one VolumeSnapshot
  | Snapshot           |     per PVC attached to the VM
  | (KubeVirt CR)     |
  +-------------------+
         |
         v
  +-------------------+     +-------------------+
  | VolumeSnapshot    |     | VolumeSnapshot    |
  | (for PVC-boot)    |     | (for PVC-data)    |
  | (Kubernetes CSI)  |     | (Kubernetes CSI)  |
  +-------------------+     +-------------------+
         |                         |
         v                         v
  +-------------------+     +-------------------+
  | CSI Driver        |     | CSI Driver        |
  | creates storage-  |     | creates storage-  |
  | level snapshot    |     | level snapshot    |
  | (e.g., Ceph RBD   |     | (e.g., Ceph RBD   |
  |  snapshot, LVM    |     |  snapshot, LVM    |
  |  thin snapshot)   |     |  thin snapshot)   |
  +-------------------+     +-------------------+

Hyper-V Checkpoints: Hyper-V uses the term "checkpoint" (previously "snapshot"). Two types exist:

Application-Consistent vs. Crash-Consistent Snapshots

This distinction is critical for database workloads:

Type Mechanism Data Integrity Use Case
Crash-consistent Point-in-time copy of disk blocks with no guest coordination. Equivalent to pulling the power cord. Filesystem journal replay needed on restore. Database may need recovery. Possible data loss of uncommitted transactions. Quick snapshots when guest agent is not available.
Application-consistent Guest agent freezes I/O (Linux: fsfreeze/VSS, Windows: VSS). Pending writes are flushed to disk. Database quiesces. Then snapshot is taken. Filesystem is clean. Database is in a consistent state. No recovery needed on restore. Production databases, Exchange, SQL Server.

KubeVirt achieves application consistency via the QEMU Guest Agent (qemu-ga). If the guest agent is installed and running, KubeVirt's snapshot controller issues guest-fsfreeze-freeze before the snapshot and guest-fsfreeze-thaw after. Without the guest agent, the snapshot is crash-consistent only.

Snapshot Chains and Performance Degradation

The performance impact of snapshot chains varies by platform:

VMware: Each snapshot layer is a separate delta VMDK file. Reads must traverse the chain from newest to oldest to find the correct block. With >3 snapshots, IOPS can degrade by 10-30%. VMware warns against running more than 32 snapshots (hard limit), but performance degradation becomes noticeable at 3-5 snapshots.

KubeVirt (Ceph RBD): Ceph RBD snapshots use copy-on-write at the RADOS object level. Read performance is not degraded by snapshots because snapshot metadata is tracked in the OSD's object store, not in a chain of files. However, the space overhead of maintaining many snapshots is real -- every block modified since the snapshot was taken consumes additional space. The CSI driver's snapshot implementation varies by storage backend; the performance characteristics are ultimately a property of the storage, not of KubeVirt.

Hyper-V (VHDX): Differencing disks (AVHDX files) behave similarly to VMware delta VMDKs. Each checkpoint adds a layer. Consolidation ("merge") is I/O-intensive.

Cloning: Full Clone vs. Linked Clone

Full clone: A complete, independent copy of a VM and all its disks. No dependency on the source VM. Time to create = time to copy all disk data. A 200 GB VM takes 200 GB of storage and several minutes to clone.

Linked clone (VMware) / Differencing disk clone (Hyper-V): A new VM shares the base disk with the parent VM (read-only) and has its own delta disk for writes. Creating a linked clone is nearly instantaneous because no data is copied. However, the clone depends on the parent's base disk -- if the parent is deleted, the clone is broken.

KubeVirt cloning uses CSI clone operations. The VirtualMachine CR can specify a DataVolumeTemplate with a source PVC, and CDI (Containerized Data Importer) performs the clone via the CSI driver's clone capability. If the CSI driver supports efficient cloning (e.g., Ceph RBD rbd clone using copy-on-write), the clone is fast and space-efficient. If the driver does not support native cloning, CDI falls back to creating a new PVC and copying data block-by-block (a "smart clone" or "host-assisted clone" depending on namespace).

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: cloned-vm
spec:
  running: true
  dataVolumeTemplates:
    - metadata:
        name: cloned-vm-rootdisk
      spec:
        source:
          pvc:
            namespace: golden-images
            name: rhel9-golden-image
        storage:
          resources:
            requests:
              storage: 50Gi
          storageClassName: ocs-storagecluster-ceph-rbd
  template:
    # ... VM spec ...

3. VM Templates & Rapid Provisioning

The Provisioning Pipeline

Rapid provisioning -- creating a ready-to-use VM in under 5 minutes from request to running workload -- requires three things:

  1. A pre-built disk image (the template) that does not need to be installed from an ISO.
  2. Fast disk copy or clone to create a new VM's disk from the template.
  3. First-boot customization (hostname, IP, SSH keys, users) that runs automatically without human intervention.
  VM Provisioning Flow

  +-----------+     +-----------+     +-----------+     +-----------+
  | Golden    |     | Clone/    |     | First-boot|     | VM Ready  |
  | Image     |---->| Copy      |---->| Config    |---->| for Use   |
  | (template)|     | (new disk)|     | (cloud-   |     |           |
  |           |     |           |     |  init)    |     |           |
  +-----------+     +-----------+     +-----------+     +-----------+
       |                 |                 |                 |
       v                 v                 v                 v
  Maintained by     Seconds (COW)     30-90 seconds      Total:
  platform team     Minutes (full)    (SSH key inject,   1-5 minutes
  Updated monthly                     network config,    with COW clone
                                      package install)   5-15 minutes
                                                         with full copy

KubeVirt Provisioning Mechanisms

containerDisk: A VM disk image baked into an OCI container image and stored in a container registry. When a VM starts, the container image is pulled and the disk is extracted. The disk is ephemeral -- changes are lost when the VM is stopped. Useful for stateless VMs or as boot sources for templates.

# containerDisk: disk image baked into a container image
spec:
  template:
    spec:
      volumes:
        - name: rootdisk
          containerDisk:
            image: registry.internal.bank.ch/vm-images/rhel9-base:9.4
            imagePullPolicy: IfNotPresent

DataVolume templates: The primary mechanism for persistent VM provisioning. A DataVolumeTemplate in the VirtualMachine spec instructs CDI to create a PVC and populate it from a source (container registry, HTTP URL, existing PVC, or upload).

# DataVolume template: creates a persistent disk from a source
spec:
  dataVolumeTemplates:
    - metadata:
        name: webserver-01-rootdisk
      spec:
        source:
          pvc:
            namespace: golden-images
            name: rhel9-golden-image-20260401
        storage:
          resources:
            requests:
              storage: 50Gi
          storageClassName: ocs-storagecluster-ceph-rbd

Golden image PVCs with boot sources: OpenShift Virtualization (OVE) includes automatic boot source management. For common operating systems (RHEL, CentOS, Fedora, Windows Server), OVE can automatically download and maintain golden image PVCs that are kept up to date on a configurable schedule. These PVCs serve as the source for DataVolume clones.

  Golden Image Pipeline (OVE)

  +------------------+     +-----------------+     +------------------+
  | Upstream Image   |     | CDI Import      |     | Golden Image PVC |
  | Source           |---->| CronJob         |---->| (kept current)   |
  | (Red Hat CDN,    |     | (runs weekly)   |     |                  |
  |  internal        |     |                 |     | Namespace:        |
  |  registry)       |     |                 |     | openshift-        |
  +------------------+     +-----------------+     | virtualization   |
                                                   +------------------+
                                                          |
                                           Clone on VM creation
                                                          |
                                                          v
                                                   +------------------+
                                                   | VM PVC           |
                                                   | (independent     |
                                                   |  copy)           |
                                                   +------------------+

Provisioning time determinants:

Factor Impact Optimization
Disk image size Full copy time is proportional to image size Keep golden images minimal (5-10 GB)
Clone mechanism COW clone (Ceph rbd clone, LVM thin clone) is near-instant. Full copy takes minutes. Use a storage backend that supports CSI clone.
Image pull (containerDisk) First pull downloads from registry. Subsequent pulls use node cache. Pre-pull images on all nodes via DaemonSet.
Cloud-init execution 30-90 seconds for typical first-boot configuration. Minimize cloud-init modules. Use baked images where possible.
Guest OS boot time UEFI boot + GRUB + kernel + systemd = 10-60 seconds depending on OS. Use RHEL minimal or tuned images.

Hyper-V Templates

Hyper-V uses a traditional template approach:

  1. Build a reference VM with the desired OS, patches, and configuration.
  2. Sysprep the VM (generalize it -- remove machine-specific identifiers like SID, hostname, hardware drivers).
  3. Export the VM as a template or simply copy the VHDX file.
  4. Deploy by creating a new VM using the template VHDX as a base (full copy or differencing disk).

Azure Local integrates with Azure Marketplace images and custom VM images stored in Azure Arc resource bridge. Provisioning via the Azure portal or Azure CLI creates VMs from these images. The provisioning pipeline is comparable to Azure cloud VM creation, with the added latency of local disk operations.


4. Cloud-init / Ignition

Cloud-init

Cloud-init is the industry-standard tool for Linux VM first-boot customization. Originally developed for Amazon EC2, it is now supported by every major cloud provider and hypervisor. Cloud-init runs during the first boot of a VM (and optionally on subsequent boots) to configure the system based on metadata and user-data provided by the platform.

Cloud-init datasources: Cloud-init supports multiple datasources -- the mechanism by which it receives configuration data:

Datasource How it works Used by
NoCloud Reads from a virtual CD-ROM (ISO) or a local seed directory containing meta-data and user-data files. KubeVirt (cloudInitNoCloud), libvirt, manual setups
ConfigDrive Reads from a configuration drive (virtual disk with a specific label). OpenStack, KubeVirt (cloudInitConfigDrive)
Azure Queries Azure IMDS (Instance Metadata Service) over HTTP at 169.254.169.254. Azure, Azure Local
EC2 Queries EC2 metadata service over HTTP. AWS
GCE Queries GCE metadata service. Google Cloud
VMware (open-vm-tools) Reads from VMware's GuestInfo variables or OVF properties. VMware vSphere

Cloud-init modules: Cloud-init processes configuration in phases (init, config, final). Common modules used in enterprise VM provisioning:

Cloud-init in KubeVirt:

KubeVirt provides two cloud-init volume types:

# Option 1: cloudInitNoCloud (most common)
# Generates an ISO image attached as a virtual CD-ROM
spec:
  template:
    spec:
      volumes:
        - name: cloudinit
          cloudInitNoCloud:
            userData: |
              #cloud-config
              hostname: webserver-01
              fqdn: webserver-01.prod.bank.ch

              users:
                - name: ansible
                  ssh_authorized_keys:
                    - ssh-ed25519 AAAA... ansible-automation@bank.ch
                  sudo: ALL=(ALL) NOPASSWD:ALL
                  groups: wheel
                  shell: /bin/bash

              write_files:
                - path: /etc/pki/ca-trust/source/anchors/bank-root-ca.pem
                  content: |
                    -----BEGIN CERTIFICATE-----
                    MIIFxTCCA62gAwIBAgIUOeji...
                    -----END CERTIFICATE-----

              runcmd:
                - update-ca-trust
                - systemctl enable --now qemu-guest-agent
                - hostnamectl set-hostname webserver-01.prod.bank.ch

              packages:
                - qemu-guest-agent
                - chrony

              ntp:
                servers:
                  - ntp1.bank.ch
                  - ntp2.bank.ch

            networkData: |
              version: 2
              ethernets:
                eth0:
                  addresses:
                    - 10.100.50.21/24
                  gateway4: 10.100.50.1
                  nameservers:
                    addresses:
                      - 10.100.1.10
                      - 10.100.1.11
                    search:
                      - prod.bank.ch
                      - bank.ch
# Option 2: cloudInitConfigDrive
# Uses a config-drive volume (OpenStack-compatible)
spec:
  template:
    spec:
      volumes:
        - name: cloudinit
          cloudInitConfigDrive:
            userData: |
              #cloud-config
              hostname: webserver-02
              # ... same content as above ...

The choice between NoCloud and ConfigDrive depends on the guest OS image's cloud-init configuration. Most enterprise Linux images support both. NoCloud is simpler and more commonly used with KubeVirt.

Referencing Secrets: For sensitive data (passwords, private keys), the userData and networkData fields can reference Kubernetes Secrets instead of inline content:

volumes:
  - name: cloudinit
    cloudInitNoCloud:
      userDataSecretRef:
        name: my-vm-cloud-init-secret
      networkDataSecretRef:
        name: my-vm-network-config

Ignition

Ignition is a first-boot provisioning tool developed by CoreOS (now part of Red Hat). Unlike cloud-init, which is a multi-phase configuration management tool, Ignition runs exactly once during the initramfs phase -- before the root filesystem is mounted -- and writes all configuration directly to disk. If Ignition fails, the machine does not boot.

When to use which:

Tool Used for Characteristics
Cloud-init Guest VMs (RHEL, Ubuntu, Windows) Multi-phase, idempotent, forgiving. Runs during normal boot. Can be re-triggered.
Ignition Fedora CoreOS, RHCOS (OVE node OS) Single-shot, early-boot (initramfs). Writes files, systemd units, users. Machine fails to boot if Ignition fails.

In the context of OVE:

These are separate concerns. The evaluation team does not need to write Ignition configs for guest VMs unless those guests are running Fedora CoreOS or RHCOS, which is rare for enterprise application VMs.


5. Resource Pools / Quotas

VMware Resource Pools vs. Kubernetes Quotas

In VMware, resource pools are hierarchical containers within a cluster that partition CPU and memory using three controls: shares (relative priority during contention), reservations (guaranteed minimum), and limits (hard ceiling). Resource pools can be nested, and their allocations are relative to their parent pool.

In Kubernetes (and therefore KubeVirt), the equivalent concept is a combination of Namespaces, ResourceQuotas, and LimitRanges. The mapping is not 1:1, but the intent is the same: prevent one team/project/tenant from consuming all cluster resources.

  VMware Resource Pool vs. Kubernetes Namespace Hierarchy

  VMware:                              Kubernetes (KubeVirt):
  ========                             =====================

  Cluster                              Cluster
  |                                    |
  +-- RP: Production                   +-- Namespace: prod-tier1
  |   | Reservation: 180 GHz CPU       |   ResourceQuota:
  |   | Reservation: 1.5 TB RAM        |     requests.cpu: 180
  |   |                                |     requests.memory: 1.5Ti
  |   +-- RP: Tier-1                   |     limits.cpu: 200
  |   |   Reservation: 120 GHz        |     limits.memory: 1.8Ti
  |   |   +-- DB VMs                   |     count/virtualmachines: 50
  |   |                                |
  |   +-- RP: Tier-2                   +-- Namespace: prod-tier2
  |       Reservation: 60 GHz         |   ResourceQuota:
  |       +-- Web VMs                  |     requests.cpu: 60
  |                                    |     requests.memory: 500Gi
  +-- RP: Development                  |
  |   Limit: 64 GHz                    +-- Namespace: development
  |   Limit: 512 GB                    |   ResourceQuota:
  |   No reservation                   |     limits.cpu: 64
  |   +-- Dev VMs                      |     limits.memory: 512Gi
  |                                    |     requests.cpu: 0  (no guarantee)
  +-- RP: Infrastructure               |
      Reservation: 12 GHz             +-- Namespace: infrastructure
      +-- Infra VMs                        ResourceQuota:
                                             requests.cpu: 12
                                             requests.memory: 48Gi

ResourceQuota and LimitRange

# ResourceQuota: caps total resource consumption in a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-tier1-quota
  namespace: prod-tier1
spec:
  hard:
    # Total CPU and memory that all VMs in this namespace can request
    requests.cpu: "180"
    requests.memory: 1.5Ti

    # Total CPU and memory limits (hard ceiling) for all VMs
    limits.cpu: "200"
    limits.memory: 1.8Ti

    # Cap the number of VirtualMachine objects
    count/virtualmachines.kubevirt.io: "50"

    # Cap the number of PVCs (controls storage consumption)
    persistentvolumeclaims: "150"

    # Cap total storage requested
    requests.storage: 50Ti
# LimitRange: sets defaults and constraints for individual VMs
apiVersion: v1
kind: LimitRange
metadata:
  name: vm-limits
  namespace: prod-tier1
spec:
  limits:
    - type: Container     # Applies to the virt-launcher container
      default:            # Default limits if not specified on the VM
        cpu: "4"
        memory: 8Gi
      defaultRequest:     # Default requests if not specified
        cpu: "2"
        memory: 4Gi
      min:                # Minimum allowed -- prevents undersized VMs
        cpu: "1"
        memory: 2Gi
      max:                # Maximum allowed -- prevents oversized VMs
        cpu: "64"
        memory: 256Gi

Key differences from VMware resource pools:

Concept VMware Resource Pool Kubernetes ResourceQuota
Hierarchy Resource pools can be nested multiple levels deep. Child pools inherit from parent. Namespaces are flat -- no nesting. Hierarchical Namespace Controller (HNC) exists but is not widely adopted.
Shares (relative priority) Shares determine relative resource allocation during contention. No direct equivalent. Kubernetes PriorityClasses influence scheduling and eviction order, but not proportional sharing of CPU/memory.
Reservation (guaranteed) Reservation guarantees a minimum resource allocation backed by admission control. requests in Kubernetes serve a similar purpose -- they are guaranteed allocations that the scheduler accounts for. A pod is only placed on a node if the node has enough unrequested resources.
Limit (hard ceiling) Limits cap consumption. VM is throttled if it exceeds the limit. limits in Kubernetes are enforced by cgroups. If a container exceeds its memory limit, it is OOM-killed. If it exceeds its CPU limit, it is throttled.
Overcommitment VMware allows overcommitment by design. Shares arbitrate during contention. Ballooning and swap handle memory pressure. Kubernetes allows overcommitment when requests < limits. If all pods on a node simultaneously use their limits, the node is overcommitted. Pods exceeding their memory requests (but within limits) may be evicted under memory pressure.

Hyper-V Resource Controls

Hyper-V provides per-VM resource controls (not hierarchical pools):

Azure Local does not expose hierarchical resource pools. Resource governance is managed through Azure RBAC and Azure policies at the subscription and resource group level, not at the hypervisor level. This is a significant architectural difference from vSphere resource pools.


6. Affinity / Anti-Affinity Rules

Why Placement Control Matters

At 5,000+ VMs, placement decisions have cascading effects:

Kubernetes Affinity Constructs

Kubernetes provides three affinity mechanisms, all of which apply to KubeVirt VMs through the virt-launcher pod spec:

nodeAffinity: Controls which nodes a VM can be scheduled on, based on node labels.

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: oracle-db-01
  namespace: database-tier
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          # Hard requirement: MUST run on nodes labeled for Oracle
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: license/oracle
                    operator: In
                    values: ["true"]
                  - key: hardware/gpu
                    operator: DoesNotExist
          # Soft preference: prefer nodes in rack-A
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values: ["rack-a"]

podAntiAffinity: Ensures that VMs with specific labels do not land on the same host (or same zone/rack).

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: oracle-rac-node-1
  namespace: database-tier
spec:
  template:
    metadata:
      labels:
        app: oracle-rac
        cluster-name: rac-prod-01
    spec:
      affinity:
        podAntiAffinity:
          # Hard requirement: no two oracle-rac pods on the same host
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: cluster-name
                    operator: In
                    values: ["rac-prod-01"]
              topologyKey: kubernetes.io/hostname
          # Soft preference: spread across racks too
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: cluster-name
                      operator: In
                      values: ["rac-prod-01"]
                topologyKey: topology.kubernetes.io/zone

podAffinity: Ensures that related VMs are scheduled on the same host or in the same topology domain.

# Keep the app server and its cache VM on the same host
spec:
  template:
    metadata:
      labels:
        app: trading-platform
        component: app-server
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: ["trading-platform"]
                  - key: component
                    operator: In
                    values: ["cache"]
              topologyKey: kubernetes.io/hostname

Topology Spread Constraints

Topology spread constraints are a more flexible mechanism than anti-affinity for distributing VMs evenly across failure domains:

spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web-frontend
        - maxSkew: 2
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: web-frontend

This ensures that web-frontend VMs are spread evenly across zones (maxSkew: 1 means the difference in VM count between any two zones is at most 1) and roughly evenly across hosts (maxSkew: 2, soft constraint).

KubeVirt-Specific Placement Features

dedicatedCPUPlacement: Pins vCPUs to dedicated physical CPU cores, preventing contention with other workloads. Essential for latency-sensitive VMs (trading systems, real-time data processing):

spec:
  template:
    spec:
      domain:
        cpu:
          cores: 8
          dedicatedCpuPlacement: true

When dedicatedCPUPlacement is set, the virt-launcher pod is placed in the Kubernetes Guaranteed QoS class (requests == limits), and the kubelet's CPU manager allocates exclusive CPU cores to the pod. No other pod can use those cores.

evictionStrategy: Controls what happens to the VM when the node is drained:

spec:
  template:
    spec:
      evictionStrategy: LiveMigrate  # Options: LiveMigrate, LiveMigrateIfPossible, External, None

VMware DRS Rules vs. Kubernetes Affinity

VMware DRS Rule Kubernetes Equivalent Notes
VM-VM affinity ("should run together") podAffinity with preferredDuringSchedulingIgnoredDuringExecution Soft constraint -- best effort
VM-VM affinity ("must run together") podAffinity with requiredDuringSchedulingIgnoredDuringExecution Hard constraint -- scheduler will not violate
VM-VM anti-affinity ("should run separately") podAntiAffinity with preferredDuringScheduling... Soft constraint
VM-VM anti-affinity ("must run separately") podAntiAffinity with requiredDuringScheduling... Hard constraint
VM-Host affinity ("should/must run on host group") nodeAffinity with requiredDuringScheduling... or preferredDuringScheduling... Based on node labels, not host groups. Labels are more flexible but must be manually maintained.
VM-Host anti-affinity ("should/must not run on host group") nodeAffinity with operator: NotIn or DoesNotExist Same mechanism, inverted selector

Key difference: DRS affinity rules are enforced continuously -- DRS re-evaluates rules every 5 minutes and migrates VMs to maintain compliance. Kubernetes affinity rules are enforced at scheduling time only (requiredDuringSchedulingIgnoredDuringExecution). The "IgnoredDuringExecution" means that if a node's labels change after a VM is scheduled, the VM is NOT automatically migrated to a compliant node. Kubernetes has a RequiredDuringExecution concept in alpha/proposal stage, but it is not production-ready. This is a meaningful gap for organizations that rely on DRS to continuously enforce placement rules.

Hyper-V Placement

Hyper-V Failover Clustering provides:

Azure Local inherits these Hyper-V clustering capabilities. They are less flexible than Kubernetes affinity constructs (no topology spread, no weighted preferences) but are well-understood by Windows Server administrators.


7. CPU & RAM Hot-Add

What Hot-Add Is and Why It Matters

Hot-add is the ability to increase a VM's CPU count or memory while it is running, without a reboot. It addresses a common operational scenario: a production VM hits a resource ceiling during a peak workload, and the team needs to increase capacity immediately without scheduling a maintenance window.

When hot-add matters:

When hot-add is less important:

KubeVirt CPU and Memory Hot-Plug

KubeVirt supports CPU and memory hot-plug as of recent versions. The feature allows modifying the VM's CPU and memory while it is running.

CPU hot-plug: KubeVirt supports adding vCPUs to a running VM. The VM spec defines a range of sockets, and the running VM can have additional sockets activated without a restart. The guest OS must support CPU hot-plug (Linux: supported since kernel 2.6, Windows Server: supported since 2012).

To enable CPU hot-plug, the VM must be configured with maxSockets greater than the initial socket count:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: scalable-vm
spec:
  template:
    spec:
      domain:
        cpu:
          sockets: 2         # Initial: 2 sockets
          cores: 2           # 2 cores per socket
          threads: 1         # 1 thread per core
          maxSockets: 8      # Maximum: can scale up to 8 sockets

After the VM is running, update the live VMI to increase sockets:

# Scale CPU from 2 to 4 sockets (2 cores each = 8 vCPUs)
kubectl patch vmi scalable-vm --type merge \
  -p '{"spec":{"domain":{"cpu":{"sockets":4}}}}'

Memory hot-plug: KubeVirt supports memory hot-plug by defining maxGuest memory in the VM spec:

spec:
  template:
    spec:
      domain:
        memory:
          guest: 8Gi         # Initial: 8 GB
          maxGuest: 32Gi     # Maximum: can scale up to 32 GB

Limitations and considerations:

Hyper-V Hot-Add

Hyper-V supports hot-add for both CPU and memory, with some distinctions:

Dynamic Memory: Hyper-V's Dynamic Memory is a more automated approach than manual hot-add. The VM is configured with a minimum, startup, and maximum memory value. The Hyper-V host dynamically adjusts the VM's memory allocation within this range based on demand, using a balloon driver inside the guest.

Parameter Description
Startup Memory Memory allocated when the VM boots
Minimum Memory Lowest memory the VM can be reduced to under host pressure
Maximum Memory Highest memory the VM can grow to under guest demand
Memory Buffer Percentage of committed memory to keep as reserve (default: 20%)

Dynamic Memory is transparent to the guest OS -- no manual intervention required. This is more operationally convenient than KubeVirt's explicit hot-plug.

CPU hot-add (Hyper-V): Supported for Generation 2 VMs. The VM must be configured with "Enable processor compatibility" unchecked and the guest OS must support hot-add (Windows Server 2016+, Linux with hot-plug support).

VMware Hot-Add

VMware has supported CPU and memory hot-add since vSphere 4.1. It must be enabled per-VM in the VM settings before the first boot (or while the VM is powered off). Once enabled:

One VMware-specific quirk: enabling CPU hot-add disables vNUMA for the VM (the VM presents a flat memory topology to the guest). This can negatively impact performance for NUMA-sensitive workloads. vSphere 8 relaxed this restriction for some configurations, but it remains a consideration.


8. GPU Passthrough / vGPU

Why GPU Virtualization Matters

GPU access in virtual machines is required for an expanding set of enterprise workloads:

Two fundamental approaches exist: passthrough (one GPU to one VM) and sharing (one GPU to many VMs).

PCIe Passthrough via VFIO (KubeVirt)

VFIO (Virtual Function I/O) is a Linux kernel framework that allows userspace programs (including QEMU) to directly access PCI devices with full DMA isolation via the IOMMU (VT-d on Intel, AMD-Vi on AMD). When a GPU is passed through to a VM via VFIO, the VM has exclusive, near-native-performance access to the entire GPU.

  GPU Passthrough via VFIO in KubeVirt

  +================================================================+
  |  Worker Node                                                    |
  |                                                                 |
  |  +----------------------------------------------------------+  |
  |  |  virt-launcher Pod (VM with GPU)                         |  |
  |  |                                                          |  |
  |  |  +----------------------------------------------------+  |  |
  |  |  |  QEMU/KVM                                          |  |  |
  |  |  |                                                    |  |  |
  |  |  |  Guest OS sees: NVIDIA A100 (native PCI device)   |  |  |
  |  |  |  Guest installs native NVIDIA driver               |  |  |
  |  |  +----------------------------------------------------+  |  |
  |  |       |                                                   |  |
  |  |       | VFIO device assignment                            |  |
  |  |       | (QEMU -device vfio-pci,host=0000:3b:00.0)        |  |
  |  +----------------------------------------------------------+  |
  |       |                                                         |
  |       | IOMMU (VT-d / AMD-Vi)                                   |
  |       | DMA remapping ensures VM can only access its own GPU    |
  |       |                                                         |
  |  +----------------------------------------------------------+  |
  |  |  Physical GPU: NVIDIA A100 80GB                          |  |
  |  |  PCI address: 0000:3b:00.0                               |  |
  |  |  IOMMU group: 42                                         |  |
  |  |  Bound to vfio-pci driver (not nvidia driver)            |  |
  |  +----------------------------------------------------------+  |
  +================================================================+

In KubeVirt, GPU passthrough is configured using the gpus or hostDevices field in the VM spec. The GPU must first be configured as a Kubernetes device plugin resource.

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: ml-inference-vm
  namespace: ai-workloads
spec:
  template:
    spec:
      domain:
        devices:
          gpus:
            - name: gpu1
              deviceName: nvidia.com/A100
        resources:
          requests:
            memory: 64Gi
          limits:
            memory: 64Gi
      # Node must have an available A100 GPU
      # Kubernetes device plugin framework handles allocation

Limitation: GPU passthrough is exclusive -- one GPU per VM. The GPU cannot be shared. Live migration of VMs with passthrough GPUs is NOT supported (the GPU's state cannot be serialized and transferred). The VM must be shut down and restarted on a different GPU-equipped host. This is the same limitation as VMware's DirectPath I/O and Hyper-V's DDA.

NVIDIA GPU Operator for Kubernetes

The NVIDIA GPU Operator automates the deployment and management of GPU drivers and device plugins on Kubernetes nodes. It deploys:

For KubeVirt, the GPU Operator works with both passthrough (VFIO) and vGPU modes.

vGPU: Time-Sliced vs. MIG (Multi-Instance GPU)

When a single physical GPU must be shared across multiple VMs, two technologies are available:

Time-sliced vGPU (NVIDIA GRID / vGPU):

  Time-Sliced vGPU (NVIDIA GRID)

  +================================================================+
  |  Physical GPU: NVIDIA A100 80GB                                 |
  |                                                                 |
  |  NVIDIA vGPU Manager (host driver)                              |
  |  Schedules time slices across virtual GPUs                      |
  |                                                                 |
  |  +--------------+  +--------------+  +--------------+           |
  |  | vGPU 1       |  | vGPU 2       |  | vGPU 3       |          |
  |  | Profile:     |  | Profile:     |  | Profile:     |          |
  |  | A100-4C      |  | A100-4C      |  | A100-4C      |          |
  |  | (4 GB VRAM)  |  | (4 GB VRAM)  |  | (4 GB VRAM)  |          |
  |  |              |  |              |  |              |           |
  |  | Assigned to  |  | Assigned to  |  | Assigned to  |          |
  |  | VM-1         |  | VM-2         |  | VM-3         |          |
  |  +--------------+  +--------------+  +--------------+           |
  |                                                                 |
  |  Time-slicing: Each vGPU gets exclusive GPU access in           |
  |  round-robin time windows. Low latency but no guaranteed        |
  |  throughput -- a VM's GPU time depends on contention.           |
  |                                                                 |
  |  Profiles: A100-1C (1GB), A100-2C (2GB), ..., A100-80C (80GB) |
  |  Cannot mix compute and graphics profiles on the same GPU.      |
  +================================================================+

Time-sliced vGPU divides GPU time across VMs. Each VM sees a virtual GPU with a dedicated portion of the GPU's framebuffer (VRAM) but shares the compute cores via time-slicing. The NVIDIA vGPU Manager runs on the host and mediates access. Each VM installs a standard NVIDIA vGPU guest driver.

Licensing: NVIDIA vGPU requires a separate license from NVIDIA. License tiers:

MIG (Multi-Instance GPU):

MIG is available on NVIDIA A100, A30, H100, and later GPUs. Unlike time-slicing, MIG physically partitions the GPU into up to 7 isolated instances, each with dedicated compute cores, memory bandwidth, and L2 cache. There is no time-sharing -- each MIG instance is truly isolated.

  MIG (Multi-Instance GPU) on NVIDIA A100

  +================================================================+
  |  Physical GPU: NVIDIA A100 80GB                                 |
  |  108 SMs (Streaming Multiprocessors), 80 GB HBM2e               |
  |                                                                 |
  |  MIG Partitioning (example configuration):                      |
  |                                                                 |
  |  +---------------------+  +---------------------+              |
  |  | MIG Instance 1      |  | MIG Instance 2      |              |
  |  | Profile: 3g.40gb    |  | Profile: 3g.40gb    |              |
  |  | 42 SMs              |  | 42 SMs              |              |
  |  | 40 GB HBM2e         |  | 40 GB HBM2e         |              |
  |  | Dedicated L2 cache  |  | Dedicated L2 cache  |              |
  |  | Dedicated mem BW    |  | Dedicated mem BW    |              |
  |  |                     |  |                     |              |
  |  | Isolated -- cannot  |  | Isolated -- cannot  |              |
  |  | see or affect       |  | see or affect       |              |
  |  | other instances     |  | other instances     |              |
  |  +---------------------+  +---------------------+              |
  |                                                                 |
  |  Other profiles: 1g.10gb, 2g.20gb, 4g.40gb, 7g.80gb           |
  |  Cannot mix profile sizes (e.g., cannot have 1g + 3g on same)  |
  |  Reconfiguring MIG requires GPU reset (all VMs must be stopped)|
  +================================================================+

MIG vs. time-sliced vGPU:

Aspect Time-sliced vGPU MIG
Isolation Shared compute, dedicated VRAM Fully isolated compute + memory + cache
Performance predictability Variable (depends on contention) Consistent (guaranteed resources)
Maximum partitions Up to 32 vGPUs per GPU (profile-dependent) Up to 7 MIG instances
GPU models All NVIDIA datacenter GPUs A100, A30, H100, H200, B100, B200 only
Live migration Supported (vGPU state can be serialized) Not supported with passthrough
Licensing Requires NVIDIA vGPU license No additional license (included with GPU)
Reconfiguration Dynamic (add/remove vGPUs without GPU reset) Requires GPU reset to reconfigure partitions

Hyper-V: DDA and GPU-P

Discrete Device Assignment (DDA): Hyper-V's equivalent of VFIO passthrough. A physical PCIe device (GPU, NVMe, FPGA) is assigned exclusively to a VM. The VM gets near-native performance. Live migration is not supported.

GPU-P (GPU Partitioning): Introduced in Windows Server 2025 and Azure Local, GPU-P allows a single GPU to be partitioned and shared across multiple VMs. Similar in concept to SR-IOV for NICs, GPU-P creates virtual GPU partitions that are hardware-isolated. Currently supported for a limited set of GPUs and primarily used for Azure Virtual Desktop scenarios on Azure Local.

Azure Local leverages GPU-P for VDI workloads and DDA for AI/ML workloads. GPU-P is less mature than NVIDIA's vGPU ecosystem but does not require separate NVIDIA licensing.

VMware GPU Virtualization

VMware supports three GPU modes:

Live migration with vGPU: VMware was the first to support live migration of VMs with vGPU attachments (vMotion with vGPU, supported since vSphere 6.7 Update 1). KubeVirt also supports live migration of VMs with mediated devices (including vGPU), but this capability is less tested at scale. Hyper-V does not support live migration of VMs with DDA or GPU-P.


How the Candidates Handle This

Capability VMware (Current) OVE (KubeVirt) Azure Local (Hyper-V) Swisscom ESC
Live Migration vMotion. Pre-copy. Mature, battle-tested at scale. DRS automates migrations. Encrypted. KubeVirt live migration. Pre-copy + optional post-copy. Node drain integration. No built-in DRS. Hyper-V Live Migration. Pre-copy. Quick Migration fallback. No DRS equivalent. VMware vMotion (current). Migration capability depends on underlying platform evolution.
Concurrent Migrations 4 per host (configurable to 8) 2 per node default (configurable) 2 simultaneous default Per VMware settings
Migration Network Dedicated vmknic, VLAN-isolated Dedicated NAD via Multus Dedicated SMB/TCP network Per VMware settings
VM Snapshots VMDK delta/redo logs. Mature but chains degrade performance. VolumeSnapshot via CSI. Quality depends on storage backend. Guest agent for app consistency. Hyper-V checkpoints (standard + production). VSS for app consistency. VMware snapshots
VM Cloning Full clone + linked clone. Instant clone (since vSphere 6.7). CSI clone. COW if storage supports it (Ceph, LVM thin). CDI handles import/clone. Full copy or differencing disk. VMware clone
VM Templates Content Library, OVF templates, vApp. Mature. containerDisk, DataVolume templates, golden image PVCs with auto-update. Azure Marketplace images, custom VHDX templates. VMware templates
First-boot Config VMware Guest Customization (limited), cloud-init via open-vm-tools. Cloud-init (NoCloud, ConfigDrive). Native, well-integrated. Sysprep (Windows), cloud-init (Linux), Azure VM Agent. VMware Guest Customization
Resource Pools Hierarchical resource pools with shares, reservations, limits. Namespaces + ResourceQuotas + LimitRanges. Flat (no nesting). No shares. No hierarchical pools. Azure RBAC + policies at subscription level. VMware resource pools
Affinity Rules DRS affinity/anti-affinity (VM-VM, VM-Host). Continuously enforced. nodeAffinity, podAffinity, podAntiAffinity, topology spread. Enforced at scheduling only. Preferred/possible owners, anti-affinity class. Basic. VMware DRS rules
CPU Hot-Add Supported (per-VM setting). Disables vNUMA. CPU hot-plug supported. maxSockets defines ceiling. Supported (Generation 2 VMs). Per VMware settings
Memory Hot-Add Supported (per-VM setting). Memory hot-plug supported. maxGuest defines ceiling. Dynamic Memory (automatic). Hot-add supported. Per VMware settings
GPU Passthrough DirectPath I/O (VFIO-like). No live migration. VFIO passthrough. No live migration. DDA (Discrete Device Assignment). No live migration. Not typically applicable (managed service).
vGPU NVIDIA GRID vGPU. Live migration supported. NVIDIA vGPU via mediated devices. Live migration supported (with caveats). GPU-P (native partitioning). No live migration. Not typically applicable.

Key Takeaways

  1. Live migration works on all platforms, but operational maturity differs. VMware vMotion is the most battle-tested at enterprise scale. KubeVirt live migration is functionally equivalent (same pre-copy algorithm, comparable downtime) but wraps migration in Kubernetes pod lifecycle semantics that are unfamiliar to VMware-trained teams. The absence of a built-in DRS equivalent in KubeVirt is a genuine gap -- the team must plan for manual rebalancing or adopt the Descheduler.

  2. Node drain is KubeVirt's killer feature for maintenance workflows. While vMotion + maintenance mode is well understood, KubeVirt's integration with PodDisruptionBudgets provides more granular control over migration ordering and parallelism. For an organization with 5,000+ VMs, PDBs can enforce that no more than N VMs from a critical application are simultaneously migrating -- a guarantee that vSphere maintenance mode does not natively provide.

  3. Snapshot quality depends entirely on the storage backend. KubeVirt delegates snapshots to CSI drivers, so the quality, speed, and consistency guarantees of snapshots are a property of the storage platform (Ceph, NetApp, Pure, etc.), not of KubeVirt itself. The team must evaluate the specific CSI driver's snapshot capabilities, not just KubeVirt's API surface.

  4. Resource governance models are fundamentally different. VMware resource pools are hierarchical with proportional sharing (shares). Kubernetes namespaces + ResourceQuotas are flat with hard caps. There is no equivalent of shares (proportional CPU/memory allocation during contention). Organizations that rely on shares for soft multi-tenancy will need to redesign their resource governance model.

  5. Affinity rules are scheduling-time-only in Kubernetes. VMware DRS continuously enforces affinity and anti-affinity rules via periodic re-evaluation and vMotion. Kubernetes only enforces affinity at pod scheduling time. If node labels change or cluster topology shifts, existing VMs are not automatically relocated. This must be supplemented with operational procedures or custom controllers.

  6. GPU workloads constrain migration. On all platforms, VMs with passthrough GPUs cannot be live migrated. Only NVIDIA vGPU (time-sliced) supports live migration, and only on VMware and (with caveats) KubeVirt. MIG instances passed through via VFIO also cannot be live migrated. Plan GPU node maintenance accordingly -- these VMs will require scheduled downtime.

  7. Hot-add is supported but less mature on KubeVirt. VMware hot-add is a well-tested feature used daily by many organizations. KubeVirt CPU and memory hot-plug are functional but newer, and the interaction with Kubernetes in-place pod resize adds a layer of complexity. Hot-remove remains unsupported on all platforms. The better long-term strategy is right-sizing VMs from the start.

  8. Template provisioning is faster on KubeVirt if the storage backend supports COW cloning. With a copy-on-write capable storage backend (Ceph RBD, LVM thin), creating a new VM from a golden image is nearly instantaneous. With full-copy backends, provisioning is slower than VMware's linked clone or instant clone. Storage backend selection directly impacts provisioning SLAs.


Discussion Guide

Use these questions when engaging with vendors, Red Hat/Microsoft/Swisscom field teams, or internal subject matter experts.

Live Migration

  1. Demonstrate a live migration of a VM with 128 GB of memory running an active write workload (e.g., fio random write at 500 MB/s). What is the total migration time and the switchover downtime? Does the migration converge, or does auto-converge need to activate? Why this matters: Memory-intensive, write-heavy VMs are the hardest to migrate. This test reveals whether the platform can handle your most demanding VMs without resorting to post-copy or manual intervention.

  2. Drain a node running 30+ VMs simultaneously while respecting PodDisruptionBudgets. How long does the full drain take? What happens if one VM's migration fails -- does the drain block or proceed? Why this matters: Node drain is the most common maintenance operation. The team needs to know the total time to evacuate a fully loaded node and understand failure handling.

  3. What is the roadmap for DRS-equivalent automatic load balancing? Is the Descheduler recommended for production use, and what are its limitations? Why this matters: Without automated rebalancing, the team must manually monitor cluster resource distribution and trigger migrations. This does not scale to 5,000+ VMs.

Snapshots and Cloning

  1. Take an application-consistent snapshot of a running database VM (PostgreSQL or Oracle) with a 500 GB data disk. How long does the snapshot take? Does it cause any I/O pause or performance degradation during the snapshot? Why this matters: Snapshot performance directly affects backup windows and RPO. A snapshot that causes a 5-second I/O stun on a production database is unacceptable.

  2. Clone 20 VMs simultaneously from a single golden image. How long does each clone take? Does clone performance degrade with concurrency? What storage backend optimizations are used (COW, thin provisioning)? Why this matters: Rapid provisioning at scale requires concurrent cloning without performance collapse. The answer reveals whether the storage backend can handle burst provisioning.

Resource Governance

  1. Demonstrate namespace-based multi-tenancy with ResourceQuotas. If two namespaces are competing for resources on the same nodes, how does the platform handle contention? Is there an equivalent to VMware shares for proportional allocation? Why this matters: Proportional sharing is a core capability for financial institutions with multiple business units sharing infrastructure. If there is no equivalent, the team needs to understand the alternative approach.

  2. Show how to enforce that a VM cannot exceed 16 vCPUs and 64 GB RAM, and that the total allocation for the "development" team cannot exceed 200 vCPUs and 1 TB RAM. What error does a user see if they try to exceed the quota? Why this matters: Quota enforcement must be self-service-friendly. Users need clear error messages, not cryptic API errors.

Affinity and Placement

  1. Configure a hard anti-affinity rule ensuring that 3 VMs of a database cluster never share the same physical host. Then, drain one of the three hosts. Does the drain succeed? Does the anti-affinity rule prevent placement, and if so, what is the error behavior? Why this matters: Anti-affinity + node drain can create scheduling deadlocks (if there are only 3 nodes and all 3 must have exactly one DB VM). The team needs to understand how the platform resolves these conflicts.

GPU

  1. Demonstrate vGPU (time-sliced) with 4 VMs sharing a single GPU. Show GPU utilization metrics for each VM under concurrent load. Then live migrate one of the 4 VMs to another host with the same GPU model. What is the migration downtime? Why this matters: vGPU live migration is critical for maintaining the same maintenance workflow for GPU-accelerated VMs. If live migration is not supported, GPU nodes become maintenance liabilities.

  2. Show MIG partitioning: create 3 MIG instances on an A100, assign each to a different VM, and demonstrate that performance isolation holds under load (one VM running full compute should not affect the others). Then reconfigure the MIG partitions -- does this require stopping all VMs? Why this matters: MIG provides stronger isolation than time-sliced vGPU, but reconfiguration is disruptive. The team needs to understand the operational tradeoff.