VM Lifecycle Management
Why This Matters
The previous chapters covered hypervisor internals, the VMware baseline, and KubeVirt's architecture. This chapter moves from "how does the engine work" to "how do you operate virtual machines day-to-day at scale." Every topic here -- live migration, snapshots, templates, resource governance, placement rules, hot-add, GPU assignment -- maps to an operational workflow that the current VMware team executes routinely for 5,000+ VMs. If any replacement platform cannot replicate, improve on, or explicitly retire these workflows, the migration will stall.
Live migration is the single most scrutinized capability in any VMware replacement evaluation. vMotion is the foundational operation behind zero-downtime maintenance, DRS load balancing, and hardware lifecycle management. The team performs hundreds of vMotion operations per week -- many automatically via DRS, others manually for planned maintenance windows. Any replacement must demonstrate that live migration works reliably at scale, completes within acceptable time windows, and does not impose restrictions that make routine node maintenance painful.
Snapshots and clones are the foundation of backup integration, rapid provisioning, and development workflows. Templates and cloud-init drive the provisioning pipeline. Resource pools and quotas are how the organization enforces multi-tenancy and cost governance across business units. Affinity rules keep database clusters separated across failure domains and keep latency-sensitive VM pairs co-located. Hot-add lets the team respond to capacity spikes without scheduling downtime. GPU passthrough enables a growing set of AI/ML inference and VDI workloads.
Each section compares the VMware baseline with the KubeVirt/OVE approach, the Hyper-V/Azure Local approach, and notes implications for Swisscom ESC. The comparison tables at the end provide a summary, but the real evaluation depth is in the sections themselves.
Concepts
1. Live Migration (vMotion Equivalent)
Why Live Migration Is Non-Negotiable
Live migration is not a convenience feature -- it is a structural dependency of modern infrastructure operations. Without it:
- Patching requires downtime. Every hypervisor host OS update requires shutting down or cold-migrating all VMs, which means coordinated maintenance windows.
- Hardware maintenance requires downtime. Firmware updates, DIMM replacements, NIC upgrades -- all require VM evacuation.
- Load balancing is impossible. No automated rebalancing, no DRS equivalent.
- Cluster upgrades become rolling outages. Upgrading a 20-node cluster means 20 rounds of VM downtime.
At 5,000+ VMs, the organization performs planned maintenance on hypervisor hosts weekly. Without live migration that works reliably and completes within minutes, the operational overhead of maintenance windows alone would make a platform non-viable.
Migration Algorithms: Pre-Copy vs. Post-Copy
All modern live migration implementations use one of two fundamental algorithms (or a hybrid of both):
Pre-Copy Live Migration Algorithm
Source Host Target Host
+-----------------+ +-----------------+
| VM running | | |
| Memory: 64 GB | | |
+-----------------+ +-----------------+
Phase 1: Pre-copy iteration 0 (bulk transfer)
====================================================
| Copy ALL memory | ---- 64 GB ------> | Receives pages |
| pages to target | | into staging |
| VM continues | | memory |
| running, dirtying| | |
| pages | | |
Phase 2: Pre-copy iteration 1 (dirty page resend)
====================================================
| Track dirty | ---- 2 GB ------> | Overwrites dirty|
| pages since | | pages |
| iteration 0 | | |
| Send dirty pages| | |
Phase 3: Pre-copy iteration 2 (converging)
====================================================
| Fewer dirty | ---- 200 MB ----> | Overwrites dirty|
| pages this round| | pages |
Phase N: Final iteration (switchover)
====================================================
| PAUSE VM | ---- 5 MB -------> | Apply last |
| Send remaining | | dirty pages |
| dirty pages | | RESUME VM |
| Redirect clients| | VM now runs here|
+-----------------+ +-----------------+
| |
v v
VM deleted from VM running on
source after target
confirmation
Pre-copy transfers memory iteratively while the VM continues running on the source. Each iteration sends only the pages that were dirtied since the previous iteration. The dirty set shrinks with each pass (ideally). When the dirty set is small enough that it can be transferred in less than the maximum tolerable downtime (typically <500 ms), the VM is paused, the final dirty pages are sent, and execution resumes on the target.
The convergence problem: If the VM's dirty page rate exceeds the migration bandwidth, the dirty set never shrinks. A VM running an in-memory database that touches 50 GB/s of memory cannot converge over a 10 Gbps migration link (which transfers ~1.2 GB/s). Strategies for convergence:
- Auto-converge (QEMU): Progressively throttle vCPU execution on the source to reduce the dirty page rate. Trades VM performance during migration for convergence. KubeVirt supports this via the
allowAutoConvergesetting in MigrationPolicy. - XBZRLE compression (QEMU): Compress dirty pages using XOR-Based Zero Run-Length Encoding before transfer. Effective for pages with small deltas between iterations. Reduces bandwidth but adds CPU overhead.
- Bandwidth increase: Use a dedicated 25 Gbps or 100 Gbps migration network.
- Post-copy fallback: Switch to post-copy after pre-copy fails to converge within a deadline.
Post-copy takes the opposite approach: the VM is paused on the source, minimal state (CPU registers, device state) is transferred to the target, and the VM resumes on the target immediately -- before most memory has been copied. Pages are faulted in on demand: when the VM on the target accesses a page that has not yet been transferred, it triggers a network page fault, and the hypervisor fetches the page from the source.
Post-Copy Live Migration
Source Host Target Host
+-----------------+ +-----------------+
| VM running | | |
| Memory: 64 GB | | |
+-----------------+ +-----------------+
Phase 1: Pause and transfer state
====================================================
| PAUSE VM | -- CPU state, --> | RESUME VM |
| Send CPU regs, | -- device state --> | immediately |
| device state | -- (small, ~MB) --> | with NO memory |
| | | pages local |
Phase 2: Background push + demand paging
====================================================
| Push remaining | ---- bulk -------> | VM runs, faults |
| pages in | | on missing pages|
| background | <-- page request -- | Requests pages |
| | -- page data ----> | on demand |
Phase 3: Complete
====================================================
| All pages sent | | All pages local |
| Source freed | | Migration done |
+-----------------+ +-----------------+
Post-copy advantages: Guaranteed convergence. The VM runs on the target from the start, so the total migration time is bounded regardless of dirty page rate. Total downtime for the switchover is typically <100 ms.
Post-copy disadvantages: The VM on the target runs with degraded performance until all pages are faulted in -- every page miss incurs a network round-trip. If the source host crashes during migration, the VM is lost (pages still on the source are unrecoverable). This makes post-copy dangerous for host failures that are the reason you are migrating in the first place.
QEMU/KVM supports both pre-copy and post-copy, and a hybrid mode where pre-copy runs first, and if it fails to converge, the system switches to post-copy.
VMware vMotion uses pre-copy exclusively (as of vSphere 8.0). VMware has historically not implemented post-copy because the risk of VM loss during migration is considered unacceptable for enterprise workloads.
Hyper-V Live Migration uses pre-copy with SMB-based storage migration for non-shared-storage scenarios.
KubeVirt Live Migration: How It Works
KubeVirt's live migration is orchestrated through Kubernetes primitives. The migration creates a new pod on the target node, migrates the QEMU process's state between the source and target pods, and then deletes the source pod. This is fundamentally different from vMotion, where both source and target are managed by agents on ESXi hosts -- in KubeVirt, the migration is a Kubernetes-native pod lifecycle event.
KubeVirt Live Migration Flow
Step 1: Migration initiated
===============================================================
User creates VirtualMachineInstanceMigration (VMIM) CR
(or node drain triggers automatic migration)
$ kubectl create -f migration.yaml
OR
$ virtctl migrate my-vm
OR
$ kubectl drain node-1 --delete-emptydir-data
|
v
virt-controller detects VMIM for VMI "my-vm"
|
v
Step 2: Target pod creation
===============================================================
virt-controller creates a NEW virt-launcher pod on the
target node with the same spec as the source pod:
- Same CPU/memory requests and limits
- Same PVC mounts (shared storage required for RWX,
or storage migration for RWO)
- Same network configuration
- Migration-specific annotations
+--------------------+ +--------------------+
| Source Node | | Target Node |
| +----------------+ | | +----------------+ |
| | virt-launcher | | | | virt-launcher | |
| | Pod (running) | | | | Pod (pending) | |
| | | | | | | |
| | QEMU/KVM | | | | QEMU/KVM | |
| | (VM active) | | | | (waiting for | |
| | | | | | incoming | |
| | | | | | migration) | |
| +----------------+ | | +----------------+ |
+--------------------+ +--------------------+
Step 3: Migration handshake
===============================================================
virt-handler on target node:
- Prepares QEMU on target to receive migration
- Opens a TCP port for incoming migration data
- Signals readiness to virt-controller
virt-handler on source node:
- Receives migration target address
- Instructs libvirtd to initiate migration to target
Step 4: Memory transfer (pre-copy iterations)
===============================================================
+--------------------+ +--------------------+
| Source Node | | Target Node |
| +----------------+ | | +----------------+ |
| | QEMU | | migration | | QEMU | |
| | (VM running, | | ==========> | | (receiving | |
| | sending pages)| | TCP stream | | memory pages) | |
| | | | (port 49152) | | | |
| +----------------+ | | +----------------+ |
+--------------------+ +--------------------+
- Pre-copy iterations: bulk copy, dirty page resends
- Auto-converge throttles vCPUs if dirty rate too high
- Progress reported to VMIM status (percentage, bandwidth)
Step 5: Switchover
===============================================================
- Source QEMU pauses VM
- Final dirty pages + CPU state + device state sent
- Target QEMU resumes VM
- Typical pause: 50-200 ms for well-converging VMs
Step 6: Cleanup
===============================================================
- virt-controller updates VMI to point to target node
- Source virt-launcher pod is terminated
- VMIM status set to Succeeded
- If migration fails: source VM continues running,
target pod is cleaned up, VMIM status set to Failed
+--------------------+ +--------------------+
| Source Node | | Target Node |
| | | +----------------+ |
| (pod deleted) | | | virt-launcher | |
| | | | Pod (running) | |
| | | | | |
| | | | QEMU/KVM | |
| | | | (VM active) | |
| | | +----------------+ |
+--------------------+ +--------------------+
Migration Policies in KubeVirt
KubeVirt provides MigrationPolicy as a cluster-scoped CRD that governs migration behavior. This is the equivalent of vMotion network settings and resource controls in vSphere.
apiVersion: migrations.kubevirt.io/v1alpha1
kind: MigrationPolicy
metadata:
name: production-migration-policy
spec:
# Bandwidth limit per migration (prevents saturating the network)
bandwidthPerMigration: 1Gi
# Maximum number of concurrent outbound migrations per node
# Prevents a draining node from overwhelming the migration network
# Default: 2
# For a 25 Gbps migration network, 5 concurrent at 1 Gi each = 5 Gbps
allowAutoConverge: true
# Timeout for the migration to complete before it is cancelled
completionTimeoutPerGiB: 150
# Allow post-copy migration as a fallback if pre-copy does not converge
allowPostCopy: false
# Selector: which VMIs this policy applies to
# Uses label selectors on both VMI and namespace
selectors:
virtualMachineInstanceSelector:
matchLabels:
migration-policy: production
namespaceSelector:
matchLabels:
environment: production
Key MigrationPolicy parameters:
| Parameter | Purpose | Default |
|---|---|---|
bandwidthPerMigration |
Cap migration throughput per VM to avoid saturating the network. | Unlimited |
allowAutoConverge |
Throttle source vCPUs if dirty page rate prevents convergence. | false |
completionTimeoutPerGiB |
Seconds allowed per GiB of VM memory before migration is cancelled. A 64 GB VM with a value of 150 gets 64 * 150 = 9600 seconds (~2.7 hours). | 150 |
allowPostCopy |
Allow fallback to post-copy if pre-copy fails to converge. Risky -- if source host fails during post-copy, VM is lost. | false |
When no MigrationPolicy matches a VMI, KubeVirt falls back to the cluster-wide defaults configured in the KubeVirt CR under spec.configuration.migrations.
Network Requirements for Live Migration
Live migration transfers tens of gigabytes of data between hosts. The migration network must be sized appropriately:
| VM Memory | 10 Gbps (~1.2 GB/s) | 25 Gbps (~3.1 GB/s) | 100 Gbps (~12.5 GB/s) |
|---|---|---|---|
| 8 GB | ~7 seconds | ~3 seconds | <1 second |
| 32 GB | ~27 seconds | ~10 seconds | ~3 seconds |
| 64 GB | ~54 seconds | ~21 seconds | ~5 seconds |
| 256 GB | ~213 seconds | ~82 seconds | ~20 seconds |
| 1 TB | ~853 seconds | ~330 seconds | ~82 seconds |
These are theoretical minimums for a single pre-copy pass with zero dirty pages. Real-world times are 2-5x higher due to dirty page resends, convergence delays, and protocol overhead.
Best practice: Dedicate a VLAN and physical NICs (or SR-IOV virtual functions) for migration traffic. In KubeVirt, this means defining a dedicated NetworkAttachmentDefinition for migration and configuring it in the KubeVirt CR:
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
name: kubevirt
namespace: openshift-cnv
spec:
configuration:
migrations:
network: migration-network # Name of the NAD for migration traffic
parallelMigrationsPerCluster: 20
parallelOutboundMigrationsPerNode: 5
bandwidthPerMigration: 1Gi
Hyper-V Live Migration
Hyper-V provides two migration mechanisms:
Live Migration transfers a running VM between Hyper-V hosts with minimal downtime. It uses the same pre-copy algorithm as KVM/QEMU -- iterative memory copy with dirty page tracking. Hyper-V's implementation transfers memory pages via a TCP connection between the source and target hosts. For VMs using shared storage (Cluster Shared Volumes, SMB shares, or SAN LUNs), only memory and CPU state must be transferred. For VMs on local storage, Hyper-V can simultaneously migrate the storage (Storage Live Migration) by mirroring writes to both source and target.
Quick Migration is a faster but not zero-downtime approach: the VM is saved to disk (all memory written to a save-state file), the VM definition is moved to the target host, and the VM is restored from the save-state file. Downtime equals the time to write + read memory state from storage. For a 64 GB VM on fast shared storage, this is typically 30-90 seconds. Quick Migration exists as a fallback when Live Migration fails or is not configured.
Azure Local specifics: Azure Local clusters use Cluster Shared Volumes (CSV) backed by Storage Spaces Direct (S2D). Live Migration within an Azure Local cluster operates over a dedicated migration network. Azure Local also supports planned VM mobility across Azure Local instances in different locations, but this requires Azure Arc orchestration and is not transparent live migration -- it involves VM shutdown and restart.
Comparison to VMware vMotion
| Aspect | VMware vMotion | KubeVirt Live Migration | Hyper-V Live Migration |
|---|---|---|---|
| Algorithm | Pre-copy only | Pre-copy (default), post-copy (optional), auto-converge | Pre-copy (Live), save/restore (Quick) |
| Shared storage required? | Yes (or use Storage vMotion for combined migration) | Yes for RWX volumes. RWO volumes require storage migration via CDI. | No -- Storage Live Migration supported |
| Concurrent migrations per host | Default 4 outbound, 4 inbound (configurable to 8) | Configurable per cluster (default 2 outbound/node) | Default 2 simultaneous migrations |
| Migration network | Dedicated vmknic on a vMotion-tagged port group | Dedicated NetworkAttachmentDefinition (Multus) | Dedicated SMB or TCP network |
| Encryption in transit | Supported (AES-256 since vSphere 6.5) | TLS-encrypted by default between virt-handler instances | IPsec or SMB encryption |
| Maximum VM size | No hard limit (tested with multi-TB VMs) | No hard limit (but convergence at >256 GB requires tuning) | No hard limit |
| Cross-cluster migration | Requires Enhanced vMotion Compatibility (EVC) for CPU compatibility | Same requirement: CPU model must be compatible (use cpu-model: host-model) | Processor compatibility mode |
| Downtime (typical) | <1 second for well-converging VMs | 50-500 ms (comparable to vMotion) | <1 second for Live, 30-90s for Quick |
| Automated triggering | DRS (every 5 minutes) | No built-in DRS equivalent. Descheduler is available but less mature. | No built-in DRS equivalent |
What is the same: The core algorithm (pre-copy with iterative dirty page convergence) is identical across all three. The physics are the same -- migration time is governed by VM memory size, dirty page rate, and available bandwidth.
What is different: KubeVirt wraps migration in Kubernetes pod lifecycle semantics. A KubeVirt live migration is a pod-to-pod event: a new pod is created on the target, state is transferred, the old pod is deleted. This means Kubernetes scheduling rules (affinity, taints, tolerations, resource requests) all apply to migration target selection. In vMotion, target host selection is done by DRS or manually by an admin using vCenter's host picker.
What is worse: KubeVirt does not have a DRS equivalent. There is no built-in controller that continuously monitors cluster resource utilization and automatically migrates VMs to balance load. The Kubernetes Descheduler project can evict pods (including virt-launcher pods) from overloaded nodes, which triggers re-scheduling and migration, but it is less sophisticated than DRS. It does not compute migration cost vs. benefit, it does not avoid thrashing, and it does not coordinate multiple migrations to achieve a target cluster balance. This is a significant gap for an organization accustomed to fully automated DRS.
What is better: KubeVirt's node drain (kubectl drain) is arguably more predictable than vMotion-based maintenance mode. When you drain a node, Kubernetes systematically evicts all pods (including VMs) according to PodDisruptionBudgets, which lets you guarantee that at most N VMs from a given set are simultaneously migrating. vSphere maintenance mode moves VMs in parallel but does not have a PDB equivalent -- it relies on DRS rules and administrator judgment.
Migration at Scale: Node Drain and Rolling Updates
For an organization running 50+ VMs per node across 100+ nodes, the primary live migration scenario is node drain for maintenance. The workflow:
Node Drain for Maintenance (KubeVirt)
Cluster: 100 worker nodes, ~5000 VMs (50 VMs/node)
Goal: Patch node-42 (kernel update, firmware update)
Step 1: Cordon node-42 (prevent new VM scheduling)
$ kubectl cordon node-42
-> Node marked as unschedulable
-> Existing VMs continue running
Step 2: Drain node-42 (migrate VMs away)
$ kubectl drain node-42 \
--delete-emptydir-data \
--ignore-daemonsets \
--pod-selector=kubevirt.io/domain \
--timeout=3600s
-> Kubernetes evicts virt-launcher pods one at a time
(or N at a time, governed by PodDisruptionBudgets)
-> Each eviction triggers a live migration:
1. virt-controller creates new virt-launcher pod on another node
2. QEMU pre-copy migration transfers memory
3. Switchover occurs
4. Old pod is deleted
-> 50 VMs x ~60 seconds average = ~50 minutes total drain time
(with 5 concurrent migrations: ~10 minutes)
Step 3: Perform maintenance
-> SSH into node-42, apply patches, reboot
-> 5-10 minutes
Step 4: Uncordon node-42
$ kubectl uncordon node-42
-> Node is schedulable again
-> No automatic VM rebalancing (no DRS)
-> VMs stay where they were migrated to
-> Manual rebalancing or Descheduler needed
PodDisruptionBudgets (PDBs) are the mechanism for controlling migration parallelism. For a set of VMs that should never all be down at once (e.g., a 3-node database cluster), a PDB ensures that at most one VM from the set is migrating at any time:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: database-cluster-pdb
namespace: database-tier
spec:
minAvailable: 2 # At least 2 of 3 must be running at all times
selector:
matchLabels:
app: oracle-rac
tier: production
2. VM Snapshots & Clones
Snapshot Mechanisms Compared
A VM snapshot captures the state of a VM's disks (and optionally memory and device state) at a point in time, enabling rollback. The mechanism differs fundamentally across platforms.
VMware Snapshots (VMDK):
VMware snapshots use a redo-log / delta-disk approach. When a snapshot is taken, the base VMDK becomes read-only, and a new delta file (a "child" VMDK with a -delta.vmdk suffix) captures all subsequent writes. Each additional snapshot adds another layer in the chain.
VMware Snapshot Chain
+------------------+ +------------------+ +------------------+
| Base VMDK | | Snapshot 1 | | Snapshot 2 |
| (flat disk, |<---| Delta VMDK |<---| Delta VMDK |
| read-only after | | (writes since | | (writes since |
| first snapshot) | | base, now also | | snap 1) |
| | | read-only) | | <-- ACTIVE |
| 100 GB | | 5 GB | | 2 GB |
+------------------+ +------------------+ +------------------+
Read path for VM:
- Read block X:
1. Check Snapshot 2 delta -> block not here
2. Check Snapshot 1 delta -> block not here
3. Read from Base VMDK -> found
Write path for VM:
- Write block Y:
1. Write to Snapshot 2 delta (active disk)
Revert to Snapshot 1:
- Delete Snapshot 2 delta
- Snapshot 1 delta becomes active
Delete Snapshot 1 (consolidation):
- Merge Snapshot 1 delta into Base VMDK
- I/O intensive operation, can stun VM briefly
Performance degradation: Each snapshot layer adds a lookup to the read path. With 5+ snapshots, read I/O latency increases measurably because the storage subsystem must check each delta file before falling through to the base. Snapshot consolidation (merging deltas back into the base) is an I/O-intensive operation that can cause VM stuns (brief freezes of 1-5 seconds) during the final commit phase.
VMware best practice: do not leave snapshots running for more than 24-72 hours. Snapshots are not backups -- they are temporary state for specific operations (patch testing, upgrade rollback).
KubeVirt Snapshots (VolumeSnapshot via CSI): KubeVirt delegates VM snapshots to the Kubernetes VolumeSnapshot API, which in turn calls the CSI (Container Storage Interface) driver's snapshot capability. The snapshot is taken at the storage layer, not at the hypervisor layer.
apiVersion: snapshot.kubevirt.io/v1beta1
kind: VirtualMachineSnapshot
metadata:
name: my-vm-snap-before-patch
namespace: production
spec:
source:
apiGroup: kubevirt.io
kind: VirtualMachine
name: my-database-vm
When this resource is created:
- KubeVirt's snapshot controller freezes the VM's filesystems (via the QEMU Guest Agent, if installed) to achieve application consistency.
- For each PVC attached to the VM, the controller creates a
VolumeSnapshotobject. - The CSI driver creates a point-in-time copy of the underlying storage volume (mechanism depends on the storage backend -- thin-provisioned clone, copy-on-write snapshot, or hardware array snapshot).
- A
VirtualMachineSnapshotContentobject is created to track the relationship between the VM snapshot and the underlying volume snapshots. - Filesystem thaw is issued to the guest.
KubeVirt Snapshot Architecture
+-------------------+
| VirtualMachine- | Creates one VolumeSnapshot
| Snapshot | per PVC attached to the VM
| (KubeVirt CR) |
+-------------------+
|
v
+-------------------+ +-------------------+
| VolumeSnapshot | | VolumeSnapshot |
| (for PVC-boot) | | (for PVC-data) |
| (Kubernetes CSI) | | (Kubernetes CSI) |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| CSI Driver | | CSI Driver |
| creates storage- | | creates storage- |
| level snapshot | | level snapshot |
| (e.g., Ceph RBD | | (e.g., Ceph RBD |
| snapshot, LVM | | snapshot, LVM |
| thin snapshot) | | thin snapshot) |
+-------------------+ +-------------------+
Hyper-V Checkpoints: Hyper-V uses the term "checkpoint" (previously "snapshot"). Two types exist:
- Standard checkpoints: Capture the VM's memory state, CPU state, and a differencing disk. The VM can be reverted to this exact running state. Equivalent to a VMware memory snapshot.
- Production checkpoints (default since Windows Server 2016): Use Volume Shadow Copy Service (VSS) inside the guest to create an application-consistent snapshot of the disks only, without memory state. The VM reverts to a clean boot with consistent disks. This is the safer default for production workloads.
Application-Consistent vs. Crash-Consistent Snapshots
This distinction is critical for database workloads:
| Type | Mechanism | Data Integrity | Use Case |
|---|---|---|---|
| Crash-consistent | Point-in-time copy of disk blocks with no guest coordination. Equivalent to pulling the power cord. | Filesystem journal replay needed on restore. Database may need recovery. Possible data loss of uncommitted transactions. | Quick snapshots when guest agent is not available. |
| Application-consistent | Guest agent freezes I/O (Linux: fsfreeze/VSS, Windows: VSS). Pending writes are flushed to disk. Database quiesces. Then snapshot is taken. | Filesystem is clean. Database is in a consistent state. No recovery needed on restore. | Production databases, Exchange, SQL Server. |
KubeVirt achieves application consistency via the QEMU Guest Agent (qemu-ga). If the guest agent is installed and running, KubeVirt's snapshot controller issues guest-fsfreeze-freeze before the snapshot and guest-fsfreeze-thaw after. Without the guest agent, the snapshot is crash-consistent only.
Snapshot Chains and Performance Degradation
The performance impact of snapshot chains varies by platform:
VMware: Each snapshot layer is a separate delta VMDK file. Reads must traverse the chain from newest to oldest to find the correct block. With >3 snapshots, IOPS can degrade by 10-30%. VMware warns against running more than 32 snapshots (hard limit), but performance degradation becomes noticeable at 3-5 snapshots.
KubeVirt (Ceph RBD): Ceph RBD snapshots use copy-on-write at the RADOS object level. Read performance is not degraded by snapshots because snapshot metadata is tracked in the OSD's object store, not in a chain of files. However, the space overhead of maintaining many snapshots is real -- every block modified since the snapshot was taken consumes additional space. The CSI driver's snapshot implementation varies by storage backend; the performance characteristics are ultimately a property of the storage, not of KubeVirt.
Hyper-V (VHDX): Differencing disks (AVHDX files) behave similarly to VMware delta VMDKs. Each checkpoint adds a layer. Consolidation ("merge") is I/O-intensive.
Cloning: Full Clone vs. Linked Clone
Full clone: A complete, independent copy of a VM and all its disks. No dependency on the source VM. Time to create = time to copy all disk data. A 200 GB VM takes 200 GB of storage and several minutes to clone.
Linked clone (VMware) / Differencing disk clone (Hyper-V): A new VM shares the base disk with the parent VM (read-only) and has its own delta disk for writes. Creating a linked clone is nearly instantaneous because no data is copied. However, the clone depends on the parent's base disk -- if the parent is deleted, the clone is broken.
KubeVirt cloning uses CSI clone operations. The VirtualMachine CR can specify a DataVolumeTemplate with a source PVC, and CDI (Containerized Data Importer) performs the clone via the CSI driver's clone capability. If the CSI driver supports efficient cloning (e.g., Ceph RBD rbd clone using copy-on-write), the clone is fast and space-efficient. If the driver does not support native cloning, CDI falls back to creating a new PVC and copying data block-by-block (a "smart clone" or "host-assisted clone" depending on namespace).
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: cloned-vm
spec:
running: true
dataVolumeTemplates:
- metadata:
name: cloned-vm-rootdisk
spec:
source:
pvc:
namespace: golden-images
name: rhel9-golden-image
storage:
resources:
requests:
storage: 50Gi
storageClassName: ocs-storagecluster-ceph-rbd
template:
# ... VM spec ...
3. VM Templates & Rapid Provisioning
The Provisioning Pipeline
Rapid provisioning -- creating a ready-to-use VM in under 5 minutes from request to running workload -- requires three things:
- A pre-built disk image (the template) that does not need to be installed from an ISO.
- Fast disk copy or clone to create a new VM's disk from the template.
- First-boot customization (hostname, IP, SSH keys, users) that runs automatically without human intervention.
VM Provisioning Flow
+-----------+ +-----------+ +-----------+ +-----------+
| Golden | | Clone/ | | First-boot| | VM Ready |
| Image |---->| Copy |---->| Config |---->| for Use |
| (template)| | (new disk)| | (cloud- | | |
| | | | | init) | | |
+-----------+ +-----------+ +-----------+ +-----------+
| | | |
v v v v
Maintained by Seconds (COW) 30-90 seconds Total:
platform team Minutes (full) (SSH key inject, 1-5 minutes
Updated monthly network config, with COW clone
package install) 5-15 minutes
with full copy
KubeVirt Provisioning Mechanisms
containerDisk: A VM disk image baked into an OCI container image and stored in a container registry. When a VM starts, the container image is pulled and the disk is extracted. The disk is ephemeral -- changes are lost when the VM is stopped. Useful for stateless VMs or as boot sources for templates.
# containerDisk: disk image baked into a container image
spec:
template:
spec:
volumes:
- name: rootdisk
containerDisk:
image: registry.internal.bank.ch/vm-images/rhel9-base:9.4
imagePullPolicy: IfNotPresent
DataVolume templates: The primary mechanism for persistent VM provisioning. A DataVolumeTemplate in the VirtualMachine spec instructs CDI to create a PVC and populate it from a source (container registry, HTTP URL, existing PVC, or upload).
# DataVolume template: creates a persistent disk from a source
spec:
dataVolumeTemplates:
- metadata:
name: webserver-01-rootdisk
spec:
source:
pvc:
namespace: golden-images
name: rhel9-golden-image-20260401
storage:
resources:
requests:
storage: 50Gi
storageClassName: ocs-storagecluster-ceph-rbd
Golden image PVCs with boot sources: OpenShift Virtualization (OVE) includes automatic boot source management. For common operating systems (RHEL, CentOS, Fedora, Windows Server), OVE can automatically download and maintain golden image PVCs that are kept up to date on a configurable schedule. These PVCs serve as the source for DataVolume clones.
Golden Image Pipeline (OVE)
+------------------+ +-----------------+ +------------------+
| Upstream Image | | CDI Import | | Golden Image PVC |
| Source |---->| CronJob |---->| (kept current) |
| (Red Hat CDN, | | (runs weekly) | | |
| internal | | | | Namespace: |
| registry) | | | | openshift- |
+------------------+ +-----------------+ | virtualization |
+------------------+
|
Clone on VM creation
|
v
+------------------+
| VM PVC |
| (independent |
| copy) |
+------------------+
Provisioning time determinants:
| Factor | Impact | Optimization |
|---|---|---|
| Disk image size | Full copy time is proportional to image size | Keep golden images minimal (5-10 GB) |
| Clone mechanism | COW clone (Ceph rbd clone, LVM thin clone) is near-instant. Full copy takes minutes. |
Use a storage backend that supports CSI clone. |
| Image pull (containerDisk) | First pull downloads from registry. Subsequent pulls use node cache. | Pre-pull images on all nodes via DaemonSet. |
| Cloud-init execution | 30-90 seconds for typical first-boot configuration. | Minimize cloud-init modules. Use baked images where possible. |
| Guest OS boot time | UEFI boot + GRUB + kernel + systemd = 10-60 seconds depending on OS. | Use RHEL minimal or tuned images. |
Hyper-V Templates
Hyper-V uses a traditional template approach:
- Build a reference VM with the desired OS, patches, and configuration.
- Sysprep the VM (generalize it -- remove machine-specific identifiers like SID, hostname, hardware drivers).
- Export the VM as a template or simply copy the VHDX file.
- Deploy by creating a new VM using the template VHDX as a base (full copy or differencing disk).
Azure Local integrates with Azure Marketplace images and custom VM images stored in Azure Arc resource bridge. Provisioning via the Azure portal or Azure CLI creates VMs from these images. The provisioning pipeline is comparable to Azure cloud VM creation, with the added latency of local disk operations.
4. Cloud-init / Ignition
Cloud-init
Cloud-init is the industry-standard tool for Linux VM first-boot customization. Originally developed for Amazon EC2, it is now supported by every major cloud provider and hypervisor. Cloud-init runs during the first boot of a VM (and optionally on subsequent boots) to configure the system based on metadata and user-data provided by the platform.
Cloud-init datasources: Cloud-init supports multiple datasources -- the mechanism by which it receives configuration data:
| Datasource | How it works | Used by |
|---|---|---|
| NoCloud | Reads from a virtual CD-ROM (ISO) or a local seed directory containing meta-data and user-data files. |
KubeVirt (cloudInitNoCloud), libvirt, manual setups |
| ConfigDrive | Reads from a configuration drive (virtual disk with a specific label). | OpenStack, KubeVirt (cloudInitConfigDrive) |
| Azure | Queries Azure IMDS (Instance Metadata Service) over HTTP at 169.254.169.254. | Azure, Azure Local |
| EC2 | Queries EC2 metadata service over HTTP. | AWS |
| GCE | Queries GCE metadata service. | Google Cloud |
| VMware (open-vm-tools) | Reads from VMware's GuestInfo variables or OVF properties. | VMware vSphere |
Cloud-init modules: Cloud-init processes configuration in phases (init, config, final). Common modules used in enterprise VM provisioning:
set_hostname/update_hostname-- Set the VM's hostnamessh-- Install SSH keys for authorized usersusers-groups-- Create user accounts and groupswrite_files-- Write arbitrary files to the filesystemruncmd-- Execute shell commandspackage_update_upgrade_install-- Install packagesntp-- Configure time synchronizationresolv_conf-- Configure DNSca_certs-- Install custom CA certificates (critical for enterprise PKI)disk_setup/mounts-- Partition and mount additional disks
Cloud-init in KubeVirt:
KubeVirt provides two cloud-init volume types:
# Option 1: cloudInitNoCloud (most common)
# Generates an ISO image attached as a virtual CD-ROM
spec:
template:
spec:
volumes:
- name: cloudinit
cloudInitNoCloud:
userData: |
#cloud-config
hostname: webserver-01
fqdn: webserver-01.prod.bank.ch
users:
- name: ansible
ssh_authorized_keys:
- ssh-ed25519 AAAA... ansible-automation@bank.ch
sudo: ALL=(ALL) NOPASSWD:ALL
groups: wheel
shell: /bin/bash
write_files:
- path: /etc/pki/ca-trust/source/anchors/bank-root-ca.pem
content: |
-----BEGIN CERTIFICATE-----
MIIFxTCCA62gAwIBAgIUOeji...
-----END CERTIFICATE-----
runcmd:
- update-ca-trust
- systemctl enable --now qemu-guest-agent
- hostnamectl set-hostname webserver-01.prod.bank.ch
packages:
- qemu-guest-agent
- chrony
ntp:
servers:
- ntp1.bank.ch
- ntp2.bank.ch
networkData: |
version: 2
ethernets:
eth0:
addresses:
- 10.100.50.21/24
gateway4: 10.100.50.1
nameservers:
addresses:
- 10.100.1.10
- 10.100.1.11
search:
- prod.bank.ch
- bank.ch
# Option 2: cloudInitConfigDrive
# Uses a config-drive volume (OpenStack-compatible)
spec:
template:
spec:
volumes:
- name: cloudinit
cloudInitConfigDrive:
userData: |
#cloud-config
hostname: webserver-02
# ... same content as above ...
The choice between NoCloud and ConfigDrive depends on the guest OS image's cloud-init configuration. Most enterprise Linux images support both. NoCloud is simpler and more commonly used with KubeVirt.
Referencing Secrets: For sensitive data (passwords, private keys), the userData and networkData fields can reference Kubernetes Secrets instead of inline content:
volumes:
- name: cloudinit
cloudInitNoCloud:
userDataSecretRef:
name: my-vm-cloud-init-secret
networkDataSecretRef:
name: my-vm-network-config
Ignition
Ignition is a first-boot provisioning tool developed by CoreOS (now part of Red Hat). Unlike cloud-init, which is a multi-phase configuration management tool, Ignition runs exactly once during the initramfs phase -- before the root filesystem is mounted -- and writes all configuration directly to disk. If Ignition fails, the machine does not boot.
When to use which:
| Tool | Used for | Characteristics |
|---|---|---|
| Cloud-init | Guest VMs (RHEL, Ubuntu, Windows) | Multi-phase, idempotent, forgiving. Runs during normal boot. Can be re-triggered. |
| Ignition | Fedora CoreOS, RHCOS (OVE node OS) | Single-shot, early-boot (initramfs). Writes files, systemd units, users. Machine fails to boot if Ignition fails. |
In the context of OVE:
- Ignition configures the OpenShift cluster nodes themselves (the infrastructure that runs KubeVirt).
- Cloud-init configures the guest VMs running on top of KubeVirt.
These are separate concerns. The evaluation team does not need to write Ignition configs for guest VMs unless those guests are running Fedora CoreOS or RHCOS, which is rare for enterprise application VMs.
5. Resource Pools / Quotas
VMware Resource Pools vs. Kubernetes Quotas
In VMware, resource pools are hierarchical containers within a cluster that partition CPU and memory using three controls: shares (relative priority during contention), reservations (guaranteed minimum), and limits (hard ceiling). Resource pools can be nested, and their allocations are relative to their parent pool.
In Kubernetes (and therefore KubeVirt), the equivalent concept is a combination of Namespaces, ResourceQuotas, and LimitRanges. The mapping is not 1:1, but the intent is the same: prevent one team/project/tenant from consuming all cluster resources.
VMware Resource Pool vs. Kubernetes Namespace Hierarchy
VMware: Kubernetes (KubeVirt):
======== =====================
Cluster Cluster
| |
+-- RP: Production +-- Namespace: prod-tier1
| | Reservation: 180 GHz CPU | ResourceQuota:
| | Reservation: 1.5 TB RAM | requests.cpu: 180
| | | requests.memory: 1.5Ti
| +-- RP: Tier-1 | limits.cpu: 200
| | Reservation: 120 GHz | limits.memory: 1.8Ti
| | +-- DB VMs | count/virtualmachines: 50
| | |
| +-- RP: Tier-2 +-- Namespace: prod-tier2
| Reservation: 60 GHz | ResourceQuota:
| +-- Web VMs | requests.cpu: 60
| | requests.memory: 500Gi
+-- RP: Development |
| Limit: 64 GHz +-- Namespace: development
| Limit: 512 GB | ResourceQuota:
| No reservation | limits.cpu: 64
| +-- Dev VMs | limits.memory: 512Gi
| | requests.cpu: 0 (no guarantee)
+-- RP: Infrastructure |
Reservation: 12 GHz +-- Namespace: infrastructure
+-- Infra VMs ResourceQuota:
requests.cpu: 12
requests.memory: 48Gi
ResourceQuota and LimitRange
# ResourceQuota: caps total resource consumption in a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-tier1-quota
namespace: prod-tier1
spec:
hard:
# Total CPU and memory that all VMs in this namespace can request
requests.cpu: "180"
requests.memory: 1.5Ti
# Total CPU and memory limits (hard ceiling) for all VMs
limits.cpu: "200"
limits.memory: 1.8Ti
# Cap the number of VirtualMachine objects
count/virtualmachines.kubevirt.io: "50"
# Cap the number of PVCs (controls storage consumption)
persistentvolumeclaims: "150"
# Cap total storage requested
requests.storage: 50Ti
# LimitRange: sets defaults and constraints for individual VMs
apiVersion: v1
kind: LimitRange
metadata:
name: vm-limits
namespace: prod-tier1
spec:
limits:
- type: Container # Applies to the virt-launcher container
default: # Default limits if not specified on the VM
cpu: "4"
memory: 8Gi
defaultRequest: # Default requests if not specified
cpu: "2"
memory: 4Gi
min: # Minimum allowed -- prevents undersized VMs
cpu: "1"
memory: 2Gi
max: # Maximum allowed -- prevents oversized VMs
cpu: "64"
memory: 256Gi
Key differences from VMware resource pools:
| Concept | VMware Resource Pool | Kubernetes ResourceQuota |
|---|---|---|
| Hierarchy | Resource pools can be nested multiple levels deep. Child pools inherit from parent. | Namespaces are flat -- no nesting. Hierarchical Namespace Controller (HNC) exists but is not widely adopted. |
| Shares (relative priority) | Shares determine relative resource allocation during contention. | No direct equivalent. Kubernetes PriorityClasses influence scheduling and eviction order, but not proportional sharing of CPU/memory. |
| Reservation (guaranteed) | Reservation guarantees a minimum resource allocation backed by admission control. | requests in Kubernetes serve a similar purpose -- they are guaranteed allocations that the scheduler accounts for. A pod is only placed on a node if the node has enough unrequested resources. |
| Limit (hard ceiling) | Limits cap consumption. VM is throttled if it exceeds the limit. | limits in Kubernetes are enforced by cgroups. If a container exceeds its memory limit, it is OOM-killed. If it exceeds its CPU limit, it is throttled. |
| Overcommitment | VMware allows overcommitment by design. Shares arbitrate during contention. Ballooning and swap handle memory pressure. | Kubernetes allows overcommitment when requests < limits. If all pods on a node simultaneously use their limits, the node is overcommitted. Pods exceeding their memory requests (but within limits) may be evicted under memory pressure. |
Hyper-V Resource Controls
Hyper-V provides per-VM resource controls (not hierarchical pools):
- Virtual processor weight: 1-10000 (relative priority, similar to VMware shares)
- Virtual processor reserve: Percentage of a physical processor's capacity guaranteed to the VM
- Virtual processor limit: Percentage cap on CPU consumption
- Memory reserve: Minimum RAM guaranteed
- Dynamic Memory: Allows memory to flex between a minimum and maximum
Azure Local does not expose hierarchical resource pools. Resource governance is managed through Azure RBAC and Azure policies at the subscription and resource group level, not at the hypervisor level. This is a significant architectural difference from vSphere resource pools.
6. Affinity / Anti-Affinity Rules
Why Placement Control Matters
At 5,000+ VMs, placement decisions have cascading effects:
- Anti-affinity for HA: Both nodes of an Oracle RAC cluster must run on different physical hosts. If a host fails, the surviving node keeps the database available.
- Affinity for performance: An application server VM and its database VM should be on the same host (or same rack) to minimize network latency.
- License compliance: Oracle database VMs must only run on hosts with Oracle-licensed processors. If a DRS migration moves an Oracle VM to an unlicensed host, the organization faces an audit finding.
- Hardware requirements: GPU workloads must run on GPU-equipped nodes. DPDK workloads need SR-IOV NICs.
Kubernetes Affinity Constructs
Kubernetes provides three affinity mechanisms, all of which apply to KubeVirt VMs through the virt-launcher pod spec:
nodeAffinity: Controls which nodes a VM can be scheduled on, based on node labels.
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: oracle-db-01
namespace: database-tier
spec:
template:
spec:
affinity:
nodeAffinity:
# Hard requirement: MUST run on nodes labeled for Oracle
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: license/oracle
operator: In
values: ["true"]
- key: hardware/gpu
operator: DoesNotExist
# Soft preference: prefer nodes in rack-A
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["rack-a"]
podAntiAffinity: Ensures that VMs with specific labels do not land on the same host (or same zone/rack).
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: oracle-rac-node-1
namespace: database-tier
spec:
template:
metadata:
labels:
app: oracle-rac
cluster-name: rac-prod-01
spec:
affinity:
podAntiAffinity:
# Hard requirement: no two oracle-rac pods on the same host
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: cluster-name
operator: In
values: ["rac-prod-01"]
topologyKey: kubernetes.io/hostname
# Soft preference: spread across racks too
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: cluster-name
operator: In
values: ["rac-prod-01"]
topologyKey: topology.kubernetes.io/zone
podAffinity: Ensures that related VMs are scheduled on the same host or in the same topology domain.
# Keep the app server and its cache VM on the same host
spec:
template:
metadata:
labels:
app: trading-platform
component: app-server
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["trading-platform"]
- key: component
operator: In
values: ["cache"]
topologyKey: kubernetes.io/hostname
Topology Spread Constraints
Topology spread constraints are a more flexible mechanism than anti-affinity for distributing VMs evenly across failure domains:
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-frontend
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-frontend
This ensures that web-frontend VMs are spread evenly across zones (maxSkew: 1 means the difference in VM count between any two zones is at most 1) and roughly evenly across hosts (maxSkew: 2, soft constraint).
KubeVirt-Specific Placement Features
dedicatedCPUPlacement: Pins vCPUs to dedicated physical CPU cores, preventing contention with other workloads. Essential for latency-sensitive VMs (trading systems, real-time data processing):
spec:
template:
spec:
domain:
cpu:
cores: 8
dedicatedCpuPlacement: true
When dedicatedCPUPlacement is set, the virt-launcher pod is placed in the Kubernetes Guaranteed QoS class (requests == limits), and the kubelet's CPU manager allocates exclusive CPU cores to the pod. No other pod can use those cores.
evictionStrategy: Controls what happens to the VM when the node is drained:
spec:
template:
spec:
evictionStrategy: LiveMigrate # Options: LiveMigrate, LiveMigrateIfPossible, External, None
LiveMigrate: The VM must be live-migrated before the pod can be evicted. If migration fails, the drain is blocked.LiveMigrateIfPossible: Attempt live migration, but if it fails, allow the VM to be shut down.External: Delegate eviction handling to an external controller.None: The VM is shut down (not migrated) when evicted.
VMware DRS Rules vs. Kubernetes Affinity
| VMware DRS Rule | Kubernetes Equivalent | Notes |
|---|---|---|
| VM-VM affinity ("should run together") | podAffinity with preferredDuringSchedulingIgnoredDuringExecution |
Soft constraint -- best effort |
| VM-VM affinity ("must run together") | podAffinity with requiredDuringSchedulingIgnoredDuringExecution |
Hard constraint -- scheduler will not violate |
| VM-VM anti-affinity ("should run separately") | podAntiAffinity with preferredDuringScheduling... |
Soft constraint |
| VM-VM anti-affinity ("must run separately") | podAntiAffinity with requiredDuringScheduling... |
Hard constraint |
| VM-Host affinity ("should/must run on host group") | nodeAffinity with requiredDuringScheduling... or preferredDuringScheduling... |
Based on node labels, not host groups. Labels are more flexible but must be manually maintained. |
| VM-Host anti-affinity ("should/must not run on host group") | nodeAffinity with operator: NotIn or DoesNotExist |
Same mechanism, inverted selector |
Key difference: DRS affinity rules are enforced continuously -- DRS re-evaluates rules every 5 minutes and migrates VMs to maintain compliance. Kubernetes affinity rules are enforced at scheduling time only (requiredDuringSchedulingIgnoredDuringExecution). The "IgnoredDuringExecution" means that if a node's labels change after a VM is scheduled, the VM is NOT automatically migrated to a compliant node. Kubernetes has a RequiredDuringExecution concept in alpha/proposal stage, but it is not production-ready. This is a meaningful gap for organizations that rely on DRS to continuously enforce placement rules.
Hyper-V Placement
Hyper-V Failover Clustering provides:
- Preferred owners: A VM can be configured with a list of preferred cluster nodes. If the current host fails, the VM restarts on a preferred node first.
- Possible owners: A hard constraint on which nodes can host the VM. The VM will never run on a node not in the possible owners list.
- Anti-affinity class: Cluster-Aware Updating (CAU) and failover logic will not place VMs with the same anti-affinity class name on the same node.
Azure Local inherits these Hyper-V clustering capabilities. They are less flexible than Kubernetes affinity constructs (no topology spread, no weighted preferences) but are well-understood by Windows Server administrators.
7. CPU & RAM Hot-Add
What Hot-Add Is and Why It Matters
Hot-add is the ability to increase a VM's CPU count or memory while it is running, without a reboot. It addresses a common operational scenario: a production VM hits a resource ceiling during a peak workload, and the team needs to increase capacity immediately without scheduling a maintenance window.
When hot-add matters:
- Emergency capacity increase during production incidents
- Scaling up for batch processing windows (month-end, quarter-end)
- Avoiding downtime for VM reconfiguration
When hot-add is less important:
- If the organization practices right-sizing (VMs are sized correctly from the start)
- If the workload can scale horizontally (add more VMs instead of making one bigger)
- If planned downtime windows are acceptable for reconfiguration
KubeVirt CPU and Memory Hot-Plug
KubeVirt supports CPU and memory hot-plug as of recent versions. The feature allows modifying the VM's CPU and memory while it is running.
CPU hot-plug: KubeVirt supports adding vCPUs to a running VM. The VM spec defines a range of sockets, and the running VM can have additional sockets activated without a restart. The guest OS must support CPU hot-plug (Linux: supported since kernel 2.6, Windows Server: supported since 2012).
To enable CPU hot-plug, the VM must be configured with maxSockets greater than the initial socket count:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: scalable-vm
spec:
template:
spec:
domain:
cpu:
sockets: 2 # Initial: 2 sockets
cores: 2 # 2 cores per socket
threads: 1 # 1 thread per core
maxSockets: 8 # Maximum: can scale up to 8 sockets
After the VM is running, update the live VMI to increase sockets:
# Scale CPU from 2 to 4 sockets (2 cores each = 8 vCPUs)
kubectl patch vmi scalable-vm --type merge \
-p '{"spec":{"domain":{"cpu":{"sockets":4}}}}'
Memory hot-plug: KubeVirt supports memory hot-plug by defining maxGuest memory in the VM spec:
spec:
template:
spec:
domain:
memory:
guest: 8Gi # Initial: 8 GB
maxGuest: 32Gi # Maximum: can scale up to 32 GB
Limitations and considerations:
- Guest OS support required. The guest kernel must support CPU/memory hot-plug. Most modern Linux distributions do. Windows Server supports hot-add for both CPU and memory in Datacenter edition; Standard edition has limitations.
- Hot-remove is not supported. You can add CPUs and memory to a running VM, but you generally cannot remove them without a reboot. This is a hardware/firmware limitation (ACPI hot-remove is unreliable on most guest OSes).
- NUMA implications. Adding memory at runtime may create non-optimal NUMA topology. The new memory may be assigned to a virtual NUMA node that does not align with the physical NUMA topology of the host.
- Resource accounting. When a VM is hot-plugged, the virt-launcher pod's resource limits must also be updated. KubeVirt handles this through in-place pod resource resize (a Kubernetes feature that allows changing resource requests/limits without restarting the pod).
Hyper-V Hot-Add
Hyper-V supports hot-add for both CPU and memory, with some distinctions:
Dynamic Memory: Hyper-V's Dynamic Memory is a more automated approach than manual hot-add. The VM is configured with a minimum, startup, and maximum memory value. The Hyper-V host dynamically adjusts the VM's memory allocation within this range based on demand, using a balloon driver inside the guest.
| Parameter | Description |
|---|---|
| Startup Memory | Memory allocated when the VM boots |
| Minimum Memory | Lowest memory the VM can be reduced to under host pressure |
| Maximum Memory | Highest memory the VM can grow to under guest demand |
| Memory Buffer | Percentage of committed memory to keep as reserve (default: 20%) |
Dynamic Memory is transparent to the guest OS -- no manual intervention required. This is more operationally convenient than KubeVirt's explicit hot-plug.
CPU hot-add (Hyper-V): Supported for Generation 2 VMs. The VM must be configured with "Enable processor compatibility" unchecked and the guest OS must support hot-add (Windows Server 2016+, Linux with hot-plug support).
VMware Hot-Add
VMware has supported CPU and memory hot-add since vSphere 4.1. It must be enabled per-VM in the VM settings before the first boot (or while the VM is powered off). Once enabled:
- CPU hot-add: Add vCPUs while the VM is running. Guest OS must support it.
- Memory hot-add: Add memory while the VM is running. Guest OS must support it.
- Hot-remove: Not supported (same limitation as all platforms).
One VMware-specific quirk: enabling CPU hot-add disables vNUMA for the VM (the VM presents a flat memory topology to the guest). This can negatively impact performance for NUMA-sensitive workloads. vSphere 8 relaxed this restriction for some configurations, but it remains a consideration.
8. GPU Passthrough / vGPU
Why GPU Virtualization Matters
GPU access in virtual machines is required for an expanding set of enterprise workloads:
- AI/ML inference: Running trained models for prediction, classification, NLP. Inference workloads are increasingly deployed alongside traditional enterprise applications.
- VDI (Virtual Desktop Infrastructure): Graphics-intensive desktops (CAD, video editing, GIS) require GPU acceleration for acceptable user experience.
- Video encoding/transcoding: Real-time media processing, surveillance video analytics.
- Scientific computing: Financial risk modeling, Monte Carlo simulations, molecular dynamics.
Two fundamental approaches exist: passthrough (one GPU to one VM) and sharing (one GPU to many VMs).
PCIe Passthrough via VFIO (KubeVirt)
VFIO (Virtual Function I/O) is a Linux kernel framework that allows userspace programs (including QEMU) to directly access PCI devices with full DMA isolation via the IOMMU (VT-d on Intel, AMD-Vi on AMD). When a GPU is passed through to a VM via VFIO, the VM has exclusive, near-native-performance access to the entire GPU.
GPU Passthrough via VFIO in KubeVirt
+================================================================+
| Worker Node |
| |
| +----------------------------------------------------------+ |
| | virt-launcher Pod (VM with GPU) | |
| | | |
| | +----------------------------------------------------+ | |
| | | QEMU/KVM | | |
| | | | | |
| | | Guest OS sees: NVIDIA A100 (native PCI device) | | |
| | | Guest installs native NVIDIA driver | | |
| | +----------------------------------------------------+ | |
| | | | |
| | | VFIO device assignment | |
| | | (QEMU -device vfio-pci,host=0000:3b:00.0) | |
| +----------------------------------------------------------+ |
| | |
| | IOMMU (VT-d / AMD-Vi) |
| | DMA remapping ensures VM can only access its own GPU |
| | |
| +----------------------------------------------------------+ |
| | Physical GPU: NVIDIA A100 80GB | |
| | PCI address: 0000:3b:00.0 | |
| | IOMMU group: 42 | |
| | Bound to vfio-pci driver (not nvidia driver) | |
| +----------------------------------------------------------+ |
+================================================================+
In KubeVirt, GPU passthrough is configured using the gpus or hostDevices field in the VM spec. The GPU must first be configured as a Kubernetes device plugin resource.
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: ml-inference-vm
namespace: ai-workloads
spec:
template:
spec:
domain:
devices:
gpus:
- name: gpu1
deviceName: nvidia.com/A100
resources:
requests:
memory: 64Gi
limits:
memory: 64Gi
# Node must have an available A100 GPU
# Kubernetes device plugin framework handles allocation
Limitation: GPU passthrough is exclusive -- one GPU per VM. The GPU cannot be shared. Live migration of VMs with passthrough GPUs is NOT supported (the GPU's state cannot be serialized and transferred). The VM must be shut down and restarted on a different GPU-equipped host. This is the same limitation as VMware's DirectPath I/O and Hyper-V's DDA.
NVIDIA GPU Operator for Kubernetes
The NVIDIA GPU Operator automates the deployment and management of GPU drivers and device plugins on Kubernetes nodes. It deploys:
- NVIDIA driver containers: Install the NVIDIA kernel driver on each GPU node without requiring the driver to be baked into the host OS image.
- NVIDIA device plugin: Registers GPUs as schedulable Kubernetes resources (
nvidia.com/gpu). - NVIDIA Container Toolkit: Allows GPU access from containers and VMs.
- GPU Feature Discovery (GFD): Labels nodes with GPU model, memory, driver version, and CUDA capability for scheduling decisions.
- DCGM Exporter: Exports GPU metrics (utilization, temperature, memory usage, ECC errors) to Prometheus.
For KubeVirt, the GPU Operator works with both passthrough (VFIO) and vGPU modes.
vGPU: Time-Sliced vs. MIG (Multi-Instance GPU)
When a single physical GPU must be shared across multiple VMs, two technologies are available:
Time-sliced vGPU (NVIDIA GRID / vGPU):
Time-Sliced vGPU (NVIDIA GRID)
+================================================================+
| Physical GPU: NVIDIA A100 80GB |
| |
| NVIDIA vGPU Manager (host driver) |
| Schedules time slices across virtual GPUs |
| |
| +--------------+ +--------------+ +--------------+ |
| | vGPU 1 | | vGPU 2 | | vGPU 3 | |
| | Profile: | | Profile: | | Profile: | |
| | A100-4C | | A100-4C | | A100-4C | |
| | (4 GB VRAM) | | (4 GB VRAM) | | (4 GB VRAM) | |
| | | | | | | |
| | Assigned to | | Assigned to | | Assigned to | |
| | VM-1 | | VM-2 | | VM-3 | |
| +--------------+ +--------------+ +--------------+ |
| |
| Time-slicing: Each vGPU gets exclusive GPU access in |
| round-robin time windows. Low latency but no guaranteed |
| throughput -- a VM's GPU time depends on contention. |
| |
| Profiles: A100-1C (1GB), A100-2C (2GB), ..., A100-80C (80GB) |
| Cannot mix compute and graphics profiles on the same GPU. |
+================================================================+
Time-sliced vGPU divides GPU time across VMs. Each VM sees a virtual GPU with a dedicated portion of the GPU's framebuffer (VRAM) but shares the compute cores via time-slicing. The NVIDIA vGPU Manager runs on the host and mediates access. Each VM installs a standard NVIDIA vGPU guest driver.
Licensing: NVIDIA vGPU requires a separate license from NVIDIA. License tiers:
- vGPU (formerly GRID vPC): Basic VDI desktops
- vGPU (formerly GRID vApps): Application streaming
- vCS (Virtual Compute Server): AI/ML compute workloads
- vWS (Virtual Workstation): Professional graphics (Quadro-equivalent)
MIG (Multi-Instance GPU):
MIG is available on NVIDIA A100, A30, H100, and later GPUs. Unlike time-slicing, MIG physically partitions the GPU into up to 7 isolated instances, each with dedicated compute cores, memory bandwidth, and L2 cache. There is no time-sharing -- each MIG instance is truly isolated.
MIG (Multi-Instance GPU) on NVIDIA A100
+================================================================+
| Physical GPU: NVIDIA A100 80GB |
| 108 SMs (Streaming Multiprocessors), 80 GB HBM2e |
| |
| MIG Partitioning (example configuration): |
| |
| +---------------------+ +---------------------+ |
| | MIG Instance 1 | | MIG Instance 2 | |
| | Profile: 3g.40gb | | Profile: 3g.40gb | |
| | 42 SMs | | 42 SMs | |
| | 40 GB HBM2e | | 40 GB HBM2e | |
| | Dedicated L2 cache | | Dedicated L2 cache | |
| | Dedicated mem BW | | Dedicated mem BW | |
| | | | | |
| | Isolated -- cannot | | Isolated -- cannot | |
| | see or affect | | see or affect | |
| | other instances | | other instances | |
| +---------------------+ +---------------------+ |
| |
| Other profiles: 1g.10gb, 2g.20gb, 4g.40gb, 7g.80gb |
| Cannot mix profile sizes (e.g., cannot have 1g + 3g on same) |
| Reconfiguring MIG requires GPU reset (all VMs must be stopped)|
+================================================================+
MIG vs. time-sliced vGPU:
| Aspect | Time-sliced vGPU | MIG |
|---|---|---|
| Isolation | Shared compute, dedicated VRAM | Fully isolated compute + memory + cache |
| Performance predictability | Variable (depends on contention) | Consistent (guaranteed resources) |
| Maximum partitions | Up to 32 vGPUs per GPU (profile-dependent) | Up to 7 MIG instances |
| GPU models | All NVIDIA datacenter GPUs | A100, A30, H100, H200, B100, B200 only |
| Live migration | Supported (vGPU state can be serialized) | Not supported with passthrough |
| Licensing | Requires NVIDIA vGPU license | No additional license (included with GPU) |
| Reconfiguration | Dynamic (add/remove vGPUs without GPU reset) | Requires GPU reset to reconfigure partitions |
Hyper-V: DDA and GPU-P
Discrete Device Assignment (DDA): Hyper-V's equivalent of VFIO passthrough. A physical PCIe device (GPU, NVMe, FPGA) is assigned exclusively to a VM. The VM gets near-native performance. Live migration is not supported.
GPU-P (GPU Partitioning): Introduced in Windows Server 2025 and Azure Local, GPU-P allows a single GPU to be partitioned and shared across multiple VMs. Similar in concept to SR-IOV for NICs, GPU-P creates virtual GPU partitions that are hardware-isolated. Currently supported for a limited set of GPUs and primarily used for Azure Virtual Desktop scenarios on Azure Local.
Azure Local leverages GPU-P for VDI workloads and DDA for AI/ML workloads. GPU-P is less mature than NVIDIA's vGPU ecosystem but does not require separate NVIDIA licensing.
VMware GPU Virtualization
VMware supports three GPU modes:
- DirectPath I/O (passthrough): Equivalent to VFIO. Exclusive GPU to one VM. No live migration. No vSphere HA for the VM.
- vSGA (Virtual Shared Graphics Acceleration): Uses VMware's SVGA 3D driver to share GPU resources. Software-mediated, lower performance. Deprecated in favor of vGPU.
- NVIDIA GRID vGPU: The same NVIDIA vGPU technology available on KubeVirt and Hyper-V. VMware was the first hypervisor to support vGPU. Requires NVIDIA vGPU Manager as a VIB on the ESXi host. Supports live migration of vGPU-attached VMs (the vGPU state is serialized and transferred).
Live migration with vGPU: VMware was the first to support live migration of VMs with vGPU attachments (vMotion with vGPU, supported since vSphere 6.7 Update 1). KubeVirt also supports live migration of VMs with mediated devices (including vGPU), but this capability is less tested at scale. Hyper-V does not support live migration of VMs with DDA or GPU-P.
How the Candidates Handle This
| Capability | VMware (Current) | OVE (KubeVirt) | Azure Local (Hyper-V) | Swisscom ESC |
|---|---|---|---|---|
| Live Migration | vMotion. Pre-copy. Mature, battle-tested at scale. DRS automates migrations. Encrypted. | KubeVirt live migration. Pre-copy + optional post-copy. Node drain integration. No built-in DRS. | Hyper-V Live Migration. Pre-copy. Quick Migration fallback. No DRS equivalent. | VMware vMotion (current). Migration capability depends on underlying platform evolution. |
| Concurrent Migrations | 4 per host (configurable to 8) | 2 per node default (configurable) | 2 simultaneous default | Per VMware settings |
| Migration Network | Dedicated vmknic, VLAN-isolated | Dedicated NAD via Multus | Dedicated SMB/TCP network | Per VMware settings |
| VM Snapshots | VMDK delta/redo logs. Mature but chains degrade performance. | VolumeSnapshot via CSI. Quality depends on storage backend. Guest agent for app consistency. | Hyper-V checkpoints (standard + production). VSS for app consistency. | VMware snapshots |
| VM Cloning | Full clone + linked clone. Instant clone (since vSphere 6.7). | CSI clone. COW if storage supports it (Ceph, LVM thin). CDI handles import/clone. | Full copy or differencing disk. | VMware clone |
| VM Templates | Content Library, OVF templates, vApp. Mature. | containerDisk, DataVolume templates, golden image PVCs with auto-update. | Azure Marketplace images, custom VHDX templates. | VMware templates |
| First-boot Config | VMware Guest Customization (limited), cloud-init via open-vm-tools. | Cloud-init (NoCloud, ConfigDrive). Native, well-integrated. | Sysprep (Windows), cloud-init (Linux), Azure VM Agent. | VMware Guest Customization |
| Resource Pools | Hierarchical resource pools with shares, reservations, limits. | Namespaces + ResourceQuotas + LimitRanges. Flat (no nesting). No shares. | No hierarchical pools. Azure RBAC + policies at subscription level. | VMware resource pools |
| Affinity Rules | DRS affinity/anti-affinity (VM-VM, VM-Host). Continuously enforced. | nodeAffinity, podAffinity, podAntiAffinity, topology spread. Enforced at scheduling only. | Preferred/possible owners, anti-affinity class. Basic. | VMware DRS rules |
| CPU Hot-Add | Supported (per-VM setting). Disables vNUMA. | CPU hot-plug supported. maxSockets defines ceiling. | Supported (Generation 2 VMs). | Per VMware settings |
| Memory Hot-Add | Supported (per-VM setting). | Memory hot-plug supported. maxGuest defines ceiling. | Dynamic Memory (automatic). Hot-add supported. | Per VMware settings |
| GPU Passthrough | DirectPath I/O (VFIO-like). No live migration. | VFIO passthrough. No live migration. | DDA (Discrete Device Assignment). No live migration. | Not typically applicable (managed service). |
| vGPU | NVIDIA GRID vGPU. Live migration supported. | NVIDIA vGPU via mediated devices. Live migration supported (with caveats). | GPU-P (native partitioning). No live migration. | Not typically applicable. |
Key Takeaways
-
Live migration works on all platforms, but operational maturity differs. VMware vMotion is the most battle-tested at enterprise scale. KubeVirt live migration is functionally equivalent (same pre-copy algorithm, comparable downtime) but wraps migration in Kubernetes pod lifecycle semantics that are unfamiliar to VMware-trained teams. The absence of a built-in DRS equivalent in KubeVirt is a genuine gap -- the team must plan for manual rebalancing or adopt the Descheduler.
-
Node drain is KubeVirt's killer feature for maintenance workflows. While vMotion + maintenance mode is well understood, KubeVirt's integration with PodDisruptionBudgets provides more granular control over migration ordering and parallelism. For an organization with 5,000+ VMs, PDBs can enforce that no more than N VMs from a critical application are simultaneously migrating -- a guarantee that vSphere maintenance mode does not natively provide.
-
Snapshot quality depends entirely on the storage backend. KubeVirt delegates snapshots to CSI drivers, so the quality, speed, and consistency guarantees of snapshots are a property of the storage platform (Ceph, NetApp, Pure, etc.), not of KubeVirt itself. The team must evaluate the specific CSI driver's snapshot capabilities, not just KubeVirt's API surface.
-
Resource governance models are fundamentally different. VMware resource pools are hierarchical with proportional sharing (shares). Kubernetes namespaces + ResourceQuotas are flat with hard caps. There is no equivalent of shares (proportional CPU/memory allocation during contention). Organizations that rely on shares for soft multi-tenancy will need to redesign their resource governance model.
-
Affinity rules are scheduling-time-only in Kubernetes. VMware DRS continuously enforces affinity and anti-affinity rules via periodic re-evaluation and vMotion. Kubernetes only enforces affinity at pod scheduling time. If node labels change or cluster topology shifts, existing VMs are not automatically relocated. This must be supplemented with operational procedures or custom controllers.
-
GPU workloads constrain migration. On all platforms, VMs with passthrough GPUs cannot be live migrated. Only NVIDIA vGPU (time-sliced) supports live migration, and only on VMware and (with caveats) KubeVirt. MIG instances passed through via VFIO also cannot be live migrated. Plan GPU node maintenance accordingly -- these VMs will require scheduled downtime.
-
Hot-add is supported but less mature on KubeVirt. VMware hot-add is a well-tested feature used daily by many organizations. KubeVirt CPU and memory hot-plug are functional but newer, and the interaction with Kubernetes in-place pod resize adds a layer of complexity. Hot-remove remains unsupported on all platforms. The better long-term strategy is right-sizing VMs from the start.
-
Template provisioning is faster on KubeVirt if the storage backend supports COW cloning. With a copy-on-write capable storage backend (Ceph RBD, LVM thin), creating a new VM from a golden image is nearly instantaneous. With full-copy backends, provisioning is slower than VMware's linked clone or instant clone. Storage backend selection directly impacts provisioning SLAs.
Discussion Guide
Use these questions when engaging with vendors, Red Hat/Microsoft/Swisscom field teams, or internal subject matter experts.
Live Migration
-
Demonstrate a live migration of a VM with 128 GB of memory running an active write workload (e.g., fio random write at 500 MB/s). What is the total migration time and the switchover downtime? Does the migration converge, or does auto-converge need to activate? Why this matters: Memory-intensive, write-heavy VMs are the hardest to migrate. This test reveals whether the platform can handle your most demanding VMs without resorting to post-copy or manual intervention.
-
Drain a node running 30+ VMs simultaneously while respecting PodDisruptionBudgets. How long does the full drain take? What happens if one VM's migration fails -- does the drain block or proceed? Why this matters: Node drain is the most common maintenance operation. The team needs to know the total time to evacuate a fully loaded node and understand failure handling.
-
What is the roadmap for DRS-equivalent automatic load balancing? Is the Descheduler recommended for production use, and what are its limitations? Why this matters: Without automated rebalancing, the team must manually monitor cluster resource distribution and trigger migrations. This does not scale to 5,000+ VMs.
Snapshots and Cloning
-
Take an application-consistent snapshot of a running database VM (PostgreSQL or Oracle) with a 500 GB data disk. How long does the snapshot take? Does it cause any I/O pause or performance degradation during the snapshot? Why this matters: Snapshot performance directly affects backup windows and RPO. A snapshot that causes a 5-second I/O stun on a production database is unacceptable.
-
Clone 20 VMs simultaneously from a single golden image. How long does each clone take? Does clone performance degrade with concurrency? What storage backend optimizations are used (COW, thin provisioning)? Why this matters: Rapid provisioning at scale requires concurrent cloning without performance collapse. The answer reveals whether the storage backend can handle burst provisioning.
Resource Governance
-
Demonstrate namespace-based multi-tenancy with ResourceQuotas. If two namespaces are competing for resources on the same nodes, how does the platform handle contention? Is there an equivalent to VMware shares for proportional allocation? Why this matters: Proportional sharing is a core capability for financial institutions with multiple business units sharing infrastructure. If there is no equivalent, the team needs to understand the alternative approach.
-
Show how to enforce that a VM cannot exceed 16 vCPUs and 64 GB RAM, and that the total allocation for the "development" team cannot exceed 200 vCPUs and 1 TB RAM. What error does a user see if they try to exceed the quota? Why this matters: Quota enforcement must be self-service-friendly. Users need clear error messages, not cryptic API errors.
Affinity and Placement
- Configure a hard anti-affinity rule ensuring that 3 VMs of a database cluster never share the same physical host. Then, drain one of the three hosts. Does the drain succeed? Does the anti-affinity rule prevent placement, and if so, what is the error behavior? Why this matters: Anti-affinity + node drain can create scheduling deadlocks (if there are only 3 nodes and all 3 must have exactly one DB VM). The team needs to understand how the platform resolves these conflicts.
GPU
-
Demonstrate vGPU (time-sliced) with 4 VMs sharing a single GPU. Show GPU utilization metrics for each VM under concurrent load. Then live migrate one of the 4 VMs to another host with the same GPU model. What is the migration downtime? Why this matters: vGPU live migration is critical for maintaining the same maintenance workflow for GPU-accelerated VMs. If live migration is not supported, GPU nodes become maintenance liabilities.
-
Show MIG partitioning: create 3 MIG instances on an A100, assign each to a different VM, and demonstrate that performance isolation holds under load (one VM running full compute should not affect the others). Then reconfigure the MIG partitions -- does this require stopping all VMs? Why this matters: MIG provides stronger isolation than time-sliced vGPU, but reconfiguration is disruptive. The team needs to understand the operational tradeoff.