Kubernetes-Native Virtualization
Why This Matters
The previous three chapters covered the foundational concepts of virtualization, the VMware baseline you are migrating away from, and the hypervisor engines (KVM, QEMU, libvirt, Hyper-V) that power the candidate platforms. This chapter takes the next step up the stack: how Kubernetes -- a system designed for containers -- becomes a platform that runs traditional virtual machines as first-class citizens.
This is the most important chapter in the virtualization series for one reason: KubeVirt is the core of OpenShift Virtualization Engine (OVE). Every VM that runs on OVE is a KubeVirt-managed VM. Every live migration, every storage attachment, every network connection, every console session -- all of these flow through KubeVirt's custom resources, controllers, and per-VM pod architecture. Understanding KubeVirt at the depth presented here is not optional for evaluating OVE. It is the evaluation.
For Azure Local, this chapter provides a contrast: Azure Local does not use Kubernetes for VM management (it uses Hyper-V with the Azure Arc management layer). Understanding KubeVirt clarifies what is architecturally different about the OVE approach -- where it is more flexible, where it is more complex, and where it introduces operational patterns that a VMware-trained team has never encountered.
For Swisscom ESC, this chapter is contextual. ESC currently runs on VMware; if Swisscom ever transitions to a Kubernetes-native platform, the concepts here would apply.
The chapter also covers the container runtime layer (CRI-O, containerd) that sits between Kubernetes and the actual virt-launcher processes, and Kata Containers, which represent a different approach to virtualization in Kubernetes -- running containers inside VMs rather than running VMs inside pods.
At 5,000+ VM scale, every architectural decision in the KubeVirt stack has compounding effects. A misconfigured CDI import pipeline slows down migration by weeks. A misunderstood networking mode (masquerade vs. bridge vs. SR-IOV) can mean the difference between 1 Gbps and 25 Gbps per VM. A virt-handler DaemonSet that is not sized correctly causes node-level failures during mass live migration. This chapter equips the evaluation team to operate at that level.
Concepts
1. KubeVirt
What KubeVirt Is
KubeVirt is a Kubernetes extension (operator) that enables virtual machines to run as first-class workloads alongside containers on the same cluster. It was started in 2017 by Red Hat engineers, donated to the Cloud Native Computing Foundation (CNCF) as a Sandbox project in 2018, and reached CNCF Incubating status in 2023. It is the upstream open-source project that Red Hat packages and supports as OpenShift Virtualization, which in turn is the foundation of OpenShift Virtualization Engine (OVE).
The core idea: instead of building a separate management plane for VMs (like vCenter or System Center VMM), KubeVirt extends the existing Kubernetes API with Custom Resource Definitions (CRDs) that represent virtual machines. The Kubernetes API server, scheduler, RBAC, monitoring, and networking all apply to VMs just as they do to containers. The VM itself runs inside a Kubernetes pod, where a virt-launcher process manages a libvirtd instance and a QEMU process.
This is not a trivial wrapper. KubeVirt solves a genuinely hard engineering problem: Kubernetes was designed around the assumption that workloads are ephemeral, stateless, and horizontally scalable. VMs are the opposite -- they are long-lived, deeply stateful, and vertically scaled. KubeVirt bridges this gap by introducing VM-specific lifecycle semantics (start, stop, pause, migrate, restart) on top of Kubernetes' pod-centric model.
Architecture
KubeVirt consists of four primary components deployed on a Kubernetes cluster:
KubeVirt Architecture Overview
+=====================================================================+
| Kubernetes Control Plane |
| +---------------------------------------------------------------+ |
| | kube-apiserver (+ KubeVirt CRDs registered) | |
| | - VirtualMachine (vm) | |
| | - VirtualMachineInstance (vmi) | |
| | - VirtualMachineInstanceReplicaSet (vmirs) | |
| | - VirtualMachineInstanceMigration (vmim) | |
| | - VirtualMachineClusterPreference | |
| | - VirtualMachineClusterInstancetype | |
| +---------------------------------------------------------------+ |
| |
| +-----------------------------+ +-----------------------------+ |
| | virt-api (Deployment) | | virt-controller | |
| | | | (Deployment, HA pair) | |
| | - Validating webhook | | | |
| | - Mutating webhook | | - Watches VM/VMI CRs | |
| | - Subresource API | | - Creates virt-launcher | |
| | (console, VNC, | | pods for each VMI | |
| | start, stop, migrate) | | - Manages VM lifecycle | |
| | - virtctl proxy target | | state machine | |
| | - Certificate management | | - Coordinates migrations | |
| +-----------------------------+ +-----------------------------+ |
+=====================================================================+
+=====================================================================+
| Worker Node 1 Worker Node 2 |
| +-------------------------------+ +----------------------------+ |
| | virt-handler (DaemonSet) | | virt-handler (DaemonSet) | |
| | - Registers node capabilities| | - Registers node caps | |
| | - Manages device plugins | | - Manages device plugins | |
| | (KVM, vhost-net, GPU, SRIOV)| | (KVM, vhost-net, etc.) | |
| | - Syncs VMI state with API | | - Syncs VMI state with API | |
| | - Coordinates migration target| | - Coordinates migration | |
| | - Configures node networking | | - Configures node net | |
| +-------------------------------+ +----------------------------+ |
| |
| +---------------------------+ +---------------------------+ |
| | virt-launcher Pod (VM-A) | | virt-launcher Pod (VM-B) | |
| | +----------------------+ | | +----------------------+ | |
| | | virt-launcher process| | | | virt-launcher process| | |
| | +----------------------+ | | +----------------------+ | |
| | | libvirtd | | | | libvirtd | | |
| | +----------------------+ | | +----------------------+ | |
| | | QEMU/KVM process | | | | QEMU/KVM process | | |
| | | (the actual VM) | | | | (the actual VM) | | |
| | +----------------------+ | | +----------------------+ | |
| +---------------------------+ +---------------------------+ |
+=====================================================================+
virt-api is a Deployment (typically 2 replicas for HA) that serves as the entry point for all KubeVirt-specific API operations. It registers itself as a Kubernetes admission webhook (both validating and mutating) so that every VirtualMachine or VirtualMachineInstance create/update request passes through KubeVirt's validation logic before being persisted in etcd. The virt-api also provides subresource endpoints -- these are the REST endpoints that virtctl uses for operations that do not map to standard Kubernetes CRUD: opening a VNC console, streaming a serial console, triggering a start/stop/restart, initiating a live migration. Architecturally, virt-api is comparable to vCenter's SOAP/REST API facade, except it extends the Kubernetes API server rather than replacing it.
virt-controller is a Deployment (typically 2 replicas with leader election) that implements the core reconciliation loop. It watches VirtualMachine and VirtualMachineInstance custom resources in etcd and ensures reality matches the declared state. When a user creates a VirtualMachineInstance, the virt-controller creates a pod (the virt-launcher pod) with the correct resource requests, volume mounts, and annotations. When a VirtualMachine is set to running: false, the virt-controller deletes the associated VMI and its pod. The virt-controller also manages the state machine for VMs (Pending -> Scheduling -> Scheduled -> Running -> Succeeded/Failed) and coordinates live migrations by creating target pods on destination nodes.
In vSphere terms, virt-controller is the equivalent of the vpxd process inside vCenter -- the central brain that translates desired state into actions on hosts.
virt-handler is a DaemonSet that runs on every worker node capable of hosting VMs. It is the node-level agent. Its responsibilities include:
- Device plugin registration: It registers Kubernetes device plugins for
/dev/kvm,/dev/vhost-net,/dev/net/tun, and any SR-IOV Virtual Functions. This is how Kubernetes' scheduler knows that a node has KVM capability and can run VMs. - VMI state synchronization: It watches VMIs assigned to its node, ensures the virt-launcher pod and QEMU process are in the expected state, and reports status back to the API server (IP addresses, guest agent info, migration progress).
- Network and device setup: It configures bridge interfaces, tap devices, and any node-level networking required by the VM before the virt-launcher pod starts.
- Migration coordination: On the target node of a live migration, virt-handler prepares the destination virt-launcher pod and signals readiness to the source.
- Node labeling: It labels nodes with hardware capabilities (CPU features, presence of KVM, IOMMU groups) so the scheduler can make placement decisions.
In vSphere terms, virt-handler is roughly equivalent to the hostd + vpxa agent combination on each ESXi host -- the local authority that manages VMs on behalf of the central controller.
virt-launcher is a per-VM pod that runs exactly one VM. It is not deployed as a DaemonSet or Deployment -- it is created by virt-controller as a regular pod for each VirtualMachineInstance. Inside the virt-launcher pod, three processes cooperate:
virt-launcher Pod Internal Structure
+====================================================================+
| virt-launcher Pod (one per VM) |
| Kubernetes namespace: vm-namespace |
| Pod name: virt-launcher-my-database-vm-xk7q9 |
| |
| Cgroup: /kubepods/pod<uid>/ |
| CPU/Memory limits enforced by kubelet cgroups |
| |
| +--------------------------------------------------------------+ |
| | Container: compute | |
| | | |
| | PID 1: virt-launcher | |
| | - Translates VMI spec into libvirt domain XML | |
| | - Calls libvirt API to define and start domain | |
| | - Monitors QEMU process health | |
| | - Reports VM state changes to virt-handler (via socket) | |
| | - Handles graceful shutdown (ACPI power button) | |
| | - Exits when QEMU exits (pod terminates) | |
| | | |
| | PID 2: libvirtd | |
| | - Per-pod libvirtd instance (not system-wide) | |
| | - Receives domain XML from virt-launcher | |
| | - Constructs QEMU command line | |
| | - Manages QEMU process lifecycle | |
| | - Handles domain events (migration, shutdown, crash) | |
| | | |
| | PID 3: qemu-kvm (QEMU/KVM) | |
| | - The actual virtual machine | |
| | - vCPU threads (one per vCPU) | |
| | - I/O threads for disk/network | |
| | - Device emulation (virtio, IDE, e1000, etc.) | |
| | - VNC server (for console access) | |
| | - Uses /dev/kvm for hardware virtualization | |
| | - Uses /dev/vhost-net for accelerated networking | |
| +--------------------------------------------------------------+ |
| |
| Mounted Volumes: |
| - /var/run/kubevirt-private (virt-handler communication) |
| - /var/run/libvirt (libvirt socket) |
| - PVC mounts for VM disks |
| - ConfigMap/Secret mounts for cloud-init, sysprep |
| - Device mounts: /dev/kvm, /dev/vhost-net, /dev/net/tun |
| |
| Network Interfaces: |
| - eth0 (pod network, default CNI) |
| - net1, net2, ... (additional NICs via Multus) |
+====================================================================+
The critical design decision: each VM gets its own libvirtd instance. In a traditional KVM deployment, a single system-wide libvirtd manages all VMs on a host. KubeVirt deliberately isolates libvirtd per pod for two reasons: (1) a crash in one VM's libvirtd cannot affect other VMs, and (2) the Kubernetes pod sandbox (cgroups, namespaces) cleanly contains all processes related to a single VM.
This means a worker node running 50 VMs has 50 libvirtd processes and 50 QEMU processes, each in its own pod with its own cgroup. The memory overhead of 50 libvirtd instances (each ~30-50 MB RSS) is a real cost -- roughly 1.5-2.5 GB -- that does not exist in a traditional KVM or VMware setup. At 5,000 VMs across 100 nodes, this is ~50 VMs per node, ~2 GB overhead per node, ~200 GB total cluster overhead for libvirtd alone. This is manageable but must be accounted for in capacity planning.
Custom Resource Definitions
KubeVirt extends the Kubernetes API with several CRDs. The four most important ones:
VirtualMachine (VM): The top-level user-facing resource. It represents a virtual machine with a desired running state. It contains the VM's specification (CPU, memory, disks, network, firmware) and a running or runStrategy field that controls whether the VM should be powered on. The VirtualMachine controller creates and manages a VirtualMachineInstance when the VM should be running. Think of it as the equivalent of a VM in vCenter's inventory -- it persists even when the VM is powered off.
VirtualMachineInstance (VMI): Represents a running instance of a VM. When a VirtualMachine is set to running: true, the virt-controller creates a VMI. The VMI is the actual runtime object -- it maps 1:1 to a virt-launcher pod and a QEMU process. When the VM is shut down, the VMI is deleted. In vSphere terms, the VMI is the runtime state -- like the vmx process on an ESXi host. The VM persists; the VMI is ephemeral.
VirtualMachineInstanceReplicaSet (VMIRS): Manages a set of identical VMIs, analogous to a Kubernetes ReplicaSet for pods. It maintains a desired number of running VM instances. Useful for stateless VM workloads (load balancers, web servers that must run as VMs for legacy reasons). Not commonly used in enterprise environments where each VM is unique.
VirtualMachineInstanceMigration (VMIM): A declarative object that triggers a live migration of a VMI from one node to another. Creating a VMIM object is equivalent to right-clicking a VM in vCenter and selecting "Migrate." The virt-controller and virt-handler cooperate to execute the migration. The VMIM tracks progress and status. When the migration completes (or fails), the VMIM status is updated.
Additional CRDs that are operationally important:
- VirtualMachineInstancetype / VirtualMachineClusterInstancetype: Pre-defined VM sizes (analogous to VM templates in vCenter or EC2 instance types). Defines CPU count, memory, GPU resources, and I/O thread policies. Instance types are namespaced; cluster instance types are cluster-scoped.
- VirtualMachinePreference / VirtualMachineClusterPreference: Defines preferred settings for firmware (UEFI vs. BIOS), machine type (q35 vs. i440fx), clock source, input devices, and feature flags. Preferences are applied at VM creation and can be overridden per-VM.
- DataVolume: A CDI (Containerized Data Importer) resource that represents a VM disk being imported, cloned, or uploaded. It wraps a PVC and manages the data population lifecycle.
VM Lifecycle Through Kubernetes
Understanding how a VM goes from a YAML file to a running QEMU process is essential for debugging and operations. The flow:
VM Lifecycle: From YAML to Running QEMU Process
Step 1: User applies VM manifest
=====================================
$ kubectl apply -f my-vm.yaml
|
v
kube-apiserver
|
+--> virt-api (admission webhook)
| - Validates VM spec (valid CPU model? valid disk bus?)
| - Mutates defaults (add default network, set machine type)
| - Rejects invalid specs (negative memory, unknown disk type)
|
+--> etcd (VM resource persisted)
Step 2: virt-controller reconciles
=====================================
virt-controller (watch loop)
|
+--> Sees new VM with running: true (or runStrategy: Always)
|
+--> Creates VirtualMachineInstance (VMI) CR
| - Copies spec from VM to VMI
| - Sets VMI status: Pending
|
+--> etcd (VMI resource persisted)
|
+--> Creates virt-launcher Pod
- Sets resource requests/limits (CPU, memory, hugepages)
- Adds device requests (/dev/kvm, /dev/vhost-net)
- Mounts PVCs for disks
- Mounts ConfigMaps/Secrets for cloud-init
- Sets node affinity/anti-affinity from VM spec
- Sets tolerations for any taints
- Adds KubeVirt-specific annotations
Step 3: Kubernetes schedules the pod
=====================================
kube-scheduler
|
+--> Evaluates pod against nodes:
| - Does node have /dev/kvm? (device plugin)
| - Does node have enough CPU/memory?
| - Does node have the requested hugepages?
| - Does node satisfy affinity rules?
| - Does node satisfy topology constraints?
| - Does node have the requested SR-IOV VFs?
|
+--> Binds pod to selected node
|
+--> VMI status: Scheduling -> Scheduled
Step 4: kubelet starts the pod
=====================================
kubelet (on target node)
|
+--> Calls CRI-O (or containerd) to create pod sandbox
| - Creates cgroup hierarchy
| - Creates network namespace
| - CNI plugin configures pod networking
| - Multus attaches additional interfaces
|
+--> Pulls virt-launcher container image (if not cached)
|
+--> Starts virt-launcher container
Step 5: virt-launcher boots the VM
=====================================
virt-launcher process (PID 1 in container)
|
+--> Reads VMI spec from annotation/downward API
|
+--> Translates VMI spec to libvirt domain XML
| - CPU: model, topology, features, NUMA
| - Memory: size, hugepages, NUMA cells
| - Disks: virtio-blk/virtio-scsi backed by PVCs
| - NICs: virtio-net with tap/bridge/SR-IOV backend
| - Firmware: UEFI (OVMF) or SeaBIOS
| - Devices: vTPM, watchdog, RNG, serial console
|
+--> Calls libvirt API: virDomainDefineXML()
+--> Calls libvirt API: virDomainCreate()
|
v
libvirtd (per-pod instance)
|
+--> Parses domain XML
+--> Constructs QEMU command line (~200+ arguments)
+--> Forks QEMU process
|
v
qemu-kvm process
|
+--> Opens /dev/kvm (ioctl: KVM_CREATE_VM)
+--> Creates vCPUs (ioctl: KVM_CREATE_VCPU)
+--> Maps memory (ioctl: KVM_SET_USER_MEMORY_REGION)
+--> Loads firmware (OVMF/SeaBIOS)
+--> Starts vCPU threads (ioctl: KVM_RUN in loop)
|
+--> Guest OS boots
+--> VMI status: Running
Step 6: virt-handler syncs state
=====================================
virt-handler (on same node)
|
+--> Detects running VMI on its node
+--> Reads guest info via QEMU Guest Agent (if installed)
+--> Reports IP addresses, OS info, filesystem info to VMI status
+--> Updates VMI conditions (AgentConnected, LiveMigratable, etc.)
YAML Examples
A complete VirtualMachine definition for a production database VM:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: oracle-db-prod-01
namespace: database-tier
labels:
app: oracle-database
tier: production
criticality: tier-1
spec:
running: true
template:
metadata:
labels:
app: oracle-database
kubevirt.io/vm: oracle-db-prod-01
spec:
# Node placement
nodeSelector:
node-role.kubernetes.io/worker-vm: ""
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: oracle-database
topologyKey: kubernetes.io/hostname
# CPU and memory
domain:
cpu:
cores: 16
sockets: 1
threads: 1
model: host-passthrough
dedicatedCpuPlacement: true
numa:
guestMappingPassthrough: {}
features:
- name: x2apic
policy: require
memory:
guest: 64Gi
hugepages:
pageSize: 1Gi
machine:
type: q35
firmware:
bootloader:
efi:
secureBoot: true
features:
acpi: {}
apic: {}
smm: {}
clock:
utc: {}
timer:
hpet:
present: false
pit:
tickPolicy: delay
rtc:
tickPolicy: catchup
hyperv: {}
devices:
disks:
- name: rootdisk
disk:
bus: virtio
- name: datadisk
disk:
bus: virtio
dedicatedIOThread: true
- name: cloudinitdisk
disk:
bus: virtio
interfaces:
- name: default
masquerade: {}
- name: storage-net
sriov: {}
networkInterfaceMultiqueue: true
rng: {}
tpm: {}
# Networks
networks:
- name: default
pod: {}
- name: storage-net
multus:
networkName: sriov-storage-vlan100
# Volumes
volumes:
- name: rootdisk
dataVolume:
name: oracle-db-prod-01-root
- name: datadisk
persistentVolumeClaim:
claimName: oracle-db-prod-01-data
- name: cloudinitdisk
cloudInitNoCloud:
networkData: |
version: 2
ethernets:
eth0:
dhcp4: true
userData: |
#cloud-config
hostname: oracle-db-prod-01
ssh_authorized_keys:
- ssh-rsa AAAAB3... admin@company.com
# Eviction strategy for live migration during node drain
evictionStrategy: LiveMigrate
A DataVolume for importing a VMware VMDK:
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: oracle-db-prod-01-root
namespace: database-tier
spec:
source:
http:
url: "https://image-server.internal/vmware-exports/oracle-db-root.vmdk"
certConfigMap: image-server-ca
pvc:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
storageClassName: ceph-rbd-ssd
volumeMode: Block
Pod Wrapping: How VMs Live Inside Kubernetes
The design decision to run each VM inside a Kubernetes pod is the defining architectural choice of KubeVirt. It brings enormous benefits and specific costs.
Benefits:
- Scheduling: VMs benefit from the Kubernetes scheduler. Resource requests (CPU, memory, hugepages, devices) are declared in the pod spec, and the scheduler places the VM on a node that can satisfy them. No need for a separate VM scheduler (like DRS).
- Resource enforcement: Kubelet enforces CPU and memory limits via Linux cgroups v2. The QEMU process and its vCPU threads are bound by the same cgroup limits as any container. Memory overcommit, if allowed, uses the same Kubernetes mechanisms (requests vs. limits).
- Monitoring: Standard Kubernetes monitoring (Prometheus, metrics-server) captures pod-level CPU, memory, network, and disk metrics for VMs without any VM-specific instrumentation.
- RBAC: Kubernetes RBAC controls who can create, delete, start, stop, and migrate VMs. No separate permission system is needed.
- Network policies: Kubernetes NetworkPolicy applies to VM pods, controlling which VMs can communicate with which other workloads (VMs or containers).
- Multi-tenancy: Kubernetes namespaces provide logical isolation. Team A's VMs in namespace
team-aare invisible to Team B in namespaceteam-b. Quotas and limit ranges apply per namespace.
Costs:
- Pod lifecycle mismatch: Kubernetes assumes pods are ephemeral. VMs are not. KubeVirt must carefully prevent Kubernetes from killing a VM pod during node pressure, upgrades, or scale-downs. The
evictionStrategy: LiveMigrateannotation is critical -- it tells Kubernetes to migrate the VM rather than terminate the pod during a drain. - Resource accounting granularity: A VM pod's resource requests include the QEMU process, libvirtd, and virt-launcher overhead. The actual guest-visible memory is smaller than the pod's memory request due to QEMU overhead (~100-300 MB per VM for device emulation buffers, page tables, and control structures). Capacity planning must account for this "VM tax."
- Startup time: Creating a pod involves network namespace setup, cgroup creation, volume mounts, and container image pulls. This adds 2-10 seconds to VM boot time compared to raw libvirt/QEMU, which can start a VM in <1 second. For a fleet of 5,000 VMs, this matters during disaster recovery or mass restart scenarios.
- Debugging complexity: When a VM fails to start, the cause could be in any layer: Kubernetes scheduling (pod Pending), CRI-O (container creation failed), virt-launcher (libvirt domain definition failed), libvirtd (QEMU command-line generation failed), or QEMU (hardware emulation error). Debugging requires fluency in all layers.
Networking
KubeVirt networking maps VM network interfaces to Kubernetes pod networking, which itself maps to the cluster's CNI (Container Network Interface) implementation. This is one of the areas with the greatest divergence from VMware.
In vSphere, a VM NIC connects to a vSwitch (standard or distributed), which connects to a physical NIC. The mapping is direct: VM NIC -> portgroup -> uplink -> physical NIC.
In KubeVirt, the path is: VM NIC -> virtio-net/e1000/rtl8139 -> tap device inside pod -> pod network interface -> CNI plugin -> physical NIC. And for secondary networks (Multus), each additional NIC is a separate CNI attachment.
KubeVirt supports multiple networking modes for the default pod network interface:
| Mode | How It Works | Use Case | Performance |
|---|---|---|---|
| masquerade | VM traffic is NATed through the pod IP. The VM gets a private address (10.0.2.0/24 by default) and the pod IP is the NAT gateway. Uses iptables/nftables rules. | Default mode. Simple. Works with any CNI. VM is reachable via pod IP + service/ingress. | Moderate. NAT overhead. ~10-15% throughput reduction vs. bridge. |
| bridge | VM NIC is bridged directly to the pod's network interface. The VM gets the pod's IP address via DHCP (KubeVirt runs a DHCP server on the bridge). | When the VM needs the pod IP directly (no NAT). Requires CNI support for bridge takeover. | Good. No NAT overhead. But: pod IP is "taken" by VM, so sidecar containers cannot use it. |
| SR-IOV | VM NIC is attached directly to an SR-IOV Virtual Function (VF) passed through via VFIO. Bypasses all software networking. | High-throughput, low-latency workloads. Requires SR-IOV-capable NICs and the SR-IOV device plugin. | Excellent. Near-native. <3% overhead. Up to 100 Gbps line rate. |
| passt | User-mode networking stack that translates between the VM's network stack and the pod's network namespace. No root privileges required. | Rootless deployments, environments where bridge mode is not supported. | Good. Better than masquerade, no NAT. |
For secondary networks, KubeVirt uses Multus CNI, a "meta-plugin" that allows a pod to have multiple network interfaces. Each additional interface is defined by a NetworkAttachmentDefinition (NAD) resource. This is how VMs get connections to VLANs, storage networks, or dedicated management networks -- analogous to adding a second NIC to a VM in vSphere and connecting it to a different portgroup.
KubeVirt Networking: VM NIC to Physical NIC
Inside the VM (Guest OS)
+--------------------------------------------------+
| eth0: 10.0.2.2/24 (masquerade, default network) |
| eth1: 192.168.100.5/24 (SR-IOV, storage net) |
+--------------------------------------------------+
| virtio-net | VFIO passthrough
v v
+--------------------------------------------------+
| virt-launcher Pod |
| |
| tap0 --- linux bridge --- eth0 (pod interface) |
| (masquerade mode: |
| iptables DNAT/SNAT |
| on the bridge) |
| |
| [no software path for SR-IOV -- direct HW] |
+--------------------------------------------------+
| eth0: pod IP | SR-IOV VF
v (via CNI: OVN-K, Calico) v (via VFIO)
+--------------------------------------------------+
| Worker Node |
| |
| br-int / ovs-bridge / host bridge |
| (OVN-Kubernetes or other CNI) |
| |
| Physical NIC: eno1 (pod traffic) |
| Physical NIC: eno2 (SR-IOV PF, VFs allocated) |
+--------------------------------------------------+
| |
v v
Physical Network Switch / Fabric
The key vSphere equivalences:
| vSphere Concept | KubeVirt Equivalent |
|---|---|
| vSwitch / vDS portgroup | CNI plugin + NetworkAttachmentDefinition |
| VM NIC (VMXNET3) | VM NIC (virtio-net) |
| Trunk port / VLAN tagging | Multus + VLAN CNI plugin or OVN-Kubernetes secondary network |
| DirectPath I/O (passthrough NIC) | SR-IOV mode with VFIO |
| NSX-T micro-segmentation | Kubernetes NetworkPolicy + OVN-Kubernetes ACLs |
| VM-to-VM on same host | Pod-to-pod (via CNI bridge or OVS) |
| VM traffic shaping (vDS) | CNI bandwidth plugin or OVN-Kubernetes QoS |
Storage
KubeVirt VM disks are backed by Kubernetes Persistent Volume Claims (PVCs). This is one of the most fundamental differences from VMware, where disks are VMDK files on a VMFS or NFS datastore managed by vCenter.
In KubeVirt, each VM disk is either:
- A PVC (filesystem mode or block mode) mapped into the virt-launcher pod and presented to QEMU as a block device or file.
- A containerDisk -- an OCI container image that contains a VM disk (qcow2 or raw) as a layer. The image is pulled by the container runtime and mounted read-only. Useful for ephemeral VMs (live CDs, install ISOs).
- A DataVolume -- a CDI-managed PVC that is populated from an external source (HTTP URL, S3, container registry, existing PVC) before the VM starts.
The storage path:
KubeVirt Storage: VM Disk to Physical Storage
Guest OS
+-------------------------------------+
| /dev/vda (virtio-blk) |
| or |
| /dev/sda (virtio-scsi) |
+-------------------------------------+
|
QEMU block layer
|
+--> Raw block device +--> qcow2 file on filesystem
| (block-mode PVC) | (filesystem-mode PVC)
v v
+-------------------------------------+
| PVC mounted into virt-launcher pod |
| - Block mode: /dev/xvda |
| - Filesystem mode: /var/run/ |
| kubevirt-private/vmi-disks/ |
| disk-name/disk.img |
+-------------------------------------+
|
Kubernetes PV / CSI driver
|
+--> Ceph RBD (block)
+--> OpenShift Data Foundation (block/file)
+--> NFS (file)
+--> Local PV (block/file)
+--> NetApp ONTAP (iSCSI block, NFS file)
+--> Pure Storage (iSCSI, FC)
|
v
Physical Storage Array / Cluster
Block mode vs. filesystem mode is a critical choice for production VMs:
| Aspect | Block Mode PVC | Filesystem Mode PVC |
|---|---|---|
| PVC volumeMode | Block |
Filesystem (default) |
| QEMU access | Raw block device passed directly to QEMU | QEMU opens a qcow2/raw file on the PVC's mounted filesystem |
| Performance | Better. No filesystem overhead. QEMU I/O goes directly to the block device. | Worse. Double filesystem: the PVC's filesystem (ext4/XFS) + the guest's filesystem. |
| Snapshot support | Via CSI volume snapshots | Via CSI volume snapshots or qcow2 internal snapshots |
| Live migration | Requires RWX (ReadWriteMany) block PVCs or storage-assisted migration | Same requirement |
| Overhead | Minimal | Filesystem metadata, double journaling |
| Recommendation for Tier-1 | Preferred | Avoid for I/O-intensive workloads |
For the evaluation at 5,000+ VMs, the storage integration is a major area of scrutiny. The existing VMware estate uses VMFS datastores with vSAN or SAN-backed LUNs. Migrating to KubeVirt means migrating to PVC-backed storage. This requires a CSI driver for whatever storage backend the organization chooses (Ceph/ODF, NetApp, Pure, etc.) and the CDI (Containerized Data Importer) for converting existing VMDKs.
CDI (Containerized Data Importer)
CDI is a Kubernetes operator that manages the lifecycle of VM disk data. It is responsible for populating PVCs with VM disk content before a VM starts. CDI is the answer to the question: "How do I get my VMware VMDK files into KubeVirt?"
CDI supports the following import sources:
| Source | How It Works |
|---|---|
| HTTP/HTTPS URL | Downloads a disk image (raw, qcow2, vmdk, vdi, vhd, vhdx) from a web server. Automatically detects and converts format to raw. |
| S3 bucket | Downloads from S3-compatible storage (AWS S3, MinIO, Ceph RGW). |
| Container registry | Pulls an OCI image that contains a disk image as a layer. Extracts and writes to PVC. |
| Existing PVC (clone) | Clones an existing PVC to a new PVC. Uses CSI clone if available, otherwise smart-clone (snapshot + restore) or host-assisted copy. |
| Upload | Accepts a disk image upload via virtctl image-upload. Streams data directly to a CDI upload pod. |
| VDDK (VMware Virtual Disk Development Kit) | Connects to a VMware vCenter/ESXi and downloads a VM's disk using the VMware VDDK API. This is the primary mechanism used by the Migration Toolkit for Virtualization (MTV). |
| Image I/O (oVirt) | Imports from Red Hat Virtualization (oVirt) -- less relevant for this evaluation. |
| Snapshot | Creates a PVC from a VolumeSnapshot. |
The CDI import flow for a VMware VMDK:
CDI Import Flow: VMware VMDK to KubeVirt PVC
Step 1: User creates DataVolume
================================================
$ kubectl apply -f datavolume-import.yaml
|
v
CDI Controller (watches DataVolume CRs)
|
+--> Creates a PVC with the requested size and storageClass
|
+--> Creates an Importer Pod
|
v
+------------------------------------------+
| Importer Pod |
| |
| 1. Downloads VMDK from HTTP URL |
| (or connects to vCenter via VDDK) |
| |
| 2. Detects source format: |
| - VMDK sparse? VMDK flat? qcow2? |
| - Compressed (gz, xz)? |
| |
| 3. Converts to raw format: |
| VMDK --> qemu-img convert --> raw |
| |
| 4. Writes raw data to PVC: |
| - Block mode: dd to /dev/xvda |
| - Filesystem mode: write to |
| /data/disk.img |
| |
| 5. Reports progress via DataVolume |
| status (0% -> 100%) |
+------------------------------------------+
|
v
PVC is populated and bound
|
v
DataVolume status: Succeeded
|
v
VM can now reference this DataVolume
in its volumes section and boot from it
VMDK conversion specifics: CDI uses qemu-img under the hood to convert VMDK files to raw format. It handles all VMDK variants: monolithic sparse, monolithic flat, split sparse, split flat, stream-optimized, and ESXi-style descriptor+extent files. The conversion is CPU-intensive (especially for compressed VMDKs) and disk-space-intensive (VMDKs are often thin-provisioned; the raw output is fully allocated). CDI supports scratch space (a temporary PVC used during conversion) to avoid running out of space on the target PVC.
For the migration of 5,000+ VMs, CDI throughput is a bottleneck concern. Each import runs as a single pod. Parallel imports are supported (create multiple DataVolumes), but the bottleneck shifts to:
- Network bandwidth from the VMware environment to the Kubernetes cluster (or the HTTP server hosting VMDK exports).
- Storage backend write throughput on the target cluster.
- CPU on importer pods for format conversion.
A realistic migration pipeline imports 10-50 VMs in parallel, throttled by network bandwidth and storage IOPS. At 200 GB average disk per VM, 5,000 VMs is 1 PB of data. At 10 Gbps sustained throughput, that is ~9 days of continuous transfer for the raw data alone, not counting conversion overhead, validation, and test boots. CDI bandwidth planning is a critical migration workstream.
Live Migration in KubeVirt
Live migration in KubeVirt moves a running VM from one worker node to another with near-zero downtime. The mechanism is built on libvirt/QEMU's pre-copy migration (covered in Chapter 3), but wrapped in Kubernetes pod semantics. This introduces differences from vMotion that the evaluation team must understand.
How it works:
- An operator (or an automated process like node drain) creates a VirtualMachineInstanceMigration (VMIM) resource.
- The virt-controller sees the VMIM, validates that the VMI is live-migratable (checks conditions: is the storage shared? Are all devices migratable? Is there a target node with sufficient resources?).
- The virt-controller creates a target virt-launcher pod on the destination node. This pod is identical to the source pod (same resource requests, same volume mounts) but does not start a VM -- it starts a QEMU process in "incoming migration" mode, waiting to receive the VM state.
- The virt-handler on the source node signals libvirtd to begin pre-copy migration. Libvirt connects the source QEMU to the target QEMU over a TCP connection (typically over the pod network or a dedicated migration network).
- QEMU performs iterative memory copy: first pass sends all memory pages, subsequent passes send only dirty pages. When the dirty rate is low enough, QEMU pauses the VM on the source, sends the final dirty pages and CPU state, and resumes the VM on the target. This pause window is the migration downtime.
- Once the VM is running on the target, the source virt-launcher pod is terminated. The VMI resource is updated to reflect the new node.
Differences from vMotion:
| Aspect | vMotion | KubeVirt Live Migration |
|---|---|---|
| Unit of migration | VM (vmx process) | Pod (entire virt-launcher pod with VM inside) |
| Storage requirement | Shared datastore (VMFS, vSAN, NFS) or storage vMotion for local disks | Shared PVCs (RWX access mode) or storage-class-specific migration support |
| Network | Dedicated vMotion VMkernel port, encrypted, 10/25 Gbps typical | Pod network or dedicated Multus network, TLS-encrypted, bandwidth depends on CNI |
| Trigger | Manual, DRS-automated, host maintenance mode | Manual (VMIM), node drain (kubectl drain), descheduler policy |
| Convergence | vMotion has mature convergence heuristics, memory pre-copy with stun threshold | QEMU pre-copy with configurable bandwidth limit, convergence timeout, auto-converge (throttle vCPUs) |
| Post-copy | Supported in recent vSphere versions | Supported in KubeVirt (experimental), falls back to pre-copy on failure |
| Downtime target | Typically <1 second for most workloads | Typically 10-500 ms, depends on dirty page rate and migration bandwidth |
| Multi-VM migration | DRS migrates multiple VMs in parallel with dependency awareness | Parallel VMIM resources, but no built-in dependency awareness (must be orchestrated externally) |
| Cancel | Yes, VM stays on source | Yes, delete the VMIM resource |
| Network identity | VM keeps its MAC and IP | VM keeps its MAC and IP (the pod IP changes, but the VM's internal IP is stable if using bridge mode + Multus) |
A critical caveat: the pod IP changes after migration. In masquerade mode, the VM gets a new pod IP on the target node. The VM's internal IP (the NATed address) remains the same from the guest's perspective, but any Kubernetes services or external load balancers pointing to the old pod IP must update. Kubernetes Services (ClusterIP, NodePort, LoadBalancer) handle this automatically if the VM is behind a Service (selector matches the new pod). Direct pod-IP access breaks.
For bridge mode with Multus secondary networks, the VM's MAC address is preserved, and if the secondary network spans both nodes (same VLAN), the VM retains its IP address transparently -- this is the closest equivalent to vMotion's behavior.
Live migration prerequisites:
- Storage must be shared (RWX PVCs). Local storage does not support live migration without storage migration.
- All VM devices must be migratable (SR-IOV passthrough devices are NOT migratable -- the VF is tied to a physical NIC on a specific host).
- Sufficient resources on the target node (CPU, memory, hugepages, devices).
- The
evictionStrategy: LiveMigratefield must be set on the VM for migration during node drain.
Console and VNC Access
KubeVirt provides console access through virtctl, a CLI tool that extends kubectl:
virtctl console <vm-name>: Opens a serial console to the VM (equivalent to connecting to the COM1 serial port). Streams the serial console via a websocket through virt-api to the virt-launcher pod's QEMU process. Works for headless Linux VMs with serial console enabled (console=ttyS0in kernel command line).virtctl vnc <vm-name>: Opens a VNC session to the VM's graphical console. Launches a local VNC viewer connected via a websocket through virt-api. Works for graphical VMs (Windows, Linux with GUI). Requires a VNC viewer on the client machine.virtctl ssh <vm-name>: Proxies an SSH connection to the VM through the Kubernetes API server. No need for the VM to be directly network-reachable. Requires the QEMU guest agent and proper network configuration.
Additionally, the OpenShift Web Console (in OVE) provides a browser-based VNC client and serial console, comparable to vSphere Client's remote console but accessible through a web browser without any plugins.
The architectural path for console access:
Console Access Path
User's workstation
|
+--> virtctl vnc my-vm
|
v
kubectl proxy / API server (HTTPS/WSS)
|
+--> virt-api (subresource handler)
| Routes to correct virt-launcher pod
|
v
virt-handler (on target node)
|
+--> virt-launcher pod
| |
| v
| QEMU VNC server (port 5900 inside pod)
| or QEMU chardev (serial console)
|
v
Websocket stream back to virtctl
|
v
Local VNC viewer / terminal
Comparison to vSphere: What Maps, What Doesn't
This mapping table is designed for the evaluation team to build a mental model of KubeVirt using their existing VMware knowledge:
| vSphere / ESXi Concept | KubeVirt / OVE Equivalent | Notes |
|---|---|---|
| vCenter Server | kube-apiserver + virt-controller + virt-api | No single "vCenter" -- the functions are distributed across Kubernetes components |
| ESXi Host | Kubernetes worker node (RHCOS) + virt-handler | The node runs RHCOS (Red Hat CoreOS), not ESXi |
| VMX process | virt-launcher pod (containing QEMU process) | Each VM = one pod = one QEMU process |
| hostd + vpxa | virt-handler DaemonSet | Node-local agent reporting to the central controller |
| VM (in vCenter inventory) | VirtualMachine CR | Persistent object, survives power-off |
| Running VM instance | VirtualMachineInstance CR + virt-launcher Pod | Ephemeral, exists only while VM is running |
| VM Template | VirtualMachineClusterInstancetype + VirtualMachineClusterPreference + golden DataVolume | No single "template" object; composed from instance type + preference + source disk |
| Resource Pool | Kubernetes Namespace + ResourceQuota + LimitRange | Namespaces replace resource pools for multi-tenancy |
| DRS (Distributed Resource Scheduler) | kube-scheduler + descheduler (optional) | Kubernetes scheduler handles placement; descheduler handles rebalancing (less mature than DRS) |
| vMotion | VirtualMachineInstanceMigration (VMIM) | Declarative migration resource |
| vDS (Distributed Switch) | CNI plugin (OVN-Kubernetes in OVE) | OVN-Kubernetes is the default CNI; replaces vDS functionality |
| vDS Portgroup | NetworkAttachmentDefinition (NAD) via Multus | Each additional network is a Multus attachment |
| VMFS/vSAN Datastore | StorageClass + PVCs (backed by Ceph, ODF, etc.) | No monolithic "datastore"; each disk is an independent PVC |
| VMDK | PVC (block or filesystem mode) | The PVC is the disk |
| Content Library | Container registry + DataVolume sources | VM images stored as container images or HTTP-accessible files |
| vSphere HA | Kubernetes pod rescheduling + VM run strategy | If a node fails, pods are rescheduled to surviving nodes; runStrategy: Always ensures VMs restart |
| Alarms & Events | Kubernetes Events + Prometheus alerts | No built-in alarm system; alerting via Prometheus + Alertmanager |
| RBAC (vSphere permissions) | Kubernetes RBAC (Roles, RoleBindings, ClusterRoles) | More granular than vSphere; per-namespace, per-resource-type |
| vSphere Tags | Kubernetes Labels + Annotations | Labels are the primary metadata mechanism |
| Guest OS Customization (Sysprep) | cloud-init / Sysprep (via ConfigMap/Secret volumes) | cloud-init for Linux, Sysprep for Windows, injected as volumes |
| Snapshot | VolumeSnapshot (CSI) | Via the CSI driver, not KubeVirt itself; maturity varies by storage backend |
What is better in KubeVirt:
- Declarative, GitOps-compatible infrastructure: VMs are YAML files. They can be version-controlled, reviewed, templated with Helm/Kustomize, and deployed via CI/CD pipelines. VMware VM definitions are not natively version-controllable.
- Unified platform: Containers and VMs on the same cluster, sharing the same network, storage, monitoring, and RBAC. No need for separate infrastructure silos.
- RBAC granularity: Kubernetes RBAC is more granular and flexible than vSphere permissions. Per-namespace, per-verb, per-resource-type, with custom roles.
- Multi-tenancy model: Namespaces + ResourceQuotas provide a cleaner multi-tenancy model than vSphere resource pools.
- Extensibility: The CRD/operator model means new functionality can be added without modifying the core platform.
What is worse in KubeVirt:
- Operational complexity: The debugging surface area is much larger. A VM failure can originate in Kubernetes, CRI-O, CNI, CSI, virt-controller, virt-handler, virt-launcher, libvirtd, or QEMU. VMware's vertically integrated stack is simpler to troubleshoot.
- DRS equivalent maturity: Kubernetes' scheduler is placement-only (initial scheduling). The descheduler (rebalancing running pods) is less mature than DRS. There is no equivalent of DRS load balancing that continuously optimizes VM placement based on real-time utilization.
- Snapshot maturity: VM snapshots in KubeVirt depend on the CSI driver's snapshot capability. There is no integrated memory+disk snapshot comparable to VMware snapshots.
- Tooling ecosystem: vSphere has 20+ years of third-party tooling (backup: Veeam, Commvault; monitoring: vROps; automation: vRA). KubeVirt's ecosystem is younger and thinner, though growing rapidly.
- Live migration constraints: SR-IOV passthrough devices block live migration. VMware's DirectPath I/O has the same limitation, but vSphere admins are accustomed to this; KubeVirt admins may not realize it until a node drain fails.
2. OCI / Container Runtimes (CRI-O, containerd)
CRI (Container Runtime Interface)
The Container Runtime Interface (CRI) is a plugin API that Kubernetes defines for container runtimes. It is the boundary between the kubelet (Kubernetes' node agent) and whatever software actually creates and manages containers (and, by extension, virt-launcher pods for VMs). CRI was introduced in Kubernetes 1.5 (2016) to decouple the kubelet from Docker.
Before CRI, the kubelet had Docker-specific code built in. CRI defines a gRPC API with two services:
- RuntimeService: Container lifecycle (create, start, stop, remove), pod sandbox management, exec, attach, port-forward.
- ImageService: Image pull, list, remove, image status.
Any software that implements this gRPC API can serve as a Kubernetes container runtime. The two major implementations are CRI-O and containerd.
CRI-O
CRI-O is a lightweight, OCI-compliant container runtime built specifically for Kubernetes. It was created by Red Hat, Intel, SUSE, and others. "CRI-O" literally means "CRI + OCI" -- it implements the CRI interface and uses OCI-compliant tools for the actual container operations.
Key characteristics:
- Minimal scope: CRI-O does one thing: serve as a CRI implementation for Kubernetes. It does not try to be a general-purpose container engine like Docker. No
docker build, nodocker compose, no image building. - OCI-compliant: Uses
runc(or another OCI runtime likecrunor Kata) to create containers according to the OCI Runtime Specification. - Used by OpenShift/OVE: All OpenShift clusters (and therefore all OVE deployments) use CRI-O as the container runtime. This is not configurable -- CRI-O is the only supported runtime on OpenShift.
- conmon: CRI-O uses a per-container monitor process called
conmon(container monitor). Each container gets its ownconmonprocess that holds the container's terminal, forwards logs, and detects container exit. This is relevant for KubeVirt because the virt-launcher container'sconmonprocess is the first process in the container's PID namespace.
containerd
containerd is a container runtime originally extracted from Docker. Docker donated it to the CNCF, and it is now the default runtime for vanilla Kubernetes, AKS, EKS, GKE, and most cloud Kubernetes services. containerd implements the CRI plugin natively since version 1.1.
Key differences from CRI-O:
| Aspect | CRI-O | containerd |
|---|---|---|
| Origin | Built for Kubernetes from scratch | Extracted from Docker |
| Scope | CRI only | CRI + general-purpose container management |
| Image building | Not supported (out of scope) | Not supported natively (but plugins exist) |
| OCI runtime default | crun (on RHEL/OpenShift) |
runc |
| Container monitor | conmon (separate process per container) |
Internal shim (containerd-shim-runc-v2) |
| Used by | OpenShift, OVE, SUSE | Vanilla Kubernetes, AKS, EKS, GKE |
| Release cycle | Aligned with OpenShift | Independent |
How CRI-O Handles a KubeVirt Pod
When virt-controller creates a virt-launcher pod for a VM, the following chain of events occurs on the target worker node:
CRI-O Execution Chain: kubelet to QEMU
kubelet
|
+--> gRPC: RunPodSandbox()
| |
| v
| CRI-O
| |
| +--> Creates pod-level cgroup (/kubepods/pod<uid>/)
| |
| +--> Creates network namespace (via CNI)
| | +--> Calls primary CNI plugin (OVN-Kubernetes)
| | | Creates veth pair, connects to OVS bridge
| | +--> Calls Multus (if additional networks defined)
| | Calls secondary CNI plugins
| | Creates additional interfaces in namespace
| |
| +--> Creates IPC namespace, UTS namespace
| |
| +--> Returns PodSandboxId
|
+--> gRPC: CreateContainer(PodSandboxId, "compute")
| |
| v
| CRI-O
| |
| +--> Pulls virt-launcher image (if not in local cache)
| | Image: registry.redhat.io/container-native-
| | virtualization/virt-launcher-rhel9:v4.x
| |
| +--> Creates OCI runtime bundle:
| | - config.json (OCI runtime spec)
| | - rootfs/ (from container image layers)
| |
| +--> Spawns conmon process:
| | conmon --cid <container-id> \
| | --runtime /usr/bin/crun \
| | --log-path /var/log/... \
| | ...
| |
| +--> conmon calls crun (OCI runtime):
| crun create <container-id>
| |
| +--> Creates container cgroup
| | (child of pod cgroup)
| +--> Sets up mount namespace
| | (rootfs, volumes, devices)
| +--> Mounts /dev/kvm, /dev/vhost-net
| | into container
| +--> Mounts PVC volumes at expected
| | paths in container
| +--> Configures seccomp profile
| +--> Configures SELinux label
| +--> Joins pod's network namespace
| +--> Starts container init process:
| --> virt-launcher binary (PID 1)
|
+--> gRPC: StartContainer(ContainerId)
|
v
CRI-O --> conmon --> crun start <container-id>
|
v
virt-launcher process begins:
|
+--> Starts libvirtd (as child process)
+--> Defines VM domain in libvirt
+--> Starts QEMU/KVM via libvirt
+--> VM boots inside the container's cgroup
The critical insight: the QEMU process inherits the container's cgroup. This means Kubernetes' CPU and memory limits apply directly to the QEMU process and its vCPU threads. If a VM is configured with resources.requests.memory: 64Gi and resources.limits.memory: 66Gi (64 Gi for the guest + 2 Gi for QEMU overhead), the Linux OOM killer will kill the QEMU process if it exceeds 66 Gi -- exactly as it would kill any container exceeding its memory limit. This is a feature (prevents one VM from consuming unbounded resources) and a risk (an undersized memory limit kills the VM).
OCI Image Spec and Runtime Spec
The Open Container Initiative (OCI) defines two specifications that are foundational to container runtimes:
OCI Image Specification: Defines how container images are structured -- the manifest, config, and layers. KubeVirt uses this specification for containerDisk volumes: a VM disk image (qcow2 or raw) is packaged as an OCI image layer, pushed to a container registry, and pulled by the container runtime at pod start. This is how ephemeral VM boot disks (like ISOs or live CDs) are distributed.
OCI Runtime Specification: Defines how a container is configured and executed -- the config.json file that CRI-O/containerd passes to runc/crun. This specification defines namespaces, cgroups, mounts, devices, and security settings. For a KubeVirt virt-launcher container, the OCI runtime spec includes device access rules for /dev/kvm and /dev/vhost-net, volume mounts for PVCs, and the appropriate SELinux context.
Why CRI-O vs containerd Matters for OVE
For the OVE evaluation specifically, CRI-O is not a choice -- it is a requirement. OpenShift mandates CRI-O. The implications:
-
Troubleshooting: When a virt-launcher pod fails to start, the logs are in CRI-O's log format, and the container is managed by
conmonandcrun. The operations team must be familiar withcrictl(the CRI CLI) rather thandockerornerdctl(containerd's CLI) for debugging. -
Image pull behavior: CRI-O's image pull behavior differs from containerd in edge cases (authentication, registry mirrors, image signing). OVE's support matrix is tested exclusively with CRI-O.
-
Security: CRI-O on OpenShift runs with SELinux enforcing and a locked-down seccomp profile by default. virt-launcher pods require specific SELinux labels (
container_twith KVM device access) that are configured by the KubeVirt operator. Modifying these settings outside of the operator can break VMs. -
crun vs runc: OpenShift's CRI-O uses
crun(a C-based OCI runtime) instead ofrunc(Go-based).crunhas lower overhead and faster container start times, which slightly benefits VM startup latency (the pod sandbox creation phase). The QEMU process itself is unaffected --crunonly manages the container lifecycle, not the VM.
For Azure Local, which runs Hyper-V VMs directly (not inside Kubernetes pods), none of the CRI-O/containerd discussion is relevant. Azure Local's VMs are managed by the Hyper-V hypervisor and the Azure Arc management plane, not by a container runtime.
3. Kata Containers / MicroVMs
The Problem: Container Isolation is Weaker than VM Isolation
Standard Linux containers (run by runc/crun) share the host kernel. Process isolation is enforced by kernel namespaces, cgroups, seccomp, and SELinux/AppArmor -- but all containers on a host execute syscalls against the same kernel. A kernel vulnerability (a privilege escalation bug in a syscall handler, a namespace escape, a cgroup bypass) can allow a container to break out and access the host or other containers.
This is fundamentally different from VM isolation, where each VM has its own kernel running inside a hardware-enforced boundary (VT-x/AMD-V, EPT/NPT). A guest kernel vulnerability does not compromise the host. The attack surface is the hypervisor's VM exit handler -- a much smaller and more auditable surface than the Linux syscall table (400+ syscalls).
For a Tier-1 financial enterprise running regulated workloads, this distinction matters. If two different business units (or two different customers in a shared infrastructure) run containers on the same host, the shared-kernel risk may be unacceptable. This is the problem Kata Containers solves.
Kata Containers Architecture
Kata Containers is an open-source project (originally a merger of Intel Clear Containers and Hyper.sh's runV) that runs each container (or pod) inside a lightweight virtual machine. Instead of using runc to create a container with namespace isolation, Kata uses a VMM (Virtual Machine Monitor) to create a lightweight VM for each pod.
Standard Container vs Kata Container
Standard Container (runc/crun) Kata Container
================================ ================================
+---------------------------+ +---------------------------+
| Container Process | | Container Process |
| (shares host kernel) | | (runs in guest kernel) |
+---------------------------+ +---------------------------+
| namespaces + cgroups | | Guest Linux Kernel (5.x) |
| (kernel-level isolation) | | (minimal, stripped-down) |
+---------------------------+ +---------------------------+
| Host Linux Kernel | | VMM (QEMU / Cloud HV / |
| | | Firecracker) |
| | +---------------------------+
| | | Host Linux Kernel + KVM |
+---------------------------+ +---------------------------+
| Hardware | | Hardware (VT-x/EPT) |
+---------------------------+ +---------------------------+
Isolation boundary: Isolation boundary:
Kernel namespaces (software) Hardware virtualization (VT-x)
Kata Containers architecture in detail:
Kata Containers Architecture (per Pod)
+================================================================+
| Kata VM (lightweight, boots in <1 second) |
| |
| +----------------------------------------------------------+ |
| | Container workload(s) | |
| | - Application process(es) | |
| | - OCI bundle rootfs mounted from host (via virtio-fs | |
| | or virtio-9p) | |
| +----------------------------------------------------------+ |
| | kata-agent | |
| | - gRPC server inside the VM | |
| | - Receives container lifecycle commands from kata- | |
| | runtime on the host | |
| | - Creates namespaces/cgroups inside the VM | |
| | - Manages container processes | |
| +----------------------------------------------------------+ |
| | Guest Linux Kernel | |
| | - Minimal kernel (~4-8 MB) | |
| | - Only drivers needed: virtio-blk, virtio-net, | |
| | virtio-fs, virtio-vsock | |
| | - No unnecessary modules, no GUI, no sound | |
| +----------------------------------------------------------+ |
+================================================================+
| virtio-vsock (host <-> guest communication)
| virtio-fs (filesystem sharing)
| virtio-net (networking)
v
+================================================================+
| Host |
| |
| +----------------------------------------------------------+ |
| | kata-runtime (OCI runtime, replaces runc) | |
| | - Called by CRI-O/containerd instead of runc | |
| | - Starts VMM, connects to kata-agent | |
| | - Translates OCI lifecycle calls to gRPC calls | |
| +----------------------------------------------------------+ |
| | VMM (Virtual Machine Monitor) | |
| | Options: | |
| | - QEMU (full-featured, highest compatibility) | |
| | - Cloud Hypervisor (Rust, modern, good perf) | |
| | - Firecracker (AWS, minimal, fastest boot) | |
| | - Dragonball (Alibaba, sandbox-optimized) | |
| +----------------------------------------------------------+ |
| | Linux Kernel + KVM | |
| +----------------------------------------------------------+ |
+================================================================+
The key idea: Kata Containers is a drop-in OCI runtime. From Kubernetes' perspective, a Kata pod looks exactly like a regular pod. The kubelet sends the same CRI calls to CRI-O/containerd. CRI-O/containerd calls kata-runtime instead of runc. The pod gets the same network namespace, the same cgroup accounting, the same volume mounts. The only difference is that the container process runs inside a VM rather than directly on the host kernel.
Firecracker
Firecracker is Amazon's open-source Virtual Machine Monitor (VMM), built specifically for serverless and container workloads. It powers AWS Lambda and AWS Fargate. Key characteristics:
- Minimal device model: Only 4 emulated devices: virtio-net, virtio-block, serial console, and a minimal keyboard controller (for reboot detection). No USB, no GPU, no sound, no PCI passthrough. This minimal surface reduces attack vectors.
- Fast boot: <125 ms from VMM start to guest kernel init. This is achieved by skipping BIOS/UEFI, using a direct kernel boot with a pre-loaded initrd, and having a minimal device enumeration phase.
- Low memory overhead: ~5 MB per microVM for the VMM itself (compared to ~30-130 MB for a full QEMU instance).
- Rate limiting: Built-in I/O rate limiting (bandwidth and IOPS) per virtio device. This is useful for multi-tenant density where noisy-neighbor I/O is a concern.
- Jailer: A companion process that sets up the microVM's sandbox (seccomp, cgroups, chroot) before Firecracker starts. Provides defense-in-depth even if the VMM itself is compromised.
- Limitations: No live migration. No device passthrough. No GPU support. No UEFI. These limitations are deliberate -- Firecracker is designed for ephemeral, short-lived workloads, not for running persistent VMs.
Firecracker is a Kata Containers VMM option, meaning you can configure Kata to use Firecracker instead of QEMU. This gives the fastest possible boot times but sacrifices features.
Cloud Hypervisor
Cloud Hypervisor is a Rust-based VMM that sits between QEMU (full-featured) and Firecracker (minimal). It was started by Intel and is now a Linux Foundation project. Key characteristics:
- Modern codebase: Written in Rust, with memory safety guarantees. Smaller attack surface than QEMU's 2M+ lines of C.
- Rich enough for cloud workloads: Supports virtio-fs, vhost-user, VFIO (PCI passthrough), vDPA, hotplug (CPU, memory, disk, network), NUMA, and live migration. More capable than Firecracker, less attack surface than QEMU.
- Boot time: ~200-500 ms (slower than Firecracker, much faster than QEMU).
- Memory overhead: ~10-20 MB per VM.
- Kata integration: Cloud Hypervisor is the recommended VMM for Kata Containers in many production deployments, as it offers the best balance of features and security.
KubeVirt vs Kata: Different Use Cases
This is a common source of confusion. Both KubeVirt and Kata Containers involve running VMs on Kubernetes, but they solve completely different problems:
| Aspect | KubeVirt | Kata Containers |
|---|---|---|
| Goal | Run traditional VMs (with their own OS, kernel, applications) on Kubernetes | Run containers with VM-level isolation |
| Workload | A full operating system (Windows Server, RHEL, Ubuntu) with applications installed inside | A container image (just the application + dependencies, no OS kernel) |
| Guest kernel | The VM's own kernel (whatever the guest OS uses) | A shared, minimal guest kernel provided by Kata |
| User experience | "I manage a VM" -- SSH in, install packages, configure services | "I manage a container" -- docker build, kubectl apply, same container workflow |
| Use case | Legacy apps that cannot be containerized, Windows workloads, stateful databases, appliances | Multi-tenant container platforms needing strong isolation, CI/CD build pods, untrusted code execution |
| Boot time | 10-60 seconds (full OS boot, BIOS/UEFI POST, kernel init, systemd) | <1-5 seconds (microVM boots minimal kernel, mounts container rootfs, starts application) |
| Overhead per workload | High (full OS: 512 MB - several GB for guest OS alone) | Low (minimal kernel: 32-128 MB for guest overhead) |
| Managed by | KubeVirt operator (virt-controller, virt-handler, virt-launcher) | CRI-O/containerd + Kata runtime (transparent to Kubernetes) |
In the context of this evaluation:
- KubeVirt is the mechanism by which OVE runs the 5,000+ existing VMs. Every VMware VM that is migrated to OVE becomes a KubeVirt VM.
- Kata Containers is potentially relevant for future workloads -- if the organization runs multi-tenant container platforms and needs stronger isolation than standard Linux containers provide, Kata can provide VM-level isolation transparently. OpenShift supports Kata via the OpenShift Sandboxed Containers operator.
They are complementary, not competing, technologies. An OVE cluster could simultaneously run:
- KubeVirt VMs (legacy Windows and Linux workloads migrated from VMware)
- Standard containers (cloud-native microservices)
- Kata Containers (security-sensitive container workloads needing VM isolation)
All three share the same Kubernetes control plane, networking, storage, and monitoring.
When Kata Matters for the Evaluation
Kata Containers should be considered in the evaluation in the following scenarios:
-
Multi-tenant container workloads: If the platform will host containers from multiple business units or external parties with different trust levels, Kata provides hardware-enforced isolation between tenants. Without Kata, container-to-container isolation relies on kernel namespaces, which have a larger attack surface.
-
Regulatory requirements for workload isolation: FINMA or internal security policies may require that certain workloads run with hardware-level isolation. Kata satisfies this requirement without requiring dedicated physical hosts.
-
CI/CD build pipelines: Running untrusted build jobs (e.g., building third-party code) in standard containers is a security risk. Kata containers confine build jobs in VMs, preventing a malicious build from affecting the host or other pods.
-
Comparing isolation models across candidates:
- OVE: Standard containers (namespace isolation) + Kata Containers (VM isolation) + KubeVirt VMs (full VM isolation). Three tiers available.
- Azure Local: Hyper-V VMs (full VM isolation) for VMs. Containers run in AKS-HCI (standard namespace isolation). No Kata equivalent.
- Swisscom ESC: VMware VMs (full VM isolation). Container services depend on Swisscom's offering.
How the Candidates Handle This
Comparison Table
| Aspect | VMware (Current) | OVE (KubeVirt) | Azure Local (Hyper-V) | Swisscom ESC |
|---|---|---|---|---|
| VM Management Model | vCenter (proprietary, centralized) | Kubernetes API + KubeVirt CRDs (declarative, open) | Azure Arc + Hyper-V (hybrid cloud management) | VMware vCloud Director (managed service) |
| VM Definition Format | .vmx files + vCenter DB | YAML manifests (VirtualMachine CR) | PowerShell / Azure CLI / ARM templates | VMware OVF + provider portal |
| Infrastructure as Code | Terraform vSphere provider; PowerCLI; govc | Native (kubectl, Helm, Kustomize, GitOps) | Terraform azurerm provider; Azure Bicep; PowerShell | Limited; API available but not IaC-native |
| VM-to-Host Mapping | VMX process on ESXi host | virt-launcher Pod on worker node | Child partition on Hyper-V host | VMX process on ESXi (provider-managed) |
| Hypervisor Engine | ESXi VMkernel | KVM + QEMU + libvirt (wrapped by KubeVirt) | Hyper-V (microkernel + root partition) | ESXi VMkernel (provider-managed) |
| Container Runtime | N/A (VMs only; vSphere with Tanzu adds containerd) | CRI-O (mandatory on OpenShift) | containerd (for AKS-HCI containers) | N/A for VM workloads |
| Unified VM+Container Platform | Partial (vSphere with Tanzu, but VMs and containers are separate) | Yes (VMs and containers are both pods, same scheduler/network/storage) | Separate (Hyper-V VMs + AKS-HCI for containers) | No (VM-only service) |
| VM Disk Format | VMDK on VMFS/vSAN/NFS | PVC (raw or qcow2 on Ceph RBD, ODF, NFS, etc.) | VHDX on Cluster Shared Volumes (ReFS/NTFS) | VMDK (provider-managed) |
| Disk Import/Migration Tool | N/A (source platform) | CDI (Containerized Data Importer) + MTV | Azure Migrate | Provider-managed migration |
| Live Migration | vMotion (mature, automated via DRS) | KubeVirt VMIM (QEMU pre-copy, pod-based) | Hyper-V Live Migration (mature) | vMotion (provider-managed) |
| Console Access | vSphere Client (VMRC, web console) | virtctl console/vnc, OpenShift web console | Hyper-V Manager, WAC, Azure portal | Provider portal (web console) |
| Multi-tenancy | Resource pools, folders, permissions | Kubernetes namespaces, RBAC, quotas | Azure RBAC, subscriptions | Provider-managed tenant isolation |
| Automated Placement/Balancing | DRS (automated, continuous) | kube-scheduler (initial only) + descheduler (optional, less mature) | SCVMM Dynamic Optimization (if using SCVMM) | Provider-managed (DRS under the hood) |
| VM Security Isolation | VMkernel hypervisor boundary | KVM hypervisor boundary (each VM in own pod cgroup) | Hyper-V hypervisor boundary | VMkernel hypervisor boundary |
| Container Isolation Enhancement | N/A | Kata Containers / OpenShift Sandboxed Containers | N/A for containers at VM level | N/A |
| Ecosystem Maturity | 20+ years, deep third-party integration | 5-7 years production use, growing rapidly | 10+ years Hyper-V, Azure Arc is newer | Depends on VMware maturity (provider-managed) |
Detailed Candidate Analysis
OVE (KubeVirt/KVM)
OVE's greatest strength and greatest risk both stem from the same source: it is Kubernetes-native. The strength is that VM management becomes a natural extension of the Kubernetes platform the organization may already use (or plan to use) for containers. VMs are YAML, lifecycle is declarative, networking and storage are unified, RBAC is standard Kubernetes. For a team that has invested in Kubernetes skills and tooling, OVE is a natural fit.
The risk is that the team may not have invested in Kubernetes skills. Operating KubeVirt at 5,000+ VM scale requires deep Kubernetes expertise: understanding pod scheduling, CSI driver behavior, CNI networking, RBAC, resource quotas, node affinity, taints and tolerations, pod disruption budgets, and the interactions between all of these. A VMware admin who knows vCenter deeply but has never used kubectl faces a steep learning curve.
CDI and migration readiness: OVE's CDI is the primary tool for migrating VMware VMDKs. For the 5,000+ VM estate, the Migration Toolkit for Virtualization (MTV) automates the end-to-end flow: discover VMs in vCenter, map networks and storage, convert VMDKs, create VirtualMachine CRs, and validate boot. MTV uses CDI and VDDK under the hood. The evaluation should include a PoC migration of representative VM types (Windows Server with SQL, RHEL with Oracle, Ubuntu with custom apps) to validate conversion fidelity, boot success rate, and performance parity.
Live migration maturity: KubeVirt's live migration is functional but less mature than vMotion. Specifically:
- There is no equivalent of DRS automatic load balancing. Migrations must be triggered manually or by external automation (descheduler policies, custom operators, or drain during maintenance).
- Migration bandwidth is configured globally or per-migration, but there is no automatic bandwidth negotiation like vMotion's adaptive algorithm.
- SR-IOV passthrough devices block migration, requiring a fallback strategy (e.g., use virtio-net for migrable VMs and SR-IOV only for VMs that can tolerate brief downtime during maintenance).
Kata Containers as a differentiator: OVE can offer three tiers of isolation (container, Kata container, KubeVirt VM) on a single platform. No other candidate offers this. For organizations that need both legacy VMs and secure multi-tenant containers, this is a material advantage.
Azure Local (Hyper-V)
Azure Local does not use KubeVirt or any Kubernetes-native VM management. VMs are managed through the traditional Hyper-V stack with Azure Arc as the management plane. This means:
- VMs are defined through Azure Resource Manager (ARM) templates, Azure Bicep, or Azure CLI -- not through Kubernetes CRDs.
- The VM runtime is a Hyper-V child partition, not a pod. There is no container runtime involved in VM execution.
- Live migration is native Hyper-V live migration, which is mature and well-tested in enterprise environments.
- Console access is through Windows Admin Center (WAC) or the Azure portal.
For a team that is Windows-oriented and already invested in Azure, this is arguably simpler: there is no Kubernetes learning curve for VM management. The trade-off is that containers (via AKS on Azure Local) and VMs are managed through different tools and APIs -- they are not unified on a single platform in the same way KubeVirt unifies them.
Azure Local also does not have an equivalent of Kata Containers. Container isolation in AKS on Azure Local relies on standard Linux namespace isolation (within the AKS Linux nodes) or Hyper-V isolation (for Windows containers). There is no drop-in "VM-level isolation for Linux containers" story comparable to OpenShift Sandboxed Containers.
Swisscom ESC
ESC abstracts all of this behind a managed service. The customer does not interact with KubeVirt, CRI-O, Kata, or any of the technologies in this chapter. VMs are provisioned through Swisscom's portal or API, and the underlying technology is VMware vSphere managed by Swisscom.
The relevance of this chapter for the ESC evaluation is primarily about future risk: if Swisscom transitions away from VMware (due to Broadcom licensing), would the replacement involve KubeVirt? If so, the customer's VMs would be running on the same technology described here, but without customer visibility or control. The evaluation should probe Swisscom's technology roadmap and transition commitments.
Key Takeaways
-
KubeVirt is the core of OVE, and understanding it is non-negotiable for evaluating OVE. Every VM on OVE is a KubeVirt-managed VirtualMachine, running as a QEMU process inside a virt-launcher pod on a Kubernetes worker node. The evaluation team must be fluent in the VM lifecycle (from YAML to running QEMU), the CRD model (VM vs VMI vs VMIM), and the pod-wrapping implications (resource accounting, cgroup enforcement, pod-level networking).
-
The pod-per-VM model is both KubeVirt's greatest strength and its operational cost. It brings Kubernetes scheduling, RBAC, monitoring, and network policy to VMs for free. It costs an additional ~100-300 MB per VM in overhead (libvirtd + virt-launcher + QEMU control structures), adds 2-10 seconds to VM startup, and expands the debugging surface area from one layer (VMware) to six (Kubernetes + CRI-O + CNI + CSI + KubeVirt + QEMU).
-
Networking mode selection is a high-impact decision. Masquerade is the easiest but adds NAT overhead and complicates live migration (pod IP changes). Bridge mode is better for direct IP access. SR-IOV is essential for high-throughput workloads but blocks live migration. The evaluation PoC should test all three modes with representative workloads.
-
CDI throughput determines migration timeline. At 5,000 VMs with an average of 200 GB per VM, the data migration alone is ~1 PB. CDI import parallelism, network bandwidth between VMware and OVE clusters, and target storage write throughput are the bottleneck factors. The PoC should benchmark actual import rates with the chosen storage backend.
-
Live migration in KubeVirt works but lacks DRS-equivalent automation. Migrations must be triggered explicitly (VMIM resource, node drain, or descheduler policy). There is no continuous, utilization-aware rebalancing out of the box. Operationally, this means the team needs to build or adopt automation for migration orchestration during maintenance windows.
-
CRI-O is the mandatory container runtime on OVE. The operations team must learn CRI-O's tooling (
crictl), log format, and debugging patterns. containerd experience does not fully transfer. CRI-O troubleshooting is part of the VM troubleshooting path because every VM is a container managed by CRI-O. -
Kata Containers is a differentiator for OVE, not a core requirement. For the VM migration use case, Kata is irrelevant -- VMs already have hardware-level isolation via KVM. Kata becomes relevant if the organization plans to run multi-tenant container workloads with regulatory requirements for strong isolation. This is a future-state consideration, not a day-1 migration concern.
-
The vSphere-to-KubeVirt conceptual mapping is imperfect. Some mappings are clean (vCenter -> kube-apiserver + virt-controller; ESXi host -> worker node + virt-handler). Others are fundamentally different (DRS -> kube-scheduler has no equivalence; VMFS datastore -> PVCs is a paradigm shift from shared filesystem to per-disk volumes). The evaluation team should not expect a 1:1 feature match but instead assess whether the different model achieves the same operational outcomes.
-
Azure Local and Swisscom ESC do not use any of the technologies in this chapter for VM management. Azure Local uses native Hyper-V with Azure Arc; ESC uses VMware vSphere managed by Swisscom. This chapter is primarily relevant for evaluating OVE, with secondary relevance for understanding the architectural differences between the candidates.
Discussion Guide
Use these questions when engaging with Red Hat solution engineers, Kubernetes platform architects, or the organization's internal infrastructure team. The questions probe real-world operational readiness at enterprise scale.
Questions for OVE (Red Hat / KubeVirt)
-
virt-launcher overhead at scale: "At 50 VMs per worker node (our expected density), the per-VM libvirtd overhead is ~1.5-2.5 GB per node. With 100+ worker nodes, that is 150-250 GB of cluster memory consumed by libvirtd instances. How do you account for this in capacity planning templates? Is there a roadmap to reduce per-VM overhead (e.g., shared libvirtd, libvirt-less architecture)?"
-
Pod startup latency during disaster recovery: "If an entire rack of 10 worker nodes fails simultaneously (500 VMs), how long does it take for all 500 virt-launcher pods to be scheduled and started on surviving nodes? What are the bottleneck factors -- scheduler throughput, CRI-O image pull, CNI setup, or storage attach? Have you tested this scenario at our scale?"
-
CDI import throughput benchmarks: "For our migration of 5,000 VMs (~1 PB total disk), what is the maximum sustained import rate CDI can achieve with parallel DataVolumes? Specifically: how many concurrent importer pods can run without saturating the storage backend or the API server? What is the recommended migration architecture -- dedicated import cluster, direct VDDK from vCenter, or HTTP staging server?"
-
Live migration and SR-IOV mutual exclusion: "Approximately 20% of our VMs require SR-IOV for high-throughput network workloads. These VMs cannot live-migrate. What is the recommended maintenance strategy for nodes hosting SR-IOV VMs? Tolerate brief downtime during host patching? Dual-NIC with virtio for migration and SR-IOV for production? Is there a roadmap for migratable SR-IOV (e.g., switchdev mode + OVN hardware offload)?"
-
Descheduler as DRS replacement: "Our VMware environment uses DRS in fully automated mode. The KubeVirt descheduler is the closest equivalent. How mature is the descheduler for VM workloads specifically? Does it understand VM-specific constraints (NUMA alignment, dedicated CPUs, hugepages) when making rebalancing decisions? Can it be configured to avoid migrating tier-1 VMs unless resource imbalance exceeds a threshold?"
-
Networking mode recommendation for our workload mix: "We have three categories of VMs: (a) general-purpose Linux with <1 Gbps traffic, (b) database servers with 10-25 Gbps storage replication traffic, (c) market data receivers with ultra-low-latency requirements. What networking mode do you recommend for each category? Can a single VM have both a masquerade interface (for management) and an SR-IOV interface (for data) simultaneously?"
-
Debugging a VM boot failure: "Walk us through the debugging process when a VM fails to start. Which logs do you check first? How do you distinguish between a Kubernetes scheduling failure (pod Pending), a CRI-O container creation failure (pod ContainerCreating), a virt-launcher failure (libvirt domain definition error), and a QEMU failure (hardware emulation error)? What tooling exists to correlate these layers?"
Questions for Azure Local
- Architectural comparison to KubeVirt: "Azure Local runs VMs directly on Hyper-V without Kubernetes pod wrapping. What are the performance advantages of this simpler architecture (no CRI-O, no pod overhead, no per-VM libvirtd)? Conversely, what does Azure Local lose by not having a Kubernetes-native VM model (no GitOps for VMs, no Kubernetes RBAC for VMs, no unified container+VM platform)?"
Questions for Swisscom ESC
- Technology transition transparency: "If Swisscom migrates the ESC platform from VMware to a Kubernetes-native stack (such as KubeVirt or a similar technology), what is the customer impact? Will existing VMs be transparently migrated, or will customers need to re-export and re-import VMs? What is the contractual notification period for such a technology change?"
Questions for All Candidates
- Isolation model comparison: "Compare the VM isolation boundary across your platform: What is the hypervisor attack surface (number of VM exit handlers, device emulation code size)? Has the hypervisor undergone independent security audit or formal verification? What is the historical CVE rate for hypervisor escape vulnerabilities? For OVE specifically: does the additional pod/container layer (CRI-O, cgroups, namespaces) add defense-in-depth, or does it add attack surface?"