Kubernetes-Native Virtualization

Why This Matters

The previous three chapters covered the foundational concepts of virtualization, the VMware baseline you are migrating away from, and the hypervisor engines (KVM, QEMU, libvirt, Hyper-V) that power the candidate platforms. This chapter takes the next step up the stack: how Kubernetes -- a system designed for containers -- becomes a platform that runs traditional virtual machines as first-class citizens.

This is the most important chapter in the virtualization series for one reason: KubeVirt is the core of OpenShift Virtualization Engine (OVE). Every VM that runs on OVE is a KubeVirt-managed VM. Every live migration, every storage attachment, every network connection, every console session -- all of these flow through KubeVirt's custom resources, controllers, and per-VM pod architecture. Understanding KubeVirt at the depth presented here is not optional for evaluating OVE. It is the evaluation.

For Azure Local, this chapter provides a contrast: Azure Local does not use Kubernetes for VM management (it uses Hyper-V with the Azure Arc management layer). Understanding KubeVirt clarifies what is architecturally different about the OVE approach -- where it is more flexible, where it is more complex, and where it introduces operational patterns that a VMware-trained team has never encountered.

For Swisscom ESC, this chapter is contextual. ESC currently runs on VMware; if Swisscom ever transitions to a Kubernetes-native platform, the concepts here would apply.

The chapter also covers the container runtime layer (CRI-O, containerd) that sits between Kubernetes and the actual virt-launcher processes, and Kata Containers, which represent a different approach to virtualization in Kubernetes -- running containers inside VMs rather than running VMs inside pods.

At 5,000+ VM scale, every architectural decision in the KubeVirt stack has compounding effects. A misconfigured CDI import pipeline slows down migration by weeks. A misunderstood networking mode (masquerade vs. bridge vs. SR-IOV) can mean the difference between 1 Gbps and 25 Gbps per VM. A virt-handler DaemonSet that is not sized correctly causes node-level failures during mass live migration. This chapter equips the evaluation team to operate at that level.

Concepts

1. KubeVirt

What KubeVirt Is

KubeVirt is a Kubernetes extension (operator) that enables virtual machines to run as first-class workloads alongside containers on the same cluster. It was started in 2017 by Red Hat engineers, donated to the Cloud Native Computing Foundation (CNCF) as a Sandbox project in 2018, and reached CNCF Incubating status in 2023. It is the upstream open-source project that Red Hat packages and supports as OpenShift Virtualization, which in turn is the foundation of OpenShift Virtualization Engine (OVE).

The core idea: instead of building a separate management plane for VMs (like vCenter or System Center VMM), KubeVirt extends the existing Kubernetes API with Custom Resource Definitions (CRDs) that represent virtual machines. The Kubernetes API server, scheduler, RBAC, monitoring, and networking all apply to VMs just as they do to containers. The VM itself runs inside a Kubernetes pod, where a virt-launcher process manages a libvirtd instance and a QEMU process.

This is not a trivial wrapper. KubeVirt solves a genuinely hard engineering problem: Kubernetes was designed around the assumption that workloads are ephemeral, stateless, and horizontally scalable. VMs are the opposite -- they are long-lived, deeply stateful, and vertically scaled. KubeVirt bridges this gap by introducing VM-specific lifecycle semantics (start, stop, pause, migrate, restart) on top of Kubernetes' pod-centric model.

Architecture

KubeVirt consists of four primary components deployed on a Kubernetes cluster:

  KubeVirt Architecture Overview

  +=====================================================================+
  |  Kubernetes Control Plane                                           |
  |  +---------------------------------------------------------------+  |
  |  |  kube-apiserver (+ KubeVirt CRDs registered)                  |  |
  |  |  - VirtualMachine (vm)                                        |  |
  |  |  - VirtualMachineInstance (vmi)                                |  |
  |  |  - VirtualMachineInstanceReplicaSet (vmirs)                   |  |
  |  |  - VirtualMachineInstanceMigration (vmim)                     |  |
  |  |  - VirtualMachineClusterPreference                            |  |
  |  |  - VirtualMachineClusterInstancetype                          |  |
  |  +---------------------------------------------------------------+  |
  |                                                                     |
  |  +-----------------------------+  +-----------------------------+   |
  |  |  virt-api (Deployment)      |  |  virt-controller            |   |
  |  |                             |  |  (Deployment, HA pair)      |   |
  |  |  - Validating webhook       |  |                             |   |
  |  |  - Mutating webhook         |  |  - Watches VM/VMI CRs      |   |
  |  |  - Subresource API          |  |  - Creates virt-launcher    |   |
  |  |    (console, VNC,           |  |    pods for each VMI        |   |
  |  |     start, stop, migrate)   |  |  - Manages VM lifecycle     |   |
  |  |  - virtctl proxy target     |  |    state machine            |   |
  |  |  - Certificate management   |  |  - Coordinates migrations   |   |
  |  +-----------------------------+  +-----------------------------+   |
  +=====================================================================+

  +=====================================================================+
  |  Worker Node 1                       Worker Node 2                  |
  |  +-------------------------------+   +----------------------------+ |
  |  | virt-handler (DaemonSet)      |   | virt-handler (DaemonSet)   | |
  |  | - Registers node capabilities|   | - Registers node caps      | |
  |  | - Manages device plugins      |   | - Manages device plugins   | |
  |  |   (KVM, vhost-net, GPU, SRIOV)|   |   (KVM, vhost-net, etc.)  | |
  |  | - Syncs VMI state with API    |   | - Syncs VMI state with API | |
  |  | - Coordinates migration target|   | - Coordinates migration    | |
  |  | - Configures node networking  |   | - Configures node net      | |
  |  +-------------------------------+   +----------------------------+ |
  |                                                                     |
  |  +---------------------------+  +---------------------------+       |
  |  | virt-launcher Pod (VM-A)  |  | virt-launcher Pod (VM-B)  |      |
  |  | +----------------------+  |  | +----------------------+  |      |
  |  | | virt-launcher process|  |  | | virt-launcher process|  |      |
  |  | +----------------------+  |  | +----------------------+  |      |
  |  | | libvirtd             |  |  | | libvirtd             |  |      |
  |  | +----------------------+  |  | +----------------------+  |      |
  |  | | QEMU/KVM process     |  |  | | QEMU/KVM process     |  |      |
  |  | | (the actual VM)      |  |  | | (the actual VM)      |  |      |
  |  | +----------------------+  |  | +----------------------+  |      |
  |  +---------------------------+  +---------------------------+       |
  +=====================================================================+

virt-api is a Deployment (typically 2 replicas for HA) that serves as the entry point for all KubeVirt-specific API operations. It registers itself as a Kubernetes admission webhook (both validating and mutating) so that every VirtualMachine or VirtualMachineInstance create/update request passes through KubeVirt's validation logic before being persisted in etcd. The virt-api also provides subresource endpoints -- these are the REST endpoints that virtctl uses for operations that do not map to standard Kubernetes CRUD: opening a VNC console, streaming a serial console, triggering a start/stop/restart, initiating a live migration. Architecturally, virt-api is comparable to vCenter's SOAP/REST API facade, except it extends the Kubernetes API server rather than replacing it.

virt-controller is a Deployment (typically 2 replicas with leader election) that implements the core reconciliation loop. It watches VirtualMachine and VirtualMachineInstance custom resources in etcd and ensures reality matches the declared state. When a user creates a VirtualMachineInstance, the virt-controller creates a pod (the virt-launcher pod) with the correct resource requests, volume mounts, and annotations. When a VirtualMachine is set to running: false, the virt-controller deletes the associated VMI and its pod. The virt-controller also manages the state machine for VMs (Pending -> Scheduling -> Scheduled -> Running -> Succeeded/Failed) and coordinates live migrations by creating target pods on destination nodes.

In vSphere terms, virt-controller is the equivalent of the vpxd process inside vCenter -- the central brain that translates desired state into actions on hosts.

virt-handler is a DaemonSet that runs on every worker node capable of hosting VMs. It is the node-level agent. Its responsibilities include:

Device plugin registration: It registers Kubernetes device plugins for /dev/kvm, /dev/vhost-net, /dev/net/tun, and any SR-IOV Virtual Functions. This is how Kubernetes' scheduler knows that a node has KVM capability and can run VMs.
VMI state synchronization: It watches VMIs assigned to its node, ensures the virt-launcher pod and QEMU process are in the expected state, and reports status back to the API server (IP addresses, guest agent info, migration progress).
Network and device setup: It configures bridge interfaces, tap devices, and any node-level networking required by the VM before the virt-launcher pod starts.
Migration coordination: On the target node of a live migration, virt-handler prepares the destination virt-launcher pod and signals readiness to the source.
Node labeling: It labels nodes with hardware capabilities (CPU features, presence of KVM, IOMMU groups) so the scheduler can make placement decisions.

In vSphere terms, virt-handler is roughly equivalent to the hostd + vpxa agent combination on each ESXi host -- the local authority that manages VMs on behalf of the central controller.

virt-launcher is a per-VM pod that runs exactly one VM. It is not deployed as a DaemonSet or Deployment -- it is created by virt-controller as a regular pod for each VirtualMachineInstance. Inside the virt-launcher pod, three processes cooperate:

  virt-launcher Pod Internal Structure

  +====================================================================+
  |  virt-launcher Pod (one per VM)                                    |
  |  Kubernetes namespace: vm-namespace                                |
  |  Pod name: virt-launcher-my-database-vm-xk7q9                     |
  |                                                                    |
  |  Cgroup: /kubepods/pod<uid>/                                       |
  |  CPU/Memory limits enforced by kubelet cgroups                     |
  |                                                                    |
  |  +--------------------------------------------------------------+  |
  |  |  Container: compute                                          |  |
  |  |                                                              |  |
  |  |  PID 1: virt-launcher                                        |  |
  |  |    - Translates VMI spec into libvirt domain XML             |  |
  |  |    - Calls libvirt API to define and start domain            |  |
  |  |    - Monitors QEMU process health                            |  |
  |  |    - Reports VM state changes to virt-handler (via socket)   |  |
  |  |    - Handles graceful shutdown (ACPI power button)           |  |
  |  |    - Exits when QEMU exits (pod terminates)                  |  |
  |  |                                                              |  |
  |  |  PID 2: libvirtd                                             |  |
  |  |    - Per-pod libvirtd instance (not system-wide)             |  |
  |  |    - Receives domain XML from virt-launcher                  |  |
  |  |    - Constructs QEMU command line                            |  |
  |  |    - Manages QEMU process lifecycle                          |  |
  |  |    - Handles domain events (migration, shutdown, crash)      |  |
  |  |                                                              |  |
  |  |  PID 3: qemu-kvm (QEMU/KVM)                                 |  |
  |  |    - The actual virtual machine                              |  |
  |  |    - vCPU threads (one per vCPU)                             |  |
  |  |    - I/O threads for disk/network                            |  |
  |  |    - Device emulation (virtio, IDE, e1000, etc.)             |  |
  |  |    - VNC server (for console access)                         |  |
  |  |    - Uses /dev/kvm for hardware virtualization               |  |
  |  |    - Uses /dev/vhost-net for accelerated networking          |  |
  |  +--------------------------------------------------------------+  |
  |                                                                    |
  |  Mounted Volumes:                                                  |
  |  - /var/run/kubevirt-private   (virt-handler communication)        |
  |  - /var/run/libvirt            (libvirt socket)                    |
  |  - PVC mounts for VM disks                                        |
  |  - ConfigMap/Secret mounts for cloud-init, sysprep                 |
  |  - Device mounts: /dev/kvm, /dev/vhost-net, /dev/net/tun          |
  |                                                                    |
  |  Network Interfaces:                                               |
  |  - eth0 (pod network, default CNI)                                 |
  |  - net1, net2, ... (additional NICs via Multus)                    |
  +====================================================================+

The critical design decision: each VM gets its own libvirtd instance. In a traditional KVM deployment, a single system-wide libvirtd manages all VMs on a host. KubeVirt deliberately isolates libvirtd per pod for two reasons: (1) a crash in one VM's libvirtd cannot affect other VMs, and (2) the Kubernetes pod sandbox (cgroups, namespaces) cleanly contains all processes related to a single VM.

This means a worker node running 50 VMs has 50 libvirtd processes and 50 QEMU processes, each in its own pod with its own cgroup. The memory overhead of 50 libvirtd instances (each ~30-50 MB RSS) is a real cost -- roughly 1.5-2.5 GB -- that does not exist in a traditional KVM or VMware setup. At 5,000 VMs across 100 nodes, this is ~50 VMs per node, ~2 GB overhead per node, ~200 GB total cluster overhead for libvirtd alone. This is manageable but must be accounted for in capacity planning.

Custom Resource Definitions

KubeVirt extends the Kubernetes API with several CRDs. The four most important ones:

VirtualMachine (VM): The top-level user-facing resource. It represents a virtual machine with a desired running state. It contains the VM's specification (CPU, memory, disks, network, firmware) and a running or runStrategy field that controls whether the VM should be powered on. The VirtualMachine controller creates and manages a VirtualMachineInstance when the VM should be running. Think of it as the equivalent of a VM in vCenter's inventory -- it persists even when the VM is powered off.

VirtualMachineInstance (VMI): Represents a running instance of a VM. When a VirtualMachine is set to running: true, the virt-controller creates a VMI. The VMI is the actual runtime object -- it maps 1:1 to a virt-launcher pod and a QEMU process. When the VM is shut down, the VMI is deleted. In vSphere terms, the VMI is the runtime state -- like the vmx process on an ESXi host. The VM persists; the VMI is ephemeral.

VirtualMachineInstanceReplicaSet (VMIRS): Manages a set of identical VMIs, analogous to a Kubernetes ReplicaSet for pods. It maintains a desired number of running VM instances. Useful for stateless VM workloads (load balancers, web servers that must run as VMs for legacy reasons). Not commonly used in enterprise environments where each VM is unique.

VirtualMachineInstanceMigration (VMIM): A declarative object that triggers a live migration of a VMI from one node to another. Creating a VMIM object is equivalent to right-clicking a VM in vCenter and selecting "Migrate." The virt-controller and virt-handler cooperate to execute the migration. The VMIM tracks progress and status. When the migration completes (or fails), the VMIM status is updated.

Additional CRDs that are operationally important:

VirtualMachineInstancetype / VirtualMachineClusterInstancetype: Pre-defined VM sizes (analogous to VM templates in vCenter or EC2 instance types). Defines CPU count, memory, GPU resources, and I/O thread policies. Instance types are namespaced; cluster instance types are cluster-scoped.
VirtualMachinePreference / VirtualMachineClusterPreference: Defines preferred settings for firmware (UEFI vs. BIOS), machine type (q35 vs. i440fx), clock source, input devices, and feature flags. Preferences are applied at VM creation and can be overridden per-VM.
DataVolume: A CDI (Containerized Data Importer) resource that represents a VM disk being imported, cloned, or uploaded. It wraps a PVC and manages the data population lifecycle.

VM Lifecycle Through Kubernetes

Understanding how a VM goes from a YAML file to a running QEMU process is essential for debugging and operations. The flow:

  VM Lifecycle: From YAML to Running QEMU Process

  Step 1: User applies VM manifest
  =====================================
  $ kubectl apply -f my-vm.yaml
       |
       v
  kube-apiserver
       |
       +--> virt-api (admission webhook)
       |      - Validates VM spec (valid CPU model? valid disk bus?)
       |      - Mutates defaults (add default network, set machine type)
       |      - Rejects invalid specs (negative memory, unknown disk type)
       |
       +--> etcd (VM resource persisted)

  Step 2: virt-controller reconciles
  =====================================
  virt-controller (watch loop)
       |
       +--> Sees new VM with running: true (or runStrategy: Always)
       |
       +--> Creates VirtualMachineInstance (VMI) CR
       |      - Copies spec from VM to VMI
       |      - Sets VMI status: Pending
       |
       +--> etcd (VMI resource persisted)
       |
       +--> Creates virt-launcher Pod
              - Sets resource requests/limits (CPU, memory, hugepages)
              - Adds device requests (/dev/kvm, /dev/vhost-net)
              - Mounts PVCs for disks
              - Mounts ConfigMaps/Secrets for cloud-init
              - Sets node affinity/anti-affinity from VM spec
              - Sets tolerations for any taints
              - Adds KubeVirt-specific annotations

  Step 3: Kubernetes schedules the pod
  =====================================
  kube-scheduler
       |
       +--> Evaluates pod against nodes:
       |      - Does node have /dev/kvm? (device plugin)
       |      - Does node have enough CPU/memory?
       |      - Does node have the requested hugepages?
       |      - Does node satisfy affinity rules?
       |      - Does node satisfy topology constraints?
       |      - Does node have the requested SR-IOV VFs?
       |
       +--> Binds pod to selected node
       |
       +--> VMI status: Scheduling -> Scheduled

  Step 4: kubelet starts the pod
  =====================================
  kubelet (on target node)
       |
       +--> Calls CRI-O (or containerd) to create pod sandbox
       |      - Creates cgroup hierarchy
       |      - Creates network namespace
       |      - CNI plugin configures pod networking
       |      - Multus attaches additional interfaces
       |
       +--> Pulls virt-launcher container image (if not cached)
       |
       +--> Starts virt-launcher container

  Step 5: virt-launcher boots the VM
  =====================================
  virt-launcher process (PID 1 in container)
       |
       +--> Reads VMI spec from annotation/downward API
       |
       +--> Translates VMI spec to libvirt domain XML
       |      - CPU: model, topology, features, NUMA
       |      - Memory: size, hugepages, NUMA cells
       |      - Disks: virtio-blk/virtio-scsi backed by PVCs
       |      - NICs: virtio-net with tap/bridge/SR-IOV backend
       |      - Firmware: UEFI (OVMF) or SeaBIOS
       |      - Devices: vTPM, watchdog, RNG, serial console
       |
       +--> Calls libvirt API: virDomainDefineXML()
       +--> Calls libvirt API: virDomainCreate()
       |
       v
  libvirtd (per-pod instance)
       |
       +--> Parses domain XML
       +--> Constructs QEMU command line (~200+ arguments)
       +--> Forks QEMU process
       |
       v
  qemu-kvm process
       |
       +--> Opens /dev/kvm (ioctl: KVM_CREATE_VM)
       +--> Creates vCPUs (ioctl: KVM_CREATE_VCPU)
       +--> Maps memory (ioctl: KVM_SET_USER_MEMORY_REGION)
       +--> Loads firmware (OVMF/SeaBIOS)
       +--> Starts vCPU threads (ioctl: KVM_RUN in loop)
       |
       +--> Guest OS boots
       +--> VMI status: Running

  Step 6: virt-handler syncs state
  =====================================
  virt-handler (on same node)
       |
       +--> Detects running VMI on its node
       +--> Reads guest info via QEMU Guest Agent (if installed)
       +--> Reports IP addresses, OS info, filesystem info to VMI status
       +--> Updates VMI conditions (AgentConnected, LiveMigratable, etc.)

YAML Examples

A complete VirtualMachine definition for a production database VM:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: oracle-db-prod-01
  namespace: database-tier
  labels:
    app: oracle-database
    tier: production
    criticality: tier-1
spec:
  running: true
  template:
    metadata:
      labels:
        app: oracle-database
        kubevirt.io/vm: oracle-db-prod-01
    spec:
      # Node placement
      nodeSelector:
        node-role.kubernetes.io/worker-vm: ""
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: oracle-database
              topologyKey: kubernetes.io/hostname

      # CPU and memory
      domain:
        cpu:
          cores: 16
          sockets: 1
          threads: 1
          model: host-passthrough
          dedicatedCpuPlacement: true
          numa:
            guestMappingPassthrough: {}
          features:
            - name: x2apic
              policy: require
        memory:
          guest: 64Gi
          hugepages:
            pageSize: 1Gi
        machine:
          type: q35
        firmware:
          bootloader:
            efi:
              secureBoot: true
        features:
          acpi: {}
          apic: {}
          smm: {}
        clock:
          utc: {}
          timer:
            hpet:
              present: false
            pit:
              tickPolicy: delay
            rtc:
              tickPolicy: catchup
            hyperv: {}
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
            - name: datadisk
              disk:
                bus: virtio
              dedicatedIOThread: true
            - name: cloudinitdisk
              disk:
                bus: virtio
          interfaces:
            - name: default
              masquerade: {}
            - name: storage-net
              sriov: {}
          networkInterfaceMultiqueue: true
          rng: {}
          tpm: {}

      # Networks
      networks:
        - name: default
          pod: {}
        - name: storage-net
          multus:
            networkName: sriov-storage-vlan100

      # Volumes
      volumes:
        - name: rootdisk
          dataVolume:
            name: oracle-db-prod-01-root
        - name: datadisk
          persistentVolumeClaim:
            claimName: oracle-db-prod-01-data
        - name: cloudinitdisk
          cloudInitNoCloud:
            networkData: |
              version: 2
              ethernets:
                eth0:
                  dhcp4: true
            userData: |
              #cloud-config
              hostname: oracle-db-prod-01
              ssh_authorized_keys:
                - ssh-rsa AAAAB3... admin@company.com

      # Eviction strategy for live migration during node drain
      evictionStrategy: LiveMigrate

A DataVolume for importing a VMware VMDK:

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: oracle-db-prod-01-root
  namespace: database-tier
spec:
  source:
    http:
      url: "https://image-server.internal/vmware-exports/oracle-db-root.vmdk"
      certConfigMap: image-server-ca
  pvc:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 200Gi
    storageClassName: ceph-rbd-ssd
    volumeMode: Block

Pod Wrapping: How VMs Live Inside Kubernetes

The design decision to run each VM inside a Kubernetes pod is the defining architectural choice of KubeVirt. It brings enormous benefits and specific costs.

Benefits:

Scheduling: VMs benefit from the Kubernetes scheduler. Resource requests (CPU, memory, hugepages, devices) are declared in the pod spec, and the scheduler places the VM on a node that can satisfy them. No need for a separate VM scheduler (like DRS).
Resource enforcement: Kubelet enforces CPU and memory limits via Linux cgroups v2. The QEMU process and its vCPU threads are bound by the same cgroup limits as any container. Memory overcommit, if allowed, uses the same Kubernetes mechanisms (requests vs. limits).
Monitoring: Standard Kubernetes monitoring (Prometheus, metrics-server) captures pod-level CPU, memory, network, and disk metrics for VMs without any VM-specific instrumentation.
RBAC: Kubernetes RBAC controls who can create, delete, start, stop, and migrate VMs. No separate permission system is needed.
Network policies: Kubernetes NetworkPolicy applies to VM pods, controlling which VMs can communicate with which other workloads (VMs or containers).
Multi-tenancy: Kubernetes namespaces provide logical isolation. Team A's VMs in namespace team-a are invisible to Team B in namespace team-b. Quotas and limit ranges apply per namespace.

Costs:

Pod lifecycle mismatch: Kubernetes assumes pods are ephemeral. VMs are not. KubeVirt must carefully prevent Kubernetes from killing a VM pod during node pressure, upgrades, or scale-downs. The evictionStrategy: LiveMigrate annotation is critical -- it tells Kubernetes to migrate the VM rather than terminate the pod during a drain.
Resource accounting granularity: A VM pod's resource requests include the QEMU process, libvirtd, and virt-launcher overhead. The actual guest-visible memory is smaller than the pod's memory request due to QEMU overhead (~100-300 MB per VM for device emulation buffers, page tables, and control structures). Capacity planning must account for this "VM tax."
Startup time: Creating a pod involves network namespace setup, cgroup creation, volume mounts, and container image pulls. This adds 2-10 seconds to VM boot time compared to raw libvirt/QEMU, which can start a VM in <1 second. For a fleet of 5,000 VMs, this matters during disaster recovery or mass restart scenarios.
Debugging complexity: When a VM fails to start, the cause could be in any layer: Kubernetes scheduling (pod Pending), CRI-O (container creation failed), virt-launcher (libvirt domain definition failed), libvirtd (QEMU command-line generation failed), or QEMU (hardware emulation error). Debugging requires fluency in all layers.

Networking

KubeVirt networking maps VM network interfaces to Kubernetes pod networking, which itself maps to the cluster's CNI (Container Network Interface) implementation. This is one of the areas with the greatest divergence from VMware.

In vSphere, a VM NIC connects to a vSwitch (standard or distributed), which connects to a physical NIC. The mapping is direct: VM NIC -> portgroup -> uplink -> physical NIC.

In KubeVirt, the path is: VM NIC -> virtio-net/e1000/rtl8139 -> tap device inside pod -> pod network interface -> CNI plugin -> physical NIC. And for secondary networks (Multus), each additional NIC is a separate CNI attachment.

KubeVirt supports multiple networking modes for the default pod network interface:

Mode	How It Works	Use Case	Performance
masquerade	VM traffic is NATed through the pod IP. The VM gets a private address (10.0.2.0/24 by default) and the pod IP is the NAT gateway. Uses iptables/nftables rules.	Default mode. Simple. Works with any CNI. VM is reachable via pod IP + service/ingress.	Moderate. NAT overhead. ~10-15% throughput reduction vs. bridge.
bridge	VM NIC is bridged directly to the pod's network interface. The VM gets the pod's IP address via DHCP (KubeVirt runs a DHCP server on the bridge).	When the VM needs the pod IP directly (no NAT). Requires CNI support for bridge takeover.	Good. No NAT overhead. But: pod IP is "taken" by VM, so sidecar containers cannot use it.
SR-IOV	VM NIC is attached directly to an SR-IOV Virtual Function (VF) passed through via VFIO. Bypasses all software networking.	High-throughput, low-latency workloads. Requires SR-IOV-capable NICs and the SR-IOV device plugin.	Excellent. Near-native. <3% overhead. Up to 100 Gbps line rate.
passt	User-mode networking stack that translates between the VM's network stack and the pod's network namespace. No root privileges required.	Rootless deployments, environments where bridge mode is not supported.	Good. Better than masquerade, no NAT.

For secondary networks, KubeVirt uses Multus CNI, a "meta-plugin" that allows a pod to have multiple network interfaces. Each additional interface is defined by a NetworkAttachmentDefinition (NAD) resource. This is how VMs get connections to VLANs, storage networks, or dedicated management networks -- analogous to adding a second NIC to a VM in vSphere and connecting it to a different portgroup.

  KubeVirt Networking: VM NIC to Physical NIC

  Inside the VM (Guest OS)
  +--------------------------------------------------+
  |  eth0: 10.0.2.2/24 (masquerade, default network) |
  |  eth1: 192.168.100.5/24 (SR-IOV, storage net)    |
  +--------------------------------------------------+
       |  virtio-net             |  VFIO passthrough
       v                        v
  +--------------------------------------------------+
  |  virt-launcher Pod                               |
  |                                                  |
  |  tap0 --- linux bridge --- eth0 (pod interface)  |
  |       (masquerade mode:                          |
  |        iptables DNAT/SNAT                        |
  |        on the bridge)                            |
  |                                                  |
  |  [no software path for SR-IOV -- direct HW]     |
  +--------------------------------------------------+
       |  eth0: pod IP              |  SR-IOV VF
       v  (via CNI: OVN-K, Calico)  v  (via VFIO)
  +--------------------------------------------------+
  |  Worker Node                                     |
  |                                                  |
  |  br-int / ovs-bridge / host bridge               |
  |  (OVN-Kubernetes or other CNI)                   |
  |                                                  |
  |  Physical NIC: eno1 (pod traffic)                |
  |  Physical NIC: eno2 (SR-IOV PF, VFs allocated)   |
  +--------------------------------------------------+
       |                            |
       v                            v
  Physical Network Switch / Fabric

The key vSphere equivalences:

vSphere Concept	KubeVirt Equivalent
vSwitch / vDS portgroup	CNI plugin + NetworkAttachmentDefinition
VM NIC (VMXNET3)	VM NIC (virtio-net)
Trunk port / VLAN tagging	Multus + VLAN CNI plugin or OVN-Kubernetes secondary network
DirectPath I/O (passthrough NIC)	SR-IOV mode with VFIO
NSX-T micro-segmentation	Kubernetes NetworkPolicy + OVN-Kubernetes ACLs
VM-to-VM on same host	Pod-to-pod (via CNI bridge or OVS)
VM traffic shaping (vDS)	CNI bandwidth plugin or OVN-Kubernetes QoS

Storage

KubeVirt VM disks are backed by Kubernetes Persistent Volume Claims (PVCs). This is one of the most fundamental differences from VMware, where disks are VMDK files on a VMFS or NFS datastore managed by vCenter.

In KubeVirt, each VM disk is either:

A PVC (filesystem mode or block mode) mapped into the virt-launcher pod and presented to QEMU as a block device or file.
A containerDisk -- an OCI container image that contains a VM disk (qcow2 or raw) as a layer. The image is pulled by the container runtime and mounted read-only. Useful for ephemeral VMs (live CDs, install ISOs).
A DataVolume -- a CDI-managed PVC that is populated from an external source (HTTP URL, S3, container registry, existing PVC) before the VM starts.

The storage path:

  KubeVirt Storage: VM Disk to Physical Storage

  Guest OS
  +-------------------------------------+
  |  /dev/vda (virtio-blk)              |
  |  or                                 |
  |  /dev/sda (virtio-scsi)             |
  +-------------------------------------+
       |
  QEMU block layer
       |
       +--> Raw block device     +--> qcow2 file on filesystem
       |    (block-mode PVC)     |    (filesystem-mode PVC)
       v                        v
  +-------------------------------------+
  |  PVC mounted into virt-launcher pod |
  |  - Block mode: /dev/xvda            |
  |  - Filesystem mode: /var/run/       |
  |    kubevirt-private/vmi-disks/      |
  |    disk-name/disk.img               |
  +-------------------------------------+
       |
  Kubernetes PV / CSI driver
       |
       +--> Ceph RBD (block)
       +--> OpenShift Data Foundation (block/file)
       +--> NFS (file)
       +--> Local PV (block/file)
       +--> NetApp ONTAP (iSCSI block, NFS file)
       +--> Pure Storage (iSCSI, FC)
       |
       v
  Physical Storage Array / Cluster

Block mode vs. filesystem mode is a critical choice for production VMs:

Aspect	Block Mode PVC	Filesystem Mode PVC
PVC volumeMode	`Block`	`Filesystem` (default)
QEMU access	Raw block device passed directly to QEMU	QEMU opens a qcow2/raw file on the PVC's mounted filesystem
Performance	Better. No filesystem overhead. QEMU I/O goes directly to the block device.	Worse. Double filesystem: the PVC's filesystem (ext4/XFS) + the guest's filesystem.
Snapshot support	Via CSI volume snapshots	Via CSI volume snapshots or qcow2 internal snapshots
Live migration	Requires RWX (ReadWriteMany) block PVCs or storage-assisted migration	Same requirement
Overhead	Minimal	Filesystem metadata, double journaling
Recommendation for Tier-1	Preferred	Avoid for I/O-intensive workloads

For the evaluation at 5,000+ VMs, the storage integration is a major area of scrutiny. The existing VMware estate uses VMFS datastores with vSAN or SAN-backed LUNs. Migrating to KubeVirt means migrating to PVC-backed storage. This requires a CSI driver for whatever storage backend the organization chooses (Ceph/ODF, NetApp, Pure, etc.) and the CDI (Containerized Data Importer) for converting existing VMDKs.

CDI (Containerized Data Importer)

CDI is a Kubernetes operator that manages the lifecycle of VM disk data. It is responsible for populating PVCs with VM disk content before a VM starts. CDI is the answer to the question: "How do I get my VMware VMDK files into KubeVirt?"

CDI supports the following import sources:

Source	How It Works
HTTP/HTTPS URL	Downloads a disk image (raw, qcow2, vmdk, vdi, vhd, vhdx) from a web server. Automatically detects and converts format to raw.
S3 bucket	Downloads from S3-compatible storage (AWS S3, MinIO, Ceph RGW).
Container registry	Pulls an OCI image that contains a disk image as a layer. Extracts and writes to PVC.
Existing PVC (clone)	Clones an existing PVC to a new PVC. Uses CSI clone if available, otherwise smart-clone (snapshot + restore) or host-assisted copy.
Upload	Accepts a disk image upload via `virtctl image-upload`. Streams data directly to a CDI upload pod.
VDDK (VMware Virtual Disk Development Kit)	Connects to a VMware vCenter/ESXi and downloads a VM's disk using the VMware VDDK API. This is the primary mechanism used by the Migration Toolkit for Virtualization (MTV).
Image I/O (oVirt)	Imports from Red Hat Virtualization (oVirt) -- less relevant for this evaluation.
Snapshot	Creates a PVC from a VolumeSnapshot.

The CDI import flow for a VMware VMDK:

  CDI Import Flow: VMware VMDK to KubeVirt PVC

  Step 1: User creates DataVolume
  ================================================
  $ kubectl apply -f datavolume-import.yaml
       |
       v
  CDI Controller (watches DataVolume CRs)
       |
       +--> Creates a PVC with the requested size and storageClass
       |
       +--> Creates an Importer Pod
              |
              v
  +------------------------------------------+
  |  Importer Pod                            |
  |                                          |
  |  1. Downloads VMDK from HTTP URL         |
  |     (or connects to vCenter via VDDK)    |
  |                                          |
  |  2. Detects source format:               |
  |     - VMDK sparse? VMDK flat? qcow2?     |
  |     - Compressed (gz, xz)?               |
  |                                          |
  |  3. Converts to raw format:              |
  |     VMDK --> qemu-img convert --> raw     |
  |                                          |
  |  4. Writes raw data to PVC:              |
  |     - Block mode: dd to /dev/xvda        |
  |     - Filesystem mode: write to          |
  |       /data/disk.img                     |
  |                                          |
  |  5. Reports progress via DataVolume      |
  |     status (0% -> 100%)                  |
  +------------------------------------------+
       |
       v
  PVC is populated and bound
       |
       v
  DataVolume status: Succeeded
       |
       v
  VM can now reference this DataVolume
  in its volumes section and boot from it

VMDK conversion specifics: CDI uses qemu-img under the hood to convert VMDK files to raw format. It handles all VMDK variants: monolithic sparse, monolithic flat, split sparse, split flat, stream-optimized, and ESXi-style descriptor+extent files. The conversion is CPU-intensive (especially for compressed VMDKs) and disk-space-intensive (VMDKs are often thin-provisioned; the raw output is fully allocated). CDI supports scratch space (a temporary PVC used during conversion) to avoid running out of space on the target PVC.

For the migration of 5,000+ VMs, CDI throughput is a bottleneck concern. Each import runs as a single pod. Parallel imports are supported (create multiple DataVolumes), but the bottleneck shifts to:

Network bandwidth from the VMware environment to the Kubernetes cluster (or the HTTP server hosting VMDK exports).
Storage backend write throughput on the target cluster.
CPU on importer pods for format conversion.

A realistic migration pipeline imports 10-50 VMs in parallel, throttled by network bandwidth and storage IOPS. At 200 GB average disk per VM, 5,000 VMs is 1 PB of data. At 10 Gbps sustained throughput, that is ~9 days of continuous transfer for the raw data alone, not counting conversion overhead, validation, and test boots. CDI bandwidth planning is a critical migration workstream.

Live Migration in KubeVirt

Live migration in KubeVirt moves a running VM from one worker node to another with near-zero downtime. The mechanism is built on libvirt/QEMU's pre-copy migration (covered in Chapter 3), but wrapped in Kubernetes pod semantics. This introduces differences from vMotion that the evaluation team must understand.

How it works:

An operator (or an automated process like node drain) creates a VirtualMachineInstanceMigration (VMIM) resource.
The virt-controller sees the VMIM, validates that the VMI is live-migratable (checks conditions: is the storage shared? Are all devices migratable? Is there a target node with sufficient resources?).
The virt-controller creates a target virt-launcher pod on the destination node. This pod is identical to the source pod (same resource requests, same volume mounts) but does not start a VM -- it starts a QEMU process in "incoming migration" mode, waiting to receive the VM state.
The virt-handler on the source node signals libvirtd to begin pre-copy migration. Libvirt connects the source QEMU to the target QEMU over a TCP connection (typically over the pod network or a dedicated migration network).
QEMU performs iterative memory copy: first pass sends all memory pages, subsequent passes send only dirty pages. When the dirty rate is low enough, QEMU pauses the VM on the source, sends the final dirty pages and CPU state, and resumes the VM on the target. This pause window is the migration downtime.
Once the VM is running on the target, the source virt-launcher pod is terminated. The VMI resource is updated to reflect the new node.

Differences from vMotion:

Aspect	vMotion	KubeVirt Live Migration
Unit of migration	VM (vmx process)	Pod (entire virt-launcher pod with VM inside)
Storage requirement	Shared datastore (VMFS, vSAN, NFS) or storage vMotion for local disks	Shared PVCs (RWX access mode) or storage-class-specific migration support
Network	Dedicated vMotion VMkernel port, encrypted, 10/25 Gbps typical	Pod network or dedicated Multus network, TLS-encrypted, bandwidth depends on CNI
Trigger	Manual, DRS-automated, host maintenance mode	Manual (VMIM), node drain (kubectl drain), descheduler policy
Convergence	vMotion has mature convergence heuristics, memory pre-copy with stun threshold	QEMU pre-copy with configurable bandwidth limit, convergence timeout, auto-converge (throttle vCPUs)
Post-copy	Supported in recent vSphere versions	Supported in KubeVirt (experimental), falls back to pre-copy on failure
Downtime target	Typically <1 second for most workloads	Typically 10-500 ms, depends on dirty page rate and migration bandwidth
Multi-VM migration	DRS migrates multiple VMs in parallel with dependency awareness	Parallel VMIM resources, but no built-in dependency awareness (must be orchestrated externally)
Cancel	Yes, VM stays on source	Yes, delete the VMIM resource
Network identity	VM keeps its MAC and IP	VM keeps its MAC and IP (the pod IP changes, but the VM's internal IP is stable if using bridge mode + Multus)

A critical caveat: the pod IP changes after migration. In masquerade mode, the VM gets a new pod IP on the target node. The VM's internal IP (the NATed address) remains the same from the guest's perspective, but any Kubernetes services or external load balancers pointing to the old pod IP must update. Kubernetes Services (ClusterIP, NodePort, LoadBalancer) handle this automatically if the VM is behind a Service (selector matches the new pod). Direct pod-IP access breaks.

For bridge mode with Multus secondary networks, the VM's MAC address is preserved, and if the secondary network spans both nodes (same VLAN), the VM retains its IP address transparently -- this is the closest equivalent to vMotion's behavior.

Live migration prerequisites:

Storage must be shared (RWX PVCs). Local storage does not support live migration without storage migration.
All VM devices must be migratable (SR-IOV passthrough devices are NOT migratable -- the VF is tied to a physical NIC on a specific host).
Sufficient resources on the target node (CPU, memory, hugepages, devices).
The evictionStrategy: LiveMigrate field must be set on the VM for migration during node drain.

Console and VNC Access

KubeVirt provides console access through virtctl, a CLI tool that extends kubectl:

virtctl console <vm-name>: Opens a serial console to the VM (equivalent to connecting to the COM1 serial port). Streams the serial console via a websocket through virt-api to the virt-launcher pod's QEMU process. Works for headless Linux VMs with serial console enabled (console=ttyS0 in kernel command line).
virtctl vnc <vm-name>: Opens a VNC session to the VM's graphical console. Launches a local VNC viewer connected via a websocket through virt-api. Works for graphical VMs (Windows, Linux with GUI). Requires a VNC viewer on the client machine.
virtctl ssh <vm-name>: Proxies an SSH connection to the VM through the Kubernetes API server. No need for the VM to be directly network-reachable. Requires the QEMU guest agent and proper network configuration.

Additionally, the OpenShift Web Console (in OVE) provides a browser-based VNC client and serial console, comparable to vSphere Client's remote console but accessible through a web browser without any plugins.

The architectural path for console access:

  Console Access Path

  User's workstation
       |
       +--> virtctl vnc my-vm
       |
       v
  kubectl proxy / API server (HTTPS/WSS)
       |
       +--> virt-api (subresource handler)
       |      Routes to correct virt-launcher pod
       |
       v
  virt-handler (on target node)
       |
       +--> virt-launcher pod
       |      |
       |      v
       |    QEMU VNC server (port 5900 inside pod)
       |    or QEMU chardev (serial console)
       |
       v
  Websocket stream back to virtctl
       |
       v
  Local VNC viewer / terminal

Comparison to vSphere: What Maps, What Doesn't

This mapping table is designed for the evaluation team to build a mental model of KubeVirt using their existing VMware knowledge:

vSphere / ESXi Concept	KubeVirt / OVE Equivalent	Notes
vCenter Server	kube-apiserver + virt-controller + virt-api	No single "vCenter" -- the functions are distributed across Kubernetes components
ESXi Host	Kubernetes worker node (RHCOS) + virt-handler	The node runs RHCOS (Red Hat CoreOS), not ESXi
VMX process	virt-launcher pod (containing QEMU process)	Each VM = one pod = one QEMU process
hostd + vpxa	virt-handler DaemonSet	Node-local agent reporting to the central controller
VM (in vCenter inventory)	VirtualMachine CR	Persistent object, survives power-off
Running VM instance	VirtualMachineInstance CR + virt-launcher Pod	Ephemeral, exists only while VM is running
VM Template	VirtualMachineClusterInstancetype + VirtualMachineClusterPreference + golden DataVolume	No single "template" object; composed from instance type + preference + source disk
Resource Pool	Kubernetes Namespace + ResourceQuota + LimitRange	Namespaces replace resource pools for multi-tenancy
DRS (Distributed Resource Scheduler)	kube-scheduler + descheduler (optional)	Kubernetes scheduler handles placement; descheduler handles rebalancing (less mature than DRS)
vMotion	VirtualMachineInstanceMigration (VMIM)	Declarative migration resource
vDS (Distributed Switch)	CNI plugin (OVN-Kubernetes in OVE)	OVN-Kubernetes is the default CNI; replaces vDS functionality
vDS Portgroup	NetworkAttachmentDefinition (NAD) via Multus	Each additional network is a Multus attachment
VMFS/vSAN Datastore	StorageClass + PVCs (backed by Ceph, ODF, etc.)	No monolithic "datastore"; each disk is an independent PVC
VMDK	PVC (block or filesystem mode)	The PVC is the disk
Content Library	Container registry + DataVolume sources	VM images stored as container images or HTTP-accessible files
vSphere HA	Kubernetes pod rescheduling + VM run strategy	If a node fails, pods are rescheduled to surviving nodes; `runStrategy: Always` ensures VMs restart
Alarms & Events	Kubernetes Events + Prometheus alerts	No built-in alarm system; alerting via Prometheus + Alertmanager
RBAC (vSphere permissions)	Kubernetes RBAC (Roles, RoleBindings, ClusterRoles)	More granular than vSphere; per-namespace, per-resource-type
vSphere Tags	Kubernetes Labels + Annotations	Labels are the primary metadata mechanism
Guest OS Customization (Sysprep)	cloud-init / Sysprep (via ConfigMap/Secret volumes)	cloud-init for Linux, Sysprep for Windows, injected as volumes
Snapshot	VolumeSnapshot (CSI)	Via the CSI driver, not KubeVirt itself; maturity varies by storage backend

What is better in KubeVirt:

Declarative, GitOps-compatible infrastructure: VMs are YAML files. They can be version-controlled, reviewed, templated with Helm/Kustomize, and deployed via CI/CD pipelines. VMware VM definitions are not natively version-controllable.
Unified platform: Containers and VMs on the same cluster, sharing the same network, storage, monitoring, and RBAC. No need for separate infrastructure silos.
RBAC granularity: Kubernetes RBAC is more granular and flexible than vSphere permissions. Per-namespace, per-verb, per-resource-type, with custom roles.
Multi-tenancy model: Namespaces + ResourceQuotas provide a cleaner multi-tenancy model than vSphere resource pools.
Extensibility: The CRD/operator model means new functionality can be added without modifying the core platform.

What is worse in KubeVirt:

Operational complexity: The debugging surface area is much larger. A VM failure can originate in Kubernetes, CRI-O, CNI, CSI, virt-controller, virt-handler, virt-launcher, libvirtd, or QEMU. VMware's vertically integrated stack is simpler to troubleshoot.
DRS equivalent maturity: Kubernetes' scheduler is placement-only (initial scheduling). The descheduler (rebalancing running pods) is less mature than DRS. There is no equivalent of DRS load balancing that continuously optimizes VM placement based on real-time utilization.
Snapshot maturity: VM snapshots in KubeVirt depend on the CSI driver's snapshot capability. There is no integrated memory+disk snapshot comparable to VMware snapshots.
Tooling ecosystem: vSphere has 20+ years of third-party tooling (backup: Veeam, Commvault; monitoring: vROps; automation: vRA). KubeVirt's ecosystem is younger and thinner, though growing rapidly.
Live migration constraints: SR-IOV passthrough devices block live migration. VMware's DirectPath I/O has the same limitation, but vSphere admins are accustomed to this; KubeVirt admins may not realize it until a node drain fails.

2. OCI / Container Runtimes (CRI-O, containerd)

CRI (Container Runtime Interface)

The Container Runtime Interface (CRI) is a plugin API that Kubernetes defines for container runtimes. It is the boundary between the kubelet (Kubernetes' node agent) and whatever software actually creates and manages containers (and, by extension, virt-launcher pods for VMs). CRI was introduced in Kubernetes 1.5 (2016) to decouple the kubelet from Docker.

Before CRI, the kubelet had Docker-specific code built in. CRI defines a gRPC API with two services:

RuntimeService: Container lifecycle (create, start, stop, remove), pod sandbox management, exec, attach, port-forward.
ImageService: Image pull, list, remove, image status.

Any software that implements this gRPC API can serve as a Kubernetes container runtime. The two major implementations are CRI-O and containerd.

CRI-O

CRI-O is a lightweight, OCI-compliant container runtime built specifically for Kubernetes. It was created by Red Hat, Intel, SUSE, and others. "CRI-O" literally means "CRI + OCI" -- it implements the CRI interface and uses OCI-compliant tools for the actual container operations.

Key characteristics:

Minimal scope: CRI-O does one thing: serve as a CRI implementation for Kubernetes. It does not try to be a general-purpose container engine like Docker. No docker build, no docker compose, no image building.
OCI-compliant: Uses runc (or another OCI runtime like crun or Kata) to create containers according to the OCI Runtime Specification.
Used by OpenShift/OVE: All OpenShift clusters (and therefore all OVE deployments) use CRI-O as the container runtime. This is not configurable -- CRI-O is the only supported runtime on OpenShift.
conmon: CRI-O uses a per-container monitor process called conmon (container monitor). Each container gets its own conmon process that holds the container's terminal, forwards logs, and detects container exit. This is relevant for KubeVirt because the virt-launcher container's conmon process is the first process in the container's PID namespace.

containerd

containerd is a container runtime originally extracted from Docker. Docker donated it to the CNCF, and it is now the default runtime for vanilla Kubernetes, AKS, EKS, GKE, and most cloud Kubernetes services. containerd implements the CRI plugin natively since version 1.1.

Key differences from CRI-O:

Aspect	CRI-O	containerd
Origin	Built for Kubernetes from scratch	Extracted from Docker
Scope	CRI only	CRI + general-purpose container management
Image building	Not supported (out of scope)	Not supported natively (but plugins exist)
OCI runtime default	`crun` (on RHEL/OpenShift)	`runc`
Container monitor	`conmon` (separate process per container)	Internal shim (`containerd-shim-runc-v2`)
Used by	OpenShift, OVE, SUSE	Vanilla Kubernetes, AKS, EKS, GKE
Release cycle	Aligned with OpenShift	Independent

How CRI-O Handles a KubeVirt Pod

When virt-controller creates a virt-launcher pod for a VM, the following chain of events occurs on the target worker node:

  CRI-O Execution Chain: kubelet to QEMU

  kubelet
    |
    +--> gRPC: RunPodSandbox()
    |      |
    |      v
    |    CRI-O
    |      |
    |      +--> Creates pod-level cgroup (/kubepods/pod<uid>/)
    |      |
    |      +--> Creates network namespace (via CNI)
    |      |      +--> Calls primary CNI plugin (OVN-Kubernetes)
    |      |      |      Creates veth pair, connects to OVS bridge
    |      |      +--> Calls Multus (if additional networks defined)
    |      |             Calls secondary CNI plugins
    |      |             Creates additional interfaces in namespace
    |      |
    |      +--> Creates IPC namespace, UTS namespace
    |      |
    |      +--> Returns PodSandboxId
    |
    +--> gRPC: CreateContainer(PodSandboxId, "compute")
    |      |
    |      v
    |    CRI-O
    |      |
    |      +--> Pulls virt-launcher image (if not in local cache)
    |      |      Image: registry.redhat.io/container-native-
    |      |      virtualization/virt-launcher-rhel9:v4.x
    |      |
    |      +--> Creates OCI runtime bundle:
    |      |      - config.json (OCI runtime spec)
    |      |      - rootfs/ (from container image layers)
    |      |
    |      +--> Spawns conmon process:
    |      |      conmon --cid <container-id> \
    |      |             --runtime /usr/bin/crun \
    |      |             --log-path /var/log/... \
    |      |             ...
    |      |
    |      +--> conmon calls crun (OCI runtime):
    |             crun create <container-id>
    |               |
    |               +--> Creates container cgroup
    |               |    (child of pod cgroup)
    |               +--> Sets up mount namespace
    |               |    (rootfs, volumes, devices)
    |               +--> Mounts /dev/kvm, /dev/vhost-net
    |               |    into container
    |               +--> Mounts PVC volumes at expected
    |               |    paths in container
    |               +--> Configures seccomp profile
    |               +--> Configures SELinux label
    |               +--> Joins pod's network namespace
    |               +--> Starts container init process:
    |                    --> virt-launcher binary (PID 1)
    |
    +--> gRPC: StartContainer(ContainerId)
           |
           v
         CRI-O --> conmon --> crun start <container-id>
           |
           v
         virt-launcher process begins:
           |
           +--> Starts libvirtd (as child process)
           +--> Defines VM domain in libvirt
           +--> Starts QEMU/KVM via libvirt
           +--> VM boots inside the container's cgroup

The critical insight: the QEMU process inherits the container's cgroup. This means Kubernetes' CPU and memory limits apply directly to the QEMU process and its vCPU threads. If a VM is configured with resources.requests.memory: 64Gi and resources.limits.memory: 66Gi (64 Gi for the guest + 2 Gi for QEMU overhead), the Linux OOM killer will kill the QEMU process if it exceeds 66 Gi -- exactly as it would kill any container exceeding its memory limit. This is a feature (prevents one VM from consuming unbounded resources) and a risk (an undersized memory limit kills the VM).

OCI Image Spec and Runtime Spec

The Open Container Initiative (OCI) defines two specifications that are foundational to container runtimes:

OCI Image Specification: Defines how container images are structured -- the manifest, config, and layers. KubeVirt uses this specification for containerDisk volumes: a VM disk image (qcow2 or raw) is packaged as an OCI image layer, pushed to a container registry, and pulled by the container runtime at pod start. This is how ephemeral VM boot disks (like ISOs or live CDs) are distributed.

OCI Runtime Specification: Defines how a container is configured and executed -- the config.json file that CRI-O/containerd passes to runc/crun. This specification defines namespaces, cgroups, mounts, devices, and security settings. For a KubeVirt virt-launcher container, the OCI runtime spec includes device access rules for /dev/kvm and /dev/vhost-net, volume mounts for PVCs, and the appropriate SELinux context.

Why CRI-O vs containerd Matters for OVE

For the OVE evaluation specifically, CRI-O is not a choice -- it is a requirement. OpenShift mandates CRI-O. The implications:

Troubleshooting: When a virt-launcher pod fails to start, the logs are in CRI-O's log format, and the container is managed by conmon and crun. The operations team must be familiar with crictl (the CRI CLI) rather than docker or nerdctl (containerd's CLI) for debugging.
Image pull behavior: CRI-O's image pull behavior differs from containerd in edge cases (authentication, registry mirrors, image signing). OVE's support matrix is tested exclusively with CRI-O.
Security: CRI-O on OpenShift runs with SELinux enforcing and a locked-down seccomp profile by default. virt-launcher pods require specific SELinux labels (container_t with KVM device access) that are configured by the KubeVirt operator. Modifying these settings outside of the operator can break VMs.
crun vs runc: OpenShift's CRI-O uses crun (a C-based OCI runtime) instead of runc (Go-based). crun has lower overhead and faster container start times, which slightly benefits VM startup latency (the pod sandbox creation phase). The QEMU process itself is unaffected -- crun only manages the container lifecycle, not the VM.

For Azure Local, which runs Hyper-V VMs directly (not inside Kubernetes pods), none of the CRI-O/containerd discussion is relevant. Azure Local's VMs are managed by the Hyper-V hypervisor and the Azure Arc management plane, not by a container runtime.

3. Kata Containers / MicroVMs

The Problem: Container Isolation is Weaker than VM Isolation

Standard Linux containers (run by runc/crun) share the host kernel. Process isolation is enforced by kernel namespaces, cgroups, seccomp, and SELinux/AppArmor -- but all containers on a host execute syscalls against the same kernel. A kernel vulnerability (a privilege escalation bug in a syscall handler, a namespace escape, a cgroup bypass) can allow a container to break out and access the host or other containers.

This is fundamentally different from VM isolation, where each VM has its own kernel running inside a hardware-enforced boundary (VT-x/AMD-V, EPT/NPT). A guest kernel vulnerability does not compromise the host. The attack surface is the hypervisor's VM exit handler -- a much smaller and more auditable surface than the Linux syscall table (400+ syscalls).

For a Tier-1 financial enterprise running regulated workloads, this distinction matters. If two different business units (or two different customers in a shared infrastructure) run containers on the same host, the shared-kernel risk may be unacceptable. This is the problem Kata Containers solves.

Kata Containers Architecture

Kata Containers is an open-source project (originally a merger of Intel Clear Containers and Hyper.sh's runV) that runs each container (or pod) inside a lightweight virtual machine. Instead of using runc to create a container with namespace isolation, Kata uses a VMM (Virtual Machine Monitor) to create a lightweight VM for each pod.

  Standard Container vs Kata Container

  Standard Container (runc/crun)          Kata Container
  ================================        ================================

  +---------------------------+           +---------------------------+
  | Container Process         |           | Container Process         |
  | (shares host kernel)      |           | (runs in guest kernel)    |
  +---------------------------+           +---------------------------+
  | namespaces + cgroups      |           | Guest Linux Kernel (5.x)  |
  | (kernel-level isolation)  |           | (minimal, stripped-down)  |
  +---------------------------+           +---------------------------+
  | Host Linux Kernel         |           | VMM (QEMU / Cloud HV /   |
  |                           |           |      Firecracker)         |
  |                           |           +---------------------------+
  |                           |           | Host Linux Kernel + KVM   |
  +---------------------------+           +---------------------------+
  | Hardware                  |           | Hardware (VT-x/EPT)       |
  +---------------------------+           +---------------------------+

  Isolation boundary:                     Isolation boundary:
  Kernel namespaces (software)            Hardware virtualization (VT-x)

Kata Containers architecture in detail:

  Kata Containers Architecture (per Pod)

  +================================================================+
  |  Kata VM (lightweight, boots in <1 second)                     |
  |                                                                |
  |  +----------------------------------------------------------+  |
  |  |  Container workload(s)                                   |  |
  |  |  - Application process(es)                               |  |
  |  |  - OCI bundle rootfs mounted from host (via virtio-fs    |  |
  |  |    or virtio-9p)                                         |  |
  |  +----------------------------------------------------------+  |
  |  |  kata-agent                                              |  |
  |  |  - gRPC server inside the VM                             |  |
  |  |  - Receives container lifecycle commands from kata-       |  |
  |  |    runtime on the host                                   |  |
  |  |  - Creates namespaces/cgroups inside the VM              |  |
  |  |  - Manages container processes                           |  |
  |  +----------------------------------------------------------+  |
  |  |  Guest Linux Kernel                                      |  |
  |  |  - Minimal kernel (~4-8 MB)                              |  |
  |  |  - Only drivers needed: virtio-blk, virtio-net,          |  |
  |  |    virtio-fs, virtio-vsock                               |  |
  |  |  - No unnecessary modules, no GUI, no sound              |  |
  |  +----------------------------------------------------------+  |
  +================================================================+
       | virtio-vsock (host <-> guest communication)
       | virtio-fs (filesystem sharing)
       | virtio-net (networking)
       v
  +================================================================+
  |  Host                                                          |
  |                                                                |
  |  +----------------------------------------------------------+  |
  |  |  kata-runtime (OCI runtime, replaces runc)               |  |
  |  |  - Called by CRI-O/containerd instead of runc            |  |
  |  |  - Starts VMM, connects to kata-agent                    |  |
  |  |  - Translates OCI lifecycle calls to gRPC calls          |  |
  |  +----------------------------------------------------------+  |
  |  |  VMM (Virtual Machine Monitor)                           |  |
  |  |  Options:                                                |  |
  |  |  - QEMU (full-featured, highest compatibility)           |  |
  |  |  - Cloud Hypervisor (Rust, modern, good perf)            |  |
  |  |  - Firecracker (AWS, minimal, fastest boot)              |  |
  |  |  - Dragonball (Alibaba, sandbox-optimized)               |  |
  |  +----------------------------------------------------------+  |
  |  |  Linux Kernel + KVM                                      |  |
  |  +----------------------------------------------------------+  |
  +================================================================+

The key idea: Kata Containers is a drop-in OCI runtime. From Kubernetes' perspective, a Kata pod looks exactly like a regular pod. The kubelet sends the same CRI calls to CRI-O/containerd. CRI-O/containerd calls kata-runtime instead of runc. The pod gets the same network namespace, the same cgroup accounting, the same volume mounts. The only difference is that the container process runs inside a VM rather than directly on the host kernel.

Firecracker

Firecracker is Amazon's open-source Virtual Machine Monitor (VMM), built specifically for serverless and container workloads. It powers AWS Lambda and AWS Fargate. Key characteristics:

Minimal device model: Only 4 emulated devices: virtio-net, virtio-block, serial console, and a minimal keyboard controller (for reboot detection). No USB, no GPU, no sound, no PCI passthrough. This minimal surface reduces attack vectors.
Fast boot: <125 ms from VMM start to guest kernel init. This is achieved by skipping BIOS/UEFI, using a direct kernel boot with a pre-loaded initrd, and having a minimal device enumeration phase.
Low memory overhead: ~5 MB per microVM for the VMM itself (compared to ~30-130 MB for a full QEMU instance).
Rate limiting: Built-in I/O rate limiting (bandwidth and IOPS) per virtio device. This is useful for multi-tenant density where noisy-neighbor I/O is a concern.
Jailer: A companion process that sets up the microVM's sandbox (seccomp, cgroups, chroot) before Firecracker starts. Provides defense-in-depth even if the VMM itself is compromised.
Limitations: No live migration. No device passthrough. No GPU support. No UEFI. These limitations are deliberate -- Firecracker is designed for ephemeral, short-lived workloads, not for running persistent VMs.

Firecracker is a Kata Containers VMM option, meaning you can configure Kata to use Firecracker instead of QEMU. This gives the fastest possible boot times but sacrifices features.

Cloud Hypervisor

Cloud Hypervisor is a Rust-based VMM that sits between QEMU (full-featured) and Firecracker (minimal). It was started by Intel and is now a Linux Foundation project. Key characteristics:

Modern codebase: Written in Rust, with memory safety guarantees. Smaller attack surface than QEMU's 2M+ lines of C.
Rich enough for cloud workloads: Supports virtio-fs, vhost-user, VFIO (PCI passthrough), vDPA, hotplug (CPU, memory, disk, network), NUMA, and live migration. More capable than Firecracker, less attack surface than QEMU.
Boot time: ~200-500 ms (slower than Firecracker, much faster than QEMU).
Memory overhead: ~10-20 MB per VM.
Kata integration: Cloud Hypervisor is the recommended VMM for Kata Containers in many production deployments, as it offers the best balance of features and security.

KubeVirt vs Kata: Different Use Cases

This is a common source of confusion. Both KubeVirt and Kata Containers involve running VMs on Kubernetes, but they solve completely different problems:

Aspect	KubeVirt	Kata Containers
Goal	Run traditional VMs (with their own OS, kernel, applications) on Kubernetes	Run containers with VM-level isolation
Workload	A full operating system (Windows Server, RHEL, Ubuntu) with applications installed inside	A container image (just the application + dependencies, no OS kernel)
Guest kernel	The VM's own kernel (whatever the guest OS uses)	A shared, minimal guest kernel provided by Kata
User experience	"I manage a VM" -- SSH in, install packages, configure services	"I manage a container" -- docker build, kubectl apply, same container workflow
Use case	Legacy apps that cannot be containerized, Windows workloads, stateful databases, appliances	Multi-tenant container platforms needing strong isolation, CI/CD build pods, untrusted code execution
Boot time	10-60 seconds (full OS boot, BIOS/UEFI POST, kernel init, systemd)	<1-5 seconds (microVM boots minimal kernel, mounts container rootfs, starts application)
Overhead per workload	High (full OS: 512 MB - several GB for guest OS alone)	Low (minimal kernel: 32-128 MB for guest overhead)
Managed by	KubeVirt operator (virt-controller, virt-handler, virt-launcher)	CRI-O/containerd + Kata runtime (transparent to Kubernetes)

In the context of this evaluation:

KubeVirt is the mechanism by which OVE runs the 5,000+ existing VMs. Every VMware VM that is migrated to OVE becomes a KubeVirt VM.
Kata Containers is potentially relevant for future workloads -- if the organization runs multi-tenant container platforms and needs stronger isolation than standard Linux containers provide, Kata can provide VM-level isolation transparently. OpenShift supports Kata via the OpenShift Sandboxed Containers operator.

They are complementary, not competing, technologies. An OVE cluster could simultaneously run:

KubeVirt VMs (legacy Windows and Linux workloads migrated from VMware)
Standard containers (cloud-native microservices)
Kata Containers (security-sensitive container workloads needing VM isolation)

All three share the same Kubernetes control plane, networking, storage, and monitoring.

When Kata Matters for the Evaluation

Kata Containers should be considered in the evaluation in the following scenarios:

Multi-tenant container workloads: If the platform will host containers from multiple business units or external parties with different trust levels, Kata provides hardware-enforced isolation between tenants. Without Kata, container-to-container isolation relies on kernel namespaces, which have a larger attack surface.
Regulatory requirements for workload isolation: FINMA or internal security policies may require that certain workloads run with hardware-level isolation. Kata satisfies this requirement without requiring dedicated physical hosts.
CI/CD build pipelines: Running untrusted build jobs (e.g., building third-party code) in standard containers is a security risk. Kata containers confine build jobs in VMs, preventing a malicious build from affecting the host or other pods.
Comparing isolation models across candidates:
- OVE: Standard containers (namespace isolation) + Kata Containers (VM isolation) + KubeVirt VMs (full VM isolation). Three tiers available.
- Azure Local: Hyper-V VMs (full VM isolation) for VMs. Containers run in AKS-HCI (standard namespace isolation). No Kata equivalent.
- Swisscom ESC: VMware VMs (full VM isolation). Container services depend on Swisscom's offering.

How the Candidates Handle This

Comparison Table

Aspect	VMware (Current)	OVE (KubeVirt)	Azure Local (Hyper-V)	Swisscom ESC
VM Management Model	vCenter (proprietary, centralized)	Kubernetes API + KubeVirt CRDs (declarative, open)	Azure Arc + Hyper-V (hybrid cloud management)	VMware vCloud Director (managed service)
VM Definition Format	.vmx files + vCenter DB	YAML manifests (VirtualMachine CR)	PowerShell / Azure CLI / ARM templates	VMware OVF + provider portal
Infrastructure as Code	Terraform vSphere provider; PowerCLI; govc	Native (kubectl, Helm, Kustomize, GitOps)	Terraform azurerm provider; Azure Bicep; PowerShell	Limited; API available but not IaC-native
VM-to-Host Mapping	VMX process on ESXi host	virt-launcher Pod on worker node	Child partition on Hyper-V host	VMX process on ESXi (provider-managed)
Hypervisor Engine	ESXi VMkernel	KVM + QEMU + libvirt (wrapped by KubeVirt)	Hyper-V (microkernel + root partition)	ESXi VMkernel (provider-managed)
Container Runtime	N/A (VMs only; vSphere with Tanzu adds containerd)	CRI-O (mandatory on OpenShift)	containerd (for AKS-HCI containers)	N/A for VM workloads
Unified VM+Container Platform	Partial (vSphere with Tanzu, but VMs and containers are separate)	Yes (VMs and containers are both pods, same scheduler/network/storage)	Separate (Hyper-V VMs + AKS-HCI for containers)	No (VM-only service)
VM Disk Format	VMDK on VMFS/vSAN/NFS	PVC (raw or qcow2 on Ceph RBD, ODF, NFS, etc.)	VHDX on Cluster Shared Volumes (ReFS/NTFS)	VMDK (provider-managed)
Disk Import/Migration Tool	N/A (source platform)	CDI (Containerized Data Importer) + MTV	Azure Migrate	Provider-managed migration
Live Migration	vMotion (mature, automated via DRS)	KubeVirt VMIM (QEMU pre-copy, pod-based)	Hyper-V Live Migration (mature)	vMotion (provider-managed)
Console Access	vSphere Client (VMRC, web console)	virtctl console/vnc, OpenShift web console	Hyper-V Manager, WAC, Azure portal	Provider portal (web console)
Multi-tenancy	Resource pools, folders, permissions	Kubernetes namespaces, RBAC, quotas	Azure RBAC, subscriptions	Provider-managed tenant isolation
Automated Placement/Balancing	DRS (automated, continuous)	kube-scheduler (initial only) + descheduler (optional, less mature)	SCVMM Dynamic Optimization (if using SCVMM)	Provider-managed (DRS under the hood)
VM Security Isolation	VMkernel hypervisor boundary	KVM hypervisor boundary (each VM in own pod cgroup)	Hyper-V hypervisor boundary	VMkernel hypervisor boundary
Container Isolation Enhancement	N/A	Kata Containers / OpenShift Sandboxed Containers	N/A for containers at VM level	N/A
Ecosystem Maturity	20+ years, deep third-party integration	5-7 years production use, growing rapidly	10+ years Hyper-V, Azure Arc is newer	Depends on VMware maturity (provider-managed)

Detailed Candidate Analysis

OVE (KubeVirt/KVM)

OVE's greatest strength and greatest risk both stem from the same source: it is Kubernetes-native. The strength is that VM management becomes a natural extension of the Kubernetes platform the organization may already use (or plan to use) for containers. VMs are YAML, lifecycle is declarative, networking and storage are unified, RBAC is standard Kubernetes. For a team that has invested in Kubernetes skills and tooling, OVE is a natural fit.

The risk is that the team may not have invested in Kubernetes skills. Operating KubeVirt at 5,000+ VM scale requires deep Kubernetes expertise: understanding pod scheduling, CSI driver behavior, CNI networking, RBAC, resource quotas, node affinity, taints and tolerations, pod disruption budgets, and the interactions between all of these. A VMware admin who knows vCenter deeply but has never used kubectl faces a steep learning curve.

CDI and migration readiness: OVE's CDI is the primary tool for migrating VMware VMDKs. For the 5,000+ VM estate, the Migration Toolkit for Virtualization (MTV) automates the end-to-end flow: discover VMs in vCenter, map networks and storage, convert VMDKs, create VirtualMachine CRs, and validate boot. MTV uses CDI and VDDK under the hood. The evaluation should include a PoC migration of representative VM types (Windows Server with SQL, RHEL with Oracle, Ubuntu with custom apps) to validate conversion fidelity, boot success rate, and performance parity.

Live migration maturity: KubeVirt's live migration is functional but less mature than vMotion. Specifically:

There is no equivalent of DRS automatic load balancing. Migrations must be triggered manually or by external automation (descheduler policies, custom operators, or drain during maintenance).
Migration bandwidth is configured globally or per-migration, but there is no automatic bandwidth negotiation like vMotion's adaptive algorithm.
SR-IOV passthrough devices block migration, requiring a fallback strategy (e.g., use virtio-net for migrable VMs and SR-IOV only for VMs that can tolerate brief downtime during maintenance).

Kata Containers as a differentiator: OVE can offer three tiers of isolation (container, Kata container, KubeVirt VM) on a single platform. No other candidate offers this. For organizations that need both legacy VMs and secure multi-tenant containers, this is a material advantage.

Azure Local (Hyper-V)

Azure Local does not use KubeVirt or any Kubernetes-native VM management. VMs are managed through the traditional Hyper-V stack with Azure Arc as the management plane. This means:

VMs are defined through Azure Resource Manager (ARM) templates, Azure Bicep, or Azure CLI -- not through Kubernetes CRDs.
The VM runtime is a Hyper-V child partition, not a pod. There is no container runtime involved in VM execution.
Live migration is native Hyper-V live migration, which is mature and well-tested in enterprise environments.
Console access is through Windows Admin Center (WAC) or the Azure portal.

For a team that is Windows-oriented and already invested in Azure, this is arguably simpler: there is no Kubernetes learning curve for VM management. The trade-off is that containers (via AKS on Azure Local) and VMs are managed through different tools and APIs -- they are not unified on a single platform in the same way KubeVirt unifies them.

Azure Local also does not have an equivalent of Kata Containers. Container isolation in AKS on Azure Local relies on standard Linux namespace isolation (within the AKS Linux nodes) or Hyper-V isolation (for Windows containers). There is no drop-in "VM-level isolation for Linux containers" story comparable to OpenShift Sandboxed Containers.

Swisscom ESC

ESC abstracts all of this behind a managed service. The customer does not interact with KubeVirt, CRI-O, Kata, or any of the technologies in this chapter. VMs are provisioned through Swisscom's portal or API, and the underlying technology is VMware vSphere managed by Swisscom.

The relevance of this chapter for the ESC evaluation is primarily about future risk: if Swisscom transitions away from VMware (due to Broadcom licensing), would the replacement involve KubeVirt? If so, the customer's VMs would be running on the same technology described here, but without customer visibility or control. The evaluation should probe Swisscom's technology roadmap and transition commitments.

Key Takeaways

KubeVirt is the core of OVE, and understanding it is non-negotiable for evaluating OVE. Every VM on OVE is a KubeVirt-managed VirtualMachine, running as a QEMU process inside a virt-launcher pod on a Kubernetes worker node. The evaluation team must be fluent in the VM lifecycle (from YAML to running QEMU), the CRD model (VM vs VMI vs VMIM), and the pod-wrapping implications (resource accounting, cgroup enforcement, pod-level networking).
The pod-per-VM model is both KubeVirt's greatest strength and its operational cost. It brings Kubernetes scheduling, RBAC, monitoring, and network policy to VMs for free. It costs an additional ~100-300 MB per VM in overhead (libvirtd + virt-launcher + QEMU control structures), adds 2-10 seconds to VM startup, and expands the debugging surface area from one layer (VMware) to six (Kubernetes + CRI-O + CNI + CSI + KubeVirt + QEMU).
Networking mode selection is a high-impact decision. Masquerade is the easiest but adds NAT overhead and complicates live migration (pod IP changes). Bridge mode is better for direct IP access. SR-IOV is essential for high-throughput workloads but blocks live migration. The evaluation PoC should test all three modes with representative workloads.
CDI throughput determines migration timeline. At 5,000 VMs with an average of 200 GB per VM, the data migration alone is ~1 PB. CDI import parallelism, network bandwidth between VMware and OVE clusters, and target storage write throughput are the bottleneck factors. The PoC should benchmark actual import rates with the chosen storage backend.
Live migration in KubeVirt works but lacks DRS-equivalent automation. Migrations must be triggered explicitly (VMIM resource, node drain, or descheduler policy). There is no continuous, utilization-aware rebalancing out of the box. Operationally, this means the team needs to build or adopt automation for migration orchestration during maintenance windows.
CRI-O is the mandatory container runtime on OVE. The operations team must learn CRI-O's tooling (crictl), log format, and debugging patterns. containerd experience does not fully transfer. CRI-O troubleshooting is part of the VM troubleshooting path because every VM is a container managed by CRI-O.
Kata Containers is a differentiator for OVE, not a core requirement. For the VM migration use case, Kata is irrelevant -- VMs already have hardware-level isolation via KVM. Kata becomes relevant if the organization plans to run multi-tenant container workloads with regulatory requirements for strong isolation. This is a future-state consideration, not a day-1 migration concern.
The vSphere-to-KubeVirt conceptual mapping is imperfect. Some mappings are clean (vCenter -> kube-apiserver + virt-controller; ESXi host -> worker node + virt-handler). Others are fundamentally different (DRS -> kube-scheduler has no equivalence; VMFS datastore -> PVCs is a paradigm shift from shared filesystem to per-disk volumes). The evaluation team should not expect a 1:1 feature match but instead assess whether the different model achieves the same operational outcomes.
Azure Local and Swisscom ESC do not use any of the technologies in this chapter for VM management. Azure Local uses native Hyper-V with Azure Arc; ESC uses VMware vSphere managed by Swisscom. This chapter is primarily relevant for evaluating OVE, with secondary relevance for understanding the architectural differences between the candidates.

Discussion Guide

Use these questions when engaging with Red Hat solution engineers, Kubernetes platform architects, or the organization's internal infrastructure team. The questions probe real-world operational readiness at enterprise scale.

Questions for OVE (Red Hat / KubeVirt)

virt-launcher overhead at scale: "At 50 VMs per worker node (our expected density), the per-VM libvirtd overhead is ~1.5-2.5 GB per node. With 100+ worker nodes, that is 150-250 GB of cluster memory consumed by libvirtd instances. How do you account for this in capacity planning templates? Is there a roadmap to reduce per-VM overhead (e.g., shared libvirtd, libvirt-less architecture)?"
Pod startup latency during disaster recovery: "If an entire rack of 10 worker nodes fails simultaneously (500 VMs), how long does it take for all 500 virt-launcher pods to be scheduled and started on surviving nodes? What are the bottleneck factors -- scheduler throughput, CRI-O image pull, CNI setup, or storage attach? Have you tested this scenario at our scale?"
CDI import throughput benchmarks: "For our migration of 5,000 VMs (~1 PB total disk), what is the maximum sustained import rate CDI can achieve with parallel DataVolumes? Specifically: how many concurrent importer pods can run without saturating the storage backend or the API server? What is the recommended migration architecture -- dedicated import cluster, direct VDDK from vCenter, or HTTP staging server?"
Live migration and SR-IOV mutual exclusion: "Approximately 20% of our VMs require SR-IOV for high-throughput network workloads. These VMs cannot live-migrate. What is the recommended maintenance strategy for nodes hosting SR-IOV VMs? Tolerate brief downtime during host patching? Dual-NIC with virtio for migration and SR-IOV for production? Is there a roadmap for migratable SR-IOV (e.g., switchdev mode + OVN hardware offload)?"
Descheduler as DRS replacement: "Our VMware environment uses DRS in fully automated mode. The KubeVirt descheduler is the closest equivalent. How mature is the descheduler for VM workloads specifically? Does it understand VM-specific constraints (NUMA alignment, dedicated CPUs, hugepages) when making rebalancing decisions? Can it be configured to avoid migrating tier-1 VMs unless resource imbalance exceeds a threshold?"
Networking mode recommendation for our workload mix: "We have three categories of VMs: (a) general-purpose Linux with <1 Gbps traffic, (b) database servers with 10-25 Gbps storage replication traffic, (c) market data receivers with ultra-low-latency requirements. What networking mode do you recommend for each category? Can a single VM have both a masquerade interface (for management) and an SR-IOV interface (for data) simultaneously?"
Debugging a VM boot failure: "Walk us through the debugging process when a VM fails to start. Which logs do you check first? How do you distinguish between a Kubernetes scheduling failure (pod Pending), a CRI-O container creation failure (pod ContainerCreating), a virt-launcher failure (libvirt domain definition error), and a QEMU failure (hardware emulation error)? What tooling exists to correlate these layers?"

Questions for Azure Local

Architectural comparison to KubeVirt: "Azure Local runs VMs directly on Hyper-V without Kubernetes pod wrapping. What are the performance advantages of this simpler architecture (no CRI-O, no pod overhead, no per-VM libvirtd)? Conversely, what does Azure Local lose by not having a Kubernetes-native VM model (no GitOps for VMs, no Kubernetes RBAC for VMs, no unified container+VM platform)?"

Questions for Swisscom ESC

Technology transition transparency: "If Swisscom migrates the ESC platform from VMware to a Kubernetes-native stack (such as KubeVirt or a similar technology), what is the customer impact? Will existing VMs be transparently migrated, or will customers need to re-export and re-import VMs? What is the contractual notification period for such a technology change?"

Questions for All Candidates

Isolation model comparison: "Compare the VM isolation boundary across your platform: What is the hypervisor attack surface (number of VM exit handlers, device emulation code size)? Has the hypervisor undergone independent security audit or formal verification? What is the historical CVE rate for hypervisor escape vulnerabilities? For OVE specifically: does the additional pod/container layer (CRI-O, cgroups, namespaces) add defense-in-depth, or does it add attack surface?"