Modern datacenters and beyond

Kubernetes-Native Virtualization

Why This Matters

The previous three chapters covered the foundational concepts of virtualization, the VMware baseline you are migrating away from, and the hypervisor engines (KVM, QEMU, libvirt, Hyper-V) that power the candidate platforms. This chapter takes the next step up the stack: how Kubernetes -- a system designed for containers -- becomes a platform that runs traditional virtual machines as first-class citizens.

This is the most important chapter in the virtualization series for one reason: KubeVirt is the core of OpenShift Virtualization Engine (OVE). Every VM that runs on OVE is a KubeVirt-managed VM. Every live migration, every storage attachment, every network connection, every console session -- all of these flow through KubeVirt's custom resources, controllers, and per-VM pod architecture. Understanding KubeVirt at the depth presented here is not optional for evaluating OVE. It is the evaluation.

For Azure Local, this chapter provides a contrast: Azure Local does not use Kubernetes for VM management (it uses Hyper-V with the Azure Arc management layer). Understanding KubeVirt clarifies what is architecturally different about the OVE approach -- where it is more flexible, where it is more complex, and where it introduces operational patterns that a VMware-trained team has never encountered.

For Swisscom ESC, this chapter is contextual. ESC currently runs on VMware; if Swisscom ever transitions to a Kubernetes-native platform, the concepts here would apply.

The chapter also covers the container runtime layer (CRI-O, containerd) that sits between Kubernetes and the actual virt-launcher processes, and Kata Containers, which represent a different approach to virtualization in Kubernetes -- running containers inside VMs rather than running VMs inside pods.

At 5,000+ VM scale, every architectural decision in the KubeVirt stack has compounding effects. A misconfigured CDI import pipeline slows down migration by weeks. A misunderstood networking mode (masquerade vs. bridge vs. SR-IOV) can mean the difference between 1 Gbps and 25 Gbps per VM. A virt-handler DaemonSet that is not sized correctly causes node-level failures during mass live migration. This chapter equips the evaluation team to operate at that level.


Concepts

1. KubeVirt

What KubeVirt Is

KubeVirt is a Kubernetes extension (operator) that enables virtual machines to run as first-class workloads alongside containers on the same cluster. It was started in 2017 by Red Hat engineers, donated to the Cloud Native Computing Foundation (CNCF) as a Sandbox project in 2018, and reached CNCF Incubating status in 2023. It is the upstream open-source project that Red Hat packages and supports as OpenShift Virtualization, which in turn is the foundation of OpenShift Virtualization Engine (OVE).

The core idea: instead of building a separate management plane for VMs (like vCenter or System Center VMM), KubeVirt extends the existing Kubernetes API with Custom Resource Definitions (CRDs) that represent virtual machines. The Kubernetes API server, scheduler, RBAC, monitoring, and networking all apply to VMs just as they do to containers. The VM itself runs inside a Kubernetes pod, where a virt-launcher process manages a libvirtd instance and a QEMU process.

This is not a trivial wrapper. KubeVirt solves a genuinely hard engineering problem: Kubernetes was designed around the assumption that workloads are ephemeral, stateless, and horizontally scalable. VMs are the opposite -- they are long-lived, deeply stateful, and vertically scaled. KubeVirt bridges this gap by introducing VM-specific lifecycle semantics (start, stop, pause, migrate, restart) on top of Kubernetes' pod-centric model.

Architecture

KubeVirt consists of four primary components deployed on a Kubernetes cluster:

  KubeVirt Architecture Overview

  +=====================================================================+
  |  Kubernetes Control Plane                                           |
  |  +---------------------------------------------------------------+  |
  |  |  kube-apiserver (+ KubeVirt CRDs registered)                  |  |
  |  |  - VirtualMachine (vm)                                        |  |
  |  |  - VirtualMachineInstance (vmi)                                |  |
  |  |  - VirtualMachineInstanceReplicaSet (vmirs)                   |  |
  |  |  - VirtualMachineInstanceMigration (vmim)                     |  |
  |  |  - VirtualMachineClusterPreference                            |  |
  |  |  - VirtualMachineClusterInstancetype                          |  |
  |  +---------------------------------------------------------------+  |
  |                                                                     |
  |  +-----------------------------+  +-----------------------------+   |
  |  |  virt-api (Deployment)      |  |  virt-controller            |   |
  |  |                             |  |  (Deployment, HA pair)      |   |
  |  |  - Validating webhook       |  |                             |   |
  |  |  - Mutating webhook         |  |  - Watches VM/VMI CRs      |   |
  |  |  - Subresource API          |  |  - Creates virt-launcher    |   |
  |  |    (console, VNC,           |  |    pods for each VMI        |   |
  |  |     start, stop, migrate)   |  |  - Manages VM lifecycle     |   |
  |  |  - virtctl proxy target     |  |    state machine            |   |
  |  |  - Certificate management   |  |  - Coordinates migrations   |   |
  |  +-----------------------------+  +-----------------------------+   |
  +=====================================================================+

  +=====================================================================+
  |  Worker Node 1                       Worker Node 2                  |
  |  +-------------------------------+   +----------------------------+ |
  |  | virt-handler (DaemonSet)      |   | virt-handler (DaemonSet)   | |
  |  | - Registers node capabilities|   | - Registers node caps      | |
  |  | - Manages device plugins      |   | - Manages device plugins   | |
  |  |   (KVM, vhost-net, GPU, SRIOV)|   |   (KVM, vhost-net, etc.)  | |
  |  | - Syncs VMI state with API    |   | - Syncs VMI state with API | |
  |  | - Coordinates migration target|   | - Coordinates migration    | |
  |  | - Configures node networking  |   | - Configures node net      | |
  |  +-------------------------------+   +----------------------------+ |
  |                                                                     |
  |  +---------------------------+  +---------------------------+       |
  |  | virt-launcher Pod (VM-A)  |  | virt-launcher Pod (VM-B)  |      |
  |  | +----------------------+  |  | +----------------------+  |      |
  |  | | virt-launcher process|  |  | | virt-launcher process|  |      |
  |  | +----------------------+  |  | +----------------------+  |      |
  |  | | libvirtd             |  |  | | libvirtd             |  |      |
  |  | +----------------------+  |  | +----------------------+  |      |
  |  | | QEMU/KVM process     |  |  | | QEMU/KVM process     |  |      |
  |  | | (the actual VM)      |  |  | | (the actual VM)      |  |      |
  |  | +----------------------+  |  | +----------------------+  |      |
  |  +---------------------------+  +---------------------------+       |
  +=====================================================================+

virt-api is a Deployment (typically 2 replicas for HA) that serves as the entry point for all KubeVirt-specific API operations. It registers itself as a Kubernetes admission webhook (both validating and mutating) so that every VirtualMachine or VirtualMachineInstance create/update request passes through KubeVirt's validation logic before being persisted in etcd. The virt-api also provides subresource endpoints -- these are the REST endpoints that virtctl uses for operations that do not map to standard Kubernetes CRUD: opening a VNC console, streaming a serial console, triggering a start/stop/restart, initiating a live migration. Architecturally, virt-api is comparable to vCenter's SOAP/REST API facade, except it extends the Kubernetes API server rather than replacing it.

virt-controller is a Deployment (typically 2 replicas with leader election) that implements the core reconciliation loop. It watches VirtualMachine and VirtualMachineInstance custom resources in etcd and ensures reality matches the declared state. When a user creates a VirtualMachineInstance, the virt-controller creates a pod (the virt-launcher pod) with the correct resource requests, volume mounts, and annotations. When a VirtualMachine is set to running: false, the virt-controller deletes the associated VMI and its pod. The virt-controller also manages the state machine for VMs (Pending -> Scheduling -> Scheduled -> Running -> Succeeded/Failed) and coordinates live migrations by creating target pods on destination nodes.

In vSphere terms, virt-controller is the equivalent of the vpxd process inside vCenter -- the central brain that translates desired state into actions on hosts.

virt-handler is a DaemonSet that runs on every worker node capable of hosting VMs. It is the node-level agent. Its responsibilities include:

In vSphere terms, virt-handler is roughly equivalent to the hostd + vpxa agent combination on each ESXi host -- the local authority that manages VMs on behalf of the central controller.

virt-launcher is a per-VM pod that runs exactly one VM. It is not deployed as a DaemonSet or Deployment -- it is created by virt-controller as a regular pod for each VirtualMachineInstance. Inside the virt-launcher pod, three processes cooperate:

  virt-launcher Pod Internal Structure

  +====================================================================+
  |  virt-launcher Pod (one per VM)                                    |
  |  Kubernetes namespace: vm-namespace                                |
  |  Pod name: virt-launcher-my-database-vm-xk7q9                     |
  |                                                                    |
  |  Cgroup: /kubepods/pod<uid>/                                       |
  |  CPU/Memory limits enforced by kubelet cgroups                     |
  |                                                                    |
  |  +--------------------------------------------------------------+  |
  |  |  Container: compute                                          |  |
  |  |                                                              |  |
  |  |  PID 1: virt-launcher                                        |  |
  |  |    - Translates VMI spec into libvirt domain XML             |  |
  |  |    - Calls libvirt API to define and start domain            |  |
  |  |    - Monitors QEMU process health                            |  |
  |  |    - Reports VM state changes to virt-handler (via socket)   |  |
  |  |    - Handles graceful shutdown (ACPI power button)           |  |
  |  |    - Exits when QEMU exits (pod terminates)                  |  |
  |  |                                                              |  |
  |  |  PID 2: libvirtd                                             |  |
  |  |    - Per-pod libvirtd instance (not system-wide)             |  |
  |  |    - Receives domain XML from virt-launcher                  |  |
  |  |    - Constructs QEMU command line                            |  |
  |  |    - Manages QEMU process lifecycle                          |  |
  |  |    - Handles domain events (migration, shutdown, crash)      |  |
  |  |                                                              |  |
  |  |  PID 3: qemu-kvm (QEMU/KVM)                                 |  |
  |  |    - The actual virtual machine                              |  |
  |  |    - vCPU threads (one per vCPU)                             |  |
  |  |    - I/O threads for disk/network                            |  |
  |  |    - Device emulation (virtio, IDE, e1000, etc.)             |  |
  |  |    - VNC server (for console access)                         |  |
  |  |    - Uses /dev/kvm for hardware virtualization               |  |
  |  |    - Uses /dev/vhost-net for accelerated networking          |  |
  |  +--------------------------------------------------------------+  |
  |                                                                    |
  |  Mounted Volumes:                                                  |
  |  - /var/run/kubevirt-private   (virt-handler communication)        |
  |  - /var/run/libvirt            (libvirt socket)                    |
  |  - PVC mounts for VM disks                                        |
  |  - ConfigMap/Secret mounts for cloud-init, sysprep                 |
  |  - Device mounts: /dev/kvm, /dev/vhost-net, /dev/net/tun          |
  |                                                                    |
  |  Network Interfaces:                                               |
  |  - eth0 (pod network, default CNI)                                 |
  |  - net1, net2, ... (additional NICs via Multus)                    |
  +====================================================================+

The critical design decision: each VM gets its own libvirtd instance. In a traditional KVM deployment, a single system-wide libvirtd manages all VMs on a host. KubeVirt deliberately isolates libvirtd per pod for two reasons: (1) a crash in one VM's libvirtd cannot affect other VMs, and (2) the Kubernetes pod sandbox (cgroups, namespaces) cleanly contains all processes related to a single VM.

This means a worker node running 50 VMs has 50 libvirtd processes and 50 QEMU processes, each in its own pod with its own cgroup. The memory overhead of 50 libvirtd instances (each ~30-50 MB RSS) is a real cost -- roughly 1.5-2.5 GB -- that does not exist in a traditional KVM or VMware setup. At 5,000 VMs across 100 nodes, this is ~50 VMs per node, ~2 GB overhead per node, ~200 GB total cluster overhead for libvirtd alone. This is manageable but must be accounted for in capacity planning.

Custom Resource Definitions

KubeVirt extends the Kubernetes API with several CRDs. The four most important ones:

VirtualMachine (VM): The top-level user-facing resource. It represents a virtual machine with a desired running state. It contains the VM's specification (CPU, memory, disks, network, firmware) and a running or runStrategy field that controls whether the VM should be powered on. The VirtualMachine controller creates and manages a VirtualMachineInstance when the VM should be running. Think of it as the equivalent of a VM in vCenter's inventory -- it persists even when the VM is powered off.

VirtualMachineInstance (VMI): Represents a running instance of a VM. When a VirtualMachine is set to running: true, the virt-controller creates a VMI. The VMI is the actual runtime object -- it maps 1:1 to a virt-launcher pod and a QEMU process. When the VM is shut down, the VMI is deleted. In vSphere terms, the VMI is the runtime state -- like the vmx process on an ESXi host. The VM persists; the VMI is ephemeral.

VirtualMachineInstanceReplicaSet (VMIRS): Manages a set of identical VMIs, analogous to a Kubernetes ReplicaSet for pods. It maintains a desired number of running VM instances. Useful for stateless VM workloads (load balancers, web servers that must run as VMs for legacy reasons). Not commonly used in enterprise environments where each VM is unique.

VirtualMachineInstanceMigration (VMIM): A declarative object that triggers a live migration of a VMI from one node to another. Creating a VMIM object is equivalent to right-clicking a VM in vCenter and selecting "Migrate." The virt-controller and virt-handler cooperate to execute the migration. The VMIM tracks progress and status. When the migration completes (or fails), the VMIM status is updated.

Additional CRDs that are operationally important:

VM Lifecycle Through Kubernetes

Understanding how a VM goes from a YAML file to a running QEMU process is essential for debugging and operations. The flow:

  VM Lifecycle: From YAML to Running QEMU Process

  Step 1: User applies VM manifest
  =====================================
  $ kubectl apply -f my-vm.yaml
       |
       v
  kube-apiserver
       |
       +--> virt-api (admission webhook)
       |      - Validates VM spec (valid CPU model? valid disk bus?)
       |      - Mutates defaults (add default network, set machine type)
       |      - Rejects invalid specs (negative memory, unknown disk type)
       |
       +--> etcd (VM resource persisted)

  Step 2: virt-controller reconciles
  =====================================
  virt-controller (watch loop)
       |
       +--> Sees new VM with running: true (or runStrategy: Always)
       |
       +--> Creates VirtualMachineInstance (VMI) CR
       |      - Copies spec from VM to VMI
       |      - Sets VMI status: Pending
       |
       +--> etcd (VMI resource persisted)
       |
       +--> Creates virt-launcher Pod
              - Sets resource requests/limits (CPU, memory, hugepages)
              - Adds device requests (/dev/kvm, /dev/vhost-net)
              - Mounts PVCs for disks
              - Mounts ConfigMaps/Secrets for cloud-init
              - Sets node affinity/anti-affinity from VM spec
              - Sets tolerations for any taints
              - Adds KubeVirt-specific annotations

  Step 3: Kubernetes schedules the pod
  =====================================
  kube-scheduler
       |
       +--> Evaluates pod against nodes:
       |      - Does node have /dev/kvm? (device plugin)
       |      - Does node have enough CPU/memory?
       |      - Does node have the requested hugepages?
       |      - Does node satisfy affinity rules?
       |      - Does node satisfy topology constraints?
       |      - Does node have the requested SR-IOV VFs?
       |
       +--> Binds pod to selected node
       |
       +--> VMI status: Scheduling -> Scheduled

  Step 4: kubelet starts the pod
  =====================================
  kubelet (on target node)
       |
       +--> Calls CRI-O (or containerd) to create pod sandbox
       |      - Creates cgroup hierarchy
       |      - Creates network namespace
       |      - CNI plugin configures pod networking
       |      - Multus attaches additional interfaces
       |
       +--> Pulls virt-launcher container image (if not cached)
       |
       +--> Starts virt-launcher container

  Step 5: virt-launcher boots the VM
  =====================================
  virt-launcher process (PID 1 in container)
       |
       +--> Reads VMI spec from annotation/downward API
       |
       +--> Translates VMI spec to libvirt domain XML
       |      - CPU: model, topology, features, NUMA
       |      - Memory: size, hugepages, NUMA cells
       |      - Disks: virtio-blk/virtio-scsi backed by PVCs
       |      - NICs: virtio-net with tap/bridge/SR-IOV backend
       |      - Firmware: UEFI (OVMF) or SeaBIOS
       |      - Devices: vTPM, watchdog, RNG, serial console
       |
       +--> Calls libvirt API: virDomainDefineXML()
       +--> Calls libvirt API: virDomainCreate()
       |
       v
  libvirtd (per-pod instance)
       |
       +--> Parses domain XML
       +--> Constructs QEMU command line (~200+ arguments)
       +--> Forks QEMU process
       |
       v
  qemu-kvm process
       |
       +--> Opens /dev/kvm (ioctl: KVM_CREATE_VM)
       +--> Creates vCPUs (ioctl: KVM_CREATE_VCPU)
       +--> Maps memory (ioctl: KVM_SET_USER_MEMORY_REGION)
       +--> Loads firmware (OVMF/SeaBIOS)
       +--> Starts vCPU threads (ioctl: KVM_RUN in loop)
       |
       +--> Guest OS boots
       +--> VMI status: Running

  Step 6: virt-handler syncs state
  =====================================
  virt-handler (on same node)
       |
       +--> Detects running VMI on its node
       +--> Reads guest info via QEMU Guest Agent (if installed)
       +--> Reports IP addresses, OS info, filesystem info to VMI status
       +--> Updates VMI conditions (AgentConnected, LiveMigratable, etc.)

YAML Examples

A complete VirtualMachine definition for a production database VM:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: oracle-db-prod-01
  namespace: database-tier
  labels:
    app: oracle-database
    tier: production
    criticality: tier-1
spec:
  running: true
  template:
    metadata:
      labels:
        app: oracle-database
        kubevirt.io/vm: oracle-db-prod-01
    spec:
      # Node placement
      nodeSelector:
        node-role.kubernetes.io/worker-vm: ""
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: oracle-database
              topologyKey: kubernetes.io/hostname

      # CPU and memory
      domain:
        cpu:
          cores: 16
          sockets: 1
          threads: 1
          model: host-passthrough
          dedicatedCpuPlacement: true
          numa:
            guestMappingPassthrough: {}
          features:
            - name: x2apic
              policy: require
        memory:
          guest: 64Gi
          hugepages:
            pageSize: 1Gi
        machine:
          type: q35
        firmware:
          bootloader:
            efi:
              secureBoot: true
        features:
          acpi: {}
          apic: {}
          smm: {}
        clock:
          utc: {}
          timer:
            hpet:
              present: false
            pit:
              tickPolicy: delay
            rtc:
              tickPolicy: catchup
            hyperv: {}
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
            - name: datadisk
              disk:
                bus: virtio
              dedicatedIOThread: true
            - name: cloudinitdisk
              disk:
                bus: virtio
          interfaces:
            - name: default
              masquerade: {}
            - name: storage-net
              sriov: {}
          networkInterfaceMultiqueue: true
          rng: {}
          tpm: {}

      # Networks
      networks:
        - name: default
          pod: {}
        - name: storage-net
          multus:
            networkName: sriov-storage-vlan100

      # Volumes
      volumes:
        - name: rootdisk
          dataVolume:
            name: oracle-db-prod-01-root
        - name: datadisk
          persistentVolumeClaim:
            claimName: oracle-db-prod-01-data
        - name: cloudinitdisk
          cloudInitNoCloud:
            networkData: |
              version: 2
              ethernets:
                eth0:
                  dhcp4: true
            userData: |
              #cloud-config
              hostname: oracle-db-prod-01
              ssh_authorized_keys:
                - ssh-rsa AAAAB3... admin@company.com

      # Eviction strategy for live migration during node drain
      evictionStrategy: LiveMigrate

A DataVolume for importing a VMware VMDK:

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: oracle-db-prod-01-root
  namespace: database-tier
spec:
  source:
    http:
      url: "https://image-server.internal/vmware-exports/oracle-db-root.vmdk"
      certConfigMap: image-server-ca
  pvc:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 200Gi
    storageClassName: ceph-rbd-ssd
    volumeMode: Block

Pod Wrapping: How VMs Live Inside Kubernetes

The design decision to run each VM inside a Kubernetes pod is the defining architectural choice of KubeVirt. It brings enormous benefits and specific costs.

Benefits:

Costs:

Networking

KubeVirt networking maps VM network interfaces to Kubernetes pod networking, which itself maps to the cluster's CNI (Container Network Interface) implementation. This is one of the areas with the greatest divergence from VMware.

In vSphere, a VM NIC connects to a vSwitch (standard or distributed), which connects to a physical NIC. The mapping is direct: VM NIC -> portgroup -> uplink -> physical NIC.

In KubeVirt, the path is: VM NIC -> virtio-net/e1000/rtl8139 -> tap device inside pod -> pod network interface -> CNI plugin -> physical NIC. And for secondary networks (Multus), each additional NIC is a separate CNI attachment.

KubeVirt supports multiple networking modes for the default pod network interface:

Mode How It Works Use Case Performance
masquerade VM traffic is NATed through the pod IP. The VM gets a private address (10.0.2.0/24 by default) and the pod IP is the NAT gateway. Uses iptables/nftables rules. Default mode. Simple. Works with any CNI. VM is reachable via pod IP + service/ingress. Moderate. NAT overhead. ~10-15% throughput reduction vs. bridge.
bridge VM NIC is bridged directly to the pod's network interface. The VM gets the pod's IP address via DHCP (KubeVirt runs a DHCP server on the bridge). When the VM needs the pod IP directly (no NAT). Requires CNI support for bridge takeover. Good. No NAT overhead. But: pod IP is "taken" by VM, so sidecar containers cannot use it.
SR-IOV VM NIC is attached directly to an SR-IOV Virtual Function (VF) passed through via VFIO. Bypasses all software networking. High-throughput, low-latency workloads. Requires SR-IOV-capable NICs and the SR-IOV device plugin. Excellent. Near-native. <3% overhead. Up to 100 Gbps line rate.
passt User-mode networking stack that translates between the VM's network stack and the pod's network namespace. No root privileges required. Rootless deployments, environments where bridge mode is not supported. Good. Better than masquerade, no NAT.

For secondary networks, KubeVirt uses Multus CNI, a "meta-plugin" that allows a pod to have multiple network interfaces. Each additional interface is defined by a NetworkAttachmentDefinition (NAD) resource. This is how VMs get connections to VLANs, storage networks, or dedicated management networks -- analogous to adding a second NIC to a VM in vSphere and connecting it to a different portgroup.

  KubeVirt Networking: VM NIC to Physical NIC

  Inside the VM (Guest OS)
  +--------------------------------------------------+
  |  eth0: 10.0.2.2/24 (masquerade, default network) |
  |  eth1: 192.168.100.5/24 (SR-IOV, storage net)    |
  +--------------------------------------------------+
       |  virtio-net             |  VFIO passthrough
       v                        v
  +--------------------------------------------------+
  |  virt-launcher Pod                               |
  |                                                  |
  |  tap0 --- linux bridge --- eth0 (pod interface)  |
  |       (masquerade mode:                          |
  |        iptables DNAT/SNAT                        |
  |        on the bridge)                            |
  |                                                  |
  |  [no software path for SR-IOV -- direct HW]     |
  +--------------------------------------------------+
       |  eth0: pod IP              |  SR-IOV VF
       v  (via CNI: OVN-K, Calico)  v  (via VFIO)
  +--------------------------------------------------+
  |  Worker Node                                     |
  |                                                  |
  |  br-int / ovs-bridge / host bridge               |
  |  (OVN-Kubernetes or other CNI)                   |
  |                                                  |
  |  Physical NIC: eno1 (pod traffic)                |
  |  Physical NIC: eno2 (SR-IOV PF, VFs allocated)   |
  +--------------------------------------------------+
       |                            |
       v                            v
  Physical Network Switch / Fabric

The key vSphere equivalences:

vSphere Concept KubeVirt Equivalent
vSwitch / vDS portgroup CNI plugin + NetworkAttachmentDefinition
VM NIC (VMXNET3) VM NIC (virtio-net)
Trunk port / VLAN tagging Multus + VLAN CNI plugin or OVN-Kubernetes secondary network
DirectPath I/O (passthrough NIC) SR-IOV mode with VFIO
NSX-T micro-segmentation Kubernetes NetworkPolicy + OVN-Kubernetes ACLs
VM-to-VM on same host Pod-to-pod (via CNI bridge or OVS)
VM traffic shaping (vDS) CNI bandwidth plugin or OVN-Kubernetes QoS

Storage

KubeVirt VM disks are backed by Kubernetes Persistent Volume Claims (PVCs). This is one of the most fundamental differences from VMware, where disks are VMDK files on a VMFS or NFS datastore managed by vCenter.

In KubeVirt, each VM disk is either:

The storage path:

  KubeVirt Storage: VM Disk to Physical Storage

  Guest OS
  +-------------------------------------+
  |  /dev/vda (virtio-blk)              |
  |  or                                 |
  |  /dev/sda (virtio-scsi)             |
  +-------------------------------------+
       |
  QEMU block layer
       |
       +--> Raw block device     +--> qcow2 file on filesystem
       |    (block-mode PVC)     |    (filesystem-mode PVC)
       v                        v
  +-------------------------------------+
  |  PVC mounted into virt-launcher pod |
  |  - Block mode: /dev/xvda            |
  |  - Filesystem mode: /var/run/       |
  |    kubevirt-private/vmi-disks/      |
  |    disk-name/disk.img               |
  +-------------------------------------+
       |
  Kubernetes PV / CSI driver
       |
       +--> Ceph RBD (block)
       +--> OpenShift Data Foundation (block/file)
       +--> NFS (file)
       +--> Local PV (block/file)
       +--> NetApp ONTAP (iSCSI block, NFS file)
       +--> Pure Storage (iSCSI, FC)
       |
       v
  Physical Storage Array / Cluster

Block mode vs. filesystem mode is a critical choice for production VMs:

Aspect Block Mode PVC Filesystem Mode PVC
PVC volumeMode Block Filesystem (default)
QEMU access Raw block device passed directly to QEMU QEMU opens a qcow2/raw file on the PVC's mounted filesystem
Performance Better. No filesystem overhead. QEMU I/O goes directly to the block device. Worse. Double filesystem: the PVC's filesystem (ext4/XFS) + the guest's filesystem.
Snapshot support Via CSI volume snapshots Via CSI volume snapshots or qcow2 internal snapshots
Live migration Requires RWX (ReadWriteMany) block PVCs or storage-assisted migration Same requirement
Overhead Minimal Filesystem metadata, double journaling
Recommendation for Tier-1 Preferred Avoid for I/O-intensive workloads

For the evaluation at 5,000+ VMs, the storage integration is a major area of scrutiny. The existing VMware estate uses VMFS datastores with vSAN or SAN-backed LUNs. Migrating to KubeVirt means migrating to PVC-backed storage. This requires a CSI driver for whatever storage backend the organization chooses (Ceph/ODF, NetApp, Pure, etc.) and the CDI (Containerized Data Importer) for converting existing VMDKs.

CDI (Containerized Data Importer)

CDI is a Kubernetes operator that manages the lifecycle of VM disk data. It is responsible for populating PVCs with VM disk content before a VM starts. CDI is the answer to the question: "How do I get my VMware VMDK files into KubeVirt?"

CDI supports the following import sources:

Source How It Works
HTTP/HTTPS URL Downloads a disk image (raw, qcow2, vmdk, vdi, vhd, vhdx) from a web server. Automatically detects and converts format to raw.
S3 bucket Downloads from S3-compatible storage (AWS S3, MinIO, Ceph RGW).
Container registry Pulls an OCI image that contains a disk image as a layer. Extracts and writes to PVC.
Existing PVC (clone) Clones an existing PVC to a new PVC. Uses CSI clone if available, otherwise smart-clone (snapshot + restore) or host-assisted copy.
Upload Accepts a disk image upload via virtctl image-upload. Streams data directly to a CDI upload pod.
VDDK (VMware Virtual Disk Development Kit) Connects to a VMware vCenter/ESXi and downloads a VM's disk using the VMware VDDK API. This is the primary mechanism used by the Migration Toolkit for Virtualization (MTV).
Image I/O (oVirt) Imports from Red Hat Virtualization (oVirt) -- less relevant for this evaluation.
Snapshot Creates a PVC from a VolumeSnapshot.

The CDI import flow for a VMware VMDK:

  CDI Import Flow: VMware VMDK to KubeVirt PVC

  Step 1: User creates DataVolume
  ================================================
  $ kubectl apply -f datavolume-import.yaml
       |
       v
  CDI Controller (watches DataVolume CRs)
       |
       +--> Creates a PVC with the requested size and storageClass
       |
       +--> Creates an Importer Pod
              |
              v
  +------------------------------------------+
  |  Importer Pod                            |
  |                                          |
  |  1. Downloads VMDK from HTTP URL         |
  |     (or connects to vCenter via VDDK)    |
  |                                          |
  |  2. Detects source format:               |
  |     - VMDK sparse? VMDK flat? qcow2?     |
  |     - Compressed (gz, xz)?               |
  |                                          |
  |  3. Converts to raw format:              |
  |     VMDK --> qemu-img convert --> raw     |
  |                                          |
  |  4. Writes raw data to PVC:              |
  |     - Block mode: dd to /dev/xvda        |
  |     - Filesystem mode: write to          |
  |       /data/disk.img                     |
  |                                          |
  |  5. Reports progress via DataVolume      |
  |     status (0% -> 100%)                  |
  +------------------------------------------+
       |
       v
  PVC is populated and bound
       |
       v
  DataVolume status: Succeeded
       |
       v
  VM can now reference this DataVolume
  in its volumes section and boot from it

VMDK conversion specifics: CDI uses qemu-img under the hood to convert VMDK files to raw format. It handles all VMDK variants: monolithic sparse, monolithic flat, split sparse, split flat, stream-optimized, and ESXi-style descriptor+extent files. The conversion is CPU-intensive (especially for compressed VMDKs) and disk-space-intensive (VMDKs are often thin-provisioned; the raw output is fully allocated). CDI supports scratch space (a temporary PVC used during conversion) to avoid running out of space on the target PVC.

For the migration of 5,000+ VMs, CDI throughput is a bottleneck concern. Each import runs as a single pod. Parallel imports are supported (create multiple DataVolumes), but the bottleneck shifts to:

A realistic migration pipeline imports 10-50 VMs in parallel, throttled by network bandwidth and storage IOPS. At 200 GB average disk per VM, 5,000 VMs is 1 PB of data. At 10 Gbps sustained throughput, that is ~9 days of continuous transfer for the raw data alone, not counting conversion overhead, validation, and test boots. CDI bandwidth planning is a critical migration workstream.

Live Migration in KubeVirt

Live migration in KubeVirt moves a running VM from one worker node to another with near-zero downtime. The mechanism is built on libvirt/QEMU's pre-copy migration (covered in Chapter 3), but wrapped in Kubernetes pod semantics. This introduces differences from vMotion that the evaluation team must understand.

How it works:

  1. An operator (or an automated process like node drain) creates a VirtualMachineInstanceMigration (VMIM) resource.
  2. The virt-controller sees the VMIM, validates that the VMI is live-migratable (checks conditions: is the storage shared? Are all devices migratable? Is there a target node with sufficient resources?).
  3. The virt-controller creates a target virt-launcher pod on the destination node. This pod is identical to the source pod (same resource requests, same volume mounts) but does not start a VM -- it starts a QEMU process in "incoming migration" mode, waiting to receive the VM state.
  4. The virt-handler on the source node signals libvirtd to begin pre-copy migration. Libvirt connects the source QEMU to the target QEMU over a TCP connection (typically over the pod network or a dedicated migration network).
  5. QEMU performs iterative memory copy: first pass sends all memory pages, subsequent passes send only dirty pages. When the dirty rate is low enough, QEMU pauses the VM on the source, sends the final dirty pages and CPU state, and resumes the VM on the target. This pause window is the migration downtime.
  6. Once the VM is running on the target, the source virt-launcher pod is terminated. The VMI resource is updated to reflect the new node.

Differences from vMotion:

Aspect vMotion KubeVirt Live Migration
Unit of migration VM (vmx process) Pod (entire virt-launcher pod with VM inside)
Storage requirement Shared datastore (VMFS, vSAN, NFS) or storage vMotion for local disks Shared PVCs (RWX access mode) or storage-class-specific migration support
Network Dedicated vMotion VMkernel port, encrypted, 10/25 Gbps typical Pod network or dedicated Multus network, TLS-encrypted, bandwidth depends on CNI
Trigger Manual, DRS-automated, host maintenance mode Manual (VMIM), node drain (kubectl drain), descheduler policy
Convergence vMotion has mature convergence heuristics, memory pre-copy with stun threshold QEMU pre-copy with configurable bandwidth limit, convergence timeout, auto-converge (throttle vCPUs)
Post-copy Supported in recent vSphere versions Supported in KubeVirt (experimental), falls back to pre-copy on failure
Downtime target Typically <1 second for most workloads Typically 10-500 ms, depends on dirty page rate and migration bandwidth
Multi-VM migration DRS migrates multiple VMs in parallel with dependency awareness Parallel VMIM resources, but no built-in dependency awareness (must be orchestrated externally)
Cancel Yes, VM stays on source Yes, delete the VMIM resource
Network identity VM keeps its MAC and IP VM keeps its MAC and IP (the pod IP changes, but the VM's internal IP is stable if using bridge mode + Multus)

A critical caveat: the pod IP changes after migration. In masquerade mode, the VM gets a new pod IP on the target node. The VM's internal IP (the NATed address) remains the same from the guest's perspective, but any Kubernetes services or external load balancers pointing to the old pod IP must update. Kubernetes Services (ClusterIP, NodePort, LoadBalancer) handle this automatically if the VM is behind a Service (selector matches the new pod). Direct pod-IP access breaks.

For bridge mode with Multus secondary networks, the VM's MAC address is preserved, and if the secondary network spans both nodes (same VLAN), the VM retains its IP address transparently -- this is the closest equivalent to vMotion's behavior.

Live migration prerequisites:

Console and VNC Access

KubeVirt provides console access through virtctl, a CLI tool that extends kubectl:

Additionally, the OpenShift Web Console (in OVE) provides a browser-based VNC client and serial console, comparable to vSphere Client's remote console but accessible through a web browser without any plugins.

The architectural path for console access:

  Console Access Path

  User's workstation
       |
       +--> virtctl vnc my-vm
       |
       v
  kubectl proxy / API server (HTTPS/WSS)
       |
       +--> virt-api (subresource handler)
       |      Routes to correct virt-launcher pod
       |
       v
  virt-handler (on target node)
       |
       +--> virt-launcher pod
       |      |
       |      v
       |    QEMU VNC server (port 5900 inside pod)
       |    or QEMU chardev (serial console)
       |
       v
  Websocket stream back to virtctl
       |
       v
  Local VNC viewer / terminal

Comparison to vSphere: What Maps, What Doesn't

This mapping table is designed for the evaluation team to build a mental model of KubeVirt using their existing VMware knowledge:

vSphere / ESXi Concept KubeVirt / OVE Equivalent Notes
vCenter Server kube-apiserver + virt-controller + virt-api No single "vCenter" -- the functions are distributed across Kubernetes components
ESXi Host Kubernetes worker node (RHCOS) + virt-handler The node runs RHCOS (Red Hat CoreOS), not ESXi
VMX process virt-launcher pod (containing QEMU process) Each VM = one pod = one QEMU process
hostd + vpxa virt-handler DaemonSet Node-local agent reporting to the central controller
VM (in vCenter inventory) VirtualMachine CR Persistent object, survives power-off
Running VM instance VirtualMachineInstance CR + virt-launcher Pod Ephemeral, exists only while VM is running
VM Template VirtualMachineClusterInstancetype + VirtualMachineClusterPreference + golden DataVolume No single "template" object; composed from instance type + preference + source disk
Resource Pool Kubernetes Namespace + ResourceQuota + LimitRange Namespaces replace resource pools for multi-tenancy
DRS (Distributed Resource Scheduler) kube-scheduler + descheduler (optional) Kubernetes scheduler handles placement; descheduler handles rebalancing (less mature than DRS)
vMotion VirtualMachineInstanceMigration (VMIM) Declarative migration resource
vDS (Distributed Switch) CNI plugin (OVN-Kubernetes in OVE) OVN-Kubernetes is the default CNI; replaces vDS functionality
vDS Portgroup NetworkAttachmentDefinition (NAD) via Multus Each additional network is a Multus attachment
VMFS/vSAN Datastore StorageClass + PVCs (backed by Ceph, ODF, etc.) No monolithic "datastore"; each disk is an independent PVC
VMDK PVC (block or filesystem mode) The PVC is the disk
Content Library Container registry + DataVolume sources VM images stored as container images or HTTP-accessible files
vSphere HA Kubernetes pod rescheduling + VM run strategy If a node fails, pods are rescheduled to surviving nodes; runStrategy: Always ensures VMs restart
Alarms & Events Kubernetes Events + Prometheus alerts No built-in alarm system; alerting via Prometheus + Alertmanager
RBAC (vSphere permissions) Kubernetes RBAC (Roles, RoleBindings, ClusterRoles) More granular than vSphere; per-namespace, per-resource-type
vSphere Tags Kubernetes Labels + Annotations Labels are the primary metadata mechanism
Guest OS Customization (Sysprep) cloud-init / Sysprep (via ConfigMap/Secret volumes) cloud-init for Linux, Sysprep for Windows, injected as volumes
Snapshot VolumeSnapshot (CSI) Via the CSI driver, not KubeVirt itself; maturity varies by storage backend

What is better in KubeVirt:

What is worse in KubeVirt:


2. OCI / Container Runtimes (CRI-O, containerd)

CRI (Container Runtime Interface)

The Container Runtime Interface (CRI) is a plugin API that Kubernetes defines for container runtimes. It is the boundary between the kubelet (Kubernetes' node agent) and whatever software actually creates and manages containers (and, by extension, virt-launcher pods for VMs). CRI was introduced in Kubernetes 1.5 (2016) to decouple the kubelet from Docker.

Before CRI, the kubelet had Docker-specific code built in. CRI defines a gRPC API with two services:

Any software that implements this gRPC API can serve as a Kubernetes container runtime. The two major implementations are CRI-O and containerd.

CRI-O

CRI-O is a lightweight, OCI-compliant container runtime built specifically for Kubernetes. It was created by Red Hat, Intel, SUSE, and others. "CRI-O" literally means "CRI + OCI" -- it implements the CRI interface and uses OCI-compliant tools for the actual container operations.

Key characteristics:

containerd

containerd is a container runtime originally extracted from Docker. Docker donated it to the CNCF, and it is now the default runtime for vanilla Kubernetes, AKS, EKS, GKE, and most cloud Kubernetes services. containerd implements the CRI plugin natively since version 1.1.

Key differences from CRI-O:

Aspect CRI-O containerd
Origin Built for Kubernetes from scratch Extracted from Docker
Scope CRI only CRI + general-purpose container management
Image building Not supported (out of scope) Not supported natively (but plugins exist)
OCI runtime default crun (on RHEL/OpenShift) runc
Container monitor conmon (separate process per container) Internal shim (containerd-shim-runc-v2)
Used by OpenShift, OVE, SUSE Vanilla Kubernetes, AKS, EKS, GKE
Release cycle Aligned with OpenShift Independent

How CRI-O Handles a KubeVirt Pod

When virt-controller creates a virt-launcher pod for a VM, the following chain of events occurs on the target worker node:

  CRI-O Execution Chain: kubelet to QEMU

  kubelet
    |
    +--> gRPC: RunPodSandbox()
    |      |
    |      v
    |    CRI-O
    |      |
    |      +--> Creates pod-level cgroup (/kubepods/pod<uid>/)
    |      |
    |      +--> Creates network namespace (via CNI)
    |      |      +--> Calls primary CNI plugin (OVN-Kubernetes)
    |      |      |      Creates veth pair, connects to OVS bridge
    |      |      +--> Calls Multus (if additional networks defined)
    |      |             Calls secondary CNI plugins
    |      |             Creates additional interfaces in namespace
    |      |
    |      +--> Creates IPC namespace, UTS namespace
    |      |
    |      +--> Returns PodSandboxId
    |
    +--> gRPC: CreateContainer(PodSandboxId, "compute")
    |      |
    |      v
    |    CRI-O
    |      |
    |      +--> Pulls virt-launcher image (if not in local cache)
    |      |      Image: registry.redhat.io/container-native-
    |      |      virtualization/virt-launcher-rhel9:v4.x
    |      |
    |      +--> Creates OCI runtime bundle:
    |      |      - config.json (OCI runtime spec)
    |      |      - rootfs/ (from container image layers)
    |      |
    |      +--> Spawns conmon process:
    |      |      conmon --cid <container-id> \
    |      |             --runtime /usr/bin/crun \
    |      |             --log-path /var/log/... \
    |      |             ...
    |      |
    |      +--> conmon calls crun (OCI runtime):
    |             crun create <container-id>
    |               |
    |               +--> Creates container cgroup
    |               |    (child of pod cgroup)
    |               +--> Sets up mount namespace
    |               |    (rootfs, volumes, devices)
    |               +--> Mounts /dev/kvm, /dev/vhost-net
    |               |    into container
    |               +--> Mounts PVC volumes at expected
    |               |    paths in container
    |               +--> Configures seccomp profile
    |               +--> Configures SELinux label
    |               +--> Joins pod's network namespace
    |               +--> Starts container init process:
    |                    --> virt-launcher binary (PID 1)
    |
    +--> gRPC: StartContainer(ContainerId)
           |
           v
         CRI-O --> conmon --> crun start <container-id>
           |
           v
         virt-launcher process begins:
           |
           +--> Starts libvirtd (as child process)
           +--> Defines VM domain in libvirt
           +--> Starts QEMU/KVM via libvirt
           +--> VM boots inside the container's cgroup

The critical insight: the QEMU process inherits the container's cgroup. This means Kubernetes' CPU and memory limits apply directly to the QEMU process and its vCPU threads. If a VM is configured with resources.requests.memory: 64Gi and resources.limits.memory: 66Gi (64 Gi for the guest + 2 Gi for QEMU overhead), the Linux OOM killer will kill the QEMU process if it exceeds 66 Gi -- exactly as it would kill any container exceeding its memory limit. This is a feature (prevents one VM from consuming unbounded resources) and a risk (an undersized memory limit kills the VM).

OCI Image Spec and Runtime Spec

The Open Container Initiative (OCI) defines two specifications that are foundational to container runtimes:

OCI Image Specification: Defines how container images are structured -- the manifest, config, and layers. KubeVirt uses this specification for containerDisk volumes: a VM disk image (qcow2 or raw) is packaged as an OCI image layer, pushed to a container registry, and pulled by the container runtime at pod start. This is how ephemeral VM boot disks (like ISOs or live CDs) are distributed.

OCI Runtime Specification: Defines how a container is configured and executed -- the config.json file that CRI-O/containerd passes to runc/crun. This specification defines namespaces, cgroups, mounts, devices, and security settings. For a KubeVirt virt-launcher container, the OCI runtime spec includes device access rules for /dev/kvm and /dev/vhost-net, volume mounts for PVCs, and the appropriate SELinux context.

Why CRI-O vs containerd Matters for OVE

For the OVE evaluation specifically, CRI-O is not a choice -- it is a requirement. OpenShift mandates CRI-O. The implications:

  1. Troubleshooting: When a virt-launcher pod fails to start, the logs are in CRI-O's log format, and the container is managed by conmon and crun. The operations team must be familiar with crictl (the CRI CLI) rather than docker or nerdctl (containerd's CLI) for debugging.

  2. Image pull behavior: CRI-O's image pull behavior differs from containerd in edge cases (authentication, registry mirrors, image signing). OVE's support matrix is tested exclusively with CRI-O.

  3. Security: CRI-O on OpenShift runs with SELinux enforcing and a locked-down seccomp profile by default. virt-launcher pods require specific SELinux labels (container_t with KVM device access) that are configured by the KubeVirt operator. Modifying these settings outside of the operator can break VMs.

  4. crun vs runc: OpenShift's CRI-O uses crun (a C-based OCI runtime) instead of runc (Go-based). crun has lower overhead and faster container start times, which slightly benefits VM startup latency (the pod sandbox creation phase). The QEMU process itself is unaffected -- crun only manages the container lifecycle, not the VM.

For Azure Local, which runs Hyper-V VMs directly (not inside Kubernetes pods), none of the CRI-O/containerd discussion is relevant. Azure Local's VMs are managed by the Hyper-V hypervisor and the Azure Arc management plane, not by a container runtime.


3. Kata Containers / MicroVMs

The Problem: Container Isolation is Weaker than VM Isolation

Standard Linux containers (run by runc/crun) share the host kernel. Process isolation is enforced by kernel namespaces, cgroups, seccomp, and SELinux/AppArmor -- but all containers on a host execute syscalls against the same kernel. A kernel vulnerability (a privilege escalation bug in a syscall handler, a namespace escape, a cgroup bypass) can allow a container to break out and access the host or other containers.

This is fundamentally different from VM isolation, where each VM has its own kernel running inside a hardware-enforced boundary (VT-x/AMD-V, EPT/NPT). A guest kernel vulnerability does not compromise the host. The attack surface is the hypervisor's VM exit handler -- a much smaller and more auditable surface than the Linux syscall table (400+ syscalls).

For a Tier-1 financial enterprise running regulated workloads, this distinction matters. If two different business units (or two different customers in a shared infrastructure) run containers on the same host, the shared-kernel risk may be unacceptable. This is the problem Kata Containers solves.

Kata Containers Architecture

Kata Containers is an open-source project (originally a merger of Intel Clear Containers and Hyper.sh's runV) that runs each container (or pod) inside a lightweight virtual machine. Instead of using runc to create a container with namespace isolation, Kata uses a VMM (Virtual Machine Monitor) to create a lightweight VM for each pod.

  Standard Container vs Kata Container

  Standard Container (runc/crun)          Kata Container
  ================================        ================================

  +---------------------------+           +---------------------------+
  | Container Process         |           | Container Process         |
  | (shares host kernel)      |           | (runs in guest kernel)    |
  +---------------------------+           +---------------------------+
  | namespaces + cgroups      |           | Guest Linux Kernel (5.x)  |
  | (kernel-level isolation)  |           | (minimal, stripped-down)  |
  +---------------------------+           +---------------------------+
  | Host Linux Kernel         |           | VMM (QEMU / Cloud HV /   |
  |                           |           |      Firecracker)         |
  |                           |           +---------------------------+
  |                           |           | Host Linux Kernel + KVM   |
  +---------------------------+           +---------------------------+
  | Hardware                  |           | Hardware (VT-x/EPT)       |
  +---------------------------+           +---------------------------+

  Isolation boundary:                     Isolation boundary:
  Kernel namespaces (software)            Hardware virtualization (VT-x)

Kata Containers architecture in detail:

  Kata Containers Architecture (per Pod)

  +================================================================+
  |  Kata VM (lightweight, boots in <1 second)                     |
  |                                                                |
  |  +----------------------------------------------------------+  |
  |  |  Container workload(s)                                   |  |
  |  |  - Application process(es)                               |  |
  |  |  - OCI bundle rootfs mounted from host (via virtio-fs    |  |
  |  |    or virtio-9p)                                         |  |
  |  +----------------------------------------------------------+  |
  |  |  kata-agent                                              |  |
  |  |  - gRPC server inside the VM                             |  |
  |  |  - Receives container lifecycle commands from kata-       |  |
  |  |    runtime on the host                                   |  |
  |  |  - Creates namespaces/cgroups inside the VM              |  |
  |  |  - Manages container processes                           |  |
  |  +----------------------------------------------------------+  |
  |  |  Guest Linux Kernel                                      |  |
  |  |  - Minimal kernel (~4-8 MB)                              |  |
  |  |  - Only drivers needed: virtio-blk, virtio-net,          |  |
  |  |    virtio-fs, virtio-vsock                               |  |
  |  |  - No unnecessary modules, no GUI, no sound              |  |
  |  +----------------------------------------------------------+  |
  +================================================================+
       | virtio-vsock (host <-> guest communication)
       | virtio-fs (filesystem sharing)
       | virtio-net (networking)
       v
  +================================================================+
  |  Host                                                          |
  |                                                                |
  |  +----------------------------------------------------------+  |
  |  |  kata-runtime (OCI runtime, replaces runc)               |  |
  |  |  - Called by CRI-O/containerd instead of runc            |  |
  |  |  - Starts VMM, connects to kata-agent                    |  |
  |  |  - Translates OCI lifecycle calls to gRPC calls          |  |
  |  +----------------------------------------------------------+  |
  |  |  VMM (Virtual Machine Monitor)                           |  |
  |  |  Options:                                                |  |
  |  |  - QEMU (full-featured, highest compatibility)           |  |
  |  |  - Cloud Hypervisor (Rust, modern, good perf)            |  |
  |  |  - Firecracker (AWS, minimal, fastest boot)              |  |
  |  |  - Dragonball (Alibaba, sandbox-optimized)               |  |
  |  +----------------------------------------------------------+  |
  |  |  Linux Kernel + KVM                                      |  |
  |  +----------------------------------------------------------+  |
  +================================================================+

The key idea: Kata Containers is a drop-in OCI runtime. From Kubernetes' perspective, a Kata pod looks exactly like a regular pod. The kubelet sends the same CRI calls to CRI-O/containerd. CRI-O/containerd calls kata-runtime instead of runc. The pod gets the same network namespace, the same cgroup accounting, the same volume mounts. The only difference is that the container process runs inside a VM rather than directly on the host kernel.

Firecracker

Firecracker is Amazon's open-source Virtual Machine Monitor (VMM), built specifically for serverless and container workloads. It powers AWS Lambda and AWS Fargate. Key characteristics:

Firecracker is a Kata Containers VMM option, meaning you can configure Kata to use Firecracker instead of QEMU. This gives the fastest possible boot times but sacrifices features.

Cloud Hypervisor

Cloud Hypervisor is a Rust-based VMM that sits between QEMU (full-featured) and Firecracker (minimal). It was started by Intel and is now a Linux Foundation project. Key characteristics:

KubeVirt vs Kata: Different Use Cases

This is a common source of confusion. Both KubeVirt and Kata Containers involve running VMs on Kubernetes, but they solve completely different problems:

Aspect KubeVirt Kata Containers
Goal Run traditional VMs (with their own OS, kernel, applications) on Kubernetes Run containers with VM-level isolation
Workload A full operating system (Windows Server, RHEL, Ubuntu) with applications installed inside A container image (just the application + dependencies, no OS kernel)
Guest kernel The VM's own kernel (whatever the guest OS uses) A shared, minimal guest kernel provided by Kata
User experience "I manage a VM" -- SSH in, install packages, configure services "I manage a container" -- docker build, kubectl apply, same container workflow
Use case Legacy apps that cannot be containerized, Windows workloads, stateful databases, appliances Multi-tenant container platforms needing strong isolation, CI/CD build pods, untrusted code execution
Boot time 10-60 seconds (full OS boot, BIOS/UEFI POST, kernel init, systemd) <1-5 seconds (microVM boots minimal kernel, mounts container rootfs, starts application)
Overhead per workload High (full OS: 512 MB - several GB for guest OS alone) Low (minimal kernel: 32-128 MB for guest overhead)
Managed by KubeVirt operator (virt-controller, virt-handler, virt-launcher) CRI-O/containerd + Kata runtime (transparent to Kubernetes)

In the context of this evaluation:

They are complementary, not competing, technologies. An OVE cluster could simultaneously run:

All three share the same Kubernetes control plane, networking, storage, and monitoring.

When Kata Matters for the Evaluation

Kata Containers should be considered in the evaluation in the following scenarios:

  1. Multi-tenant container workloads: If the platform will host containers from multiple business units or external parties with different trust levels, Kata provides hardware-enforced isolation between tenants. Without Kata, container-to-container isolation relies on kernel namespaces, which have a larger attack surface.

  2. Regulatory requirements for workload isolation: FINMA or internal security policies may require that certain workloads run with hardware-level isolation. Kata satisfies this requirement without requiring dedicated physical hosts.

  3. CI/CD build pipelines: Running untrusted build jobs (e.g., building third-party code) in standard containers is a security risk. Kata containers confine build jobs in VMs, preventing a malicious build from affecting the host or other pods.

  4. Comparing isolation models across candidates:

    • OVE: Standard containers (namespace isolation) + Kata Containers (VM isolation) + KubeVirt VMs (full VM isolation). Three tiers available.
    • Azure Local: Hyper-V VMs (full VM isolation) for VMs. Containers run in AKS-HCI (standard namespace isolation). No Kata equivalent.
    • Swisscom ESC: VMware VMs (full VM isolation). Container services depend on Swisscom's offering.

How the Candidates Handle This

Comparison Table

Aspect VMware (Current) OVE (KubeVirt) Azure Local (Hyper-V) Swisscom ESC
VM Management Model vCenter (proprietary, centralized) Kubernetes API + KubeVirt CRDs (declarative, open) Azure Arc + Hyper-V (hybrid cloud management) VMware vCloud Director (managed service)
VM Definition Format .vmx files + vCenter DB YAML manifests (VirtualMachine CR) PowerShell / Azure CLI / ARM templates VMware OVF + provider portal
Infrastructure as Code Terraform vSphere provider; PowerCLI; govc Native (kubectl, Helm, Kustomize, GitOps) Terraform azurerm provider; Azure Bicep; PowerShell Limited; API available but not IaC-native
VM-to-Host Mapping VMX process on ESXi host virt-launcher Pod on worker node Child partition on Hyper-V host VMX process on ESXi (provider-managed)
Hypervisor Engine ESXi VMkernel KVM + QEMU + libvirt (wrapped by KubeVirt) Hyper-V (microkernel + root partition) ESXi VMkernel (provider-managed)
Container Runtime N/A (VMs only; vSphere with Tanzu adds containerd) CRI-O (mandatory on OpenShift) containerd (for AKS-HCI containers) N/A for VM workloads
Unified VM+Container Platform Partial (vSphere with Tanzu, but VMs and containers are separate) Yes (VMs and containers are both pods, same scheduler/network/storage) Separate (Hyper-V VMs + AKS-HCI for containers) No (VM-only service)
VM Disk Format VMDK on VMFS/vSAN/NFS PVC (raw or qcow2 on Ceph RBD, ODF, NFS, etc.) VHDX on Cluster Shared Volumes (ReFS/NTFS) VMDK (provider-managed)
Disk Import/Migration Tool N/A (source platform) CDI (Containerized Data Importer) + MTV Azure Migrate Provider-managed migration
Live Migration vMotion (mature, automated via DRS) KubeVirt VMIM (QEMU pre-copy, pod-based) Hyper-V Live Migration (mature) vMotion (provider-managed)
Console Access vSphere Client (VMRC, web console) virtctl console/vnc, OpenShift web console Hyper-V Manager, WAC, Azure portal Provider portal (web console)
Multi-tenancy Resource pools, folders, permissions Kubernetes namespaces, RBAC, quotas Azure RBAC, subscriptions Provider-managed tenant isolation
Automated Placement/Balancing DRS (automated, continuous) kube-scheduler (initial only) + descheduler (optional, less mature) SCVMM Dynamic Optimization (if using SCVMM) Provider-managed (DRS under the hood)
VM Security Isolation VMkernel hypervisor boundary KVM hypervisor boundary (each VM in own pod cgroup) Hyper-V hypervisor boundary VMkernel hypervisor boundary
Container Isolation Enhancement N/A Kata Containers / OpenShift Sandboxed Containers N/A for containers at VM level N/A
Ecosystem Maturity 20+ years, deep third-party integration 5-7 years production use, growing rapidly 10+ years Hyper-V, Azure Arc is newer Depends on VMware maturity (provider-managed)

Detailed Candidate Analysis

OVE (KubeVirt/KVM)

OVE's greatest strength and greatest risk both stem from the same source: it is Kubernetes-native. The strength is that VM management becomes a natural extension of the Kubernetes platform the organization may already use (or plan to use) for containers. VMs are YAML, lifecycle is declarative, networking and storage are unified, RBAC is standard Kubernetes. For a team that has invested in Kubernetes skills and tooling, OVE is a natural fit.

The risk is that the team may not have invested in Kubernetes skills. Operating KubeVirt at 5,000+ VM scale requires deep Kubernetes expertise: understanding pod scheduling, CSI driver behavior, CNI networking, RBAC, resource quotas, node affinity, taints and tolerations, pod disruption budgets, and the interactions between all of these. A VMware admin who knows vCenter deeply but has never used kubectl faces a steep learning curve.

CDI and migration readiness: OVE's CDI is the primary tool for migrating VMware VMDKs. For the 5,000+ VM estate, the Migration Toolkit for Virtualization (MTV) automates the end-to-end flow: discover VMs in vCenter, map networks and storage, convert VMDKs, create VirtualMachine CRs, and validate boot. MTV uses CDI and VDDK under the hood. The evaluation should include a PoC migration of representative VM types (Windows Server with SQL, RHEL with Oracle, Ubuntu with custom apps) to validate conversion fidelity, boot success rate, and performance parity.

Live migration maturity: KubeVirt's live migration is functional but less mature than vMotion. Specifically:

Kata Containers as a differentiator: OVE can offer three tiers of isolation (container, Kata container, KubeVirt VM) on a single platform. No other candidate offers this. For organizations that need both legacy VMs and secure multi-tenant containers, this is a material advantage.

Azure Local (Hyper-V)

Azure Local does not use KubeVirt or any Kubernetes-native VM management. VMs are managed through the traditional Hyper-V stack with Azure Arc as the management plane. This means:

For a team that is Windows-oriented and already invested in Azure, this is arguably simpler: there is no Kubernetes learning curve for VM management. The trade-off is that containers (via AKS on Azure Local) and VMs are managed through different tools and APIs -- they are not unified on a single platform in the same way KubeVirt unifies them.

Azure Local also does not have an equivalent of Kata Containers. Container isolation in AKS on Azure Local relies on standard Linux namespace isolation (within the AKS Linux nodes) or Hyper-V isolation (for Windows containers). There is no drop-in "VM-level isolation for Linux containers" story comparable to OpenShift Sandboxed Containers.

Swisscom ESC

ESC abstracts all of this behind a managed service. The customer does not interact with KubeVirt, CRI-O, Kata, or any of the technologies in this chapter. VMs are provisioned through Swisscom's portal or API, and the underlying technology is VMware vSphere managed by Swisscom.

The relevance of this chapter for the ESC evaluation is primarily about future risk: if Swisscom transitions away from VMware (due to Broadcom licensing), would the replacement involve KubeVirt? If so, the customer's VMs would be running on the same technology described here, but without customer visibility or control. The evaluation should probe Swisscom's technology roadmap and transition commitments.


Key Takeaways


Discussion Guide

Use these questions when engaging with Red Hat solution engineers, Kubernetes platform architects, or the organization's internal infrastructure team. The questions probe real-world operational readiness at enterprise scale.

Questions for OVE (Red Hat / KubeVirt)

  1. virt-launcher overhead at scale: "At 50 VMs per worker node (our expected density), the per-VM libvirtd overhead is ~1.5-2.5 GB per node. With 100+ worker nodes, that is 150-250 GB of cluster memory consumed by libvirtd instances. How do you account for this in capacity planning templates? Is there a roadmap to reduce per-VM overhead (e.g., shared libvirtd, libvirt-less architecture)?"

  2. Pod startup latency during disaster recovery: "If an entire rack of 10 worker nodes fails simultaneously (500 VMs), how long does it take for all 500 virt-launcher pods to be scheduled and started on surviving nodes? What are the bottleneck factors -- scheduler throughput, CRI-O image pull, CNI setup, or storage attach? Have you tested this scenario at our scale?"

  3. CDI import throughput benchmarks: "For our migration of 5,000 VMs (~1 PB total disk), what is the maximum sustained import rate CDI can achieve with parallel DataVolumes? Specifically: how many concurrent importer pods can run without saturating the storage backend or the API server? What is the recommended migration architecture -- dedicated import cluster, direct VDDK from vCenter, or HTTP staging server?"

  4. Live migration and SR-IOV mutual exclusion: "Approximately 20% of our VMs require SR-IOV for high-throughput network workloads. These VMs cannot live-migrate. What is the recommended maintenance strategy for nodes hosting SR-IOV VMs? Tolerate brief downtime during host patching? Dual-NIC with virtio for migration and SR-IOV for production? Is there a roadmap for migratable SR-IOV (e.g., switchdev mode + OVN hardware offload)?"

  5. Descheduler as DRS replacement: "Our VMware environment uses DRS in fully automated mode. The KubeVirt descheduler is the closest equivalent. How mature is the descheduler for VM workloads specifically? Does it understand VM-specific constraints (NUMA alignment, dedicated CPUs, hugepages) when making rebalancing decisions? Can it be configured to avoid migrating tier-1 VMs unless resource imbalance exceeds a threshold?"

  6. Networking mode recommendation for our workload mix: "We have three categories of VMs: (a) general-purpose Linux with <1 Gbps traffic, (b) database servers with 10-25 Gbps storage replication traffic, (c) market data receivers with ultra-low-latency requirements. What networking mode do you recommend for each category? Can a single VM have both a masquerade interface (for management) and an SR-IOV interface (for data) simultaneously?"

  7. Debugging a VM boot failure: "Walk us through the debugging process when a VM fails to start. Which logs do you check first? How do you distinguish between a Kubernetes scheduling failure (pod Pending), a CRI-O container creation failure (pod ContainerCreating), a virt-launcher failure (libvirt domain definition error), and a QEMU failure (hardware emulation error)? What tooling exists to correlate these layers?"

Questions for Azure Local

  1. Architectural comparison to KubeVirt: "Azure Local runs VMs directly on Hyper-V without Kubernetes pod wrapping. What are the performance advantages of this simpler architecture (no CRI-O, no pod overhead, no per-VM libvirtd)? Conversely, what does Azure Local lose by not having a Kubernetes-native VM model (no GitOps for VMs, no Kubernetes RBAC for VMs, no unified container+VM platform)?"

Questions for Swisscom ESC

  1. Technology transition transparency: "If Swisscom migrates the ESC platform from VMware to a Kubernetes-native stack (such as KubeVirt or a similar technology), what is the customer impact? Will existing VMs be transparently migrated, or will customers need to re-export and re-import VMs? What is the contractual notification period for such a technology change?"

Questions for All Candidates

  1. Isolation model comparison: "Compare the VM isolation boundary across your platform: What is the hypervisor attack surface (number of VM exit handlers, device emulation code size)? Has the hypervisor undergone independent security audit or formal verification? What is the historical CVE rate for hypervisor escape vulnerabilities? For OVE specifically: does the additional pod/container layer (CRI-O, cgroups, namespaces) add defense-in-depth, or does it add attack surface?"