Core Hypervisor Technologies
Why This Matters
The previous two chapters established the foundational virtualization concepts and the VMware baseline you are migrating away from. This chapter goes one level deeper into the actual hypervisor engines that power two of the three candidate platforms: KVM and QEMU (the foundation of OpenShift Virtualization Engine) and Hyper-V (the engine behind Azure Local). Libvirt, the management abstraction between human operators and KVM/QEMU, is covered in equal depth because it is the layer KubeVirt wraps and controls inside every virt-launcher pod.
Swisscom ESC currently runs on VMware ESXi, which was covered in Chapter 2. This chapter does not re-examine ESXi internals, but the comparison table at the end maps ESXi concepts to KVM and Hyper-V equivalents so the evaluation team can reason across all candidates.
Understanding these technologies at the level presented here is not academic. When a live migration fails at 2 AM, the error message will reference a QEMU block job or a libvirt domain XML attribute. When a vendor proposes a "virtio-net with multiqueue" configuration for a high-throughput trading feed VM, the team must know what that means at the vring level to assess whether it will actually deliver the claimed bandwidth. When Azure Local's Hyper-V presents a synthetic NIC vs. an emulated NIC, the performance difference is 10x, and understanding why requires knowing how VMBus works vs. how emulated device traps work.
This chapter equips the evaluation team to operate at that level.
Concepts
1. KVM (Kernel-based Virtual Machine)
What It Is and Why It Exists
KVM is a Linux kernel module that turns the Linux kernel into a Type-1 hypervisor. It was merged into the mainline Linux kernel in version 2.6.20 (February 2007) and has been the default virtualization technology in RHEL, Ubuntu, SUSE, and Debian since. KVM is the hypervisor engine behind Red Hat OpenShift Virtualization Engine (OVE), Google Cloud Compute Engine, Amazon EC2 (via a modified KVM fork called Nitro), Oracle Cloud Infrastructure, and IBM Cloud.
KVM itself is deliberately minimal. It handles CPU virtualization and memory virtualization -- the two things that require kernel privilege and hardware VT-x/AMD-V support. Everything else -- device emulation, disk I/O, network I/O, the user-facing management interface -- is delegated to user-space components (QEMU, libvirt, or custom hypervisor processes). This separation of concerns is a fundamental architectural difference from ESXi, where the VMkernel handles everything.
Architecture
KVM consists of three kernel modules:
| Module | Purpose |
|---|---|
kvm.ko |
Core KVM module. Provides the /dev/kvm character device, the ioctl API, and architecture-independent VM management logic. |
kvm-intel.ko |
Intel VT-x specific implementation. Manages VMCS structures, VMX transitions, EPT configuration. |
kvm-amd.ko |
AMD-V (SVM) specific implementation. Manages VMCB structures, nested page tables. |
Only one of kvm-intel.ko or kvm-amd.ko is loaded, depending on the CPU vendor. The architecture:
KVM Architecture
User Space (Ring 3)
+------------------------------------------------------------------+
| |
| +------------------+ +------------------+ +----------------+ |
| | QEMU Process | | QEMU Process | | QEMU Process | |
| | (VM 1) | | (VM 2) | | (VM 3) | |
| | | | | | | |
| | vCPU Thread 0 --+--+--> ioctl(KVM_RUN)| | | |
| | vCPU Thread 1 --+--+--> ioctl(KVM_RUN)| | | |
| | I/O Thread -----+--+--> epoll loop | | | |
| | Main Thread ----+--+--> event loop | | | |
| +------------------+ +------------------+ +----------------+ |
| |
| /dev/kvm (character device, file descriptor per VM) |
+------------------------------------------------------------------+
Kernel Space (Ring 0, VMX Root Mode)
+------------------------------------------------------------------+
| Linux Kernel |
| +------------------------------------------------------------+ |
| | kvm.ko | |
| | - VM creation, vCPU creation | |
| | - Memory slot management (GPA -> HVA mapping) | |
| | - ioctl dispatch | |
| +------------------------------------------------------------+ |
| +------------------------------------------------------------+ |
| | kvm-intel.ko / kvm-amd.ko | |
| | - VMCS / VMCB management | |
| | - VMX entry (VMLAUNCH/VMRESUME) / SVM VMRUN | |
| | - VM exit handling | |
| | - EPT / NPT page table management | |
| | - In-kernel APIC emulation | |
| | - Posted interrupt processing | |
| +------------------------------------------------------------+ |
| +------------------------------------------------------------+ |
| | Standard Linux kernel subsystems | |
| | - Scheduler (CFS) -- schedules vCPU threads like any task | |
| | - Memory manager -- provides backing pages via mmap/madvise|
| | - Cgroups v2 -- CPU/mem/IO limits for QEMU processes | |
| | - VFIO -- device passthrough framework | |
| +------------------------------------------------------------+ |
+------------------------------------------------------------------+
Hardware
+------------------------------------------------------------------+
| CPU: VT-x/AMD-V, EPT/NPT, VPID/ASID, Posted Interrupts |
| RAM: Physical DRAM, NUMA topology |
| I/O: PCIe devices, NICs, HBAs, GPUs |
+------------------------------------------------------------------+
KVM ioctl API
All interaction with KVM happens through ioctl() system calls on file descriptors. There are three levels of file descriptor, each scoping a different set of operations:
System-level (/dev/kvm):
| ioctl | Purpose |
|---|---|
KVM_GET_API_VERSION |
Returns the KVM API version (currently 12). Used by QEMU to verify compatibility. |
KVM_CREATE_VM |
Creates a new VM. Returns a VM file descriptor (vmfd). |
KVM_CHECK_EXTENSION |
Queries whether a specific KVM extension (capability) is available. Extensions include KVM_CAP_IRQCHIP, KVM_CAP_HLT, KVM_CAP_NR_VCPUS, etc. |
KVM_GET_VCPU_MMAP_SIZE |
Returns the size of the shared memory region used for QEMU-KVM communication per vCPU. |
VM-level (vmfd):
| ioctl | Purpose |
|---|---|
KVM_CREATE_VCPU |
Creates a virtual CPU for this VM. Returns a vCPU file descriptor (vcpufd). |
KVM_SET_USER_MEMORY_REGION |
Maps a region of QEMU's user-space address space to guest physical addresses. This is how guest RAM is backed by host memory. |
KVM_CREATE_IRQCHIP |
Creates an in-kernel interrupt controller (APIC, IOAPIC, PIC). |
KVM_SET_TSS_ADDR |
Sets the address of the Task State Segment (required for Intel VT-x). |
KVM_CREATE_DEVICE |
Creates an in-kernel device (e.g., VFIO, ARM GIC). |
vCPU-level (vcpufd):
| ioctl | Purpose |
|---|---|
KVM_RUN |
The most important call. Enters the guest (VMX non-root mode) and blocks until a VM exit occurs that requires user-space handling. |
KVM_GET_REGS / KVM_SET_REGS |
Read/write general-purpose registers (RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP, R8-R15, RIP, RFLAGS). |
KVM_GET_SREGS / KVM_SET_SREGS |
Read/write special registers (CR0, CR2, CR3, CR4, EFER, segment registers, IDT, GDT). |
KVM_GET_MSRS / KVM_SET_MSRS |
Read/write Model-Specific Registers. |
KVM_INTERRUPT |
Injects an interrupt into the guest vCPU. |
vCPU Execution Loop
The core of KVM's operation is the vCPU execution loop. Each vCPU runs as a Linux thread inside the QEMU process. The loop looks like this in simplified form:
KVM vCPU Execution Loop
QEMU vCPU Thread KVM Kernel Module Guest VM
================ ================= ========
| | |
|-- ioctl(vcpufd, KVM_RUN) ---------> | |
| | |
| [thread blocks in kernel space] | |
| |-- VMRESUME ------------> |
| | |
| | Guest executes |
| | at near-native |
| | speed on real CPU |
| | |
| | <--- VM EXIT ----------- |
| | |
| | KVM inspects |
| | exit_reason: |
| | |
| | Case: I/O instruction |
| | -> Can KVM handle |
| | in-kernel? |
| | |
| | YES (e.g., APIC access,|
| | MSR read, clock read) |
| | -> Handle in kernel |
| | -> VMRESUME -------> |
| | (never returns |
| | to user space) |
| | |
| | NO (e.g., I/O port |
| | access, MMIO write, |
| | halt instruction) |
| <-- ioctl returns ------------------| |
| | |
| QEMU reads kvm_run->exit_reason | |
| from shared mmap region: | |
| | |
| KVM_EXIT_IO: | |
| emulate I/O port access | |
| (e.g., serial port, IDE ctrl) | |
| | |
| KVM_EXIT_MMIO: | |
| emulate memory-mapped device | |
| (e.g., virtio, GPU framebuffer) | |
| | |
| KVM_EXIT_HLT: | |
| guest executed HLT instruction | |
| vCPU is idle, wait for IRQ | |
| | |
| KVM_EXIT_SHUTDOWN: | |
| guest triple-faulted or | |
| requested shutdown | |
| | |
|-- ioctl(vcpufd, KVM_RUN) ---------> | (loop repeats) |
| | |
The critical performance insight: most VM exits are handled entirely in kernel space and never return to QEMU. The in-kernel APIC, clock source (kvmclock), and MSR handling mean that only I/O operations and uncommon events cause the expensive kernel-to-user-space context switch. A well-tuned VM under compute load may execute billions of instructions between user-space exits.
Memory Management
KVM uses a concept called memslots to map guest physical memory to host virtual memory. Each memslot defines a contiguous range of guest physical addresses (GPAs) and maps it to a contiguous range of host virtual addresses (HVAs) in the QEMU process.
KVM Memory Architecture
Guest Physical Address Space QEMU Virtual Address Space Host Physical RAM
(what the guest OS sees) (QEMU's mmap'd regions) (actual DRAM)
+-----------------------------+ +-----------------------------+
| 0x0000_0000 - 0x0009_FFFF |----->| 0x7f00_0000_0000 + offset |---> [DRAM pages]
| (640 KB conventional) | | (memslot 0) |
+-----------------------------+ +-----------------------------+
| 0x000A_0000 - 0x000F_FFFF | | (not mapped -- VGA hole, |
| (VGA, ROM, hole) | | device MMIO) |
+-----------------------------+ +-----------------------------+
| 0x0010_0000 - 0x7FFF_FFFF |----->| 0x7f00_0010_0000 + offset |---> [DRAM pages]
| (below 4G RAM) | | (memslot 1) |
+-----------------------------+ +-----------------------------+
| 0x1_0000_0000 - ... |----->| 0x7f01_0000_0000 + offset |---> [DRAM pages]
| (above 4G RAM) | | (memslot 2) |
+-----------------------------+ +-----------------------------+
EPT (Extended Page Tables), managed by KVM:
Maps GPA -> HPA directly in hardware.
The CPU walks EPT on every guest memory access (cached in TLB).
QEMU allocates memslots via:
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, {
slot: 1,
guest_phys_addr: 0x100000,
memory_size: 0x7FF00000,
userspace_addr: <mmap'd pointer>
});
EPT/NPT integration: When QEMU registers a memslot, KVM builds EPT (Intel) or NPT (AMD) page tables that map GPAs to the host physical addresses (HPAs) backing QEMU's mmap region. The CPU hardware walks these EPT tables on every guest memory access, translating GPA to HPA without any hypervisor intervention. EPT violations (accessing an unmapped or protected GPA) cause VM exits.
Dirty page tracking: For live migration, KVM tracks which guest pages have been modified. The mechanism:
- QEMU calls
ioctl(vmfd, KVM_GET_DIRTY_LOG, {slot, bitmap}). - KVM returns a bitmap where each bit represents a page in the memslot. Bit set = page was written since last query.
- KVM implements dirty tracking by write-protecting EPT entries. When the guest writes to a tracked page, the CPU triggers an EPT violation (VM exit). KVM marks the page as dirty in its bitmap, removes the write protection, and resumes the guest. This is a one-time cost per page per tracking cycle.
- QEMU iterates the dirty bitmap and re-sends modified pages to the migration destination. The cycle repeats until the dirty rate is low enough for a final cutover.
Interrupt Handling
Interrupt delivery is one of the most performance-critical paths in a hypervisor. KVM provides three models:
In-kernel APIC (default, recommended): KVM emulates the Local APIC, I/O APIC, and PIC entirely in kernel space. When a device (emulated by QEMU or passed through via VFIO) signals an interrupt, the notification reaches KVM through an irqfd -- a file descriptor that, when written to, injects an interrupt into the guest without any context switch to QEMU. The in-kernel APIC processes the interrupt, selects the target vCPU based on APIC destination and priority, and delivers it on the next VM entry. This path avoids any user-space involvement.
Split IRQchip: The APIC runs in kernel space, but some interrupt routing decisions are handled by QEMU. Used for complex interrupt topologies or when QEMU needs to intercept specific interrupts.
Posted Interrupts (hardware-assisted): On modern Intel CPUs, posted interrupts allow the CPU to deliver an interrupt directly to a guest vCPU without causing a VM exit. The hypervisor writes the interrupt vector to a Posted Interrupt Descriptor (PID) in memory and sends a notification IPI. If the target vCPU is currently running in VMX non-root mode, the CPU reads the PID and delivers the interrupt inline, never exiting to the hypervisor. This is critical for high-throughput I/O with device passthrough (VFIO):
Interrupt Delivery Paths
Path 1: Emulated device (QEMU) -> irqfd -> KVM in-kernel APIC -> VM entry
Latency: ~2-5 us
Path 2: Passthrough device (VFIO) -> Posted Interrupt -> guest (no VM exit)
Latency: ~0.5-1 us
Path 3: Emulated device -> QEMU user-space -> ioctl(KVM_INTERRUPT) -> VM entry
Latency: ~10-20 us (avoid this path for performance)
For comparison:
Bare-metal interrupt latency: ~0.3-0.5 us
Device Assignment (VFIO)
VFIO (Virtual Function I/O) is the Linux kernel framework for safely assigning physical PCIe devices directly to VMs. The device is removed from the host's driver and given exclusively to the guest, which accesses it at near-native speed without any hypervisor emulation in the data path.
VFIO requires IOMMU (VT-d on Intel, AMD-Vi on AMD) to provide DMA remapping -- ensuring the device can only DMA to memory regions belonging to its assigned VM, not to other VMs or the host kernel.
Common VFIO use cases in the evaluation:
- SR-IOV NIC passthrough: A 100GbE NIC is split into Virtual Functions (VFs), each assigned to a different VM via VFIO. Each VM gets near-native network throughput without virtio overhead.
- GPU passthrough: A physical GPU (NVIDIA A100, etc.) is assigned to a VM for AI/ML workloads.
- NVMe passthrough: A physical NVMe drive is assigned to a VM for direct storage access.
Performance Characteristics
| Workload Type | KVM Overhead vs Bare Metal | Primary Overhead Source |
|---|---|---|
| CPU-bound (integer, FP) | 1-3% | VM entry/exit for timer ticks, context switches |
| Memory-bound (large working set) | 3-8% | EPT page walks on TLB misses (mitigated by huge pages) |
| Network I/O (virtio-net) | 5-15% | Vring processing, ioeventfd/irqfd transitions |
| Network I/O (SR-IOV passthrough) | 1-3% | IOMMU DMA remapping |
| Block I/O (virtio-blk) | 5-10% | Vring processing, host filesystem/block layer |
| Block I/O (NVMe passthrough) | 1-3% | IOMMU DMA remapping |
How OVE Uses KVM
In OpenShift Virtualization Engine, KVM is not directly managed by administrators. The stack is:
OVE Hypervisor Stack (per worker node)
+------------------------------------------------------------------+
| Kubernetes Control Plane |
| (API Server, etcd, controllers) |
+------------------------------------------------------------------+
| VirtualMachineInstance (VMI) CR
v
+------------------------------------------------------------------+
| virt-controller |
| Watches VMI CRs, creates pods for each VM |
+------------------------------------------------------------------+
| Creates Pod with virt-launcher container
v
+------------------------------------------------------------------+
| Worker Node (RHCOS) |
| +------------------------------------------------------------+ |
| | virt-launcher Pod | |
| | +------------------------------------------------------+ | |
| | | virt-launcher process | | |
| | | Manages VM lifecycle within the pod | | |
| | +------------------------------------------------------+ | |
| | | libvirtd (per-pod instance) | | |
| | | Translates VMI spec -> domain XML -> QEMU cmdline | | |
| | +------------------------------------------------------+ | |
| | | QEMU process | | |
| | | vCPU threads, I/O threads, device emulation | | |
| | +------------------------------------------------------+ | |
| +------------------------------------------------------------+ |
| |
| +------------------------------------------------------------+ |
| | Linux Kernel (RHCOS) + kvm.ko + kvm-intel.ko/kvm-amd.ko | |
| +------------------------------------------------------------+ |
| +------------------------------------------------------------+ |
| | virt-handler (DaemonSet) | |
| | Node-level agent: manages node capabilities, device | |
| | plugins, network setup, migration coordination | |
| +------------------------------------------------------------+ |
+------------------------------------------------------------------+
The key insight: each VM runs inside its own Kubernetes pod. The pod contains a libvirtd instance and a QEMU process. Kubernetes' scheduler, cgroup enforcement, and resource accounting apply to the QEMU process just as they would to any container. This means CPU limits, memory limits, huge page requests, NUMA topology alignment, and network policies all use standard Kubernetes mechanisms -- the KVM/QEMU layer is transparent.
2. QEMU (Quick Emulator)
What It Is and Why It Exists
QEMU is the user-space component that complements KVM. While KVM handles CPU execution and memory management in kernel space, QEMU provides everything else: device emulation, disk image management, network backend, migration, and the machine model that defines what hardware the guest "sees." QEMU was originally written by Fabrice Bellard (2003) as a full-system emulator -- it could emulate an entire x86 PC in software, including the CPU. When paired with KVM, QEMU delegates CPU execution to the hardware and focuses on its other roles.
Without QEMU, a KVM VM would have a CPU and RAM but no disks, no network cards, no display, no USB controllers -- no way to boot an operating system. QEMU is to KVM what the VMX process (with its device emulation layer) is to ESXi's VMkernel.
Architecture
A QEMU process for a single VM consists of multiple threads with distinct roles:
QEMU Process Architecture (single VM)
+------------------------------------------------------------------+
| QEMU Process (PID 54321) |
| |
| Main Thread (event loop / "main loop") |
| +------------------------------------------------------------+ |
| | - GLib main loop (g_main_loop) | |
| | - Processes QMP commands (JSON management protocol) | |
| | - Handles timer callbacks | |
| | - Drives block layer background jobs (mirror, commit) | |
| | - Manages device hotplug/unplug | |
| +------------------------------------------------------------+ |
| |
| vCPU Threads (one per virtual CPU) |
| +---------------------+ +---------------------+ |
| | vCPU 0 Thread | | vCPU 1 Thread | ... |
| | - Calls KVM_RUN | | - Calls KVM_RUN | |
| | - Handles exits | | - Handles exits | |
| | - May emulate | | - May emulate | |
| | device MMIO | | device MMIO | |
| +---------------------+ +---------------------+ |
| |
| I/O Threads (for parallel block I/O) |
| +---------------------+ +---------------------+ |
| | IOThread 0 | | IOThread 1 | ... |
| | - Own event loop | | - Own event loop | |
| | - Handles AIO | | - Handles AIO | |
| | completions for | | completions for | |
| | assigned disks | | assigned disks | |
| +---------------------+ +---------------------+ |
| |
| Worker Threads (thread pool for sync operations) |
| +------------------------------------------------------------+ |
| | Used for: file creation, qcow2 metadata operations, | |
| | LUKS encryption, compression | |
| +------------------------------------------------------------+ |
+------------------------------------------------------------------+
The main loop is the heart of QEMU. It is a single-threaded event loop based on GLib's GMainLoop. All management operations (device hotplug, migration initiation, snapshot creation) are serialized through this loop. The main loop also runs a "Big QEMU Lock" (BQL, formerly known as the "Global Mutex") that serializes access to device state. Only one thread at a time may hold the BQL. This is a historical design decision that limits parallelism for device emulation but simplifies correctness.
I/O Threads were introduced to break the BQL bottleneck for block I/O. Each IOThread has its own event loop and can process block I/O completions independently of the main loop. Assigning different disks to different IOThreads enables parallel disk I/O without BQL contention. In KubeVirt/OVE, IOThreads are configurable via the spec.domain.devices.disks[].io and ioThreadsPolicy fields.
Device Emulation
QEMU emulates a complete PC, including legacy devices that guests expect to find at boot time and modern paravirtualized devices for performance:
Legacy / Emulated Devices (for compatibility):
| Device | What QEMU Emulates | Guest Requirement |
|---|---|---|
| i440FX + PIIX3 (pc machine type) | Legacy PCI chipset, ISA bridge, IDE controller, USB UHCI | Any OS boots, but limited to PCI (no PCIe) |
| Q35 + ICH9 (q35 machine type) | Modern PCIe chipset, AHCI SATA, USB xHCI, IOMMU | Recommended for production; required for PCIe passthrough |
| e1000 / e1000e / rtl8139 | Intel Gigabit / Realtek NIC emulation | Built-in drivers in most OSes |
| IDE / AHCI | Legacy disk controller emulation | Built-in drivers in most OSes |
| VGA / cirrus / bochs-display | Display adapter emulation | Built-in drivers |
| i8254 (PIT), i8259 (PIC), MC146818 (RTC) | Legacy timer, interrupt controller, real-time clock | Required for BIOS boot sequence |
Paravirtualized Devices (virtio -- for performance):
| Device | Purpose | Guest Requirement |
|---|---|---|
virtio-blk |
Block device (one device = one disk) | virtio drivers (built into Linux 2.6.25+, Red Hat virtio drivers for Windows) |
virtio-scsi |
SCSI controller (supports many LUNs) | virtio-scsi drivers |
virtio-net |
Network interface | virtio-net drivers |
virtio-gpu |
GPU with 2D/3D acceleration | virtio-gpu drivers |
virtio-balloon |
Memory ballooning | virtio-balloon drivers |
virtio-rng |
Hardware random number generator passthrough | virtio-rng drivers |
virtio-serial |
Serial/console channel | virtio-serial drivers |
virtio-fs |
Shared filesystem (host-guest file sharing via FUSE) | virtiofs drivers |
vhost-net |
Kernel-accelerated virtio-net (data path in kernel, not QEMU) | Same as virtio-net |
vhost-user |
DPDK-accelerated virtio (data path in a separate user-space process) | Same as virtio-net |
The performance difference between emulated and paravirtualized devices is dramatic:
| Metric | e1000 (emulated) | virtio-net | vhost-net | SR-IOV passthrough |
|---|---|---|---|---|
| Throughput (1500 MTU) | ~2 Gbps | ~10 Gbps | ~20 Gbps | ~25 Gbps (line rate) |
| CPU cost per Gbps | Very high | Moderate | Low | Minimal |
| Latency (round-trip) | ~200 us | ~50 us | ~30 us | ~10 us |
| Guest driver requirement | None (built-in) | virtio drivers | virtio drivers | VF driver |
Virtio Architecture in Detail
Virtio is the standardized paravirtualized I/O framework for KVM/QEMU guests. It was designed by Rusty Russell (IBM/Red Hat) and is formally specified by the OASIS virtio TC. Understanding virtio at the vring level is essential because virtio performance is the single biggest factor in KVM's I/O competitiveness with ESXi.
Why virtio is fast: Traditional device emulation works by trapping guest I/O instructions. Every time the guest writes to an I/O port or an MMIO address belonging to an emulated device, the CPU triggers a VM exit, control transfers to QEMU, QEMU emulates the device behavior, and then the guest resumes. For a single network packet, this might involve dozens of trap-and-emulate cycles.
Virtio replaces this with a shared memory ring buffer architecture. The guest and QEMU share a set of memory regions (the virtqueues). The guest writes descriptors into the ring, kicks the hypervisor once, and QEMU processes a batch of descriptors in one pass. This amortizes the cost of VM exits across many I/O operations.
Virtio Virtqueue Architecture (vring)
+------------------------------------------------------------------+
| Guest Driver (e.g., virtio-net in Linux kernel) |
| |
| 1. Allocates data buffers in guest memory |
| 2. Writes buffer addresses into Descriptor Table |
| 3. Adds descriptor index to Available Ring |
| 4. Writes to Queue Notify register (triggers ioeventfd) |
+------------------------------------------------------------------+
| | |
v v v
+------------------------------------------------------------------+
| Shared Memory (visible to both guest and QEMU via memslot) |
| |
| Descriptor Table (array of descriptors, typically 256 entries) |
| +------+------+----------+--------+-------+ |
| | idx | addr | length | flags | next | |
| +------+------+----------+--------+-------+ |
| | 0 | GPA | 1500 | 0x0 | -- | (single buffer) |
| | 1 | GPA | 20 | NEXT | 2 | (chained: header) |
| | 2 | GPA | 1500 | WRITE | -- | (chained: data) |
| | 3 | GPA | 1500 | 0x0 | -- | |
| | ... | ... | ... | ... | ... | |
| +------+------+----------+--------+-------+ |
| |
| Available Ring (guest writes, QEMU reads) |
| +-------+-------+------+------+------+------+ |
| | flags | idx | e0 | e1 | e2 | ... | |
| +-------+-------+------+------+------+------+ |
| | 0x0 | 3 | 0 | 1 | 3 | ... | |
| +-------+-------+------+------+------+------+ |
| idx = next entry guest will write to |
| e0..eN = descriptor indices of buffers available for QEMU |
| |
| Used Ring (QEMU writes, guest reads) |
| +-------+-------+----------+----------+ |
| | flags | idx | (id,len) | (id,len) | ... |
| +-------+-------+----------+----------+ |
| | 0x0 | 2 | (0,1500) | (1,1520) | ... |
| +-------+-------+----------+----------+ |
| idx = next entry QEMU will write to |
| id = descriptor index, len = bytes written by QEMU |
+------------------------------------------------------------------+
^ ^ ^
| | |
+------------------------------------------------------------------+
| QEMU (or vhost-net in kernel, or vhost-user in DPDK process) |
| |
| 1. Gets notification (ioeventfd triggers epoll) |
| 2. Reads Available Ring to find new descriptors |
| 3. Processes buffers (transmit packet / fill receive buffer) |
| 4. Writes results to Used Ring |
| 5. Injects interrupt via irqfd (signals guest) |
+------------------------------------------------------------------+
Key data structures:
-
Descriptor Table: A fixed-size array (default 256 entries, configurable up to 32768). Each descriptor contains a guest-physical address pointing to a data buffer, the buffer length, flags (NEXT for chaining, WRITE if the device should write to it, INDIRECT for indirect descriptor tables), and a next index for chaining.
-
Available Ring: A producer-consumer ring where the guest posts descriptors it has prepared for the device. The guest increments the
idxfield after adding entries. QEMU reads fromlast_avail_idxtoidxto find new work. -
Used Ring: A producer-consumer ring where the device (QEMU) posts completed descriptors. QEMU increments
idxafter processing. The guest reads fromlast_used_idxtoidxto find completed operations.
Notification mechanisms:
-
ioeventfd (guest -> host): When the guest writes to the virtqueue's notification register (an MMIO address), KVM intercepts the write and signals an eventfd file descriptor. QEMU (or vhost-net) has this eventfd registered in its epoll loop and wakes up to process the queue. The critical optimization: KVM handles the ioeventfd entirely in kernel space -- the VM exit for the notification write is handled by KVM, which signals the eventfd, and immediately resumes the guest. QEMU processes the queue asynchronously.
-
irqfd (host -> guest): When QEMU (or vhost-net) finishes processing a batch of descriptors, it writes to an irqfd file descriptor. KVM translates this into a guest interrupt injection. With posted interrupts, the interrupt can be delivered without a VM exit if the target vCPU is running.
Guest-side requirements (drivers):
Virtio devices are not automatically recognized by every operating system. The guest must have virtio drivers:
| Guest OS | Virtio Driver Status |
|---|---|
| Linux (kernel 2.6.25+) | Built-in. All virtio drivers are mainline kernel modules. No additional installation needed. |
| Linux (kernel < 2.6.25) | Not available. Must use emulated devices (e1000, IDE). |
| Windows Server 2016+ | Available via Red Hat "virtio-win" driver package. Must be installed manually or injected during image preparation. Not built into Windows. |
| Windows Server 2012 R2 | Available via virtio-win, but EOL guest -- evaluate migration necessity. |
| FreeBSD | Built-in since FreeBSD 10. |
For the VMware-to-OVE migration, every Windows VM must have virtio drivers installed before or during migration. The Migration Toolkit for Virtualization (MTV) handles driver injection as part of the conversion process, but this must be validated per OS version in the PoC.
Machine Types (pc vs q35)
QEMU supports multiple "machine types" that define the virtual hardware topology presented to the guest:
| Machine Type | Chipset | Bus | Boot | IOMMU | Recommended |
|---|---|---|---|---|---|
pc (i440FX) |
i440FX + PIIX3 | PCI (parallel) | BIOS or UEFI | No | Legacy VMs only |
q35 (ICH9) |
Q35 + ICH9 | PCIe (serial) | UEFI (OVMF) | Yes (Intel VT-d emulated) | All new VMs, required for passthrough |
Why q35 matters:
- PCIe topology: q35 presents a proper PCIe root complex with PCIe root ports, enabling multi-function devices, PCIe capability negotiation, and AER (Advanced Error Reporting). PCIe passthrough (VFIO) requires PCIe topology; it does not work correctly with the legacy PCI bus of i440FX.
- IOMMU emulation: q35 supports an emulated Intel VT-d IOMMU, which is required for vIOMMU (virtual IOMMU) -- needed by guests that themselves want to do DMA isolation (e.g., a guest running DPDK or a nested VM).
- Modern peripherals: q35 provides AHCI SATA (instead of IDE), USB 3.0 xHCI (instead of UHCI), and native hotplug on PCIe ports.
- Secure Boot: OVMF (the UEFI firmware for QEMU) works best with q35. The i440FX machine type supports OVMF but with limitations.
KubeVirt defaults to q35 for all new VMs. Migrating legacy VMs from VMware may require preserving the pc machine type if the guest OS depends on i440FX-specific device layout (rare, but possible with very old Linux or Windows installations).
Block Layer
QEMU's block layer is a sophisticated subsystem that manages disk images, supports multiple formats, and provides live block operations:
qcow2 (QEMU Copy-On-Write version 2):
qcow2 is the native disk image format for KVM/QEMU. Its internal structure:
qcow2 Image Structure
+------------------------------------------------------------------+
| qcow2 Header (72+ bytes) |
| - magic: 0x514649FB ('QFI\xfb') |
| - version: 3 |
| - backing_file_offset: pointer to backing file path (if any) |
| - cluster_bits: 16 (default 64KB clusters) |
| - size: virtual disk size |
| - crypt_method: 0 (none) or 2 (LUKS) |
| - l1_size: number of L1 table entries |
| - l1_table_offset: pointer to L1 table |
| - refcount_table_offset: pointer to refcount table |
+------------------------------------------------------------------+
| L1 Table (top-level, in-memory) |
| +------+------+------+------+ |
| | ptr | ptr | ptr | NULL | ... (one entry per L2 table) |
| +------+------+------+------+ |
| | | | |
| v v v |
| L2 Tables (second-level, loaded on demand) |
| +------+------+------+------+------+------+ |
| | ptr | ptr | NULL | ptr | ptr | NULL | ... |
| +------+------+------+------+------+------+ |
| Each L2 entry points to a data cluster (64KB by default) |
| NULL entries = unallocated (reads return zeros) |
| | | | | |
| v v v v |
| Data Clusters |
| +----------+ +----------+ +----------+ +----------+ |
| | 64KB | | 64KB | | 64KB | | 64KB | ... |
| | data | | data | | data | | data | |
| +----------+ +----------+ +----------+ +----------+ |
+------------------------------------------------------------------+
| Refcount Table + Refcount Blocks |
| Tracks how many references point to each cluster. |
| refcount=0: free, refcount=1: used, refcount>1: snapshot/COW |
+------------------------------------------------------------------+
Key qcow2 features relevant to the evaluation:
- Thin provisioning: Clusters are allocated on first write. A 500 GB virtual disk may consume only 50 GB on the host initially.
- Backing chains: A qcow2 image can have a "backing file" -- a read-only base image. Reads for unallocated clusters fall through to the backing file. This enables efficient snapshots and template-based provisioning.
- Internal snapshots: The L1 table can be duplicated, creating a point-in-time snapshot without copying data. Modified clusters get new allocations (COW); unmodified clusters share refcounts with the snapshot.
- Compression: Individual clusters can be compressed (zlib or zstd). Compressed clusters are read-only.
- Encryption: qcow2v3 supports LUKS-based encryption of data clusters.
Block jobs (live block operations):
| Job Type | What It Does | Use Case |
|---|---|---|
block-mirror |
Copies all data from source to target in real-time, then switches | Live storage migration (move VM disk to new storage without downtime) |
block-commit |
Merges an overlay image down into its backing file | Removing a snapshot layer after validating the snapshot is no longer needed |
block-stream |
Copies data from a backing file into the overlay, making the overlay self-contained | Breaking a backing chain after template-based provisioning |
block-backup |
Creates a point-in-time copy (full or incremental via dirty bitmaps) | Backup |
These block jobs run as background operations coordinated by QEMU's main loop. They use rate-limiting to avoid saturating storage I/O. In OVE, these operations are triggered by KubeVirt's CDI (Containerized Data Importer) and storage controllers.
QMP (QEMU Machine Protocol)
QMP is QEMU's management interface -- a JSON-based protocol over a Unix domain socket or TCP connection. Every QEMU management operation (pause, resume, migrate, hotplug, query status) is a QMP command. Libvirt is the primary QMP client; it translates high-level operations into QMP commands.
Example QMP exchange:
--> {"execute": "query-status"}
<-- {"return": {"status": "running", "singlestep": false, "running": true}}
--> {"execute": "device_add", "arguments": {"driver": "virtio-net-pci", "id": "net1", "netdev": "tap1"}}
<-- {"return": {}}
--> {"execute": "migrate", "arguments": {"uri": "tcp:destination-host:4567"}}
<-- {"return": {}}
<-- {"event": "MIGRATION", "data": {"status": "setup"}, "timestamp": {...}}
<-- {"event": "MIGRATION", "data": {"status": "active"}, "timestamp": {...}}
<-- {"event": "MIGRATION", "data": {"status": "completed"}, "timestamp": {...}}
In the OVE stack, QMP is not used directly by administrators. The virt-launcher process communicates with QEMU via QMP, and the virt-launcher exposes a gRPC API to the virt-handler DaemonSet, which in turn exposes Kubernetes-native APIs. Administrators interact with kubectl and VirtualMachineInstance custom resources, never with QMP directly.
QEMU + KVM Integration
When QEMU starts with -accel kvm (or the equivalent machine configuration), it operates as a "KVM accelerated" process:
- QEMU opens
/dev/kvmand creates a VM viaKVM_CREATE_VM. - For each vCPU, QEMU creates a Linux thread and a KVM vCPU via
KVM_CREATE_VCPU. - QEMU allocates guest RAM via
mmap()and registers it with KVM viaKVM_SET_USER_MEMORY_REGION. - QEMU sets up device emulation (virtio devices, chipset, etc.) and registers ioeventfd/irqfd pairs with KVM.
- Each vCPU thread enters a loop calling
ioctl(vcpufd, KVM_RUN). - VM exits that KVM cannot handle in-kernel (I/O to emulated devices, MMIO to QEMU-emulated regions) return to the vCPU thread, which dispatches to the appropriate QEMU device model.
- QEMU's main loop handles management operations (QMP), background block jobs, and timer-driven events.
The result: QEMU is simultaneously a device emulator, a VM manager, and a process whose threads spend most of their time inside the kernel executing guest code via KVM. This dual nature is important for understanding resource accounting -- a QEMU process showing 400% CPU usage in top means 4 vCPU threads are running guest code at full speed.
3. Libvirt
What It Is and Why It Exists
Libvirt is the management abstraction layer between human operators (or higher-level tools) and the raw KVM/QEMU interface. Without libvirt, managing a QEMU VM requires constructing a command line with dozens of arguments:
# Raw QEMU command line for a single VM (simplified):
qemu-system-x86_64 \
-name guest=myvm,debug-threads=on \
-machine q35,accel=kvm,usb=off \
-cpu host,+invtsc \
-m 16384 \
-smp 8,sockets=1,cores=8,threads=1 \
-object memory-backend-file,id=ram-node0,size=16G,mem-path=/dev/hugepages,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-7,memdev=ram-node0 \
-drive file=/var/lib/libvirt/images/myvm.qcow2,format=qcow2,if=none,id=drive0 \
-device virtio-blk-pci,drive=drive0,id=disk0,bootindex=1 \
-netdev tap,id=net0,ifname=tap-myvm,script=no,downscript=no,vhost=on \
-device virtio-net-pci,netdev=net0,mac=52:54:00:aa:bb:cc \
-chardev socket,id=qmp,path=/var/run/qmp-myvm.sock,server=on,wait=off \
-mon chardev=qmp,mode=control \
-device virtio-balloon-pci,id=balloon0 \
-device virtio-rng-pci,rng=rng0 \
-object rng-random,id=rng0,filename=/dev/urandom \
...
This is unmaintainable for 5,000 VMs. Libvirt replaces it with a structured XML representation and a stable API.
Architecture
Libvirt follows a client-daemon-driver architecture:
Libvirt Architecture
+------------------------------------------------------------------+
| Clients |
| +----------+ +----------+ +----------+ +-----------------+ |
| | virsh | | virt- | | Cockpit | | KubeVirt | |
| | (CLI) | | manager | | (Web UI) | | virt-launcher | |
| +----------+ +----------+ +----------+ +-----------------+ |
| | | | | |
| v v v v |
| libvirt client library (libvirt.so) |
| Connection URIs: |
| qemu:///system (local, root, system VMs) |
| qemu:///session (local, user, user VMs) |
| qemu+ssh://host/system (remote via SSH tunnel) |
+------------------------------------------------------------------+
|
| (Unix socket or TCP/TLS)
v
+------------------------------------------------------------------+
| Daemon Layer |
| |
| Traditional: monolithic libvirtd |
| Modern (RHEL 9+, Fedora 35+): modular daemons |
| |
| +------------------+ +------------------+ +----------------+ |
| | virtqemud | | virtnetworkd | | virtstoraged | |
| | (QEMU/KVM VMs) | | (virtual | | (storage pool | |
| | | | networks) | | management) | |
| +------------------+ +------------------+ +----------------+ |
| +------------------+ +------------------+ +----------------+ |
| | virtnodedevd | | virtsecretd | | virtinterfaced | |
| | (node devices, | | (secret/key | | (host NIC | |
| | VFIO, MDEV) | | management) | | configuration)| |
| +------------------+ +------------------+ +----------------+ |
+------------------------------------------------------------------+
|
| (driver plugins)
v
+------------------------------------------------------------------+
| Hypervisor Drivers |
| +------------------+ +------------------+ +----------------+ |
| | QEMU driver | | LXC driver | | Test driver | |
| | (manages QEMU | | (manages Linux | | (mock driver | |
| | processes, QMP | | containers) | | for testing) | |
| | communication) | | | | | |
| +------------------+ +------------------+ +----------------+ |
+------------------------------------------------------------------+
|
| (QEMU driver uses QMP to talk to QEMU processes)
v
+------------------------------------------------------------------+
| QEMU Processes (one per VM) |
| +------------------+ +------------------+ +----------------+ |
| | qemu-system-x86 | | qemu-system-x86 | | qemu-system-x86| |
| | (VM 1) | | (VM 2) | | (VM 3) | |
| +------------------+ +------------------+ +----------------+ |
+------------------------------------------------------------------+
In RHEL 9 and newer (which includes RHCOS, the OS underlying OVE worker nodes), the monolithic libvirtd is replaced by modular daemons. Each daemon handles a specific subsystem and can be restarted independently. This improves resilience -- a crash in the storage daemon does not affect running VMs managed by the QEMU daemon.
Domain XML: The VM Definition Format
Libvirt defines VMs using an XML format called "domain XML." This is the equivalent of a VMware .vmx configuration file, but more structured and comprehensive. Understanding domain XML is essential because KubeVirt's VirtualMachineInstance spec maps directly to domain XML fields.
A representative domain XML for a production VM:
<domain type='kvm'>
<name>prod-db-01</name>
<uuid>a1b2c3d4-e5f6-7890-abcd-ef1234567890</uuid>
<memory unit='GiB'>64</memory>
<currentMemory unit='GiB'>64</currentMemory>
<vcpu placement='static'>16</vcpu>
<cpu mode='host-passthrough' check='none'>
<topology sockets='1' dies='1' cores='8' threads='2'/>
<numa>
<cell id='0' cpus='0-7' memory='32' unit='GiB'/>
<cell id='1' cpus='8-15' memory='32' unit='GiB'/>
</numa>
</cpu>
<os>
<type arch='x86_64' machine='q35'>hvm</type>
<loader readonly='yes' secure='yes' type='pflash'>
/usr/share/OVMF/OVMF_CODE.secboot.fd
</loader>
<nvram>
/var/lib/libvirt/qemu/nvram/prod-db-01_VARS.fd
</nvram>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<smm state='on'/> <!-- Required for Secure Boot -->
</features>
<clock offset='utc'>
<timer name='kvmclock' present='yes'/>
<timer name='hpet' present='no'/>
</clock>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<!-- System disk -->
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native' discard='unmap'/>
<source file='/var/lib/libvirt/images/prod-db-01-root.qcow2'/>
<target dev='vda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</disk>
<!-- Data disk -->
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/vg_data/lv_prod_db_01'/>
<target dev='vdb' bus='virtio'/>
</disk>
<!-- Network -->
<interface type='bridge'>
<mac address='52:54:00:aa:bb:01'/>
<source bridge='br-prod'/>
<model type='virtio'/>
<driver name='vhost' queues='8'/>
</interface>
<!-- Balloon -->
<memballoon model='virtio'>
<address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</memballoon>
<!-- vTPM -->
<tpm model='tpm-crb'>
<backend type='emulator' version='2.0'/>
</tpm>
<!-- RNG -->
<rng model='virtio'>
<backend model='random'>/dev/urandom</backend>
</rng>
</devices>
<memoryBacking>
<hugepages>
<page size='2048' unit='KiB' nodeset='0-1'/>
</hugepages>
<nosharepages/> <!-- Disables KSM for this VM -->
</memoryBacking>
<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='6'/>
<!-- ... -->
<emulatorpin cpuset='36-37'/>
<iothreadpin iothread='1' cpuset='38'/>
</cputune>
</domain>
Key domain XML sections and their VMware equivalents:
| Domain XML Element | VMware Equivalent | Notes |
|---|---|---|
<cpu mode='host-passthrough'> |
cpuid.coresPerSocket, CPU masking |
Exposes host CPU features directly to guest |
<cpu><topology> |
numvcpus, cpuid.coresPerSocket |
Defines virtual CPU topology |
<cpu><numa> |
vNUMA (auto-configured in vSphere) | Explicit NUMA topology for guest |
<os><loader> (OVMF) |
VM Hardware Version + UEFI firmware | UEFI boot with Secure Boot |
<disk bus='virtio'> |
scsi0:0 (PVSCSI adapter) |
Paravirtualized disk |
<interface><model type='virtio'> |
vmxnet3 |
Paravirtualized NIC |
<memoryBacking><hugepages> |
Large Pages (auto in vSphere) | Host huge page backing |
<cputune><vcpupin> |
Scheduling Affinity | CPU pinning |
<tpm model='tpm-crb'> |
vTPM 2.0 | Virtual TPM |
Storage Pools and Volumes Management
Libvirt abstracts storage into pools (a storage location) and volumes (individual disk images within a pool):
| Pool Type | Backend | OVE Equivalent |
|---|---|---|
dir |
Local filesystem directory | Not directly used; PVs backed by local storage |
logical |
LVM volume group | LVM-based PVs via TopoLVM or LVMs operator |
rbd |
Ceph RBD pool | Ceph CSI driver providing PVs |
iscsi |
iSCSI target | iSCSI CSI driver |
netfs |
NFS export | NFS CSI driver |
disk |
Raw disk partition | Local disk PVs |
In standalone KVM deployments, libvirt manages storage pools directly. In OVE, storage is managed by Kubernetes CSI drivers and the Containerized Data Importer (CDI). Libvirt inside the virt-launcher pod sees only the storage volumes that Kubernetes has mounted into the pod's filesystem.
Network Management
Libvirt provides several network attachment models:
- Virtual network (NAT): Libvirt creates a Linux bridge with iptables NAT rules. VMs get internal addresses and access the external network through NAT. Used for development, not production.
- Bridged networking: VMs connect to a pre-configured Linux bridge (e.g.,
br-prod) that bridges to a physical NIC. VMs get addresses on the physical network. The standard production model. - macvtap: VMs attach directly to a physical NIC using the macvtap kernel module. Avoids the overhead of a full Linux bridge. Does not support host-to-VM communication on the same NIC.
- SR-IOV passthrough: VMs get a PCIe Virtual Function passed through via VFIO. Maximum performance, no software switching overhead.
In OVE, network management is handled by Multus CNI and the NMState operator, not by libvirt directly. Libvirt inside the virt-launcher pod receives its network configuration from KubeVirt, which translates the Kubernetes network attachment definitions into libvirt interface XML.
virsh CLI: Essential Commands
virsh is the primary command-line tool for libvirt. While OVE administrators use kubectl and virtctl rather than virsh directly, understanding virsh commands maps directly to understanding the underlying operations that KubeVirt performs:
| virsh Command | What It Does | KubeVirt/kubectl Equivalent |
|---|---|---|
virsh list --all |
List all VMs (running and stopped) | kubectl get vmi -A |
virsh start <vm> |
Start a stopped VM | virtctl start <vm> |
virsh shutdown <vm> |
Send ACPI shutdown to guest | virtctl stop <vm> |
virsh destroy <vm> |
Force-kill a VM (like power off) | kubectl delete vmi <vm> |
virsh suspend <vm> |
Pause vCPU execution (freeze) | virtctl pause <vm> |
virsh resume <vm> |
Resume paused vCPU execution | virtctl unpause <vm> |
virsh migrate <vm> <dest> |
Live-migrate VM to another host | virtctl migrate <vm> (KubeVirt selects destination) |
virsh dumpxml <vm> |
Show full domain XML | kubectl get vmi <vm> -o yaml (KubeVirt spec, not raw XML) |
virsh domstats <vm> |
Show VM statistics (CPU, memory, block, net) | Prometheus metrics from virt-handler |
virsh snapshot-create <vm> |
Create VM snapshot | VolumeSnapshot CR (Kubernetes CSI snapshot) |
virsh console <vm> |
Connect to serial console | virtctl console <vm> |
virsh vncdisplay <vm> |
Get VNC display address | virtctl vnc <vm> (proxied through API server) |
Libvirt in KubeVirt: How KubeVirt Wraps Libvirt
This is the critical architectural point for the OVE evaluation. KubeVirt does not replace libvirt -- it wraps it. Each VM in OVE runs inside a virt-launcher pod that contains:
- virt-launcher process: A Go binary that translates the KubeVirt VirtualMachineInstance spec into libvirt domain XML and manages the VM lifecycle.
- libvirtd (per-pod): A full libvirt daemon running inside the pod (not shared across VMs). It receives domain XML from virt-launcher and manages the QEMU process.
- QEMU process: Launched and monitored by libvirtd.
KubeVirt Libvirt Wrapping Architecture
kubectl / API Server
|
| VirtualMachineInstance CR (Kubernetes-native YAML)
v
virt-controller (cluster-level)
|
| Creates Pod spec
v
Kubernetes Scheduler (places pod on node)
|
v
virt-launcher Pod (on worker node)
+------------------------------------------------------------------+
| |
| virt-launcher (Go process) |
| +------------------------------------------------------------+ |
| | 1. Reads VMI spec from Kubernetes API | |
| | 2. Translates spec fields to domain XML: | |
| | spec.domain.cpu -> <cpu> element | |
| | spec.domain.memory -> <memory> element | |
| | spec.domain.devices -> <devices> element | |
| | spec.volumes -> <disk> elements | |
| | spec.networks -> <interface> elements | |
| | 3. Passes domain XML to libvirtd via libvirt API | |
| | 4. Monitors VM lifecycle, reports status back to K8s API | |
| +------------------------------------------------------------+ |
| | |
| | libvirt API (Unix socket, in-pod) |
| v |
| libvirtd (per-pod instance) |
| +------------------------------------------------------------+ |
| | - Receives domain XML | |
| | - Constructs QEMU command line | |
| | - Launches QEMU process | |
| | - Communicates with QEMU via QMP | |
| | - Reports domain state changes | |
| +------------------------------------------------------------+ |
| | |
| | QMP (Unix socket, in-pod) |
| v |
| QEMU process |
| +------------------------------------------------------------+ |
| | - vCPU threads (call KVM_RUN) | |
| | - Device emulation | |
| | - Block I/O | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
|
| ioctl(KVM_RUN) via /dev/kvm
v
Linux Kernel (RHCOS) + kvm.ko
Why a per-pod libvirtd? In traditional KVM deployments, a single libvirtd manages all VMs on a host. In KubeVirt, each VM gets its own libvirtd for isolation. If libvirtd crashes, only one VM is affected. The per-pod model also enables Kubernetes' security boundaries (namespaces, network policies, RBAC) to apply at the individual VM level.
Operational implication: When debugging an OVE VM issue, the troubleshooting path is: kubectl describe vmi <vm> (Kubernetes layer) -> kubectl exec -it <virt-launcher-pod> -- virsh dominfo <vm> (libvirt layer) -> kubectl exec -it <virt-launcher-pod> -- virsh qemu-monitor-command <vm> '{"execute":"query-status"}' (QMP/QEMU layer). Understanding all three layers is necessary for production support.
Comparison to PowerCLI / vCenter API
For teams transitioning from VMware, this mapping clarifies the operational equivalence:
| VMware (PowerCLI / vCenter) | Libvirt (virsh) | OVE (kubectl/virtctl) |
|---|---|---|
Get-VM |
virsh list --all |
kubectl get vm -A |
New-VM |
virsh define <xml> |
kubectl apply -f vm.yaml |
Start-VM |
virsh start |
virtctl start |
Stop-VM -Kill |
virsh destroy |
kubectl delete vmi |
Get-VMHost |
virsh nodeinfo |
kubectl get nodes |
Move-VM (vMotion) |
virsh migrate --live |
virtctl migrate |
New-Snapshot |
virsh snapshot-create |
VolumeSnapshot CR |
Get-Stat -Stat cpu.usage |
virsh domstats --cpu-total |
Prometheus + Grafana |
Set-VM -NumCPU 8 |
Edit domain XML, virsh define |
Edit VM spec, kubectl apply |
| VMRC (VM Remote Console) | virsh vncdisplay + VNC client |
virtctl vnc or virtctl console |
The fundamental difference: VMware uses an imperative API model (call a function, get a result), while OVE uses a declarative model (define desired state in YAML, the system converges). Libvirt's virsh is imperative, but KubeVirt wraps it in Kubernetes' declarative reconciliation loop.
4. Hyper-V
What It Is and Why It Exists
Hyper-V is Microsoft's Type-1 bare-metal hypervisor, the virtualization engine behind Azure (public cloud), Azure Local (formerly Azure Stack HCI), and Windows Server's Hyper-V role. It is the only hypervisor under evaluation that is not open-source. Understanding Hyper-V at architectural depth is necessary for evaluating Azure Local as a candidate platform.
Hyper-V has been available since Windows Server 2008. The current version relevant to Azure Local is the hypervisor integrated into Azure Stack HCI OS (a specialized Windows Server Core variant purpose-built for HCI scenarios).
Architecture: Root Partition / Child Partition Model
Hyper-V's architecture differs fundamentally from both ESXi and KVM:
Hyper-V Architecture
+------------------------------------------------------------------+
| Child Partition 1 Child Partition 2 Child Partition 3 |
| (VM 1) (VM 2) (VM 3) |
| +----------------+ +----------------+ +----------------+ |
| | Guest OS | | Guest OS | | Guest OS | |
| | (Windows/Linux)| | (Windows/Linux)| | (Windows/Linux)| |
| +----------------+ +----------------+ +----------------+ |
| | VSC (client) | | VSC (client) | | VSC (client) | |
| | (virtio-like | | (Hyper-V | | (Integration | |
| | synthetic | | Integration | | Services) | |
| | devices) | | Services) | | | |
| +-------+--------+ +-------+--------+ +-------+--------+ |
| | | | |
| | VMBus (shared memory + ring buffers) |
| | | | |
| +-------+--------------------+--------------------+-------+ |
| | Root Partition | |
| | +--------------------------------------------------+ | |
| | | Windows Server (Core or Full) | | |
| | | - Management OS (runs Hyper-V management stack) | | |
| | | - VSP (server): handles I/O on behalf of VMs | | |
| | | - WMI/PowerShell management interface | | |
| | | - vmms.exe (VM Management Service) | | |
| | | - vmwp.exe (VM Worker Process, one per VM) | | |
| | | - Device drivers (storage, network) | | |
| | +--------------------------------------------------+ | |
| +----------------------------------------------------------+ |
+------------------------------------------------------------------+
| Hyper-V Hypervisor Layer |
| +------------------------------------------------------------+ |
| | hvix64.exe (Intel) / hvax64.exe (AMD) | |
| | - CPU scheduling (all partitions) | |
| | - Memory management (SLAT / EPT) | |
| | - Intercept handling | |
| | - Partition isolation | |
| | - Virtual interrupt delivery | |
| | - Hypercall interface | |
| +------------------------------------------------------------+ |
+------------------------------------------------------------------+
| Hardware: CPU (VT-x/AMD-V), RAM, NICs, HBAs |
+------------------------------------------------------------------+
Key architectural distinctions from KVM and ESXi:
Root Partition: Hyper-V does not eliminate the host OS. Instead, when Hyper-V starts, the hypervisor layer (hvix64.exe / hvax64.exe) takes ownership of the CPU and installs itself at VMX root mode. The Windows instance that was running becomes the "root partition" -- it continues to run, but now it runs on top of the hypervisor, just like any other partition. The root partition is special: it has direct hardware access (to physical devices and drivers), it runs the management stack, and it hosts the VSP (Virtualization Service Provider) components that handle I/O for child partitions.
This is different from KVM, where the Linux kernel is the hypervisor (KVM runs in the same execution context as the Linux kernel). And it is different from ESXi, where there is no general-purpose OS at all -- the VMkernel is the only kernel.
Child Partitions: These are the VMs. Each child partition is isolated by the hypervisor -- it cannot access hardware directly (unless a device is assigned via Discrete Device Assignment / DDA, Hyper-V's equivalent of VFIO passthrough).
VMBus, Synthetic Devices, and Emulated Devices
Hyper-V offers two types of virtual devices:
Emulated devices work exactly like QEMU's emulated devices: the guest accesses a virtual device (e.g., an emulated Intel 21140 NIC or an emulated IDE controller) that traps I/O instructions to the hypervisor, which emulates the device behavior in software. This is slow but compatible with any guest OS.
Synthetic devices are Hyper-V's equivalent of virtio. They use VMBus, a high-performance shared-memory communication channel between the root partition and child partitions. The architecture uses a VSP/VSC model:
- VSP (Virtualization Service Provider): Runs in the root partition. Handles the actual I/O by talking to real device drivers.
- VSC (Virtualization Service Client): Runs in the child partition (the guest). Communicates with the VSP via VMBus ring buffers (analogous to virtio's vrings).
- VMBus: The transport layer. Uses shared memory pages and event signals (intercept-based notifications, similar to ioeventfd/irqfd in KVM) to move data between VSP and VSC without the data crossing the hypervisor layer.
VMBus vs Virtio -- Parallel Architecture
Hyper-V (VMBus) KVM/QEMU (Virtio)
=============== =================
Child Partition: Guest VM:
+------------------+ +------------------+
| Guest OS | | Guest OS |
| VSC Driver | | virtio driver |
| (storvsc.sys / | | (virtio-blk / |
| netvsc.sys / | | virtio-net / |
| hv_storvsc / | | built-in) |
| hv_netvsc) | | |
+--------+---------+ +--------+---------+
| |
VMBus Ring Buffer Virtio Vring
(shared memory) (shared memory)
| |
+--------+---------+ +--------+---------+
| Root Partition | | QEMU / vhost |
| VSP Driver | | Device backend |
| (StorVSP.sys / | | (virtio-blk / |
| NetVSP.sys) | | vhost-net) |
+------------------+ +------------------+
| |
Real hardware Real hardware
driver in root driver in host
partition Linux kernel
The performance of synthetic devices is comparable to virtio when the guest has the correct drivers (Hyper-V Integration Services for Windows, hv_* kernel modules for Linux). Without Integration Services, guests fall back to emulated devices, which are 5-10x slower.
| Device Type | Hyper-V Emulated | Hyper-V Synthetic (VMBus) | KVM/QEMU Emulated | KVM/QEMU Virtio |
|---|---|---|---|---|
| Network | Intel 21140 (~100Mbps) | Synthetic NIC (~25 Gbps) | e1000 (~2 Gbps) | virtio-net (~20 Gbps) |
| Storage | IDE controller | Synthetic SCSI (StorVSC) | IDE (~200 MB/s) | virtio-blk / virtio-scsi |
| Display | S3 Trio 64 VGA | Synthetic Video (RemoteFX) | std VGA / virtio-gpu | virtio-gpu |
Memory Management
Hyper-V provides several memory management capabilities:
SLAT (Second Level Address Translation): Hyper-V's use of Intel EPT / AMD NPT. Mandatory since Windows Server 2016. Provides the same two-level GPA-to-HPA translation described in the KVM section, with the same performance benefits.
Dynamic Memory: Hyper-V's equivalent of memory ballooning, but more sophisticated. It continuously adjusts a VM's physical memory allocation based on the guest's actual demand:
- Startup RAM: The amount of memory assigned at boot.
- Minimum RAM: The floor -- Dynamic Memory will never reclaim below this.
- Maximum RAM: The ceiling.
- Memory Buffer: A percentage of committed memory the guest should have as free buffer (default 20%).
The Dynamic Memory mechanism uses the hv_balloon driver (Linux) or the built-in Integration Service (Windows). Unlike KVM's balloon driver, which is typically triggered manually or by external policy, Hyper-V Dynamic Memory operates continuously and automatically.
Smart Paging: When a VM restarts and its startup RAM exceeds the amount of physical memory currently available (because Dynamic Memory reclaimed memory while the VM was running), Hyper-V uses a temporary page file on the host to bridge the gap until the VM can adjust its working set. Smart Paging is used only during VM startup, never during normal operation.
Generation 1 vs Generation 2 VMs
Hyper-V supports two VM "generations" that define the virtual hardware presented to the guest:
| Feature | Generation 1 | Generation 2 |
|---|---|---|
| Firmware | BIOS (AMI BIOS emulation) | UEFI |
| Boot from | IDE, legacy NIC (PXE) | SCSI, network (PXE over synthetic NIC) |
| Disk controller | IDE (boot) + SCSI | SCSI only (no IDE) |
| Secure Boot | Not supported | Supported (Windows and Linux) |
| vTPM | Not supported | Supported |
| Max disk size (boot) | 2 TB (IDE/MBR limitation) | 64 TB (SCSI/GPT) |
| Guest OS support | All (Windows, Linux, FreeBSD) | Windows Server 2012 R2+, modern Linux |
| Live resize (disk) | Not supported | Supported |
Generation 2 is required for Azure Local and for any modern deployment. Generation 1 exists for backward compatibility with legacy guests. The KVM equivalent of this distinction is the pc (i440FX) vs q35 machine type -- Generation 2 corresponds to q35 with UEFI.
Hyper-V in Azure Local
Azure Local (formerly Azure Stack HCI) uses Hyper-V as its hypervisor but layers significant additional capabilities:
Azure Local Stack
+------------------------------------------------------------------+
| Azure Arc Control Plane (cloud-based) |
| +------------------------------------------------------------+ |
| | Azure Portal / Azure CLI / ARM API | |
| | - VM provisioning, monitoring, policy, update management | |
| | - Arc-connected: each node registers as Arc-enabled server | |
| | - Azure RBAC, Azure Policy, Azure Monitor integration | |
| +------------------------------------------------------------+ |
+------------------------------------------------------------------+
| Azure Arc agent (HTTPS outbound to Azure)
v
+------------------------------------------------------------------+
| Azure Local Cluster (on-premises, 2-16 nodes) |
| +------------------------------------------------------------+ |
| | Azure Stack HCI OS (specialized Windows Server Core) | |
| | - No GUI, no desktop, minimal attack surface | |
| | - Windows Failover Clustering (WSFC) for HA | |
| | - Storage Spaces Direct (S2D) for HCI storage | |
| | - Software-Defined Networking (SDN) stack | |
| +------------------------------------------------------------+ |
| +------------------------------------------------------------+ |
| | Hyper-V | |
| | - VM execution, live migration, replication | |
| | - Discrete Device Assignment (DDA) for GPU/NVMe passthrough| |
| | - Shielded VMs with HGS attestation | |
| +------------------------------------------------------------+ |
+------------------------------------------------------------------+
| Hardware (validated Azure Local catalog) |
+------------------------------------------------------------------+
Key Azure Local-specific capabilities:
- Azure Arc integration: VMs can be managed from the Azure Portal, assigned Azure RBAC roles, and monitored with Azure Monitor -- a hybrid cloud management model.
- Storage Spaces Direct (S2D): Replaces the need for external SANs by pooling local NVMe/SSD/HDD across cluster nodes into a software-defined storage layer. Comparable to vSAN.
- Windows Failover Clustering (WSFC): Provides HA for VMs. When a node fails, VMs are restarted on surviving nodes. Comparable to vSphere HA.
- Live Migration: Hyper-V's equivalent of vMotion. Uses SMB 3.0 for migration transport (compared to a dedicated TCP channel in VMware and libvirt-managed TCP in KVM).
Windows Admin Center and PowerShell Management
Azure Local is managed through two primary interfaces:
Windows Admin Center (WAC): A web-based management tool (similar in role to vCenter's web client or Cockpit for Linux). WAC provides VM lifecycle management, cluster health monitoring, storage management, and network configuration through a browser UI. Unlike vCenter, WAC is free and does not require a separate server (it can run on one of the cluster nodes, though a dedicated management server is recommended for production).
PowerShell: The primary automation interface. Hyper-V and failover clustering expose comprehensive PowerShell cmdlets:
| PowerShell Cmdlet | What It Does | OVE Equivalent |
|---|---|---|
Get-VM |
List VMs | kubectl get vm |
New-VM -Name "vm1" -Generation 2 |
Create VM | kubectl apply -f vm.yaml |
Start-VM -Name "vm1" |
Start VM | virtctl start vm1 |
Stop-VM -Name "vm1" -TurnOff |
Force power off | kubectl delete vmi vm1 |
Move-VM -Name "vm1" -DestinationHost "node2" |
Live migrate | virtctl migrate vm1 |
Set-VMProcessor -VMName "vm1" -Count 8 |
Set vCPU count | Edit VM spec, kubectl apply |
Add-VMHardDiskDrive -VMName "vm1" -Path "..." |
Add disk | Edit VM spec, add volume |
Get-VMNetworkAdapter |
List NICs | kubectl get vmi -o yaml |
Enable-VMTPM -VMName "vm1" |
Enable vTPM | Set spec.domain.tpm |
Set-VMFirmware -VMName "vm1" -EnableSecureBoot On |
Enable Secure Boot | Set spec.domain.firmware.bootloader.efi.secureBoot |
Comparison to KVM / ESXi
| Aspect | KVM | Hyper-V | ESXi |
|---|---|---|---|
| Hypervisor location | Kernel module in Linux | Separate binary below root partition | Custom VMkernel |
| Host OS | Full Linux (can run any Linux software) | Root partition (restricted Windows Server) | No general-purpose OS |
| Open source | Yes (GPL v2) | No (proprietary) | No (proprietary) |
| CPU scheduling | Linux CFS scheduler + cgroups | Custom hypervisor scheduler | Custom VMkernel scheduler |
| Paravirt I/O | Virtio (OASIS standard) | VMBus / synthetic devices (MS proprietary) | PVSCSI + vmxnet3 (VMware proprietary) |
| Paravirt guest drivers | Built into Linux kernel; installable for Windows | Built into Windows; hv_* modules in Linux |
VMware Tools (installable) |
| Device passthrough | VFIO framework | Discrete Device Assignment (DDA) | DirectPath I/O |
| Nested virtualization | Supported (KVM in KVM) | Supported (Hyper-V in Hyper-V) | Supported (ESXi in ESXi) |
| Max vCPUs per VM | 710 (RHEL 9) / 1024+ (upstream) | 240 (Windows Server 2022) | 768 (vSphere 8) |
| Max RAM per VM | 16 TB (RHEL 9) | 12 TB (Windows Server 2022) | 24 TB (vSphere 8) |
| Live migration transport | TCP (libvirt-managed) | SMB 3.0 / TCP | TCP (proprietary) |
| Management API | Libvirt API (XML/RPC) + QMP (JSON) | WMI / CIM / PowerShell | SOAP / REST (vSphere API) |
| Clustering / HA | External (Kubernetes in OVE, Pacemaker otherwise) | Windows Failover Clustering (built-in) | vSphere HA (vCenter-managed) |
| Ecosystem | Open-source tooling, Linux-native | Microsoft/Windows ecosystem | VMware ecosystem (third-party) |
How the Candidates Handle This
Comparison Table
| Aspect | VMware (ESXi) | OVE (KVM/QEMU/Libvirt) | Azure Local (Hyper-V) | Swisscom ESC |
|---|---|---|---|---|
| Hypervisor engine | VMkernel (proprietary microkernel) | KVM kernel module + QEMU user-space | Hyper-V (hvix64.exe / hvax64.exe) | VMware ESXi (Swisscom-managed) |
| Paravirt framework | PVSCSI + vmxnet3 (proprietary) | Virtio (OASIS open standard) | VMBus / Synthetic devices (proprietary) | PVSCSI + vmxnet3 |
| Guest driver requirement | VMware Tools required for performance | Virtio drivers required (built into Linux, installable for Windows) | Integration Services required (built into Windows, hv_* modules in Linux) | VMware Tools |
| Machine model | VM Hardware Version (v13-v21) | QEMU machine type (q35 recommended) | Generation 1 / Generation 2 | VM Hardware Version |
| UEFI / Secure Boot | OVMF-based, supported since vHW 13 | OVMF-based, native q35 support | Native UEFI, mandatory for Gen 2 | OVMF-based (provider-managed) |
| Management layer | vCenter (SOAP/REST API), PowerCLI | KubeVirt -> libvirt -> QMP -> QEMU | WAC, PowerShell, Azure Portal (Arc) | vCloud Director API (provider portal) |
| Disk format | VMDK (flat, thin, thick) | qcow2 (thin, snapshots, backing chains) or raw | VHDX (dynamic, fixed, differencing) | VMDK |
| Block operations | Storage vMotion, VAIO framework | QEMU block jobs (mirror, commit, stream) | Storage Live Migration (Hyper-V) | Provider-managed |
| Migration protocol | vMotion (encrypted, proprietary TCP) | Libvirt-managed TCP (QEMU migration stream) | SMB 3.0 (encrypted) | Provider-managed |
| In-kernel I/O acceleration | VMkernel handles all I/O in-kernel | vhost-net (kernel), vhost-user (DPDK) for network; io_uring for block | VMBus I/O through root partition kernel | VMkernel |
| Device passthrough | DirectPath I/O (limited vMotion) | VFIO (no live migration for passthrough devices) | DDA (no live migration for DDA devices) | Not available to customers |
| Debuggability | esxtop, vm-support bundles, vRealize | Layered: kubectl/virsh/QMP, Prometheus, node logs | perfmon, WAC diagnostics, Event Viewer | Opaque -- ticketing required |
Prose Analysis
OVE -- Depth of Stack: The KVM/QEMU/libvirt stack in OVE is the most transparent of all candidates. Every layer is open-source, every decision is traceable (domain XML, QEMU command line, KVM ioctl parameters), and every metric is exposable (libvirt domstats, QEMU block statistics, KVM exit counters via perf kvm stat). This transparency is both a strength and a burden. Strength: when something goes wrong, the debugging surface is unlimited. Burden: the operations team must understand three layers of technology (Kubernetes + libvirt + QEMU/KVM) instead of one (vCenter).
The virtio driver requirement for Windows guests is a migration friction point. Every Windows VM moving from VMware to OVE must have its storage and network drivers replaced from PVSCSI/vmxnet3 to virtio-blk/virtio-net. MTV automates this, but driver injection failures are the #1 cause of failed VM conversions in Red Hat's published customer case studies. The PoC must validate driver injection for every Windows version in the fleet.
Azure Local -- Integrated but Opaque: Hyper-V's root partition model means that all I/O traverses the root partition's Windows kernel. This adds a layer compared to KVM (where virtio data paths can bypass QEMU entirely via vhost). However, VMBus is highly optimized, and Microsoft has invested decades in the Windows storage and networking stacks. For Windows guest workloads, Azure Local with Generation 2 VMs and synthetic devices will deliver performance competitive with KVM + virtio, often with less tuning effort.
The Azure Arc integration provides a management experience that neither OVE nor VMware can match for organizations already invested in Azure identity and governance. VMs appear in the Azure Portal alongside cloud resources, inheriting Azure RBAC, Policy, and Monitor. This is compelling for hybrid cloud strategies but creates a dependency on Azure control plane availability (outbound HTTPS to Azure must always be reachable for management, though VMs continue running if Arc connectivity is lost).
Swisscom ESC -- Abstracted Away: From a hypervisor technology perspective, ESC is the least interesting candidate because it is VMware under the hood. The evaluation questions are not "how does the hypervisor work?" but rather "what happens when Swisscom changes the hypervisor?" If Broadcom's licensing drives Swisscom to migrate ESC to a different engine (CloudStack, OpenStack, KVM-native), the customer faces a re-migration -- hypervisor internals matter again. The evaluation must include contractual protections for this scenario.
Key Takeaways
-
KVM's design is fundamentally different from ESXi's. ESXi is a monolithic, proprietary kernel that handles everything. KVM is a minimal kernel module that delegates device emulation to QEMU, management to libvirt, and scheduling/isolation to the Linux kernel. This modular design gives OVE flexibility (swap out any component) but adds operational complexity (more layers to understand).
-
Virtio is the performance equalizer. KVM's I/O performance with emulated devices (e1000, IDE) is poor. With virtio, it matches or exceeds ESXi's PVSCSI/vmxnet3 in throughput and latency benchmarks. However, virtio requires guest drivers -- this is a non-issue for Linux guests (built-in since 2008) but a migration task for every Windows VM. The PoC must validate virtio driver installation and performance for every Windows OS version in the fleet.
-
QEMU's block layer is a differentiator. qcow2's backing chain model, live block jobs (mirror, commit, stream), and dirty bitmap-based incremental backup are more capable than VMDK's equivalents. These features directly enable efficient VM templating, live storage migration, and backup without third-party tools. However, raw format on LVM or Ceph RBD is recommended over qcow2 for production workloads requiring maximum I/O performance -- qcow2's metadata overhead is measurable under heavy random write loads.
-
Libvirt is not an optional layer -- it is load-bearing. In OVE, every VM lifecycle operation flows through libvirt. KubeVirt translates Kubernetes CRs to domain XML, libvirt translates domain XML to QEMU command lines, and QEMU talks to KVM. Understanding this chain is essential for production troubleshooting. The per-pod libvirtd model in KubeVirt improves isolation but means the operations team cannot simply SSH to a host and run
virsh listto see all VMs -- each VM's libvirtd is inside its pod. -
Hyper-V's root partition model is architecturally distinct from KVM. The root partition (running Windows) handles all I/O for child partitions. This adds a layer but also means the Windows driver ecosystem (storage, networking, management) is directly available. For organizations with strong Windows operations teams, Hyper-V is operationally familiar. For organizations with strong Linux/Kubernetes teams, KVM/QEMU is the natural fit.
-
Machine type / VM generation selection is a one-time, irreversible decision. VMs created with the
pc(i440FX) machine type in QEMU or Generation 1 in Hyper-V cannot be converted in-place toq35or Generation 2. The migration from VMware is the natural point to ensure all VMs are created on the modern machine type (q35 / Gen 2), which unlocks PCIe topology, Secure Boot, vTPM, and larger disk support. -
Performance debugging requires hypervisor-level instrumentation. Understanding the vCPU execution loop (VM exits, exit reasons, interrupt injection latency) is necessary to diagnose performance issues that appear as "the VM is slow" at the application layer. OVE exposes this via
perf kvm stat, KVM tracepoints, and QEMU block/net statistics. Azure Local exposes this via Hyper-V performance counters in perfmon. Both are more granular than what most VMware administrators use (esxtopcovers the surface but the VMkernel internals are opaque). ESC provides no guest-level hypervisor metrics to customers.
Discussion Guide
Use these questions when engaging with vendor architects, Red Hat solution engineers, Microsoft technical specialists, or Swisscom account teams. The questions probe implementation depth beyond datasheet claims.
For OVE (Red Hat / Partner)
-
Virtio driver maturity for Windows: "For our Windows Server 2019 and 2022 fleet (~2,000 VMs), what is the current virtio-win driver version, its WHQL certification status, and the known issues list? When a virtio driver bug is found in production, what is the patch delivery timeline -- does it follow RHEL errata cadence or a separate channel? Can we run a VMware PVSCSI/vmxnet3 driver alongside virtio during a migration transition period, or must the swap be atomic?"
-
QEMU machine type migration: "We expect some legacy VMs to arrive from VMware on the
pc(i440FX) machine type. What is the supported path to move them toq35? Is there an in-place conversion, or does it require a re-import? What guest OS changes are needed (driver re-detection, device path changes)?" -
vCPU execution efficiency: "For our latency-sensitive pricing engine VMs (8 vCPUs, dedicated CPU placement, 1 GB huge pages), what is the expected VM exit rate under steady-state load? Can you demonstrate
perf kvm statoutput from a comparable workload? What exit reasons dominate, and what tuning knobs reduce them (e.g., halt polling, APICv, x2apic passthrough)?" -
Block layer performance -- qcow2 vs raw: "What is the measured IOPS and latency difference between qcow2 and raw format for a 4K random write workload on Ceph RBD storage? Under what circumstances does Red Hat recommend raw over qcow2 in OVE? If we use raw, how do we get snapshot functionality (is it delegated to the CSI driver / Ceph snapshots)?"
-
Per-pod libvirtd operational impact: "With 200+ VMs on a single worker node, that means 200+ libvirtd instances. What is the memory and CPU overhead per libvirtd instance? How does this compare to a traditional single-libvirtd model? If we need to debug a libvirt issue, what is the operational procedure -- do we
kubectl execinto every pod, or is there a centralized log aggregation path?"
For Azure Local (Microsoft / Partner)
-
VMBus performance vs virtio: "For a Linux guest running a database workload (high IOPS, low latency), how does VMBus synthetic SCSI performance compare to KVM virtio-scsi on equivalent hardware? Do you have published benchmark data? What is the recommended storage configuration for maximum IOPS -- S2D with mirror, or passthrough NVMe via DDA?"
-
Generation 2 migration from VMware: "We are migrating VMs from VMware (VMDK, BIOS-based, PVSCSI) to Azure Local (VHDX, UEFI/Gen2, synthetic SCSI). What is the conversion path? Does Azure Migrate handle the firmware transition (BIOS to UEFI) automatically, or must we rebuild the boot partition? For Linux VMs, does the conversion inject
hv_*kernel modules automatically?" -
Root partition overhead and failure modes: "Since all child partition I/O traverses the root partition's Windows kernel, what happens if the root partition experiences a kernel bug, driver crash, or BSOD? Do child VMs crash immediately, or is there a grace period? Is the root partition's Windows kernel a different build from standard Windows Server, with a different patch cadence?"
-
Hyper-V performance counters for capacity planning: "What performance counters are available for measuring hypervisor-level overhead? Specifically: intercept rate, synthetic interrupt delivery latency, VMBus throughput per VM, and SLAT miss rate. Can these be streamed to Azure Monitor for long-term trending, or only accessed locally via perfmon?"
For Swisscom ESC
- Hypervisor transition plan: "If Broadcom licensing changes make VMware ESXi economically unviable for the ESC platform, what is Swisscom's stated migration path? Will customers be migrated transparently (same VMs, same APIs, different hypervisor underneath), or will it require customer-side re-migration effort? Is there a contractual SLA covering hypervisor change notification periods and migration support?"