Modern datacenters and beyond

Virtualization Foundational Concepts

Why This Matters

Every IaaS platform under evaluation -- OpenShift Virtualization Engine (OVE), Azure Local, and Swisscom Enterprise Service Cloud (ESC) -- ultimately sits on top of a hypervisor executing virtual machines on physical hardware. Before evaluating any candidate, the team must have a shared, precise understanding of the foundational mechanisms that make virtualization possible and performant. These are not academic topics. Each concept directly influences VM density, latency-sensitive workload placement, security posture, and the ability to run 5,000+ VMs at financial-grade service levels.

A misconfigured NUMA topology can silently halve the throughput of a database VM. A missing vTPM capability can block a Windows Server 2022 Secured-Core deployment. A hypervisor that cannot pin CPUs will fail regulatory requirements for deterministic latency in trading systems. This document equips the evaluation team to ask the right questions, interpret vendor answers critically, and identify gaps before they surface in production.

The concepts below form the bottom layer of the technology stack. Everything discussed in later chapters -- KVM internals, KubeVirt architecture, live migration, storage I/O paths -- builds on these foundations.


Concepts

1. Type-1 vs Type-2 Hypervisors

What It Is and Why It Exists

A hypervisor is the software layer that creates and manages virtual machines by multiplexing physical hardware resources (CPU, memory, I/O) across multiple isolated guest operating systems. The distinction between Type-1 and Type-2 is about where the hypervisor sits in the software stack relative to the hardware and any host operating system.

Type-1 (bare-metal) hypervisors run directly on the physical hardware with no intervening general-purpose OS. The hypervisor is the first software that executes after firmware/UEFI. Examples: VMware ESXi, Microsoft Hyper-V (when running as the root partition), KVM (when the Linux kernel is configured as a hypervisor host).

Type-2 (hosted) hypervisors run as an application on top of a conventional operating system. The host OS manages hardware; the hypervisor uses OS-level APIs to create VMs. Examples: Oracle VirtualBox, VMware Workstation, Parallels Desktop.

  Type-1 (Bare-Metal)                     Type-2 (Hosted)

  +---------------------------+            +---------------------------+
  |  VM 1  |  VM 2  |  VM 3  |            |  VM 1  |  VM 2  |  VM 3  |
  +--------+--------+--------+            +--------+--------+--------+
  |     Hypervisor (Type-1)   |            |    Hypervisor Application |
  +---------------------------+            +---------------------------+
  |   Hardware (CPU/RAM/NIC)  |            |     Host Operating System |
  +---------------------------+            +---------------------------+
                                           |   Hardware (CPU/RAM/NIC)  |
                                           +---------------------------+

How It Works -- Technical Internals

The critical distinction is about privilege levels and trap handling.

On x86-64, the CPU provides four protection rings (Ring 0 through Ring 3). Traditionally, the OS kernel runs in Ring 0 (most privileged), and user-space applications run in Ring 3. When Intel VT-x / AMD-V extensions are enabled, the CPU adds a new privilege layer below Ring 0 called VMX root mode (Intel) or host mode (AMD). The hypervisor runs in VMX root mode. Guest OS kernels believe they are running in Ring 0, but they are actually in VMX non-root mode -- their privileged instructions are intercepted ("trapped") by the CPU and forwarded to the hypervisor.

Type-1 mechanics: The hypervisor executes in VMX root mode directly. When a guest OS executes a sensitive instruction (e.g., writing to a control register like CR3, or executing CPUID), the CPU triggers a VM exit -- control transfers from VMX non-root mode back to the hypervisor in VMX root mode. The hypervisor handles the event, modifies state as needed, and executes a VM entry (VMLAUNCH / VMRESUME) to return control to the guest. The data structure that holds guest CPU state during these transitions is the Virtual Machine Control Structure (VMCS) on Intel or the Virtual Machine Control Block (VMCB) on AMD.

Type-2 mechanics: The hypervisor runs as a user-space process under the host OS. It must request CPU virtualization resources through the host OS kernel. On Linux, this means opening /dev/kvm and issuing ioctl() calls. The host OS kernel module (e.g., kvm.ko) sets up VMX root mode on behalf of the hypervisor process. Every VM exit first lands in the kernel module, which decides whether to handle it in-kernel or bounce it up to the user-space hypervisor process -- adding an extra context switch.

The KVM Nuance

KVM blurs the Type-1 / Type-2 boundary. The Linux kernel loads kvm.ko and kvm-intel.ko (or kvm-amd.ko), turning the kernel itself into a hypervisor. The kernel runs in VMX root mode. QEMU runs in user space and handles device emulation, but all CPU and memory virtualization happens in-kernel. When the host Linux system is dedicated entirely to running VMs (as in an OVE worker node or a vSphere ESXi host), it functions as a de facto Type-1 hypervisor despite technically being a Linux kernel.

  KVM Architecture (de facto Type-1)

  +---------------------------+
  |  VM 1  |  VM 2  |  VM 3  |     Guest VMs (VMX non-root)
  +--------+--------+--------+
  |  QEMU  |  QEMU  |  QEMU  |     User-space device emulation (Ring 3)
  +--------+--------+--------+
  |  Linux Kernel + kvm.ko    |     VMX root mode (Ring 0 / Root)
  +---------------------------+
  |   Hardware (CPU/RAM/NIC)  |
  +---------------------------+

Relationship to Other Topics

Type-1 is a prerequisite for everything else in this document. CPU virtualization extensions (VT-x / AMD-V) are what make hardware-assisted Type-1 hypervisors possible. NUMA awareness, CPU pinning, and memory management all operate within the Type-1 model. Nested virtualization (running a hypervisor inside a VM) essentially creates a second layer of VMX root/non-root transitions.

Why It Matters at 5,000+ VM Scale

At enterprise scale, only Type-1 hypervisors are viable. The extra context-switch overhead of a true Type-2 architecture would compound across thousands of VMs into measurable latency and CPU waste. All three candidates use Type-1 or de facto Type-1 hypervisors:


2. CPU Virtualization Extensions (VT-x / AMD-V)

What It Is and Why It Exists

Before 2005, x86 virtualization required binary translation -- the hypervisor would scan guest OS kernel code at runtime, find privileged instructions, and rewrite them into safe equivalents. This worked (VMware built a business on it), but it was slow and enormously complex. Intel VT-x (codenamed "Vanderpool," introduced 2005) and AMD-V (codenamed "Pacifica," introduced 2006) added hardware support directly into the CPU, eliminating the need for binary translation.

These extensions are not optional luxuries. They are mandatory for every candidate platform. A server without VT-x/AMD-V enabled in BIOS cannot run any of the hypervisors under evaluation.

How It Works -- Technical Internals

Intel VT-x

VT-x introduces two new CPU operating modes:

The lifecycle of a VT-x enabled system:

  1. The hypervisor executes VMXON to enter VMX root mode. This requires setting bit 13 of CR4 (the VMXE bit) and providing a 4KB-aligned memory region for the VMXON region.
  2. The hypervisor allocates and initializes a VMCS (Virtual Machine Control Structure) for each vCPU. The VMCS is a hardware-defined data structure containing:
    • Guest-state area: Values of guest registers (RSP, RIP, RFLAGS, CR0, CR3, CR4, segment selectors, GDTR, IDTR, etc.) that are loaded on VM entry and saved on VM exit.
    • Host-state area: Values of host registers loaded on VM exit.
    • VM-execution control fields: Bit fields that specify which guest actions cause VM exits (e.g., whether HLT, RDTSC, MOV to CR3 trap).
    • VM-exit information fields: Populated by the CPU on VM exit to tell the hypervisor why the exit occurred (exit reason, exit qualification, instruction length).
  3. VMLAUNCH starts initial execution of the guest. VMRESUME continues it after a VM exit.
  4. When the guest executes a trapped instruction or an event occurs (external interrupt, page fault with EPT violation), the CPU performs a VM exit: saves guest state to the VMCS, loads host state, and transfers control to the hypervisor's VM-exit handler.
  VM Entry / VM Exit Cycle (Intel VT-x)

  Hypervisor (VMX root)         Guest VM (VMX non-root)
  =====================         =======================
        |                              |
        |--- VMLAUNCH/VMRESUME ------->|
        |                              |  Guest executes code
        |                              |  at near-native speed
        |                              |
        |<------ VM EXIT --------------|  (privileged op, EPT
        |   CPU saves guest state      |   violation, interrupt,
        |   to VMCS, loads host state  |   I/O trap, etc.)
        |                              |
        |  Hypervisor inspects         |
        |  exit_reason in VMCS         |
        |  and handles the event       |
        |                              |
        |--- VMRESUME ---------------->|
        |                              |
        v                              v
Key VMCS Fields (Partial List)
Field Purpose
GUEST_RIP, GUEST_RSP Guest instruction pointer and stack pointer
GUEST_CR3 Guest page table base -- used with EPT for two-level address translation
PIN_BASED_VM_EXEC_CONTROL Controls external interrupt exiting, NMI exiting, virtual NMIs
PRIMARY_PROC_BASED_VM_EXEC_CONTROL Controls HLT exiting, INVLPG exiting, RDTSC exiting, CR3 load/store exiting
SECONDARY_PROC_BASED_VM_EXEC_CONTROL Controls EPT enable, VPID enable, unrestricted guest, RDRAND exiting
VM_EXIT_REASON 32-bit field populated by CPU on exit: bits 0-15 = basic exit reason (0=exception, 1=external interrupt, 10=CPUID, 28=CR access, 48=EPT violation, etc.)
EXIT_QUALIFICATION Additional info about the exit (e.g., which CR was accessed, which I/O port)
AMD-V (SVM)

AMD's equivalent uses different terminology but achieves the same result:

One practical difference: AMD-V stores the VMCB in regular memory accessible via standard MOV instructions. Intel's VMCS requires special VMREAD/VMWRITE instructions to access fields, which is slightly more cumbersome for hypervisor developers but allows the CPU more flexibility in its internal representation.

Extended Page Tables (EPT) / Nested Page Tables (NPT)

A critical sub-feature: without EPT/NPT, the hypervisor must intercept every guest page table modification (every write to CR3, every INVLPG) to maintain shadow page tables. This is extremely expensive at scale.

EPT/NPT adds a second level of page table translation in hardware:

  Two-Level Address Translation (EPT / NPT)

  Guest Virtual Address (GVA)
          |
          | Guest page tables (controlled by guest OS)
          v
  Guest Physical Address (GPA)
          |
          | EPT / NPT (controlled by hypervisor)
          v
  Host Physical Address (HPA)  -->  Actual RAM

  The CPU hardware walks BOTH levels automatically.
  No hypervisor intervention needed for normal memory access.

This eliminates the vast majority of memory-related VM exits. Without EPT/NPT, a workload with heavy memory allocation patterns (databases, JVMs) would generate millions of VM exits per second. With EPT/NPT, those same operations proceed at near-native speed.

VPID (Virtual Processor Identifiers)

When the CPU switches between VMs (VM exit followed by VM entry to a different VMCS), it traditionally must flush the entire TLB (Translation Lookaside Buffer) because TLB entries from VM-A are invalid for VM-B. VPID tags each TLB entry with the ID of the VM that created it, allowing the CPU to keep TLB entries from multiple VMs simultaneously. This dramatically reduces VM-entry and VM-exit costs.

Relationship to Other Topics

VT-x/AMD-V is the foundation for all other topics. NUMA awareness operates on top of the memory model that EPT/NPT enables. CPU pinning assigns vCPUs to physical cores that are participating in VMX transitions. Huge Pages interact with EPT to reduce two-level page walk overhead. Nested virtualization uses the hardware's ability to handle multiple levels of VMX root/non-root (or, on older hardware, to use shadow VMCS structures).

Why It Matters at 5,000+ VM Scale

The efficiency of VT-x/AMD-V determines the "virtualization tax" -- the percentage of CPU cycles consumed by hypervisor overhead rather than useful guest work. Modern implementations with EPT + VPID + posted interrupts reduce this overhead to 2-5% for compute-bound workloads. Over 5,000 VMs on hundreds of physical cores, even a 1% improvement in virtualization efficiency reclaims the equivalent of multiple physical servers.

For procurement: every server purchased must have VT-x or AMD-V and EPT/NPT enabled in BIOS. Some server vendors ship with these features disabled by default. This must be part of the hardware commissioning checklist.


3. NUMA Awareness (Non-Uniform Memory Access)

What It Is and Why It Exists

Modern multi-socket servers do not have a single, uniform pool of memory. Instead, each CPU socket has its own directly-attached memory banks. A CPU accessing its own local memory is fast (local access latency: ~80-100ns). A CPU accessing memory attached to another socket must traverse an inter-socket link (Intel UPI, AMD Infinity Fabric) -- this remote access is 1.5x to 2x slower (~120-190ns). The memory is "non-uniform" in access time depending on which socket is asking.

  Dual-Socket NUMA Topology

  +--------------------------------------------+
  |  NUMA Node 0                               |
  |  +----------------+   +----------------+   |
  |  |  CPU Socket 0  |   |   Local DRAM   |   |
  |  |  Cores 0-31    |---|   256 GB        |   |
  |  +----------------+   +----------------+   |
  +--------------------------------------------+
          |  UPI / Infinity Fabric (cross-socket link)
          |  ~1.5-2x latency penalty
  +--------------------------------------------+
  |  NUMA Node 1                               |
  |  +----------------+   +----------------+   |
  |  |  CPU Socket 1  |   |   Local DRAM   |   |
  |  |  Cores 32-63   |---|   256 GB        |   |
  |  +----------------+   +----------------+   |
  +--------------------------------------------+

  Total: 512 GB RAM, but NOT uniformly accessible.
  Core 5 accessing Node 0 DRAM: ~90ns
  Core 5 accessing Node 1 DRAM: ~160ns

Some modern CPUs (AMD EPYC, Intel Xeon with Sub-NUMA Clustering / SNC) expose multiple NUMA nodes per socket. An AMD EPYC 9654 (96 cores) can expose 4 NUMA nodes on a single socket, each with its own memory controller. A dual-socket server with SNC enabled might present 4 or 8 NUMA nodes to the OS.

  AMD EPYC -- 4 NUMA Nodes per Socket (NPS4 mode)

  Socket 0                        Socket 1
  +-------+-------+               +-------+-------+
  | Node0 | Node1 |               | Node4 | Node5 |
  | 24c   | 24c   |               | 24c   | 24c   |
  | 64GB  | 64GB  |               | 64GB  | 64GB  |
  +-------+-------+               +-------+-------+
  | Node2 | Node3 |               | Node6 | Node7 |
  | 24c   | 24c   |               | 24c   | 24c   |
  | 64GB  | 64GB  |               | 64GB  | 64GB  |
  +-------+-------+               +-------+-------+

  8 NUMA nodes total, 512 GB total
  Cross-CCD access: ~1.2x penalty
  Cross-socket access: ~1.8x penalty

How It Works -- Technical Internals

The OS discovers NUMA topology through the ACPI SRAT (System Resource Affinity Table) and SLIT (System Locality Information Table) provided by firmware:

On Linux, this information is exposed via:

NUMA-aware hypervisor scheduling means that when a VM is assigned to physical cores, the hypervisor ensures that the VM's memory is allocated from the DRAM banks local to those cores. The three key scheduling decisions are:

  1. vCPU placement: Which physical cores run this VM's virtual CPUs?
  2. Memory allocation: Which NUMA node's memory banks supply this VM's RAM?
  3. Alignment: Do (1) and (2) match? If a VM's vCPUs run on Socket 0 but its memory is on Socket 1, every memory access crosses the inter-socket link.

What happens without NUMA awareness:

A NUMA-unaware scheduler might place a VM's 8 vCPUs across both sockets (4 on Node 0, 4 on Node 1) and scatter its memory across all nodes. This results in:

What happens with proper NUMA awareness:

The hypervisor confines the VM to a single NUMA node: all vCPUs are pinned to cores on Node 0, and all memory is allocated from Node 0's DRAM banks. Result:

NUMA and Large VMs

A VM requiring more vCPUs or RAM than a single NUMA node can provide must span nodes. In this case, the hypervisor should present a virtual NUMA topology to the guest OS, so the guest's own NUMA-aware scheduler (e.g., SQL Server's memory node allocation, JVM's NUMA-aware garbage collector) can make optimal decisions:

  Large VM Spanning Two NUMA Nodes

  Physical Host:
  +-------------------+-------------------+
  | NUMA Node 0       | NUMA Node 1       |
  | Cores 0-31        | Cores 32-63       |
  | 256 GB            | 256 GB            |
  +-------------------+-------------------+

  VM with 32 vCPUs, 384 GB RAM:
  +--------------------------------------+
  | vNUMA Node 0       | vNUMA Node 1    |
  | vCPUs 0-15         | vCPUs 16-31     |
  | 192 GB             | 192 GB          |
  | (mapped to pNode0) | (mapped to      |
  |                    |  pNode1)         |
  +--------------------------------------+

  Guest OS sees 2 NUMA nodes and schedules accordingly.

Relationship to Other Topics

NUMA awareness is tightly coupled with CPU pinning (pinning ensures vCPUs stay on the correct NUMA node), Huge Pages (huge page allocation must be NUMA-local), and memory ballooning (reclaiming memory from one NUMA node can force a VM to allocate from a remote node, destroying locality). NUMA also affects live migration: when a VM moves to a new host, its NUMA topology may change, requiring memory re-allocation on the destination.

Why It Matters at 5,000+ VM Scale

In a fleet of 5,000+ VMs running on dual-socket (or quad-socket) servers, NUMA misalignment is the single most common cause of unexplained performance degradation. The symptoms are subtle: a database VM runs 30% slower than expected, but CPU and memory utilization look normal. The root cause is that 40% of its memory accesses are crossing the inter-socket link.

At scale, the impact compounds. If 20% of VMs are NUMA-misaligned and each suffers a 25% throughput penalty, the organization is effectively wasting the capacity of hundreds of cores. For latency-sensitive financial workloads (pricing engines, risk calculations, market data feed processing), NUMA misalignment can cause regulatory SLA breaches.


4. CPU Pinning

What It Is and Why It Exists

By default, a hypervisor's scheduler is free to move a VM's vCPUs across any available physical core -- just as a regular OS scheduler moves threads across cores. This works well for general workloads, but introduces two problems for performance-sensitive VMs:

  1. Cache pollution: When a vCPU migrates from Core 5 to Core 12, it leaves behind a warm L1/L2 cache on Core 5 and starts cold on Core 12. The working set must be reloaded, causing a burst of cache misses.
  2. NUMA violation: If the scheduler moves a vCPU from a core on Node 0 to a core on Node 1, but the VM's memory remains on Node 0, all subsequent memory accesses are remote.
  3. Scheduling jitter: In a contended environment, vCPUs may be preempted and delayed by the host scheduler, causing latency spikes visible to the guest OS.

CPU pinning (also called CPU affinity) binds specific vCPUs to specific physical cores, eliminating migration and guaranteeing that the vCPU always executes on the same core.

How It Works -- Technical Internals

On Linux/KVM

CPU affinity is set via the sched_setaffinity() system call or the cpuset cgroup controller. Since each QEMU vCPU is a Linux thread (a pthread), pinning a vCPU is the same as pinning a thread:

  CPU Pinning Internals (KVM/QEMU)

  QEMU Process (PID 12345)
  +-------------------------------------------------+
  | Main Thread     -> manages VM lifecycle         |
  | vCPU Thread 0   -> pinned to pCPU 4  (cgroup)   |
  | vCPU Thread 1   -> pinned to pCPU 5  (cgroup)   |
  | vCPU Thread 2   -> pinned to pCPU 6  (cgroup)   |
  | vCPU Thread 3   -> pinned to pCPU 7  (cgroup)   |
  | I/O Thread 0    -> pinned to pCPU 8  (cgroup)   |
  | I/O Thread 1    -> pinned to pCPU 9  (cgroup)   |
  +-------------------------------------------------+

  pCPU 4-7: NUMA Node 0, local DRAM channels 0-3
  VM memory: allocated on NUMA Node 0 via mbind() / set_mempolicy()

The cgroup v2 hierarchy used by Kubernetes (and therefore OVE) controls this through:

When KubeVirt's dedicatedCpuPlacement is enabled, the Kubernetes CPU Manager switches to the static policy. It carves out a set of physical cores from the node's allocatable pool and assigns them exclusively to the vCPU threads. No other pod's threads can run on those cores.

Hyper-V CPU Pinning

Hyper-V uses Hyper-V processor groups and the Set-VMProcessor PowerShell cmdlet with -ProcessorCompatibility and NUMA-spanning parameters. Hyper-V does not expose direct core pinning in the same way as KVM; instead, it relies on its Virtual NUMA topology presentation and the NUMA-aware scheduler to keep vCPUs on appropriate cores. For deterministic pinning, Hyper-V relies on processor affinity settings and dedicated NUMA node assignment via SCVMM or PowerShell.

VMware CPU Affinity

vSphere provides VM > Edit Settings > Advanced > Scheduling Affinity to bind vCPUs to specific physical cores. However, VMware recommends against using CPU affinity in most cases because it conflicts with DRS (Distributed Resource Scheduler) -- DRS cannot move an affinity-pinned VM.

Dedicated vs Shared CPU Models

Model Mechanism Use Case
Shared (default) vCPUs float across all available physical cores; time-sliced General workloads, web servers, batch jobs
Dedicated (pinned) 1:1 mapping between vCPU and physical core; exclusive Databases, real-time processing, latency-sensitive financial workloads
Isolated Dedicated + host kernel tasks excluded from pinned cores via isolcpus Ultra-low-latency, NFV, trading systems

Relationship to Other Topics

CPU pinning is the enforcement mechanism for NUMA awareness -- without pinning, NUMA-aware memory allocation is futile because the scheduler will move vCPUs away from their local memory. CPU pinning also interacts with Huge Pages (a pinned VM's memory can be backed by huge pages for maximum TLB efficiency) and with live migration (a pinned VM must find equivalent free cores on the destination host, which constrains placement).

Why It Matters at 5,000+ VM Scale

In a financial enterprise, two categories of VMs typically require CPU pinning:

  1. Latency-sensitive workloads: Core banking transactions, FIX protocol engines, market data processors, pricing engines. These require deterministic, jitter-free CPU access.
  2. Licensed-per-core software: Oracle Database, SAP HANA, and similar products where licensing costs are tied to the number of physical cores the software can access. CPU pinning legally limits the core count.

At 5,000+ VMs, pinning must be managed systematically. Over-pinning (pinning too many VMs) fragments the physical core pool and reduces scheduling flexibility, lowering overall cluster utilization. Under-pinning (not pinning where needed) causes performance complaints. The evaluation should assess each candidate's ability to mix pinned and unpinned VMs on the same hosts without operational complexity.


5. Memory Management -- Ballooning, KSM, Huge Pages

Overview

Memory is typically the most constrained resource in a virtualized environment. A server with 512 GB of RAM running 80 VMs each requesting 8 GB would need 640 GB -- more than physically available. Hypervisors use several techniques to close this gap, each with distinct trade-offs.

5a. Memory Ballooning

What It Is

Ballooning is a cooperative memory reclamation technique. The hypervisor installs a small driver inside the guest OS (the "balloon driver"). When the host is under memory pressure, the hypervisor instructs the balloon driver to allocate a large block of memory inside the guest. This forces the guest OS to page out or reclaim its own less-used pages. The hypervisor then takes the physical pages backing the balloon allocation and reassigns them to other VMs.

  Memory Ballooning Flow

  Time T0 (no pressure):
  +------------------------------+
  |  Guest OS                    |
  |  App Memory: 6 GB           |
  |  Free/Cache: 2 GB           |
  |  Balloon:    0 GB            |
  +------------------------------+
  Host backing: 8 GB physical pages

  Time T1 (host under pressure, balloon inflates):
  +------------------------------+
  |  Guest OS                    |
  |  App Memory: 6 GB           |
  |  Free/Cache: 0.5 GB         |  <-- Guest OS reclaims cache
  |  Balloon:    1.5 GB (locked) |  <-- Balloon inflated
  +------------------------------+
  Host backing: 6.5 GB physical pages
  Host reclaimed: 1.5 GB --> given to other VMs

  Time T2 (pressure relieved, balloon deflates):
  +------------------------------+
  |  Guest OS                    |
  |  App Memory: 6 GB           |
  |  Free/Cache: 2 GB           |
  |  Balloon:    0 GB            |
  +------------------------------+
  Host backing: 8 GB physical pages
How It Works -- Technical Internals
Risks of Ballooning

5b. KSM (Kernel Same-page Merging)

What It Is

KSM is a Linux kernel feature (available since kernel 2.6.32) that scans memory for pages with identical content across different VMs (or processes), merges them into a single copy marked copy-on-write (COW), and frees the duplicates. The classic example: 50 VMs running the same version of Windows all have identical copies of ntoskrnl.exe in memory. KSM merges these into one shared copy, saving potentially gigabytes of RAM.

  KSM Deduplication

  Before KSM:
  VM1 Memory: [Page A] [Page B] [Page C] [Page D]
  VM2 Memory: [Page A] [Page B] [Page E] [Page F]
  VM3 Memory: [Page A] [Page B] [Page G] [Page H]
  Total host pages used: 12

  After KSM (Pages A and B are identical across VMs):
  VM1 Memory: [Shared A*] [Shared B*] [Page C] [Page D]
  VM2 Memory: [Shared A*] [Shared B*] [Page E] [Page F]
  VM3 Memory: [Shared A*] [Shared B*] [Page G] [Page H]
  Total host pages used: 8  (saved 4 pages)

  * = Copy-on-Write: if any VM modifies the page,
      a private copy is created automatically.
How It Works -- Technical Internals

KSM runs as a kernel thread (ksmd) that periodically scans registered memory regions. The process:

  1. Applications (or QEMU) register memory regions with KSM via madvise(MADV_MERGEABLE).
  2. ksmd scans pages and computes checksums (not full hashes -- it uses a two-tree structure for efficiency: an "unstable tree" for candidate pages and a "stable tree" for already-merged pages).
  3. When two pages with identical content are found, KSM replaces both PTEs (page table entries) with references to a single physical page, sets the page as read-only, and frees the duplicate.
  4. On write, the CPU triggers a COW page fault; the kernel allocates a new page, copies the content, and updates the PTE -- transparent to the application.

KSM parameters in /sys/kernel/mm/ksm/:

Security Concerns

KSM is vulnerable to side-channel attacks (e.g., memory deduplication timing attacks). An attacker in VM-A can detect whether VM-B has a specific page in memory by observing whether a write to that page content triggers a fast (already merged) or slow (not merged) COW fault. For this reason, KSM is typically disabled in security-sensitive financial environments, and all three candidate platforms support disabling it.

5c. Huge Pages

What It Is

Standard x86-64 memory pages are 4 KB. The CPU uses a TLB (Translation Lookaside Buffer) to cache virtual-to-physical address translations. A typical TLB has 1,024-2,048 entries. With 4 KB pages, 2,048 TLB entries cover only 8 MB of memory. A VM with 64 GB of RAM actively accessing a large working set will experience constant TLB misses, each requiring a full page table walk through 4 levels of page tables (PML4 -> PDPT -> PD -> PT), costing 10-30 ns per miss.

Huge Pages use larger page sizes:

Page Size TLB Entries Coverage (2048 entries) Use Case
4 KB (standard) 2,048 8 MB Default, general purpose
2 MB (huge) 2,048 4 GB Databases, JVMs, medium VMs
1 GB (gigantic) 2,048 2 TB Very large VMs, in-memory DBs

With 2 MB pages, the same 2,048 TLB entries cover 4 GB -- a 512x improvement in address space coverage, dramatically reducing TLB miss rates.

  Page Table Walk Comparison

  4 KB Standard Pages (4-level walk on TLB miss):

  Virtual Addr -----> PML4 -----> PDPT -----> PD -----> PT -----> Physical Page (4 KB)
                      L4           L3          L2         L1
                      ~5ns         ~5ns        ~5ns       ~5ns     Total: ~20ns per miss

  2 MB Huge Pages (3-level walk on TLB miss, PT level eliminated):

  Virtual Addr -----> PML4 -----> PDPT -----> PD -----> Physical Page (2 MB)
                      L4           L3          L2
                      ~5ns         ~5ns        ~5ns                Total: ~15ns per miss
                                                        + 512x fewer misses overall

  1 GB Gigantic Pages (2-level walk on TLB miss):

  Virtual Addr -----> PML4 -----> PDPT -----> Physical Page (1 GB)
                      L4           L3
                      ~5ns         ~5ns                            Total: ~10ns per miss
                                              + 262,144x fewer misses overall
How It Works -- Technical Internals

On Linux, huge pages come in two flavors:

Static Huge Pages (hugetlbfs):

Transparent Huge Pages (THP):

Huge Pages + EPT: When a VM uses huge pages on the host, the EPT entries can also use 2 MB or 1 GB page mappings. This means the EPT page walk is also shortened, providing a double benefit: fewer guest TLB misses and fewer EPT walk levels per miss.

Interaction with Other Memory Techniques
Combination Compatible? Notes
Huge Pages + Ballooning No Balloon driver cannot reclaim huge pages; they are not pageable
Huge Pages + KSM No KSM operates at 4 KB granularity; cannot merge 2 MB pages
Huge Pages + NUMA Yes, critical Huge pages must be allocated on the correct NUMA node; use numactl --membind or equivalent
Huge Pages + CPU Pinning Yes, recommended The full stack: pinned CPUs + NUMA-local huge pages = maximum determinism

Why It Matters at 5,000+ VM Scale

Memory management strategy directly determines VM density (how many VMs fit on a host) and performance predictability.

The evaluation team should determine which VMs in the 5,000+ fleet need which memory strategy, and verify that each candidate platform supports the required combination.


6. Secure Boot / vTPM

What It Is and Why It Exists

Secure Boot is a UEFI firmware feature that verifies the digital signature of every piece of code loaded during the boot process -- firmware drivers, option ROMs, the bootloader, and the OS kernel. If any component fails signature verification, boot is halted. This prevents bootkits, rootkits, and unauthorized OS modifications.

TPM (Trusted Platform Module) is a dedicated security chip (or firmware module) that provides:

vTPM (virtual TPM) emulates a TPM 2.0 device for a virtual machine, providing the same security capabilities to guest operating systems that a physical TPM provides to bare-metal systems.

  Secure Boot + vTPM in a Virtualized Stack

  +--------------------------------------------------+
  |  Guest VM                                        |
  |  +--------------------------------------------+  |
  |  | Guest OS Kernel (signed)                   |  |
  |  +--------------------------------------------+  |
  |  | Guest UEFI Bootloader (signed, verified    |  |
  |  |   by virtual UEFI firmware)                |  |
  |  +--------------------------------------------+  |
  |  | vTPM 2.0 Device                            |  |
  |  | - PCR measurements of boot chain           |  |
  |  | - BitLocker/LUKS key storage               |  |
  |  | - Attestation support                      |  |
  |  +--------------------------------------------+  |
  +--------------------------------------------------+
  |  Hypervisor                                      |
  |  - Provides virtual UEFI firmware per VM         |
  |  - Emulates TPM 2.0 via software backend         |
  |  - Stores vTPM state (encrypted on disk)          |
  +--------------------------------------------------+
  |  Physical Hardware                               |
  |  - Hardware TPM 2.0 (for host attestation)       |
  |  - UEFI Secure Boot (for hypervisor integrity)   |
  +--------------------------------------------------+

How It Works -- Technical Internals

Secure Boot Chain (Virtualized)
  1. Host boots with Secure Boot: The server's UEFI firmware verifies the hypervisor/host OS bootloader against certificates in the firmware's Secure Boot key database (db). The chain: PK (Platform Key) -> KEK (Key Exchange Key) -> db (authorized signatures) -> bootloader -> kernel.
  2. VM boots with virtual Secure Boot: The hypervisor provides each VM with a virtual UEFI firmware (e.g., OVMF -- Open Virtual Machine Firmware, a TianoCore UEFI implementation for QEMU). This virtual UEFI has its own Secure Boot key database. The VM's bootloader and kernel must be signed by keys in this database.
  3. Standard certificate chain: Microsoft provides UEFI CA certificates that sign Windows bootloaders. Shim (a small bootloader signed by Microsoft) signs GRUB, which verifies the Linux kernel. The result: both Windows and Linux guests can boot under Secure Boot without custom key enrollment.
vTPM Implementations
Platform Configuration Registers (PCRs)

The TPM contains a set of PCRs (typically 24, numbered PCR0-PCR23) that store hash measurements of the boot chain. Each measurement is "extended" into a PCR using: PCR_new = SHA-256(PCR_old || measurement). This creates a chain of trust:

PCR What It Measures
PCR0 BIOS/UEFI firmware code
PCR1 BIOS/UEFI firmware configuration
PCR2 Option ROMs
PCR4 MBR / bootloader code
PCR7 Secure Boot policy (whether Secure Boot was enforced)
PCR8-15 OS-defined (kernel, initramfs, etc.)

A vTPM replicates this inside the VM: PCR0 measures the virtual UEFI firmware, PCR4 measures the guest bootloader, etc. Remote attestation services can verify that a VM booted a known-good configuration by reading PCR values.

Why vTPM Is Required for Modern Workloads

Relationship to Other Topics

Secure Boot validates the boot chain; vTPM provides runtime security services. Both depend on the hypervisor layer being a Type-1 implementation that can provide hardware-level isolation. Nested virtualization complicates Secure Boot because the inner hypervisor needs its own Secure Boot chain. CPU pinning and NUMA are orthogonal to Secure Boot/vTPM.

Why It Matters at 5,000+ VM Scale

For a Tier-1 financial enterprise:


7. Nested Virtualization

What It Is and Why It Exists

Nested virtualization means running a hypervisor inside a virtual machine. The outer (L0) hypervisor runs on physical hardware. It hosts a VM (L1) that itself runs a hypervisor, which hosts its own VMs (L2).

  Nested Virtualization Stack

  +-----------------------------------------------+
  |  L2 VMs (nested guests)                       |
  |  +----------+  +----------+  +----------+    |
  |  | L2 VM A  |  | L2 VM B  |  | L2 VM C  |    |
  |  +----------+  +----------+  +----------+    |
  +-----------------------------------------------+
  |  L1 Hypervisor (runs inside L1 VM)            |
  |  e.g., KVM, Hyper-V, ESXi                     |
  +-----------------------------------------------+
  |  L1 VM (hosted by L0 hypervisor)              |
  +-----------------------------------------------+
  |  L0 Hypervisor (runs on bare metal)           |
  |  e.g., KVM, Hyper-V, ESXi                     |
  +-----------------------------------------------+
  |  Physical Hardware                            |
  +-----------------------------------------------+

Common use cases:

How It Works -- Technical Internals

The Problem

VT-x/AMD-V was originally designed for a single level of virtualization. The L0 hypervisor uses VMX root/non-root modes. But the L1 hypervisor also wants to use VMX root/non-root modes for its L2 guests. How does the CPU handle two levels of hypervisor?

Intel VMCS Shadowing (Hardware-Assisted Nesting)

Modern Intel CPUs (Haswell and later) support VMCS shadowing. The L0 hypervisor creates a "shadow VMCS" for each L2 VM. When the L1 hypervisor executes VMREAD or VMWRITE to configure its L2 VM, the CPU intercepts these operations and redirects them to the shadow VMCS. The L0 hypervisor merges the shadow VMCS with its own control settings to produce the actual VMCS used by the hardware.

On a VM exit from L2:

  1. The CPU exits to L0 (not L1).
  2. L0 inspects the exit reason. If it is something L1 requested to trap (e.g., L1 set a particular bit in the L1 VMCS execution controls), L0 injects the exit into L1.
  3. If the exit is only relevant to L0 (e.g., an EPT violation for L0's memory mapping), L0 handles it directly and resumes L2.
Address Translation with Nested EPT

The memory translation becomes three levels deep:

  Three-Level Address Translation (Nested Virtualization)

  L2 Guest Virtual Address (L2-GVA)
          |
          | L2 guest page tables (controlled by L2 OS)
          v
  L2 Guest Physical Address (L2-GPA)
          |
          | L1 EPT (controlled by L1 hypervisor)
          v
  L1 Guest Physical Address (L1-GPA)
          |
          | L0 EPT (controlled by L0 hypervisor)
          v
  Host Physical Address (HPA) --> Actual RAM

  On a TLB miss, the CPU must walk THREE sets of page tables.
  This is extremely expensive: up to 24 memory accesses per translation
  in the worst case (4 levels * 3 translation layers * 2 accesses per level).

To mitigate this, some CPUs support merged EPT: the L0 hypervisor computes a composite page table that maps L2-GPA directly to HPA, collapsing two EPT walks into one. This is done in software by the L0 hypervisor and cached, but it must be invalidated whenever L1 changes its EPT mappings.

AMD Nested Virtualization

AMD added nested virtualization support earlier than Intel (SVM was designed with nesting in mind from the start). AMD uses nested VMCB and nested NPT with similar semantics. The key AMD advantage: nested page table walks are simpler because the VMCB is stored in regular memory (no special VMREAD/VMWRITE needed).

Performance Impact

Nested virtualization carries a significant performance penalty:

Metric Bare Metal L1 VM (standard) L2 VM (nested)
CPU overhead 0% 2-5% 10-30%
Memory access latency Baseline +5-10% (EPT) +20-50% (double EPT)
I/O latency Baseline +10-20% (virtio) +30-60% (double emulation)
Context switch cost Baseline ~1.5x ~3-5x

These penalties make nested virtualization unsuitable for production workloads. It is a development and testing tool.

Relationship to Other Topics

Nested virtualization builds directly on CPU virtualization extensions -- it requires VT-x/AMD-V with VMCS shadowing or equivalent support. NUMA awareness is typically not preserved through two levels of nesting. CPU pinning at the L0 level improves L2 performance but does not eliminate the double-EPT overhead. Secure Boot can be chained through nested levels but adds complexity.

Why It Matters at 5,000+ VM Scale

Nested virtualization is not a production concern for the 5,000+ VM fleet. Its relevance to the evaluation is in three areas:

  1. Development and testing: Can the operations team run a local OVE/Azure Local cluster inside VMs for testing infrastructure-as-code, runbooks, and migration scripts before touching production hardware?
  2. CI/CD pipelines: Can CI/CD jobs that test VM-based infrastructure run inside VMs on the platform?
  3. Migration staging: Can a miniature VMware environment be run inside the target platform for migration validation?

If the answer to any of these is "yes, but only on dedicated development hardware," that is acceptable. If the answer is "not supported at all," it constrains the operations team's workflow.


How the Candidates Handle This

Comparison Table

Aspect VMware vSphere (Current) OVE (KubeVirt/KVM) Azure Local (Hyper-V) Swisscom ESC
Hypervisor Type Type-1 (ESXi microkernel) De facto Type-1 (KVM in Linux kernel) Type-1 (Hyper-V with root partition) Type-1 (ESXi, managed by Swisscom)
CPU Virt Extensions VT-x/AMD-V + EPT/NPT, mandatory VT-x/AMD-V + EPT/NPT, mandatory VT-x/AMD-V + EPT (SLAT), mandatory Same as VMware (VT-x + EPT)
NUMA Awareness vSphere NUMA scheduler auto-assigns; vNUMA exposed to large VMs; wide/narrow NUMA modes KubeVirt exposes NUMA topology via spec.domain.cpu.numa; Kubernetes Topology Manager aligns CPU/memory/devices Hyper-V NUMA-aware scheduler; virtual NUMA auto-presented to VMs >1 socket; configurable via SCVMM/PowerShell Provider-managed; customer has no visibility or control over NUMA placement
CPU Pinning Scheduling Affinity (per-VM); conflicts with DRS dedicatedCpuPlacement: true in VM spec; Kubernetes CPU Manager static policy; isolcpus for host kernel exclusion Hyper-V Processor Affinity; NUMA node assignment via PowerShell; less granular than KVM Not customer-configurable; provider decides placement
Memory Ballooning vmmemctl balloon driver, mature and well-tuned virtio_balloon driver; configurable per VM; less mature tuning than VMware Hyper-V Dynamic Memory; deeply integrated; auto-adjusting Provider-managed; customer sees allocated RAM, not overcommit details
KSM / Page Sharing Transparent Page Sharing (TPS); disabled by default since 2014 (side-channel risk); intra-VM TPS available KSM available via ksmd; disabled by default in OVE; can be enabled per node Not applicable (Hyper-V does not implement host-level page sharing) Provider-managed; likely disabled in regulated tenants
Huge Pages Large Pages enabled by default (2 MB); no native 1 GB support for VMs Static huge pages (2 MB and 1 GB) via hugepages-2Mi / hugepages-1Gi resource requests in VM spec; THP not recommended Large Pages supported; configurable per VM via Hyper-V settings; 2 MB and 1 GB Provider-managed; customer cannot request specific page sizes
Secure Boot vSphere 6.7+ with UEFI firmware; requires VM hardware version 13+ OVMF-based UEFI with Secure Boot; supported since OpenShift 4.14; requires spec.domain.firmware.bootloader.efi.secureBoot: true Native Secure Boot with UEFI; mandatory for Shielded VMs; deeply integrated with HGS Provider-managed; available per VM on request
vTPM vTPM 2.0 since vSphere 6.7; requires KMS and VM encryption vTPM 2.0 via swtpm; supported since OpenShift 4.14; state persisted in PV vTPM 2.0 native; required for Shielded VMs; keys protected by HGS/host TPM Available; managed by Swisscom, customer has no key management control
Nested Virtualization Supported since ESXi 5.1; production-supported for specific use cases Supported on KVM hosts; not recommended for production; enable via vmx CPU feature passthrough Natively supported; well-tested; commonly used for dev/test with Hyper-V-in-Hyper-V Not available to customers (provider does not expose nested virt in ESC tenants)

Detailed Candidate Analysis

OVE (KubeVirt / KVM)

NUMA and CPU Pinning: OVE provides the most granular control of all candidates. Through Kubernetes' Topology Manager and CPU Manager, combined with KubeVirt's VM-level NUMA specification, operators can achieve precise NUMA-aligned CPU pinning. The spec.domain.cpu.dedicatedCpuPlacement: true flag triggers the static CPU Manager policy, which reserves entire cores for the VM and aligns memory allocation to the corresponding NUMA node. For ultra-low-latency workloads, the host can be configured with isolcpus to keep the Linux kernel's own threads off pinned cores entirely.

The trade-off: this level of control requires the operations team to understand Kubernetes resource requests, Topology Manager policies (single-numa-node, restricted, best-effort), and CPU Manager behavior. Misconfiguration results in pods stuck in Pending state or silently degraded NUMA alignment.

Huge Pages: OVE supports both 2 MB and 1 GB huge pages as first-class Kubernetes resources. VMs request huge pages in their pod spec (resources.requests.hugepages-2Mi: "16Gi"), and the Kubernetes scheduler ensures the VM lands on a node with sufficient pre-allocated huge pages. This is more explicit than VMware's default large-page behavior, which means more configuration work but also more predictable behavior.

vTPM: KubeVirt uses swtpm (software TPM emulator) to provide vTPM 2.0 to VMs. The vTPM state is persisted in a Kubernetes PersistentVolume. This is important for live migration: when a VM migrates, its vTPM state must move with it. OVE handles this through the PV, but the operator must ensure the storage backend supports ReadWriteMany or that the PV is properly detached/reattached during migration.

Azure Local (Hyper-V)

NUMA Awareness: Hyper-V has a mature NUMA-aware scheduler that automatically presents virtual NUMA topology to large VMs. It is less configurable than KVM in terms of fine-grained core assignment, but "just works" for most workloads without operator intervention. For environments where the operations team is Windows-focused and prefers automation over control, this is a strength.

Secure Boot / vTPM: Azure Local has the strongest security posture of all candidates in this area. Secure Boot is mandatory (not optional) for the host OS. vTPM is mandatory for Shielded VMs. The Host Guardian Service (HGS) provides hardware-attested key management for vTPM. Shielded VMs provide encryption of virtual disks (VHDX) and vTPM keys, with attestation ensuring VMs can only run on hosts that pass health checks. This is the only candidate that offers true hardware-attested VM shielding out of the box.

Nested Virtualization: Hyper-V has well-tested nested virtualization support, commonly used for running Hyper-V inside Hyper-V for development/testing. This is a practical advantage for the operations team during the evaluation and migration phase.

Limitation -- Cluster Size and NUMA: Azure Local's 16-node cluster limit means that large VMs spanning multiple NUMA nodes may have fewer placement options during failover. If a 4-socket VM needs to fail over and only 2 of 16 nodes have sufficient free NUMA-aligned resources, the failure domain is tight.

Swisscom ESC

Managed Service Implications: ESC abstracts all foundational concepts behind a service contract. The customer does not configure NUMA, CPU pinning, huge pages, or vTPM directly. This is simultaneously the greatest strength and greatest weakness:

For the evaluation team, the key question is: does Swisscom's SLA cover performance characteristics that imply proper NUMA alignment and resource isolation? A contractual IOPS guarantee is only meaningful if the underlying platform honors NUMA locality.

Technology Transition Risk: Since ESC is built on VMware, all current capabilities (vTPM, Secure Boot, NUMA, etc.) derive from vSphere. If Swisscom transitions to a different hypervisor stack in response to Broadcom licensing changes, these capabilities must be re-evaluated.


Key Takeaways


Discussion Guide

Use these questions when engaging with vendor architects, Red Hat solution engineers, Microsoft technical specialists, or Swisscom account teams. The questions are designed to go beyond datasheet capabilities and probe real-world implementation depth.

Questions for All Candidates

  1. NUMA Alignment Verification: "How can we verify, at any point in time, that a specific production VM is NUMA-aligned? What tooling or metrics expose NUMA locality vs. remote memory access ratios? If a VM is found to be NUMA-misaligned after a live migration, what is the remediation process -- does it require a reboot, a re-migration, or is it automatic?"

  2. Huge Pages and Memory Overcommit Interaction: "We plan to run tier-1 database VMs with 1 GB huge pages and tier-3 development VMs with memory overcommit on the same physical hosts. Describe exactly how your scheduler handles a node where 256 GB of 512 GB is reserved as static huge pages. What happens to the remaining 256 GB -- is it available for overcommit, and how does the scheduler prevent huge-page VMs from landing on nodes that cannot satisfy the request?"

  3. vTPM Key Lifecycle: "Walk us through the complete lifecycle of a vTPM key: creation, storage at rest, access during VM boot, behavior during live migration, backup/restore, and what happens if the underlying key management system is unavailable. Specifically: where are vTPM keys stored, who has access to them, and how are they rotated?"

  4. CPU Pinning at Scale: "If we pin 40% of our VMs (the latency-sensitive tier), what is the impact on scheduling flexibility for the remaining 60%? Have you seen customers at our scale (5,000+ VMs, 100+ hosts) successfully operate a mixed pinned/unpinned environment? What monitoring tells us when pinning fragmentation is reducing cluster utilization?"

  5. Secure Boot Supply Chain: "Describe the Secure Boot certificate chain from your hypervisor through to a Windows Server 2022 guest and a RHEL 9 guest. Who controls the Platform Key? Can we enroll custom Secure Boot keys for internally-signed bootloaders or kernel modules? What happens during a hypervisor upgrade -- does the Secure Boot chain need to be re-established?"

Questions Specific to OVE

  1. Kubernetes Topology Manager Policy: "Which Topology Manager policy do you recommend for our mixed workload -- single-numa-node or restricted? What happens when a VM requests dedicatedCpuPlacement and hugepages-1Gi but no single NUMA node on any available worker has both sufficient free CPUs and free 1 GB huge pages? Does the pod go Pending forever, or is there a fallback? How do we monitor for this at fleet scale?"

  2. KSM and Side-Channel Risk: "Is KSM enabled or disabled by default on OVE worker nodes? If enabled, can it be disabled per-node or per-VM? Have Red Hat's security assessments addressed KSM-based timing side-channel attacks in multi-tenant environments? In a scenario where two different business units' VMs share a host, what is the recommended configuration?"

Questions Specific to Azure Local

  1. Shielded VMs and Operational Overhead: "Shielded VMs with HGS attestation provide the strongest vTPM protection, but what is the operational overhead? How many HGS servers are required for our scale? What happens if HGS is unavailable -- can Shielded VMs still boot? What is the recovery procedure if the HGS database is corrupted or lost? Have you seen Tier-1 financial institutions deploy Shielded VMs at scale, or is this feature primarily used in smaller, high-security enclaves?"

  2. NUMA Granularity in a 16-Node Cluster: "With a maximum of 16 nodes per cluster and our requirement for 300+ large VMs (32+ vCPUs each) with NUMA-aligned placement, demonstrate that a 16-node cluster can handle the failover of 2 nodes simultaneously while maintaining NUMA alignment for all critical VMs. What is the NUMA-aware admission control mechanism, and does it prevent VM starts that would violate NUMA locality?"

Questions Specific to Swisscom ESC

  1. NUMA and Performance Guarantees Under the Hood: "Our current VMware environment uses per-VM NUMA affinity rules for our 50 largest database VMs. In ESC, how does Swisscom ensure equivalent NUMA alignment? Is it contractually guaranteed in the SLA, or is it best-effort? If we experience performance degradation and suspect NUMA misalignment, what diagnostic data can Swisscom provide? Can we request explicit NUMA pinning as part of our service profile?"