Advanced Data Paths
Why This Matters
The previous chapters covered overlay networking (OVN/GENEVE), physical connectivity (LACP, MLAG, ECMP), and fabric design. Those technologies handle the vast majority of the 5,000+ VM workloads -- general-purpose applications where 50 microseconds of overlay encapsulation overhead is irrelevant and the convenience of OVN's distributed routing, NetworkPolicy enforcement, and transparent live migration outweighs any raw performance consideration.
This chapter covers the cases where that is not good enough.
Some workloads -- database replication with sub-100-microsecond latency requirements, telco NFV virtual network functions processing millions of packets per second, financial feed handlers that cannot tolerate interrupt-driven kernel packet processing -- need a fundamentally different data path. These workloads require hardware-assisted I/O (SR-IOV), safe direct device assignment to VMs (IOMMU/VFIO), or complete kernel bypass for packet processing (DPDK). These three technologies form a stack: SR-IOV creates virtual PCIe devices, IOMMU makes it safe to assign those devices directly to VMs, and DPDK enables userspace packet processing at line rate inside those VMs.
For a Tier-1 financial enterprise, this is not theoretical. The organization likely has workloads today on VMware that use DirectPath I/O (VMware's SR-IOV passthrough) or that depend on VMware's PVRDMA for low-latency inter-VM communication. The migration to OVE, Azure Local, or Swisscom ESC must preserve these performance characteristics -- or the workloads cannot migrate.
The complexity cost is significant. SR-IOV sacrifices overlay network features (no OVN NetworkPolicy, no transparent live migration). DPDK consumes dedicated CPU cores and hugepage memory that are unavailable to other workloads. IOMMU misconfiguration can either break device assignment entirely or create security holes in VM-to-VM isolation. These are technologies that should be deployed only when virtio networking through the overlay is demonstrably insufficient -- but when they are needed, there is no substitute.
This chapter covers three tightly coupled technologies:
- SR-IOV -- hardware-partitioned NIC that presents virtual PCIe functions directly to VMs, bypassing the software virtual switch
- IOMMU -- hardware memory protection that makes direct device assignment safe by isolating DMA and interrupts per VM
- DPDK -- userspace networking framework that bypasses the kernel entirely for wire-speed packet processing
Concepts
1. SR-IOV (Single Root I/O Virtualization)
The Problem: Software Virtual Switches Add Latency and Consume CPU
In a standard virtualization data path, every packet a VM sends traverses the following chain:
Standard Virtio Networking Path (per packet):
VM (Guest) Physical NIC
+-----------+ +-----------+
| Application| | |
| | | | Wire |
| Socket | | |
| | | +-----------+
| TCP/IP | ^
| | | |
| virtio-net| <-- paravirtual NIC driver |
+-----+------+ |
| |
| 1. Guest writes to virtio ring |
| (shared memory between guest & host) |
v |
+-----------+ |
| vhost-net | <-- host kernel module that |
| (kernel) | processes virtio rings |
+-----+-----+ |
| |
| 2. Packet enters kernel networking stack |
v |
+-----------+ |
| OVS | <-- Open vSwitch (kernel datapath) |
| br-int | flow lookup, encap/decap, |
| | ACL enforcement |
+-----+-----+ |
| |
| 3. OVS forwards to physical port |
| (may GENEVE-encapsulate first) |
v |
+-----------+ |
| Host NIC | <-- physical NIC driver (mlx5_core) |
| Driver | DMA to NIC hardware |
+-----+-----+ |
| |
+----------------------------------------------+
Each step in this chain adds latency:
- Virtio ring processing: ~2-5 us (guest-to-host notification, vring descriptor walk)
- OVS flow lookup: ~5-15 us (megaflow cache hit) to ~50-100 us (slow-path upcall on cache miss)
- GENEVE encap/decap: ~3-10 us (header construction, UDP checksum)
- Host kernel networking: ~5-10 us (netfilter hooks, socket buffer allocation, context switches)
- Total one-way: ~15-40 us typical, ~100+ us worst case
For general-purpose workloads, 15-40 microseconds is invisible. For a database replication link where every committed transaction adds a network round-trip, or a trading system where every microsecond of latency has measurable financial impact, this overhead is unacceptable.
Additionally, the software path consumes CPU. At 10 Gbps with 64-byte packets (the worst case for per-packet overhead), the host CPU can spend 2-4 cores just processing network I/O for a single busy VM. Those cores are unavailable for running other VMs.
SR-IOV Architecture: Physical Function and Virtual Functions
SR-IOV is a PCI-SIG specification (first published in 2007, current revision 1.1) that allows a single physical PCIe device to present itself as multiple independent virtual devices. The physical device is called the Physical Function (PF), and the virtual devices are called Virtual Functions (VFs).
SR-IOV NIC Architecture:
PCIe Bus
+================================================================+
| |
| Physical NIC (e.g., Mellanox ConnectX-6 Dx) |
| +------------------------------------------------------------+ |
| | | |
| | Physical Function (PF) BDF: 0000:65:00.0 | |
| | +------------------------+ | |
| | | Full PCIe config space | | |
| | | SR-IOV Capability | <-- PCIe Extended Capability | |
| | | VF BAR addresses | structure at offset 0x160 | |
| | | VF Enable bit | | |
| | | NumVFs register | | |
| | | First VF Offset | | |
| | | VF Stride | | |
| | +------------------------+ | |
| | | | | | | | |
| | v v v v v | |
| | +--------+ +--------+ +--------+ +--------+ +--------+ | |
| | | VF 0 | | VF 1 | | VF 2 | | VF 3 | | VF N | | |
| | |.0 .1| |.0 .2| |.0 .3| |.0 .4| |.0 .N+1| | |
| | | | | | | | | | | | | |
| | | Minimal| | Minimal| | Minimal| | Minimal| | Minimal| | |
| | | Config | | Config | | Config | | Config | | Config | | |
| | | Space | | Space | | Space | | Space | | Space | | |
| | | | | | | | | | | | | |
| | | Own TX | | Own TX | | Own TX | | Own TX | | Own TX | | |
| | | Own RX | | Own RX | | Own RX | | Own RX | | Own RX | | |
| | | Own QP | | Own QP | | Own QP | | Own QP | | Own QP | | |
| | +--------+ +--------+ +--------+ +--------+ +--------+ | |
| | | |
| | +------------------------------------------------------+ | |
| | | NIC Embedded Switch (eSwitch) | | |
| | | - L2 forwarding between VFs and PF | | |
| | | - VLAN tagging/stripping per VF | | |
| | | - MAC anti-spoofing per VF | | |
| | | - Rate limiting per VF | | |
| | | - Hairpin forwarding (VF-to-VF on same NIC) | | |
| | +------------------------------------------------------+ | |
| | | |
| | +------------------------------------------------------+ | |
| | | Physical Port(s) to Network | | |
| | +------------------------------------------------------+ | |
| +------------------------------------------------------------+ |
+================================================================+
Key architectural points:
-
The PF is a full PCIe function with complete configuration space, all device capabilities, and the ability to manage VFs. The PF driver (e.g.,
mlx5_corefor Mellanox,icefor Intel E810) controls VF creation, VLAN assignment, MAC address enforcement, and rate limiting. The PF is typically used by the host (hypervisor) for management traffic or as the uplink for the software virtual switch. -
Each VF is a lightweight PCIe function with its own Bus:Device.Function (BDF) address, its own BAR (Base Address Register) mappings, its own TX/RX queues, and its own interrupt vectors (MSI-X). A VF appears to its driver as a real NIC -- it can send and receive packets independently without any involvement from the PF driver or the host CPU (after initial setup).
-
The eSwitch is embedded in the NIC hardware. It performs L2 forwarding between VFs, between VFs and the PF, and between VFs and the physical port. The eSwitch enforces VLAN tagging, MAC anti-spoofing, and rate limits. In legacy mode, the eSwitch is a simple L2 bridge. In switchdev mode (used by OVS hardware offload), the eSwitch is programmable via TC flower rules and can offload OVS flow entries to hardware.
PCIe Configuration: ARI and How VFs Appear
Standard PCIe uses a 3-level addressing scheme: Bus (8 bits), Device (5 bits), Function (3 bits). The 3-bit Function field limits a single device to 8 functions -- far too few for SR-IOV, which needs 64, 128, or even 256 VFs per PF.
ARI (Alternative Routing-ID Interpretation) is a PCIe capability that repurposes the Device field. With ARI, the 8-bit Device + 3-bit Function fields are merged into a single 8-bit Function field (the Device field is ignored for routing), allowing up to 256 functions per device. ARI requires support from both the PCIe endpoint (the NIC) and every PCIe switch or root complex in the path.
PCIe BDF Addressing Without and With ARI:
Standard PCIe (no ARI):
Bus:Device.Function = 8:5:3 bits = max 8 functions per device
Example: 0000:65:00.0 (PF), 0000:65:00.1 (VF0), ... 0000:65:00.7 (VF6)
--> Only 7 VFs possible (function 0 = PF, functions 1-7 = VFs)
With ARI enabled:
Bus:Device.Function = 8:0:8 bits = up to 256 functions per bus
VFs get BDFs using FirstVFOffset and VFStride from SR-IOV capability:
PF BDF: 0000:65:00.0
VF0 BDF: 0000:65:00.0 + FirstVFOffset
VF1 BDF: VF0 BDF + VFStride
...
Example with FirstVFOffset=1, VFStride=1:
PF: 0000:65:00.0
VF0: 0000:65:00.1
VF1: 0000:65:00.2
...
VF63: 0000:65:01.0 (rolls over to next device number)
...
VF127: 0000:65:02.0
To verify ARI and SR-IOV capability on a system:
# Check if NIC supports SR-IOV
lspci -vvv -s 65:00.0 | grep -A 20 "Single Root I/O Virtualization"
# Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
# IOVCap: Migration-, 10BitTagReq-, Interrupt Message Number: 000
# IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy+
# IOVSta: Migration-
# Initial VFs: 68, Total VFs: 68, Number of VFs: 8, ...
# VF offset: 2, stride: 1, ...
# Check if ARI is enabled on the upstream port
lspci -vvv -s 64:00.0 | grep ARI
# DevCtl2: ... ARIFwd+
VF Capabilities: What a VF Can Do vs What Only the PF Can Do
This distinction is critical for understanding the security model and operational constraints:
| Capability | PF | VF |
|---|---|---|
| Send/receive packets | Yes | Yes |
| Own TX/RX queues | Yes (many) | Yes (fewer, typically 4-16 queue pairs) |
| Own MSI-X interrupt vectors | Yes (many) | Yes (fewer, typically 4-8) |
| Change own MAC address | Yes | Depends on PF driver config (usually restricted) |
| Set promiscuous mode | Yes | No (security: VF cannot sniff other VFs' traffic) |
| Configure VLAN trunk | Yes | No (PF admin assigns VLAN to VF) |
| Modify eSwitch rules | Yes | No |
| Create/destroy VFs | Yes | No |
| Update NIC firmware | Yes | No |
| Set rate limits on VFs | Yes | No |
| Enable/disable MAC anti-spoofing | Yes | No (PF admin controls this) |
| Access physical link parameters | Yes | No (VF cannot see SFP module, link speed negotiation) |
| RSS (Receive Side Scaling) | Yes | Yes (within VF's own queues) |
| Checksum offload (TX/RX) | Yes | Yes |
| TSO/LRO | Yes | Yes |
The security model is clear: a VF is a restricted view of the NIC. A compromised VM with a VF cannot affect other VFs, cannot sniff other VMs' traffic, and cannot modify the eSwitch configuration. The PF driver on the host enforces these restrictions in hardware.
Linux SR-IOV Setup: Enabling VFs
The manual process on a bare Linux system (for understanding; in production, this is automated by the SR-IOV Network Operator):
# Step 1: Enable IOMMU (kernel parameter, requires reboot)
# For Intel: intel_iommu=on iommu=pt
# For AMD: amd_iommu=on iommu=pt
# "iommu=pt" = passthrough mode for host devices (performance)
# Step 2: Enable SR-IOV VFs on the PF
echo 8 > /sys/class/net/ens1f0/device/sriov_numvfs
# Creates 8 VFs -- they appear as new PCIe devices immediately
# Step 3: Verify VFs exist
lspci | grep "Virtual Function"
# 65:00.2 Ethernet controller: Mellanox ConnectX-6 Dx (VF)
# 65:00.3 Ethernet controller: Mellanox ConnectX-6 Dx (VF)
# ... (8 VFs)
# Step 4: Configure VF VLAN and MAC via PF
ip link set ens1f0 vf 0 mac 52:54:00:aa:bb:01 vlan 100 spoofchk on
ip link set ens1f0 vf 1 mac 52:54:00:aa:bb:02 vlan 200 spoofchk on
# Step 5: Verify VF configuration
ip link show ens1f0
# ... vf 0 MAC 52:54:00:aa:bb:01, vlan 100, spoof checking on, ...
# ... vf 1 MAC 52:54:00:aa:bb:02, vlan 200, spoof checking on, ...
# Step 6: Bind VF to VFIO driver (for VM assignment)
echo 0000:65:00.2 > /sys/bus/pci/devices/0000:65:00.2/driver/unbind
echo vfio-pci > /sys/bus/pci/devices/0000:65:00.2/driver_override
echo 0000:65:00.2 > /sys/bus/pci/drivers/vfio-pci/bind
# Step 7: Assign VF to VM (QEMU command line)
qemu-system-x86_64 ... \
-device vfio-pci,host=0000:65:00.2,id=net0 \
...
SR-IOV in KubeVirt/OVE: The Operator Stack
In OpenShift/OVE, SR-IOV is managed declaratively through the SR-IOV Network Operator. The operator automates all the manual steps above and integrates with Kubernetes scheduling so that a VM is only placed on a node that has a free VF of the correct type.
SR-IOV Operator Stack in OVE:
Cluster Admin SR-IOV Network Operator
| |
| 1. Creates |
| SriovNetworkNodePolicy CR |
+------------------------------------>
| |
| | 2. Operator reads policy,
| | finds matching NICs on
| | matching nodes
| |
| | 3. Operator configures:
| | - sriov_numvfs on PF
| | - VF driver binding (vfio-pci or netdevice)
| | - Device plugin (advertises VFs to kubelet)
| |
| 4. Creates |
| SriovNetwork CR |
+------------------------------------>
| |
| | 5. Operator creates
| | NetworkAttachmentDefinition
| | (Multus NAD with sriov CNI plugin)
| |
| 6. Creates |
| VirtualMachine CR |
| (references NAD) |
+------------------------------------>
| |
| Kubernetes Scheduler
| |
| | 7. Scheduler sees VM requests
| | resource "openshift.io/sriovnic"
| | Finds node with free VF
| | Schedules VM pod on that node
| |
| kubelet + sriov-device-plugin
| |
| | 8. Device plugin allocates VF,
| | binds to vfio-pci,
| | passes VF PCI address to
| | virt-launcher pod
| |
| virt-launcher (QEMU)
| |
| | 9. QEMU attaches VF to VM
| | via vfio-pci device
| |
| VM Guest
| |
| | 10. Guest sees VF as a real NIC,
| | loads mlx5_core/iavf driver,
| | gets direct hardware access
The CRD definitions:
# SriovNetworkNodePolicy: tells the operator WHICH NICs to configure
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-mlx-sriov
namespace: openshift-sriov-network-operator
spec:
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
resourceName: sriovnic # name advertised to kubelet
numVfs: 8 # create 8 VFs per matching NIC
nicSelector:
vendor: "15b3" # Mellanox
deviceID: "101d" # ConnectX-6 Dx
pfNames: ["ens1f0"] # specific PF interface
deviceType: vfio-pci # bind VFs to vfio-pci for VM passthrough
# deviceType: netdevice # alternative: keep as kernel netdev (for pod SR-IOV)
isRdma: false
priority: 99
---
# SriovNetwork: creates the NetworkAttachmentDefinition
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: sriov-vlan100
namespace: openshift-sriov-network-operator
spec:
resourceName: sriovnic
networkNamespace: finance # namespace where VMs will use this
vlan: 100 # VLAN for this network
spoofChk: "on"
trust: "off"
ipam: |
{
"type": "whereabouts",
"range": "10.10.100.0/24"
}
---
# VirtualMachine: references the SR-IOV network
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: db-replica-01
namespace: finance
spec:
template:
spec:
domain:
devices:
interfaces:
- name: default
masquerade: {} # primary OVN network (management)
- name: sriov-fast
sriov: {} # SR-IOV passthrough interface
resources:
requests:
openshift.io/sriovnic: "1" # request one VF
networks:
- name: default
pod: {}
- name: sriov-fast
multus:
networkName: sriov-vlan100 # reference the SriovNetwork
SR-IOV in Hyper-V / Azure Local
Azure Local handles SR-IOV differently from OVE because Hyper-V's networking model is fundamentally different from Kubernetes/Multus.
SET (Switch Embedded Teaming): Azure Local uses SET instead of traditional NIC teaming. SET is a NIC teaming technology integrated into the Hyper-V Virtual Switch. Unlike OVE where SR-IOV VFs bypass the virtual switch entirely, Hyper-V's SET can work alongside SR-IOV -- the VMQ (Virtual Machine Queue) and SR-IOV features are coordinated by the Hyper-V switch extension.
SR-IOV with VMQ in Hyper-V:
- VMQ (Virtual Machine Queue) is a lighter-weight hardware acceleration where the NIC sorts incoming packets into per-VM hardware queues but the packets still pass through the Hyper-V switch. VMQ reduces CPU overhead but does not eliminate the software data path.
- SR-IOV in Hyper-V fully bypasses the Hyper-V switch for data-plane traffic. The VF is exposed directly to the guest VM. However, Hyper-V maintains a "software fallback" path: if SR-IOV fails or during live migration, traffic automatically falls back to the synthetic (VMBus) path through the Hyper-V switch. This fallback is transparent to the guest.
- The fallback capability is a significant advantage over KubeVirt's SR-IOV, where live migration of VMs with SR-IOV interfaces requires explicit hot-unplugging the VF, migrating over the virtio interface, and hot-plugging a new VF on the destination -- a process that causes a brief network interruption.
Configuration in Azure Local (PowerShell):
# Enable SR-IOV on a VM network adapter
Set-VMNetworkAdapter -VMName "DB-Replica-01" `
-Name "FastNIC" `
-IovWeight 100 # >0 enables SR-IOV; weight for resource allocation
# Verify SR-IOV status
Get-VMNetworkAdapter -VMName "DB-Replica-01" |
Select-Object Name, IovUsage, IovQueuePairsRequested, IovQueuePairsAllocated
Performance: Latency and Throughput Comparison
These numbers are representative of production measurements on modern hardware (2024-2025 benchmarks, ConnectX-6 Dx, Intel E810, Xeon Sapphire Rapids):
| Metric | Virtio (OVS kernel) | Virtio (OVS-DPDK) | SR-IOV (VF passthrough) | DPDK in guest (over SR-IOV VF) |
|---|---|---|---|---|
| One-way latency (64B) | 25-50 us | 10-20 us | 5-10 us | 2-5 us |
| Round-trip latency (64B) | 50-100 us | 20-40 us | 10-20 us | 5-10 us |
| Throughput (64B packets) | 1-3 Mpps | 8-14 Mpps | 15-25 Mpps | 20-40 Mpps |
| Throughput (1500B packets) | 10-15 Gbps | 25-40 Gbps | 40-100 Gbps | 40-100 Gbps |
| CPU cores consumed at 10 Gbps (64B) | 2-4 cores | 1-2 dedicated cores | 0.1-0.3 cores | 1-2 dedicated cores |
| Jitter (p99 - p50, 64B) | 20-100 us | 5-15 us | 1-3 us | 0.5-2 us |
Key observations:
- SR-IOV delivers the best latency per CPU ratio -- near-hardware latency with minimal CPU consumption because the NIC hardware does the work.
- DPDK in the guest achieves the absolute lowest latency but requires dedicated CPU cores running poll loops (100% utilization even when idle).
- Virtio with OVS-DPDK is a middle ground: lower latency than kernel OVS but with dedicated CPU cost.
- Virtio with kernel OVS is the default and is sufficient for the vast majority of workloads. The 50-100 us round-trip latency is invisible to web applications, batch jobs, and most database workloads.
Trade-offs: What You Give Up With SR-IOV
SR-IOV is not free. Enabling it sacrifices several features that the overlay network provides:
-
No OVN NetworkPolicy enforcement. The VF bypasses OVS/OVN entirely. Packets go directly from the VM to the NIC hardware without touching any OpenFlow pipeline. Kubernetes NetworkPolicy does not apply. Security must be enforced by the NIC eSwitch (limited L2/VLAN rules), physical firewalls, or guest-OS firewalls.
-
No transparent overlay networking. SR-IOV VFs are connected to the physical network, not the GENEVE overlay. The VM is on a physical VLAN, visible to physical switches. This means no distributed routing by OVN, no ARP suppression, no micro-segmentation.
-
Live migration complexity. A VF is a PCIe device bound to specific hardware on a specific host. It cannot be migrated. KubeVirt handles this by:
- Hot-unplugging the VF from the VM before migration
- Migrating the VM using the virtio management interface
- Hot-plugging a new VF on the destination node after migration
- This causes a brief network interruption (seconds) on the SR-IOV interface
- The VM must have a secondary virtio interface for migration traffic
-
Hardware dependency. SR-IOV requires specific NIC hardware, specific firmware versions, specific PF drivers, and a BIOS that enables SR-IOV and IOMMU. This limits hardware choices and complicates upgrades.
-
Limited VF count. A typical NIC supports 64-128 VFs per PF. On a dense node running 50+ VMs, VFs may become a scarce resource. Each VF also consumes NIC memory for queues and flow tables.
-
No OVS hardware offload features. When using SR-IOV with VF passthrough, OVS offload (TC flower on eSwitch in switchdev mode) is a different technology. Switchdev mode allows OVS to program the eSwitch flow table, providing some overlay features in hardware -- but this is more complex to set up and not all NIC features are offloadable.
When to Use SR-IOV
SR-IOV is appropriate when:
- Latency-sensitive workloads: Database replication (synchronous replicas where commit latency = network RTT), in-memory data grids (Redis cluster, Hazelcast), financial message buses
- Telco NFV: Virtual network functions that process packets at line rate (vRouters, vFirewalls, session border controllers)
- High-throughput bulk transfer: Backup systems, storage replication, large data movement where CPU overhead of software switching is the bottleneck
- CPU-constrained hosts: When the host CPUs are fully subscribed and cannot spare cycles for software packet processing
SR-IOV is not appropriate when:
- The workload runs fine on virtio (most workloads)
- NetworkPolicy enforcement is required on the interface
- Transparent live migration without network interruption is critical
- The environment needs frequent, flexible network reconfiguration (SR-IOV VLAN changes require PF admin commands, not just a NetworkPolicy update)
2. IOMMU (Input-Output Memory Management Unit)
The Problem: DMA Allows Devices to Access Any System Memory
Direct Memory Access (DMA) is the mechanism by which PCIe devices read and write system memory without CPU involvement. The NIC uses DMA to write incoming packets into a ring buffer in host memory and to read outgoing packets from host memory. DMA is essential for performance -- without it, the CPU would need to copy every byte between the device and memory.
The problem is that without IOMMU, a PCIe device can DMA to any physical memory address. This is a security catastrophe in a virtualization environment:
The DMA Problem Without IOMMU:
Physical Memory Map
+---------------------------+
| 0x0000_0000_0000 |
| Host Kernel Memory | <-- VF could DMA here: kernel compromise
| (page tables, creds) |
+---------------------------+
| 0x0000_1000_0000 |
| VM-A Memory | <-- VF assigned to VM-A: legitimate access
| (guest OS, applications) |
+---------------------------+
| 0x0000_2000_0000 |
| VM-B Memory | <-- VF could DMA here: cross-VM data leak
| (guest OS, applications) |
+---------------------------+
| 0x0000_3000_0000 |
| VM-C Memory | <-- VF could DMA here: cross-VM data leak
| (guest OS, applications) |
+---------------------------+
Without IOMMU, a VF programmed by VM-A can read/write
any physical address -- including VM-B's memory, VM-C's
memory, and the host kernel's memory.
This is not theoretical. A compromised guest OS with
direct access to a VF can craft DMA descriptors that
target arbitrary physical addresses.
For a financial enterprise running regulated workloads with strict VM-to-VM isolation requirements, this is unacceptable. IOMMU solves this problem.
IOMMU Architecture: Intel VT-d and AMD-Vi
The IOMMU sits between PCIe devices and system memory, intercepting and translating every DMA transaction. It enforces a per-device mapping from I/O Virtual Addresses (IOVAs) to physical addresses. A device can only access physical memory pages that the IOMMU has explicitly mapped for it.
IOMMU DMA Remapping Flow:
PCIe Device (VF) IOMMU Physical Memory
+------------+ +---------------+ +------------------+
| | | | | |
| DMA Write: | | 1. Identify | | 0x1000_0000 |
| addr=0x8000| ------> | device by | | VM-A's memory |
| data=0xABCD| | PCIe BDF | | (mapped) |
| | | (source ID)| | |
+------------+ | | +------------------+
| 2. Look up | | 0x2000_0000 |
| IOMMU | | VM-B's memory |
| domain | | (NOT mapped |
| for this | | for this VF) |
| BDF | | |
| | +------------------+
| 3. Walk page |
| table for |
| domain: |
| |
| IOVA | Physical
| 0x0000 --> | 0x1000_0000 (VM-A page 0)
| 0x1000 --> | 0x1000_1000 (VM-A page 1)
| 0x2000 --> | 0x1000_2000 (VM-A page 2)
| ... | ...
| 0x8000 --> | 0x1000_8000 (VM-A page 8)
| |
| 4. Translate: |
| 0x8000 --> |---> 0x1000_8000 (ALLOWED)
| 0x8000 --> |--X-> 0x2000_0000 (BLOCKED: not in
| | this domain's page table)
| |
| 5. If BLOCKED:|
| Generate |
| DMA fault |
| (logged, |
| reported |
| to host) |
+---------------+
The IOMMU page table structure is similar to the CPU's MMU page tables (multi-level, 4KB page granularity) but is indexed by device identity (BDF) rather than process ID. Intel VT-d uses DMAR (DMA Remapping) tables described in the ACPI firmware, and AMD-Vi uses IVRS (I/O Virtualization Reporting Structure) tables.
Two hardware implementations exist:
- Intel VT-d (Virtualization Technology for Directed I/O): Available on all server-class Xeon processors since Nehalem. Provides DMA remapping, interrupt remapping, and ATS (Address Translation Services) for device-side TLB caching.
- AMD-Vi (AMD I/O Virtualization Technology): Available on all EPYC processors. Functionally equivalent to VT-d with DMA remapping, interrupt remapping, and support for AMD-specific features like Guest Virtual APIC (GVA).
IOMMU Groups: What They Are and Why They Matter
An IOMMU group is the smallest unit of device isolation that the IOMMU can enforce. All devices in an IOMMU group share the same DMA address space -- they can see each other's DMA transactions and memory mappings. This means:
- All devices in an IOMMU group must be assigned to the same VM (or all kept on the host). You cannot assign one device from a group to VM-A and another to VM-B -- they could access each other's memory through peer-to-peer DMA.
- For SR-IOV, each VF must be in its own IOMMU group. If two VFs share an IOMMU group, they cannot be assigned to different VMs safely.
IOMMU group membership is determined by the PCIe topology and the IOMMU hardware:
# List IOMMU groups and their devices
for d in /sys/kernel/iommu_groups/*/devices/*; do
n=$(echo $d | awk -F/ '{print $5}')
echo "IOMMU Group $n: $(lspci -nns $(basename $d))"
done
# Example output (good -- each VF in its own group):
# IOMMU Group 72: 65:00.0 Mellanox ConnectX-6 Dx [PF]
# IOMMU Group 73: 65:00.2 Mellanox ConnectX-6 Dx [VF]
# IOMMU Group 74: 65:00.3 Mellanox ConnectX-6 Dx [VF]
# IOMMU Group 75: 65:00.4 Mellanox ConnectX-6 Dx [VF]
# ...
# Example output (bad -- multiple VFs in one group):
# IOMMU Group 72: 65:00.0 Mellanox ConnectX-6 Dx [PF]
# 65:00.2 Mellanox ConnectX-6 Dx [VF]
# 65:00.3 Mellanox ConnectX-6 Dx [VF]
# --> Cannot assign VF 0 and VF 1 to different VMs!
If multiple VFs land in the same IOMMU group, the cause is usually a missing or broken ACS (Access Control Services) configuration on the PCIe bridge between the VFs.
Interrupt Remapping: Preventing Interrupt Injection Attacks
DMA remapping is only half the security picture. Without interrupt remapping, a device can generate arbitrary MSI/MSI-X interrupts targeting any CPU vector -- including vectors used by other VMs or the hypervisor. A compromised device could:
- Inject interrupts into another VM's vCPU, causing spurious interrupts or crashes
- Target hypervisor interrupt handlers, potentially escalating privileges
IOMMU interrupt remapping (part of both VT-d and AMD-Vi) maintains an Interrupt Remapping Table (IRT) that maps device-generated interrupt requests to validated target vCPUs. A device can only generate interrupts that the hypervisor has explicitly configured in the IRT for that device.
Interrupt Remapping:
Device generates MSI-X interrupt:
Target: APIC ID 5, Vector 42
Without Interrupt Remapping:
--> Interrupt delivered directly to CPU 5, Vector 42
--> If CPU 5 is running VM-B, VM-B receives unexpected interrupt
--> Potential DoS or privilege escalation
With Interrupt Remapping (VT-d/AMD-Vi):
--> IOMMU intercepts interrupt request
--> Looks up device BDF in Interrupt Remapping Table
--> IRT entry says: this device may only target APIC ID 3, Vectors 32-39
--> If target matches IRT: deliver interrupt
--> If target does NOT match: block and log fault
Both VT-d and AMD-Vi interrupt remapping are required for production SR-IOV deployments. The Linux kernel will refuse to allow VFIO device assignment if interrupt remapping is not available (unless the administrator explicitly overrides this safety check, which should never be done in production).
IOMMU in Linux: Enabling and Configuring
IOMMU is enabled at boot time via kernel command line parameters and the system firmware (BIOS/UEFI):
# BIOS/UEFI settings (varies by vendor):
# - Intel VT-d: Enable
# - AMD-Vi / IOMMU: Enable
# - ACS: Enable (if available as a separate setting)
# - SR-IOV: Enable
# Kernel command line parameters (/etc/default/grub or via MachineConfig in OVE):
# Intel systems:
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"
# AMD systems:
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"
# iommu=pt (passthrough) means:
# - Devices NOT assigned to VMs use identity mapping (IOVA = physical address)
# - This avoids IOMMU overhead for host devices (storage controllers, management NICs)
# - Devices assigned to VMs still get full DMA remapping
# - This is the recommended setting for virtualization hosts
# Verify IOMMU is active after boot:
dmesg | grep -i iommu
# [ 0.000000] DMAR: IOMMU enabled
# [ 0.123456] DMAR: Intel(R) Virtualization Technology for Directed I/O
# [ 1.234567] iommu: Default domain type: Passthrough
# Check DMAR table (Intel VT-d):
dmesg | grep DMAR
# DMAR: Host address width 46
# DMAR: DRHD base: 0x000000fed90000 flags: 0x0
# DMAR: DRHD base: 0x000000fed91000 flags: 0x1
DMAR Table: The DMAR (DMA Remapping Reporting) table is an ACPI table provided by the system firmware. It describes the IOMMU hardware units (DRHDs -- DMA Remapping Hardware Unit Definitions), which PCIe segments and devices they cover, and reserved memory regions. The Linux kernel reads this table at boot to discover and initialize the IOMMU hardware.
IOMMU Domains: An IOMMU domain is a set of page table mappings. Each VM gets its own IOMMU domain. When a VF is assigned to a VM, the VF's IOMMU group is attached to the VM's IOMMU domain. The domain's page table maps the VM's guest physical addresses (IOVAs) to host physical addresses where the VM's memory is actually located.
VFIO (Virtual Function I/O): Safe Device Assignment to VMs
VFIO is the Linux kernel framework that provides safe, controlled device assignment to userspace processes (including QEMU/KVM virtual machines). VFIO is the modern replacement for the legacy pci-stub and uio drivers, and it is the only device assignment mechanism recommended for production use.
VFIO Architecture:
Userspace (QEMU) Kernel (VFIO)
+------------------+ +-------------------------+
| | | |
| QEMU process | | VFIO Core |
| (virt-launcher | | (/dev/vfio/vfio) |
| pod in OVE) | | |
| | | VFIO Container |
| +------------+ | ioctl() | +-------------------+ |
| | vfio-pci |--+------------>| | IOMMU domain | |
| | device | | | | (DMA page tables) | |
| +------------+ | | +-------------------+ |
| | | | | |
| | mmap() | | VFIO Group |
| v | | (/dev/vfio/73) |
| +------------+ | | +-------------------+ |
| | Device BAR | | | | IOMMU Group 73 | |
| | registers | | ioctl() | | Members: | |
| | (MMIO) |--+------------>| | 0000:65:00.2 | |
| +------------+ | | +-------------------+ |
| | | | | |
| | eventfd | | VFIO Device |
| v | | +-------------------+ |
| +------------+ | ioctl() | | PCIe config space | |
| | Interrupt |--+------------>| | BAR mappings | |
| | handling | | | | Interrupt routing | |
| | (MSI-X) | | | | Reset control | |
| +------------+ | | +-------------------+ |
| | | | |
+------------------+ | v |
| +-------------------+ |
| | IOMMU Hardware | |
| | (VT-d / AMD-Vi) | |
| | DMA remapping | |
| | Interrupt remap | |
| +-------------------+ |
+-------------------------+
The VFIO model has three layers:
-
VFIO Container (
/dev/vfio/vfio): Represents an IOMMU domain. A QEMU process opens the container and establishes the DMA mapping space. All devices within the container share the same IOMMU page tables. Typically, each VM has one container (one IOMMU domain). -
VFIO Group (
/dev/vfio/<group_number>): Represents an IOMMU group. Opening a group file descriptor gives access to all devices in that IOMMU group. VFIO enforces that all devices in a group are either all assigned to the same userspace process or all kept on the host -- partial assignment is blocked because it would violate IOMMU isolation. -
VFIO Device: Obtained by opening a specific device within a group (via
ioctl(VFIO_GROUP_GET_DEVICE_FD, "0000:65:00.2")). The device file descriptor provides access to:- Region access:
mmap()of device BARs (for MMIO register access from userspace) - DMA mapping:
ioctl(VFIO_IOMMU_MAP_DMA)to create IOVA-to-physical mappings - Interrupt handling:
ioctl(VFIO_DEVICE_SET_IRQS)to route device MSI-X interrupts to eventfd file descriptors that QEMU polls - Device reset:
ioctl(VFIO_DEVICE_RESET)to perform a PCIe FLR (Function Level Reset) when the VM is destroyed
- Region access:
The userspace workflow for assigning a VF to a VM via VFIO:
VFIO Assignment Sequence:
1. Unbind VF from kernel driver:
echo 0000:65:00.2 > /sys/bus/pci/devices/0000:65:00.2/driver/unbind
2. Bind VF to vfio-pci driver:
echo vfio-pci > /sys/bus/pci/devices/0000:65:00.2/driver_override
echo 0000:65:00.2 > /sys/bus/pci/drivers/vfio-pci/bind
--> Kernel creates /dev/vfio/73 (for IOMMU group 73)
3. QEMU opens VFIO container:
container_fd = open("/dev/vfio/vfio", O_RDWR)
ioctl(container_fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU)
4. QEMU opens VFIO group:
group_fd = open("/dev/vfio/73", O_RDWR)
ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container_fd)
5. QEMU gets device FD:
device_fd = ioctl(group_fd, VFIO_GROUP_GET_DEVICE_FD, "0000:65:00.2")
6. QEMU maps VM memory for DMA:
struct vfio_iommu_type1_dma_map dma_map = {
.vaddr = guest_memory_ptr, // VM's memory in QEMU's address space
.iova = 0x0, // IOVA base for this VM
.size = vm_memory_size, // e.g., 16 GB
.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE
};
ioctl(container_fd, VFIO_IOMMU_MAP_DMA, &dma_map)
--> IOMMU now allows device 0000:65:00.2 to DMA only to VM-A's memory
7. QEMU mmaps device BARs for register access:
bar0 = mmap(NULL, bar0_size, PROT_READ|PROT_WRITE, MAP_SHARED,
device_fd, VFIO_PCI_BAR0_REGION_INDEX * PAGE_SIZE)
8. QEMU configures MSI-X interrupts:
ioctl(device_fd, VFIO_DEVICE_SET_IRQS, &irq_set)
--> Device interrupts are delivered to QEMU via eventfd
ACS (Access Control Services): Ensuring Peer-to-Peer DMA Isolation
ACS is a PCIe capability on switch ports and root ports that controls whether peer-to-peer DMA transactions between devices behind the same switch are allowed. Without ACS, two VFs on the same NIC could potentially perform peer-to-peer DMA through the PCIe switch, bypassing the IOMMU entirely.
ACS provides several control bits:
- ACS Source Validation (SV): Ensures the source of a transaction is a valid requester
- ACS Translation Blocking (TB): Blocks translated (ATS) requests from being peer-to-peer routed
- ACS P2P Request Redirect (RR): Forces peer-to-peer requests to be routed upstream to the root complex (where the IOMMU sits) instead of being directly forwarded
- ACS P2P Completion Redirect (CR): Same for completion transactions
- ACS Upstream Forwarding (UF): Controls forwarding of transactions to upstream ports
- ACS P2P Egress Control (EC): Per-port egress filtering
For SR-IOV in virtualization, the critical bits are RR and CR: they ensure that DMA between VFs is always routed through the IOMMU, preventing any VF from accessing another VF's memory without IOMMU translation.
# Verify ACS is enabled on PCIe bridge ports
lspci -vvv -s 64:00.0 | grep "Access Control Services"
# Capabilities: [148] Access Control Services
# ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpFwd- EgressCtrl- DirectTrans-
# ACSCtl: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpFwd- EgressCtrl- DirectTrans-
If ACS is not supported by the PCIe topology (common on older server platforms), the kernel may group multiple VFs into a single IOMMU group, preventing independent assignment to different VMs. This is a hardware limitation that cannot be fixed in software (though the pcie_acs_override=downstream kernel parameter exists as a dangerous workaround that should never be used in production -- it lies to the kernel about ACS support, undermining isolation).
Security Implications for Financial Workloads
For a Tier-1 financial enterprise, the IOMMU/VFIO/ACS stack provides the following guarantees when correctly configured:
-
DMA isolation: VM-A's VF can only access VM-A's memory. DMA to any other physical address is blocked by the IOMMU and generates a fault logged by the host kernel.
-
Interrupt isolation: VM-A's VF can only generate interrupts targeting VM-A's vCPUs. Interrupt injection to other VMs or the hypervisor is blocked by interrupt remapping.
-
PCIe peer-to-peer isolation: With ACS enabled, VFs on the same NIC cannot perform peer-to-peer DMA through the PCIe switch. All DMA goes through the IOMMU.
-
Device reset on VM termination: When a VM is destroyed, VFIO performs a PCIe FLR (Function Level Reset) on the VF, clearing all device state. The next VM to receive the VF gets a clean device with no residual state from the previous VM.
These guarantees are equivalent to or stronger than what VMware provides with DirectPath I/O, where the same VT-d/AMD-Vi hardware is used for device passthrough. The isolation is hardware-enforced, not software-enforced -- a compromised hypervisor cannot weaken the DMA isolation (assuming the IOMMU hardware itself is not compromised).
Compliance note: For workloads subject to FINMA regulations or PCI-DSS requirements, the IOMMU provides a hardware-enforced isolation boundary. However, regulatory auditors may require documentation of the IOMMU configuration (enabled in BIOS, kernel parameters set, ACS verified, VFIO in use) as evidence that the isolation is in effect. This documentation should be part of the platform's security baseline.
3. DPDK (Data Plane Development Kit)
The Problem: Kernel Networking Is Slow for High-Throughput Workloads
Even with SR-IOV, if the guest VM uses the standard Linux kernel networking stack to process packets, performance is limited by the kernel's packet processing model:
Why Kernel Networking Is Slow (Per-Packet Cost):
Packet arrives at NIC
|
v
1. Hardware interrupt (IRQ)
--> Context switch from running process to interrupt handler
--> Cost: ~1-2 us (interrupt entry, register save)
|
v
2. NAPI poll / softirq processing
--> NIC driver reads packet from ring buffer
--> Allocates sk_buff (socket buffer) structure
--> Cost: ~0.5-1 us (memory allocation, cache miss)
|
v
3. Kernel network stack
--> netfilter/iptables hooks (conntrack, NAT checks)
--> Routing table lookup
--> Socket buffer queuing
--> Cost: ~2-5 us (multiple function calls, cache misses)
|
v
4. Copy to userspace
--> recvmsg() / read() system call
--> Data copied from kernel buffer to userspace buffer
--> Cost: ~0.5-2 us (syscall overhead + memcpy)
|
v
5. Application processes packet
--> Cost: depends on application
Total per-packet kernel overhead: ~5-10 us
At 14.88 Mpps (10 Gbps line rate, 64-byte packets):
10 us/packet * 14.88M packets = ~149 CPU-seconds per second
--> Physically impossible on a single core
--> Requires multi-queue RSS across 8-16 cores
--> Even then, ~50-70% of CPU time spent in kernel overhead
The fundamental bottleneck is the interrupt-driven model: each packet (or batch of packets) triggers an interrupt, a context switch, kernel function calls, memory allocation, and at least one data copy. For bulk throughput this is acceptable (batching amortizes the cost), but for latency-critical packet processing, the per-packet overhead dominates.
DPDK Architecture
DPDK eliminates the kernel from the data path entirely. It is a set of userspace libraries and drivers that allow an application to directly access NIC hardware from userspace, processing packets in a tight poll loop without any interrupts, context switches, or kernel involvement.
DPDK Architecture:
+================================================================+
| DPDK Application (Userspace) |
| |
| +----------------------------------------------------------+ |
| | Application Logic | |
| | (packet processing, forwarding, filtering) | |
| +----------------------------------------------------------+ |
| | | | | |
| v v v v |
| +----------+ +----------+ +----------+ +----------+ |
| | rte_mbuf | | rte_ring | | rte_hash | |rte_timer | |
| | (packet | | (lock- | | (hash | |(timing) | |
| | buffers)| | free | | tables) | | | |
| | | | queues) | | | | | |
| +----------+ +----------+ +----------+ +----------+ |
| | |
| v |
| +----------------------------------------------------------+ |
| | rte_ethdev (Ethernet Device Abstraction) | |
| | - rte_eth_rx_burst(): poll RX queue, return packet batch | |
| | - rte_eth_tx_burst(): submit TX batch to NIC | |
| +----------------------------------------------------------+ |
| | |
| v |
| +----------------------------------------------------------+ |
| | PMD (Poll Mode Driver) | |
| | - mlx5 PMD (Mellanox ConnectX) | |
| | - i40e PMD (Intel X710/XL710) | |
| | - ice PMD (Intel E810) | |
| | - virtio PMD (for DPDK inside VMs) | |
| | | |
| | The PMD runs in a tight poll loop: | |
| | while (1) { | |
| | nb_rx = rte_eth_rx_burst(port, queue, mbufs, 32); | |
| | if (nb_rx > 0) | |
| | process_packets(mbufs, nb_rx); | |
| | rte_eth_tx_burst(port, queue, tx_mbufs, nb_tx); | |
| | } | |
| | --> 100% CPU utilization, zero interrupts | |
| +----------------------------------------------------------+ |
| | |
| v |
| +----------------------------------------------------------+ |
| | EAL (Environment Abstraction Layer) | |
| | - Hugepage memory management (2MB / 1GB pages) | |
| | - CPU core affinity (pin PMD thread to specific core) | |
| | - PCI device enumeration and driver binding | |
| | - NUMA awareness (allocate memory on same node as NIC) | |
| +----------------------------------------------------------+ |
| | |
+======|==========================================================+
|
v
+----------------------------------------------------------+
| NIC Hardware |
| - Direct register access via mmap'd BARs (UIO or VFIO) |
| - DMA ring buffers in hugepage memory |
| - No kernel driver involvement after initialization |
+----------------------------------------------------------+
Key DPDK components:
-
EAL (Environment Abstraction Layer): Initializes the DPDK environment -- discovers CPUs, maps hugepage memory, enumerates PCI devices, binds PMDs. EAL is the first thing initialized and provides the runtime environment for everything else.
-
PMD (Poll Mode Driver): The userspace NIC driver. Instead of registering an interrupt handler in the kernel, the PMD polls the NIC's RX descriptor ring in a busy loop. When a packet arrives, the NIC writes it to a DMA ring buffer in hugepage memory and updates the descriptor. The PMD reads the descriptor, processes the packet, and moves on -- no interrupt, no context switch, no kernel involvement. The cost is permanent 100% CPU utilization on the dedicated core(s).
-
rte_mbuf: DPDK's packet buffer structure, allocated from hugepage memory pools. Unlike the kernel's sk_buff (which is allocated from slab caches and can fragment), mbufs are pre-allocated in contiguous hugepage memory, ensuring zero allocation overhead and predictable cache behavior.
-
rte_ring: Lock-free ring buffers used for inter-core communication. When multiple cores process packets (e.g., one core receives, another processes, a third transmits), rte_ring provides the queue between them without locks or cache coherence overhead.
-
rte_ethdev: The abstract Ethernet device API. Applications call
rte_eth_rx_burst()andrte_eth_tx_burst()regardless of the underlying NIC hardware. The PMD implements the actual hardware access.
How DPDK Bypasses the Kernel
DPDK uses two mechanisms to access NIC hardware from userspace:
-
UIO (Userspace I/O): The
igb_uiokernel module maps the NIC's PCI BARs into userspace and provides a simple interrupt notification mechanism. UIO is simple but does not provide IOMMU protection -- DMA is not isolated. UIO is not recommended for production. -
VFIO: The same VFIO framework described in the IOMMU section. DPDK's VFIO support provides full IOMMU-protected DMA mapping. VFIO is the recommended backend for production DPDK deployments. The DPDK EAL opens
/dev/vfio/vfioand/dev/vfio/<group>, maps the device's BARs viammap(), and sets up DMA mappings viaVFIO_IOMMU_MAP_DMA.
DPDK vs. Kernel Networking Path:
Kernel Path (per packet): DPDK Path (per packet):
+----------+ +----------+
| NIC HW | | NIC HW |
+----+-----+ +----+-----+
| |
| HW interrupt (IRQ) | No interrupt
v | (PMD polls descriptor ring)
+----------+ |
| IRQ | |
| handler | |
+----+-----+ |
| |
| softirq / NAPI poll |
v |
+----------+ |
| NIC | |
| driver | |
| (kernel) | |
+----+-----+ |
| |
| sk_buff alloc, netif_receive_skb() |
v |
+----------+ |
| netfilter| |
| /iptables| |
+----+-----+ |
| |
| routing, socket lookup |
v v
+----------+ +----------+
| socket | | PMD |
| buffer | | (user- |
+----+-----+ | space) |
| +----+-----+
| syscall (read/recvmsg) |
| copy_to_user() | rte_mbuf (hugepage, zero-copy)
v v
+----------+ +----------+
| App | | App |
| (user- | | (user- |
| space) | | space) |
+----------+ +----------+
Kernel: ~5-10 us per packet DPDK: ~0.1-0.5 us per packet
Kernel: interrupt-driven, bursty DPDK: poll-driven, deterministic
Kernel: shared CPU with other tasks DPDK: dedicated CPU core(s)
Key DPDK Libraries
| Library | Purpose | Equivalent in Kernel |
|---|---|---|
rte_ethdev |
Ethernet device abstraction (RX/TX burst API) | struct net_device, napi_gro_receive() |
rte_mbuf |
Packet buffer management (pool-based, hugepage-backed) | sk_buff (slab-allocated) |
rte_ring |
Lock-free FIFO ring (inter-core communication) | skb_queue_head (spinlock-protected) |
rte_hash |
Hash table (for flow tables, MAC lookup) | Kernel hash tables (rhashtable) |
rte_lpm |
Longest Prefix Match (IP routing) | fib_table / ip_fib_lookup() |
rte_acl |
Access Control List (packet classification) | netfilter / iptables |
rte_flow |
NIC hardware flow offload (rte_flow API) | tc flower / ethtool ntuple |
rte_mempool |
Fixed-size object pool (zero-alloc at runtime) | kmem_cache / SLAB |
DPDK + OVS: OVS-DPDK Userspace Datapath
Open vSwitch has two datapath implementations:
- Kernel datapath (default): OVS installs flow rules as kernel module entries (
openvswitch.ko). Packets are processed in the kernel using the megaflow cache. - DPDK datapath (OVS-DPDK): OVS runs entirely in userspace using DPDK PMDs for NIC access and DPDK-based flow processing. The kernel is not involved in any packet processing.
OVS-DPDK Architecture:
+-----------------------------------------------------------------+
| OVS-DPDK Process (ovs-vswitchd with DPDK enabled) |
| |
| +---------------------+ +---------------------+ |
| | PMD Thread | | PMD Thread | |
| | (CPU core 2) | | (CPU core 3) | |
| | | | | |
| | Poll NIC RX queue | | Poll vhost-user | |
| | Flow table lookup | | socket (VM ports) | |
| | Execute actions | | Flow table lookup | |
| | TX to output port | | Execute actions | |
| +---------------------+ +---------------------+ |
| | | |
| v v |
| +---------------------------------------------------+ |
| | OVS Datapath (Userspace / DPDK) | |
| | | |
| | EMC (Exact Match Cache) | |
| | --> per-PMD-thread, hash of 5-tuple | |
| | --> fastest: ~100 ns per lookup | |
| | | |
| | dpcls (Datapath Classifier / Megaflow Cache) | |
| | --> TSS (Tuple Space Search) based | |
| | --> wildcarded flow entries | |
| | --> ~200-500 ns per lookup | |
| | | |
| | upcall to ovs-vswitchd (slow path) | |
| | --> full OpenFlow pipeline evaluation | |
| | --> installs new megaflow entry | |
| | --> ~10-50 us (rare, first packet of new flow) | |
| +---------------------------------------------------+ |
| | | |
| v v |
| +------------------+ +------------------+ |
| | DPDK PMD | | vhost-user | |
| | (physical NIC) | | (VM interface) | |
| | mlx5/ice/i40e | | | |
| +------------------+ +------------------+ |
+---------|-----------------------------|---------------------------+
| |
v v
+------------------+ +------------------+
| Physical NIC | | VM (QEMU) |
| (ConnectX-6 Dx) | | virtio-net with |
| | | vhost-user backend|
+------------------+ +------------------+
The key difference from kernel OVS: in OVS-DPDK, PMD threads (one per dedicated CPU core) continuously poll both physical NIC queues and vhost-user sockets (VM ports). There are no interrupts. The flow table lookup, action execution, and packet forwarding all happen in userspace. The kernel is not involved at all after initialization.
vhost-user is the mechanism by which OVS-DPDK communicates with QEMU/KVM VMs. Instead of using the kernel's vhost-net module, OVS-DPDK and QEMU share a Unix domain socket that provides direct access to shared memory regions containing the virtio ring descriptors and packet buffers. This eliminates the kernel from the VM-to-switch path as well.
Performance: OVS-Kernel vs OVS-DPDK vs SR-IOV
| Metric | OVS Kernel Datapath | OVS-DPDK | SR-IOV (no DPDK in guest) | SR-IOV + DPDK in guest |
|---|---|---|---|---|
| Latency (64B, VM-to-VM same host) | 30-60 us | 10-20 us | 5-15 us | 3-8 us |
| Latency (64B, VM-to-VM cross host) | 50-100 us | 20-40 us | 10-25 us | 5-15 us |
| Throughput (64B, single queue) | 1-2 Mpps | 5-8 Mpps | 10-15 Mpps | 15-25 Mpps |
| Throughput (64B, multi-queue) | 3-6 Mpps | 15-25 Mpps | 25-40 Mpps | 30-50 Mpps |
| Throughput (1500B, multi-queue) | 15-20 Gbps | 40-80 Gbps | 80-100 Gbps | 80-100 Gbps |
| CPU cost | Variable (interrupts) | 1-4 dedicated cores | ~0 (NIC hardware) | 1-2 dedicated cores per VM |
| Jitter (p99-p50) | 20-100 us | 3-10 us | 1-3 us | 0.5-1 us |
| Live migration | Transparent | Transparent | Complex (VF hot-unplug) | Complex + app restart |
| NetworkPolicy | Yes (OVN ACLs) | Yes (OVN ACLs) | No (eSwitch only) | No |
| Overlay support | Yes (GENEVE) | Yes (GENEVE) | No (physical VLAN) | No (physical VLAN) |
Key observation: for most workloads, OVS kernel datapath is sufficient. The 50-100 us latency is invisible to web applications, APIs, and batch jobs. OVS-DPDK is justified when you need the overlay features (NetworkPolicy, GENEVE) but also need lower latency or higher packet rates. SR-IOV is justified when you need the absolute lowest latency and can sacrifice overlay features.
DPDK in OVE: When and How OVS-DPDK Is Configured
OVS-DPDK in OpenShift/OVE is configured via the Cluster Network Operator (CNO) at cluster install time or via day-2 configuration. It is not the default -- the default is OVS kernel datapath.
To enable OVS-DPDK, the cluster admin must:
- Reserve hugepages on worker nodes (via a
MachineConfigorPerformanceProfileCR from the Node Tuning Operator):
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: dpdk-profile
spec:
cpu:
isolated: "2-19,22-39" # CPUs for DPDK PMD threads and VMs
reserved: "0-1,20-21" # CPUs for system/kubelet
hugepages:
defaultHugepagesSize: 1G
pages:
- size: 1G
count: 16 # 16 GB of 1G hugepages for DPDK
node: 0 # NUMA node 0
- size: 1G
count: 16
node: 1 # NUMA node 1
nodeSelector:
node-role.kubernetes.io/worker-dpdk: ""
numa:
topologyPolicy: single-numa-node
- Configure OVS-DPDK via the
OVSInterfaceCR or the Network Operator configuration:
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
name: cluster
spec:
defaultNetwork:
ovnKubernetesConfig:
egressIPConfig: {}
type: OVNKubernetes
additionalNetworks: []
# OVS-DPDK configuration is set via MachineConfig that modifies
# the OVS startup to include DPDK initialization:
# ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
# ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem="4096,4096"
# ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3C # cores 2-5
- Create vhost-user interfaces for VMs that need DPDK connectivity (via Multus NAD with
type: userspace).
DPDK in Guest VMs: vhost-user and virtio PMD
When DPDK is used inside a guest VM (for applications like virtual routers or packet processors), the connectivity path is:
DPDK in Guest VM Architecture:
Guest VM
+--------------------------------------------------+
| DPDK Application |
| +----------------------------------------------+ |
| | rte_eth_rx_burst() / rte_eth_tx_burst() | |
| +----------------------------------------------+ |
| | |
| v |
| +----------------------------------------------+ |
| | virtio PMD (DPDK userspace virtio driver) | |
| | - Polls virtio ring descriptors | |
| | - Zero-copy via shared hugepage memory | |
| | - No guest kernel involvement | |
| +----------------------------------------------+ |
| | |
+---------|------------------------------------------+
| (shared hugepage memory + virtio rings)
| via vhost-user Unix socket
v
+--------------------------------------------------+
| OVS-DPDK (Host) |
| +----------------------------------------------+ |
| | vhost-user port | |
| | - Polls virtio ring from guest side | |
| | - Zero-copy between guest mbuf and OVS mbuf | |
| +----------------------------------------------+ |
| | |
| v |
| +----------------------------------------------+ |
| | OVS flow pipeline (userspace datapath) | |
| +----------------------------------------------+ |
| | |
| v |
| +----------------------------------------------+ |
| | DPDK PMD (physical NIC) | |
| +----------------------------------------------+ |
+--------------------------------------------------+
|
v
+--------------------------------------------------+
| Physical NIC |
+--------------------------------------------------+
The entire path -- from the DPDK application in the guest, through the virtio ring, through OVS-DPDK, to the physical NIC -- is pure userspace with no kernel involvement on either the guest or the host side. This achieves the absolute lowest latency possible in a virtualized environment (2-5 us one-way for 64-byte packets).
For SR-IOV + DPDK in guest (without OVS-DPDK on the host), the path is even shorter:
SR-IOV + DPDK in Guest:
Guest VM
+--------------------------------------------------+
| DPDK Application |
| +----------------------------------------------+ |
| | rte_eth_rx_burst() / rte_eth_tx_burst() | |
| +----------------------------------------------+ |
| | |
| v |
| +----------------------------------------------+ |
| | mlx5 PMD / ice PMD (native NIC PMD) | |
| | - Direct hardware access to VF registers | |
| | - DMA ring buffers in guest hugepage memory | |
| | - No virtio, no host involvement | |
| +----------------------------------------------+ |
+---------|------------------------------------------+
| (VFIO passthrough, IOMMU-protected DMA)
v
+--------------------------------------------------+
| VF on Physical NIC |
| - Direct NIC hardware, no software switch |
| - eSwitch handles L2 forwarding |
+--------------------------------------------------+
This is the absolute fastest path: DPDK PMD in the guest talks directly to the VF hardware via VFIO/IOMMU. No virtio layer, no OVS, no host kernel. Latency is limited only by the PCIe bus and NIC hardware (~2-3 us one-way).
Trade-offs: The Cost of DPDK
-
Dedicated CPU cores. DPDK PMD threads run in a busy-loop at 100% CPU utilization. A core dedicated to DPDK cannot run anything else. On the host, OVS-DPDK typically needs 1-4 cores depending on throughput requirements. In the guest, each DPDK application needs at least 1 core for the PMD, plus additional cores for application logic. These cores are permanently consumed whether or not there is traffic.
-
Hugepage memory reservation. DPDK requires hugepage memory (2 MB or 1 GB pages) for packet buffers and internal data structures. This memory must be reserved at boot time and is unavailable to the kernel's page cache or other applications. For OVS-DPDK, 2-8 GB per NUMA node is typical. For DPDK in guest VMs, the guest must also have hugepages configured.
-
Operational complexity. DPDK adds significant operational complexity:
- CPU pinning must be carefully planned (NUMA-aware, avoiding hyper-threading siblings for latency-critical cores)
- Hugepage allocation must account for NUMA topology
- NIC driver binding (VFIO vs. UIO) must be correct
- OVS-DPDK configuration (socket memory, PMD core mask, RX/TX queue sizes) requires tuning
- Monitoring is different (no standard kernel network counters; use
ovs-appctl dpif-netdev/pmd-stats-show)
-
Application changes required. DPDK is not a drop-in replacement for kernel networking. Applications must be written (or rewritten) to use the DPDK API (
rte_eth_rx_burst(),rte_mbuf, etc.). Standard socket-based applications cannot use DPDK without a compatibility layer (likefd.io VPPor a DPDK-based TCP/IP stack such asTLDKorF-Stack). -
No kernel networking features. DPDK bypasses the kernel, which means no
iptables, notc, notcpdump(on DPDK-bound interfaces), no kernel routing table, no/proc/net/*statistics. Debugging and monitoring require DPDK-specific tools.
When to Use DPDK
DPDK is appropriate when:
- NFV workloads: Virtual routers (VyOS, FRR in DPDK mode), virtual firewalls (VPP-based), session border controllers, DPI engines, load balancers -- any VM that is fundamentally a packet processing application
- Network appliance VMs: VMs that function as network appliances and need to process millions of packets per second (the VM itself is the "switch" or "router")
- High-frequency trading adjacent workloads: Feed handlers, order routers, market data distributors where microsecond-level latency improvements have direct business value
- OVS-DPDK on the host (without DPDK in guest): When the organization wants overlay features (GENEVE, NetworkPolicy) but needs better latency/throughput than kernel OVS provides. This is the most common DPDK use case in production.
DPDK is not appropriate when:
- The workload is a standard application (web server, database, batch job) that uses socket-based networking
- The workload does not process enough packets per second to justify dedicated CPU cores (below ~1 Mpps, kernel networking is efficient enough)
- Operational simplicity is a priority (DPDK adds significant configuration and monitoring complexity)
- The VM needs to be easily live-migrated (DPDK state in the guest is not trivially migratable)
Rule of thumb for this organization: Unless the workload is a packet processing application (NFV, network appliance) or the team has measured and documented that kernel networking is the bottleneck (not the application, not the storage, not the database), DPDK adds complexity without benefit. Start with virtio + kernel OVS. If latency is measured to be the bottleneck, try SR-IOV first (simpler, lower CPU cost). Only reach for DPDK if SR-IOV alone is insufficient or if overlay features (NetworkPolicy) must be preserved alongside low latency.
How the Candidates Handle This
Comparison Table
| Aspect | VMware (Current) | OVE | Azure Local | Swisscom ESC |
|---|---|---|---|---|
| SR-IOV support | DirectPath I/O (SR-IOV passthrough via ESXi); VMware-validated NIC list | SR-IOV Network Operator (declarative CRDs); upstream SR-IOV CNI; automated VF lifecycle | SR-IOV via Hyper-V vSwitch; SET integration; automatic fallback to synthetic path | Depends on Swisscom offering; likely limited or unavailable (managed service, customer has no host-level access) |
| SR-IOV management model | vCenter UI: edit VM settings, add PCI passthrough device; limited automation | Kubernetes-native: SriovNetworkNodePolicy + SriovNetwork CRDs; GitOps-friendly; device plugin for scheduling | PowerShell / WAC: Set-VMNetworkAdapter -IovWeight; Network ATC for intent-based config | N/A (customer does not manage host-level NIC configuration) |
| SR-IOV live migration | Not supported with DirectPath I/O (VM must be powered off to migrate) | Supported with network interruption: VF hot-unplug, migrate over virtio, VF hot-plug on destination | Supported with transparent fallback: VF detach -> synthetic NIC -> VF attach on destination (brief performance dip, no connectivity loss) | N/A |
| IOMMU requirement | VT-d required for DirectPath I/O; ESXi enables automatically | VT-d/AMD-Vi required; enabled via BIOS + kernel params (intel_iommu=on); PerformanceProfile automates config | VT-d required for SR-IOV; Windows Hyper-V enables automatically when DDA or SR-IOV configured | N/A (infrastructure managed by Swisscom) |
| VFIO | ESXi uses its own passthrough framework (vmklinux passthrough), not VFIO | VFIO-pci for VF assignment to VMs; standard Linux VFIO stack | Not applicable (Hyper-V uses DDA -- Discrete Device Assignment, Microsoft's equivalent of VFIO) | N/A |
| DPDK on host (OVS-DPDK) | Not available; ESXi uses its own kernel-based vSwitch / N-VDS with hardware offload | Supported: OVS-DPDK configurable via MachineConfig and PerformanceProfile; requires hugepages, dedicated CPU cores, NUMA tuning | Not available; Hyper-V vSwitch / VFP is kernel-based; no DPDK integration | N/A |
| DPDK in guest VMs | Supported: VM with SR-IOV VF can run DPDK with native PMD; or virtio PMD with vhost-user (limited VMware support) | Fully supported: SR-IOV VF + DPDK PMD in guest; or vhost-user + virtio PMD with OVS-DPDK on host | Supported: VM with SR-IOV VF can run DPDK with native PMD; vhost-user not available (no OVS-DPDK on host) | N/A (no host-level control for DPDK configuration) |
| Hardware offload (eSwitch/switchdev) | NSX supports NIC offload for GENEVE flows (validated NIC list, limited flow types) | OVS TC flower offload in switchdev mode (ConnectX-5+, E810); offloads eSwitch flows to NIC hardware while preserving OVN features | AccelNet in Azure cloud (FPGA SmartNIC); on-premises Azure Local uses standard NIC SR-IOV offload only | N/A |
| Performance tuning automation | vCenter affinity rules, NUMA topology visible in UI, manual tuning | Node Tuning Operator + PerformanceProfile CRD: automated CPU isolation, hugepage reservation, NUMA pinning, IRQ affinity | Network ATC for intent-based NIC config; manual PowerShell for advanced tuning | N/A |
| Debugging tools | esxcli, vsish, packet capture via dvfilter | dpdk-testpmd, ovs-appctl dpif-netdev/pmd-stats-show, ovs-ofctl, ethtool -S, standard Linux perf tools |
Get-VMNetworkAdapter, Performance Monitor, ETW tracing |
Ticket to Swisscom |
Key Differences in Prose
SR-IOV operational model: The most significant difference is how SR-IOV is managed day-to-day. VMware treats DirectPath I/O as an advanced feature with manual, per-VM configuration through vCenter. OVE treats it as a declarative infrastructure concern -- the admin defines policies (which NICs, how many VFs, which VLANs) and the SR-IOV Network Operator automates everything from VF creation to Kubernetes scheduling integration. Azure Local falls in between -- PowerShell-based configuration with some automation through Network ATC. For an organization managing SR-IOV across dozens of nodes and hundreds of VFs, OVE's operator-driven model is significantly more scalable and auditable than manual vCenter clicks or PowerShell scripts.
Live migration with SR-IOV: Azure Local has a meaningful advantage here. Hyper-V's automatic fallback from SR-IOV VF to synthetic (VMBus) NIC during live migration is transparent to the guest -- the VM experiences a brief performance dip but no connectivity loss. KubeVirt's approach in OVE (hot-unplug VF, migrate over virtio, hot-plug new VF) causes a brief network interruption on the SR-IOV interface and requires the VM to have a secondary virtio management interface. For workloads where SR-IOV is used for performance but availability during migration is also critical, this is a genuine Azure Local advantage. VMware does not support live migration with DirectPath I/O at all -- the VM must be powered off.
DPDK support depth: OVE has the most complete DPDK story. It supports both OVS-DPDK on the host (for improved overlay performance) and DPDK in guest VMs (via SR-IOV VF passthrough or vhost-user with OVS-DPDK). The Node Tuning Operator and PerformanceProfile CRD automate the complex system tuning (hugepages, CPU isolation, NUMA pinning) that DPDK requires. Neither VMware nor Azure Local offer OVS-DPDK on the host -- their virtual switches are kernel-based. Both support DPDK in guest VMs via SR-IOV passthrough, but without the host-side OVS-DPDK, the "overlay + low latency" combination is not available. For telco NFV workloads or network appliance VMs, OVE's DPDK support is a differentiator.
Swisscom ESC limitations: Swisscom ESC, as a managed service, does not expose any of these advanced data path features to the customer. The customer cannot configure SR-IOV, cannot enable IOMMU passthrough, cannot run DPDK. If the organization has workloads that require these capabilities, ESC is not a viable platform for those specific VMs. The remaining workloads (the vast majority) that run fine on virtio/standard networking can run on ESC without issue.
Hardware offload as an alternative to DPDK: A middle ground between kernel OVS and full DPDK is hardware offload via eSwitch/switchdev mode (OVE) or AccelNet (Azure, cloud only). In switchdev mode on OVE, OVS flow entries are offloaded to the NIC's eSwitch hardware, achieving near-SR-IOV throughput while preserving OVN overlay features (GENEVE encap/decap, ACLs in hardware). This requires ConnectX-5 Dx or newer NICs and is less mature than kernel OVS, but avoids the CPU cost of OVS-DPDK. For this organization, hardware offload may be the best compromise: overlay features preserved, no dedicated CPU cores consumed, but requires specific NIC hardware.
Key Takeaways
-
Start with virtio and kernel OVS for everything. The default networking path (virtio NIC + OVS kernel datapath + OVN/GENEVE overlay) handles 90-95% of workloads with acceptable performance. The 50-100 us round-trip latency is invisible to web applications, APIs, databases with asynchronous replication, batch jobs, and most business applications. Do not add SR-IOV or DPDK complexity unless a specific workload has a measured, documented performance requirement that virtio cannot meet.
-
SR-IOV is the first step when virtio is not enough. For workloads that need lower latency or higher packet rates, SR-IOV provides near-hardware performance with minimal CPU overhead. The cost is loss of overlay features (no NetworkPolicy on the SR-IOV interface, no GENEVE, physical VLAN required) and live migration complexity. SR-IOV is simpler to operate than DPDK because it does not consume dedicated CPU cores for polling.
-
IOMMU is non-negotiable for any device passthrough. Every OVE and Azure Local node that uses SR-IOV must have IOMMU enabled (Intel VT-d or AMD-Vi) with ACS on the PCIe bridges. Without IOMMU, VF assignment to VMs is unsafe -- a compromised VM could access any host memory via DMA. This is a BIOS setting that must be verified during hardware procurement and PoC lab setup. It is not optional.
-
DPDK is a specialist tool for specialist workloads. DPDK in guest VMs is appropriate for NFV (virtual routers, firewalls, load balancers) and packet processing applications. DPDK on the host (OVS-DPDK) is appropriate when the organization needs both overlay features and lower-than-kernel-OVS latency. In both cases, DPDK consumes dedicated CPU cores and hugepage memory, adds operational complexity, and requires DPDK-aware monitoring and debugging tools. For this organization, DPDK is likely relevant for a small number of VMs (single digits to low tens), not for the general population.
-
Hardware offload (switchdev/eSwitch) is a middle ground worth evaluating. For OVE with ConnectX-6 Dx or Intel E810 NICs, OVS hardware offload in switchdev mode can provide near-SR-IOV throughput while preserving OVN overlay features and NetworkPolicy enforcement. This avoids the "all or nothing" trade-off of SR-IOV (performance vs. features) and the CPU cost of OVS-DPDK. Evaluate this during the PoC with the actual NIC hardware.
-
Azure Local has a genuine advantage in SR-IOV live migration. Hyper-V's transparent fallback from VF to synthetic NIC during live migration is more graceful than KubeVirt's hot-unplug/hot-plug approach. If the organization has SR-IOV workloads that also require high availability via live migration, this is a meaningful differentiator for Azure Local. VMware does not support live migration with DirectPath I/O at all.
-
Swisscom ESC cannot serve advanced data path requirements. Any VM that needs SR-IOV, DPDK, or custom IOMMU configuration cannot run on ESC. This is a scope constraint, not a flaw -- ESC serves standard workloads well. The advanced data path VMs must run on OVE or Azure Local (or remain on VMware if neither alternative meets the requirement).
-
Verify IOMMU groups during PoC hardware evaluation. Before committing to a server platform, check that each SR-IOV VF gets its own IOMMU group. If the server's PCIe topology lacks ACS on the root ports or bridges, multiple VFs may share an IOMMU group, preventing independent assignment to different VMs. This is a hardware limitation that cannot be fixed in software. Test with
ls /sys/kernel/iommu_groups/*/devices/*on the actual PoC hardware. -
Budget CPU cores and hugepages explicitly for DPDK. If the organization plans to use OVS-DPDK or DPDK in guest VMs, the capacity planning must account for the dedicated resources. OVS-DPDK typically needs 2-4 cores per node (out of the PMD core pool). Each DPDK-enabled guest VM needs at least 1-2 additional cores and 1-4 GB of hugepage memory beyond its normal allocation. These resources are consumed at 100% whether the workload is active or idle. On a 64-core node, dedicating 4 cores to OVS-DPDK reduces the available VM capacity by ~6%.
Discussion Guide
The following questions target advanced data path capabilities during vendor workshops, SME deep-dives, and PoC validation sessions. They are designed to test whether the vendor or SME has actual production experience with SR-IOV, IOMMU, and DPDK -- not just slide-deck familiarity.
1. SR-IOV Network Operator Lifecycle
"Walk us through the day-2 lifecycle of SR-IOV in OVE. We have a SriovNetworkNodePolicy creating 8 VFs on ConnectX-6 Dx NICs across 20 nodes. Now we need to change from 8 VFs to 16 VFs per NIC. What happens to running VMs with allocated VFs when we update the policy? Is there a rolling update? How long is the disruption per node? What is the rollback procedure if the new VF count causes issues?"
Purpose: Tests operational maturity with the SR-IOV operator. The correct answer: updating numVfs in a SriovNetworkNodePolicy triggers a node drain and reboot (because changing VF count requires resetting the PF). The operator processes nodes one at a time (configurable via maxUnavailable). Running VMs with VFs on the node being updated will be evicted (live-migrated if possible, otherwise stopped). The disruption is ~5-10 minutes per node (drain + reboot + VF creation). Rollback: revert the policy to the previous numVfs value, which triggers another rolling reboot. This is operationally expensive -- VF count changes should be planned during maintenance windows.
2. IOMMU Group Verification
"Show us how to verify that our PoC server hardware has correct IOMMU group isolation for SR-IOV. What specific checks should we perform? What do we do if we find multiple VFs sharing an IOMMU group? Is the
pcie_acs_overridekernel parameter acceptable for production?"
Purpose: Tests understanding of IOMMU isolation requirements. The correct answer: enumerate IOMMU groups with ls /sys/kernel/iommu_groups/*/devices/* and verify each VF has its own group. If VFs share groups, the cause is typically missing ACS on PCIe bridge ports -- check with lspci -vvv | grep -A 5 "Access Control Services". The pcie_acs_override=downstream parameter should never be used in production -- it disables IOMMU isolation checks and allows VFs assigned to different VMs to access each other's memory via peer-to-peer DMA. If the hardware lacks ACS, the correct solution is different hardware or accepting that only one VF per IOMMU group can be independently assigned.
3. SR-IOV vs Virtio Performance Baseline
"For our PoC, we need to establish a performance baseline comparing virtio (OVN overlay) vs SR-IOV (direct VLAN) for a database replication workload. What test methodology do you recommend? What metrics should we capture? What tools should we use? What latency improvement do you expect for our specific workload profile (1-10 KB messages, 10,000-50,000 messages/second)?"
Purpose: Tests ability to design and execute a meaningful performance test. The answer should include: tools (qperf, sockperf, netperf for network-level; application-level metrics from the database replication log); metrics (p50/p99/p999 latency, throughput, CPU utilization on both host and guest, interrupt rate, OVS flow cache hit rate); methodology (same-host and cross-host tests, loaded vs unloaded, varying message sizes); expected improvement (~3-5x latency reduction from virtio+OVN to SR-IOV for small messages). The vendor should emphasize that the application-level improvement may be much smaller than the raw network improvement if the application bottleneck is not networking.
4. DPDK Feasibility Assessment
"We have 12 VMs running virtual network functions (firewalls, load balancers) that currently use VMware's DirectPath I/O with DPDK inside the guest. What is the migration path to OVE? Do we need OVS-DPDK on the host, or is SR-IOV sufficient? How do we handle the hugepage and CPU pinning requirements for these VMs in Kubernetes? What happens during a node failure -- can these VMs be rescheduled to another node with the correct DPDK/SR-IOV configuration?"
Purpose: Tests NFV migration planning. The correct answer: SR-IOV + DPDK in guest (without host OVS-DPDK) is the closest equivalent to VMware DirectPath I/O + DPDK. Host OVS-DPDK is only needed if the VMs also need overlay connectivity alongside DPDK. Hugepages for guest DPDK are requested via the VM spec (spec.domain.memory.hugepages.pageSize). CPU pinning is configured via the PerformanceProfile and VM spec (dedicatedCpuPlacement: true). For rescheduling: the VM can be rescheduled to any node that has free VFs (via the SR-IOV device plugin), available hugepages, and the correct PerformanceProfile. The DPDK application state inside the VM is not preserved across rescheduling -- the application must handle reconnection.
5. OVS-DPDK vs Kernel OVS Decision Framework
"We are considering enabling OVS-DPDK on our OVE worker nodes. What is the decision framework? What workload characteristics justify the added complexity? How many CPU cores should we reserve for OVS-DPDK PMD threads on a 64-core node with 40 VMs? What is the impact on non-DPDK VMs sharing the same node? How do we monitor OVS-DPDK performance vs kernel OVS?"
Purpose: Tests architectural judgment. The answer should include: OVS-DPDK is justified when overlay latency (GENEVE encap/decap) is the measured bottleneck, not application logic; for 40 VMs at moderate throughput, 2-4 PMD cores is typical; non-DPDK VMs are unaffected except by the reduced CPU pool; monitoring via ovs-appctl dpif-netdev/pmd-stats-show (packets processed, cycles per packet, idle cycles) and ovs-appctl dpif-netdev/pmd-rxq-show (queue assignment). The vendor should caution that OVS-DPDK is a cluster-wide decision (all OVS on DPDK-enabled nodes switches to userspace datapath) and cannot be selectively enabled per VM.
6. Azure Local SR-IOV Fallback Behavior
"Azure Local claims transparent SR-IOV fallback during live migration. Walk us through exactly what happens at the packet level during a live migration of a VM with SR-IOV. How long is the fallback period? What throughput does the synthetic NIC provide during fallback? Is there any packet loss? How does this compare to KubeVirt's approach in OVE?"
Purpose: Tests Azure Local SR-IOV depth. The answer: during live migration, Hyper-V detaches the VF from the guest (the guest's VF driver sees a surprise removal), traffic immediately falls back to the synthetic (VMBus/NetVSC) path through the Hyper-V vSwitch. The fallback happens in milliseconds (typically 50-200 ms of interruption). The synthetic NIC provides full throughput (10-25 Gbps depending on CPU) but with higher latency (kernel-based path). After migration completes, a new VF is attached on the destination host. Total SR-IOV interruption: typically 5-30 seconds. Packet loss during the fallback transition: 0-few packets (the VMBus path was always connected, just not the primary path).
7. eSwitch Switchdev Mode and Hardware Offload
"OVE supports OVS hardware offload via eSwitch switchdev mode on ConnectX NICs. How does this compare to SR-IOV passthrough? What OVS flow types can be offloaded? What are the limitations? Can we use hardware offload and SR-IOV VF passthrough on the same NIC simultaneously? What NIC firmware version and OFED version are required?"
Purpose: Tests awareness of the hardware offload middle ground. The answer: switchdev mode turns the NIC's eSwitch into a programmable hardware flow table controlled by OVS via TC flower rules. Unlike SR-IOV passthrough (which bypasses OVS entirely), switchdev offload keeps traffic in the OVS pipeline but executes flow matching and actions in NIC hardware. Offloadable flows include L2/L3 forwarding, VLAN push/pop, GENEVE encap/decap, CT (connection tracking) for stateful flows. Limitations: complex flows with multiple recirculations may not offload; flow count is limited by NIC TCAM (typically 64K-256K flows). SR-IOV passthrough and switchdev offload can coexist on the same NIC (some VFs passed through, PF in switchdev mode for OVS). Requires ConnectX-5 Dx or newer, firmware 16.31+, MLNX_OFED 5.4+ or in-tree mlx5 driver.
8. Hugepage Planning and NUMA Topology
"Our worker nodes have 2 NUMA nodes, 32 cores each, 256 GB RAM each. We plan to run 4 DPDK-enabled VMs (each needing 2 GB hugepages) and OVS-DPDK (needing 4 GB hugepages per NUMA node). How do we size the hugepage reservation? How do we ensure DPDK VMs are scheduled on the same NUMA node as their SR-IOV VFs? What happens if a DPDK VM is accidentally scheduled cross-NUMA?"
Purpose: Tests NUMA-aware capacity planning. The answer: total hugepage requirement = 4 GB (OVS-DPDK NUMA 0) + 4 GB (OVS-DPDK NUMA 1) + 4 VMs * 2 GB = 16 GB total. The PerformanceProfile reserves hugepages per NUMA node. NUMA affinity between VM and VF is enforced by Kubernetes topology manager (topologyPolicy: single-numa-node in PerformanceProfile) and the device plugin (which reports VF NUMA node). Cross-NUMA DPDK traffic incurs ~100-200 ns additional latency per memory access due to remote NUMA access (QPI/UPI interconnect). For latency-sensitive workloads, cross-NUMA scheduling should be prevented by the topology policy.
9. Security Review: IOMMU and VF Isolation
"Our security team needs to validate that SR-IOV with VFIO provides equivalent isolation to VMware's vSphere security model. What specific isolation guarantees does the IOMMU/VFIO stack provide? What attack vectors exist (DMA attacks, interrupt injection, PCIe peer-to-peer)? How are these mitigated? What audit evidence can we provide to regulators that the isolation is in effect?"
Purpose: Tests security depth. The answer should cover: DMA isolation via IOMMU page tables (per-domain mapping, fault on unauthorized access); interrupt isolation via interrupt remapping (IRT blocks cross-VM interrupt injection); PCIe peer-to-peer isolation via ACS (forces all DMA through IOMMU); device reset via FLR on VM termination (clears residual state). Audit evidence: BIOS configuration screenshots showing VT-d enabled; kernel boot log showing IOMMU active (dmesg | grep DMAR); IOMMU group enumeration showing per-VF isolation; VFIO driver binding confirmation. The isolation is hardware-enforced by the same Intel VT-d / AMD-Vi silicon that VMware's ESXi uses for DirectPath I/O -- the trust boundary is identical.
10. Operational Runbook: Troubleshooting SR-IOV and DPDK Issues
"A VM with an SR-IOV VF reports no network connectivity after a node reboot. Walk us through the troubleshooting sequence. What are the common failure modes? Where do we look first? Similarly, if OVS-DPDK PMD threads show high 'cycles per packet' (indicating degraded performance), what are the likely causes and how do we diagnose them?"
Purpose: Tests real-world troubleshooting capability. SR-IOV troubleshooting sequence: (1) verify VFs exist on the PF (cat /sys/class/net/ens1f0/device/sriov_numvfs); (2) check VF binding (ls -la /sys/bus/pci/devices/0000:65:00.2/driver -- should show vfio-pci); (3) check IOMMU group permissions (ls -la /dev/vfio/73); (4) verify SR-IOV operator status (oc get sriovnetworknodestates); (5) check device plugin pods (oc get pods -n openshift-sriov-network-operator); (6) common failures: PF driver crashed after reboot, VF count reset to 0, IOMMU not enabled in BIOS after firmware update. OVS-DPDK performance diagnosis: (1) ovs-appctl dpif-netdev/pmd-stats-show to check per-PMD cycle distribution; (2) high cycles/packet causes: flow cache thrashing (too many unique flows), cross-NUMA memory access, CPU frequency throttling (check cpupower frequency-info), noisy neighbor on shared LLC (last-level cache).