Virtualization & Overlays
Why This Matters
The previous chapters covered the physical network -- VLANs, bonding, LACP, ECMP, spine-leaf fabrics. That physical network provides bandwidth and redundancy, but it does not solve the fundamental problem of running 5,000+ VMs on a shared fabric: logical isolation at scale without physical reconfiguration. Every time a new application needs its own network segment, a VM migrates between hosts, or a security policy needs per-workload enforcement, someone must either reconfigure physical switch ports and VLAN trunks -- or use an overlay.
An overlay network is a virtual network built on top of ("overlaid on") the physical underlay. Overlay packets are encapsulated inside underlay packets, so the physical switches see only normal IP/UDP traffic between hosts. The overlay endpoints (the hypervisor or virtual switch on each host) handle encapsulation, decapsulation, and logical forwarding decisions. This decouples the logical network topology from the physical topology entirely.
This is not a new idea. VMware NSX has been doing this for years using GENEVE (and VXLAN before that). What changes in the OVE migration is the specific stack: instead of NSX Manager + N-VDS, the organization will operate OVN + OVS with GENEVE tunnels, managed by the OVN-Kubernetes CNI, with Multus providing multi-network attachment for VMs that need direct access to physical VLANs.
This chapter covers five tightly related technologies:
- CNI -- how Kubernetes connects workloads to networks (the plugin interface)
- Multus -- how a single VM gets multiple network interfaces (the meta-plugin)
- OVS / OVN -- the virtual switch and its distributed control plane (the data and control planes)
- VXLAN -- the original overlay encapsulation protocol (the predecessor)
- GENEVE -- the modern overlay encapsulation protocol (the current standard)
These five form a dependency chain: CNI calls the OVN-Kubernetes plugin, which configures OVS, which encapsulates packets in GENEVE tunnels. Multus extends CNI to support secondary networks using bridge, macvlan, or SR-IOV plugins alongside OVN. VXLAN is covered because Azure Local's Microsoft SDN uses it, the physical fabric may carry VXLAN-based EVPN, and understanding VXLAN's limitations explains why GENEVE exists.
The OVS/OVN section is the most critical in this chapter. OVS is the data plane on every OVE node. OVN is the distributed control plane that translates high-level network intent (logical switches, logical routers, ACLs) into per-node OVS flow tables. When a VM loses connectivity at 2:00 AM, the troubleshooting path goes through OVS flow tables and OVN port bindings. The team must understand these internals at the flow-rule level, not just at the conceptual level.
Concepts
1. CNI (Container Network Interface)
What It Is and Why It Exists
CNI is a specification from the Cloud Native Computing Foundation (CNCF) that defines how container runtimes (CRI-O, containerd) configure networking for Linux containers and, by extension, for KubeVirt virtual machines running inside pods. CNI is not a product or a daemon -- it is a contract between the container runtime and a set of executable binaries (plugins) that are invoked to set up and tear down network interfaces.
In VMware terms, CNI is the equivalent of the interface between vCenter and the vSphere Distributed Switch -- the mechanism by which the orchestration layer tells the network layer "connect this workload." The critical difference is that CNI is a standardized, pluggable interface with dozens of implementations, while VMware's mechanism is proprietary and locked to the N-VDS or vDS.
The CNI Specification
The CNI specification defines four operations, each implemented as a call to an executable binary:
| Operation | Purpose | When Called |
|---|---|---|
ADD |
Attach a container to a network, assign IP, configure routes | Pod/VM creation |
DEL |
Detach a container from a network, release IP | Pod/VM deletion |
CHECK |
Verify that a container's networking is still correct | Periodic health check |
VERSION |
Report supported CNI specification versions | Discovery |
CNI Execution Model
The container runtime does not import a library or call an API. It executes a binary on disk, passes configuration via stdin (JSON), and reads the result from stdout (JSON). Environment variables provide the container's network namespace path and the desired operation.
CNI Plugin Execution Flow:
Container Runtime (CRI-O) CNI Plugin Binary
(e.g., creating a Pod) (e.g., /opt/cni/bin/ovn-k8s-cni-overlay)
| |
| 1. Finds plugin binary on disk |
| (path from CNI config JSON) |
| |
| 2. Sets environment variables: |
| CNI_COMMAND=ADD |
| CNI_CONTAINERID=abc123 |
| CNI_NETNS=/var/run/netns/abc123 |
| CNI_IFNAME=eth0 |
| CNI_PATH=/opt/cni/bin |
| |
| 3. Pipes JSON config to stdin: |
| { |
| "cniVersion": "1.0.0", |
| "name": "ovn-kubernetes", |
| "type": "ovn-k8s-cni-overlay", |
| "ipam": { ... }, |
| "dns": { ... } |
| } |
| |
+-----exec()----->+ |
| |
| 4. Plugin actions: |
| - Creates veth pair
| - Moves one end into container netns
| - Assigns IP address (via IPAM)
| - Configures routes
| - Connects to OVS bridge (br-int)
| - Returns result JSON to stdout
| |
+<----stdout------+ |
| |
| 5. Result JSON: |
| { |
| "cniVersion": "1.0.0", |
| "interfaces": [ |
| { "name": "eth0", |
| "mac": "0a:58:0a:80:00:05", |
| "sandbox": "/var/run/..." } |
| ], |
| "ips": [ |
| { "address": "10.128.0.5/23", |
| "gateway": "10.128.0.1" } |
| ] |
| } |
| |
CNI Chaining
Multiple CNI plugins can be invoked in sequence for a single network attachment. This is called "chaining" and is configured as an ordered list of plugins in the CNI configuration file. Each plugin in the chain receives the result of the previous plugin and can augment or modify it.
CNI Plugin Chain Example:
Network Config (/etc/cni/net.d/10-ovn-kubernetes.conf):
{
"cniVersion": "1.0.0",
"name": "ovn-kubernetes",
"plugins": [
{
"type": "ovn-k8s-cni-overlay", <-- 1st: creates interface, connects to OVS
"ipam": { "type": "ovn-k8s-cni-overlay" }
},
{
"type": "portmap", <-- 2nd: sets up port mapping (DNAT)
"capabilities": { "portMappings": true }
},
{
"type": "bandwidth", <-- 3rd: applies traffic shaping
"ingressRate": 1000000,
"egressRate": 1000000
}
]
}
Execution order for ADD:
ovn-k8s-cni-overlay (ADD) --> portmap (ADD) --> bandwidth (ADD)
Execution order for DEL (reverse):
bandwidth (DEL) --> portmap (DEL) --> ovn-k8s-cni-overlay (DEL)
Major CNI Plugins and Their Architectures
The CNI plugin determines the entire networking model of the cluster. The choice of CNI is one of the most consequential decisions in cluster design.
OVN-Kubernetes (OVE's default CNI):
- Uses OVN as the control plane and OVS as the data plane
- Provides overlay networking via GENEVE tunnels between nodes
- Implements Kubernetes NetworkPolicy via OVN ACLs
- Provides distributed routing, DHCP, DNS, and NAT within the overlay
- Native integration with KubeVirt for VM networking
- This is the CNI covered in depth in the OVS/OVN section below
Calico (alternative, eBPF dataplane option):
- Pure Layer-3 model -- every pod gets a routed /32 (no bridge, no overlay by default)
- Can use an eBPF data plane instead of iptables for high-performance policy enforcement
- BGP-based route distribution between nodes (using BIRD or a custom BGP daemon)
- Overlay mode (VXLAN or IPIP) available when BGP is not feasible
- Popular in on-premises Kubernetes deployments, not the OVE default
Cilium (eBPF-native):
- Entirely eBPF-based data plane -- no iptables, no OVS, no kernel modules beyond eBPF
- Provides NetworkPolicy, transparent encryption (WireGuard or IPsec), and L7 policy
- Uses VXLAN or GENEVE for overlay, or native routing for direct mode
- Advanced observability via Hubble (flow logs, service maps)
- Gaining traction in cloud-native environments but not the OVE default
IPAM Plugins
IP Address Management (IPAM) in CNI is handled by dedicated IPAM plugins that the main CNI plugin delegates to:
| IPAM Plugin | How It Works | Use Case |
|---|---|---|
host-local |
Allocates from a static CIDR range per node, stores state in a local file | Default for bridge/macvlan CNIs on secondary networks |
dhcp |
Obtains address via DHCP from an external DHCP server | VMs needing addresses from existing enterprise DHCP infrastructure |
whereabouts |
Cluster-wide IPAM using a Kubernetes CRD as the backend, prevents duplicate IPs across nodes | Multi-network environments where multiple nodes share a subnet |
ovn-k8s-cni-overlay |
OVN-internal IPAM, allocates from the cluster pod CIDR via the OVN Northbound DB | Primary cluster network in OVE |
CNI vs. VMware Port Groups: Conceptual Mapping
| VMware Concept | Kubernetes/OVE CNI Equivalent |
|---|---|
| vSphere Distributed Switch (vDS) | OVS br-int bridge (managed by OVN-Kubernetes CNI) |
| Port Group (e.g., "Production-VLAN-100") | NetworkAttachmentDefinition CRD (via Multus) |
| Port Group VLAN tag | VLAN ID in the NetworkAttachmentDefinition spec |
| VM vNIC connected to a port group | Pod annotation k8s.v1.cni.cncf.io/networks: "vlan100" |
| Trunk port (carrying multiple VLANs) | Multus with multiple NetworkAttachmentDefinition references |
| NSX logical switch (overlay segment) | OVN logical switch (automatic, managed by OVN-Kubernetes) |
2. Multus (Multi-Network CNI)
Why Multi-Network Matters for VMs in Kubernetes
In vanilla Kubernetes, every Pod gets exactly one network interface (eth0) connected to the cluster's primary CNI network. This is sufficient for most containerized applications. It is not sufficient for virtual machines migrated from VMware.
A typical VMware VM has multiple vNICs:
- Management network -- for SSH, monitoring agents, patch management
- Application/data network -- for the actual workload traffic, often on a specific VLAN
- Storage network -- for iSCSI, NFS, or Ceph traffic (if not on the overlay)
- Backup network -- for backup agent traffic to a dedicated backup VLAN
KubeVirt VMs running on OVE need the same multi-network capability. The primary cluster network (OVN-Kubernetes overlay) provides one interface. Multus provides the mechanism to attach additional interfaces to secondary networks -- typically physical VLANs accessible via bridge or macvlan CNI plugins.
Multus Architecture
Multus is a "meta-plugin" -- it is itself a CNI plugin, but instead of configuring networking directly, it delegates to other CNI plugins. When CRI-O calls the Multus binary for a Pod's ADD operation, Multus reads the Pod's annotations to determine which networks to attach, then calls the appropriate CNI plugin for each network.
Multus Meta-Plugin Architecture:
CRI-O (Container Runtime)
|
| exec: /opt/cni/bin/multus (CNI_COMMAND=ADD)
|
v
+--------------------------------------------------------------+
| Multus CNI Binary |
| |
| 1. Read Pod annotations from Kubernetes API: |
| k8s.v1.cni.cncf.io/networks: "vlan100, vlan200" |
| |
| 2. Look up NetworkAttachmentDefinition CRDs: |
| "vlan100" --> bridge CNI config (VLAN 100) |
| "vlan200" --> macvlan CNI config (VLAN 200) |
| |
| 3. Always call the cluster default CNI first: |
| exec: ovn-k8s-cni-overlay (ADD) |
| --> creates eth0 (cluster network, OVN overlay) |
| |
| 4. Call each secondary CNI in order: |
| exec: bridge (ADD, config from "vlan100" NAD) |
| --> creates net1 (bridge to physical VLAN 100) |
| |
| exec: macvlan (ADD, config from "vlan200" NAD) |
| --> creates net2 (macvlan on physical VLAN 200) |
| |
| 5. Return combined result (all interfaces) to CRI-O |
+--------------------------------------------------------------+
| | |
v v v
eth0 net1 net2
(OVN (bridge, (macvlan,
overlay) VLAN 100) VLAN 200)
| | |
VM sees three network interfaces inside the guest OS
NetworkAttachmentDefinition CRD
The NetworkAttachmentDefinition (NAD) is a Kubernetes Custom Resource that defines a secondary network. It contains the CNI configuration JSON for that network. Each NAD is namespace-scoped, providing multi-tenant isolation (a NAD in namespace "team-a" is not visible to namespace "team-b").
# Example: NetworkAttachmentDefinition for VLAN 100 via linux-bridge
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: vlan100
namespace: production
spec:
config: |
{
"cniVersion": "1.0.0",
"type": "bridge",
"bridge": "br-vlan100",
"vlan": 100,
"ipam": {
"type": "whereabouts",
"range": "10.20.100.0/24",
"exclude": ["10.20.100.1/32", "10.20.100.254/32"]
}
}
# Example: NetworkAttachmentDefinition for SR-IOV high-performance network
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: sriov-data
namespace: trading
spec:
config: |
{
"cniVersion": "1.0.0",
"type": "sriov",
"resourceName": "intel.com/mlx_sriov_rdma",
"vlan": 500,
"ipam": {
"type": "whereabouts",
"range": "10.50.0.0/24"
}
}
How KubeVirt Uses Multus
KubeVirt extends the Multus model with VM-specific interface types. In a VirtualMachine spec, the interfaces and networks sections define how the VM's virtual NICs connect to cluster and secondary networks.
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: database-vm
namespace: production
spec:
template:
spec:
domain:
devices:
interfaces:
- name: cluster-net
masquerade: {} # NAT to cluster network (OVN overlay)
- name: data-vlan
bridge: {} # Bridge to physical VLAN (passthrough)
- name: storage-vlan
bridge: {} # Bridge to storage VLAN
# ...
networks:
- name: cluster-net
pod: {} # Default cluster network (OVN-Kubernetes)
- name: data-vlan
multus:
networkName: production/vlan100 # NAD reference (namespace/name)
- name: storage-vlan
multus:
networkName: production/vlan200 # NAD reference
Inside the guest OS, this VM sees three NICs:
eth0(orenp1s0) -- connected to the OVN overlay cluster networketh1(orenp2s0) -- bridged to physical VLAN 100eth2(orenp3s0) -- bridged to physical VLAN 200
Secondary Network Plugins
Multus delegates to these CNI plugins for secondary network attachment:
| Plugin | Mechanism | Use Case | Performance | VLAN Support |
|---|---|---|---|---|
| bridge | Linux bridge on host, veth pair to Pod/VM | General-purpose VLAN attachment | Good (kernel-level switching) | Yes (802.1Q tag) |
| macvlan | Sub-interface on host NIC, VM gets its own MAC on the wire | Direct L2 access without a bridge | Better (no bridge overhead) | Yes (parent VLAN sub-interface) |
| ipvlan | Like macvlan but shares the host's MAC, differs at L3 | When switch has MAC limit or port security | Good | Yes |
| SR-IOV | Hardware VF (Virtual Function) passed directly to VM | Ultra-low-latency, line-rate performance | Best (hardware datapath, bypasses kernel) | Yes (hardware VLAN filter) |
| OVN overlay | OVN-managed secondary logical networks | Overlay-based secondary networks without physical VLAN dependency | Good (GENEVE tunnel) | N/A (logical) |
Practical Example: VM with OVN Cluster Network + VLAN-Backed Data Network
Complete Packet Path -- VM with Two Networks:
Guest OS (database-vm)
+------------------+--------------------+
| eth0: 10.128.2.5 | eth1: 10.20.100.50|
| (OVN overlay) | (VLAN 100 bridge) |
+--------+---------+---------+----------+
| |
virt-launcher Pod (netns)
| |
+--------v---------+ +------v-----------+
| veth pair: | | veth pair: |
| pod-side <-> host | | pod-side <-> host |
+--------+---------+ +------+-----------+
| |
v v
+--------+---------+ +------+-----------+
| OVS br-int | | Linux bridge |
| (OVN-managed) | | br-vlan100 |
| | | (Multus bridge |
| GENEVE tunnel | | CNI managed) |
| to other nodes | | |
+--------+---------+ +------+-----------+
| |
v v
+--------+---------+ +------+-----------+
| bond0 (LACP) | | bond0.100 |
| (underlay for | | (VLAN 100 |
| GENEVE tunnels) | | sub-interface |
| | | on same bond) |
+--------+---------+ +------+-----------+
| |
+-------+-----------+
|
v
Physical NIC(s)
to ToR switch
Traffic on eth0 (cluster network) is encapsulated in GENEVE by OVS and sent through the underlay. Traffic on eth1 (VLAN 100) goes out as standard 802.1Q-tagged frames on the physical wire -- no encapsulation, no overlay. The ToR switch sees tagged VLAN 100 frames from the bond's VLAN sub-interface and switches them like any other VLAN traffic.
3. OVS / OVN (Open vSwitch / Open Virtual Network)
This is the most critical section in this chapter. OVS is the data plane running on every OVE worker node. OVN is the distributed SDN control plane that programs OVS. Together, they replace VMware's N-VDS + NSX Manager + NSX Central Control Plane.
OVS Architecture
Open vSwitch (OVS) is a production-quality, multilayer virtual switch. It supports OpenFlow for programmable flow-based forwarding, OVSDB for management, and kernel or DPDK datapaths for packet processing.
OVS Component Architecture (per node):
+---------------------------------------------------------------+
| User Space |
| |
| +-------------------------+ +----------------------------+ |
| | ovs-vswitchd | | ovsdb-server | |
| | (main OVS daemon) | | (configuration database) | |
| | | | | |
| | - OpenFlow controller | | - Stores bridge, port, | |
| | connection handler | | interface config | |
| | - Flow table manager | | - Listens on | |
| | - MAC learning logic | | unix:/var/run/openvswitch| |
| | - GENEVE/VXLAN tunnel | | /db.sock | |
| | encap/decap | | - Listens on | |
| | - Upcall handler for | | ptcp:6640 (remote OVSDB)| |
| | datapath misses | | | |
| +------------+------------+ +----------------------------+ |
| | |
| | netlink |
| | |
+---------------+-----------------------------------------------+
| Kernel Space |
| |
| +----------------------------------------------------------+ |
| | openvswitch.ko (kernel datapath module) | |
| | | |
| | - Exact-match flow cache (megaflow cache) | |
| | - Fast-path packet forwarding (no user-space upcall) | |
| | - Handles >95% of packets in steady state | |
| | - On cache miss: upcall to ovs-vswitchd | |
| +----------------------------------------------------------+ |
| |
+---------------------------------------------------------------+
Alternative: DPDK Userspace Datapath
+---------------------------------------------------------------+
| ovs-vswitchd + DPDK libraries |
| - Polls NICs directly (no kernel involvement) |
| - Dedicated CPU cores for packet processing |
| - Higher throughput, lower latency for NFV workloads |
| - Requires hugepages, CPU pinning, DPDK-compatible NIC |
+---------------------------------------------------------------+
The three components have distinct roles:
-
ovsdb-server: A lightweight database daemon that stores OVS configuration (bridges, ports, interfaces, tunnel endpoints). Configuration tools like
ovs-vsctlcommunicate with ovsdb-server via the OVSDB management protocol (JSON-RPC over a Unix socket or TCP). OVN'sovn-controlleralso communicates with ovsdb-server to program flows. -
ovs-vswitchd: The main switching daemon. It reads configuration from ovsdb-server, manages OpenFlow tables, handles datapath cache misses (upcalls), and programs the kernel datapath or DPDK datapath with exact-match flow entries. It is the "brain" of OVS on each node.
-
Kernel datapath (openvswitch.ko): The fast path. Once ovs-vswitchd has computed the forwarding decision for a flow, it installs an exact-match entry in the kernel datapath's megaflow cache. Subsequent packets of the same flow are forwarded entirely in the kernel without any user-space involvement. This is what makes OVS perform at near-wire-speed for established flows.
OVS Flow Processing Pipeline
Understanding how a packet traverses OVS is essential for troubleshooting. The flow processing has two paths: the fast path (kernel datapath cache hit) and the slow path (cache miss, upcall to ovs-vswitchd).
OVS Packet Processing Pipeline:
Packet arrives on a port (e.g., from a VM's veth, or from a GENEVE tunnel)
|
v
+----+-----------------------------------------------+
| Kernel Datapath (openvswitch.ko) |
| |
| Step 1: Extract flow key from packet headers |
| (src/dst MAC, src/dst IP, src/dst port, |
| in_port, VLAN tag, tunnel ID, etc.) |
| |
| Step 2: Look up flow key in megaflow cache |
| |
| +---> CACHE HIT (fast path, >95% of packets) |
| | Execute cached actions: |
| | - modify headers |
| | - push/pop VLAN tag |
| | - encapsulate in GENEVE |
| | - output to port X |
| | Update counters |
| | DONE (no user-space involvement) |
| | |
| +---> CACHE MISS (slow path, <5% of packets) |
| Send packet + flow key to ovs-vswitchd |
| via netlink upcall |
+----+------------------------------------------------+
|
v (upcall -- packet goes to user space)
+----+------------------------------------------------+
| ovs-vswitchd (user space) |
| |
| Step 3: Look up flow key in OpenFlow tables |
| |
| OpenFlow Table Pipeline: |
| +--------+ +--------+ +--------+ |
| |Table 0 |--->|Table 10|--->|Table 20|---> ... |
| |(ingress | |(ACL | |(L2 | |
| | classif)| | check) | | lookup)| |
| +--------+ +--------+ +--------+ |
| |
| Each table contains flow entries: |
| priority=100, match(in_port=5,ip,nw_dst=10.0.0.1)|
| actions=set_field:00:11:22:33:44:55->eth_dst, |
| output:10 |
| |
| Step 4: Compute the complete action set |
| |
| Step 5: Install exact-match entry in kernel cache |
| (so next packet of same flow takes fast path) |
| |
| Step 6: Execute actions on the current packet |
| (forward, encap, output, drop) |
+-----------------------------------------------------+
Key performance insight: The first packet of every new flow takes the slow path (upcall to user space, OpenFlow table lookup, kernel cache install). Subsequent packets take the fast path (kernel-only, microsecond-level forwarding). In a 5,000+ VM environment, the steady-state hit rate should be >95%. If the hit rate drops significantly (observable via ovs-appctl dpif/show), it indicates excessive flow churn -- possibly caused by port scanning, DDoS traffic, or misconfigured ACLs that prevent flow caching.
OpenFlow Tables and Flow Matching
OVS implements the OpenFlow protocol to receive flow programming instructions. In OVE, the OpenFlow controller is ovn-controller (the local OVN agent on each node), not a separate SDN controller.
Each OpenFlow flow entry has three parts:
- Match fields: Which packets this rule applies to (in_port, eth_src, eth_dst, ip_src, ip_dst, tcp_dst, tunnel_id, etc.)
- Priority: Higher priority entries are matched first (0-65535)
- Actions: What to do with matched packets (output, drop, mod_dl_dst, set_tunnel, resubmit to another table)
Example: Dumping OpenFlow flows on an OVE node
$ ovs-ofctl dump-flows br-int --no-stats | head -20
table=0, priority=100,in_port="patch-br-int-to"
actions=load:0x1->NXM_NX_REG13[],resubmit(,8)
table=8, priority=50,reg13=0x1,metadata=0x3,dl_dst=0a:58:0a:80:00:05
actions=load:0x2->NXM_NX_REG15[],resubmit(,32)
table=32, priority=100,reg15=0x2,metadata=0x3
actions=load:0x3->NXM_NX_REG11[],load:0x1->NXM_NX_REG12[],
resubmit(,33)
table=33, priority=100,reg11=0x3,reg12=0x1
actions=output:"vm-database-0"
Reading this:
Table 0: Packet arrives from patch port (tunnel decap done).
Set register 13 = 1, go to table 8.
Table 8: If reg13=1, metadata=3 (logical switch ID), and
destination MAC matches VM "database", set reg15=2
(output port ID), go to table 32.
Table 32: Load logical metadata, go to table 33.
Table 33: Output to the VM's OVS port "vm-database-0".
OVN-Kubernetes programs dozens of OpenFlow tables on br-int. The table numbering follows OVN's logical pipeline stages:
| Table Range | Pipeline Stage | Purpose |
|---|---|---|
| 0-7 | Ingress pre-processing | Classify incoming packets, handle ARP, DHCP |
| 8-15 | Ingress ACLs | Apply NetworkPolicy / firewall rules (ingress direction) |
| 16-23 | Ingress post-ACL | After ACL pass, prepare for forwarding |
| 24-31 | Logical switching | MAC learning, unknown destination handling |
| 32-39 | Logical routing | L3 forwarding between logical subnets |
| 40-47 | Egress pre-processing | Prepare for output |
| 48-55 | Egress ACLs | Apply NetworkPolicy / firewall rules (egress direction) |
| 56-63 | Egress post-ACL | Final output to port or tunnel |
OVN Architecture
OVN (Open Virtual Network) is the distributed SDN control plane built on top of OVS. It translates high-level network intent (logical switches, logical routers, ACLs, NAT rules) into per-node OVS OpenFlow rules. OVN replaces NSX Manager + NSX Central Control Plane in the OVE stack.
OVN Full Architecture:
+===================================================================+
| API / Orchestration Layer |
| |
| Kubernetes API Server |
| + OVN-Kubernetes Controller Manager (watches Pod, Service, |
| NetworkPolicy, Namespace objects and translates them into |
| OVN Northbound DB entries) |
+==============================+====================================+
|
Northbound API (OVSDB protocol)
|
+==============================v====================================+
| OVN Northbound Database (NB DB) |
| |
| Stores the LOGICAL network topology: |
| |
| +------------------+ +------------------+ +-----------------+ |
| | Logical Switches | | Logical Routers | | ACLs / NAT | |
| | (one per | | (one per cluster | | (firewall rules| |
| | namespace or | | + one per | | from Network | |
| | subnet) | | namespace for | | Policies) | |
| | | | distributed | | | |
| | Logical Switch | | gateway) | | ACL: | |
| | Ports: | | | | direction= | |
| | - VM ports | | Logical Router | | from-lport | |
| | - router ports | | Ports: | | match="ip4. | |
| | - localnet ports| | - switch-facing | | dst==10.0/16"| |
| +------------------+ | - gateway ports | | action=allow | |
| +------------------+ +-----------------+ |
+==============================+====================================+
|
ovn-northd
(translates logical topology
to physical flow rules)
|
+==============================v====================================+
| OVN Southbound Database (SB DB) |
| |
| Stores the PHYSICAL bindings and compiled flow rules: |
| |
| +------------------+ +------------------+ +-----------------+ |
| | Chassis Table | | Port Bindings | | Datapath Flows | |
| | (one entry per | | (maps logical | | (compiled | |
| | OVE node) | | port to | | logical | |
| | | | chassis + tunnel | | pipeline into | |
| | chassis-id: | | endpoint) | | match/action | |
| | worker-node-01 | | | | pairs) | |
| | encap: geneve | | port: "vm-db" | | | |
| | ip: 192.168.1.10 | | chassis: | | | |
| +------------------+ | worker-node-01 | +-----------------+ |
| | tunnel_key: 5 | |
| +------------------+ |
+==============================+====================================+
|
ovn-controller (runs on EVERY node)
reads SB DB, programs local OVS
|
+==============================v====================================+
| OVS on Node (Data Plane) |
| |
| br-int (integration bridge) |
| +-------------------------------------------------------------+ |
| | | |
| | Ports: | |
| | vm-web-01 (veth to Web VM) | |
| | vm-db-01 (veth to Database VM) | |
| | ovn-k8s-mp0 (management port) | |
| | patch-br-int-to-br-ex (patch to external bridge) | |
| | geneve_sys (GENEVE tunnel port to all other nodes) | |
| | | |
| | OpenFlow Tables: programmed by ovn-controller | |
| | Table 0: Classify and direct to logical pipeline | |
| | Table 8: Ingress ACLs | |
| | Table 32: L2/L3 forwarding | |
| | Table 48: Egress ACLs | |
| | Table 64: Output (local port or GENEVE tunnel) | |
| +-------------------------------------------------------------+ |
| |
| br-ex (external bridge, optional) |
| +-------------------------------------------------------------+ |
| | Connects to physical network for north-south traffic | |
| | Ports: bond0 (physical uplink), patch to br-int | |
| +-------------------------------------------------------------+ |
+===================================================================+
OVN Logical Constructs
OVN models the network as a set of logical objects that exist independently of the physical topology:
Logical Switches: Layer-2 broadcast domains. In OVE, OVN-Kubernetes creates one logical switch per node subnet. Pods and VMs on the same logical switch can communicate at Layer 2. Pods on different logical switches communicate via the logical router.
Logical Routers: Layer-3 routing between logical switches. OVN-Kubernetes creates a cluster-wide distributed logical router that connects all node subnets. Routing decisions happen locally on each node (distributed routing) -- packets between two local VMs on different subnets are routed within the same OVS instance without crossing the network.
Logical Switch Ports: Represent endpoints (VM interfaces, router interfaces, external gateways). Each port has a MAC address, one or more IP addresses, and is bound to a specific chassis (node) in the Southbound DB.
ACLs: Firewall rules attached to logical switches or routers. OVN-Kubernetes translates Kubernetes NetworkPolicy objects into OVN ACLs. ACLs are enforced as OpenFlow rules in the ingress and egress ACL tables on br-int.
OVN Data Path: Logical-to-Physical Translation
This is the core of OVN's value -- the translation from logical network intent to physical OVS flows. Understanding this translation is essential for troubleshooting.
OVN Logical-to-Physical Translation:
Logical View (what the admin sees):
+-----------+ +-----------+
| Logical | +----------------+ | Logical |
| Switch |----| Logical |----| Switch |
| "node-1" | | Router | | "node-2" |
| | | "cluster-rtr" | | |
| VM-A | +----------------+ | VM-B |
| 10.128.0.5| | 10.128.2.8|
+-----------+ +-----------+
Physical View (what the wire sees):
+-------------------+ +-------------------+
| Worker Node 1 | GENEVE tunnel | Worker Node 2 |
| | (UDP 6081) | |
| OVS br-int |<===============>| OVS br-int |
| port: vm-a | Outer IP: | port: vm-b |
| OpenFlow tables | 192.168.1.10 | OpenFlow tables |
| programmed by | --> | programmed by |
| ovn-controller | 192.168.1.20 | ovn-controller |
+-------------------+ +-------------------+
Packet path: VM-A (10.128.0.5) --> VM-B (10.128.2.8):
1. VM-A sends packet:
src_ip=10.128.0.5, dst_ip=10.128.2.8
src_mac=0a:58:0a:80:00:05, dst_mac=0a:58:0a:80:00:01 (gateway MAC)
2. OVS br-int on Node 1:
a. Table 0: Classify -- packet from local VM port
b. Table 8: Ingress ACL check -- is this allowed by NetworkPolicy?
Match: allow (default allow or explicit policy)
c. Table 32: Routing -- dst 10.128.2.0/23 is on logical switch "node-2"
Rewrite dst_mac to VM-B's MAC: 0a:58:0a:80:02:08
Rewrite src_mac to router's MAC on node-2 subnet
Decrement TTL
d. Table 48: Egress ACL check on destination logical switch
e. Table 64: Output decision -- VM-B is on chassis "worker-node-2"
Encapsulate in GENEVE:
outer src_ip: 192.168.1.10 (Node 1 TEP)
outer dst_ip: 192.168.1.20 (Node 2 TEP)
outer UDP dst_port: 6081
GENEVE VNI: <logical datapath ID>
Output to geneve_sys tunnel port
3. Physical network carries the GENEVE-encapsulated packet:
Node 1 --> (underlay: bond0, spine-leaf, ECMP) --> Node 2
4. OVS br-int on Node 2:
a. geneve_sys port receives packet
b. Decapsulate GENEVE: extract VNI, inner headers
c. Table 0: Classify -- packet from tunnel port, VNI maps
to logical switch "node-2"
d. Table 48: Egress ACL check for VM-B
e. Table 64: Output to local port "vm-b"
5. VM-B receives the packet with:
src_ip=10.128.0.5, dst_ip=10.128.2.8
(inner headers unchanged except MAC rewrite and TTL decrement)
OVN Distributed Routing and Gateway Nodes
OVN implements two types of routing:
Distributed Routing (DR): Every node runs its own copy of the logical router. Packets between VMs on different subnets (but within the cluster) are routed locally -- they never leave the node if source and destination are co-located, and they traverse a single GENEVE hop if they are on different nodes. This eliminates the central routing bottleneck that plagues some SDN architectures.
Gateway Routing (GR): For north-south traffic (cluster to external, external to cluster), OVN designates specific nodes as gateway nodes. These nodes have br-ex connected to the physical network and handle SNAT, DNAT, and external routing. In OVE, master nodes or designated infra nodes typically serve as gateway nodes.
Distributed vs. Gateway Routing:
East-West (VM to VM, same cluster):
+--------+ GENEVE +--------+
| Node 1 |========================| Node 2 |
| VM-A | Distributed routing | VM-B |
| | (local on each node) | |
+--------+ +--------+
No central bottleneck. Router exists on every node.
North-South (VM to external):
+--------+ GENEVE +--------+
| Node 1 |========================| Gateway|
| VM-A | Traffic to external | Node |
| | goes via gateway | |
+--------+ | br-ex |
| | |
+---+----+
|
Physical
Network
(ToR switch,
spine-leaf)
Debugging: Practical Workflows
Listing OVN logical topology:
# List all logical switches
$ ovn-nbctl ls-list
b1a2c3d4-... (join)
e5f6a7b8-... (node-subnet-10.128.0.0/23)
c9d0e1f2-... (node-subnet-10.128.2.0/23)
# List ports on a logical switch
$ ovn-nbctl lsp-list node-subnet-10.128.0.0/23
port-vm-web-01 (addresses: ["0a:58:0a:80:00:05 10.128.0.5"])
port-vm-db-01 (addresses: ["0a:58:0a:80:00:06 10.128.0.6"])
port-rtr-node1 (type: router, addresses: ["router"])
# List ACLs on a logical switch
$ ovn-nbctl acl-list node-subnet-10.128.0.0/23
from-lport 1001 (ip4.src == 10.128.0.0/14) allow-related
to-lport 1001 (ip4.dst == 10.128.0.5 && tcp.dst == 5432) allow-related
to-lport 0 (1) drop
Checking physical port bindings:
# Show which VMs are on which nodes
$ ovn-sbctl show
Chassis "worker-node-01"
hostname: worker-node-01.example.com
Encap geneve
ip: "192.168.1.10"
options: {csum="true"}
Port_Binding "vm-web-01"
Port_Binding "vm-db-01"
Chassis "worker-node-02"
hostname: worker-node-02.example.com
Encap geneve
ip: "192.168.1.20"
options: {csum="true"}
Port_Binding "vm-app-01"
Port_Binding "vm-cache-01"
Tracing a packet through the logical pipeline (ovn-trace):
ovn-trace is the single most valuable debugging tool for OVN. It simulates a packet traversing the logical pipeline and shows every match, every action, and every table transition -- without actually sending a packet.
$ ovn-trace --detailed node-subnet-10.128.0.0/23 \
'inport == "port-vm-web-01" && \
eth.src == 0a:58:0a:80:00:05 && \
eth.dst == 0a:58:0a:80:00:01 && \
ip4.src == 10.128.0.5 && \
ip4.dst == 10.128.2.8 && \
ip.ttl == 64 && \
tcp.dst == 5432'
# Output (abbreviated):
#
# ingress(dp="node-subnet-10.128.0.0/23", inport="port-vm-web-01")
# -----------------------------------------------------------------
# 0. ls_in_port_sec_l2: inport == "port-vm-web-01"
# eth.src == 0a:58:0a:80:00:05 --> MATCH, next;
#
# 3. ls_in_pre_acl: ip --> MATCH, next;
#
# 5. ls_in_acl: ip4.src == 10.128.0.0/14
# --> MATCH, action: allow-related; next;
#
# 13. ls_in_l2_lkup: eth.dst == 0a:58:0a:80:00:01 (router port)
# --> MATCH, action: output to "port-rtr-node1"
#
# ingress(dp="cluster-router", inport="rtr-port-node1")
# -----------------------------------------------------------------
# 0. lr_in_admission: eth.dst == router-mac --> MATCH, next;
#
# 7. lr_in_ip_routing: ip4.dst == 10.128.2.0/23
# --> MATCH, nexthop 10.128.2.1 via "rtr-port-node2"
# --> action: eth.src = router-mac-node2,
# eth.dst = resolve(10.128.2.8),
# ip.ttl--, output "rtr-port-node2"
#
# egress(dp="node-subnet-10.128.2.0/23", outport="port-vm-app-01")
# -----------------------------------------------------------------
# 1. ls_out_pre_acl: ip --> MATCH, next;
#
# 3. ls_out_acl: ip4.dst == 10.128.2.8 && tcp.dst == 5432
# --> MATCH, action: allow-related; next;
#
# 9. ls_out_port_sec_l2: outport == "port-vm-app-01"
# eth.dst == 0a:58:0a:80:02:08 --> MATCH, output;
#
# Output to "port-vm-app-01" on chassis "worker-node-02"
# via GENEVE tunnel to 192.168.1.20, VNI 5
Dumping OVS flows for a specific VM:
# Find the OVS port number for a VM
$ ovs-vsctl get Interface vm-db-01 ofport
5
# Dump flows matching this port
$ ovs-ofctl dump-flows br-int | grep "in_port=5"
# Watch real-time flow hits (useful for live debugging)
$ watch -n 1 'ovs-ofctl dump-flows br-int --no-stats | grep "in_port=5"'
# Check datapath flows (kernel cache entries) for performance analysis
$ ovs-appctl dpctl/dump-flows | head -20
# Check the hit rate of the megaflow cache
$ ovs-appctl coverage/show | grep -i "flow"
4. VXLAN (Virtual Extensible Local Area Network)
Why VLANs Aren't Enough
VLANs (IEEE 802.1Q) are limited in three fundamental ways that make them inadequate for large-scale virtualization:
-
4,094 segment limit: The 12-bit VLAN ID field allows only 4,094 usable VLANs. In a multi-tenant data center or a large enterprise with hundreds of application tiers, this limit is easily reached.
-
Spanning Tree domain: VLANs are Layer-2 constructs that require Spanning Tree Protocol (STP) to prevent loops. STP limits the usable topology, blocks redundant links, and has slow convergence times (seconds to tens of seconds). A spine-leaf fabric running pure Layer 3 (BGP + ECMP) eliminates STP -- but VLANs cannot span a Layer-3 boundary without being encapsulated.
-
Layer-2 boundary: A VLAN is confined to a single Layer-2 domain. If a VM on VLAN 100 needs to move to a host in a different Layer-2 domain (different ToR switch pair without a Layer-2 stretch), the VLAN must be extended there -- requiring trunk reconfiguration, STP reconvergence, and increased broadcast domain size. This directly conflicts with the goal of workload mobility.
VXLAN solves all three problems by encapsulating Layer-2 frames inside UDP/IP packets, allowing them to traverse any Layer-3 network.
VXLAN Encapsulation Format
VXLAN (RFC 7348) wraps a complete inner Ethernet frame inside an outer IP/UDP packet. The encapsulation adds 50 bytes of overhead.
VXLAN Encapsulated Packet (byte-by-byte):
+================================================================+
| OUTER ETHERNET HEADER (14 bytes) |
| Dst MAC: 6 bytes (next-hop MAC, e.g., ToR switch) |
| Src MAC: 6 bytes (host NIC MAC) |
| EtherType: 2 bytes (0x0800 for IPv4) |
+================================================================+
| OUTER IP HEADER (20 bytes) |
| Version: 4 bits (4) |
| IHL: 4 bits (5, no options) |
| DSCP/ECN: 8 bits (copy from inner or set by policy) |
| Total Len: 16 bits (outer IP payload length) |
| ID: 16 bits |
| Flags: 3 bits (DF bit typically set) |
| Frag Off: 13 bits (0, no fragmentation) |
| TTL: 8 bits (64 typical) |
| Protocol: 8 bits (17 = UDP) |
| Checksum: 16 bits |
| Src IP: 32 bits (source VTEP IP) |
| Dst IP: 32 bits (destination VTEP IP) |
+================================================================+
| OUTER UDP HEADER (8 bytes) |
| Src Port: 16 bits (entropy -- hash of inner 5-tuple) |
| Dst Port: 16 bits (4789 -- IANA assigned VXLAN port) |
| Length: 16 bits (UDP payload length) |
| Checksum: 16 bits (0x0000 = disabled, or computed) |
+================================================================+
| VXLAN HEADER (8 bytes) |
| |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |R|R|R|R|I|R|R|R| Reserved (24 bits) |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | VXLAN Network Identifier (VNI) (24 bits) | Reserved |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| I flag = 1: VNI is valid |
| VNI: 24 bits = 16,777,216 possible segments |
| (vs. 4,094 VLANs) |
| R: Reserved bits, must be 0 |
+================================================================+
| INNER ETHERNET FRAME (original VM frame) |
| Dst MAC: 6 bytes (destination VM MAC) |
| Src MAC: 6 bytes (source VM MAC) |
| EtherType: 2 bytes (0x0800 for IPv4, 0x86DD for IPv6) |
| IP Header: 20-40 bytes |
| TCP/UDP: variable |
| Payload: variable |
| (no FCS -- stripped by source NIC, recalculated by dest NIC) |
+================================================================+
Total overhead: 14 (outer Eth) + 20 (outer IP) + 8 (outer UDP)
+ 8 (VXLAN header) = 50 bytes
If inner frame is 1500 bytes (standard MTU):
Outer frame = 1500 + 50 = 1550 bytes
--> Physical MTU must be >= 1550 (or use jumbo frames)
VTEP (VXLAN Tunnel Endpoint)
The VTEP is where encapsulation and decapsulation happen. In a software-defined environment, the VTEP is the virtual switch on each hypervisor host (OVS for OVE, VFP for Azure Local, N-VDS for VMware). In hardware-based VXLAN (e.g., data center switches running EVPN/VXLAN), the VTEP is the switch itself.
Each VTEP has an IP address on the underlay network (the "TEP IP" or "tunnel IP"). All VXLAN traffic between two hosts is sent as UDP packets between their VTEP IPs.
VTEP Topology:
Host A (VTEP: 192.168.1.10) Host B (VTEP: 192.168.1.20)
+----------------------------+ +----------------------------+
| VM-1 VM-2 | | VM-3 VM-4 |
| VNI 1000 VNI 2000 | | VNI 1000 VNI 2000 |
+----+-----------+-----------+ +----+-----------+-----------+
| | | |
+----v-----------v-----------+ +----v-----------v-----------+
| OVS (VTEP) | | OVS (VTEP) |
| Encap: wrap in VXLAN | | Decap: unwrap VXLAN |
| Outer src: 192.168.1.10 | UDP | Outer dst: 192.168.1.20 |
| Outer dst: 192.168.1.20 |<====>| Outer src: 192.168.1.10 |
+----------------------------+ 4789 +----------------------------+
| |
Physical underlay (spine-leaf fabric, ECMP, LACP bonds)
VM-1 and VM-3 are on VNI 1000 (same logical segment)
VM-2 and VM-4 are on VNI 2000 (different logical segment)
Both VNI 1000 and VNI 2000 share the same physical underlay
BUM Traffic Handling
BUM (Broadcast, Unknown unicast, Multicast) traffic is the Achilles heel of overlay networks. In a physical VLAN, BUM traffic is flooded to all ports in the VLAN by the switch. In a VXLAN overlay, BUM traffic must be replicated to all VTEPs that have members on the same VNI.
Two approaches exist:
1. Multicast-based replication: The VNI is mapped to a multicast group (e.g., VNI 1000 -> 239.1.1.1). When a VTEP needs to flood a BUM frame, it sends it to the multicast group. The underlay network's multicast routing (PIM, IGMP) delivers it to all VTEPs subscribed to that group.
- Pro: Efficient for dense VNI deployments (many hosts per VNI)
- Con: Requires multicast routing on the physical underlay, which many network teams disable or do not support. Adds operational complexity.
2. Ingress replication (head-end replication): The source VTEP maintains a list of all remote VTEPs with members on the same VNI. For BUM traffic, it sends a unicast copy to each remote VTEP individually.
- Pro: No multicast required on the underlay. Simpler network design.
- Con: O(N) bandwidth consumption at the source VTEP for each BUM frame (one copy per remote VTEP). At scale (hundreds of VTEPs), this can be significant.
OVN (used by OVE) avoids most BUM traffic entirely by using control-plane-based MAC/IP learning. The OVN Southbound DB knows which MAC and IP addresses are on which chassis. OVN programs OVS with this information directly, so ARP requests are answered by the local OVS instance (ARP proxy) without flooding. This is a significant advantage over flood-and-learn VXLAN implementations.
MAC Learning: Flood-and-Learn vs. Control-Plane-Based
| Approach | How It Works | BUM Impact | Used By |
|---|---|---|---|
| Flood-and-learn | Source VTEP floods unknown-destination frames to all VTEPs. Remote VTEP learns the source MAC from the flooded frame. | High BUM traffic in large deployments | Basic VXLAN (RFC 7348), some HW VTEP implementations |
| EVPN (BGP EVPN) | Control plane distributes MAC/IP bindings via MP-BGP. VTEPs learn remote MACs before any data traffic flows. | Minimal BUM traffic (ARP suppression) | Data center switch VXLAN (Cisco, Arista, Juniper), Azure Local |
| OVN controller | OVN Southbound DB distributes port bindings (MAC, IP, chassis). ovn-controller programs OVS with known destinations. ARP proxy eliminates ARP flooding. | Near-zero BUM traffic | OVE (OVN-Kubernetes) |
VXLAN + EVPN
EVPN (Ethernet VPN, RFC 7432) uses MP-BGP as a control plane for VXLAN, replacing flood-and-learn with explicit MAC/IP route advertisements. This is the standard for VXLAN deployments on physical data center switches.
MP-BGP with EVPN route types:
| Route Type | Name | Purpose |
|---|---|---|
| Type 1 | Ethernet Auto-discovery | Multi-homing, fast convergence |
| Type 2 | MAC/IP Advertisement | Distributes MAC and (optionally) IP bindings |
| Type 3 | Inclusive Multicast Ethernet Tag | Advertises VTEP membership for BUM replication |
| Type 4 | Ethernet Segment | Designated Forwarder election for multi-homing |
| Type 5 | IP Prefix | Advertises IP prefix routes for inter-subnet routing |
EVPN/VXLAN is relevant to this evaluation because the physical underlay fabric may use EVPN/VXLAN for Layer-2 extension between leaf switches (e.g., stretching a VLAN across multiple racks). The platform's overlay (OVN GENEVE or Microsoft SDN VXLAN) runs on top of the underlay. This creates a potentially confusing "overlay on overlay" scenario if not carefully designed.
MTU Overhead
VXLAN adds exactly 50 bytes of overhead. This has direct implications:
| Physical Underlay MTU | Inner MTU (available to VM) | Jumbo Required? |
|---|---|---|
| 1500 | 1450 | Yes (VMs expect 1500) |
| 1550 | 1500 | Minimum for standard VM MTU |
| 9000 | 8950 | Recommended for production |
| 9216 | 9166 | Maximum common switch MTU |
Best practice: Set the physical underlay to MTU 9000 or 9216. This provides ample headroom for the 50-byte VXLAN overhead and allows the inner VM MTU to remain at the standard 1500 bytes (or even use jumbo frames inside the overlay).
Performance Considerations
-
UDP checksum: VXLAN can use UDP checksum (enabled) or disable it (set to 0x0000). Disabling saves CPU cycles for software VTEPs. However, if the underlay has any risk of bit corruption (rare in modern data centers), enabling the checksum provides an additional integrity check. OVS enables outer UDP checksum by default for VXLAN.
-
Checksum offload: Modern NICs can offload outer UDP checksum computation to hardware, eliminating the CPU cost. Verify NIC offload capabilities with
ethtool -k <interface> | grep tx-udp_tnl. -
GRO/GSO (Generic Receive/Segmentation Offload): The kernel can aggregate multiple small VXLAN-encapsulated packets into larger buffers before handing them to OVS, reducing per-packet processing overhead. This is critical at 25+ Gbps line rates.
-
RSS (Receive Side Scaling): The NIC distributes incoming packets across multiple CPU cores based on the outer UDP source port. Since the outer source port carries entropy (derived from the inner flow's 5-tuple), different inner flows are distributed to different CPU cores. This is essential for scaling overlay processing across all available cores.
5. GENEVE (Generic Network Virtualization Encapsulation)
Why GENEVE Was Created
VXLAN (2014, RFC 7348) solved the VLAN scale problem but introduced a new limitation: the VXLAN header is fixed and contains only the VNI. There is no space for additional metadata -- security tags, routing hints, telemetry markers, QoS policy identifiers, or service chain instructions. As SDN evolved, vendors needed to embed metadata in the tunnel header, and VXLAN could not accommodate this without breaking the protocol.
Several competing encapsulations were proposed (VXLAN-GPE, GRE, NVGRE, STT). GENEVE (RFC 8926, published 2020, designed much earlier) was created as a unified, extensible encapsulation that could replace all of them. Its key design principle: a variable-length header with a TLV (Type-Length-Value) option space that different systems can populate with arbitrary metadata without requiring protocol revisions.
GENEVE Header Format
GENEVE Encapsulated Packet:
+================================================================+
| OUTER ETHERNET HEADER (14 bytes) |
| (same as VXLAN) |
+================================================================+
| OUTER IP HEADER (20 bytes) |
| (same as VXLAN) |
+================================================================+
| OUTER UDP HEADER (8 bytes) |
| Src Port: 16 bits (entropy -- hash of inner 5-tuple) |
| Dst Port: 16 bits (6081 -- IANA assigned GENEVE port) |
| Length: 16 bits |
| Checksum: 16 bits |
+================================================================+
| GENEVE HEADER (8 bytes base + variable-length options) |
| |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Ver | Opt Len |O|C| Rsvd | Protocol Type |
| | (2) | (6 bits) | | | (6b) | (16 bits) |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Virtual Network Identifier (VNI) (24 bits) | Reserved |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | |
| | Variable-Length Options (TLV format) |
| | (0 to 252 bytes, in 4-byte increments) |
| | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| Field details: |
| Ver (2 bits): Version, must be 0 |
| Opt Len (6 bits): Length of options in 4-byte units |
| (0 = no options, 63 = 252 bytes max) |
| O (1 bit): OAM packet flag (control, not data) |
| C (1 bit): Critical options present flag |
| (if set, receiver MUST understand all |
| options or drop the packet) |
| Protocol Type: EtherType of inner payload |
| 0x6558 = Transparent Ethernet Bridging |
| (standard for L2 overlay) |
| VNI (24 bits): Same as VXLAN -- 16M segments |
+================================================================+
| INNER ETHERNET FRAME |
| (same as VXLAN) |
+================================================================+
GENEVE TLV Option Format:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Option Class (16 bits) | Type (8 bits) | R|R|R| Len |
| | | | (5b) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Option Data (variable) |
| (0 to 124 bytes) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Option Class: identifies the organization/system that defined
the option (like an OUI). Examples:
0x0102: OVN (Open Virtual Network)
0x0001: VMware NSX
Custom classes for security, telemetry, etc.
Type: option type within the class (e.g., "logical port tag",
"security group ID", "flow tracing marker")
Len: length of option data in 4-byte units (0-31)
Option Data: the actual metadata payload
TLV Option Space
The TLV option space is what makes GENEVE fundamentally more capable than VXLAN. Different SDN systems use it for different purposes:
OVN TLV usage (Option Class 0x0102):
| Type | Purpose | Data |
|---|---|---|
| 0x80 | Logical ingress port | 32-bit port key (identifies the source logical switch port) |
| 0x81 | Logical egress port | 32-bit port key (identifies the destination logical switch port) |
| 0x82 | Logical datapath | 32-bit datapath ID (identifies the logical switch or router) |
These three TLV options allow the receiving OVS node to immediately identify which logical pipeline stage to enter without doing a full MAC/IP lookup. This is a performance optimization: the sending node has already done the logical forwarding decision and encoded the result in the GENEVE options. The receiving node just needs to look up the tunnel key and execute the output action.
VMware NSX TLV usage (Option Class 0x0001):
| Type | Purpose | Data |
|---|---|---|
| Various | Security tag | Identifies the NSX security group/tag for micro-segmentation |
| Various | Service insertion metadata | Identifies the service chain stage for north-south traffic |
| Various | Distributed firewall policy ID | References the DFW rule set to apply |
The critical difference from VXLAN: In VXLAN, the only metadata in the header is the VNI. Everything else (which logical port, which security group, which policy) must be inferred from the inner packet headers, requiring additional table lookups on the receiving end. GENEVE encodes this metadata explicitly, reducing lookup overhead and enabling richer SDN features.
GENEVE vs. VXLAN: Feature Comparison
| Feature | VXLAN | GENEVE |
|---|---|---|
| RFC | 7348 (2014) | 8926 (2020) |
| UDP port | 4789 | 6081 |
| VNI bits | 24 (16M segments) | 24 (16M segments) |
| Header size | Fixed 8 bytes | 8 bytes base + 0-252 bytes options |
| Extensibility | None (reserved bits only) | Full TLV option space (252 bytes max) |
| Metadata support | VNI only | VNI + arbitrary TLV options |
| OAM flag | No | Yes (O bit) |
| Critical flag | No | Yes (C bit, forces drop if option unknown) |
| Protocol type field | No (always Ethernet) | Yes (supports non-Ethernet payloads) |
| Hardware offload | Mature (most modern NICs) | Growing (most NICs since ~2020) |
| Used by | Azure SDN (Microsoft), some legacy deployments | OVN (OVE), VMware NSX, most modern SDN stacks |
NSX Uses GENEVE, OVN Uses GENEVE -- Different TLV Usage
Both VMware NSX and OVN use GENEVE as their tunnel encapsulation, but they use different TLV option classes and types. This means that NSX GENEVE tunnels and OVN GENEVE tunnels are not interoperable at the option level -- a packet encapsulated by NSX cannot be correctly processed by OVN, and vice versa.
This matters for the migration: during a potential phased migration where some hosts run VMware/NSX and others run OVE/OVN, there is no "transparent" tunnel interop between the two platforms. VMs on NSX and VMs on OVN communicate via the physical network (routed at Layer 3), not via shared GENEVE tunnels.
NSX vs. OVN GENEVE Tunnel Incompatibility:
NSX Host OVN Host
+-------------------+ +-------------------+
| NSX N-VDS | | OVS br-int |
| GENEVE encap: | | GENEVE encap: |
| Class: 0x0001 | Cannot share | Class: 0x0102 |
| (NSX options) | tunnel -- TLV | (OVN options) |
| Port: 6081 | semantics | Port: 6081 |
+--------+----------+ differ +--------+----------+
| |
v v
Physical network (L3 routed communication only)
Hardware Offload Status for GENEVE
Hardware offload is critical for overlay performance at 25+ Gbps. Without offload, the host CPU must compute checksums, perform segmentation, and handle encap/decap -- consuming significant CPU resources.
| Offload Feature | VXLAN Status | GENEVE Status |
|---|---|---|
| TX checksum offload (outer UDP) | Widely supported | Supported on most modern NICs (Mellanox ConnectX-5+, Intel E810+, Broadcom 57500+) |
| TSO (TCP Segmentation Offload through tunnel) | Widely supported | Supported (kernel 4.7+, NIC firmware dependent) |
| GRO (Generic Receive Offload) | Widely supported | Supported (kernel 4.14+) |
| RSS (outer UDP src port entropy) | Widely supported | Supported |
| Full hardware VTEP (encap/decap in NIC) | Some NICs (Mellanox ConnectX-6 Dx) | Limited (Mellanox ConnectX-6 Dx with OVS offload, Intel E810 with TC offload) |
| Fixed-size options only | N/A | Some NICs only offload GENEVE with zero options or fixed option sizes |
Key concern for OVE: OVN uses three 32-bit TLV options (12 bytes total), making the GENEVE header 20 bytes (8 base + 12 options) instead of 8 bytes. Some older NICs can offload GENEVE with zero options but not with variable-length options. Verify that the target NIC firmware supports offload for GENEVE with OVN's specific option layout.
# Verify GENEVE offload capabilities on Linux:
$ ethtool -k ens1f0 | grep -i geneve
tx-udp_tnl-segmentation: on # TSO through GENEVE tunnels
tx-udp_tnl-csum-segmentation: on # TSO with outer checksum
rx-udp_tnl-port-offload: on # RSS entropy from UDP port
$ ethtool --show-offload ens1f0 | grep -i tunnel
tx-tunnel-remcsum-segmentation: on
Why the Industry Is Converging on GENEVE
-
Extensibility without protocol revisions: When a new SDN feature needs metadata in the tunnel header (e.g., telemetry INT -- In-band Network Telemetry), GENEVE accommodates it with a new TLV option. VXLAN would require a new RFC or proprietary extensions (VXLAN-GPE, VXLAN-GBP).
-
Multi-vendor ecosystem: OVN, VMware NSX, and newer SDN stacks all use GENEVE. The IETF standardized it as the convergence point for network virtualization encapsulation.
-
OAM and critical flag support: GENEVE's O bit allows control-plane OAM packets to be distinguished from data packets in the same tunnel. The C bit ensures that if a tunnel endpoint does not understand a critical option, the packet is dropped rather than silently misprocessed.
-
Hardware offload is catching up: While VXLAN had a multi-year head start in NIC offload support, GENEVE offload is now available on all major NIC platforms. The remaining gap is in full hardware VTEP offload (encap/decap entirely in the NIC), which is available on high-end NICs like Mellanox ConnectX-6 Dx with SmartNIC firmware.
-
VXLAN is not disappearing: VXLAN remains the standard for hardware-based overlays in the physical fabric (EVPN/VXLAN on data center switches). The convergence on GENEVE applies primarily to software-based SDN overlays on hypervisor hosts. In practice, a modern data center may run EVPN/VXLAN on the physical switches and GENEVE on the virtual switches -- two layers of overlay serving different purposes.
How the Candidates Handle This
Comparison Table
| Aspect | VMware (NSX/GENEVE) | OVE (OVN/GENEVE) | Azure Local (VFP/VXLAN) | Swisscom ESC |
|---|---|---|---|---|
| CNI / network plugin | N/A (ESXi, not Kubernetes) | OVN-Kubernetes CNI (CNCF ecosystem) | N/A (Hyper-V, not Kubernetes) | N/A (VMware/ESXi) |
| Multi-network support | vDS port groups (multiple vNICs per VM to different port groups) | Multus meta-CNI + bridge/macvlan/SR-IOV secondary CNI plugins | Hyper-V vSwitch with multiple vNICs, each assigned to a VM network | vDS port groups (same as VMware baseline) |
| Virtual switch | N-VDS (NSX Virtual Distributed Switch) or VDS 7.0+ with NSX | OVS (Open vSwitch) with kernel datapath | VFP (Virtual Filtering Platform) extension on Hyper-V vSwitch | N-VDS or VDS (same as VMware baseline) |
| SDN control plane | NSX Manager (proprietary, 3-node cluster, Corfu DB) | OVN (open source, NB DB + ovn-northd + SB DB + ovn-controller per node) | Network Controller (proprietary, 3-node cluster, REST API) | NSX Manager (provider-operated) |
| Overlay protocol | GENEVE (NSX-T 3.x+), VXLAN (legacy NSX-T 2.x) | GENEVE (OVN default, UDP 6081) | VXLAN (Microsoft SDN, UDP 4789) | GENEVE or VXLAN depending on NSX version |
| Overlay metadata | NSX-specific TLV options (security tags, DFW policy IDs) | OVN TLV options (logical port, datapath, egress port) | VNI only (VXLAN has no TLV space) | Same as VMware (NSX TLV options) |
| Distributed routing | Yes (DR component on every ESXi host) | Yes (OVN logical router distributed to every node) | Yes (VFP implements distributed routing per host) | Yes (same as VMware) |
| ARP suppression | Yes (NSX controller distributes MAC/IP, local ARP proxy) | Yes (OVN SB DB distributes port bindings, OVS ARP proxy) | Yes (Network Controller distributes CA-PA mappings) | Yes (same as VMware) |
| BUM handling | Control-plane-based (no flood-and-learn, NSX distributes MAC tables) | Control-plane-based (OVN SB DB, near-zero BUM) | Control-plane-based (Network Controller distributes mappings) | Same as VMware |
| Overlay MTU overhead | 50-74 bytes (GENEVE base + NSX options) | 62 bytes typical (GENEVE base 8 + OVN options 12 + outer headers 42) | 50 bytes (VXLAN) | Same as VMware |
| Hardware offload | NSX supports NIC offload for GENEVE (validated NIC list) | OVS supports NIC offload for GENEVE (TC flower offload on ConnectX-6 Dx) | VFP supports NIC offload for VXLAN (AccelNet / FPGA SmartNIC) | Same as VMware |
| Debugging tools | NSX Traceflow (UI-based packet trace), NSX CLI, packet capture | ovs-ofctl, ovn-trace, ovn-nbctl, ovn-sbctl, tcpdump |
Test-NetConnection, VFP port diagnostics, Network Controller logs |
Ticket to Swisscom (no direct access) |
| IPAM | NSX IPAM or external (Infoblox integration) | OVN-internal IPAM for cluster network; whereabouts for secondary networks; external IPAM via DHCP |
Network Controller IPAM or SCVMM IPAM | Provider-managed IPAM |
| Network configuration model | vCenter API / NSX Policy API (imperative or declarative) | Kubernetes CRDs: NetworkAttachmentDefinition, NetworkPolicy, NMState (fully declarative, GitOps) | ARM Templates, Bicep, PowerShell (mixed imperative/declarative) | ESC portal or API (limited IaC maturity) |
Key Differences in Prose
Overlay protocol choice and implications: The most visible protocol-level difference is that OVE uses GENEVE while Azure Local uses VXLAN. In practice, both achieve the same core goal (encapsulated overlay with 24-bit segment ID). The difference matters in two areas: (1) GENEVE's TLV options allow OVN to encode logical pipeline metadata in the tunnel header, reducing lookup overhead on the receiving node -- VFP/VXLAN must derive this information from inner packet headers; (2) GENEVE hardware offload is slightly less mature than VXLAN offload, meaning OVE nodes may consume marginally more CPU for overlay processing on older NIC hardware. On modern NICs (ConnectX-5+, E810+), this difference is negligible.
CNI and Multus vs. vDS port groups: The conceptual gap between VMware's port group model and OVE's CNI/Multus model is the largest operational learning curve in the networking domain. In VMware, a VM's network connections are configured by dragging a vNIC to a port group in vCenter. In OVE, a VM's network connections are defined in YAML: a VirtualMachine spec references a NetworkAttachmentDefinition CRD, which contains a CNI plugin configuration. The result is equivalent (a VM with an interface on a specific network), but the mechanism is fundamentally different. The team must build comfort with this model before migration.
OVN vs. NSX control plane: Both are distributed SDN control planes that compile high-level network intent into per-host forwarding rules. The architectural parallel is strong: NSX Manager corresponds to the OVN Northbound DB + ovn-northd; the NSX Central Control Plane corresponds to ovn-northd + Southbound DB; the NSX Local Control Plane (nsx-proxy) corresponds to ovn-controller. The key operational difference is transparency: OVN's databases are queryable with ovn-nbctl and ovn-sbctl, and the flow pipeline is traceable with ovn-trace. NSX provides similar functionality via the Traceflow UI and CLI, but the internal flow tables are less directly accessible. For an organization that values deep troubleshooting capability, OVN's open tooling is a significant advantage.
Azure Local's VFP vs. OVS: VFP (Virtual Filtering Platform) is Microsoft's proprietary virtual switch extension in Hyper-V. It performs the same role as OVS (flow-based packet processing, encap/decap, ACL enforcement) but is not programmable via OpenFlow. VFP rules are pushed by the Network Controller via a proprietary protocol. This means the Azure Local team cannot inspect forwarding rules with a standard tool like ovs-ofctl dump-flows; they must use Microsoft-specific diagnostic commands. VFP supports hardware offload via AccelNet (Azure's FPGA-based SmartNIC), which provides wire-speed overlay processing -- but AccelNet is not available in on-premises Azure Local; only standard NIC offload is available.
Swisscom ESC operational model: ESC uses the same NSX/GENEVE stack as VMware. The critical difference is operational: the customer has no access to NSX Manager, cannot run ovn-trace equivalents, and cannot inspect flow tables. All overlay networking is a black box managed by Swisscom. For routine operations this is acceptable (fewer things for the customer to break). For incident response and troubleshooting, it introduces a dependency on Swisscom's response time and expertise -- the customer cannot independently diagnose overlay issues.
Key Takeaways
-
CNI is the plug, not the plumbing. CNI defines the interface between the container runtime and the network plugin. The choice of CNI plugin (OVN-Kubernetes, Calico, Cilium) determines the entire networking architecture. OVE uses OVN-Kubernetes, which provides an overlay network with distributed routing, ACLs, and native KubeVirt integration. The team must understand that changing the CNI is a cluster-reinstall decision, not a configuration change.
-
Multus is essential for VM migration from VMware. VMware VMs with multiple vNICs on different port groups/VLANs require Multus in OVE to replicate the same multi-network connectivity. Each secondary network requires a
NetworkAttachmentDefinitionCRD with the correct CNI plugin configuration (bridge, macvlan, or SR-IOV) and IPAM settings. This is the most direct operational mapping from VMware port groups to OVE networking. -
OVS/OVN is the new NSX. The team that currently operates NSX must learn OVS/OVN. The concepts map closely (logical switches, logical routers, distributed routing, ACLs, tunnel encapsulation), but the tooling is different (
ovn-nbctlinstead of NSX Manager UI,ovs-ofctlinstead of NSX Traceflow,ovn-traceinstead of NSX packet capture). Invest in OVN training before the migration PoC begins. -
OVN eliminates BUM flooding through control-plane MAC learning. Unlike basic VXLAN (which floods ARP and unknown unicast), OVN distributes MAC/IP bindings via the Southbound DB and programs OVS with ARP proxy rules. This means that a 5,000+ VM environment on OVN generates near-zero BUM traffic in the overlay -- a significant scalability advantage over flood-and-learn VXLAN implementations.
-
GENEVE is the industry convergence point for overlay encapsulation. Both OVE (OVN) and VMware (NSX) use GENEVE. Azure Local uses VXLAN. The practical difference for the organization is: if the physical fabric uses EVPN/VXLAN on the switches, OVE adds a second overlay layer (GENEVE) on top, while Azure Local uses the same encapsulation as the fabric. Neither scenario is inherently better -- both work, but the MTU math differs (GENEVE overhead is 62 bytes with OVN options vs. 50 bytes for VXLAN).
-
Verify NIC hardware offload for GENEVE with OVN options. Not all NICs that offload GENEVE with zero options also offload GENEVE with OVN's 12-byte TLV options. During PoC hardware selection, test actual offload behavior with
ethtooland measure CPU utilization under load with overlay traffic. The difference between hardware-offloaded and software-only GENEVE processing can be 2-5x in CPU consumption at 25+ Gbps. -
The overlay does not eliminate the need for physical VLAN understanding. Many migrated VMs will need direct access to physical VLANs (via Multus bridge or macvlan), not just the OVN overlay. Storage traffic, backup traffic, and legacy applications with hardcoded VLAN dependencies require physical VLAN trunking on the host bonds and corresponding
NetworkAttachmentDefinitionCRDs. The physical network team and the platform team must coordinate VLAN trunk configurations. -
ovn-trace is the most important debugging tool the team does not yet know. It traces a hypothetical packet through the entire OVN logical pipeline -- every table, every match, every action -- without sending a real packet. It answers "why is this VM's traffic being dropped?" in seconds rather than hours. Every platform engineer on the OVE team must be proficient with
ovn-trace,ovn-nbctl,ovn-sbctl, andovs-ofctl dump-flows.
Discussion Guide
The following questions target overlay networking specifics during vendor workshops, SME deep-dives, and PoC validation sessions. They are designed to reveal whether the vendor or SME understands the overlay stack at the packet level -- not just at the marketing level.
1. OVN Logical Pipeline and Packet Tracing
"Walk us through the OVN logical pipeline for a packet from VM-A on Node-1 (subnet 10.128.0.0/23) to VM-B on Node-3 (subnet 10.128.4.0/23). Which OVN logical objects does the packet traverse? Where does the routing decision happen -- on Node-1, Node-3, or a dedicated router node? Show us how to trace this path using ovn-trace. What OpenFlow table sequence will we see on Node-1's br-int?"
Purpose: Tests understanding of distributed routing in OVN. The correct answer describes the packet entering the local logical switch, being forwarded to the local copy of the distributed logical router (routing happens on the source node), then being encapsulated in a GENEVE tunnel to Node-3's chassis. ovn-trace output should show the logical switch ingress pipeline, the logical router pipeline, and the logical switch egress pipeline on the destination. If the vendor cannot demonstrate ovn-trace, they have not operated OVN in production.
2. GENEVE vs. VXLAN for This Deployment
"OVE uses GENEVE with OVN TLV options. Azure Local uses VXLAN. Our physical fabric uses EVPN/VXLAN on the switches. If we choose OVE, we have GENEVE overlays running on top of a VXLAN underlay. What is the total encapsulation overhead? How does this affect our MTU planning? Is there any interaction between the two overlay layers (e.g., inner GENEVE VNI vs. outer VXLAN VNI)? Can the physical switches' ECMP hash function see the GENEVE outer UDP source port for entropy?"
Purpose: Tests awareness of overlay-on-overlay scenarios and MTU math. The correct answer: the two overlays are independent (the fabric's EVPN/VXLAN treats GENEVE traffic as normal IP/UDP payload). MTU math: physical MTU (e.g., 9216) minus fabric VXLAN overhead (50 bytes) = 9166 bytes available for the host underlay; then minus GENEVE overhead (62 bytes for OVN) = 9104 bytes effective inner MTU. The ECMP hash on the physical switches operates on the GENEVE outer IP/UDP headers, not the VXLAN tunnel headers -- so entropy works correctly.
3. Multus and Secondary Network Design
"We have 200 VMs that currently connect to VLAN 100, VLAN 200, and VLAN 300 via VMware port groups. After migration to OVE, these VMs need the same VLAN connectivity. Walk us through the complete configuration: the NMState policy for VLAN trunking on the host bonds, the NetworkAttachmentDefinition CRDs for each VLAN, the VirtualMachine spec with Multus annotations, and the IPAM strategy (static, DHCP, or whereabouts). How do we ensure that VLAN 100 on Node-1 is the same Layer-2 domain as VLAN 100 on Node-50?"
Purpose: Tests the complete secondary network stack. The answer must include: (1) NMState NNCP to create VLAN sub-interfaces or bridge interfaces on the bond; (2) one NAD per VLAN with bridge or macvlan CNI config; (3) VirtualMachine spec with multus network references; (4) IPAM choice (whereabouts for cross-node subnet sharing, or DHCP for enterprise IPAM integration). The Layer-2 domain question tests awareness that bridge CNI creates a local bridge per node -- VLAN 100 frames from Node-1 reach the physical wire and are switched by the ToR to Node-50, making it the same L2 domain. This is not a GENEVE overlay -- it is real VLAN traffic on the physical network.
4. OVS Flow Cache Performance
"In a 5,000+ VM environment, how many OVS datapath flow entries should we expect per node? What happens if the megaflow cache is full -- are new flows dropped or processed in the slow path? How do we monitor cache hit rate and eviction rate? What workload patterns can cause excessive cache churn, and how do we mitigate them?"
Purpose: Tests OVS performance internals. The answer should reference: megaflow cache size (configurable, default varies by kernel version, typically 200K-2M entries); cache full behavior (new flows processed via slow-path upcall, not dropped, but with higher latency); monitoring via ovs-appctl dpctl/show (hit/miss counters) and ovs-appctl dpif-netdev/pmd-stats-show (for DPDK); cache churn causes (port scanning, DDoS, excessive short-lived connections); mitigation (connection tracking to consolidate stateful flows, adjusting the exact-match cache size).
5. OVN Control Plane Scalability
"Our target cluster has 100 nodes and 5,000 VMs. How does the OVN Southbound DB scale with this number of port bindings? What is the expected size of the Southbound DB? How many OpenFlow rules per node does ovn-controller typically program for a cluster this size? What happens during a 'thundering herd' scenario where 100 VMs start simultaneously on 100 different nodes -- does OVN serialize the port binding updates or process them in parallel?"
Purpose: Tests OVN scalability knowledge. The answer should reference: SB DB size grows linearly with port bindings (5,000 ports is modest for OVN); OpenFlow rules per node are proportional to local ports + ACL complexity (typically 1,000-10,000 rules per node at this scale); ovn-controller processes SB DB changes incrementally (only computes flows for ports local to its chassis); thundering herd behavior is handled by ovn-northd processing NB DB changes and writing to SB DB, with each ovn-controller watching for changes relevant to its chassis. The bottleneck is typically ovn-northd throughput, not SB DB size.
6. NetworkPolicy to OVN ACL Translation
"We need to implement the following policy: all VMs in namespace 'finance' can talk to each other on any port, can reach the database VMs in namespace 'shared-db' on TCP/5432 only, and cannot reach any other namespace. Show us the Kubernetes NetworkPolicy YAML. Then show us what OVN ACLs this translates to (using ovn-nbctl acl-list). How are these ACLs realized as OpenFlow rules on br-int? What is the performance impact of having 500 such policies across the cluster?"
Purpose: Tests the full policy-to-flow translation chain. The answer must show: (1) a NetworkPolicy with ingress and egress rules using namespace selectors and port specifications; (2) the resulting OVN ACLs on the relevant logical switches (direction, priority, match expression, action); (3) the OpenFlow table entries in the ingress/egress ACL tables; (4) performance: OVN ACLs with connection tracking (ct_state) are efficient because they match on flow state, not per-packet inspection. 500 policies across 100 nodes distribute to ~5 policies per node on average, which is negligible for OVS performance.
7. Overlay Tunnel Health and Failure Detection
"GENEVE tunnels between nodes use UDP port 6081 on the underlay. If the physical path between Node-1 and Node-7 degrades (50% packet loss on one spine link), how does OVN detect this? Does OVS have BFD for tunnel health? What is the failure detection time? How does OVN reroute traffic when a tunnel is declared down -- does it update the Southbound DB, or does OVS handle it locally?"
Purpose: Tests tunnel health monitoring understanding. The answer: OVS supports BFD (Bidirectional Forwarding Detection) on tunnel interfaces, with configurable detection intervals (e.g., 1-second intervals, 3-miss detection = 3-second detection). When BFD declares a tunnel down, ovn-controller updates the chassis availability in the SB DB, and OVN can reschedule gateway routing to avoid the failed node. For east-west traffic, if the tunnel to a specific node is down, traffic to VMs on that node is black-holed until the tunnel recovers -- OVN does not reroute VM-to-VM traffic via a third node (the VMs are bound to their chassis). The correct solution is to fix the underlay path.
8. Day-2 CNI and Overlay Changes
"Six months after go-live, we need to add a new secondary network (VLAN 500) for a new application tier. This VLAN must be available on 50 of our 100 nodes (only nodes with the right label). Walk us through the day-2 process: NMState policy for the new VLAN, NetworkAttachmentDefinition, testing, and validation. Can this be done without disrupting existing VMs on other VLANs? What rollback mechanism exists if the new VLAN configuration causes an issue?"
Purpose: Tests day-2 operational maturity. The answer should describe: (1) create a NMState NodeNetworkConfigurationPolicy with a nodeSelector targeting the 50 nodes, adding a VLAN sub-interface or bridge for VLAN 500; (2) NMState applies changes rolling, one node at a time, with automatic rollback if the network config fails validation (node remains reachable after applying); (3) create a NAD referencing the new bridge/VLAN; (4) test by creating a test VM attached to the new NAD; (5) existing VMs are not affected because NMState only modifies the new interfaces, not existing ones. Rollback: NMState reverts the configuration on the affected node if the config causes connectivity loss to the API server.
9. Performance Baseline: Overlay vs. Direct VLAN
"What latency and throughput overhead does the GENEVE overlay add compared to a direct VLAN attachment via Multus bridge CNI? For a latency-sensitive application (e.g., database replication requiring sub-500-microsecond round-trip), should the VMs use the OVN overlay network or a direct VLAN attachment? What are the tradeoffs?"
Purpose: Tests performance awareness and architectural judgment. The answer: GENEVE overlay adds 10-50 microseconds of latency per hop due to encap/decap processing (exact figure depends on NIC offload capability and CPU speed). Throughput overhead is minimal with hardware offload (TSO/GRO through GENEVE). For sub-500-microsecond latency requirements, a direct VLAN via Multus bridge CNI or SR-IOV eliminates overlay overhead entirely. The tradeoff: direct VLAN bypasses OVN's NetworkPolicy enforcement and distributed routing -- the VM is directly on the physical network, and security must be enforced by physical firewalls or guest-OS firewalls. Most VMs should use the overlay; only latency-critical exceptions should use direct VLAN or SR-IOV.
10. Migration Coexistence: NSX and OVN Side by Side
"During the migration, we will have VMs on VMware/NSX and VMs on OVE/OVN running simultaneously for months. Both use GENEVE, but with different TLV options. How do VMs on the two platforms communicate? Is there a shared overlay, or must traffic be routed through the physical network? What are the implications for firewall rules that currently reference NSX groups -- how do we enforce policies across the two platforms during coexistence?"
Purpose: Tests migration planning realism. The correct answer: there is no shared overlay. NSX GENEVE and OVN GENEVE are incompatible at the tunnel level (different option classes, different control planes). Communication between NSX VMs and OVN VMs must be routed through the physical network (Layer-3 routing via physical switches, NSX Tier-0 gateway, or OVN gateway node). Firewall rules that reference NSX security groups cannot be applied on the OVN side -- equivalent Kubernetes NetworkPolicies must be created on OVE. During coexistence, the security perimeter has a gap at the L3 boundary between the two platforms, which must be addressed with physical firewall rules or a shared policy engine.