Routing & Security
Why This Matters
The previous chapters covered the data plane -- how packets move between VMs through overlays (OVN/GENEVE), physical fabrics (spine-leaf, LACP, ECMP), and advanced paths (SR-IOV, DPDK). But moving packets is only half the problem. The other half is deciding where packets go (routing) and whether packets are allowed (security). In a VMware environment with NSX, these two concerns are handled by the NSX Distributed Router (DR) for routing and the NSX Distributed Firewall (DFW) for micro-segmentation. The DFW is the single most operationally significant NSX feature for a Tier-1 financial enterprise -- it enforces zero-trust segmentation policies across every VM, and FINMA compliance depends on it.
This chapter covers the technologies that replace, extend, or re-implement these capabilities in OVE, Azure Local, and Swisscom ESC. The scope is wide because routing and security intersect at every layer:
- DVR (Distributed Virtual Routing) -- how routing decisions happen locally on every hypervisor instead of through a central chokepoint
- VRF (Virtual Routing and Forwarding) -- how multiple isolated routing tables coexist on a single node for tenant separation
- eBPF (Extended Berkeley Packet Filter) -- the programmable kernel technology that is replacing iptables for network policy enforcement, observability, and runtime security
- Micro-segmentation -- the security model that replaces perimeter firewalls with per-workload policies (the NSX DFW replacement)
- Network Policies (Kubernetes) -- the Kubernetes-native mechanism for micro-segmentation, including the new AdminNetworkPolicy resource that addresses key DFW feature gaps
- QoS (Quality of Service) -- traffic classification, prioritization, and bandwidth management for storage, migration, and management traffic separation
- VPN / IPsec Tunneling -- encrypted site-to-site connectivity for multi-datacenter and DR scenarios
The micro-segmentation and Network Policy sections are the most critical in this chapter. For a financial enterprise, the ability to enforce granular, auditable, per-workload security policies is not a feature -- it is a regulatory requirement. NSX DFW provides this today with a mature, well-understood policy model. The replacement must deliver equivalent or superior capability, or the migration is blocked.
Concepts
1. DVR (Distributed Virtual Routing)
The Problem with Centralized Routing
In a traditional network architecture, routing between subnets is performed by a centralized device -- a physical router, a firewall, or a dedicated routing VM. Every packet that crosses a subnet boundary must traverse the network to reach the router, get a forwarding decision, and traverse the network back to the destination. This creates three problems:
Hairpinning: Two VMs on the same physical host, in different subnets, must send their traffic up to the centralized router and back down to the same host. The packet traverses the physical network twice for no reason.
Single bottleneck: All inter-subnet traffic funnels through the router. At 5,000+ VMs with hundreds of subnets, the centralized router becomes a bandwidth and CPU bottleneck. Even high-end physical routers struggle with the aggregate east-west traffic of a large virtualized environment.
North-south-only optimization: Centralized routers are positioned at the north-south boundary (data center to external). East-west traffic (VM to VM within the data center) was historically a small fraction of total traffic. In modern virtualized environments, east-west traffic is 70-80% of all traffic. The architecture is optimized for the wrong traffic pattern.
Centralized Routing -- The Hairpin Problem:
Host A Central Router Host A
+------------------+ +-------------+ (same host!)
| VM-1 (10.1.1.5) | | |
| Subnet: 10.1.1/24|---+ | Routing | +---| VM-2 (10.2.1.8) |
+------------------+ | | Table: | | | Subnet: 10.2.1/24|
| | 10.1.1/24 | | +------------------+
+----->| 10.2.1/24 |------+
leaf | 10.3.1/24 | leaf
switch | default gw | switch
uplink +-------------+ uplink
Problem: VM-1 -> VM-2 traffic crosses the physical network TWICE
even though both VMs are on the same host.
Path: VM-1 -> leaf -> spine -> router -> spine -> leaf -> VM-2
Hops: 6 (should be 0 -- same host)
Latency: ~500-1000 us (should be ~10-50 us)
DVR Concept: Every Hypervisor Hosts the Routing Function
Distributed Virtual Routing (DVR) solves this by placing the routing function on every hypervisor. Instead of a centralized router making forwarding decisions, each host has a local instance of the logical router that can route between subnets directly. East-west traffic between VMs on the same host never touches the physical network. East-west traffic between VMs on different hosts takes the shortest path -- directly between the two hosts, without a detour through a centralized router.
Distributed Virtual Routing -- Local Routing:
Host A Host B
+-------------------------------------------+ +----------------------------+
| | | |
| VM-1 (10.1.1.5) VM-2 (10.2.1.8) | | VM-3 (10.3.1.12) |
| | | | | | |
| v v | | v |
| +----+----+ +----+----+ | | +----+----+ |
| | ls-sub1 | | ls-sub2 | | | | ls-sub3 | |
| +---------+ +---------+ | | +---------+ |
| | | | | | |
| +----------+------------+ | | | |
| | | | | |
| +------+------+ | | +-----+------+ |
| | Distributed | | | | Distributed| |
| | Router | | | | Router | |
| | (local copy)| | | | (local copy)| |
| +------+------+ | | +-----+------+ |
| | | | | |
+-------------------------------------------+ +----------------------------+
| |
+------ GENEVE Tunnel ----------------+
(only for cross-host traffic)
Case 1: VM-1 -> VM-2 (same host, different subnet)
Path: VM-1 -> local DVR -> VM-2
Hops: 0 physical hops
Latency: ~10-50 us (OVS flow processing only)
Case 2: VM-1 -> VM-3 (different host, different subnet)
Path: VM-1 -> local DVR -> GENEVE tunnel -> remote DVR -> VM-3
Hops: 1 physical hop (direct host-to-host)
Latency: ~50-200 us (OVS + encap/decap + wire)
How OVN Implements DVR
OVN (Open Virtual Network), the control plane for OVS in OVE, implements DVR natively. Every OVN logical router is distributed across all nodes by default -- there is no "centralized mode" to accidentally configure.
Logical Router Distributed Component: When an OVN logical router is created, every chassis (hypervisor) that hosts VMs connected to that router's subnets gets a local copy of the router's forwarding pipeline. This is implemented as OVS flows on br-int. The logical router does not exist as a separate process or VM -- it is a set of OpenFlow rules in the OVS flow table on each node.
Gateway Nodes for External Traffic: While east-west routing is fully distributed, north-south traffic (to/from external networks) must exit through a physical uplink. OVN designates specific nodes as gateway chassis for each logical router. External traffic (default route, SNAT) is forwarded through GENEVE tunnels to the gateway chassis, which performs SNAT/DNAT and sends the traffic out its physical interface. Multiple gateway chassis can be configured for redundancy (active-passive failover via BFD).
OVN DVR flow example:
# On a host with VMs in subnets 10.1.1.0/24 and 10.2.1.0/24:
# OVS flow table on br-int shows local routing entries:
# 1. ARP responder for router port on subnet 1 (gateway MAC)
table=12, priority=100, arp, arp_tpa=10.1.1.1, arp_op=1
actions=move:NXM_OF_ETH_SRC->NXM_OF_ETH_DST,
mod_dl_src:fa:16:3e:aa:bb:01,
load:2->NXM_OF_ARP_OP,
move:NXM_NX_ARP_SHA->NXM_NX_ARP_THA,
load:0xfa163eaabb01->NXM_NX_ARP_SHA,
move:NXM_OF_ARP_SPA->NXM_OF_ARP_TPA,
load:0x0a010101->NXM_OF_ARP_SPA,
IN_PORT
# 2. Routing decision: packets from subnet 1 destined for subnet 2
table=17, priority=100, ip, nw_dst=10.2.1.0/24, metadata=0x3
actions=mod_dl_src:fa:16:3e:aa:bb:02, # router MAC for subnet 2
dec_ttl,
mod_dl_dst:<destination VM MAC>,
load:0x4->NXM_NX_REG15, # output port for destination
resubmit(,32) # continue to egress pipeline
# 3. External traffic -> forward to gateway chassis via tunnel
table=17, priority=50, ip, metadata=0x3
actions=mod_dl_src:fa:16:3e:aa:bb:gw,
dec_ttl,
load:0x1->NXM_NX_REG15, # tunnel port to gateway chassis
resubmit(,32)
How NSX Implements DVR
VMware NSX uses the same conceptual split but with different terminology:
- Distributed Router (DR): Runs on every transport node (ESXi host). Handles all east-west routing. Implemented in the NSX kernel module (
vmkerneldatapath). The DR component has a router port on every connected logical switch. - Service Router (SR): Runs on NSX Edge nodes (dedicated VMs or bare-metal). Handles north-south traffic, NAT, VPN, load balancing, and any service that requires a centralized function. Two deployment modes: active-standby (simpler, stateful failover) or active-active (ECMP, higher throughput, stateless services only).
The key architectural difference: in OVN, the gateway chassis is a regular worker node with an external uplink. In NSX, the Edge node is a dedicated, purpose-built component with its own resource pool, sizing requirements, and failure domain. NSX Edge sizing mistakes (under-provisioned CPU, memory, or uplink bandwidth) are a common cause of north-south bottlenecks in production.
Performance Benefit
The performance benefit of DVR is dramatic for east-west traffic:
| Traffic Pattern | Centralized Routing | DVR |
|---|---|---|
| Same host, same subnet | ~5 us (L2 switching) | ~5 us (same) |
| Same host, different subnet | ~500-1000 us (hairpin) | ~10-50 us (local OVS flows) |
| Cross-host, different subnet | ~500-1500 us (via router) | ~50-200 us (direct tunnel) |
| North-south (external) | ~200-500 us | ~200-500 us (via gateway, same) |
For east-west traffic, DVR eliminates the centralized router as a latency source and bandwidth bottleneck. For north-south traffic, the architecture is the same -- both approaches require traffic to traverse a gateway/edge node. The difference is that with DVR, this only applies to external traffic, not to every inter-subnet packet.
2. VRF (Virtual Routing and Forwarding)
Concept: Multiple Routing Tables on a Single Device
VRF is a technology that creates multiple independent routing table instances on a single router, switch, or Linux host. Each VRF instance has its own routing table, its own ARP table, and its own set of interfaces. Traffic in one VRF cannot reach another VRF without an explicit route leak or inter-VRF forwarding policy.
Think of VRFs as VLANs for Layer 3. VLANs isolate broadcast domains at Layer 2. VRFs isolate routing domains at Layer 3. A packet in VRF "tenant-A" uses tenant-A's routing table, sees tenant-A's default gateway, and can only reach destinations within tenant-A's routing table -- even if tenant-B uses the same IP address range on the same physical device.
VRF Isolation:
Single Linux Host / Router
+------------------------------------------------------------------+
| |
| VRF: tenant-a VRF: tenant-b |
| +---------------------------+ +---------------------------+ |
| | Routing Table: | | Routing Table: | |
| | 10.0.0.0/8 -> eth1.100 | | 10.0.0.0/8 -> eth1.200 | |
| | default -> 10.0.0.1 | | default -> 10.0.0.1 | |
| | | | | |
| | ARP Table: | | ARP Table: | |
| | 10.0.1.5 -> aa:bb:cc:01| | 10.0.1.5 -> dd:ee:ff:01| |
| | | | | |
| | Interfaces: eth1.100 | | Interfaces: eth1.200 | |
| +---------------------------+ +---------------------------+ |
| |
| Both tenants use 10.0.0.0/8 -- no conflict because separate |
| routing tables. Packets in tenant-a can NEVER reach tenant-b |
| unless explicitly configured (route leaking). |
+------------------------------------------------------------------+
VRF in Linux
Linux has native VRF support since kernel 4.3 (2015), implemented as a network device type (l3mdev). A VRF device is a "routing context" that enslaves other network interfaces. All traffic arriving on enslaved interfaces is processed using the VRF's routing table, not the default table.
# Create a VRF device for tenant-a, using routing table 100
ip link add vrf-tenant-a type vrf table 100
ip link set vrf-tenant-a up
# Enslave an interface to the VRF
ip link set eth1.100 master vrf-tenant-a
# Add routes in the VRF's routing table
ip route add 10.0.0.0/8 dev eth1.100 table 100
ip route add default via 10.0.0.1 table 100
# Verify routing table isolation
ip route show vrf vrf-tenant-a
# Shows only tenant-a routes
ip route show
# Shows only main/default table routes -- tenant-a routes are invisible
# Bind a socket to a VRF (application isolation)
# setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, "vrf-tenant-a", ...)
# Or run a process in a VRF context:
ip vrf exec vrf-tenant-a ping 10.0.1.5
The l3mdev (Layer 3 Master Device) framework is the kernel mechanism that redirects routing lookups from the default FIB (Forwarding Information Base) to the VRF-specific FIB. When a packet arrives on an interface enslaved to a VRF, the kernel's ip_route_input() function uses the VRF's table ID for the lookup instead of the main table. This is implemented via the l3mdev_fib_table() callback in the network device structure.
VRF Lite vs MPLS VRF
VRF Lite: VRF without a signaling protocol. Each VRF is locally significant -- the router at each end must be manually configured with matching VRF definitions and interface assignments. Traffic between sites in the same VRF is carried over dedicated VLAN subinterfaces or GRE tunnels. This is simple and sufficient for small-scale tenant separation (dozens of VRFs).
MPLS VRF (VPN/MPLS, RFC 4364): VRF with BGP MPLS VPN signaling. VRF membership, route targets, and route distinguishers are exchanged via MP-BGP between PE (Provider Edge) routers. Traffic is MPLS-encapsulated with VPN labels. This is the carrier-grade approach for large-scale multi-tenant networks (thousands of VRFs, hundreds of sites). The complexity is significantly higher, but the automation and scalability are essential at scale.
For this organization, VRF Lite is the likely deployment model. The number of network segments (tens to low hundreds) does not justify MPLS complexity, and the OVN overlay provides equivalent isolation at the virtual network layer.
How VRFs Map to OVN Logical Routers and Kubernetes Namespaces
In OVN, each logical router is conceptually a VRF -- it has its own routing table and its own set of connected logical switches. Two logical routers can use overlapping IP ranges without conflict because their forwarding tables are independent.
In Kubernetes, namespaces provide a soft isolation boundary. OVN-Kubernetes assigns each namespace a subnet from the cluster network CIDR, and the per-node logical router handles routing between namespace subnets. However, by default, all namespace subnets share a single routing context -- there is no VRF-level isolation. To achieve true VRF-like isolation, Kubernetes relies on NetworkPolicies (covered later in this chapter) rather than separate routing tables.
For workloads that require genuine L3 isolation (e.g., PCI DSS cardholder data environment separated from general corporate workloads), the approach in OVE is:
- Secondary networks via Multus: Attach VMs to dedicated OVN secondary networks with their own logical router, achieving VRF-equivalent isolation at the OVN level
- NetworkPolicies: Default-deny policies that enforce isolation even within the shared cluster network
- Node isolation: Dedicated node pools with separate physical network segments for the most sensitive workloads
3. eBPF (Extended Berkeley Packet Filter)
Evolution from BPF to eBPF
The original BPF (Berkeley Packet Filter), now called cBPF (classic BPF), was created in 1992 at Lawrence Berkeley National Laboratory. Its purpose was narrow: efficiently filter network packets in the kernel so that tools like tcpdump could capture specific traffic without copying every packet to userspace. cBPF programs are small, register-based bytecode programs that run in a kernel virtual machine with two registers (A and X), a scratch memory area, and a limited instruction set (load, store, jump, arithmetic).
eBPF (Extended BPF), introduced in Linux 3.18 (2014) by Alexei Starovoitov, expanded this concept from a packet filter into a general-purpose, safe, in-kernel programmable execution environment. The "extended" part is an understatement -- eBPF is to cBPF what a modern CPU is to a calculator.
| Aspect | cBPF (classic) | eBPF (extended) |
|---|---|---|
| Registers | 2 (A, X) | 11 general-purpose (R0-R10), 64-bit |
| Instruction set | ~30 instructions | ~100+ instructions, function calls |
| Program size | 4,096 instructions max | 1 million verified instructions |
| Data structures | Fixed scratch memory (16 slots) | Maps (hash, array, LRU, ring buffer, etc.) |
| Attach points | Socket filter only | XDP, TC, tracepoints, kprobes, cgroups, LSM, etc. |
| Helper functions | None | 200+ kernel helper functions |
| JIT compilation | Optional, limited | Mandatory on most architectures |
| Tail calls | No | Yes (up to 33 chain depth) |
eBPF Architecture
eBPF Architecture Overview:
Userspace Kernel
+----------------------------+ +-------------------------------------+
| | | |
| eBPF Program (C) | | |
| +----------------------+ | | |
| | #include <bpf/bpf.h> | | | |
| | SEC("xdp") | | | |
| | int prog(ctx) { | | | |
| | ... | | | |
| | return XDP_PASS; | | | |
| | } | | | |
| +----------+-----------+ | | |
| | | | |
| Clang/LLVM compile | | |
| to eBPF bytecode | | |
| | | | |
| v | | |
| +----------------------+ | bpf() | +-------------------------------+ |
| | ELF object file | | syscall | | eBPF Verifier | |
| | (.o with BPF section)| +------------->| | | |
| +----------------------+ | | | 1. DAG check (no loops) | |
| | | | 2. Simulate all paths | |
| +----------------------+ | | | 3. Type checking (BTF) | |
| | Loader (libbpf, | | | | 4. Memory bounds checking | |
| | bpftool, cilium- | | | | 5. Verify helper call args | |
| | agent) | | | | 6. Stack depth <= 512 bytes | |
| +----------+-----------+ | | | 7. Max 1M verified insns | |
| | | | +---------------+---------------+ |
| | map ops | | | PASS |
| +<------------>+ | v |
| | | +-------------------------------+ |
| +----------------------+ | | | JIT Compiler | |
| | Userspace tools: | | | | eBPF bytecode -> native x86 | |
| | bpftool | | | | (or ARM64, s390x, etc.) | |
| | cilium monitor | | | +---------------+---------------+ |
| | hubble | | | | |
| | bpftrace | | | v |
| +----------------------+ | | +-------------------------------+ |
| | | | eBPF Program (native) | |
+----------------------------+ | | attached to hook point: | |
| | | |
| | XDP ---------> NIC driver | |
| | TC ---------> qdisc | |
| | socket ------> socket ops | |
| | kprobe ------> kernel func | |
| | tracepoint --> trace event | |
| | cgroup ------> cgroup hook | |
| | LSM ---------> security hook | |
| +-------------------------------+ |
| |
| +-------------------------------+ |
| | eBPF Maps | |
| | (shared data between programs | |
| | and between kernel/user) | |
| | | |
| | Hash Map Array LRU Hash | |
| | Per-CPU Ring Buffer | |
| | LPM Trie Stack Trace | |
| | Sockmap DevMap (XDP) | |
| +-------------------------------+ |
+-------------------------------------+
The Verifier is the critical safety mechanism. Every eBPF program must pass the verifier before it can be loaded into the kernel. The verifier performs static analysis to guarantee that the program:
- Terminates: The control flow graph is a DAG (directed acyclic graph) -- no backward jumps, no unbounded loops (bounded loops were added in kernel 5.3 with a provable iteration limit).
- Is memory-safe: Every memory access is bounds-checked. Pointer arithmetic is tracked and constrained. The program cannot read or write arbitrary kernel memory.
- Does not leak kernel pointers: Scalar values derived from kernel pointers are tracked and cannot be returned to userspace (preventing KASLR bypass).
- Has bounded stack usage: The eBPF stack is limited to 512 bytes per program (tail calls can extend this, but each frame is bounded).
- Calls only allowed helpers: Each program type has a specific set of allowed helper functions. An XDP program cannot call
bpf_probe_read()(a tracing helper), and a tracing program cannot callbpf_redirect()(a networking helper).
The verifier simulates every possible execution path of the program, tracking the type and range of every register and every stack slot. For a complex Cilium network policy program, verification can take millions of simulated instructions and hundreds of milliseconds. The 1-million-instruction complexity limit is per-verification (not per-runtime), and recent kernels (5.18+) allow partial verification of subprograms to reduce complexity.
JIT Compilation: After verification, the eBPF bytecode is compiled to native machine code by the JIT compiler. On x86_64, eBPF instructions map nearly 1:1 to native instructions (eBPF was designed for this). JIT-compiled eBPF programs run at near-native speed -- the overhead compared to hand-written C code in the kernel is typically <5%.
Program Types: XDP, TC, Socket, Tracing
XDP (eXpress Data Path): XDP programs attach to the network driver's receive path and execute before the kernel allocates an sk_buff (socket buffer). This is the fastest possible interception point -- the program operates on the raw DMA buffer (the xdp_buff structure) with the packet data still in the driver's ring buffer. XDP programs can return:
| Return Code | Action | Use Case |
|---|---|---|
XDP_PASS |
Continue normal kernel processing | Default, allowed traffic |
XDP_DROP |
Drop the packet immediately | DDoS mitigation, firewall deny |
XDP_TX |
Transmit back out the same NIC | Load balancer reply, reflection |
XDP_REDIRECT |
Redirect to another NIC, CPU, or AF_XDP socket | L3 forwarding, container redirect |
XDP_ABORTED |
Drop with error trace | Debugging |
XDP can operate in three modes:
- Native XDP (
xdpdrv): Loaded into the NIC driver, runs at driver level. Requires driver support (all modern drivers:mlx5,i40e,ice,bnxt). Best performance. - Generic XDP (
xdpgeneric): Runs in the kernel networking stack aftersk_buffallocation. Works with any NIC. Slower than native XDP (defeats the purpose of avoidingsk_buffallocation). - Offloaded XDP: Loaded directly onto the NIC hardware (SmartNIC FPGA/ASIC). Programs execute in NIC silicon, never touching the host CPU. Supported on Netronome Agilio and some Broadcom NICs.
TC (Traffic Control): TC eBPF programs attach to the Linux Traffic Control layer, specifically to the clsact qdisc (classful action queueing discipline). TC programs operate on sk_buff (fully allocated socket buffers), giving them access to more packet metadata than XDP but running later in the processing pipeline. TC programs can be attached at both ingress and egress.
Packet Processing Pipeline with eBPF Hook Points:
NIC Hardware
|
v
+----------+
| NIC | XDP runs HERE (before sk_buff allocation)
| Driver | (xdp_buff -- raw DMA buffer)
+----+-----+
|
| sk_buff allocation
v
+----------+
| TC | TC ingress eBPF runs HERE
| ingress | (sk_buff -- full metadata)
+----+-----+
|
v
+----------+
| Netfilter| (iptables/nftables -- traditional path)
| (if any) |
+----+-----+
|
v
+----------+
| Routing | ip_forward() / ip_local_deliver()
| Decision |
+----+-----+
|
v
+----------+
| TC | TC egress eBPF runs HERE
| egress |
+----+-----+
|
v
+----------+
| NIC | XDP_TX / XDP_REDIRECT (if redirecting)
| Driver |
+----------+
|
v
Wire
Socket programs: Attach to individual sockets or cgroups of sockets. Use cases: socket-level load balancing (BPF_PROG_TYPE_SK_LOOKUP), message redirection between sockets (BPF_PROG_TYPE_SK_MSG with SOCKMAP), TCP congestion control customization (BPF_PROG_TYPE_STRUCT_OPS).
Tracing programs: Attach to kernel tracepoints, kprobes (dynamic kernel function instrumentation), and uprobes (userspace function instrumentation). These do not modify packet processing but observe it -- they are the foundation of eBPF-based observability tools like Hubble, bpftrace, and tcplife.
eBPF Maps
Maps are the shared data structures that allow eBPF programs to maintain state between invocations and to communicate with userspace. Key map types:
| Map Type | Structure | Use Case |
|---|---|---|
BPF_MAP_TYPE_HASH |
Hash table (key-value) | Connection tracking, policy lookup |
BPF_MAP_TYPE_ARRAY |
Fixed-size array | Per-CPU counters, configuration |
BPF_MAP_TYPE_LRU_HASH |
LRU eviction hash | NAT tables, flow tracking at scale |
BPF_MAP_TYPE_PERCPU_HASH |
Per-CPU hash table | Lock-free counters, per-CPU state |
BPF_MAP_TYPE_RINGBUF |
MPSC ring buffer | Event streaming to userspace |
BPF_MAP_TYPE_LPM_TRIE |
Longest-prefix match trie | IP CIDR lookups for routing/policy |
BPF_MAP_TYPE_DEVMAP |
Device redirect map | XDP redirect targets |
BPF_MAP_TYPE_SOCKMAP |
Socket redirect map | Socket-level load balancing |
Maps are created via the bpf() syscall and persist in kernel memory as long as at least one reference exists (a loaded program, a pinned path in bpffs, or a userspace file descriptor). Map size is limited by available kernel memory, not by the program -- a single hash map can hold millions of entries.
eBPF for Networking: Cilium as a CNI
Cilium is a CNI plugin (and more) that replaces the traditional Linux networking stack (iptables, kube-proxy, conntrack) with eBPF programs. In an OVE context, Cilium is not the primary CNI (OVN-Kubernetes is), but it is available as a secondary network provider and its concepts are important because:
- OVN-Kubernetes is adopting eBPF concepts -- the OVN northbound database translates to OVS flows today, but eBPF-based OVN datapath implementations are being developed
- Cilium is the reference implementation for understanding what eBPF can do for networking
- Some organizations deploy Cilium alongside OVN for enhanced observability (Hubble) or cluster mesh
Cilium replaces iptables for three core Kubernetes networking functions:
- Service load balancing: Instead of kube-proxy's iptables DNAT chains (which scale O(n) with services), Cilium uses an eBPF hash map keyed by
{service IP, port, protocol}. Lookup is O(1) regardless of the number of services. - NetworkPolicy enforcement: Instead of iptables rules per pod (which scale O(n^2) with pods * policies), Cilium attaches TC eBPF programs to each pod's veth interface. Policy decisions are made via map lookups on the pod's identity (a numeric label hash), not by IP address.
- Connection tracking: Instead of the kernel's
nf_conntrack(which has a single conntrack table shared by all pods, causing lock contention at scale), Cilium maintains per-CPU conntrack maps in eBPF, eliminating lock contention.
eBPF for Observability: Hubble, bpftrace, tcplife
eBPF's tracing capabilities enable deep observability without modifying application code:
- Hubble (Cilium): A network observability platform that uses eBPF to capture L3/L4/L7 flow data for every packet processed by Cilium. Provides a service dependency map, DNS query logging, HTTP request/response tracing, and network policy verdict logging -- all from eBPF programs, with no application instrumentation.
- bpftrace: A high-level tracing language (inspired by DTrace and awk) that compiles to eBPF programs. Example:
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { printf("retransmit to %s\n", ntop(args->daddr)); }'-- traces every TCP retransmission in real-time. - tcplife: From the BCC toolkit, traces TCP connection lifecycle (connect, accept, close) with duration, bytes transferred, and process information. Uses kprobes on
tcp_set_state().
eBPF vs iptables/nftables: Performance at Scale
This comparison is directly relevant to the migration decision. VMware NSX DFW rules are enforced in the ESXi kernel module as per-packet processing. The OVE equivalent uses OVN ACLs (translated to OVS flows) or eBPF programs. The performance characteristics differ fundamentally:
| Aspect | iptables | eBPF (Cilium/OVN) |
|---|---|---|
| Rule matching | Linear chain traversal, O(n) | Map lookup, O(1) or O(log n) |
| Rule update | Full chain replacement (atomic swap) | Single map entry update |
| Latency per 1,000 rules | ~50-100 us (chain walk) | ~1-5 us (hash lookup) |
| Latency per 10,000 rules | ~500-1000 us | ~1-5 us (same) |
| Conntrack scalability | Single table, spinlock contention | Per-CPU tables, no locks |
| Memory overhead per rule | ~200 bytes (iptables entry) | ~64 bytes (map entry) |
| Visibility | iptables -L (static) |
Maps + ring buffer (live) |
At 5,000+ VMs with hundreds of security policies, the difference between O(n) iptables chains and O(1) eBPF map lookups is the difference between milliseconds and microseconds of per-packet policy evaluation overhead. This is why the industry is moving from iptables to eBPF for network policy enforcement.
4. Micro-segmentation
Definition: Security at the Workload Level
Micro-segmentation is a security model that enforces access control policies at the individual workload (VM, pod, container) level, rather than at the network perimeter. In a traditional perimeter security model, a firewall at the data center edge inspects north-south traffic, but east-west traffic between VMs inside the data center is either unfiltered or filtered by a small number of internal firewall zones. Once an attacker breaches the perimeter, lateral movement is unrestricted.
Micro-segmentation inverts this model. Every workload has its own firewall policy. A web server can reach the application server on port 8080, but not the database on port 5432. The application server can reach the database on port 5432, but not the management server on port 22. A compromised web server cannot pivot to the database, even though both are on the same subnet. The "blast radius" of any breach is limited to the compromised workload and its explicitly allowed connections.
Micro-segmentation vs Perimeter Security:
Traditional Perimeter Model:
+-------------------------------------------------------+
| Perimeter Firewall (north-south only) |
+-------------------------------------------------------+
| |
| Data Center -- FLAT east-west |
| |
| +------+ +------+ +------+ +------+ |
| | Web |<--->| App |<--->| DB |<--->| Mgmt | |
| +------+ +------+ +------+ +------+ |
| |
| All VMs can reach all VMs. Compromised Web -> |
| attacker has access to DB, Mgmt, everything. |
+-------------------------------------------------------+
Micro-segmented Model:
+-------------------------------------------------------+
| Perimeter Firewall (north-south) |
+-------------------------------------------------------+
| |
| Data Center -- Per-workload policy |
| |
| +------+ :8080 +------+ :5432 +------+ |
| | Web |-------->| App |-------->| DB | |
| +------+ ALLOW +------+ ALLOW +------+ |
| | | |
| X (DENY all other) X |
| | | |
| +------+ +------+ |
| | Mgmt | :22 ALLOW from | Jump | DENY |
| +------+ admin VLAN only | Host | from all |
| +------+ |
| |
| Compromised Web -> attacker can ONLY reach App:8080. |
| Cannot reach DB, Mgmt, or any other workload. |
+-------------------------------------------------------+
Zero-Trust Networking for Financial Enterprises
For a Tier-1 financial enterprise subject to FINMA regulation, micro-segmentation is not optional -- it is a regulatory expectation. FINMA Circular 2023/1 on operational risks and resilience (and its predecessor guidance on IT risks) requires:
- Least privilege access: Workloads should only be able to communicate with explicitly authorized endpoints
- Network segmentation: Critical systems (payment processing, core banking, market data) must be isolated from general-purpose workloads
- Auditability: Security policies must be documentable, version-controlled, and auditable
- Incident containment: Architecture must limit lateral movement in case of compromise
NSX DFW delivers all of these today. The replacement platform must deliver equivalent capabilities. This is a migration blocker -- if the replacement cannot enforce per-workload policies with equivalent granularity, auditability, and operational manageability, the migration cannot proceed for regulated workloads.
Implementation Approaches
NSX Distributed Firewall (DFW): The incumbent. NSX DFW enforces policies in the ESXi kernel module at the vNIC level -- every packet entering or leaving a VM passes through the DFW filter. Policies are defined in NSX Manager using a multi-tier category model:
- Ethernet category: L2 rules (MAC-based)
- Emergency category: Override rules for incident response
- Infrastructure category: Rules for shared services (DNS, NTP, AD)
- Environment category: Zone-based rules (production vs. development)
- Application category: Application-specific micro-segmentation rules
Within each category, rules are evaluated in priority order. The first match wins. This category model provides a structured way to layer policies -- infrastructure rules take precedence over application rules, and emergency rules override everything.
NSX DFW supports identity-based rules using dynamic security groups populated by vCenter tags, Active Directory group membership, or third-party integrations (vulnerability scanners, CMDB). A rule can say "allow DNS servers to receive UDP 53 from all workloads" where "DNS servers" is a dynamic group that includes any VM tagged with role:dns -- adding a new DNS server automatically applies the policy.
OVN ACLs: OVN implements stateful firewall rules as ACLs (Access Control Lists) on logical switch ports. OVN ACLs are translated into OVS flows on every node, providing distributed enforcement equivalent to NSX DFW. OVN ACLs support:
- L3/L4 matching (IP, port, protocol)
- Direction (ingress/egress relative to the logical port)
- Connection tracking integration (stateful rules with
ct_statematching) - Priority-based evaluation (higher priority ACL wins)
- Logging (OVN ACL log action for audit)
In OVE, OVN ACLs are not configured directly by administrators. They are generated by the OVN-Kubernetes CNI from Kubernetes NetworkPolicy resources (covered in section 5). This is a key architectural difference from NSX: in NSX, the administrator defines DFW rules in a dedicated security policy UI. In OVE, the administrator defines Kubernetes NetworkPolicy objects, and the OVN-Kubernetes controller translates them into OVN ACLs.
eBPF/Cilium: As covered in section 3, Cilium enforces micro-segmentation using eBPF programs attached to pod veth interfaces. Cilium's policy model is identity-based -- each pod receives a numeric identity derived from its Kubernetes labels, and policies are evaluated against identities rather than IP addresses. This avoids the "stale IP" problem where a policy referencing an IP address becomes incorrect when a pod restarts with a new IP.
Hyper-V Virtual Filtering Platform (VFP): Azure Local's equivalent. VFP is a programmable virtual switch extension in the Hyper-V vSwitch that enforces ACLs, encapsulation, and NAT rules. VFP rules are programmed by the Azure Local Network Controller via the OVSDB protocol (yes, Azure Local uses OVSDB for southbound communication to VFP). VFP ACLs provide per-VM stateful filtering with L3/L4 matching and connection tracking.
Policy Model: Allow-List vs Deny-List
Default deny (allow-list): No traffic is permitted unless a rule explicitly allows it. This is the zero-trust model. NSX DFW in production environments almost always operates with a "deny all" default rule at the bottom of the rule table. Every permitted flow requires an explicit allow rule.
Default allow (deny-list): All traffic is permitted unless a rule explicitly blocks it. This is simpler to implement initially but offers weak security -- any missed deny rule is an open path.
For FINMA compliance, default deny is the required posture. Both NSX DFW and Kubernetes NetworkPolicy support default deny. The implementation differs:
- NSX DFW: Add a rule at the bottom of the rule table with action "Drop" and scope "any-any"
- Kubernetes NetworkPolicy: Apply a NetworkPolicy with a
podSelector: {}(all pods) and empty ingress/egress rules. Once any NetworkPolicy selects a pod, all traffic not explicitly allowed by a policy is denied.
Challenges at Scale
Policy management: With 5,000+ VMs and hundreds of micro-segmentation rules, policy management becomes a significant operational challenge. Common problems:
- Rule sprawl: old rules are never cleaned up, policy tables grow to thousands of entries
- Conflicting rules: a deny rule in one category conflicts with an allow rule in another
- Testing: how do you verify that a policy change does not break a production application?
- Change management: who approves security policy changes? How are they version-controlled?
Troubleshooting: When a connection fails, determining whether the failure is caused by a security policy (expected behavior) or a network outage (unexpected behavior) requires visibility into the policy evaluation process. NSX DFW provides rule hit counters and flow logs. OVN provides ACL logging. Cilium provides Hubble flow verdicts. Without these tools, micro-segmentation troubleshooting becomes a guessing game.
Audit compliance: Auditors need to see a complete, point-in-time snapshot of all security policies and evidence that they are enforced. NSX provides policy export via API. In Kubernetes, policies are stored as API resources and can be exported via kubectl get networkpolicy -A -o yaml. GitOps (storing policies in Git) provides version history and change audit trail.
5. Network Policies (Kubernetes)
The NetworkPolicy Resource
Kubernetes NetworkPolicy is the native mechanism for micro-segmentation in Kubernetes clusters, including OVE with KubeVirt VMs. A NetworkPolicy is a namespaced resource that selects pods (including KubeVirt VM pods) via labels and defines allowed ingress and/or egress traffic.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-app-to-db
namespace: banking-app
spec:
podSelector:
matchLabels:
role: database # This policy applies to pods with role=database
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
role: app-server # Allow from app-server pods
namespaceSelector:
matchLabels:
env: production # ... in production namespaces only
- ipBlock:
cidr: 10.0.50.0/24 # Allow from monitoring subnet
except:
- 10.0.50.100/32 # ... except this specific IP
ports:
- protocol: TCP
port: 5432 # PostgreSQL port only
egress:
- to:
- podSelector:
matchLabels:
role: dns # Allow DNS lookups
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
- to:
- ipBlock:
cidr: 10.0.100.0/24 # Allow NTP, AD to infra subnet
ports:
- protocol: UDP
port: 123
NetworkPolicy Evaluation Flow
NetworkPolicy Evaluation for an Incoming Packet:
Incoming packet to pod "db-0" (role=database, ns=banking-app)
|
v
+----+--------------------------------------------+
| Is there ANY NetworkPolicy selecting this pod |
| with policyType "Ingress"? |
+----+--------------------------------------------+
| |
| YES | NO
v v
+----+----+ +----+----+
| Evaluate | | ALLOW |
| all | | (no |
| matching | | policy |
| policies | | = open)|
+----+-----+ +---------+
|
v
+----+--------------------------------------------+
| For each matching NetworkPolicy: |
| Does the packet match ANY ingress rule? |
+----+--------------------------------------------+
| |
| YES (any policy) | NO (no policy matches)
v v
+----+----+ +----+----+
| ALLOW | | DENY |
| (union | | (default|
| of all | | deny |
| rules) | | once |
+---------+ | selected|
+---------+
Key semantics:
1. If NO policy selects a pod, ALL traffic is allowed (Kubernetes default)
2. Once ANY policy selects a pod, only explicitly allowed traffic passes
3. Multiple policies selecting the same pod are UNION-ed (additive, never subtractive)
4. There is NO deny rule in standard NetworkPolicy -- only "allow" or "implicit deny"
5. podSelector + namespaceSelector in the SAME "from" entry = AND
podSelector and namespaceSelector as SEPARATE "from" entries = OR
Default Deny: Implementing Zero-Trust in Kubernetes
To achieve zero-trust (deny-all-by-default) in a Kubernetes namespace:
# Default deny ALL ingress traffic to all pods in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: banking-app
spec:
podSelector: {} # empty selector = selects ALL pods
policyTypes:
- Ingress
# No ingress rules = nothing is allowed
---
# Default deny ALL egress traffic from all pods in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
namespace: banking-app
spec:
podSelector: {}
policyTypes:
- Egress
# No egress rules = nothing is allowed (including DNS!)
# IMPORTANT: denying egress without allowing DNS (UDP/TCP 53)
# breaks pod name resolution. Always add a DNS allow rule.
After applying default deny, every allowed flow must be explicitly defined via additional NetworkPolicy resources. This is operationally equivalent to NSX DFW with a "deny all" default rule.
NetworkPolicy Limitations
Standard Kubernetes NetworkPolicy (GA since Kubernetes 1.7) has significant limitations compared to NSX DFW:
| Limitation | Impact | NSX DFW Comparison |
|---|---|---|
| No explicit deny rules | Cannot deny specific traffic; can only "not allow" it. Cannot create exception rules (allow all except X). | DFW has both Allow and Drop actions |
| No cluster-wide scope | Policies are namespaced. Cannot create a single policy that applies to all namespaces. | DFW rules can scope to "any" |
| No prioritized evaluation | Multiple policies are unioned. Cannot express "this rule overrides that rule." | DFW has categories with strict priority order |
| No L7 filtering | Cannot filter by HTTP path, method, header, or TLS SNI. | DFW with ALB or NSX Intelligence supports L7 context |
| No FQDN-based rules | Cannot write "allow egress to api.example.com." IP-based only (except via ipBlock). | DFW supports FQDN in rules (DNS-based resolution) |
| No logging in spec | No standard way to specify per-rule logging. Implementation-dependent. | DFW has per-rule logging toggle |
| No action on rule match | Rules can only allow. No concept of "log and allow" or "reject with RST." | DFW supports Allow, Drop, Reject actions |
These limitations are significant for an NSX DFW replacement. The most critical gaps -- no deny rules, no cluster-wide scope, and no priority -- are addressed by AdminNetworkPolicy (covered next).
OVN-Kubernetes NetworkPolicy Implementation
When a NetworkPolicy is created in Kubernetes, the OVN-Kubernetes controller translates it into OVN ACLs:
- Watch: The OVN-Kubernetes controller watches the Kubernetes API for NetworkPolicy create/update/delete events
- Translate: The controller converts the NetworkPolicy
podSelector,namespaceSelector, andipBlockrules into OVN Address Sets (named lists of IP addresses) and OVN ACLs (match + action rules) - Program: The OVN northbound database receives the ACLs, and
ovn-northdcompiles them into logical flows in the southbound database - Distribute:
ovn-controlleron each node reads the southbound database and programs OVS flows onbr-int - Enforce: OVS enforces the flows at the per-port level -- every packet entering or leaving a pod's OVS port is matched against the ACL-derived flows
NetworkPolicy -> OVN ACL Translation:
Kubernetes API OVN Northbound DB OVS (per node)
+-------------------+ +------------------+ +------------------+
| NetworkPolicy: | | Address Set: | | OVS Flow: |
| podSelector: | ---> | "banking-app_ | --> | table=44, |
| role: database| | role_database" | | priority=1001, |
| ingress: | | = {10.128.2.5, | | ip, |
| from: | | 10.128.3.8} | | nw_src=10.128..|
| podSelector:| | | | nw_dst=10.128..|
| role: app | | ACL: | | tp_dst=5432, |
| ports: | | match: "ip4.src| | ct_state=+new, |
| - TCP/5432 | | == $app_ips | | actions= |
+-------------------+ | && tcp.dst == | | ct(commit), |
| 5432" | | resubmit |
| action: allow- | +------------------+
| related |
| priority: 1001 |
+------------------+
The Address Set mechanism is critical for performance. When a pod is added or removed, only the Address Set is updated (a single OVSDB transaction), not every ACL that references pods with that label. This is equivalent to NSX DFW's dynamic security groups -- the group membership changes, but the rules referencing the group remain static.
AdminNetworkPolicy and BaselineAdminNetworkPolicy
AdminNetworkPolicy (ANP) and BaselineAdminNetworkPolicy (BANP) are the Kubernetes solution to the three most critical NetworkPolicy limitations: no deny rules, no cluster-wide scope, and no priority-based evaluation. These resources are defined in the policy.networking.k8s.io API group (KEP-2091) and are supported in OVN-Kubernetes starting with OpenShift 4.14+.
AdminNetworkPolicy (ANP):
apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
metadata:
name: cluster-dns-allow # cluster-scoped (no namespace)
spec:
priority: 10 # lower number = higher priority
subject:
namespaces:
matchLabels:
kubernetes.io/metadata.name: banking-app
ingress: [] # no ingress rules (not affected)
egress:
- name: "allow-dns"
action: Allow # NEW: explicit Allow action
to:
- namespaces:
matchLabels:
role: dns-service
ports:
- portNumber:
protocol: UDP
port: 53
- name: "deny-external-db"
action: Deny # NEW: explicit Deny action
to:
- networks:
- 10.99.0.0/16 # deny access to legacy DB subnet
Key ANP features that address NSX DFW gaps:
| Feature | NetworkPolicy | AdminNetworkPolicy | NSX DFW Equivalent |
|---|---|---|---|
| Scope | Namespace | Cluster-wide | DFW rule scope: "DFW" (all) |
| Actions | Implicit allow only | Allow, Deny, Pass | Allow, Drop, Reject |
| Priority | No priority (union) | Numeric priority (0-1000) | Category + rule order |
| Subject | podSelector | namespaces + pods | Applied-To groups |
| Admin control | Any namespace user | Cluster admin only | NSX admin role |
The Pass action is unique to ANP and critical for delegated policy models. When an ANP rule matches with action Pass, it delegates the decision to namespace-level NetworkPolicies. This enables a tiered model:
ANP/BANP Evaluation Order (maps to NSX DFW Categories):
+----------------------------------------------------------+
| Priority 0-99: AdminNetworkPolicy (Emergency/Infra) |
| Equivalent to: NSX DFW Emergency + Infrastructure |
| Action: Allow / Deny / Pass |
| Who manages: Platform team / Security team |
+----------------------------------------------------------+
|
| If no ANP matches, or ANP action = Pass:
v
+----------------------------------------------------------+
| Namespace NetworkPolicy (Application rules) |
| Equivalent to: NSX DFW Application category |
| Action: Implicit allow (union) |
| Who manages: Application team (namespace owner) |
+----------------------------------------------------------+
|
| If no NetworkPolicy selects the pod:
v
+----------------------------------------------------------+
| BaselineAdminNetworkPolicy (Default posture) |
| Equivalent to: NSX DFW default rule at bottom |
| Action: Allow / Deny |
| Who manages: Platform team |
+----------------------------------------------------------+
|
| If no BANP matches:
v
+----------------------------------------------------------+
| Kubernetes default: ALLOW |
+----------------------------------------------------------+
BaselineAdminNetworkPolicy (BANP): A single cluster-scoped resource (only one can exist) that defines the baseline posture. It is evaluated after all ANPs and namespace NetworkPolicies. Think of it as the "default deny all" rule at the bottom of the NSX DFW rule table:
apiVersion: policy.networking.k8s.io/v1alpha1
kind: BaselineAdminNetworkPolicy
metadata:
name: default # only one BANP allowed, name must be "default"
spec:
subject:
namespaces: {} # applies to ALL namespaces
ingress:
- name: "default-deny-ingress"
action: Deny
from:
- namespaces: {} # from any namespace
egress:
- name: "default-deny-egress"
action: Deny
to:
- namespaces: {} # to any namespace
- networks:
- 0.0.0.0/0 # to any external IP
With ANP + BANP, the OVE platform achieves feature parity with the NSX DFW category model:
| NSX DFW Category | Kubernetes Equivalent | Managed By |
|---|---|---|
| Emergency | ANP priority 0-9 | Security team |
| Infrastructure | ANP priority 10-99 | Platform team |
| Environment | ANP priority 100-499 | Platform team |
| Application | Namespace NetworkPolicy | App team |
| Default rule | BaselineAdminNetworkPolicy | Platform team |
Comparison to NSX DFW Categories and Applied-To Scope
NSX DFW's "Applied-To" field controls where a rule is enforced. Options include: the entire DFW (all hosts), specific security groups, specific logical ports, or specific logical switches. This is an optimization -- a rule with Applied-To set to a small group is only programmed on the hosts where those VMs run, reducing the rule set size on most hosts.
In Kubernetes, the equivalent is the podSelector (in NetworkPolicy) or subject (in ANP). OVN-Kubernetes is smart about rule distribution -- it only programs OVN ACLs on the chassis where the selected pods actually run, not on every node in the cluster. This is equivalent to NSX DFW's Applied-To optimization.
Remaining gaps between ANP/NetworkPolicy and NSX DFW:
- L7 context: NSX DFW with Application Rule Manager can create rules based on L7 application signatures (e.g., "allow MySQL protocol, deny raw TCP on port 3306"). Kubernetes NetworkPolicy is L3/L4 only. Cilium's CiliumNetworkPolicy resource supports L7 (HTTP, Kafka, DNS) but is not part of the upstream Kubernetes API.
- FQDN-based rules: NSX DFW can match on DNS names ("allow egress to *.github.com"). This requires DNS snooping. OVN-Kubernetes added experimental FQDN support in OpenShift 4.15+ via EgressFirewall resources, but it is not part of the NetworkPolicy API.
- Rule hit counters in the UI: NSX Manager shows per-rule hit counts, enabling identification of unused rules. OVN ACL counters exist but are not exposed in a comparable management UI.
- Graphical policy builder: NSX Manager provides a drag-and-drop policy editor with group visualization. Kubernetes policies are YAML manifests. Tooling like Calico Enterprise or Red Hat ACS provides a graphical UI, but none match the maturity of NSX Manager's interface.
KubeVirt VMs and NetworkPolicy
KubeVirt VMs run inside pods. From the Kubernetes networking perspective, a VM pod is a pod like any other -- it has an IP address, it is connected to the OVN overlay, and it is subject to NetworkPolicy selection via pod labels.
When a NetworkPolicy selects a KubeVirt VM pod, the OVN ACLs are applied to the VM's OVS port on br-int. All traffic entering or leaving the VM passes through these ACLs. The VM itself is unaware of the policy -- from the guest OS perspective, packets that violate the policy simply never arrive (ingress) or are silently dropped (egress).
# Example: NetworkPolicy for KubeVirt VMs
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ssh-to-legacy-vms
namespace: legacy-workloads
spec:
podSelector:
matchLabels:
vm.kubevirt.io/name: legacy-db-01 # KubeVirt VM label
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
role: jump-host
ports:
- protocol: TCP
port: 22
Important consideration: KubeVirt VMs with Multus secondary network interfaces (e.g., a VLAN-attached management interface) are partially outside the scope of Kubernetes NetworkPolicy. NetworkPolicy applies only to the pod's primary OVN interface (the one connected to br-int). Traffic on secondary interfaces (connected via bridge, macvlan, or SR-IOV) bypasses OVN ACLs entirely. For workloads with secondary interfaces, micro-segmentation for the secondary network must be handled at the physical switch/firewall level or via host-level iptables/nftables rules.
6. QoS (Quality of Service)
Traffic Classification and Marking
QoS begins with classifying traffic into categories and marking packets so that network devices along the path can apply differentiated treatment. Two marking mechanisms dominate:
DSCP (Differentiated Services Code Point): A 6-bit field in the IP header's ToS (Type of Service) byte. 64 possible values (codepoints), grouped into per-hop behaviors (PHBs):
| DSCP Value | PHB | Typical Use |
|---|---|---|
| 46 (EF) | Expedited Forwarding | VoIP, real-time |
| 34 (AF41) | Assured Forwarding 4, low drop | Video, interactive |
| 26 (AF31) | Assured Forwarding 3, low drop | Business-critical data |
| 18 (AF21) | Assured Forwarding 2, low drop | Transactional data |
| 10 (AF11) | Assured Forwarding 1, low drop | Bulk data |
| 0 (BE) | Best Effort | Default, everything else |
| 8 (CS1) | Scavenger | Background, low-priority backups |
802.1p / PCP (Priority Code Point): A 3-bit field in the 802.1Q VLAN tag. 8 priority levels (0-7). Operates at Layer 2 -- effective within a single switched domain but lost at L3 router hops unless explicitly mapped to DSCP.
For a virtualized environment, the critical question is: who marks the traffic? Options:
- Guest OS marks DSCP: The VM itself sets the DSCP value. The hypervisor must trust or re-mark it.
- Virtual switch marks DSCP: OVS or VFP classifies traffic by port/IP/protocol and sets DSCP. The guest OS marking is overridden.
- Physical switch marks DSCP: The ToR switch classifies and marks. Only useful for non-overlay traffic (SR-IOV, VLAN-direct).
In OVE, OVS can mark DSCP via flow actions (mod_nw_tos). OVN does not have a native QoS policy abstraction, but the OVN-Kubernetes CNI exposes QoS configuration through pod annotations.
Queuing Disciplines
Once traffic is classified and marked, queuing disciplines (qdiscs) on the host determine how packets are dequeued and transmitted:
FIFO (First In, First Out): The default. No differentiation. Simple, but high-priority traffic waits behind bulk transfers.
Priority Queuing (prio): Multiple queues with strict priority. Higher-priority queues are drained completely before lower-priority queues. Risk: priority inversion -- bulk traffic in a low-priority queue starves if the high-priority queue is always non-empty.
WFQ (Weighted Fair Queuing): Each queue gets a weighted share of bandwidth. A queue with weight 50% gets at least 50% of bandwidth, even when all queues are busy. Surplus bandwidth is shared proportionally.
HTB (Hierarchical Token Bucket): The most commonly used qdisc in Linux. Organizes queues in a hierarchy with guaranteed rates (floor) and maximum rates (ceiling). Child classes borrow bandwidth from parents when available.
HTB Configuration for Host Traffic Separation:
Root (total link: 25 Gbps)
+--------------------------------------------------------+
| |
| Class 1: VM Traffic Class 2: Storage Class 3: Management
| Rate: 15 Gbps Rate: 8 Gbps Rate: 2 Gbps
| Ceil: 25 Gbps Ceil: 15 Gbps Ceil: 5 Gbps
| (can burst to 25G (can burst to 15G (can burst to 5G
| if others idle) when VM is light) during migration)
| |
| Sub-classes: DSCP: AF41 DSCP: CS6
| +----+----+----+ |
| |High|Med |Low | |
| |EF |AF21|BE | |
| |3G |7G |5G | |
| +----+----+----+ |
+--------------------------------------------------------+
Linux TC (Traffic Control)
Linux TC is the kernel framework for traffic shaping, scheduling, and policing. It consists of three components:
- qdisc (queuing discipline): Attached to a network interface, defines how packets are enqueued and dequeued. Each interface has a root qdisc.
- class: Subdivisions within a classful qdisc (like HTB). Each class has its own rate limits and priority.
- filter: Matches packets and assigns them to classes. Filters can match on any packet header field, DSCP, flow label, or even eBPF program result.
# Example: HTB configuration for separating VM, storage, and management traffic
# 1. Add root HTB qdisc on the bond interface
tc qdisc add dev bond0 root handle 1: htb default 30
# 2. Root class (total bandwidth)
tc class add dev bond0 parent 1: classid 1:1 htb rate 25gbit
# 3. VM traffic class (guaranteed 15 Gbps, burst to 25 Gbps)
tc class add dev bond0 parent 1:1 classid 1:10 htb rate 15gbit ceil 25gbit
# 4. Storage traffic class (guaranteed 8 Gbps, burst to 15 Gbps)
tc class add dev bond0 parent 1:1 classid 1:20 htb rate 8gbit ceil 15gbit
# 5. Management traffic class (guaranteed 2 Gbps, burst to 5 Gbps)
tc class add dev bond0 parent 1:1 classid 1:30 htb rate 2gbit ceil 5gbit
# 6. Classify by DSCP
tc filter add dev bond0 parent 1: protocol ip prio 1 \
u32 match ip tos 0xb8 0xfc action classid 1:10 # EF -> VM high-prio
tc filter add dev bond0 parent 1: protocol ip prio 2 \
u32 match ip tos 0x88 0xfc action classid 1:20 # AF41 -> Storage
tc filter add dev bond0 parent 1: protocol ip prio 3 \
u32 match ip tos 0xc0 0xfc action classid 1:30 # CS6 -> Management
OVS QoS
OVS provides two QoS mechanisms:
Ingress policing: Rate-limits traffic arriving at an OVS port (typically a VM's veth interface). Excess traffic is dropped. This prevents a single VM from consuming all host bandwidth.
# Rate-limit VM port to 1 Gbps with 100 Kbit burst
ovs-vsctl set interface veth-vm1 ingress_policing_rate=1000000 # kbps
ovs-vsctl set interface veth-vm1 ingress_policing_burst=100 # kbit
Egress shaping: Uses Linux HTB queues on OVS ports for traffic shaping (as opposed to hard policing). Provides smoother bandwidth limiting with burst accommodation.
# Create QoS record for 500 Mbps with queues
ovs-vsctl set port veth-vm1 qos=@newqos -- \
--id=@newqos create qos type=linux-htb \
other-config:max-rate=500000000 \
queues:0=@q0 queues:1=@q1 -- \
--id=@q0 create queue other-config:min-rate=100000000 \
other-config:max-rate=500000000 -- \
--id=@q1 create queue other-config:min-rate=400000000 \
other-config:max-rate=500000000
QoS for Traffic Separation
In a converged infrastructure where VM traffic, storage traffic (iSCSI, NFS, Ceph), live migration traffic, and management traffic share the same physical NICs, QoS is essential for preventing any single traffic type from starving the others:
| Traffic Type | Priority | DSCP | Min Bandwidth | Max Bandwidth |
|---|---|---|---|---|
| Management (API, SSH, monitoring) | Highest | CS6 (48) | 1 Gbps | 5 Gbps |
| Storage (Ceph, iSCSI) | High | AF41 (34) | 8 Gbps | 15 Gbps |
| Live Migration | Medium | AF31 (26) | 0 (on-demand) | 10 Gbps |
| VM production traffic | Normal | AF21 (18) | 10 Gbps | 25 Gbps |
| Backup, replication | Low | CS1 (8) | 1 Gbps | 10 Gbps |
VMware vSphere handles this via Network I/O Control (NIOC), which classifies traffic by type (vMotion, vSAN, management, VM) and applies shares and reservations. The OVE equivalent requires manual configuration of Linux TC and OVS QoS, or automation via MachineConfig and custom operators. This is an operational maturity gap -- NSX/vSphere NIOC is integrated and UI-driven; OVE's QoS is CLI-driven and requires expertise in Linux TC.
Kubernetes Pod QoS Classes
Kubernetes assigns QoS classes to pods based on their resource requests and limits. While this is primarily a scheduling and eviction mechanism, it affects networking indirectly:
| QoS Class | Criteria | Eviction Priority | Network Implication |
|---|---|---|---|
| Guaranteed | requests == limits (CPU and memory) | Last to be evicted | Highest scheduling priority, often used for critical VMs |
| Burstable | requests < limits | Middle | Standard workloads |
| BestEffort | No requests or limits | First to be evicted | Lowest priority, may experience resource contention |
For KubeVirt VMs, setting explicit CPU and memory requests/limits equal to each other (Guaranteed QoS class) ensures the VM pod is not evicted under memory pressure and receives dedicated CPU time. This does not directly affect network QoS, but it prevents the VM from being killed during resource contention -- which looks like a network outage from the VM's perspective.
7. VPN / IPsec Tunneling
IPsec Architecture
IPsec (Internet Protocol Security) is a suite of protocols for securing IP communications through authentication and encryption. For a financial enterprise, IPsec is the standard mechanism for encrypted site-to-site connectivity between data centers, DR sites, and cloud environments.
The IPsec framework consists of three core components:
IKE (Internet Key Exchange): The control-plane protocol that negotiates cryptographic parameters and establishes security associations. IKE runs on UDP port 500 (and port 4500 for NAT traversal).
SA (Security Association): A unidirectional agreement between two peers on the encryption algorithm, authentication method, key material, and lifetime. Each data flow requires two SAs -- one for each direction. SAs are stored in the Security Association Database (SAD).
ESP (Encapsulating Security Payload): The data-plane protocol that encrypts and authenticates packets. ESP is IP protocol 50. It provides confidentiality (encryption), integrity (authentication), and anti-replay (sequence numbers).
AH (Authentication Header): An older data-plane protocol (IP protocol 51) that provides integrity and authentication without encryption. AH is rarely used today because ESP with authentication provides equivalent integrity protection plus confidentiality. AH also has the disadvantage of authenticating the IP header, which breaks NAT. For all practical purposes, modern IPsec means IKE + ESP.
IKE Phases
IPsec Tunnel Establishment (IKEv2):
Initiator (DC-A) Responder (DC-B)
10.0.1.1:500 10.0.2.1:500
| |
| IKE_SA_INIT (Message 1) |
| - SA proposal (AES-256-GCM, SHA-384, DH-20) |
| - Key exchange (DH public value) |
| - Nonce (Ni) |
+--------------------------------------------->|
| |
| IKE_SA_INIT (Message 2) |
| - SA accepted (AES-256-GCM, SHA-384, DH-20) |
| - Key exchange (DH public value) |
| - Nonce (Nr) |
|<---------------------------------------------+
| |
| Both sides now compute: |
| SKEYSEED = prf(Ni | Nr, DH-shared-secret) |
| SK_d, SK_ai, SK_ar, SK_ei, SK_er, SK_pi, SK_pr
| (derived key material for IKE SA) |
| |
| IKE_AUTH (Message 3) -- encrypted with SK_ei |
| - Identity (ID_i) |
| - Certificate (or PSK auth) |
| - AUTH payload (signature over messages 1-2) |
| - SA proposal for Child SA (ESP transform) |
| - Traffic Selectors (TSi: 10.1.0.0/16) |
+--------------------------------------------->|
| |
| IKE_AUTH (Message 4) -- encrypted with SK_er |
| - Identity (ID_r) |
| - Certificate (or PSK auth) |
| - AUTH payload |
| - SA accepted for Child SA |
| - Traffic Selectors (TSr: 10.2.0.0/16) |
|<---------------------------------------------+
| |
| IPsec tunnel established. |
| Child SA (ESP) active for: |
| 10.1.0.0/16 <-> 10.2.0.0/16 |
| SPI_out=0xABCD1234 SPI_in=0xEFGH5678 |
| Cipher: AES-256-GCM |
| Lifetime: 3600s / 1GB (rekey before) |
| |
| Data flow (ESP-encrypted): |
| [IP HDR: 10.0.1.1->10.0.2.1][ESP HDR: SPI |
| =0xABCD1234][Encrypted: original IP packet |
| 10.1.x.x->10.2.x.x][ESP Auth Tag] |
+--------------------------------------------->|
| |
IKEv1 vs IKEv2: IKEv1 uses two phases with multiple modes (Main Mode + Quick Mode for identity protection, or Aggressive Mode for faster setup). IKEv2 (RFC 7296) simplifies this to a 4-message exchange (IKE_SA_INIT + IKE_AUTH) that establishes both the IKE SA and the first Child SA (IPsec SA) in a single round-trip pair. IKEv2 also adds built-in NAT traversal (UDP encapsulation on port 4500), MOBIKE (endpoint mobility), and improved dead peer detection. All new deployments should use IKEv2.
Tunnel Mode vs Transport Mode
Tunnel mode: The entire original IP packet is encapsulated inside a new IP header and encrypted. The outer IP header has the tunnel endpoints (e.g., gateway IPs), and the inner IP header has the actual source and destination. This is the standard mode for site-to-site VPNs.
Tunnel Mode Encapsulation:
Original packet:
[IP HDR: 10.1.5.10 -> 10.2.8.20] [TCP HDR] [Payload]
After IPsec tunnel mode encryption:
[New IP HDR: 10.0.1.1 -> 10.0.2.1] [ESP HDR: SPI, Seq#]
[Encrypted: [IP HDR: 10.1.5.10 -> 10.2.8.20] [TCP HDR] [Payload]]
[ESP Auth Tag (ICV)]
The physical network sees: 10.0.1.1 -> 10.0.2.1, protocol ESP
The actual source/destination (10.1.5.10 / 10.2.8.20) are encrypted
Transport mode: Only the payload is encrypted; the original IP header is preserved (but authenticated). Used for host-to-host encryption (e.g., encrypting OVN GENEVE tunnel traffic between nodes). Lower overhead than tunnel mode because no additional IP header is added.
Transport Mode Encapsulation:
Original packet:
[IP HDR: 10.0.1.1 -> 10.0.2.1] [UDP HDR: GENEVE] [Inner packet]
After IPsec transport mode encryption:
[IP HDR: 10.0.1.1 -> 10.0.2.1] [ESP HDR: SPI, Seq#]
[Encrypted: [UDP HDR: GENEVE] [Inner packet]]
[ESP Auth Tag (ICV)]
The original IP header is preserved, only the payload is encrypted
Useful for: encrypting overlay (GENEVE) traffic between hosts
IPsec for Site-to-Site Connectivity
For a Tier-1 financial enterprise, site-to-site IPsec is used in several scenarios:
- Data center interconnect: Connecting primary and secondary data centers over a WAN link. All east-west traffic between the two sites is encrypted.
- DR site connectivity: Encrypted replication traffic to the disaster recovery site.
- Cloud connectivity: Connecting on-premises data centers to cloud provider virtual networks (e.g., Azure ExpressRoute with IPsec overlay for additional encryption, or direct IPsec VPN as a backup path).
- Branch office / partner connectivity: Connecting branch offices or business partner networks to the data center.
Key design considerations for high-throughput site-to-site IPsec:
- AES-NI hardware acceleration: Modern x86 CPUs (Intel since Westmere, AMD since Bulldozer) include AES-NI instructions that accelerate AES encryption/decryption by 5-10x. With AES-NI, a single CPU core can encrypt at 10+ Gbps with AES-256-GCM. Without AES-NI, IPsec becomes a significant CPU bottleneck.
- Multiple SA parallelism: A single IPsec SA (a single encryption context) is processed serially. To scale beyond 10-15 Gbps, multiple SAs must be used in parallel, each handled by a different CPU core. This can be achieved through multiple tunnel interfaces or by using IPsec offload to the NIC.
- MTU and fragmentation: IPsec tunnel mode adds 50-73 bytes of overhead (IP + ESP + IV + padding + ICV). If the original packet is at the interface MTU, IPsec encapsulation causes fragmentation. The recommended practice is to reduce the tunnel interface MTU to account for IPsec overhead (e.g., MTU 1400 for a 1500-byte underlay MTU with tunnel mode).
- DPD (Dead Peer Detection): IKEv2 includes liveness checks (empty INFORMATIONAL exchange) to detect when a peer is unreachable. DPD interval should be set short enough to detect failures quickly (10-30 seconds) but not so short that it causes unnecessary rekeying on flaky WAN links.
WireGuard as a Modern Alternative
WireGuard is a modern VPN protocol that has been part of the Linux kernel since 5.6 (2020). It was designed as a simpler, faster, more auditable alternative to IPsec:
| Aspect | IPsec (IKEv2 + ESP) | WireGuard |
|---|---|---|
| Codebase | ~400,000 lines (Linux kernel IPsec + strongSwan) | ~4,000 lines (kernel module) |
| Cipher choice | Negotiated (many options) | Fixed: ChaCha20-Poly1305 (or AES-256-GCM with hardware support) |
| Key exchange | IKE (complex protocol) | Noise Protocol Framework (1-RTT) |
| Configuration | Complex (ipsec.conf, swanctl.conf, certificates) | Simple (wg genkey, wg set, single config file) |
| Performance | 3-5 Gbps (single SA, software) | 5-10 Gbps (single interface, software) |
| NAT traversal | UDP encapsulation (port 4500) | UDP natively (configurable port) |
| Statefulness | Stateful (SA negotiation, rekeying) | Stateless (cryptokey routing) |
| Audit surface | Large (cipher agility, mode negotiation, legacy compatibility) | Small (fixed ciphers, minimal protocol) |
WireGuard's main advantage is simplicity. Its main disadvantage in enterprise environments is the lack of IKE -- key distribution must be handled out-of-band (manual, or via a management layer like Tailscale/Headscale). For site-to-site VPN between data centers, this is acceptable (keys can be distributed via configuration management). For dynamic, certificate-based authentication with many peers, IPsec's IKE is more mature.
For overlay tunnel encryption (encrypting GENEVE traffic between hosts in OVE), WireGuard's simplicity and kernel-native implementation make it an attractive option. However, OVN's IPsec support (using strongSwan with transport-mode ESP) is the officially supported path in OpenShift.
VPN in Each Platform
NSX IPsec: NSX Edge nodes provide full IPsec VPN functionality (route-based and policy-based) with IKEv1/IKEv2, multiple cipher suites, and certificate or PSK authentication. NSX supports both site-to-site IPsec and route-based VPN with BGP over IPsec (for dynamic routing over encrypted tunnels). The NSX Edge is the VPN termination point, and its throughput is determined by the Edge VM sizing (or bare-metal Edge for higher throughput).
OVN IPsec: OVN supports encrypting the GENEVE tunnel traffic between nodes using IPsec in transport mode. This is configured cluster-wide and encrypts all east-west overlay traffic. Implementation uses strongSwan (IKEv2) and the Linux kernel's xfrm framework. In OpenShift 4.14+, this is enabled via the OVNKubernetes network operator configuration. Note: this is tunnel encryption (encrypting overlay traffic between hosts), not site-to-site VPN. For site-to-site VPN connectivity in OVE, an external VPN gateway is required -- either a dedicated router/firewall VM, a physical VPN appliance, or a dedicated VPN operator (e.g., strongSwan running as a pod).
Azure Local VPN Gateway: Azure Local can deploy an Azure VPN Gateway as a VM that provides site-to-site IPsec VPN with IKEv2 and BGP. This integrates with Azure's VPN Gateway service for hybrid connectivity. The VPN Gateway supports multiple tunnels, traffic selectors, and certificate-based authentication. Azure Local also supports ExpressRoute for dedicated, private WAN connectivity (not IPsec-encrypted unless an additional IPsec overlay is configured).
Swisscom ESC S2S VPN: Swisscom provides managed site-to-site VPN connectivity as part of the ESC service. The customer specifies the VPN endpoints, traffic selectors, and authentication method, and Swisscom provisions the tunnel. The customer has limited visibility into the VPN configuration details (managed service model). VPN throughput depends on the contracted service level.
How the Candidates Handle This
Comparison Table
| Capability | VMware (NSX) | OVE (OVN/eBPF) | Azure Local | Swisscom ESC |
|---|---|---|---|---|
| DVR | NSX DR on every transport node + SR on Edge VMs/bare-metal; mature, proven at scale | OVN logical routers distributed on all nodes by default; gateway chassis for external traffic; equivalent functionality, different management model | Hyper-V Network Virtualization (HNV) with distributed routing via VFP; routing on every host, gateway for external | Managed by Swisscom; customer has no visibility into routing architecture |
| VRF / Network Isolation | NSX logical routers provide VRF-equivalent isolation; VRF Lite on physical ToR switches | OVN logical routers for L3 isolation; Linux VRF for host-level separation; secondary networks via Multus for hard isolation | VRF on Hyper-V vSwitch via VFP; Azure Stack HCI logical networks provide isolation | Managed network segmentation via Swisscom provisioning; customer requests segments |
| eBPF | Not available (ESXi uses vmkernel datapath, not Linux eBPF) | Full Linux eBPF stack available; OVN-Kubernetes uses OVS flows (not eBPF today, but migration in progress); Cilium available as secondary CNI | Not available (Hyper-V uses VFP, not Linux eBPF) | Not available (managed service) |
| Micro-segmentation | NSX DFW: mature, multi-category policy model, identity-based groups, L7 context, graphical UI, per-rule logging | OVN ACLs via NetworkPolicy + AdminNetworkPolicy; cluster-scoped policies; L3/L4 only (L7 via Cilium); YAML-driven, GitOps-friendly | VFP ACLs via Network Controller; NSG-like rules (similar to Azure cloud NSGs); limited identity integration | Managed firewall rules via Swisscom; customer defines rules in service portal; limited granularity |
| Network Policy Scope | DFW rules scope to any, security group, logical switch, or port | NetworkPolicy (namespace), AdminNetworkPolicy (cluster), BaselineAdminNetworkPolicy (cluster default) | NSG-like rules scoped to subnets or VM NICs | Service-portal-defined rules scoped to customer-visible segments |
| Policy Management UI | NSX Manager: graphical rule editor, rule search, hit counters, flow visualization | Kubernetes API (kubectl/oc), YAML manifests, GitOps; graphical options via Red Hat ACS, Calico Enterprise | Windows Admin Center, Azure Portal (limited), PowerShell | Swisscom service portal |
| QoS | NIOC (Network I/O Control): per-traffic-type shares and reservations, UI-driven | Linux TC + OVS QoS: manually configured or via MachineConfig; no integrated UI equivalent to NIOC | QoS via Network ATC (intent-based classification for storage, management, compute traffic) | Managed by Swisscom; customer has no QoS controls |
| VPN / IPsec | NSX Edge: full IPsec VPN (site-to-site, route-based, IKEv1/v2, BGP-over-IPsec) | OVN IPsec for overlay encryption (transport mode, inter-node only); site-to-site VPN requires external gateway (strongSwan pod, physical appliance) | Azure VPN Gateway VM: site-to-site IPsec with IKEv2, BGP; ExpressRoute for private WAN | Managed S2S VPN via Swisscom; customer specifies endpoints and traffic selectors |
| Overlay Encryption | NSX supports GENEVE encryption between transport nodes | OVN IPsec (strongSwan, transport mode ESP on GENEVE tunnels); supported in OpenShift 4.14+ | Not available by default on Azure Local; WAN traffic assumed on private ExpressRoute or VPN | Managed by Swisscom |
Key Differences in Prose
Micro-segmentation maturity: NSX DFW is the most mature micro-segmentation solution in the comparison. It has 10+ years of enterprise deployment experience, a polished graphical management interface, identity-based dynamic security groups integrated with vCenter and Active Directory, L7 application awareness, per-rule hit counters, and a multi-category policy model that maps naturally to enterprise security team workflows. OVE's combination of NetworkPolicy + AdminNetworkPolicy achieves functional parity for L3/L4 policy enforcement, but the operational experience is different: policies are YAML manifests managed via GitOps, there is no integrated graphical rule builder (third-party tools like Red Hat ACS or Calico Enterprise fill this gap), and L7 context requires Cilium's CiliumNetworkPolicy extension. For a team accustomed to NSX Manager's drag-and-drop policy editor, the transition to YAML + GitOps is a significant workflow change -- but it is arguably a better model for audit compliance (version-controlled, reviewable, CI/CD-testable policies).
AdminNetworkPolicy is the key to DFW parity. Without AdminNetworkPolicy, Kubernetes NetworkPolicy cannot match NSX DFW's category-based, priority-ordered, cluster-scoped policy model. With ANP + BANP, the functional gap closes significantly. The platform team can define emergency, infrastructure, and environment rules at the cluster level (ANP), delegate application-specific rules to namespace owners (NetworkPolicy), and enforce a default-deny baseline (BANP). This mirrors the NSX DFW category model. ANP is supported in OVN-Kubernetes starting with OpenShift 4.14, but it is still an evolving API (v1alpha1 as of 2025). The team should validate ANP behavior and maturity during the PoC.
eBPF is an OVE-only advantage. Neither VMware (ESXi kernel) nor Azure Local (Hyper-V/Windows kernel) can leverage eBPF. OVE's Linux foundation means the organization gains access to the entire eBPF ecosystem -- Cilium for enhanced policy enforcement, Hubble for L7 observability, bpftrace for ad-hoc debugging, and future eBPF-based OVN datapaths. This is a strategic technology advantage, not an immediate functional one -- OVN flows work well today -- but it positions the platform for ongoing innovation.
QoS is an operational gap in OVE. VMware's NIOC provides an integrated, UI-driven QoS management experience for traffic separation. OVE requires manual Linux TC and OVS QoS configuration, or custom automation via MachineConfig. Azure Local's Network ATC provides a middle ground -- intent-based classification (storage, management, compute) that is simpler than raw TC but less granular than NIOC. For an organization converging storage, VM, migration, and management traffic on shared NICs, the lack of an integrated QoS management tool in OVE is a day-2 operational concern that should be addressed during platform engineering.
VPN/IPsec: NSX Edge vs external gateway. NSX Edge provides a fully integrated VPN gateway with IKEv2, BGP, and management through NSX Manager. In OVE, site-to-site VPN is not an integrated platform feature -- it requires deploying an external VPN gateway (strongSwan pod, physical appliance, or cloud VPN service). OVN IPsec covers overlay encryption (node-to-node), but not site-to-site connectivity. This is a valid architectural concern for multi-datacenter deployments. Azure Local's VPN Gateway VM provides a middle ground -- integrated but less feature-rich than NSX Edge.
Swisscom ESC abstraction level. Swisscom ESC abstracts away all routing, security, and VPN infrastructure. The customer consumes pre-defined network segments, firewall rules via a service portal, and managed VPN connectivity. This is the simplest operational model but the least flexible. Workloads requiring custom micro-segmentation rules, advanced QoS, or eBPF-based observability cannot run on ESC. The ESC model is appropriate for standard enterprise workloads that fit within Swisscom's predefined security templates.
Key Takeaways
-
AdminNetworkPolicy + BaselineAdminNetworkPolicy is the critical enabler for NSX DFW migration. Without ANP/BANP, Kubernetes NetworkPolicy lacks the cluster-scoped, priority-ordered, deny-capable policy model that NSX DFW provides. With ANP/BANP, the functional gap is manageable. Validate ANP behavior, performance, and OVN-Kubernetes integration maturity in the PoC. This is the single most important feature to validate for the security team.
-
Default deny must be implemented from day one. In both NSX and Kubernetes, the default behavior is "allow all." The default-deny posture must be explicitly configured -- via NSX DFW default drop rule or via BaselineAdminNetworkPolicy with Deny action. Migrating 5,000+ VMs to OVE without implementing default deny is a security regression that will not pass FINMA audit.
-
Micro-segmentation policy migration is the hardest part of the platform migration. The technical migration of VMs is relatively straightforward (disk image conversion, network adapter reconfiguration). The migration of thousands of NSX DFW rules to Kubernetes NetworkPolicy + ANP is a policy-by-policy translation effort that requires deep understanding of both policy models, extensive testing, and careful cutover planning. Budget significant effort for this.
-
DVR is functionally equivalent across all platforms. OVN DVR, NSX DVR, and HNV distributed routing all achieve the same result: local routing on every host, centralized gateway for external traffic. The implementation details differ, but the performance characteristics are comparable. This is not a differentiator.
-
eBPF is a strategic advantage for OVE but not an immediate migration requirement. OVN flows handle NetworkPolicy enforcement effectively today. eBPF (via Cilium or future OVN eBPF datapath) provides better scalability at extreme rule counts, O(1) policy evaluation, and deep observability. For an organization with 5,000+ VMs, the eBPF scalability advantage becomes meaningful when rule counts exceed ~10,000 entries per node. Evaluate whether the current rule count justifies Cilium deployment during the PoC.
-
QoS requires proactive engineering in OVE. VMware NIOC handles traffic separation automatically. OVE does not. If the organization converges storage, VM, live migration, and management traffic on shared NICs (which is the standard OVE deployment model with 2x 25 GbE bonds), Linux TC and OVS QoS must be explicitly configured to prevent storage or migration traffic from starving VM connectivity. Automate this via MachineConfig day-1.
-
Site-to-site VPN is not an integrated OVE feature. Plan for an external VPN gateway solution (physical appliance, strongSwan pod, or cloud VPN service) if multi-datacenter encrypted connectivity is required. OVN IPsec covers overlay encryption (node-to-node within the cluster) but does not provide site-to-site VPN functionality.
-
KubeVirt VMs with Multus secondary interfaces are partially outside NetworkPolicy scope. Traffic on secondary network interfaces (VLAN-direct, bridge, macvlan, SR-IOV) is not filtered by OVN ACLs. If these interfaces carry sensitive traffic, additional micro-segmentation controls (physical firewall, host-level iptables, or Multus-aware CNI policies) must be deployed. This is a common gap in OVE deployments that is often discovered late.
-
VRF-level isolation in OVE requires architectural planning. The default OVE model (single cluster network, namespace-based soft isolation) does not provide VRF-equivalent hard isolation. For PCI DSS or FINMA-mandated network separation, use OVN secondary networks with separate logical routers, dedicated node pools, or physical network segmentation. Define the isolation requirements before designing the OVE network architecture.
Discussion Guide
The following questions target routing and security capabilities during vendor workshops, SME deep-dives, and PoC validation sessions. They are designed to test whether the vendor or SME has actual production experience with micro-segmentation, NetworkPolicy, and IPsec in real enterprise deployments -- not just slide-deck familiarity.
1. NSX DFW Rule Migration Strategy
"We have 3,200 NSX DFW rules across five categories (Emergency, Infrastructure, Environment, Application, Default). Walk us through the migration strategy to Kubernetes NetworkPolicy + AdminNetworkPolicy. How do we map the NSX category model to ANP priorities? How do we handle NSX rules that use dynamic security groups populated by vCenter tags? What about rules with L7 Application signatures? What is the expected timeline for migrating this rule set?"
Purpose: Tests the vendor's understanding of the DFW-to-NetworkPolicy mapping and realistic migration planning. The correct answer should include: (1) Emergency/Infrastructure/Environment categories map to ANPs with priority 0-99/10-99/100-499; Application category maps to namespace NetworkPolicies; Default rule maps to BANP. (2) vCenter tag-based security groups must be replaced with Kubernetes labels on VM pods -- this requires a label taxonomy design and automated label assignment during VM migration. (3) L7 rules cannot be expressed in NetworkPolicy; they require CiliumNetworkPolicy or application-level controls (TLS mutual auth, API gateway rules). (4) Migration timeline for 3,200 rules: 3-6 months minimum with dedicated security engineering resources, including rule consolidation (many NSX rules are redundant or obsolete), testing, and phased cutover.
2. AdminNetworkPolicy Maturity and Limitations
"AdminNetworkPolicy is still v1alpha1 in upstream Kubernetes. What is the maturity level in the OpenShift version you are proposing? Have you deployed ANP in production with more than 500 policies? What are the known limitations? How does ANP priority evaluation interact with namespace NetworkPolicies? Can you demonstrate the Pass action with a concrete example?"
Purpose: Tests real-world ANP experience. The answer should address: OpenShift 4.14+ supports ANP via OVN-Kubernetes with tech-preview stability; OpenShift 4.16+ targets GA. Known limitations include: limited tooling for policy visualization, no ANP-specific hit counters in OVN (logging is available), and ANP updates trigger OVN ACL recompilation which may cause brief policy enforcement gaps on large clusters. The Pass action delegates to namespace NetworkPolicy -- if an ANP with priority 10 matches with action Pass, evaluation continues to lower-priority ANPs, then to namespace NetworkPolicies, then to BANP.
3. Default Deny Implementation and Impact
"We need to implement default deny across the entire cluster for FINMA compliance. Walk us through the implementation. What happens to existing workloads when we enable default deny? How do we ensure DNS, monitoring, and infrastructure services continue to work? What is the rollback plan if the default deny breaks production?"
Purpose: Tests operational readiness for zero-trust deployment. The correct answer: implement BANP with Deny action for all traffic, then create ANPs for infrastructure services (DNS, monitoring, kube-apiserver, ingress controller, OVN internal traffic) before enabling the BANP. The safest approach: (1) deploy all infrastructure ANPs first, (2) deploy application NetworkPolicies for known flows, (3) enable BANP in "audit mode" (if available -- OVN ACL logging) to identify traffic that would be denied, (4) resolve all identified gaps, (5) switch BANP to enforcing mode. Rollback: delete the BANP resource, which immediately reverts to Kubernetes default-allow behavior. Critical gotcha: OVN internal traffic (ovnkube-node, ovnkube-controller communications) must be explicitly allowed, or the cluster control plane breaks.
4. NetworkPolicy for KubeVirt VMs with Multus
"We have KubeVirt VMs with two network interfaces: a primary OVN interface and a secondary VLAN interface via Multus bridge CNI. How does NetworkPolicy apply to this VM? Is the VLAN interface protected by NetworkPolicy? If not, how do we enforce micro-segmentation on the VLAN traffic? What about VMs with SR-IOV secondary interfaces?"
Purpose: Tests understanding of the Multus + NetworkPolicy interaction. The correct answer: NetworkPolicy applies only to the primary OVN interface (the pod's cluster network attachment). Traffic on the Multus secondary interface bypasses OVN ACLs entirely. For VLAN traffic micro-segmentation, options include: (1) physical firewall/ACLs on the ToR switch for the VLAN, (2) OVN secondary networks (using Multus with OVN as the secondary CNI, which does support ACLs), (3) in-guest firewall rules (iptables/firewalld inside the VM). SR-IOV interfaces are even further outside NetworkPolicy scope -- traffic goes directly from the VF to the physical NIC, bypassing OVS entirely. For SR-IOV-attached VMs, micro-segmentation must be handled at the physical switch or via eSwitch ACLs (if using switchdev mode).
5. eBPF and Cilium Evaluation
"Should we deploy Cilium alongside OVN-Kubernetes for enhanced micro-segmentation and observability? What are the trade-offs? Can Cilium replace OVN-Kubernetes as the primary CNI in OVE? What is the operational overhead of running two networking stacks? How does Hubble compare to our existing NSX Intelligence deployment for flow visualization?"
Purpose: Tests architectural judgment on eBPF strategy. The answer should include: Cilium can run alongside OVN-Kubernetes as a secondary CNI (via Multus) but cannot replace it in OVE -- OVN-Kubernetes is the supported and tested CNI for OpenShift. Running two CNIs adds operational complexity (two control planes, two sets of flow tables, two troubleshooting paths). The recommended approach: use OVN-Kubernetes for all NetworkPolicy enforcement (it handles the DFW replacement use case effectively), and evaluate Cilium Hubble specifically for observability if NSX Intelligence-equivalent flow visualization is a requirement. Hubble provides per-flow verdict logging, service dependency maps, DNS query logging, and HTTP request tracing -- comparable to NSX Intelligence for L3/L4 and superior for L7 (with CiliumNetworkPolicy).
6. QoS and Traffic Separation
"In VMware, we use NIOC to separate storage (vSAN), vMotion, management, and VM traffic with guaranteed bandwidth reservations. How do we achieve equivalent traffic separation in OVE? What happens if storage traffic (Ceph) consumes all bandwidth during a recovery event and starves VM production traffic? Show us the specific configuration."
Purpose: Tests practical QoS implementation knowledge. The answer should demonstrate: Linux TC with HTB qdisc on the bond interface, with classes for storage (Ceph OSD), live migration (KubeVirt virt-handler), management (API server, monitoring), and VM traffic. DSCP marking at the OVS level for traffic classification. The specific risk: Ceph recovery after an OSD failure generates enormous rebalancing traffic that, without QoS, can saturate the entire network link. The TC configuration must guarantee minimum bandwidth for VM production traffic even during Ceph recovery. The answer should also mention that this is a day-1 configuration requirement, not a day-2 optimization.
7. IPsec and Multi-Datacenter Connectivity
"We operate two data centers (active-active) with NSX Federation providing stretched networking and cross-site DFW policy synchronization. How do we replicate this in OVE? What replaces NSX Federation for cross-site policy management? How is the inter-DC traffic encrypted? What is the failover behavior when one site loses connectivity?"
Purpose: Tests multi-datacenter architecture knowledge. The correct answer: OVE does not have a direct equivalent to NSX Federation. Cross-site connectivity options include: (1) Submariner (Red Hat's multi-cluster networking project) for cross-cluster service discovery and L3 connectivity with IPsec encryption, (2) site-to-site VPN gateway connecting the two clusters' external networks, (3) Red Hat Advanced Cluster Management (ACM) for cross-cluster policy distribution (pushing NetworkPolicy and ANP resources to both clusters via GitOps). The critical gap: NSX Federation's stretched logical switch (L2 stretch across sites) is not replicated in OVE -- workloads are expected to be L3-routable across sites, not L2-adjacent. This may require application architecture changes for workloads that depend on L2 adjacency across sites.
8. Overlay Encryption Performance
"We are considering enabling OVN IPsec to encrypt all east-west overlay traffic for compliance. What is the performance impact? How many Gbps of encrypted throughput can we expect per node with AES-NI? Does IPsec encryption add latency to every packet? Is there a hardware offload option? What is the operational overhead (certificate management, rekeying)?"
Purpose: Tests IPsec operational readiness. The answer should include: with AES-256-GCM and AES-NI, a single core can encrypt at 10-15 Gbps; with multi-core parallelism (Linux xfrm uses per-CPU SAs), aggregate throughput of 40-80 Gbps per node is achievable on modern CPUs. Latency impact: ~5-15 us per packet for encryption/decryption (added to the existing OVS processing latency). Hardware offload: inline IPsec offload is available on ConnectX-6 Dx and newer NICs (offloads ESP processing to NIC hardware, reducing CPU overhead to near-zero and latency to ~1-2 us). Certificate management: OVN IPsec uses self-signed certificates managed by the ovn-ipsec daemonset, with automatic rotation. For enterprise PKI integration (using the organization's CA), the certificate issuance and rotation process must be customized -- this is an operational integration task.
9. Micro-segmentation Audit and Compliance
"Our auditors require a complete, point-in-time snapshot of all security policies, evidence that they are enforced, and a change history. How do we provide this in OVE? In NSX, we export DFW rules via API and provide flow logs as enforcement evidence. What is the equivalent?"
Purpose: Tests audit compliance capability. The answer should include: (1) Policy snapshot: oc get networkpolicy,adminnetworkpolicy,baselineadminnetworkpolicy -A -o yaml provides a complete export of all policies. If policies are managed via GitOps (ArgoCD, Flux), the Git repository itself is the version-controlled audit trail with full change history, author attribution, and approval records. (2) Enforcement evidence: OVN ACL logging (ovn-nbctl set ACL <uuid> log=true) generates syslog entries for every packet that matches an ACL. These logs can be shipped to the SIEM for audit. (3) Flow logs: OVN flow export via IPFIX or Hubble (if Cilium is deployed) provides flow-level visibility comparable to NSX flow logs. (4) Change history: GitOps commit log provides the who/what/when/why for every policy change -- arguably superior to NSX Manager's audit log because it includes the approval workflow context (PR reviews, CI test results).
10. VPN Gateway Architecture for OVE
"We need site-to-site IPsec VPN between our OVE cluster and three remote sites (DR site, partner network, cloud VPC). In NSX, we use NSX Edge VPN. What is the equivalent in OVE? Do we deploy strongSwan as a pod? Do we use a physical appliance? How do we handle BGP over IPsec for dynamic routing? What is the HA model for the VPN gateway?"
Purpose: Tests VPN design in an OVE context. The correct answer: OVE does not include an integrated VPN gateway. Options: (1) Physical VPN appliance (Cisco, Palo Alto, Fortinet) at the data center edge -- simplest, well-understood, but outside the Kubernetes management plane. (2) strongSwan pod with host networking (running as a DaemonSet on dedicated gateway nodes) -- provides IKEv2, BGP (via FRR sidecar), and HA via keepalived or MetalLB for VIP management. (3) VPN operator (e.g., kube-router with BGP, or third-party operators). For BGP over IPsec: deploy FRRouting (FRR) as a sidecar container in the VPN pod, establish BGP sessions over the IPsec tunnel, and inject learned routes into the cluster network. HA model: active-passive with VRRP (keepalived) or active-active with ECMP (multiple gateway pods, each with its own IPsec tunnel to the remote site, physical router load-balancing across them).