Modern datacenters and beyond

Routing & Security

Why This Matters

The previous chapters covered the data plane -- how packets move between VMs through overlays (OVN/GENEVE), physical fabrics (spine-leaf, LACP, ECMP), and advanced paths (SR-IOV, DPDK). But moving packets is only half the problem. The other half is deciding where packets go (routing) and whether packets are allowed (security). In a VMware environment with NSX, these two concerns are handled by the NSX Distributed Router (DR) for routing and the NSX Distributed Firewall (DFW) for micro-segmentation. The DFW is the single most operationally significant NSX feature for a Tier-1 financial enterprise -- it enforces zero-trust segmentation policies across every VM, and FINMA compliance depends on it.

This chapter covers the technologies that replace, extend, or re-implement these capabilities in OVE, Azure Local, and Swisscom ESC. The scope is wide because routing and security intersect at every layer:

  1. DVR (Distributed Virtual Routing) -- how routing decisions happen locally on every hypervisor instead of through a central chokepoint
  2. VRF (Virtual Routing and Forwarding) -- how multiple isolated routing tables coexist on a single node for tenant separation
  3. eBPF (Extended Berkeley Packet Filter) -- the programmable kernel technology that is replacing iptables for network policy enforcement, observability, and runtime security
  4. Micro-segmentation -- the security model that replaces perimeter firewalls with per-workload policies (the NSX DFW replacement)
  5. Network Policies (Kubernetes) -- the Kubernetes-native mechanism for micro-segmentation, including the new AdminNetworkPolicy resource that addresses key DFW feature gaps
  6. QoS (Quality of Service) -- traffic classification, prioritization, and bandwidth management for storage, migration, and management traffic separation
  7. VPN / IPsec Tunneling -- encrypted site-to-site connectivity for multi-datacenter and DR scenarios

The micro-segmentation and Network Policy sections are the most critical in this chapter. For a financial enterprise, the ability to enforce granular, auditable, per-workload security policies is not a feature -- it is a regulatory requirement. NSX DFW provides this today with a mature, well-understood policy model. The replacement must deliver equivalent or superior capability, or the migration is blocked.


Concepts

1. DVR (Distributed Virtual Routing)

The Problem with Centralized Routing

In a traditional network architecture, routing between subnets is performed by a centralized device -- a physical router, a firewall, or a dedicated routing VM. Every packet that crosses a subnet boundary must traverse the network to reach the router, get a forwarding decision, and traverse the network back to the destination. This creates three problems:

Hairpinning: Two VMs on the same physical host, in different subnets, must send their traffic up to the centralized router and back down to the same host. The packet traverses the physical network twice for no reason.

Single bottleneck: All inter-subnet traffic funnels through the router. At 5,000+ VMs with hundreds of subnets, the centralized router becomes a bandwidth and CPU bottleneck. Even high-end physical routers struggle with the aggregate east-west traffic of a large virtualized environment.

North-south-only optimization: Centralized routers are positioned at the north-south boundary (data center to external). East-west traffic (VM to VM within the data center) was historically a small fraction of total traffic. In modern virtualized environments, east-west traffic is 70-80% of all traffic. The architecture is optimized for the wrong traffic pattern.

Centralized Routing -- The Hairpin Problem:

  Host A                         Central Router           Host A
  +------------------+          +-------------+          (same host!)
  | VM-1 (10.1.1.5)  |          |             |
  | Subnet: 10.1.1/24|---+      |  Routing    |      +---| VM-2 (10.2.1.8)  |
  +------------------+   |      |  Table:     |      |   | Subnet: 10.2.1/24|
                          |      |  10.1.1/24  |      |   +------------------+
                          +----->|  10.2.1/24  |------+
                          leaf   |  10.3.1/24  |  leaf
                          switch |  default gw |  switch
                          uplink +-------------+  uplink

  Problem: VM-1 -> VM-2 traffic crosses the physical network TWICE
           even though both VMs are on the same host.

           Path: VM-1 -> leaf -> spine -> router -> spine -> leaf -> VM-2
           Hops: 6 (should be 0 -- same host)
           Latency: ~500-1000 us (should be ~10-50 us)

DVR Concept: Every Hypervisor Hosts the Routing Function

Distributed Virtual Routing (DVR) solves this by placing the routing function on every hypervisor. Instead of a centralized router making forwarding decisions, each host has a local instance of the logical router that can route between subnets directly. East-west traffic between VMs on the same host never touches the physical network. East-west traffic between VMs on different hosts takes the shortest path -- directly between the two hosts, without a detour through a centralized router.

Distributed Virtual Routing -- Local Routing:

  Host A                                          Host B
  +-------------------------------------------+  +----------------------------+
  |                                             |  |                            |
  |  VM-1 (10.1.1.5)        VM-2 (10.2.1.8)   |  |  VM-3 (10.3.1.12)         |
  |       |                       |             |  |       |                    |
  |       v                       v             |  |       v                    |
  |  +----+----+             +----+----+        |  |  +----+----+              |
  |  | ls-sub1 |             | ls-sub2 |        |  |  | ls-sub3 |              |
  |  +---------+             +---------+        |  |  +---------+              |
  |       |                       |             |  |       |                    |
  |       +----------+------------+             |  |       |                    |
  |                  |                          |  |       |                    |
  |           +------+------+                   |  | +-----+------+            |
  |           | Distributed |                   |  | | Distributed|            |
  |           |   Router    |                   |  | |   Router   |            |
  |           | (local copy)|                   |  | | (local copy)|           |
  |           +------+------+                   |  | +-----+------+            |
  |                  |                          |  |       |                    |
  +-------------------------------------------+  +----------------------------+
                     |                                     |
                     +------ GENEVE Tunnel ----------------+
                     (only for cross-host traffic)

  Case 1: VM-1 -> VM-2 (same host, different subnet)
    Path: VM-1 -> local DVR -> VM-2
    Hops: 0 physical hops
    Latency: ~10-50 us (OVS flow processing only)

  Case 2: VM-1 -> VM-3 (different host, different subnet)
    Path: VM-1 -> local DVR -> GENEVE tunnel -> remote DVR -> VM-3
    Hops: 1 physical hop (direct host-to-host)
    Latency: ~50-200 us (OVS + encap/decap + wire)

How OVN Implements DVR

OVN (Open Virtual Network), the control plane for OVS in OVE, implements DVR natively. Every OVN logical router is distributed across all nodes by default -- there is no "centralized mode" to accidentally configure.

Logical Router Distributed Component: When an OVN logical router is created, every chassis (hypervisor) that hosts VMs connected to that router's subnets gets a local copy of the router's forwarding pipeline. This is implemented as OVS flows on br-int. The logical router does not exist as a separate process or VM -- it is a set of OpenFlow rules in the OVS flow table on each node.

Gateway Nodes for External Traffic: While east-west routing is fully distributed, north-south traffic (to/from external networks) must exit through a physical uplink. OVN designates specific nodes as gateway chassis for each logical router. External traffic (default route, SNAT) is forwarded through GENEVE tunnels to the gateway chassis, which performs SNAT/DNAT and sends the traffic out its physical interface. Multiple gateway chassis can be configured for redundancy (active-passive failover via BFD).

OVN DVR flow example:

# On a host with VMs in subnets 10.1.1.0/24 and 10.2.1.0/24:
# OVS flow table on br-int shows local routing entries:

# 1. ARP responder for router port on subnet 1 (gateway MAC)
table=12, priority=100, arp, arp_tpa=10.1.1.1, arp_op=1
    actions=move:NXM_OF_ETH_SRC->NXM_OF_ETH_DST,
            mod_dl_src:fa:16:3e:aa:bb:01,
            load:2->NXM_OF_ARP_OP,
            move:NXM_NX_ARP_SHA->NXM_NX_ARP_THA,
            load:0xfa163eaabb01->NXM_NX_ARP_SHA,
            move:NXM_OF_ARP_SPA->NXM_OF_ARP_TPA,
            load:0x0a010101->NXM_OF_ARP_SPA,
            IN_PORT

# 2. Routing decision: packets from subnet 1 destined for subnet 2
table=17, priority=100, ip, nw_dst=10.2.1.0/24, metadata=0x3
    actions=mod_dl_src:fa:16:3e:aa:bb:02,   # router MAC for subnet 2
            dec_ttl,
            mod_dl_dst:<destination VM MAC>,
            load:0x4->NXM_NX_REG15,          # output port for destination
            resubmit(,32)                     # continue to egress pipeline

# 3. External traffic -> forward to gateway chassis via tunnel
table=17, priority=50, ip, metadata=0x3
    actions=mod_dl_src:fa:16:3e:aa:bb:gw,
            dec_ttl,
            load:0x1->NXM_NX_REG15,          # tunnel port to gateway chassis
            resubmit(,32)

How NSX Implements DVR

VMware NSX uses the same conceptual split but with different terminology:

The key architectural difference: in OVN, the gateway chassis is a regular worker node with an external uplink. In NSX, the Edge node is a dedicated, purpose-built component with its own resource pool, sizing requirements, and failure domain. NSX Edge sizing mistakes (under-provisioned CPU, memory, or uplink bandwidth) are a common cause of north-south bottlenecks in production.

Performance Benefit

The performance benefit of DVR is dramatic for east-west traffic:

Traffic Pattern Centralized Routing DVR
Same host, same subnet ~5 us (L2 switching) ~5 us (same)
Same host, different subnet ~500-1000 us (hairpin) ~10-50 us (local OVS flows)
Cross-host, different subnet ~500-1500 us (via router) ~50-200 us (direct tunnel)
North-south (external) ~200-500 us ~200-500 us (via gateway, same)

For east-west traffic, DVR eliminates the centralized router as a latency source and bandwidth bottleneck. For north-south traffic, the architecture is the same -- both approaches require traffic to traverse a gateway/edge node. The difference is that with DVR, this only applies to external traffic, not to every inter-subnet packet.


2. VRF (Virtual Routing and Forwarding)

Concept: Multiple Routing Tables on a Single Device

VRF is a technology that creates multiple independent routing table instances on a single router, switch, or Linux host. Each VRF instance has its own routing table, its own ARP table, and its own set of interfaces. Traffic in one VRF cannot reach another VRF without an explicit route leak or inter-VRF forwarding policy.

Think of VRFs as VLANs for Layer 3. VLANs isolate broadcast domains at Layer 2. VRFs isolate routing domains at Layer 3. A packet in VRF "tenant-A" uses tenant-A's routing table, sees tenant-A's default gateway, and can only reach destinations within tenant-A's routing table -- even if tenant-B uses the same IP address range on the same physical device.

VRF Isolation:

  Single Linux Host / Router
  +------------------------------------------------------------------+
  |                                                                    |
  |  VRF: tenant-a                     VRF: tenant-b                  |
  |  +---------------------------+     +---------------------------+  |
  |  | Routing Table:             |     | Routing Table:             | |
  |  |   10.0.0.0/8 -> eth1.100  |     |   10.0.0.0/8 -> eth1.200  | |
  |  |   default -> 10.0.0.1    |     |   default -> 10.0.0.1    | |
  |  |                           |     |                           |  |
  |  | ARP Table:                |     | ARP Table:                |  |
  |  |   10.0.1.5 -> aa:bb:cc:01|     |   10.0.1.5 -> dd:ee:ff:01|  |
  |  |                           |     |                           |  |
  |  | Interfaces: eth1.100      |     | Interfaces: eth1.200      |  |
  |  +---------------------------+     +---------------------------+  |
  |                                                                    |
  |  Both tenants use 10.0.0.0/8 -- no conflict because separate     |
  |  routing tables. Packets in tenant-a can NEVER reach tenant-b    |
  |  unless explicitly configured (route leaking).                    |
  +------------------------------------------------------------------+

VRF in Linux

Linux has native VRF support since kernel 4.3 (2015), implemented as a network device type (l3mdev). A VRF device is a "routing context" that enslaves other network interfaces. All traffic arriving on enslaved interfaces is processed using the VRF's routing table, not the default table.

# Create a VRF device for tenant-a, using routing table 100
ip link add vrf-tenant-a type vrf table 100
ip link set vrf-tenant-a up

# Enslave an interface to the VRF
ip link set eth1.100 master vrf-tenant-a

# Add routes in the VRF's routing table
ip route add 10.0.0.0/8 dev eth1.100 table 100
ip route add default via 10.0.0.1 table 100

# Verify routing table isolation
ip route show vrf vrf-tenant-a
# Shows only tenant-a routes

ip route show
# Shows only main/default table routes -- tenant-a routes are invisible

# Bind a socket to a VRF (application isolation)
# setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, "vrf-tenant-a", ...)
# Or run a process in a VRF context:
ip vrf exec vrf-tenant-a ping 10.0.1.5

The l3mdev (Layer 3 Master Device) framework is the kernel mechanism that redirects routing lookups from the default FIB (Forwarding Information Base) to the VRF-specific FIB. When a packet arrives on an interface enslaved to a VRF, the kernel's ip_route_input() function uses the VRF's table ID for the lookup instead of the main table. This is implemented via the l3mdev_fib_table() callback in the network device structure.

VRF Lite vs MPLS VRF

VRF Lite: VRF without a signaling protocol. Each VRF is locally significant -- the router at each end must be manually configured with matching VRF definitions and interface assignments. Traffic between sites in the same VRF is carried over dedicated VLAN subinterfaces or GRE tunnels. This is simple and sufficient for small-scale tenant separation (dozens of VRFs).

MPLS VRF (VPN/MPLS, RFC 4364): VRF with BGP MPLS VPN signaling. VRF membership, route targets, and route distinguishers are exchanged via MP-BGP between PE (Provider Edge) routers. Traffic is MPLS-encapsulated with VPN labels. This is the carrier-grade approach for large-scale multi-tenant networks (thousands of VRFs, hundreds of sites). The complexity is significantly higher, but the automation and scalability are essential at scale.

For this organization, VRF Lite is the likely deployment model. The number of network segments (tens to low hundreds) does not justify MPLS complexity, and the OVN overlay provides equivalent isolation at the virtual network layer.

How VRFs Map to OVN Logical Routers and Kubernetes Namespaces

In OVN, each logical router is conceptually a VRF -- it has its own routing table and its own set of connected logical switches. Two logical routers can use overlapping IP ranges without conflict because their forwarding tables are independent.

In Kubernetes, namespaces provide a soft isolation boundary. OVN-Kubernetes assigns each namespace a subnet from the cluster network CIDR, and the per-node logical router handles routing between namespace subnets. However, by default, all namespace subnets share a single routing context -- there is no VRF-level isolation. To achieve true VRF-like isolation, Kubernetes relies on NetworkPolicies (covered later in this chapter) rather than separate routing tables.

For workloads that require genuine L3 isolation (e.g., PCI DSS cardholder data environment separated from general corporate workloads), the approach in OVE is:

  1. Secondary networks via Multus: Attach VMs to dedicated OVN secondary networks with their own logical router, achieving VRF-equivalent isolation at the OVN level
  2. NetworkPolicies: Default-deny policies that enforce isolation even within the shared cluster network
  3. Node isolation: Dedicated node pools with separate physical network segments for the most sensitive workloads

3. eBPF (Extended Berkeley Packet Filter)

Evolution from BPF to eBPF

The original BPF (Berkeley Packet Filter), now called cBPF (classic BPF), was created in 1992 at Lawrence Berkeley National Laboratory. Its purpose was narrow: efficiently filter network packets in the kernel so that tools like tcpdump could capture specific traffic without copying every packet to userspace. cBPF programs are small, register-based bytecode programs that run in a kernel virtual machine with two registers (A and X), a scratch memory area, and a limited instruction set (load, store, jump, arithmetic).

eBPF (Extended BPF), introduced in Linux 3.18 (2014) by Alexei Starovoitov, expanded this concept from a packet filter into a general-purpose, safe, in-kernel programmable execution environment. The "extended" part is an understatement -- eBPF is to cBPF what a modern CPU is to a calculator.

Aspect cBPF (classic) eBPF (extended)
Registers 2 (A, X) 11 general-purpose (R0-R10), 64-bit
Instruction set ~30 instructions ~100+ instructions, function calls
Program size 4,096 instructions max 1 million verified instructions
Data structures Fixed scratch memory (16 slots) Maps (hash, array, LRU, ring buffer, etc.)
Attach points Socket filter only XDP, TC, tracepoints, kprobes, cgroups, LSM, etc.
Helper functions None 200+ kernel helper functions
JIT compilation Optional, limited Mandatory on most architectures
Tail calls No Yes (up to 33 chain depth)

eBPF Architecture

eBPF Architecture Overview:

  Userspace                                    Kernel
  +----------------------------+              +-------------------------------------+
  |                            |              |                                     |
  |  eBPF Program (C)         |              |                                     |
  |  +----------------------+ |              |                                     |
  |  | #include <bpf/bpf.h> | |              |                                     |
  |  | SEC("xdp")           | |              |                                     |
  |  | int prog(ctx) {      | |              |                                     |
  |  |   ...                 | |              |                                     |
  |  |   return XDP_PASS;   | |              |                                     |
  |  | }                     | |              |                                     |
  |  +----------+-----------+ |              |                                     |
  |             |              |              |                                     |
  |     Clang/LLVM compile    |              |                                     |
  |     to eBPF bytecode      |              |                                     |
  |             |              |              |                                     |
  |             v              |              |                                     |
  |  +----------------------+ |   bpf()      |  +-------------------------------+  |
  |  | ELF object file      | |   syscall    |  |         eBPF Verifier         |  |
  |  | (.o with BPF section)| +------------->|  |                               |  |
  |  +----------------------+ |              |  | 1. DAG check (no loops)       |  |
  |                            |              |  | 2. Simulate all paths         |  |
  |  +----------------------+ |              |  | 3. Type checking (BTF)        |  |
  |  | Loader (libbpf,      | |              |  | 4. Memory bounds checking     |  |
  |  |  bpftool, cilium-    | |              |  | 5. Verify helper call args    |  |
  |  |  agent)              | |              |  | 6. Stack depth <= 512 bytes   |  |
  |  +----------+-----------+ |              |  | 7. Max 1M verified insns      |  |
  |             |              |              |  +---------------+---------------+  |
  |             |   map ops    |              |                  | PASS              |
  |             +<------------>+              |                  v                  |
  |                            |              |  +-------------------------------+  |
  |  +----------------------+ |              |  |      JIT Compiler             |  |
  |  | Userspace tools:     | |              |  | eBPF bytecode -> native x86   |  |
  |  |   bpftool            | |              |  | (or ARM64, s390x, etc.)       |  |
  |  |   cilium monitor     | |              |  +---------------+---------------+  |
  |  |   hubble             | |              |                  |                  |
  |  |   bpftrace           | |              |                  v                  |
  |  +----------------------+ |              |  +-------------------------------+  |
  |                            |              |  |     eBPF Program (native)     |  |
  +----------------------------+              |  |     attached to hook point:   |  |
                                              |  |                               |  |
                                              |  |  XDP ---------> NIC driver    |  |
                                              |  |  TC  ---------> qdisc         |  |
                                              |  |  socket ------> socket ops    |  |
                                              |  |  kprobe ------> kernel func   |  |
                                              |  |  tracepoint --> trace event   |  |
                                              |  |  cgroup ------> cgroup hook   |  |
                                              |  |  LSM ---------> security hook |  |
                                              |  +-------------------------------+  |
                                              |                                     |
                                              |  +-------------------------------+  |
                                              |  |         eBPF Maps             |  |
                                              |  | (shared data between programs |  |
                                              |  |  and between kernel/user)     |  |
                                              |  |                               |  |
                                              |  |  Hash Map   Array   LRU Hash  |  |
                                              |  |  Per-CPU    Ring Buffer       |  |
                                              |  |  LPM Trie   Stack Trace      |  |
                                              |  |  Sockmap    DevMap (XDP)     |  |
                                              |  +-------------------------------+  |
                                              +-------------------------------------+

The Verifier is the critical safety mechanism. Every eBPF program must pass the verifier before it can be loaded into the kernel. The verifier performs static analysis to guarantee that the program:

  1. Terminates: The control flow graph is a DAG (directed acyclic graph) -- no backward jumps, no unbounded loops (bounded loops were added in kernel 5.3 with a provable iteration limit).
  2. Is memory-safe: Every memory access is bounds-checked. Pointer arithmetic is tracked and constrained. The program cannot read or write arbitrary kernel memory.
  3. Does not leak kernel pointers: Scalar values derived from kernel pointers are tracked and cannot be returned to userspace (preventing KASLR bypass).
  4. Has bounded stack usage: The eBPF stack is limited to 512 bytes per program (tail calls can extend this, but each frame is bounded).
  5. Calls only allowed helpers: Each program type has a specific set of allowed helper functions. An XDP program cannot call bpf_probe_read() (a tracing helper), and a tracing program cannot call bpf_redirect() (a networking helper).

The verifier simulates every possible execution path of the program, tracking the type and range of every register and every stack slot. For a complex Cilium network policy program, verification can take millions of simulated instructions and hundreds of milliseconds. The 1-million-instruction complexity limit is per-verification (not per-runtime), and recent kernels (5.18+) allow partial verification of subprograms to reduce complexity.

JIT Compilation: After verification, the eBPF bytecode is compiled to native machine code by the JIT compiler. On x86_64, eBPF instructions map nearly 1:1 to native instructions (eBPF was designed for this). JIT-compiled eBPF programs run at near-native speed -- the overhead compared to hand-written C code in the kernel is typically <5%.

Program Types: XDP, TC, Socket, Tracing

XDP (eXpress Data Path): XDP programs attach to the network driver's receive path and execute before the kernel allocates an sk_buff (socket buffer). This is the fastest possible interception point -- the program operates on the raw DMA buffer (the xdp_buff structure) with the packet data still in the driver's ring buffer. XDP programs can return:

Return Code Action Use Case
XDP_PASS Continue normal kernel processing Default, allowed traffic
XDP_DROP Drop the packet immediately DDoS mitigation, firewall deny
XDP_TX Transmit back out the same NIC Load balancer reply, reflection
XDP_REDIRECT Redirect to another NIC, CPU, or AF_XDP socket L3 forwarding, container redirect
XDP_ABORTED Drop with error trace Debugging

XDP can operate in three modes:

TC (Traffic Control): TC eBPF programs attach to the Linux Traffic Control layer, specifically to the clsact qdisc (classful action queueing discipline). TC programs operate on sk_buff (fully allocated socket buffers), giving them access to more packet metadata than XDP but running later in the processing pipeline. TC programs can be attached at both ingress and egress.

Packet Processing Pipeline with eBPF Hook Points:

  NIC Hardware
       |
       v
  +----------+
  | NIC      |  XDP runs HERE (before sk_buff allocation)
  | Driver   |  (xdp_buff -- raw DMA buffer)
  +----+-----+
       |
       | sk_buff allocation
       v
  +----------+
  | TC       |  TC ingress eBPF runs HERE
  | ingress  |  (sk_buff -- full metadata)
  +----+-----+
       |
       v
  +----------+
  | Netfilter|  (iptables/nftables -- traditional path)
  | (if any) |
  +----+-----+
       |
       v
  +----------+
  | Routing  |  ip_forward() / ip_local_deliver()
  | Decision |
  +----+-----+
       |
       v
  +----------+
  | TC       |  TC egress eBPF runs HERE
  | egress   |
  +----+-----+
       |
       v
  +----------+
  | NIC      |  XDP_TX / XDP_REDIRECT (if redirecting)
  | Driver   |
  +----------+
       |
       v
     Wire

Socket programs: Attach to individual sockets or cgroups of sockets. Use cases: socket-level load balancing (BPF_PROG_TYPE_SK_LOOKUP), message redirection between sockets (BPF_PROG_TYPE_SK_MSG with SOCKMAP), TCP congestion control customization (BPF_PROG_TYPE_STRUCT_OPS).

Tracing programs: Attach to kernel tracepoints, kprobes (dynamic kernel function instrumentation), and uprobes (userspace function instrumentation). These do not modify packet processing but observe it -- they are the foundation of eBPF-based observability tools like Hubble, bpftrace, and tcplife.

eBPF Maps

Maps are the shared data structures that allow eBPF programs to maintain state between invocations and to communicate with userspace. Key map types:

Map Type Structure Use Case
BPF_MAP_TYPE_HASH Hash table (key-value) Connection tracking, policy lookup
BPF_MAP_TYPE_ARRAY Fixed-size array Per-CPU counters, configuration
BPF_MAP_TYPE_LRU_HASH LRU eviction hash NAT tables, flow tracking at scale
BPF_MAP_TYPE_PERCPU_HASH Per-CPU hash table Lock-free counters, per-CPU state
BPF_MAP_TYPE_RINGBUF MPSC ring buffer Event streaming to userspace
BPF_MAP_TYPE_LPM_TRIE Longest-prefix match trie IP CIDR lookups for routing/policy
BPF_MAP_TYPE_DEVMAP Device redirect map XDP redirect targets
BPF_MAP_TYPE_SOCKMAP Socket redirect map Socket-level load balancing

Maps are created via the bpf() syscall and persist in kernel memory as long as at least one reference exists (a loaded program, a pinned path in bpffs, or a userspace file descriptor). Map size is limited by available kernel memory, not by the program -- a single hash map can hold millions of entries.

eBPF for Networking: Cilium as a CNI

Cilium is a CNI plugin (and more) that replaces the traditional Linux networking stack (iptables, kube-proxy, conntrack) with eBPF programs. In an OVE context, Cilium is not the primary CNI (OVN-Kubernetes is), but it is available as a secondary network provider and its concepts are important because:

  1. OVN-Kubernetes is adopting eBPF concepts -- the OVN northbound database translates to OVS flows today, but eBPF-based OVN datapath implementations are being developed
  2. Cilium is the reference implementation for understanding what eBPF can do for networking
  3. Some organizations deploy Cilium alongside OVN for enhanced observability (Hubble) or cluster mesh

Cilium replaces iptables for three core Kubernetes networking functions:

eBPF for Observability: Hubble, bpftrace, tcplife

eBPF's tracing capabilities enable deep observability without modifying application code:

eBPF vs iptables/nftables: Performance at Scale

This comparison is directly relevant to the migration decision. VMware NSX DFW rules are enforced in the ESXi kernel module as per-packet processing. The OVE equivalent uses OVN ACLs (translated to OVS flows) or eBPF programs. The performance characteristics differ fundamentally:

Aspect iptables eBPF (Cilium/OVN)
Rule matching Linear chain traversal, O(n) Map lookup, O(1) or O(log n)
Rule update Full chain replacement (atomic swap) Single map entry update
Latency per 1,000 rules ~50-100 us (chain walk) ~1-5 us (hash lookup)
Latency per 10,000 rules ~500-1000 us ~1-5 us (same)
Conntrack scalability Single table, spinlock contention Per-CPU tables, no locks
Memory overhead per rule ~200 bytes (iptables entry) ~64 bytes (map entry)
Visibility iptables -L (static) Maps + ring buffer (live)

At 5,000+ VMs with hundreds of security policies, the difference between O(n) iptables chains and O(1) eBPF map lookups is the difference between milliseconds and microseconds of per-packet policy evaluation overhead. This is why the industry is moving from iptables to eBPF for network policy enforcement.


4. Micro-segmentation

Definition: Security at the Workload Level

Micro-segmentation is a security model that enforces access control policies at the individual workload (VM, pod, container) level, rather than at the network perimeter. In a traditional perimeter security model, a firewall at the data center edge inspects north-south traffic, but east-west traffic between VMs inside the data center is either unfiltered or filtered by a small number of internal firewall zones. Once an attacker breaches the perimeter, lateral movement is unrestricted.

Micro-segmentation inverts this model. Every workload has its own firewall policy. A web server can reach the application server on port 8080, but not the database on port 5432. The application server can reach the database on port 5432, but not the management server on port 22. A compromised web server cannot pivot to the database, even though both are on the same subnet. The "blast radius" of any breach is limited to the compromised workload and its explicitly allowed connections.

Micro-segmentation vs Perimeter Security:

  Traditional Perimeter Model:
  +-------------------------------------------------------+
  |  Perimeter Firewall (north-south only)                |
  +-------------------------------------------------------+
  |                                                         |
  |  Data Center -- FLAT east-west                         |
  |                                                         |
  |  +------+     +------+     +------+     +------+      |
  |  | Web  |<--->| App  |<--->|  DB  |<--->| Mgmt |      |
  |  +------+     +------+     +------+     +------+      |
  |                                                         |
  |  All VMs can reach all VMs. Compromised Web ->          |
  |  attacker has access to DB, Mgmt, everything.           |
  +-------------------------------------------------------+

  Micro-segmented Model:
  +-------------------------------------------------------+
  |  Perimeter Firewall (north-south)                      |
  +-------------------------------------------------------+
  |                                                         |
  |  Data Center -- Per-workload policy                    |
  |                                                         |
  |  +------+  :8080  +------+  :5432  +------+           |
  |  | Web  |-------->| App  |-------->|  DB  |           |
  |  +------+  ALLOW  +------+  ALLOW  +------+           |
  |      |                                  |               |
  |      X  (DENY all other)               X               |
  |      |                                  |               |
  |  +------+                          +------+            |
  |  | Mgmt |  :22 ALLOW from          | Jump |  DENY     |
  |  +------+  admin VLAN only         | Host |  from all |
  |                                     +------+            |
  |                                                         |
  |  Compromised Web -> attacker can ONLY reach App:8080.  |
  |  Cannot reach DB, Mgmt, or any other workload.         |
  +-------------------------------------------------------+

Zero-Trust Networking for Financial Enterprises

For a Tier-1 financial enterprise subject to FINMA regulation, micro-segmentation is not optional -- it is a regulatory expectation. FINMA Circular 2023/1 on operational risks and resilience (and its predecessor guidance on IT risks) requires:

NSX DFW delivers all of these today. The replacement platform must deliver equivalent capabilities. This is a migration blocker -- if the replacement cannot enforce per-workload policies with equivalent granularity, auditability, and operational manageability, the migration cannot proceed for regulated workloads.

Implementation Approaches

NSX Distributed Firewall (DFW): The incumbent. NSX DFW enforces policies in the ESXi kernel module at the vNIC level -- every packet entering or leaving a VM passes through the DFW filter. Policies are defined in NSX Manager using a multi-tier category model:

  1. Ethernet category: L2 rules (MAC-based)
  2. Emergency category: Override rules for incident response
  3. Infrastructure category: Rules for shared services (DNS, NTP, AD)
  4. Environment category: Zone-based rules (production vs. development)
  5. Application category: Application-specific micro-segmentation rules

Within each category, rules are evaluated in priority order. The first match wins. This category model provides a structured way to layer policies -- infrastructure rules take precedence over application rules, and emergency rules override everything.

NSX DFW supports identity-based rules using dynamic security groups populated by vCenter tags, Active Directory group membership, or third-party integrations (vulnerability scanners, CMDB). A rule can say "allow DNS servers to receive UDP 53 from all workloads" where "DNS servers" is a dynamic group that includes any VM tagged with role:dns -- adding a new DNS server automatically applies the policy.

OVN ACLs: OVN implements stateful firewall rules as ACLs (Access Control Lists) on logical switch ports. OVN ACLs are translated into OVS flows on every node, providing distributed enforcement equivalent to NSX DFW. OVN ACLs support:

In OVE, OVN ACLs are not configured directly by administrators. They are generated by the OVN-Kubernetes CNI from Kubernetes NetworkPolicy resources (covered in section 5). This is a key architectural difference from NSX: in NSX, the administrator defines DFW rules in a dedicated security policy UI. In OVE, the administrator defines Kubernetes NetworkPolicy objects, and the OVN-Kubernetes controller translates them into OVN ACLs.

eBPF/Cilium: As covered in section 3, Cilium enforces micro-segmentation using eBPF programs attached to pod veth interfaces. Cilium's policy model is identity-based -- each pod receives a numeric identity derived from its Kubernetes labels, and policies are evaluated against identities rather than IP addresses. This avoids the "stale IP" problem where a policy referencing an IP address becomes incorrect when a pod restarts with a new IP.

Hyper-V Virtual Filtering Platform (VFP): Azure Local's equivalent. VFP is a programmable virtual switch extension in the Hyper-V vSwitch that enforces ACLs, encapsulation, and NAT rules. VFP rules are programmed by the Azure Local Network Controller via the OVSDB protocol (yes, Azure Local uses OVSDB for southbound communication to VFP). VFP ACLs provide per-VM stateful filtering with L3/L4 matching and connection tracking.

Policy Model: Allow-List vs Deny-List

Default deny (allow-list): No traffic is permitted unless a rule explicitly allows it. This is the zero-trust model. NSX DFW in production environments almost always operates with a "deny all" default rule at the bottom of the rule table. Every permitted flow requires an explicit allow rule.

Default allow (deny-list): All traffic is permitted unless a rule explicitly blocks it. This is simpler to implement initially but offers weak security -- any missed deny rule is an open path.

For FINMA compliance, default deny is the required posture. Both NSX DFW and Kubernetes NetworkPolicy support default deny. The implementation differs:

Challenges at Scale

Policy management: With 5,000+ VMs and hundreds of micro-segmentation rules, policy management becomes a significant operational challenge. Common problems:

Troubleshooting: When a connection fails, determining whether the failure is caused by a security policy (expected behavior) or a network outage (unexpected behavior) requires visibility into the policy evaluation process. NSX DFW provides rule hit counters and flow logs. OVN provides ACL logging. Cilium provides Hubble flow verdicts. Without these tools, micro-segmentation troubleshooting becomes a guessing game.

Audit compliance: Auditors need to see a complete, point-in-time snapshot of all security policies and evidence that they are enforced. NSX provides policy export via API. In Kubernetes, policies are stored as API resources and can be exported via kubectl get networkpolicy -A -o yaml. GitOps (storing policies in Git) provides version history and change audit trail.


5. Network Policies (Kubernetes)

The NetworkPolicy Resource

Kubernetes NetworkPolicy is the native mechanism for micro-segmentation in Kubernetes clusters, including OVE with KubeVirt VMs. A NetworkPolicy is a namespaced resource that selects pods (including KubeVirt VM pods) via labels and defines allowed ingress and/or egress traffic.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-to-db
  namespace: banking-app
spec:
  podSelector:
    matchLabels:
      role: database          # This policy applies to pods with role=database
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: app-server    # Allow from app-server pods
          namespaceSelector:
            matchLabels:
              env: production     # ... in production namespaces only
        - ipBlock:
            cidr: 10.0.50.0/24   # Allow from monitoring subnet
            except:
              - 10.0.50.100/32   # ... except this specific IP
      ports:
        - protocol: TCP
          port: 5432             # PostgreSQL port only
  egress:
    - to:
        - podSelector:
            matchLabels:
              role: dns           # Allow DNS lookups
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    - to:
        - ipBlock:
            cidr: 10.0.100.0/24  # Allow NTP, AD to infra subnet
      ports:
        - protocol: UDP
          port: 123

NetworkPolicy Evaluation Flow

NetworkPolicy Evaluation for an Incoming Packet:

  Incoming packet to pod "db-0" (role=database, ns=banking-app)
       |
       v
  +----+--------------------------------------------+
  | Is there ANY NetworkPolicy selecting this pod   |
  | with policyType "Ingress"?                       |
  +----+--------------------------------------------+
       |                            |
       | YES                        | NO
       v                            v
  +----+----+                  +----+----+
  | Evaluate |                 | ALLOW   |
  | all      |                 | (no     |
  | matching |                 |  policy |
  | policies |                 |  = open)|
  +----+-----+                 +---------+
       |
       v
  +----+--------------------------------------------+
  | For each matching NetworkPolicy:                 |
  | Does the packet match ANY ingress rule?          |
  +----+--------------------------------------------+
       |                            |
       | YES (any policy)           | NO (no policy matches)
       v                            v
  +----+----+                  +----+----+
  | ALLOW   |                  | DENY    |
  | (union  |                  | (default|
  |  of all |                  |  deny   |
  |  rules) |                  |  once   |
  +---------+                  |  selected|
                               +---------+

  Key semantics:
  1. If NO policy selects a pod, ALL traffic is allowed (Kubernetes default)
  2. Once ANY policy selects a pod, only explicitly allowed traffic passes
  3. Multiple policies selecting the same pod are UNION-ed (additive, never subtractive)
  4. There is NO deny rule in standard NetworkPolicy -- only "allow" or "implicit deny"
  5. podSelector + namespaceSelector in the SAME "from" entry = AND
     podSelector and namespaceSelector as SEPARATE "from" entries = OR

Default Deny: Implementing Zero-Trust in Kubernetes

To achieve zero-trust (deny-all-by-default) in a Kubernetes namespace:

# Default deny ALL ingress traffic to all pods in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: banking-app
spec:
  podSelector: {}    # empty selector = selects ALL pods
  policyTypes:
    - Ingress
  # No ingress rules = nothing is allowed

---
# Default deny ALL egress traffic from all pods in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: banking-app
spec:
  podSelector: {}
  policyTypes:
    - Egress
  # No egress rules = nothing is allowed (including DNS!)
  # IMPORTANT: denying egress without allowing DNS (UDP/TCP 53)
  # breaks pod name resolution. Always add a DNS allow rule.

After applying default deny, every allowed flow must be explicitly defined via additional NetworkPolicy resources. This is operationally equivalent to NSX DFW with a "deny all" default rule.

NetworkPolicy Limitations

Standard Kubernetes NetworkPolicy (GA since Kubernetes 1.7) has significant limitations compared to NSX DFW:

Limitation Impact NSX DFW Comparison
No explicit deny rules Cannot deny specific traffic; can only "not allow" it. Cannot create exception rules (allow all except X). DFW has both Allow and Drop actions
No cluster-wide scope Policies are namespaced. Cannot create a single policy that applies to all namespaces. DFW rules can scope to "any"
No prioritized evaluation Multiple policies are unioned. Cannot express "this rule overrides that rule." DFW has categories with strict priority order
No L7 filtering Cannot filter by HTTP path, method, header, or TLS SNI. DFW with ALB or NSX Intelligence supports L7 context
No FQDN-based rules Cannot write "allow egress to api.example.com." IP-based only (except via ipBlock). DFW supports FQDN in rules (DNS-based resolution)
No logging in spec No standard way to specify per-rule logging. Implementation-dependent. DFW has per-rule logging toggle
No action on rule match Rules can only allow. No concept of "log and allow" or "reject with RST." DFW supports Allow, Drop, Reject actions

These limitations are significant for an NSX DFW replacement. The most critical gaps -- no deny rules, no cluster-wide scope, and no priority -- are addressed by AdminNetworkPolicy (covered next).

OVN-Kubernetes NetworkPolicy Implementation

When a NetworkPolicy is created in Kubernetes, the OVN-Kubernetes controller translates it into OVN ACLs:

  1. Watch: The OVN-Kubernetes controller watches the Kubernetes API for NetworkPolicy create/update/delete events
  2. Translate: The controller converts the NetworkPolicy podSelector, namespaceSelector, and ipBlock rules into OVN Address Sets (named lists of IP addresses) and OVN ACLs (match + action rules)
  3. Program: The OVN northbound database receives the ACLs, and ovn-northd compiles them into logical flows in the southbound database
  4. Distribute: ovn-controller on each node reads the southbound database and programs OVS flows on br-int
  5. Enforce: OVS enforces the flows at the per-port level -- every packet entering or leaving a pod's OVS port is matched against the ACL-derived flows
NetworkPolicy -> OVN ACL Translation:

  Kubernetes API                OVN Northbound DB         OVS (per node)
  +-------------------+        +------------------+      +------------------+
  | NetworkPolicy:    |        | Address Set:     |      | OVS Flow:        |
  |   podSelector:    |  --->  |   "banking-app_  |  --> | table=44,        |
  |     role: database|        |    role_database" |     |   priority=1001, |
  |   ingress:        |        |   = {10.128.2.5, |     |   ip,            |
  |     from:         |        |      10.128.3.8} |     |   nw_src=10.128..|
  |       podSelector:|        |                  |      |   nw_dst=10.128..|
  |         role: app |        | ACL:             |      |   tp_dst=5432,   |
  |     ports:        |        |   match: "ip4.src|      |   ct_state=+new, |
  |       - TCP/5432  |        |    == $app_ips   |      |   actions=       |
  +-------------------+        |    && tcp.dst == |      |     ct(commit),  |
                               |    5432"         |      |     resubmit     |
                               |   action: allow- |      +------------------+
                               |    related       |
                               |   priority: 1001 |
                               +------------------+

The Address Set mechanism is critical for performance. When a pod is added or removed, only the Address Set is updated (a single OVSDB transaction), not every ACL that references pods with that label. This is equivalent to NSX DFW's dynamic security groups -- the group membership changes, but the rules referencing the group remain static.

AdminNetworkPolicy and BaselineAdminNetworkPolicy

AdminNetworkPolicy (ANP) and BaselineAdminNetworkPolicy (BANP) are the Kubernetes solution to the three most critical NetworkPolicy limitations: no deny rules, no cluster-wide scope, and no priority-based evaluation. These resources are defined in the policy.networking.k8s.io API group (KEP-2091) and are supported in OVN-Kubernetes starting with OpenShift 4.14+.

AdminNetworkPolicy (ANP):

apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
metadata:
  name: cluster-dns-allow         # cluster-scoped (no namespace)
spec:
  priority: 10                     # lower number = higher priority
  subject:
    namespaces:
      matchLabels:
        kubernetes.io/metadata.name: banking-app
  ingress: []                      # no ingress rules (not affected)
  egress:
    - name: "allow-dns"
      action: Allow                # NEW: explicit Allow action
      to:
        - namespaces:
            matchLabels:
              role: dns-service
      ports:
        - portNumber:
            protocol: UDP
            port: 53
    - name: "deny-external-db"
      action: Deny                 # NEW: explicit Deny action
      to:
        - networks:
            - 10.99.0.0/16         # deny access to legacy DB subnet

Key ANP features that address NSX DFW gaps:

Feature NetworkPolicy AdminNetworkPolicy NSX DFW Equivalent
Scope Namespace Cluster-wide DFW rule scope: "DFW" (all)
Actions Implicit allow only Allow, Deny, Pass Allow, Drop, Reject
Priority No priority (union) Numeric priority (0-1000) Category + rule order
Subject podSelector namespaces + pods Applied-To groups
Admin control Any namespace user Cluster admin only NSX admin role

The Pass action is unique to ANP and critical for delegated policy models. When an ANP rule matches with action Pass, it delegates the decision to namespace-level NetworkPolicies. This enables a tiered model:

ANP/BANP Evaluation Order (maps to NSX DFW Categories):

  +----------------------------------------------------------+
  | Priority 0-99: AdminNetworkPolicy (Emergency/Infra)      |
  |   Equivalent to: NSX DFW Emergency + Infrastructure      |
  |   Action: Allow / Deny / Pass                             |
  |   Who manages: Platform team / Security team              |
  +----------------------------------------------------------+
           |
           | If no ANP matches, or ANP action = Pass:
           v
  +----------------------------------------------------------+
  | Namespace NetworkPolicy (Application rules)               |
  |   Equivalent to: NSX DFW Application category             |
  |   Action: Implicit allow (union)                          |
  |   Who manages: Application team (namespace owner)         |
  +----------------------------------------------------------+
           |
           | If no NetworkPolicy selects the pod:
           v
  +----------------------------------------------------------+
  | BaselineAdminNetworkPolicy (Default posture)              |
  |   Equivalent to: NSX DFW default rule at bottom           |
  |   Action: Allow / Deny                                    |
  |   Who manages: Platform team                              |
  +----------------------------------------------------------+
           |
           | If no BANP matches:
           v
  +----------------------------------------------------------+
  | Kubernetes default: ALLOW                                 |
  +----------------------------------------------------------+

BaselineAdminNetworkPolicy (BANP): A single cluster-scoped resource (only one can exist) that defines the baseline posture. It is evaluated after all ANPs and namespace NetworkPolicies. Think of it as the "default deny all" rule at the bottom of the NSX DFW rule table:

apiVersion: policy.networking.k8s.io/v1alpha1
kind: BaselineAdminNetworkPolicy
metadata:
  name: default                    # only one BANP allowed, name must be "default"
spec:
  subject:
    namespaces: {}                 # applies to ALL namespaces
  ingress:
    - name: "default-deny-ingress"
      action: Deny
      from:
        - namespaces: {}           # from any namespace
  egress:
    - name: "default-deny-egress"
      action: Deny
      to:
        - namespaces: {}           # to any namespace
        - networks:
            - 0.0.0.0/0            # to any external IP

With ANP + BANP, the OVE platform achieves feature parity with the NSX DFW category model:

NSX DFW Category Kubernetes Equivalent Managed By
Emergency ANP priority 0-9 Security team
Infrastructure ANP priority 10-99 Platform team
Environment ANP priority 100-499 Platform team
Application Namespace NetworkPolicy App team
Default rule BaselineAdminNetworkPolicy Platform team

Comparison to NSX DFW Categories and Applied-To Scope

NSX DFW's "Applied-To" field controls where a rule is enforced. Options include: the entire DFW (all hosts), specific security groups, specific logical ports, or specific logical switches. This is an optimization -- a rule with Applied-To set to a small group is only programmed on the hosts where those VMs run, reducing the rule set size on most hosts.

In Kubernetes, the equivalent is the podSelector (in NetworkPolicy) or subject (in ANP). OVN-Kubernetes is smart about rule distribution -- it only programs OVN ACLs on the chassis where the selected pods actually run, not on every node in the cluster. This is equivalent to NSX DFW's Applied-To optimization.

Remaining gaps between ANP/NetworkPolicy and NSX DFW:

KubeVirt VMs and NetworkPolicy

KubeVirt VMs run inside pods. From the Kubernetes networking perspective, a VM pod is a pod like any other -- it has an IP address, it is connected to the OVN overlay, and it is subject to NetworkPolicy selection via pod labels.

When a NetworkPolicy selects a KubeVirt VM pod, the OVN ACLs are applied to the VM's OVS port on br-int. All traffic entering or leaving the VM passes through these ACLs. The VM itself is unaware of the policy -- from the guest OS perspective, packets that violate the policy simply never arrive (ingress) or are silently dropped (egress).

# Example: NetworkPolicy for KubeVirt VMs
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ssh-to-legacy-vms
  namespace: legacy-workloads
spec:
  podSelector:
    matchLabels:
      vm.kubevirt.io/name: legacy-db-01    # KubeVirt VM label
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: jump-host
      ports:
        - protocol: TCP
          port: 22

Important consideration: KubeVirt VMs with Multus secondary network interfaces (e.g., a VLAN-attached management interface) are partially outside the scope of Kubernetes NetworkPolicy. NetworkPolicy applies only to the pod's primary OVN interface (the one connected to br-int). Traffic on secondary interfaces (connected via bridge, macvlan, or SR-IOV) bypasses OVN ACLs entirely. For workloads with secondary interfaces, micro-segmentation for the secondary network must be handled at the physical switch/firewall level or via host-level iptables/nftables rules.


6. QoS (Quality of Service)

Traffic Classification and Marking

QoS begins with classifying traffic into categories and marking packets so that network devices along the path can apply differentiated treatment. Two marking mechanisms dominate:

DSCP (Differentiated Services Code Point): A 6-bit field in the IP header's ToS (Type of Service) byte. 64 possible values (codepoints), grouped into per-hop behaviors (PHBs):

DSCP Value PHB Typical Use
46 (EF) Expedited Forwarding VoIP, real-time
34 (AF41) Assured Forwarding 4, low drop Video, interactive
26 (AF31) Assured Forwarding 3, low drop Business-critical data
18 (AF21) Assured Forwarding 2, low drop Transactional data
10 (AF11) Assured Forwarding 1, low drop Bulk data
0 (BE) Best Effort Default, everything else
8 (CS1) Scavenger Background, low-priority backups

802.1p / PCP (Priority Code Point): A 3-bit field in the 802.1Q VLAN tag. 8 priority levels (0-7). Operates at Layer 2 -- effective within a single switched domain but lost at L3 router hops unless explicitly mapped to DSCP.

For a virtualized environment, the critical question is: who marks the traffic? Options:

  1. Guest OS marks DSCP: The VM itself sets the DSCP value. The hypervisor must trust or re-mark it.
  2. Virtual switch marks DSCP: OVS or VFP classifies traffic by port/IP/protocol and sets DSCP. The guest OS marking is overridden.
  3. Physical switch marks DSCP: The ToR switch classifies and marks. Only useful for non-overlay traffic (SR-IOV, VLAN-direct).

In OVE, OVS can mark DSCP via flow actions (mod_nw_tos). OVN does not have a native QoS policy abstraction, but the OVN-Kubernetes CNI exposes QoS configuration through pod annotations.

Queuing Disciplines

Once traffic is classified and marked, queuing disciplines (qdiscs) on the host determine how packets are dequeued and transmitted:

FIFO (First In, First Out): The default. No differentiation. Simple, but high-priority traffic waits behind bulk transfers.

Priority Queuing (prio): Multiple queues with strict priority. Higher-priority queues are drained completely before lower-priority queues. Risk: priority inversion -- bulk traffic in a low-priority queue starves if the high-priority queue is always non-empty.

WFQ (Weighted Fair Queuing): Each queue gets a weighted share of bandwidth. A queue with weight 50% gets at least 50% of bandwidth, even when all queues are busy. Surplus bandwidth is shared proportionally.

HTB (Hierarchical Token Bucket): The most commonly used qdisc in Linux. Organizes queues in a hierarchy with guaranteed rates (floor) and maximum rates (ceiling). Child classes borrow bandwidth from parents when available.

HTB Configuration for Host Traffic Separation:

  Root (total link: 25 Gbps)
  +--------------------------------------------------------+
  |                                                          |
  |  Class 1: VM Traffic        Class 2: Storage        Class 3: Management
  |  Rate: 15 Gbps              Rate: 8 Gbps            Rate: 2 Gbps
  |  Ceil: 25 Gbps              Ceil: 15 Gbps           Ceil: 5 Gbps
  |  (can burst to 25G           (can burst to 15G       (can burst to 5G
  |   if others idle)            when VM is light)        during migration)
  |                                                          |
  |  Sub-classes:               DSCP: AF41               DSCP: CS6
  |  +----+----+----+                                        |
  |  |High|Med |Low |                                        |
  |  |EF  |AF21|BE  |                                        |
  |  |3G  |7G  |5G  |                                        |
  |  +----+----+----+                                        |
  +--------------------------------------------------------+

Linux TC (Traffic Control)

Linux TC is the kernel framework for traffic shaping, scheduling, and policing. It consists of three components:

# Example: HTB configuration for separating VM, storage, and management traffic

# 1. Add root HTB qdisc on the bond interface
tc qdisc add dev bond0 root handle 1: htb default 30

# 2. Root class (total bandwidth)
tc class add dev bond0 parent 1: classid 1:1 htb rate 25gbit

# 3. VM traffic class (guaranteed 15 Gbps, burst to 25 Gbps)
tc class add dev bond0 parent 1:1 classid 1:10 htb rate 15gbit ceil 25gbit

# 4. Storage traffic class (guaranteed 8 Gbps, burst to 15 Gbps)
tc class add dev bond0 parent 1:1 classid 1:20 htb rate 8gbit ceil 15gbit

# 5. Management traffic class (guaranteed 2 Gbps, burst to 5 Gbps)
tc class add dev bond0 parent 1:1 classid 1:30 htb rate 2gbit ceil 5gbit

# 6. Classify by DSCP
tc filter add dev bond0 parent 1: protocol ip prio 1 \
    u32 match ip tos 0xb8 0xfc action classid 1:10    # EF -> VM high-prio
tc filter add dev bond0 parent 1: protocol ip prio 2 \
    u32 match ip tos 0x88 0xfc action classid 1:20    # AF41 -> Storage
tc filter add dev bond0 parent 1: protocol ip prio 3 \
    u32 match ip tos 0xc0 0xfc action classid 1:30    # CS6 -> Management

OVS QoS

OVS provides two QoS mechanisms:

Ingress policing: Rate-limits traffic arriving at an OVS port (typically a VM's veth interface). Excess traffic is dropped. This prevents a single VM from consuming all host bandwidth.

# Rate-limit VM port to 1 Gbps with 100 Kbit burst
ovs-vsctl set interface veth-vm1 ingress_policing_rate=1000000  # kbps
ovs-vsctl set interface veth-vm1 ingress_policing_burst=100     # kbit

Egress shaping: Uses Linux HTB queues on OVS ports for traffic shaping (as opposed to hard policing). Provides smoother bandwidth limiting with burst accommodation.

# Create QoS record for 500 Mbps with queues
ovs-vsctl set port veth-vm1 qos=@newqos -- \
  --id=@newqos create qos type=linux-htb \
  other-config:max-rate=500000000 \
  queues:0=@q0 queues:1=@q1 -- \
  --id=@q0 create queue other-config:min-rate=100000000 \
                          other-config:max-rate=500000000 -- \
  --id=@q1 create queue other-config:min-rate=400000000 \
                          other-config:max-rate=500000000

QoS for Traffic Separation

In a converged infrastructure where VM traffic, storage traffic (iSCSI, NFS, Ceph), live migration traffic, and management traffic share the same physical NICs, QoS is essential for preventing any single traffic type from starving the others:

Traffic Type Priority DSCP Min Bandwidth Max Bandwidth
Management (API, SSH, monitoring) Highest CS6 (48) 1 Gbps 5 Gbps
Storage (Ceph, iSCSI) High AF41 (34) 8 Gbps 15 Gbps
Live Migration Medium AF31 (26) 0 (on-demand) 10 Gbps
VM production traffic Normal AF21 (18) 10 Gbps 25 Gbps
Backup, replication Low CS1 (8) 1 Gbps 10 Gbps

VMware vSphere handles this via Network I/O Control (NIOC), which classifies traffic by type (vMotion, vSAN, management, VM) and applies shares and reservations. The OVE equivalent requires manual configuration of Linux TC and OVS QoS, or automation via MachineConfig and custom operators. This is an operational maturity gap -- NSX/vSphere NIOC is integrated and UI-driven; OVE's QoS is CLI-driven and requires expertise in Linux TC.

Kubernetes Pod QoS Classes

Kubernetes assigns QoS classes to pods based on their resource requests and limits. While this is primarily a scheduling and eviction mechanism, it affects networking indirectly:

QoS Class Criteria Eviction Priority Network Implication
Guaranteed requests == limits (CPU and memory) Last to be evicted Highest scheduling priority, often used for critical VMs
Burstable requests < limits Middle Standard workloads
BestEffort No requests or limits First to be evicted Lowest priority, may experience resource contention

For KubeVirt VMs, setting explicit CPU and memory requests/limits equal to each other (Guaranteed QoS class) ensures the VM pod is not evicted under memory pressure and receives dedicated CPU time. This does not directly affect network QoS, but it prevents the VM from being killed during resource contention -- which looks like a network outage from the VM's perspective.


7. VPN / IPsec Tunneling

IPsec Architecture

IPsec (Internet Protocol Security) is a suite of protocols for securing IP communications through authentication and encryption. For a financial enterprise, IPsec is the standard mechanism for encrypted site-to-site connectivity between data centers, DR sites, and cloud environments.

The IPsec framework consists of three core components:

IKE (Internet Key Exchange): The control-plane protocol that negotiates cryptographic parameters and establishes security associations. IKE runs on UDP port 500 (and port 4500 for NAT traversal).

SA (Security Association): A unidirectional agreement between two peers on the encryption algorithm, authentication method, key material, and lifetime. Each data flow requires two SAs -- one for each direction. SAs are stored in the Security Association Database (SAD).

ESP (Encapsulating Security Payload): The data-plane protocol that encrypts and authenticates packets. ESP is IP protocol 50. It provides confidentiality (encryption), integrity (authentication), and anti-replay (sequence numbers).

AH (Authentication Header): An older data-plane protocol (IP protocol 51) that provides integrity and authentication without encryption. AH is rarely used today because ESP with authentication provides equivalent integrity protection plus confidentiality. AH also has the disadvantage of authenticating the IP header, which breaks NAT. For all practical purposes, modern IPsec means IKE + ESP.

IKE Phases

IPsec Tunnel Establishment (IKEv2):

  Initiator (DC-A)                              Responder (DC-B)
  10.0.1.1:500                                  10.0.2.1:500
       |                                              |
       |  IKE_SA_INIT (Message 1)                     |
       |  - SA proposal (AES-256-GCM, SHA-384, DH-20) |
       |  - Key exchange (DH public value)             |
       |  - Nonce (Ni)                                 |
       +--------------------------------------------->|
       |                                              |
       |  IKE_SA_INIT (Message 2)                     |
       |  - SA accepted (AES-256-GCM, SHA-384, DH-20) |
       |  - Key exchange (DH public value)             |
       |  - Nonce (Nr)                                 |
       |<---------------------------------------------+
       |                                              |
       |  Both sides now compute:                      |
       |  SKEYSEED = prf(Ni | Nr, DH-shared-secret)  |
       |  SK_d, SK_ai, SK_ar, SK_ei, SK_er, SK_pi, SK_pr
       |  (derived key material for IKE SA)            |
       |                                              |
       |  IKE_AUTH (Message 3) -- encrypted with SK_ei |
       |  - Identity (ID_i)                            |
       |  - Certificate (or PSK auth)                  |
       |  - AUTH payload (signature over messages 1-2) |
       |  - SA proposal for Child SA (ESP transform)   |
       |  - Traffic Selectors (TSi: 10.1.0.0/16)      |
       +--------------------------------------------->|
       |                                              |
       |  IKE_AUTH (Message 4) -- encrypted with SK_er |
       |  - Identity (ID_r)                            |
       |  - Certificate (or PSK auth)                  |
       |  - AUTH payload                               |
       |  - SA accepted for Child SA                   |
       |  - Traffic Selectors (TSr: 10.2.0.0/16)      |
       |<---------------------------------------------+
       |                                              |
       |  IPsec tunnel established.                    |
       |  Child SA (ESP) active for:                   |
       |    10.1.0.0/16 <-> 10.2.0.0/16              |
       |    SPI_out=0xABCD1234  SPI_in=0xEFGH5678    |
       |    Cipher: AES-256-GCM                       |
       |    Lifetime: 3600s / 1GB (rekey before)      |
       |                                              |
       |  Data flow (ESP-encrypted):                   |
       |  [IP HDR: 10.0.1.1->10.0.2.1][ESP HDR: SPI  |
       |   =0xABCD1234][Encrypted: original IP packet |
       |   10.1.x.x->10.2.x.x][ESP Auth Tag]          |
       +--------------------------------------------->|
       |                                              |

IKEv1 vs IKEv2: IKEv1 uses two phases with multiple modes (Main Mode + Quick Mode for identity protection, or Aggressive Mode for faster setup). IKEv2 (RFC 7296) simplifies this to a 4-message exchange (IKE_SA_INIT + IKE_AUTH) that establishes both the IKE SA and the first Child SA (IPsec SA) in a single round-trip pair. IKEv2 also adds built-in NAT traversal (UDP encapsulation on port 4500), MOBIKE (endpoint mobility), and improved dead peer detection. All new deployments should use IKEv2.

Tunnel Mode vs Transport Mode

Tunnel mode: The entire original IP packet is encapsulated inside a new IP header and encrypted. The outer IP header has the tunnel endpoints (e.g., gateway IPs), and the inner IP header has the actual source and destination. This is the standard mode for site-to-site VPNs.

Tunnel Mode Encapsulation:

  Original packet:
  [IP HDR: 10.1.5.10 -> 10.2.8.20] [TCP HDR] [Payload]

  After IPsec tunnel mode encryption:
  [New IP HDR: 10.0.1.1 -> 10.0.2.1] [ESP HDR: SPI, Seq#]
  [Encrypted: [IP HDR: 10.1.5.10 -> 10.2.8.20] [TCP HDR] [Payload]]
  [ESP Auth Tag (ICV)]

  The physical network sees: 10.0.1.1 -> 10.0.2.1, protocol ESP
  The actual source/destination (10.1.5.10 / 10.2.8.20) are encrypted

Transport mode: Only the payload is encrypted; the original IP header is preserved (but authenticated). Used for host-to-host encryption (e.g., encrypting OVN GENEVE tunnel traffic between nodes). Lower overhead than tunnel mode because no additional IP header is added.

Transport Mode Encapsulation:

  Original packet:
  [IP HDR: 10.0.1.1 -> 10.0.2.1] [UDP HDR: GENEVE] [Inner packet]

  After IPsec transport mode encryption:
  [IP HDR: 10.0.1.1 -> 10.0.2.1] [ESP HDR: SPI, Seq#]
  [Encrypted: [UDP HDR: GENEVE] [Inner packet]]
  [ESP Auth Tag (ICV)]

  The original IP header is preserved, only the payload is encrypted
  Useful for: encrypting overlay (GENEVE) traffic between hosts

IPsec for Site-to-Site Connectivity

For a Tier-1 financial enterprise, site-to-site IPsec is used in several scenarios:

  1. Data center interconnect: Connecting primary and secondary data centers over a WAN link. All east-west traffic between the two sites is encrypted.
  2. DR site connectivity: Encrypted replication traffic to the disaster recovery site.
  3. Cloud connectivity: Connecting on-premises data centers to cloud provider virtual networks (e.g., Azure ExpressRoute with IPsec overlay for additional encryption, or direct IPsec VPN as a backup path).
  4. Branch office / partner connectivity: Connecting branch offices or business partner networks to the data center.

Key design considerations for high-throughput site-to-site IPsec:

WireGuard as a Modern Alternative

WireGuard is a modern VPN protocol that has been part of the Linux kernel since 5.6 (2020). It was designed as a simpler, faster, more auditable alternative to IPsec:

Aspect IPsec (IKEv2 + ESP) WireGuard
Codebase ~400,000 lines (Linux kernel IPsec + strongSwan) ~4,000 lines (kernel module)
Cipher choice Negotiated (many options) Fixed: ChaCha20-Poly1305 (or AES-256-GCM with hardware support)
Key exchange IKE (complex protocol) Noise Protocol Framework (1-RTT)
Configuration Complex (ipsec.conf, swanctl.conf, certificates) Simple (wg genkey, wg set, single config file)
Performance 3-5 Gbps (single SA, software) 5-10 Gbps (single interface, software)
NAT traversal UDP encapsulation (port 4500) UDP natively (configurable port)
Statefulness Stateful (SA negotiation, rekeying) Stateless (cryptokey routing)
Audit surface Large (cipher agility, mode negotiation, legacy compatibility) Small (fixed ciphers, minimal protocol)

WireGuard's main advantage is simplicity. Its main disadvantage in enterprise environments is the lack of IKE -- key distribution must be handled out-of-band (manual, or via a management layer like Tailscale/Headscale). For site-to-site VPN between data centers, this is acceptable (keys can be distributed via configuration management). For dynamic, certificate-based authentication with many peers, IPsec's IKE is more mature.

For overlay tunnel encryption (encrypting GENEVE traffic between hosts in OVE), WireGuard's simplicity and kernel-native implementation make it an attractive option. However, OVN's IPsec support (using strongSwan with transport-mode ESP) is the officially supported path in OpenShift.

VPN in Each Platform

NSX IPsec: NSX Edge nodes provide full IPsec VPN functionality (route-based and policy-based) with IKEv1/IKEv2, multiple cipher suites, and certificate or PSK authentication. NSX supports both site-to-site IPsec and route-based VPN with BGP over IPsec (for dynamic routing over encrypted tunnels). The NSX Edge is the VPN termination point, and its throughput is determined by the Edge VM sizing (or bare-metal Edge for higher throughput).

OVN IPsec: OVN supports encrypting the GENEVE tunnel traffic between nodes using IPsec in transport mode. This is configured cluster-wide and encrypts all east-west overlay traffic. Implementation uses strongSwan (IKEv2) and the Linux kernel's xfrm framework. In OpenShift 4.14+, this is enabled via the OVNKubernetes network operator configuration. Note: this is tunnel encryption (encrypting overlay traffic between hosts), not site-to-site VPN. For site-to-site VPN connectivity in OVE, an external VPN gateway is required -- either a dedicated router/firewall VM, a physical VPN appliance, or a dedicated VPN operator (e.g., strongSwan running as a pod).

Azure Local VPN Gateway: Azure Local can deploy an Azure VPN Gateway as a VM that provides site-to-site IPsec VPN with IKEv2 and BGP. This integrates with Azure's VPN Gateway service for hybrid connectivity. The VPN Gateway supports multiple tunnels, traffic selectors, and certificate-based authentication. Azure Local also supports ExpressRoute for dedicated, private WAN connectivity (not IPsec-encrypted unless an additional IPsec overlay is configured).

Swisscom ESC S2S VPN: Swisscom provides managed site-to-site VPN connectivity as part of the ESC service. The customer specifies the VPN endpoints, traffic selectors, and authentication method, and Swisscom provisions the tunnel. The customer has limited visibility into the VPN configuration details (managed service model). VPN throughput depends on the contracted service level.


How the Candidates Handle This

Comparison Table

Capability VMware (NSX) OVE (OVN/eBPF) Azure Local Swisscom ESC
DVR NSX DR on every transport node + SR on Edge VMs/bare-metal; mature, proven at scale OVN logical routers distributed on all nodes by default; gateway chassis for external traffic; equivalent functionality, different management model Hyper-V Network Virtualization (HNV) with distributed routing via VFP; routing on every host, gateway for external Managed by Swisscom; customer has no visibility into routing architecture
VRF / Network Isolation NSX logical routers provide VRF-equivalent isolation; VRF Lite on physical ToR switches OVN logical routers for L3 isolation; Linux VRF for host-level separation; secondary networks via Multus for hard isolation VRF on Hyper-V vSwitch via VFP; Azure Stack HCI logical networks provide isolation Managed network segmentation via Swisscom provisioning; customer requests segments
eBPF Not available (ESXi uses vmkernel datapath, not Linux eBPF) Full Linux eBPF stack available; OVN-Kubernetes uses OVS flows (not eBPF today, but migration in progress); Cilium available as secondary CNI Not available (Hyper-V uses VFP, not Linux eBPF) Not available (managed service)
Micro-segmentation NSX DFW: mature, multi-category policy model, identity-based groups, L7 context, graphical UI, per-rule logging OVN ACLs via NetworkPolicy + AdminNetworkPolicy; cluster-scoped policies; L3/L4 only (L7 via Cilium); YAML-driven, GitOps-friendly VFP ACLs via Network Controller; NSG-like rules (similar to Azure cloud NSGs); limited identity integration Managed firewall rules via Swisscom; customer defines rules in service portal; limited granularity
Network Policy Scope DFW rules scope to any, security group, logical switch, or port NetworkPolicy (namespace), AdminNetworkPolicy (cluster), BaselineAdminNetworkPolicy (cluster default) NSG-like rules scoped to subnets or VM NICs Service-portal-defined rules scoped to customer-visible segments
Policy Management UI NSX Manager: graphical rule editor, rule search, hit counters, flow visualization Kubernetes API (kubectl/oc), YAML manifests, GitOps; graphical options via Red Hat ACS, Calico Enterprise Windows Admin Center, Azure Portal (limited), PowerShell Swisscom service portal
QoS NIOC (Network I/O Control): per-traffic-type shares and reservations, UI-driven Linux TC + OVS QoS: manually configured or via MachineConfig; no integrated UI equivalent to NIOC QoS via Network ATC (intent-based classification for storage, management, compute traffic) Managed by Swisscom; customer has no QoS controls
VPN / IPsec NSX Edge: full IPsec VPN (site-to-site, route-based, IKEv1/v2, BGP-over-IPsec) OVN IPsec for overlay encryption (transport mode, inter-node only); site-to-site VPN requires external gateway (strongSwan pod, physical appliance) Azure VPN Gateway VM: site-to-site IPsec with IKEv2, BGP; ExpressRoute for private WAN Managed S2S VPN via Swisscom; customer specifies endpoints and traffic selectors
Overlay Encryption NSX supports GENEVE encryption between transport nodes OVN IPsec (strongSwan, transport mode ESP on GENEVE tunnels); supported in OpenShift 4.14+ Not available by default on Azure Local; WAN traffic assumed on private ExpressRoute or VPN Managed by Swisscom

Key Differences in Prose

Micro-segmentation maturity: NSX DFW is the most mature micro-segmentation solution in the comparison. It has 10+ years of enterprise deployment experience, a polished graphical management interface, identity-based dynamic security groups integrated with vCenter and Active Directory, L7 application awareness, per-rule hit counters, and a multi-category policy model that maps naturally to enterprise security team workflows. OVE's combination of NetworkPolicy + AdminNetworkPolicy achieves functional parity for L3/L4 policy enforcement, but the operational experience is different: policies are YAML manifests managed via GitOps, there is no integrated graphical rule builder (third-party tools like Red Hat ACS or Calico Enterprise fill this gap), and L7 context requires Cilium's CiliumNetworkPolicy extension. For a team accustomed to NSX Manager's drag-and-drop policy editor, the transition to YAML + GitOps is a significant workflow change -- but it is arguably a better model for audit compliance (version-controlled, reviewable, CI/CD-testable policies).

AdminNetworkPolicy is the key to DFW parity. Without AdminNetworkPolicy, Kubernetes NetworkPolicy cannot match NSX DFW's category-based, priority-ordered, cluster-scoped policy model. With ANP + BANP, the functional gap closes significantly. The platform team can define emergency, infrastructure, and environment rules at the cluster level (ANP), delegate application-specific rules to namespace owners (NetworkPolicy), and enforce a default-deny baseline (BANP). This mirrors the NSX DFW category model. ANP is supported in OVN-Kubernetes starting with OpenShift 4.14, but it is still an evolving API (v1alpha1 as of 2025). The team should validate ANP behavior and maturity during the PoC.

eBPF is an OVE-only advantage. Neither VMware (ESXi kernel) nor Azure Local (Hyper-V/Windows kernel) can leverage eBPF. OVE's Linux foundation means the organization gains access to the entire eBPF ecosystem -- Cilium for enhanced policy enforcement, Hubble for L7 observability, bpftrace for ad-hoc debugging, and future eBPF-based OVN datapaths. This is a strategic technology advantage, not an immediate functional one -- OVN flows work well today -- but it positions the platform for ongoing innovation.

QoS is an operational gap in OVE. VMware's NIOC provides an integrated, UI-driven QoS management experience for traffic separation. OVE requires manual Linux TC and OVS QoS configuration, or custom automation via MachineConfig. Azure Local's Network ATC provides a middle ground -- intent-based classification (storage, management, compute) that is simpler than raw TC but less granular than NIOC. For an organization converging storage, VM, migration, and management traffic on shared NICs, the lack of an integrated QoS management tool in OVE is a day-2 operational concern that should be addressed during platform engineering.

VPN/IPsec: NSX Edge vs external gateway. NSX Edge provides a fully integrated VPN gateway with IKEv2, BGP, and management through NSX Manager. In OVE, site-to-site VPN is not an integrated platform feature -- it requires deploying an external VPN gateway (strongSwan pod, physical appliance, or cloud VPN service). OVN IPsec covers overlay encryption (node-to-node), but not site-to-site connectivity. This is a valid architectural concern for multi-datacenter deployments. Azure Local's VPN Gateway VM provides a middle ground -- integrated but less feature-rich than NSX Edge.

Swisscom ESC abstraction level. Swisscom ESC abstracts away all routing, security, and VPN infrastructure. The customer consumes pre-defined network segments, firewall rules via a service portal, and managed VPN connectivity. This is the simplest operational model but the least flexible. Workloads requiring custom micro-segmentation rules, advanced QoS, or eBPF-based observability cannot run on ESC. The ESC model is appropriate for standard enterprise workloads that fit within Swisscom's predefined security templates.


Key Takeaways


Discussion Guide

The following questions target routing and security capabilities during vendor workshops, SME deep-dives, and PoC validation sessions. They are designed to test whether the vendor or SME has actual production experience with micro-segmentation, NetworkPolicy, and IPsec in real enterprise deployments -- not just slide-deck familiarity.

1. NSX DFW Rule Migration Strategy

"We have 3,200 NSX DFW rules across five categories (Emergency, Infrastructure, Environment, Application, Default). Walk us through the migration strategy to Kubernetes NetworkPolicy + AdminNetworkPolicy. How do we map the NSX category model to ANP priorities? How do we handle NSX rules that use dynamic security groups populated by vCenter tags? What about rules with L7 Application signatures? What is the expected timeline for migrating this rule set?"

Purpose: Tests the vendor's understanding of the DFW-to-NetworkPolicy mapping and realistic migration planning. The correct answer should include: (1) Emergency/Infrastructure/Environment categories map to ANPs with priority 0-99/10-99/100-499; Application category maps to namespace NetworkPolicies; Default rule maps to BANP. (2) vCenter tag-based security groups must be replaced with Kubernetes labels on VM pods -- this requires a label taxonomy design and automated label assignment during VM migration. (3) L7 rules cannot be expressed in NetworkPolicy; they require CiliumNetworkPolicy or application-level controls (TLS mutual auth, API gateway rules). (4) Migration timeline for 3,200 rules: 3-6 months minimum with dedicated security engineering resources, including rule consolidation (many NSX rules are redundant or obsolete), testing, and phased cutover.

2. AdminNetworkPolicy Maturity and Limitations

"AdminNetworkPolicy is still v1alpha1 in upstream Kubernetes. What is the maturity level in the OpenShift version you are proposing? Have you deployed ANP in production with more than 500 policies? What are the known limitations? How does ANP priority evaluation interact with namespace NetworkPolicies? Can you demonstrate the Pass action with a concrete example?"

Purpose: Tests real-world ANP experience. The answer should address: OpenShift 4.14+ supports ANP via OVN-Kubernetes with tech-preview stability; OpenShift 4.16+ targets GA. Known limitations include: limited tooling for policy visualization, no ANP-specific hit counters in OVN (logging is available), and ANP updates trigger OVN ACL recompilation which may cause brief policy enforcement gaps on large clusters. The Pass action delegates to namespace NetworkPolicy -- if an ANP with priority 10 matches with action Pass, evaluation continues to lower-priority ANPs, then to namespace NetworkPolicies, then to BANP.

3. Default Deny Implementation and Impact

"We need to implement default deny across the entire cluster for FINMA compliance. Walk us through the implementation. What happens to existing workloads when we enable default deny? How do we ensure DNS, monitoring, and infrastructure services continue to work? What is the rollback plan if the default deny breaks production?"

Purpose: Tests operational readiness for zero-trust deployment. The correct answer: implement BANP with Deny action for all traffic, then create ANPs for infrastructure services (DNS, monitoring, kube-apiserver, ingress controller, OVN internal traffic) before enabling the BANP. The safest approach: (1) deploy all infrastructure ANPs first, (2) deploy application NetworkPolicies for known flows, (3) enable BANP in "audit mode" (if available -- OVN ACL logging) to identify traffic that would be denied, (4) resolve all identified gaps, (5) switch BANP to enforcing mode. Rollback: delete the BANP resource, which immediately reverts to Kubernetes default-allow behavior. Critical gotcha: OVN internal traffic (ovnkube-node, ovnkube-controller communications) must be explicitly allowed, or the cluster control plane breaks.

4. NetworkPolicy for KubeVirt VMs with Multus

"We have KubeVirt VMs with two network interfaces: a primary OVN interface and a secondary VLAN interface via Multus bridge CNI. How does NetworkPolicy apply to this VM? Is the VLAN interface protected by NetworkPolicy? If not, how do we enforce micro-segmentation on the VLAN traffic? What about VMs with SR-IOV secondary interfaces?"

Purpose: Tests understanding of the Multus + NetworkPolicy interaction. The correct answer: NetworkPolicy applies only to the primary OVN interface (the pod's cluster network attachment). Traffic on the Multus secondary interface bypasses OVN ACLs entirely. For VLAN traffic micro-segmentation, options include: (1) physical firewall/ACLs on the ToR switch for the VLAN, (2) OVN secondary networks (using Multus with OVN as the secondary CNI, which does support ACLs), (3) in-guest firewall rules (iptables/firewalld inside the VM). SR-IOV interfaces are even further outside NetworkPolicy scope -- traffic goes directly from the VF to the physical NIC, bypassing OVS entirely. For SR-IOV-attached VMs, micro-segmentation must be handled at the physical switch or via eSwitch ACLs (if using switchdev mode).

5. eBPF and Cilium Evaluation

"Should we deploy Cilium alongside OVN-Kubernetes for enhanced micro-segmentation and observability? What are the trade-offs? Can Cilium replace OVN-Kubernetes as the primary CNI in OVE? What is the operational overhead of running two networking stacks? How does Hubble compare to our existing NSX Intelligence deployment for flow visualization?"

Purpose: Tests architectural judgment on eBPF strategy. The answer should include: Cilium can run alongside OVN-Kubernetes as a secondary CNI (via Multus) but cannot replace it in OVE -- OVN-Kubernetes is the supported and tested CNI for OpenShift. Running two CNIs adds operational complexity (two control planes, two sets of flow tables, two troubleshooting paths). The recommended approach: use OVN-Kubernetes for all NetworkPolicy enforcement (it handles the DFW replacement use case effectively), and evaluate Cilium Hubble specifically for observability if NSX Intelligence-equivalent flow visualization is a requirement. Hubble provides per-flow verdict logging, service dependency maps, DNS query logging, and HTTP request tracing -- comparable to NSX Intelligence for L3/L4 and superior for L7 (with CiliumNetworkPolicy).

6. QoS and Traffic Separation

"In VMware, we use NIOC to separate storage (vSAN), vMotion, management, and VM traffic with guaranteed bandwidth reservations. How do we achieve equivalent traffic separation in OVE? What happens if storage traffic (Ceph) consumes all bandwidth during a recovery event and starves VM production traffic? Show us the specific configuration."

Purpose: Tests practical QoS implementation knowledge. The answer should demonstrate: Linux TC with HTB qdisc on the bond interface, with classes for storage (Ceph OSD), live migration (KubeVirt virt-handler), management (API server, monitoring), and VM traffic. DSCP marking at the OVS level for traffic classification. The specific risk: Ceph recovery after an OSD failure generates enormous rebalancing traffic that, without QoS, can saturate the entire network link. The TC configuration must guarantee minimum bandwidth for VM production traffic even during Ceph recovery. The answer should also mention that this is a day-1 configuration requirement, not a day-2 optimization.

7. IPsec and Multi-Datacenter Connectivity

"We operate two data centers (active-active) with NSX Federation providing stretched networking and cross-site DFW policy synchronization. How do we replicate this in OVE? What replaces NSX Federation for cross-site policy management? How is the inter-DC traffic encrypted? What is the failover behavior when one site loses connectivity?"

Purpose: Tests multi-datacenter architecture knowledge. The correct answer: OVE does not have a direct equivalent to NSX Federation. Cross-site connectivity options include: (1) Submariner (Red Hat's multi-cluster networking project) for cross-cluster service discovery and L3 connectivity with IPsec encryption, (2) site-to-site VPN gateway connecting the two clusters' external networks, (3) Red Hat Advanced Cluster Management (ACM) for cross-cluster policy distribution (pushing NetworkPolicy and ANP resources to both clusters via GitOps). The critical gap: NSX Federation's stretched logical switch (L2 stretch across sites) is not replicated in OVE -- workloads are expected to be L3-routable across sites, not L2-adjacent. This may require application architecture changes for workloads that depend on L2 adjacency across sites.

8. Overlay Encryption Performance

"We are considering enabling OVN IPsec to encrypt all east-west overlay traffic for compliance. What is the performance impact? How many Gbps of encrypted throughput can we expect per node with AES-NI? Does IPsec encryption add latency to every packet? Is there a hardware offload option? What is the operational overhead (certificate management, rekeying)?"

Purpose: Tests IPsec operational readiness. The answer should include: with AES-256-GCM and AES-NI, a single core can encrypt at 10-15 Gbps; with multi-core parallelism (Linux xfrm uses per-CPU SAs), aggregate throughput of 40-80 Gbps per node is achievable on modern CPUs. Latency impact: ~5-15 us per packet for encryption/decryption (added to the existing OVS processing latency). Hardware offload: inline IPsec offload is available on ConnectX-6 Dx and newer NICs (offloads ESP processing to NIC hardware, reducing CPU overhead to near-zero and latency to ~1-2 us). Certificate management: OVN IPsec uses self-signed certificates managed by the ovn-ipsec daemonset, with automatic rotation. For enterprise PKI integration (using the organization's CA), the certificate issuance and rotation process must be customized -- this is an operational integration task.

9. Micro-segmentation Audit and Compliance

"Our auditors require a complete, point-in-time snapshot of all security policies, evidence that they are enforced, and a change history. How do we provide this in OVE? In NSX, we export DFW rules via API and provide flow logs as enforcement evidence. What is the equivalent?"

Purpose: Tests audit compliance capability. The answer should include: (1) Policy snapshot: oc get networkpolicy,adminnetworkpolicy,baselineadminnetworkpolicy -A -o yaml provides a complete export of all policies. If policies are managed via GitOps (ArgoCD, Flux), the Git repository itself is the version-controlled audit trail with full change history, author attribution, and approval records. (2) Enforcement evidence: OVN ACL logging (ovn-nbctl set ACL <uuid> log=true) generates syslog entries for every packet that matches an ACL. These logs can be shipped to the SIEM for audit. (3) Flow logs: OVN flow export via IPFIX or Hubble (if Cilium is deployed) provides flow-level visibility comparable to NSX flow logs. (4) Change history: GitOps commit log provides the who/what/when/why for every policy change -- arguably superior to NSX Manager's audit log because it includes the approval workflow context (PR reviews, CI test results).

10. VPN Gateway Architecture for OVE

"We need site-to-site IPsec VPN between our OVE cluster and three remote sites (DR site, partner network, cloud VPC). In NSX, we use NSX Edge VPN. What is the equivalent in OVE? Do we deploy strongSwan as a pod? Do we use a physical appliance? How do we handle BGP over IPsec for dynamic routing? What is the HA model for the VPN gateway?"

Purpose: Tests VPN design in an OVE context. The correct answer: OVE does not include an integrated VPN gateway. Options: (1) Physical VPN appliance (Cisco, Palo Alto, Fortinet) at the data center edge -- simplest, well-understood, but outside the Kubernetes management plane. (2) strongSwan pod with host networking (running as a DaemonSet on dedicated gateway nodes) -- provides IKEv2, BGP (via FRR sidecar), and HA via keepalived or MetalLB for VIP management. (3) VPN operator (e.g., kube-router with BGP, or third-party operators). For BGP over IPsec: deploy FRRouting (FRR) as a sidecar container in the VPN pod, establish BGP sessions over the IPsec tunnel, and inject learned routes into the cluster network. HA model: active-passive with VRRP (keepalived) or active-active with ECMP (multiple gateway pods, each with its own IPsec tunnel to the remote site, physical router load-balancing across them).