Modern datacenters and beyond

Platform-Specific Networking

Why This Matters

The previous chapters covered the building blocks: physical connectivity (LACP, ECMP, spine-leaf), overlays (OVN, GENEVE, Multus), advanced data paths (SR-IOV, DPDK), and routing/security (DVR, NetworkPolicy, micro-segmentation). Those chapters answered "how do packets move between workloads?" This chapter answers a different question: how does Kubernetes itself organize, expose, and load-balance network traffic -- and how do KubeVirt VMs participate in that model?

This is the chapter where the VMware networking mental model meets the Kubernetes networking model head-on. In VMware, the networking primitives are port groups, VLANs, distributed switches, and NSX load balancers. A VM gets an IP from DHCP or static assignment, it is placed on a port group, and external access is routed through a physical or NSX Edge load balancer. The model is familiar, well-understood, and has been in production for over a decade.

In Kubernetes -- and by extension in OpenShift Virtualization Engine (OVE) -- the networking model is fundamentally different. Every Pod gets a unique, routable IP address. Services provide stable virtual IPs that load-balance across Pod endpoints. Ingress controllers and OpenShift Routes handle L7 north-south traffic. And on bare-metal infrastructure (which is what OVE is), there is no cloud provider to hand out external IP addresses for LoadBalancer Services -- that gap is filled by MetalLB, which announces service IPs to the physical network via ARP or BGP.

For an organization migrating 5,000+ VMs from VMware to OVE, understanding this model is not optional. Every VM that becomes a KubeVirt VirtualMachine gets a Pod IP. Every VM service that was previously exposed through an NSX load balancer or a physical F5 must now be exposed through a Kubernetes Service, an OpenShift Route, or MetalLB. The networking team must understand how these mechanisms work at the packet level -- not at the PowerPoint level -- because when a production VM service becomes unreachable at 3:00 AM, the troubleshooting path goes through kube-proxy rules, HAProxy router pods, and MetalLB speaker elections.

This chapter covers three tightly related topics:

  1. Kubernetes Networking Model -- the four fundamental requirements, Service types, kube-proxy modes, CoreDNS, and how KubeVirt VMs participate
  2. OpenShift Routes / Ingress Controllers -- how external L7 traffic enters the cluster, HAProxy internals, router sharding, and TLS termination
  3. MetalLB / Load Balancing -- how bare-metal clusters get external IPs for LoadBalancer Services, L2 vs BGP mode, and integration with physical network infrastructure

These three topics form a dependency chain: the Kubernetes networking model defines how Pods and Services communicate internally; Ingress/Routes define how L7 external traffic reaches Services; MetalLB defines how L4 external IPs are allocated and advertised for LoadBalancer Services. Together, they replace the NSX Edge load balancer, the NSX T1 gateway, and the physical F5/Citrix ADC -- or they work alongside external load balancers in a hybrid architecture.


Concepts

1. Kubernetes Networking Model

The Four Fundamental Requirements

The Kubernetes networking model is defined by four non-negotiable requirements. Every CNI plugin, every cluster implementation, and every networking solution must satisfy all four:

# Requirement Description
1 Every Pod gets its own IP address No port-mapping tricks. A Pod's IP is its identity on the network. Two Pods never share an IP within a cluster.
2 Pod-to-Pod communication without NAT Any Pod can reach any other Pod using the Pod's IP directly. The source IP seen by the destination is the real Pod IP, not a translated address.
3 Pod-to-Service communication Pods access Services via a stable virtual IP (ClusterIP). The Service IP load-balances across the backing Pod endpoints.
4 External-to-Service communication External clients reach cluster workloads through NodePort, LoadBalancer, or Ingress/Route mechanisms.

These requirements are simple to state but have profound implications. Requirement #2 -- no NAT between Pods -- means that every Pod IP must be routable across all nodes in the cluster. This is why Kubernetes clusters use a Pod CIDR (e.g., 10.128.0.0/14) that is divided into per-node subnets (e.g., /23 per node), and the CNI plugin (OVN-Kubernetes in OVE) handles the routing between nodes via overlay tunnels (GENEVE) or direct routing.

How This Differs from VMware Networking

The contrast with VMware networking is stark:

Aspect VMware Kubernetes
Workload identity VM has a MAC + IP on a port group. IP assigned by DHCP or static. Pod gets an IP from the Pod CIDR, assigned automatically by the CNI IPAM.
Network isolation VLANs + port groups. Each VLAN is a broadcast domain. Namespaces + NetworkPolicies. By default, all Pods can reach all Pods (flat network).
IP lifetime VM IP is stable -- survives reboots, migrations. Pod IP is ephemeral -- changes on every Pod restart. Service IP is stable.
Load balancing NSX Edge LB, physical F5/Citrix, or DNS round-robin. Service (L4 via kube-proxy), Ingress/Route (L7 via HAProxy/NGINX), MetalLB (L4 external IP).
Service discovery DNS + manual configuration. VM IPs are often hardcoded. CoreDNS + Service names. my-svc.my-namespace.svc.cluster.local resolves to the ClusterIP.
Multi-tenancy Separate vSphere clusters, separate NSX T0/T1 gateways. Namespaces, NetworkPolicies, AdminNetworkPolicies, separate node pools.

The most disorienting change for a VMware networking engineer is the ephemerality of Pod IPs. In VMware, a VM's IP address is a stable reference point -- it appears in monitoring dashboards, firewall rules, DNS records, and application configuration files. In Kubernetes, the Pod IP changes every time the Pod is rescheduled. The stable reference point is the Service, not the Pod. Every firewall rule, monitoring query, and application reference must use the Service name or Service IP, never the Pod IP.

For KubeVirt VMs, this distinction is partially relaxed: a VirtualMachine (as opposed to a VirtualMachineInstance) can be configured with a stable hostname and, if needed, a stable IP via Multus secondary networks. But on the primary cluster network, the VM still participates in the Kubernetes model -- it gets a Pod IP from the Pod CIDR, and that IP changes if the VM is migrated or restarted.

VMware Networking Model vs Kubernetes Networking Model:

VMware:
+------------------------------------------------------------------+
|  vSphere Cluster                                                  |
|                                                                    |
|  +--Host-1--(ESXi)--+           +--Host-2--(ESXi)--+             |
|  |                    |           |                    |             |
|  | VM-A (10.1.1.5)   |           | VM-B (10.1.1.6)   |             |
|  |   |               |           |   |               |             |
|  |   +-- port group  |           |   +-- port group  |             |
|  |   |   VLAN 100    |           |   |   VLAN 100    |             |
|  |   +-- vDS --------+--trunk----+---+-- vDS --------+             |
|  +--------------------+           +--------------------+             |
|                                                                    |
|  IP lifetime: stable (years)                                       |
|  Isolation: VLAN + NSX DFW                                        |
|  Discovery: DNS (manual) or NSX LB VIP                            |
+------------------------------------------------------------------+

Kubernetes (OVE):
+------------------------------------------------------------------+
|  OpenShift Cluster                                                |
|                                                                    |
|  +--Node-1--(RHCOS)--+          +--Node-2--(RHCOS)--+            |
|  |                     |          |                     |            |
|  | Pod-A (10.128.0.5) |          | Pod-B (10.128.2.8) |            |
|  |   |                |          |   |                |            |
|  |   +-- veth pair    |          |   +-- veth pair    |            |
|  |   |                |          |   |                |            |
|  |   +-- br-int (OVS) |          |   +-- br-int (OVS) |            |
|  |        |            |          |        |            |            |
|  +--------+------------+          +--------+------------+            |
|           |     GENEVE tunnel (UDP 6081)    |                       |
|           +--------------------------------+                       |
|                                                                    |
|  IP lifetime: ephemeral (Pod restart = new IP)                    |
|  Isolation: NetworkPolicy + AdminNetworkPolicy                    |
|  Discovery: CoreDNS (automatic via Service names)                 |
|  Stable endpoint: Service ClusterIP (e.g., 172.30.45.12)         |
+------------------------------------------------------------------+

Service Types

A Kubernetes Service is an abstraction that provides a stable network endpoint for a set of Pods (or KubeVirt VMs). The Service selects its backends using label selectors and maintains an up-to-date list of healthy endpoints. Kubernetes defines four Service types:

ClusterIP (default): Allocates a virtual IP from the Service CIDR (e.g., 172.30.0.0/16 in OpenShift) that is reachable only from within the cluster. kube-proxy programs iptables/IPVS rules on every node to DNAT traffic destined for the ClusterIP to one of the backend Pod IPs. This is the most common Service type -- it is the internal load balancer for east-west traffic.

apiVersion: v1
kind: Service
metadata:
  name: my-database
  namespace: production
spec:
  type: ClusterIP          # default -- internal only
  selector:
    app: postgres
    tier: database
  ports:
  - port: 5432             # Service port (clients connect here)
    targetPort: 5432        # Pod port (actual backend)
    protocol: TCP

NodePort: Extends ClusterIP by additionally exposing the Service on a static port (30000-32767) on every node's IP address. External clients can reach the Service via <any-node-IP>:<NodePort>. kube-proxy programs rules to forward traffic from the NodePort to the ClusterIP, which then DNATs to a backend Pod. NodePort is simple but has operational drawbacks: the port range is limited, firewall rules must allow the NodePort range on every node, and there is no built-in external load balancing across nodes.

LoadBalancer: Extends NodePort by additionally requesting an external IP address from the cluster's load balancer provider. In cloud environments (AWS, Azure, GCP), the cloud controller manager provisions a cloud load balancer (ELB, Azure LB, Cloud LB) automatically. On bare-metal, there is no cloud provider -- this is where MetalLB steps in. MetalLB allocates an IP from a configured pool and advertises it via ARP (Layer 2 mode) or BGP (BGP mode). The external IP is the entry point for external clients.

ExternalName: A special Service type that does not proxy traffic. It returns a CNAME record pointing to an external DNS name. Used to alias external services (e.g., my-rds-instance.us-east-1.rds.amazonaws.com) with a cluster-internal Service name. No kube-proxy rules are created. Rarely used in bare-metal OVE deployments but useful for hybrid scenarios where some backends remain outside the cluster.

Service Type Hierarchy:

  ExternalName ─── CNAME alias only, no proxying

  ClusterIP ──── internal virtual IP, cluster-only reachability
      |
      +── NodePort ──── ClusterIP + static port on every node IP
             |
             +── LoadBalancer ──── NodePort + external IP (MetalLB on bare-metal)

  Each type builds on the previous:
    LoadBalancer = external IP + NodePort + ClusterIP + backend Pods

kube-proxy Modes

kube-proxy is the component that implements Services at the node level. It watches the Kubernetes API for Service and Endpoints/EndpointSlice resources and programs the node's networking rules to forward traffic from Service IPs to backend Pod IPs. kube-proxy operates in one of three modes:

iptables mode (default in most clusters):

kube-proxy creates iptables rules in the nat and filter tables. For each Service, it creates a chain of DNAT rules that randomly select a backend Pod. The randomization is achieved through iptables' --probability extension (statistic module).

iptables Rules for a ClusterIP Service (simplified):

  Chain KUBE-SERVICES (target for all traffic to Service CIDRs):
  -A KUBE-SERVICES -d 172.30.45.12/32 -p tcp --dport 5432 \
     -j KUBE-SVC-XXXXXXXXXXXXXXXX

  Chain KUBE-SVC-XXXXXXXXXXXXXXXX (load balancing across 3 endpoints):
  -A KUBE-SVC-XXXX -m statistic --mode random --probability 0.33333 \
     -j KUBE-SEP-AAAA    (endpoint 1: 10.128.0.5:5432)
  -A KUBE-SVC-XXXX -m statistic --mode random --probability 0.50000 \
     -j KUBE-SEP-BBBB    (endpoint 2: 10.128.2.8:5432)
  -A KUBE-SVC-XXXX \
     -j KUBE-SEP-CCCC    (endpoint 3: 10.128.4.3:5432)

  Chain KUBE-SEP-AAAA (single endpoint):
  -A KUBE-SEP-AAAA -p tcp -j DNAT --to-destination 10.128.0.5:5432

  Chain KUBE-SEP-BBBB:
  -A KUBE-SEP-BBBB -p tcp -j DNAT --to-destination 10.128.2.8:5432

  Chain KUBE-SEP-CCCC:
  -A KUBE-SEP-CCCC -p tcp -j DNAT --to-destination 10.128.4.3:5432

iptables mode is simple and well-understood, but it does not scale well. The time to update iptables rules is proportional to the total number of rules (Services x Endpoints). At 10,000+ Services, rule updates become slow (seconds to minutes) and the iptables lock creates contention. The probability-based load balancing is statistically uniform but does not consider connection count, latency, or backend health.

IPVS mode:

IPVS (IP Virtual Server) is a kernel-level L4 load balancer built into the Linux kernel's netfilter framework. Instead of iptables chains with probability-based selection, IPVS uses a hash table for Service-to-endpoint mapping. Lookups are O(1) regardless of the number of Services or endpoints. IPVS supports multiple load-balancing algorithms:

Algorithm Description Use Case
rr (Round Robin) Cycles through backends sequentially Default, simple, fair for uniform backends
lc (Least Connections) Routes to the backend with fewest active connections Backends with varying request durations
dh (Destination Hashing) Consistent hashing based on destination IP Session affinity without connection tracking
sh (Source Hashing) Consistent hashing based on source IP Client-sticky sessions
wrr (Weighted Round Robin) Round robin with per-backend weights Heterogeneous backend capacity
wlc (Weighted Least Connections) Least connections with per-backend weights Heterogeneous backends with varying loads

IPVS mode is recommended for clusters with more than 1,000 Services. It provides O(1) lookup performance, better load-balancing algorithms, and connection tracking via the kernel's conntrack subsystem. The trade-off is operational complexity: IPVS requires the ip_vs, ip_vs_rr, ip_vs_wrr, ip_vs_sh kernel modules and the ipvsadm tool for debugging.

eBPF mode (Cilium kube-proxy replacement):

Cilium can replace kube-proxy entirely by implementing Service load balancing in eBPF programs attached to the TC (Traffic Control) or XDP (eXpress Data Path) hooks. This eliminates iptables/IPVS entirely and processes Service translations in the kernel's fast path before the standard networking stack. Benefits: O(1) lookup, lower latency than iptables/IPVS (no conntrack overhead in many cases), support for Maglev consistent hashing, DSR (Direct Server Return) for LoadBalancer Services, and integrated L7 visibility.

In OVE, the default kube-proxy implementation is OVN-Kubernetes itself. OVN-Kubernetes does not use iptables or IPVS for Service load balancing -- it implements Services as OVN load balancers in the OVN Southbound database. The OVS flows on each node handle the DNAT directly. This is functionally equivalent to IPVS mode but managed by OVN's control plane rather than kube-proxy. When running OVN-Kubernetes as the CNI in OpenShift, kube-proxy is not deployed as a separate component.

kube-proxy Mode Comparison:

                    iptables          IPVS              eBPF (Cilium)     OVN-Kubernetes
                    ─────────         ────              ─────────────     ──────────────
Data structure:     Linear chains     Hash table        BPF maps          OVS flow table
Lookup complexity:  O(n)              O(1)              O(1)              O(1)
Update cost:        O(n) full         O(1) incremental  O(1) incremental  O(1) incremental
                    rewrite
LB algorithms:      Random            rr, lc, dh, sh,   rr, Maglev,      OVN built-in
                    probability       wrr, wlc          random, session   (random/affinity)
DSR support:        No                Limited           Yes               No (SNAT default)
L7 visibility:      No                No                Yes (Hubble)      No (L3/L4 only)
Max Services:       ~5,000 (perf      ~100,000          ~100,000          ~50,000+
                    degrades)
Used in OVE:        No                No                No (optional)     Yes (default)

DNS: CoreDNS and Service Discovery

CoreDNS is the cluster DNS server in Kubernetes (and OpenShift). It runs as a Deployment in the openshift-dns namespace (or kube-system in upstream Kubernetes) and provides automatic DNS resolution for Services.

Every Service automatically gets a DNS record:

DNS Record Type Format Example Resolves To
A/AAAA (Service) <svc>.<namespace>.svc.cluster.local postgres.production.svc.cluster.local ClusterIP (172.30.45.12)
SRV (Service port) _<port-name>._<protocol>.<svc>.<ns>.svc.cluster.local _tcp._tcp.postgres.production.svc.cluster.local Port + target
A/AAAA (Pod) <pod-ip-dashed>.<namespace>.pod.cluster.local 10-128-0-5.production.pod.cluster.local Pod IP
A/AAAA (Headless) <svc>.<namespace>.svc.cluster.local postgres-headless.production.svc.cluster.local All Pod IPs (no ClusterIP)

Headless Services are Services with clusterIP: None. Instead of a single virtual IP, a DNS query for a headless Service returns the individual Pod IPs of all ready endpoints. This is critical for stateful applications (databases, message queues) where clients need to connect to specific instances rather than a load-balanced VIP. KubeVirt VMs running database replicas often use headless Services for peer discovery.

apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
  namespace: production
spec:
  clusterIP: None           # headless -- returns all Pod IPs
  selector:
    app: postgres
  ports:
  - port: 5432
    targetPort: 5432
DNS resolution comparison:

  Normal ClusterIP Service:
    dig postgres.production.svc.cluster.local
    -> 172.30.45.12  (single VIP, kube-proxy/OVN load-balances to backends)

  Headless Service:
    dig postgres-headless.production.svc.cluster.local
    -> 10.128.0.5    (Pod A -- primary)
    -> 10.128.2.8    (Pod B -- replica 1)
    -> 10.128.4.3    (Pod C -- replica 2)
    (client decides which backend to connect to)

CoreDNS in OpenShift is configured via a dns.operator.openshift.io/default resource. Key tuning parameters for large clusters:

EndpointSlices

EndpointSlices are the scalable replacement for the original Endpoints resource. In early Kubernetes, each Service had a single Endpoints object containing all backend Pod IPs. For Services with hundreds of endpoints (e.g., a large DaemonSet), every Pod change required updating the entire Endpoints object and broadcasting it to all nodes. This created a quadratic scaling problem: N endpoints x M nodes = N*M API events per change.

EndpointSlices solve this by splitting endpoints into groups of up to 100 (configurable). Each EndpointSlice is an independent API object that can be updated without affecting other slices. This reduces the API event fan-out and allows kube-proxy (or OVN-Kubernetes) to process incremental updates.

Endpoints vs EndpointSlices:

  Endpoints (legacy):
    Service "web-frontend" -> Endpoints object:
      [10.128.0.5, 10.128.0.6, 10.128.0.7, ..., 10.128.2.105]
      (single object, 200 IPs, any change = full rewrite + broadcast)

  EndpointSlices (modern):
    Service "web-frontend" ->
      EndpointSlice-1: [10.128.0.5, ..., 10.128.0.104]   (100 IPs)
      EndpointSlice-2: [10.128.2.5, ..., 10.128.2.105]   (100 IPs)
      (update only the affected slice, broadcast only the changed slice)

  Impact at scale:
    5,000 VMs across 50 nodes, average 10 endpoints per Service:
      Endpoints:      every endpoint change -> 1 API update -> 50 node watches
      EndpointSlices: every endpoint change -> 1 API update -> 50 node watches
                      (same for small Services, but for Services with 500+ endpoints,
                       EndpointSlices reduce update size by 5-10x)

EndpointSlices are the default in Kubernetes 1.21+ and OpenShift 4.8+. The original Endpoints resource is still maintained for backward compatibility but is no longer the primary data source for kube-proxy or OVN-Kubernetes.

Pod CIDR, Service CIDR, and Physical Network Interaction

An OpenShift cluster uses three IP address spaces that must not overlap with each other or with the physical network:

CIDR Purpose Default (OpenShift) Scope
Pod CIDR IP addresses for Pods (and KubeVirt VMs on the primary network) 10.128.0.0/14 Divided into per-node subnets (e.g., /23 per node = 510 Pods per node)
Service CIDR Virtual IPs for Services (ClusterIP) 172.30.0.0/16 Cluster-wide, not routable outside the cluster
Node Network IP addresses for the nodes themselves (host IPs) Customer-defined (e.g., 10.0.0.0/24) Routable on the physical network
CIDR Interaction with Physical Network:

  Physical Network (Data Center)
  +=======================================================================+
  |  Node Network: 10.0.0.0/24                                            |
  |  (routable on physical switches, assigned to node NICs)               |
  |                                                                        |
  |  +--Node 1 (10.0.0.11)--+     +--Node 2 (10.0.0.12)--+              |
  |  |                        |     |                        |              |
  |  |  Pod CIDR slice:       |     |  Pod CIDR slice:       |              |
  |  |  10.128.0.0/23         |     |  10.128.2.0/23         |              |
  |  |  (Pods: 10.128.0.x)   |     |  (Pods: 10.128.2.x)   |              |
  |  |                        |     |                        |              |
  |  |  br-int (OVS)          |     |  br-int (OVS)          |              |
  |  |       |                |     |       |                |              |
  |  +-------+----------------+     +-------+----------------+              |
  |          |                              |                               |
  |          +-- GENEVE tunnel (10.0.0.11 <-> 10.0.0.12) --+               |
  |                                                                        |
  |  Service CIDR: 172.30.0.0/16                                           |
  |  (virtual IPs, never appear on the wire -- OVN/kube-proxy DNATs them) |
  |                                                                        |
  |  MetalLB External IPs: 10.0.1.0/24                                    |
  |  (allocated from a separate pool, advertised via ARP or BGP)          |
  +=======================================================================+

  Key interactions:
  1. Pod CIDR is internal -- physical switches never see 10.128.x.x
     (encapsulated in GENEVE tunnels between node IPs 10.0.0.x)
  2. Service CIDR is virtual -- 172.30.x.x never appears on any wire
     (OVN/kube-proxy DNATs to Pod IPs before the packet leaves the node)
  3. MetalLB IPs must be routable on the physical network
     (either L2-adjacent or BGP-announced to physical routers)
  4. Node IPs must be routable on the physical network
     (they are the GENEVE tunnel endpoints and the NodePort listeners)

The critical planning consideration: Pod CIDR and Service CIDR must not overlap with any existing IP range on the corporate network. In a large enterprise with thousands of VLANs, IP address space exhaustion is real. The default 10.128.0.0/14 Pod CIDR provides 262,142 addresses -- sufficient for 5,000+ VMs plus containerized workloads -- but if the enterprise already uses 10.128.x.x for other purposes, the Pod CIDR must be changed at cluster install time. Changing it after installation is not supported.

How KubeVirt VMs Participate in the Kubernetes Networking Model

A KubeVirt VirtualMachineInstance (VMI) runs inside a Pod. From the Kubernetes networking perspective, the VM is a Pod -- it gets a Pod IP, it is reachable by other Pods, and it can be fronted by a Service. But VMs have networking requirements that containers typically do not:

  1. The VM has its own network stack. Inside the Pod, the VM guest OS runs its own kernel with its own IP configuration. The Pod IP is mapped to the VM's virtio NIC (or other virtual NIC) via a bridge or masquerade binding. The VM guest OS typically receives the same IP as the Pod (bridge mode) or a different IP that is NATed to the Pod IP (masquerade mode).

  2. VMs may need VLAN access. Many enterprise VMs need direct access to physical VLANs (e.g., a database VM on the storage VLAN, a legacy application on a specific management VLAN). This is handled by Multus secondary interfaces -- the VM gets a second (or third) NIC attached to a specific VLAN via bridge, macvlan, or SR-IOV CNI plugins.

  3. VMs may need stable MAC addresses. Some applications are licensed to specific MAC addresses, or use MAC-based authentication (802.1X). KubeVirt allows specifying a static MAC address on the VM's NIC, which persists across restarts and migrations.

  4. VMs may need DHCP. The VM guest OS expects to configure its network via DHCP. KubeVirt provides an internal DHCP server for the primary Pod network interface. For secondary VLAN interfaces, the VM can obtain its IP from the enterprise DHCP server on that VLAN.

KubeVirt VM Networking -- Primary + Secondary Interface:

  +--Node (10.0.0.11)-----------------------------------------------------+
  |                                                                         |
  |  +--Pod (virt-launcher-myvm-xxxxx)-----------------------------------+ |
  |  |  Pod IP: 10.128.0.55 (from Pod CIDR, assigned by OVN-Kubernetes)  | |
  |  |                                                                     | |
  |  |  +--VM Guest (myvm)-----------------------------------------+     | |
  |  |  |                                                           |     | |
  |  |  |  eth0: 10.128.0.55  (primary -- bridge binding to Pod IP)|     | |
  |  |  |      |              Routes: default gw via OVN            |     | |
  |  |  |      |                                                    |     | |
  |  |  |  eth1: 10.5.20.100  (secondary -- VLAN 520, via Multus)  |     | |
  |  |  |      |              Routes: 10.5.20.0/24 via eth1         |     | |
  |  |  |      |                                                    |     | |
  |  |  +--+---+----+----------------------------------------------+     | |
  |  |     |        |                                                      | |
  |  +-----+--------+-----------------------------------------------------+ |
  |        |        |                                                        |
  |    [veth pair]  [Multus bridge/macvlan/SR-IOV]                          |
  |        |        |                                                        |
  |    br-int (OVS) |                                                       |
  |    GENEVE overlay|                                                       |
  |        |        |                                                        |
  |   [bond0/ens1]  [bond1/ens3 -- trunk port, VLAN 520 tagged]            |
  +--------+--------+-------------------------------------------------------+
           |        |
     To spine-leaf  To physical switch trunk (VLAN 520)
     fabric         (direct L2 access to storage VLAN)

  Traffic flows:
  1. eth0 (primary): Pod-to-Pod via OVN overlay, Pod-to-Service via OVN LB
     -> Full Kubernetes networking model: Services, DNS, NetworkPolicy
  2. eth1 (secondary): Direct VLAN access, bypasses OVN
     -> No NetworkPolicy enforcement (OVN ACLs do not apply)
     -> No Service load balancing (direct L2/L3 on the VLAN)
     -> Useful for: storage traffic, legacy protocols, VLAN-specific access

The dual-network model (primary OVN + secondary VLAN) is the standard pattern for KubeVirt VMs in enterprise environments. The primary interface provides Kubernetes-native networking (Services, DNS, NetworkPolicy), while the secondary interface provides direct VLAN access for traffic that must bypass the overlay. The trade-off is clear: secondary interfaces gain direct VLAN access but lose all Kubernetes networking features (no Services, no NetworkPolicy, no DNS-based discovery).

For VMs that only need Kubernetes-native networking (web servers, application servers, modern microservices), a single primary interface is sufficient. For VMs that need VLAN access (databases on storage VLANs, legacy applications with VLAN-specific requirements), the Multus secondary interface is required.


2. OpenShift Routes / Ingress Controllers

Ingress Resource: The Kubernetes Standard

The Kubernetes Ingress resource is the standard API for L7 (HTTP/HTTPS) traffic routing into the cluster. An Ingress defines rules that map external hostnames and URL paths to backend Services:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-app-ingress
  namespace: production
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080
      - path: /
        pathType: Prefix
        backend:
          service:
            name: frontend-service
            port:
              number: 80
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-cert

The Ingress resource is only a declaration of intent. It does nothing by itself. An Ingress Controller must be running in the cluster to watch for Ingress resources and configure the actual reverse proxy (HAProxy, NGINX, Envoy, Traefik) to route traffic according to the rules.

OpenShift Route: The OpenShift-Native Extension

OpenShift introduced the Route resource years before the Kubernetes Ingress API existed. Routes provide capabilities that were not originally available in the Ingress spec:

Capability Ingress OpenShift Route
Host-based routing Yes Yes
Path-based routing Yes Yes
TLS termination (edge) Yes (via annotation) Yes (native field)
TLS termination (passthrough) Vendor-dependent Yes (native field)
TLS termination (re-encrypt) No standard API Yes (native field)
Weighted traffic splitting No (requires Gateway API) Yes (native field, alternateBackends)
IP whitelisting Annotation-dependent Yes (via annotation, standardized)
Rate limiting Annotation-dependent Yes (via annotation)
Wildcard routes Limited Yes (*.example.com)
Route admission policy None Yes (per-namespace restrictions)

The three TLS termination modes are critical for enterprise deployments:

Edge termination: TLS is terminated at the router (HAProxy). The client connects via HTTPS; the router decrypts and forwards plain HTTP to the backend Pod. The router holds the TLS certificate. This is the simplest and most common mode.

Passthrough termination: TLS is NOT terminated at the router. The router passes the encrypted TCP stream directly to the backend Pod, which must hold the TLS certificate and terminate TLS itself. The router uses SNI (Server Name Indication) to route traffic to the correct backend based on the hostname in the TLS ClientHello. No L7 features (path-based routing, header manipulation) are available because the router cannot inspect the encrypted payload.

Re-encrypt termination: TLS is terminated at the router and re-established to the backend Pod. The router holds a "front-end" certificate for external clients and uses a separate "back-end" certificate (or the Pod's serving certificate) for the connection to the backend. This provides end-to-end encryption with the router able to inspect and route based on L7 attributes (host, path, headers).

TLS Termination Modes:

Edge:
  Client ──HTTPS──> Router (HAProxy) ──HTTP──> Pod
                    [terminates TLS]
                    [can inspect/route on L7]
  Certificate: on the Router (from Route spec or cert-manager)
  Use case: standard web apps, APIs

Passthrough:
  Client ──HTTPS──────────────────────────────> Pod
                    Router (HAProxy)
                    [SNI-based routing only]
                    [no L7 inspection]
  Certificate: on the Pod/VM itself
  Use case: apps managing their own certs, mutual TLS, non-HTTP TLS

Re-encrypt:
  Client ──HTTPS──> Router (HAProxy) ──HTTPS──> Pod
                    [terminates + re-encrypts]
                    [can inspect/route on L7]
  Front-end cert: on the Router
  Back-end cert: on the Pod (or cluster CA-signed)
  Use case: compliance requiring end-to-end encryption
            with L7 routing capability

HAProxy-Based OpenShift Router

The default OpenShift Ingress Controller uses HAProxy as the reverse proxy engine. The router runs as a Deployment in the openshift-ingress namespace, typically with two replicas for high availability. The router pods are usually scheduled on dedicated infrastructure nodes (or worker nodes with a specific label) and listen on ports 80 (HTTP) and 443 (HTTPS).

HAProxy is configured dynamically by the OpenShift router process, which watches for Route and Ingress resources via the Kubernetes API and generates HAProxy configuration stanzas. The key HAProxy internals that matter for operations:

Frontend/Backend model: HAProxy uses "frontends" (listeners) and "backends" (server pools). Each Route creates entries in the HAProxy configuration that map the host/path to a backend pool of Pod IPs. The frontend handles TLS termination (for edge and re-encrypt routes) and routes to the appropriate backend.

Health checking: HAProxy performs active health checks on backend Pods. If a Pod fails its health check, HAProxy removes it from the load-balancing rotation. This is independent of Kubernetes readiness probes -- HAProxy has its own health check mechanism. The health check interval, timeout, and failure thresholds are configurable via the IngressController resource.

Connection handling: HAProxy supports HTTP/1.1 keep-alive, HTTP/2, WebSockets, and gRPC. For passthrough routes, it operates as a TCP proxy and supports any TLS-based protocol. Connection timeouts (client, server, tunnel) are tunable and critical for long-lived connections (WebSockets, database connections through the router).

Reload mechanism: When a Route is added, modified, or deleted, the router process regenerates the HAProxy configuration and performs a "hot reload." HAProxy supports seamless reloads -- it spawns a new process with the new configuration, the old process drains existing connections, and there is no connection drop during the reload. However, frequent reloads (e.g., during a mass Route creation) can cause CPU spikes and brief latency increases. In clusters with thousands of Routes, the reload frequency should be tuned via the RELOAD_INTERVAL environment variable.

OpenShift Router Architecture:

  External Client (app.example.com)
        |
        | DNS resolves to:
        | - Router node IPs (if DNS points to nodes), or
        | - MetalLB external IP (if router has LoadBalancer Service), or
        | - Physical LB VIP (if F5/Citrix fronts the routers)
        |
        v
  +--Router Pod (openshift-ingress namespace)-----------------------------+
  |                                                                         |
  |  openshift-router process                                              |
  |    |                                                                    |
  |    +-- watches: Routes, Ingresses, Endpoints (via K8s API)             |
  |    +-- generates: haproxy.config                                       |
  |    +-- signals: HAProxy reload                                         |
  |                                                                         |
  |  HAProxy process                                                       |
  |    |                                                                    |
  |    +-- Frontend: *:443 (HTTPS)                                         |
  |    |     +-- SNI map: app.example.com -> backend_app_production         |
  |    |     +-- SNI map: api.example.com -> backend_api_production         |
  |    |     +-- default: 503 Service Unavailable                          |
  |    |                                                                    |
  |    +-- Frontend: *:80 (HTTP)                                           |
  |    |     +-- ACL: redirect to HTTPS (if route has insecureEdgePolicy   |
  |    |        = Redirect)                                                 |
  |    |     +-- ACL: allow HTTP (if insecureEdgePolicy = Allow)           |
  |    |                                                                    |
  |    +-- Backend: backend_app_production                                 |
  |    |     +-- server pod1 10.128.0.55:8080 check inter 5000             |
  |    |     +-- server pod2 10.128.2.33:8080 check inter 5000             |
  |    |     +-- balance: leastconn (default: random with weights)         |
  |    |                                                                    |
  |    +-- Backend: backend_api_production                                 |
  |          +-- server pod3 10.128.4.10:8443 check inter 5000 ssl        |
  |          +-- (re-encrypt: HAProxy connects to backend via HTTPS)       |
  |                                                                         |
  +-------------------------------------------------------------------------+
        |                              |
        v                              v
    Pod 10.128.0.55:8080          Pod 10.128.4.10:8443
    (frontend app)                (API backend, re-encrypt)

Router Sharding

In a large cluster, a single router pair handling all Routes can become a bottleneck -- both in terms of HAProxy configuration size (thousands of backends) and in terms of traffic throughput. Router sharding splits the ingress workload across multiple Ingress Controllers (router deployments), each handling a subset of Routes.

Sharding is configured by creating additional IngressController resources that select Routes based on:

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: production-external
  namespace: openshift-ingress-operator
spec:
  replicas: 3
  domain: prod.example.com
  namespaceSelector:
    matchLabels:
      environment: production
  routeSelector:
    matchLabels:
      exposure: external
  endpointPublishingStrategy:
    type: LoadBalancerService      # MetalLB assigns external IP
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/infra: ""

A typical enterprise sharding pattern:

Router Scope Domain Replicas Placement
default All namespaces (fallback) apps.cluster.example.com 2 Infrastructure nodes
production-external environment=production + exposure=external prod.example.com 3 Infrastructure nodes
internal-api tier=api + exposure=internal api.internal.example.com 2 Worker nodes
legacy-vms type=kubevirt-vm legacy.example.com 2 Infrastructure nodes

Each sharded router has its own HAProxy deployment, its own external IP (via MetalLB or physical LB), and its own DNS wildcard record. This provides traffic isolation, independent scaling, and failure domain separation.

Ingress for KubeVirt VMs

Exposing KubeVirt VM services externally follows the same mechanisms as containerized workloads, but with some VM-specific considerations:

Service + Route (L7): Create a Service that selects the VM's Pod (using the VM labels) and a Route that points to the Service. The HAProxy router terminates TLS and forwards traffic to the VM's Pod IP. This works well for HTTP/HTTPS services running inside VMs.

Service type LoadBalancer (L4): Create a Service with type: LoadBalancer that selects the VM's Pod. MetalLB assigns an external IP. This works for non-HTTP services (SSH, RDP, database ports, custom TCP/UDP protocols).

NodePort (L4, simple): Create a Service with type: NodePort. The VM service is available on <any-node-IP>:<NodePort>. Simple but requires firewall rules for the NodePort range and provides no external load balancing.

Secondary interface with direct VLAN IP: If the VM has a Multus secondary interface on a routable VLAN, it already has a directly reachable IP. No Kubernetes Service or Route is needed. This is the "VMware-like" model where the VM has a stable VLAN IP. The downside: no Kubernetes load balancing, no automatic DNS via CoreDNS, no NetworkPolicy on that interface.

Exposing VM Services -- Decision Flow:

  Is the service HTTP/HTTPS?
    |
    +-- Yes --> Use OpenShift Route (edge TLS termination)
    |           + Service (ClusterIP) + VM labels
    |           External URL: https://myvm-app.prod.example.com
    |
    +-- No --> Is it a standard TCP/UDP service?
                |
                +-- Yes --> Use Service type LoadBalancer (MetalLB)
                |           External IP: 10.0.1.50:5432 (e.g., PostgreSQL)
                |
                +-- No --> Does it need a stable, routable IP?
                            |
                            +-- Yes --> Use Multus secondary VLAN interface
                            |           Direct VLAN IP: 10.5.20.100
                            |           (VMware-like model, no K8s features)
                            |
                            +-- No --> Use NodePort (simple, temporary)
                                        External: <node-ip>:30432

TLS Certificate Management

TLS certificate management in OpenShift has two main paths:

OpenShift built-in: OpenShift automatically generates TLS certificates for the default Ingress Controller using the cluster's internal CA. Routes without explicit certificates get a certificate from this CA (which is not publicly trusted -- suitable for internal services). The OpenShift service-ca controller can also issue certificates to Services via the service.beta.openshift.io/serving-cert-secret-name annotation.

cert-manager: The standard Kubernetes certificate management solution. cert-manager integrates with public CAs (Let's Encrypt), enterprise CAs (HashiCorp Vault, AWS ACM PCA, Venafi), and internal CAs (self-signed, CA issuer). For production Routes with publicly trusted certificates, cert-manager automates the entire lifecycle: CSR generation, CA verification (HTTP-01 or DNS-01 challenge), certificate issuance, secret creation, and automatic renewal before expiration.

# cert-manager Certificate for an OpenShift Route:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app-tls
  namespace: production
spec:
  secretName: app-tls-cert
  issuerRef:
    name: letsencrypt-production
    kind: ClusterIssuer
  dnsNames:
  - app.example.com
  - www.app.example.com
  duration: 2160h        # 90 days
  renewBefore: 720h      # renew 30 days before expiry

Comparison to VMware / NSX Load Balancing

In VMware environments, north-south L7 load balancing is typically handled by:

The OpenShift Route + HAProxy router replaces the basic L7 load balancing function. For advanced ADC features (WAF, content switching, GSLB, rate limiting, bot protection), the organization may still need to deploy F5 BIG-IP or Citrix ADC in front of the OpenShift routers, or deploy the F5 CIS (Container Ingress Services) operator or Citrix Ingress Controller that can program external load balancers directly from Kubernetes resources.

Feature NSX ALB / Avi F5 BIG-IP OpenShift Router (HAProxy)
L7 routing (host/path) Yes Yes Yes
TLS termination Yes Yes Yes (edge, passthrough, re-encrypt)
WAF Yes (built-in) Yes (ASM module) No (external WAF needed)
GSLB Yes Yes (GTM/DNS) No (external DNS-based GSLB)
Health checks Advanced (L3-L7, scripted) Advanced (monitors) Basic (TCP, HTTP)
Connection multiplexing Yes Yes Limited
Rate limiting Yes Yes (iRules) Yes (annotation-based, limited)
Analytics Yes (deep, per-request) Yes Limited (HAProxy stats)
Kubernetes integration CRD-based F5 CIS operator Native (Route/Ingress)

3. MetalLB / Load Balancing

The Problem: Bare-Metal Has No Cloud LoadBalancer

In cloud environments (AWS, Azure, GCP), creating a type: LoadBalancer Service automatically provisions a cloud load balancer with a public or private IP address. The cloud controller manager handles the integration -- Kubernetes says "I need an external IP for this Service," and the cloud responds with a provisioned load balancer, an IP, and health checks.

On bare-metal infrastructure -- which is what OVE runs on -- there is no cloud provider. Creating a type: LoadBalancer Service results in the Service stuck in Pending state forever, because nothing responds to the request for an external IP. The Service degrades to a NodePort, which is functional but inconvenient (high port numbers, no dedicated external IP, no automatic failover).

MetalLB solves this by providing a software LoadBalancer implementation for bare-metal Kubernetes clusters. It allocates IP addresses from a configured pool and advertises them to the network so that external clients can reach them.

MetalLB Architecture

MetalLB consists of two components:

Controller (Deployment): A single-replica deployment that watches for Services of type: LoadBalancer and assigns IP addresses from configured IPAddressPool resources. The Controller performs IP allocation logic: selecting an IP from the correct pool based on pool selectors, ensuring no duplicate assignments, and recording the assignment in the Service's .status.loadBalancer.ingress field.

Speaker (DaemonSet): Runs on every node (or a subset of nodes, controlled by node selectors). The Speaker is responsible for announcing the assigned IP to the physical network. Depending on the mode (Layer 2 or BGP), the Speaker uses different mechanisms to make the IP reachable.

MetalLB Architecture:

  +--Kubernetes Cluster-----------------------------------------------------+
  |                                                                           |
  |  Controller (Deployment, 1 replica)                                      |
  |    +-- watches: Services type=LoadBalancer                               |
  |    +-- reads: IPAddressPool, L2Advertisement, BGPAdvertisement           |
  |    +-- action: assigns IP from pool, writes to Service .status           |
  |                                                                           |
  |  Speaker (DaemonSet, runs on every node)                                 |
  |    +-- reads: Service IP assignments from Controller                     |
  |    +-- Layer 2 mode: responds to ARP/NDP requests for assigned IPs      |
  |    +-- BGP mode: announces assigned IPs as /32 routes to BGP peers      |
  |                                                                           |
  |  +--Node 1---------+  +--Node 2---------+  +--Node 3---------+         |
  |  |  Speaker pod     |  |  Speaker pod     |  |  Speaker pod     |         |
  |  |  - ARP responder |  |  - ARP responder |  |  - ARP responder |         |
  |  |    or BGP peer   |  |    or BGP peer   |  |    or BGP peer   |         |
  |  |                  |  |                  |  |                  |         |
  |  |  Service backend |  |  Service backend |  |  (no backend on  |         |
  |  |  Pods running    |  |  Pods running    |  |   this node)     |         |
  |  +------------------+  +------------------+  +------------------+         |
  |                                                                           |
  +---------------------------------------------------------------------------+
           |                      |                      |
     +-----+----------------------+----------------------+------+
     |                Physical Network (ToR switches)            |
     +-----------------------------------------------------------+

MetalLB Layer 2 Mode

In Layer 2 mode, MetalLB makes the assigned Service IP reachable by responding to ARP requests (IPv4) or NDP Neighbor Solicitation messages (IPv6) on the local network segment. One Speaker pod "wins" a leader election for each Service IP and becomes the sole responder for that IP. The winning node attracts all traffic for that IP.

How it works:

  1. A Service gets assigned IP 10.0.1.50 from the pool.
  2. MetalLB Speakers on all nodes participate in a leader election (using memberlist, a gossip-based protocol) for this IP.
  3. Node 1 wins the election and becomes the "owner" of 10.0.1.50.
  4. Node 1's Speaker starts responding to ARP requests for 10.0.1.50 with Node 1's MAC address.
  5. The physical switch learns that 10.0.1.50 lives on Node 1's port.
  6. All external traffic for 10.0.1.50 arrives at Node 1.
  7. kube-proxy/OVN on Node 1 DNATs the traffic to a backend Pod (which may be on Node 1, Node 2, or any other node).
MetalLB Layer 2 Mode -- Traffic Flow:

  External Client                Physical Switch              Cluster Nodes
  (10.0.0.100)                   (ToR / Leaf)
       |                              |
       | 1. "Who has 10.0.1.50?"      |
       +--ARP Request (broadcast)---->|
       |                              |
       |                              | 2. Forward ARP to all ports
       |                              +--------+--------+--------+
       |                              |        |        |        |
       |                              |   Node 1   Node 2   Node 3
       |                              |   (leader)  (standby) (standby)
       |                              |        |
       |                              | 3. Node 1 Speaker responds:
       |                              |    "10.0.1.50 is at MAC aa:bb:cc:01"
       |                              |<-------+
       |                              |
       | 4. ARP Reply                 |
       |<-----------------------------+
       |
       | 5. Traffic to 10.0.1.50      |
       +---TCP SYN to 10.0.1.50----->|
       |                              |
       |                              | 6. Switch forwards to Node 1
       |                              |    (MAC aa:bb:cc:01)
       |                              +------->+
       |                                       |
       |                                  Node 1 (kube-proxy/OVN):
       |                                  7. DNAT 10.0.1.50 -> 10.128.2.8
       |                                     (backend Pod, may be on Node 2)
       |                                       |
       |                                  8. Forward to Node 2 via overlay
       |                                       +---------> Node 2
       |                                                    Pod 10.128.2.8
       |                                                    (actual backend)

  Failover:
  - If Node 1 fails, leader election selects Node 2
  - Node 2 sends gratuitous ARP: "10.0.1.50 is now at MAC aa:bb:cc:02"
  - Physical switch updates MAC table
  - Failover time: ~10 seconds (gratuitous ARP + switch MAC table update)

Layer 2 mode limitations:

Limitation Description Impact
Single-node bottleneck All traffic for a given Service IP enters through one node. That node must forward traffic to backend Pods on other nodes. At high throughput (>10 Gbps per Service), the leader node becomes a bandwidth bottleneck.
No true load balancing The leader node receives all traffic. There is no distribution across nodes at the network level. kube-proxy/OVN distributes traffic to backend Pods, but the inbound link on the leader is still a single point.
Slow failover Leader election + gratuitous ARP + switch MAC table update takes 5-15 seconds. Existing TCP connections are broken during failover. Short-lived disruption.
Requires L2 adjacency The Service IP must be on the same L2 segment as the node IPs (same VLAN/subnet). Cannot use Service IPs from a different VLAN unless VLAN is trunked to the nodes.
ARP table pressure Each Service IP adds an entry to the physical switch's ARP/MAC table. With hundreds of Service IPs, this can pressure smaller switches. Typically not an issue with enterprise-grade switches (support 16K-64K MAC entries).

Layer 2 mode is simple to deploy (no BGP configuration on physical switches), works in any L2 network segment, and is sufficient for most small-to-medium deployments. For a Tier-1 enterprise with 5,000+ VMs and hundreds of LoadBalancer Services, the single-node bottleneck and slow failover are significant concerns. BGP mode is recommended for production.

MetalLB BGP Mode

In BGP mode, MetalLB Speakers establish BGP peering sessions with the physical ToR (Top of Rack) switches or upstream routers. Instead of using ARP to claim an IP, the Speakers announce the Service IP as a /32 route via BGP. The physical routers learn the route and can use ECMP (Equal-Cost Multi-Path) to distribute traffic across multiple nodes simultaneously.

How it works:

  1. A Service gets assigned IP 10.0.1.50 from the pool.
  2. MetalLB Speakers on all nodes that have backend Pods for this Service establish BGP sessions with the ToR switches.
  3. Each Speaker announces 10.0.1.50/32 as a route with the node's IP as the next-hop.
  4. The ToR switch receives the same /32 route from multiple nodes and installs all of them as ECMP paths.
  5. External traffic for 10.0.1.50 is distributed across all announcing nodes using the switch's ECMP hash (typically based on source IP, destination IP, source port, destination port -- a 5-tuple hash).
  6. Each node that receives traffic uses kube-proxy/OVN to DNAT to a local backend Pod (or forward to another node via the overlay).
MetalLB BGP Mode -- Traffic Flow with ECMP:

  External Client (10.0.0.100)
       |
       v
  Physical Router / ToR Switch
  +--------------------------------------------------------------+
  |  BGP RIB (Routing Information Base):                          |
  |                                                                |
  |  Prefix: 10.0.1.50/32                                        |
  |    Next-hop: 10.0.0.11 (Node 1) -- via BGP peer, AS 64500    |
  |    Next-hop: 10.0.0.12 (Node 2) -- via BGP peer, AS 64500    |
  |    Next-hop: 10.0.0.13 (Node 3) -- via BGP peer, AS 64500    |
  |                                                                |
  |  ECMP hash (5-tuple): distribute across all 3 next-hops       |
  +------+----------------+----------------+----------------------+
         |                |                |
         v                v                v
    Node 1 (10.0.0.11) Node 2 (10.0.0.12) Node 3 (10.0.0.13)
    Speaker: BGP peer   Speaker: BGP peer  Speaker: BGP peer
    announces 10.0.1.50 announces 10.0.1.50 announces 10.0.1.50
         |                |                |
    kube-proxy/OVN:   kube-proxy/OVN:   kube-proxy/OVN:
    DNAT to local     DNAT to local     DNAT to local
    backend Pod       backend Pod       backend Pod
         |                |                |
    Pod 10.128.0.55   Pod 10.128.2.33   Pod 10.128.4.10
    (backend 1)       (backend 2)       (backend 3)

  Failover:
  - If Node 2 fails, its BGP session drops
  - ToR switch removes Node 2 from ECMP paths within 3-10 seconds
    (BFD can reduce this to <1 second)
  - Traffic redistributes across Node 1 and Node 3
  - No gratuitous ARP, no leader election -- pure BGP convergence

BGP mode advantages over Layer 2:

Advantage Description
True load balancing Traffic is distributed across multiple nodes via ECMP. No single-node bottleneck.
Fast failover BGP session timeout (default 90s, configurable) or BFD (Bidirectional Forwarding Detection, <1s). Much faster than L2 gratuitous ARP.
No L2 adjacency required Service IPs can be from any routable prefix. The ToR learns the route via BGP, so the IPs do not need to be on the same L2 segment as the nodes.
Scales to hundreds of IPs No ARP table pressure. Routes are in the routing table, which scales better on enterprise switches.
Multi-rack support Works across racks -- each rack's ToR peers with the MetalLB Speakers on the nodes in that rack. Spine switches aggregate the routes.

BGP mode requirements:

Requirement Description
BGP peering on ToR switches The physical switches must support BGP and be configured to peer with MetalLB Speakers. Most enterprise switches (Arista, Cisco Nexus, Juniper, Cumulus) support this.
ASN allocation MetalLB Speakers need an ASN (Autonomous System Number). Typically a private ASN (64512-65534 for 16-bit, or 4200000000-4294967294 for 32-bit).
ECMP support The ToR switch must support ECMP for the announced /32 routes. Most modern switches do.
BFD (optional but recommended) BFD provides sub-second failure detection. Without BFD, failover depends on BGP hold timer (default 90s, typically tuned to 9-15s with keepalive 3-5s).
Network team coordination The network team must configure BGP peering, route policies (prefix filters to accept only the MetalLB IP pool prefixes), and ECMP on the physical switches. This is a cross-team dependency.

MetalLB CRDs (Custom Resource Definitions)

MetalLB is configured entirely via Kubernetes CRDs. The key resources:

IPAddressPool: Defines a pool of IP addresses that MetalLB can assign to Services.

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: production-pool
  namespace: metallb-system
spec:
  addresses:
  - 10.0.1.0/24           # a CIDR range
  - 10.0.2.100-10.0.2.200 # or an explicit range
  autoAssign: true          # automatically assign from this pool
  avoidBuggyIPs: true       # skip .0 and .255 addresses

L2Advertisement: Configures Layer 2 mode advertisement for an IPAddressPool.

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: production-l2
  namespace: metallb-system
spec:
  ipAddressPools:
  - production-pool
  interfaces:              # optional: restrict to specific interfaces
  - bond0
  nodeSelectors:           # optional: restrict to specific nodes
  - matchLabels:
      node-role.kubernetes.io/worker: ""

BGPAdvertisement: Configures BGP mode advertisement for an IPAddressPool.

apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
  name: production-bgp
  namespace: metallb-system
spec:
  ipAddressPools:
  - production-pool
  aggregationLength: 32     # announce as /32 (default)
  communities:
  - 64500:100               # BGP community for route policy
  localPref: 100            # BGP local preference

BGPPeer: Defines a BGP peering session with an external router.

apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
  name: tor-switch-1
  namespace: metallb-system
spec:
  myASN: 64500              # MetalLB's ASN
  peerASN: 64501            # ToR switch's ASN
  peerAddress: 10.0.0.1     # ToR switch's IP
  nodeSelectors:             # which nodes peer with this switch
  - matchLabels:
      rack: rack-01
  bfdProfile: fast-detect    # optional BFD for sub-second failover
  holdTime: 15s              # BGP hold timer
  keepaliveTime: 5s          # BGP keepalive interval

BFDProfile: Defines BFD parameters for fast failure detection.

apiVersion: metallb.io/v1beta1
kind: BFDProfile
metadata:
  name: fast-detect
  namespace: metallb-system
spec:
  receiveInterval: 300       # ms between BFD packets (receive)
  transmitInterval: 300      # ms between BFD packets (transmit)
  detectMultiplier: 3        # detect failure after 3 missed packets
                             # = 900ms detection time
  echoMode: false            # echo mode disabled (requires kernel support)
  passiveMode: false         # active mode (initiates BFD session)
  minimumTtl: 254            # TTL for BFD packets (security: single-hop)

MetalLB for KubeVirt VM Services

MetalLB is particularly valuable for KubeVirt VMs that expose non-HTTP services -- services that cannot use an OpenShift Route (which is L7/HTTP only). Common patterns:

Database VMs: A PostgreSQL or Oracle database running in a KubeVirt VM needs a stable, external IP on port 5432/1521. A Service with type: LoadBalancer and MetalLB provides this.

apiVersion: v1
kind: Service
metadata:
  name: oracle-db-external
  namespace: databases
  annotations:
    metallb.universe.tf/address-pool: database-pool
    metallb.universe.tf/loadBalancerIPs: 10.0.1.100  # request specific IP
spec:
  type: LoadBalancer
  selector:
    kubevirt.io/domain: oracle-prod-01
  ports:
  - name: oracle
    port: 1521
    targetPort: 1521
    protocol: TCP
  - name: oracle-https
    port: 5500
    targetPort: 5500
    protocol: TCP
  externalTrafficPolicy: Local  # preserve client source IP

SSH/RDP access to VMs: VMs that need direct SSH or RDP access can use MetalLB LoadBalancer Services. This replaces the VMware pattern of directly reaching a VM on its VLAN IP.

Legacy protocol servers: VMs running LDAP, SMTP, SIP, or other non-HTTP protocols need L4 load balancing, not L7. MetalLB provides this without requiring an external load balancer.

externalTrafficPolicy: Local: By default, kube-proxy/OVN may forward traffic to a backend Pod on a different node, which causes SNAT (the Pod sees the node IP as the client IP, not the real client IP). Setting externalTrafficPolicy: Local restricts traffic to backend Pods on the same node that received the traffic. This preserves the client source IP (important for access logging, IP-based authentication) but means that nodes without local backend Pods will not attract traffic. In BGP mode with externalTrafficPolicy: Local, MetalLB only announces the route from nodes that have a local backend Pod -- providing optimal traffic distribution.

externalTrafficPolicy comparison:

  Cluster (default):
    Client (10.0.0.100) --> Node 1 (MetalLB) --> SNAT --> Node 2 --> Pod
    Pod sees source IP: 10.0.0.11 (Node 1's IP)  <-- client IP lost
    Pro: traffic reaches any backend Pod, even if not on the receiving node
    Con: SNAT hides client IP

  Local:
    Client (10.0.0.100) --> Node 2 (MetalLB, only if Pod is on Node 2) --> Pod
    Pod sees source IP: 10.0.0.100 (real client IP)  <-- client IP preserved
    Pro: client source IP preserved, no extra hop
    Con: only nodes with local Pods attract traffic (BGP mode handles this
         correctly by only announcing from nodes with backends)

Alternative Load Balancers

MetalLB is not the only option for bare-metal load balancing. Enterprise environments may use external load balancers alongside or instead of MetalLB:

F5 BIG-IP + Container Ingress Services (CIS): F5's CIS operator runs in the cluster and watches for Kubernetes Services, Ingresses, and custom CRDs (VirtualServer, TransportServer). It programs the external F5 BIG-IP appliance with pool members (node IPs + NodePorts) and virtual servers. This allows the existing F5 infrastructure to serve as the LoadBalancer provider for Kubernetes Services. Advantages: mature ADC features (WAF, iRules, persistence profiles, SSL offload). Disadvantages: requires maintaining external F5 infrastructure, licensing costs, and the F5 is outside the Kubernetes management plane.

Citrix ADC + Citrix Ingress Controller: Similar to F5 CIS but for Citrix ADC (NetScaler). The Citrix Ingress Controller programs the external Citrix ADC with configuration based on Kubernetes resources. Supports content switching, SSL offload, rate limiting, and WAF features.

NSX Advanced Load Balancer (Avi): Can be used in OVE environments as an external load balancer. The Avi Kubernetes Operator (AKO) integrates with Kubernetes to provide L4/L7 load balancing, WAF, GSLB, and analytics. This is a viable option for organizations already using Avi in their VMware environment -- they can continue using the same Avi infrastructure after migrating to OVE.

kube-vip: A lightweight alternative to MetalLB that provides VIP management using either ARP (similar to MetalLB L2) or BGP. kube-vip can also manage the control plane VIP (the API server endpoint), which MetalLB does not handle. kube-vip is simpler but less feature-rich than MetalLB.

OpenShift's built-in MetalLB Operator: OpenShift includes MetalLB as a supported operator via the OperatorHub. The MetalLB Operator manages the lifecycle of MetalLB (installation, upgrades, configuration) and integrates with OpenShift's monitoring stack for alerting on MetalLB speaker failures, BGP session drops, and IP pool exhaustion.

Load Balancer Options for Bare-Metal OVE:

  +--Cluster---------------------------------------------+
  |                                                       |
  |  Option A: MetalLB (in-cluster, L4)                  |
  |    Speaker DaemonSet announces IPs via ARP or BGP    |
  |    Pro: native, simple, no external infrastructure    |
  |    Con: L4 only, no WAF/ADC features                 |
  |                                                       |
  |  Option B: F5 CIS / Citrix IC (external LB)         |
  |    Operator in cluster programs external F5/Citrix    |
  |    Pro: enterprise ADC features, existing infra       |
  |    Con: external dependency, licensing cost            |
  |                                                       |
  |  Option C: MetalLB + external LB (hybrid)            |
  |    MetalLB for L4 services (databases, SSH)           |
  |    F5/Citrix for L7 services (web apps, APIs)         |
  |    Pro: best of both worlds                            |
  |    Con: two load balancing systems to manage           |
  |                                                       |
  |  Option D: Avi (AKO) -- external L4/L7              |
  |    Avi SE VMs provide both L4 and L7 LB               |
  |    Pro: if already using Avi, continuity               |
  |    Con: VM-based SE overhead, Avi licensing            |
  |                                                       |
  +-------------------------------------------------------+

Comparison to Traditional Load Balancer Appliances

For a team accustomed to F5 BIG-IP or Citrix ADC, the shift to MetalLB + HAProxy-based OpenShift Router requires a mental model change:

Aspect F5 BIG-IP / Citrix ADC MetalLB + OpenShift Router
Architecture Dedicated appliance (hardware or VM) external to the compute cluster Software running inside the cluster (Pods/DaemonSets)
L4 load balancing Virtual server on the appliance, pool of real servers MetalLB assigns external IP, kube-proxy/OVN handles DNAT to backend Pods
L7 load balancing Content switching, URL rewriting, header manipulation, iRules OpenShift Route + HAProxy (host/path routing, TLS termination)
Health checks Advanced monitors (L3-L7, scripted, adaptive) HAProxy: basic TCP/HTTP checks. Kubernetes: readiness/liveness probes (influence endpoint membership, not LB health)
Persistence Cookie, source IP, SSL session ID, universal persistence HAProxy: cookie or source IP (via Route annotation). MetalLB: source IP hash (BGP mode with externalTrafficPolicy: Local).
SSL offload Hardware-accelerated SSL, client certificate auth, SSL profiles HAProxy: software TLS termination (CPU-bound, but fast on modern CPUs)
WAF Built-in (F5 ASM, Citrix AppFirewall) Not available. External WAF (ModSecurity, cloud WAF, or F5/Citrix in front)
Management GUI (F5 TMUI, Citrix ADM), API, Terraform provider Kubernetes API (YAML manifests), GitOps, oc CLI
HA model Active-standby or active-active pair (dedicated failover protocol) MetalLB: leader election (L2) or ECMP (BGP). Router: multiple replicas behind MetalLB LB or physical LB
Cost Significant licensing + hardware/VM costs MetalLB: free (open source). HAProxy: included in OpenShift subscription.
Operational model Network team manages the appliance independently Platform team manages LB as part of the Kubernetes platform (or delegates via CRDs)

The shift in operational model is the most significant change. In the VMware world, the load balancer is typically managed by a dedicated network team using the F5 TMUI or Citrix ADM GUI. In the OVE world, load balancer configuration is a Kubernetes resource (Service, Route, MetalLB CRDs) managed by the platform team or application teams via YAML and GitOps. The network team's role shifts from "configure the F5" to "provide BGP peering and IP pools for MetalLB."


How the Candidates Handle This

Comparison Table

Capability VMware (Current) OVE Azure Local Swisscom ESC
Networking model VM gets IP on a port group/VLAN. vDS handles L2 switching. IP is stable, long-lived. Kubernetes model: every Pod/VM gets a Pod IP from the Pod CIDR. IP is ephemeral (stable Service IPs via ClusterIP). Secondary VLAN IPs via Multus for VMs needing stable addresses. Hyper-V vSwitch with SDN Network Controller. VMs get IPs from HNV-managed logical networks. Model similar to VMware (IP on a logical network). Managed VMs on Swisscom-provided network segments. Customer-assigned IPs on Swisscom-managed VLANs. VMware-like model (stable IPs on segments).
Service discovery DNS (manual registration or IPAM integration). No automatic service discovery. Applications hardcode IPs or use DNS. CoreDNS: automatic svc.ns.svc.cluster.local resolution. Headless Services for stateful apps. Full DNS-based service discovery. Azure DNS integration. Limited automatic service discovery within the cluster. Customer-managed DNS. No automatic service discovery.
L7 ingress NSX ALB (Avi), F5 BIG-IP, Citrix ADC, or NSX T1 LB. External to the virtualization platform. OpenShift Route + HAProxy router (built-in). Supports edge, passthrough, re-encrypt TLS. Router sharding for scale. Azure Application Gateway (if deployed). NGINX Ingress Controller available on AKS-HCI. Swisscom-managed reverse proxy or customer-deployed LB. Limited L7 features.
L4 load balancing NSX T1 LB, F5 BIG-IP, or physical LB appliance. MetalLB: L2 (ARP) or BGP mode for LoadBalancer Services. External F5/Citrix via CIS operator also supported. Azure SDN Load Balancer (software-defined, integrated with Network Controller). Swisscom-managed LB infrastructure. Customer has limited visibility.
External IP assignment Static IP on VLAN (manual or IPAM). VIP on F5/NSX LB. MetalLB allocates from IPAddressPool. Advertised via ARP (L2) or BGP. SDN Network Controller allocates VIPs. Software-defined LB assigns IPs. Swisscom allocates IPs from managed pools.
BGP integration NSX Edge supports BGP for dynamic routing to upstream routers. MetalLB BGP mode: peers with ToR switches, announces Service IPs as /32 routes with ECMP. Full BGP support. Limited BGP support. Azure SDN Network Controller manages routing internally. No customer-facing BGP. Managed by Swisscom.
TLS termination NSX ALB / F5: hardware-accelerated SSL, client cert auth, SSL profiles. HAProxy: software TLS (edge, passthrough, re-encrypt). cert-manager for certificate lifecycle. NGINX / Application Gateway: software TLS. Managed TLS via Swisscom reverse proxy.
WAF F5 ASM, Citrix AppFirewall, or NSX ALB WAF. Integrated with LB. Not built-in. Requires external WAF (ModSecurity, cloud WAF, F5 in front of router). Azure WAF (if Application Gateway deployed). Swisscom-managed WAF (optional add-on).
Health checks F5: advanced L3-L7 monitors, scripted checks. NSX ALB: adaptive health scoring. HAProxy: TCP/HTTP checks. Kubernetes readiness probes control endpoint membership. Limited compared to F5/Avi. Standard health probes via SDN LB. Managed by Swisscom.
HA model for LB F5: active-standby pair with failover. NSX ALB: distributed SE (multi-node active-active). MetalLB L2: leader election, ~10s failover. MetalLB BGP: ECMP, <1s with BFD. Router: multiple replicas. SDN LB: managed by Network Controller. HA is infrastructure-level. Managed HA by Swisscom.
Operational model Network team manages F5/Citrix via GUI. Separate from vSphere administration. Platform team manages MetalLB + Routes via Kubernetes API (YAML, GitOps). Network team provides BGP peering. Hybrid: some via Windows Admin Center, some via Azure Portal, PowerShell for advanced. Swisscom manages everything. Customer opens tickets for changes.
KubeVirt VM ingress N/A (VMs are directly reachable on their VLAN IPs). Service + Route (HTTP), Service type LB (non-HTTP), or secondary VLAN IP via Multus. VMs directly reachable on logical network IPs. LoadBalancer via SDN LB. VMs directly reachable on managed network IPs.

Key Differences in Prose

The networking model paradigm shift is unique to OVE. VMware, Azure Local, and Swisscom ESC all follow the "VM gets a stable IP on a network segment" model. Only OVE introduces the Kubernetes networking model where Pod IPs are ephemeral and Services provide stable endpoints. This is the most significant operational change for the networking team. It does not make OVE worse -- it makes it different. The Kubernetes model provides automatic service discovery (CoreDNS), declarative load balancing (Services), and self-healing endpoint management (EndpointSlices). But it requires the team to abandon the mental model of "a VM has an IP" and adopt the model of "a Service has an IP, and Pods behind it come and go."

MetalLB BGP mode is the production recommendation for OVE. Layer 2 mode is acceptable for lab environments and small clusters, but for a Tier-1 enterprise with hundreds of LoadBalancer Services, BGP mode with ECMP provides true load distribution, sub-second failover (with BFD), and scalability beyond the L2 single-node bottleneck. This requires coordination with the network team to configure BGP peering on the ToR switches -- which is a cross-team dependency that should be addressed early in the project.

Azure Local has a built-in software-defined load balancer. The Azure Local SDN Network Controller provides LB functionality integrated with the Hyper-V networking stack. This is simpler to set up than MetalLB (no BGP configuration needed) but less flexible (limited community and tooling, no BGP announcement to upstream routers). Azure Local's LB model is closer to VMware's NSX LB than to MetalLB.

Swisscom ESC abstracts load balancing entirely. The customer consumes managed LB services through the service portal. There is no MetalLB, no BGP configuration, and no HAProxy router to manage. This is the simplest operational model but the least flexible. Workloads that need custom load balancing configurations, specific health check logic, or direct BGP integration cannot be accommodated within the ESC model.

The HAProxy-based OpenShift router is not a full ADC replacement. Organizations currently using F5 BIG-IP or Citrix ADC for WAF, GSLB, advanced health monitoring, or iRules/policies will find the OpenShift router lacks these features. The recommended pattern is hybrid: use the OpenShift router for standard L7 routing (host/path-based, TLS termination) and keep the existing F5/Citrix for advanced ADC features (WAF, GSLB, connection multiplexing). The F5 CIS operator or Citrix Ingress Controller can bridge both worlds by programming the external appliance from Kubernetes resources.

Router sharding is the scaling mechanism for L7 ingress. A single HAProxy deployment handling 10,000+ Routes will experience reload latency and configuration bloat. Router sharding splits the workload across multiple HAProxy deployments, each with its own external IP and DNS domain. For an organization with 5,000+ VMs, many of which expose HTTP services, plan for 3-5 sharded routers from day one: separate production external, internal API, development, and legacy VM routers.


Key Takeaways


Discussion Guide

The following questions are designed for vendor workshops, SME deep-dives, and PoC validation sessions. They test whether the vendor or SME has actual production experience with Kubernetes networking, MetalLB BGP, and OpenShift routing at enterprise scale -- not just documentation-level familiarity.

1. Kubernetes Networking Model for 5,000+ VMs

"We are migrating 5,000+ VMs from VMware to OVE. In VMware, every VM has a stable IP on a port group. In Kubernetes, Pod IPs are ephemeral. Walk us through how we maintain service reachability for these VMs. Which VMs should use the primary OVN interface only, which should get Multus secondary VLAN interfaces, and what is the decision framework? How do we handle VMs with hardcoded IP dependencies?"

Purpose: Tests understanding of the dual-network model and practical migration strategy. The answer should include: (1) VMs that expose HTTP services and integrate with the Kubernetes ecosystem (service mesh, monitoring, auto-scaling) should use the primary OVN interface + Services. (2) VMs that need stable IPs for legacy protocols, VLAN-specific access, or storage network traffic should use Multus secondary interfaces. (3) Hardcoded IP dependencies are the biggest migration challenge -- they require either application refactoring (use DNS/Service names), static IP assignment on a Multus VLAN interface (preserving the VMware-like model), or a combination where the VM has both a primary OVN interface (for Kubernetes integration) and a secondary VLAN interface (for legacy IP reachability). (4) A realistic decision framework: modern, HTTP-based, stateless services use primary only; databases and stateful services use primary + VLAN secondary; legacy, unreachable-for-refactoring VMs use VLAN secondary as their primary network path.

2. MetalLB BGP Deployment and Network Team Coordination

"We want to deploy MetalLB in BGP mode. Our data center uses a spine-leaf fabric with Arista ToR switches. Walk us through the BGP peering configuration on both the MetalLB side and the switch side. What ASN should we use? How do we configure prefix filters to ensure MetalLB only announces the IP pools and nothing else? What happens if a MetalLB Speaker announces a route that conflicts with an existing network prefix?"

Purpose: Tests real BGP operational experience. The answer should include: (1) Use a private ASN (e.g., 64500 for MetalLB, 64501 for ToR switches or use the existing ToR ASN). (2) On the MetalLB side: BGPPeer CRD with the ToR switch IP, ASN, node selectors per rack, hold/keepalive timers, and BFD profile. (3) On the Arista side: router bgp <ASN>, neighbor <node-ip> remote-as 64500, neighbor <node-ip> route-map METALLB-IN in with a prefix-list that accepts only the MetalLB IP pool ranges (e.g., ip prefix-list METALLB-ALLOWED permit 10.0.1.0/24 le 32). (4) If MetalLB announces a conflicting prefix, the longest-match routing rule means the /32 announcement takes precedence over a shorter prefix -- which can hijack traffic from the existing network. This is why prefix filters on the ToR are critical. (5) Test BGP announcements in a staging environment before production. Use show ip bgp neighbors <node-ip> received-routes on the Arista switch to verify what MetalLB is announcing.

3. OpenShift Router Sharding Strategy

"We expect 3,000+ Routes across production, staging, and internal services. How should we shard the OpenShift routers? How many IngressControllers do we need? How do we handle DNS for each shard? What is the blast radius if one sharded router fails? How do we monitor HAProxy reload times and detect configuration errors?"

Purpose: Tests L7 ingress scaling knowledge. The answer should include: (1) Recommended sharding: 4 IngressControllers -- default (catch-all, staging/dev), production-external (production Routes with exposure=external label), internal-api (internal-only Routes), legacy-vms (Routes for KubeVirt VM services). (2) Each IngressController gets its own MetalLB LoadBalancer Service (external IP) or is fronted by the physical F5. (3) DNS: wildcard records per shard -- *.prod.example.com -> production-external IP, *.internal.example.com -> internal-api IP. (4) Blast radius: router failure affects only the Routes in its shard. Production external traffic is isolated from internal API failures. (5) Monitoring: HAProxy stats endpoint (enable via IngressController spec), Prometheus metrics for reload duration, active connections, error rates, and 5xx responses. Alert on HAProxy reload time > 5s (indicates large config) and on repeated reload failures.

4. MetalLB Layer 2 vs BGP Mode Decision

"We have two data centers. DC-1 has a spine-leaf fabric with BGP-capable Arista switches. DC-2 has an older flat network with Cisco Catalyst switches that do not support BGP peering. How do we deploy MetalLB in each DC? Can we mix L2 and BGP mode in the same cluster? What are the failover characteristics in each mode?"

Purpose: Tests practical deployment experience across different network environments. The answer should include: (1) DC-1: BGP mode with ECMP on Arista ToR switches. Full benefits: true load balancing, sub-second failover with BFD. (2) DC-2: Layer 2 mode. Simpler deployment, works on any switch. Limitations: single-node bottleneck per Service IP, ~10s failover via gratuitous ARP. (3) Mixing modes: yes, MetalLB supports this -- create separate IPAddressPool resources with separate L2Advertisement and BGPAdvertisement resources for each pool. Nodes in DC-1 use BGP for their pool; nodes in DC-2 use L2 for theirs. (4) For stretched clusters across two DCs (not recommended but sometimes required), MetalLB BGP mode can announce the same /32 from nodes in both DCs if both ToR switches peer with MetalLB. ECMP distributes traffic across DCs -- but this requires careful BGP community and local-preference tuning to prefer local-DC backends.

5. TLS Certificate Lifecycle for Routes

"We have 500+ Routes that need publicly trusted TLS certificates from our enterprise CA (managed by Venafi). How do we automate certificate issuance and renewal? What happens when a certificate expires? How do we handle certificate rotation without downtime? What is the integration path between cert-manager and Venafi?"

Purpose: Tests certificate management automation knowledge. The answer should include: (1) Deploy cert-manager with a Venafi ClusterIssuer (cert-manager has a native Venafi integration). (2) Create Certificate resources that reference the ClusterIssuer and the Route's domain name. cert-manager creates the TLS secret and auto-renews before expiration (configurable via renewBefore). (3) If a certificate expires without renewal: the OpenShift router continues serving the expired certificate (HAProxy does not crash, but clients will get TLS warnings/errors). (4) Rotation without downtime: cert-manager updates the TLS secret, the router detects the secret change and performs a hot reload. HAProxy's seamless reload mechanism ensures no connection drops. (5) Venafi integration: cert-manager's Venafi issuer supports both Venafi Cloud and Venafi TPP. It submits a CSR to Venafi, Venafi issues the certificate according to the enterprise policy (key type, validity, SANs), and cert-manager stores it as a Kubernetes secret.

6. Service Exposure Strategy for KubeVirt VMs

"We have a mix of workloads migrating from VMware: 200 web application VMs (HTTP/HTTPS), 50 database VMs (PostgreSQL, Oracle), 30 legacy VMs running proprietary TCP protocols on non-standard ports, and 20 VMs that require direct VLAN access for storage. Design the service exposure strategy for each group."

Purpose: Tests practical architecture design. The answer should include: (1) Web apps: OpenShift Routes with edge TLS termination, cert-manager for certificates, router sharding for production vs staging. (2) Databases: MetalLB LoadBalancer Services with specific IPs from a dedicated database-pool. externalTrafficPolicy: Local for source IP preservation. Consider headless Services for database replicas that need peer discovery. (3) Legacy TCP VMs: MetalLB LoadBalancer Services with custom port definitions. If the protocols require the client source IP, use externalTrafficPolicy: Local. If protocols are latency-sensitive, consider whether MetalLB's DNAT hop is acceptable or if a Multus VLAN interface with direct access is better. (4) Storage VMs: Multus secondary interfaces on the storage VLAN. No Kubernetes Service or Route -- direct VLAN IP access. These VMs bypass the Kubernetes networking model on their storage interface. Ensure the storage VLAN is trunked to the correct worker nodes.

7. Troubleshooting a "Service Not Reachable" Scenario

"A production KubeVirt VM is fronted by a MetalLB LoadBalancer Service in BGP mode. External clients report the service is unreachable. Walk us through the troubleshooting process from the physical network to the VM guest OS. What tools do you use at each layer?"

Purpose: Tests troubleshooting depth and tool familiarity. The answer should cover the full stack: (1) Physical network: Check BGP session status on the ToR switch (show ip bgp summary). Is the MetalLB Speaker's BGP session established? Is the /32 route present in the routing table (show ip route <service-ip>)? If the session is down, check node connectivity and MetalLB Speaker pod logs. (2) MetalLB: oc get svc -o wide -- does the Service have an external IP? oc logs -n metallb-system -- any errors in Speaker or Controller? oc get bgppeer -- is the peer configured correctly? (3) Node level: Is the Service IP programmed in OVN? ovn-nbctl lb-list -- is the load balancer entry present? ovn-sbctl list port_binding -- is the backend port bound? (4) OVS flows: ovs-ofctl dump-flows br-int -- are there flow rules for the Service IP DNAT? (5) Pod level: oc get endpoints <svc-name> -- are there healthy endpoints? oc get pods -l kubevirt.io/domain=<vm-name> -- is the virt-launcher pod running? (6) VM level: virtctl console <vm-name> -- is the VM guest OS responding? Check guest networking (ip addr, ss -tlnp -- is the service listening on the expected port?).

8. MetalLB IP Address Pool Sizing and Management

"How do we size the MetalLB IP address pools for a cluster with 5,000+ VMs? How many LoadBalancer Services do we expect? How do we prevent pool exhaustion? Can we dynamically expand a pool? How do we handle overlapping pools or IP conflicts with existing DHCP ranges?"

Purpose: Tests capacity planning knowledge. The answer should include: (1) Not every VM needs a LoadBalancer Service -- only VMs with externally-facing services. A typical ratio is 10-20% of VMs needing external L4 IPs (the rest use Routes for HTTP or are internal-only). For 5,000 VMs, plan for 500-1,000 LoadBalancer Service IPs. (2) Pool sizing: allocate a /22 (1,022 usable IPs) or two /23s. Use separate pools for different tiers (production, staging, database). (3) Pool exhaustion: monitor via Prometheus metrics (metallb_allocator_addresses_in_use_total vs metallb_allocator_addresses_total). Alert at 80% utilization. (4) Dynamic expansion: add new CIDR ranges to the IPAddressPool resource -- MetalLB picks up the change immediately. No restart needed. (5) IP conflicts: coordinate with the IPAM team to ensure MetalLB pools are reserved in the corporate IPAM database and excluded from DHCP ranges. MetalLB does not check for IP conflicts -- if an IP in the pool is already in use on the network, MetalLB will assign it, causing a conflict.

9. HAProxy Performance and Tuning

"Our production OpenShift router handles 50,000 active connections and 20,000 requests per second. We notice intermittent 503 errors during Route changes. What is causing this? How do we tune HAProxy for this scale? What are the key parameters to adjust?"

Purpose: Tests HAProxy operational knowledge at scale. The answer should include: (1) 503 errors during Route changes are caused by HAProxy reloads. Each Route change triggers a config regeneration and hot reload. The hot reload spawns a new HAProxy process -- during the transition, new connections may briefly fail if the new process has not fully started. (2) Tuning: increase ROUTER_MAX_CONNECTIONS (default 20000, set to 100000+), set RELOAD_INTERVAL to batch Route changes (e.g., 5s instead of immediate reload on every change), increase ROUTER_THREADS to match available CPU cores. (3) Key parameters: timeout client (client-side timeout), timeout server (backend timeout), timeout tunnel (WebSocket/long-lived connection timeout -- default 1h, increase for long-lived VM connections). (4) For 50K active connections, ensure the router pods have sufficient memory (HAProxy uses ~2KB per connection in keep-alive mode) and CPU (2-4 cores for 20K req/s). (5) Consider router sharding to split the load -- 3 sharded routers with 7,000 connections each are more manageable than 1 router with 50,000.

10. End-to-End Packet Walk: External Client to KubeVirt VM

"An external client sends an HTTPS request to https://app.example.com, which is served by a KubeVirt VM running NGINX on its primary OVN interface. The VM is behind an OpenShift Route with edge TLS termination, the router has a MetalLB BGP-assigned external IP, and the VM is on a different node than the router. Walk us through every packet transformation from the client to the VM and back."

Purpose: Tests complete, end-to-end packet-level understanding. The correct answer traces the full path: (1) Client DNS resolves app.example.com to MetalLB IP (e.g., 10.0.1.50). (2) Client sends TCP SYN to 10.0.1.50:443. (3) ToR switch has /32 BGP route for 10.0.1.50 via ECMP -- selects Node 1 (router node) based on 5-tuple hash. (4) Packet arrives at Node 1. OVN load balancer matches the MetalLB IP -> router pod Service -> DNATs to router Pod IP 10.128.0.20:443. (5) HAProxy in the router pod receives the TLS ClientHello, extracts SNI (app.example.com), terminates TLS, reads the HTTP request. (6) HAProxy matches the Route for app.example.com -> backend Service nginx-svc -> endpoint Pod IP 10.128.4.15:8080 (VM on Node 3). (7) HAProxy sends plain HTTP to 10.128.4.15:8080. This packet enters the OVN overlay. (8) OVN on Node 1 encapsulates the packet in GENEVE: outer IP 10.0.0.11 (Node 1) -> 10.0.0.13 (Node 3), outer UDP port 6081. (9) Packet traverses spine-leaf fabric (standard IP routing) to Node 3. (10) OVN on Node 3 decapsulates GENEVE, delivers to the virt-launcher pod's veth interface. (11) The tap device in the pod bridges to the VM's virtio NIC. (12) NGINX inside the VM receives the HTTP request on port 8080 and responds. (13) Response follows the reverse path: VM -> tap -> veth -> OVN -> GENEVE (Node 3 -> Node 1) -> HAProxy (re-encrypts response with TLS for client) -> OVN -> MetalLB IP -> ToR -> Client.