Networking Foundational Concepts
Why This Matters
Every VM running in a data center depends on the network for three things: reaching other VMs (east-west), reaching users and external systems (north-south), and reaching shared storage (storage traffic). When you operate 5,000+ VMs, the network is not "plumbing" -- it is the primary determinant of latency, security posture, failure domain size, and operational complexity. A misconfigured MTU silently fragments storage replication traffic. A bonding mode mismatch causes asymmetric hashing and saturates one link while the other idles. A missing VLAN trunk drops an entire application tier during live migration.
The platform candidates under evaluation -- OVE, Azure Local, and Swisscom ESC -- each make fundamentally different architectural choices about how networking is realized. Understanding the foundational concepts below is necessary to evaluate those choices, to ask the right questions during vendor workshops, and to design a network architecture that will serve the organization for the next 5-10 years.
This document covers the "vocabulary layer" -- the concepts that every subsequent networking topic (overlays, SR-IOV, micro-segmentation, ingress routing) builds upon. If the team owns these eight topics, the advanced material becomes a matter of learning product-specific implementations rather than learning entirely new ideas.
Concepts
1. VLANs (Virtual Local Area Networks)
What It Is and Why It Exists
A VLAN is a logical partitioning of a single physical Layer-2 network into multiple isolated broadcast domains. Without VLANs, every device connected to the same set of switches shares a single broadcast domain -- every ARP request, every DHCP discover, every broadcast storm reaches every port. At 5,000+ VMs, an unpartitioned flat network would collapse under broadcast traffic alone.
VLANs solve three problems simultaneously:
- Broadcast containment: Each VLAN is its own broadcast domain. A broadcast in VLAN 100 never reaches VLAN 200.
- Security isolation: Traffic between VLANs must traverse a Layer-3 device (router or firewall), where policies can be enforced.
- Operational flexibility: VMs can be placed on any physical host but remain in the same logical network segment, which is the foundation for live migration.
How It Works
VLANs are defined by IEEE 802.1Q. When a frame enters a switch port configured as a "trunk" (carrying multiple VLANs), the switch inserts a 4-byte 802.1Q tag into the Ethernet frame header between the Source MAC and the EtherType/Length field.
Standard Ethernet Frame (untagged):
+----------+----------+-----------+---------------------+-----+
| Dst MAC | Src MAC | EtherType | Payload | FCS |
| 6 bytes | 6 bytes | 2 bytes | 46-1500 bytes | 4 B |
+----------+----------+-----------+---------------------+-----+
802.1Q Tagged Frame:
+----------+----------+------+------+-----------+---------------------+-----+
| Dst MAC | Src MAC | TPID | TCI | EtherType | Payload | FCS |
| 6 bytes | 6 bytes | 2 B | 2 B | 2 bytes | 46-1500 bytes | 4 B |
+----------+----------+------+------+-----------+---------------------+-----+
| |
+-- 802.1Q --+
Tag
(4 bytes)
TPID (Tag Protocol Identifier): 0x8100 -- signals "this frame is tagged"
TCI (Tag Control Information):
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| PCP |DEI| VLAN ID (VID) |
| 3 bits (priority) |1b | 12 bits (0-4095) |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
PCP = Priority Code Point (802.1p QoS, 0-7)
DEI = Drop Eligible Indicator
VID = VLAN Identifier (12 bits = 4096 possible VLANs, 0 and 4095 reserved)
Port types on a switch:
- Access port: Carries a single VLAN. Frames leave the port untagged. The switch strips the tag on egress and adds it on ingress.
- Trunk port: Carries multiple VLANs. Frames leave the port tagged (except the "native VLAN," which can be sent untagged).
On a Linux host (which is what OVE worker nodes and Azure Local nodes are), VLANs are created as sub-interfaces:
# Create VLAN 100 on top of physical interface ens1f0
ip link add link ens1f0 name ens1f0.100 type vlan id 100
ip link set ens1f0.100 up
ip addr add 10.20.100.1/24 dev ens1f0.100
The kernel's VLAN driver (8021q module) handles tag insertion and removal transparently. The physical NIC sees tagged frames on the wire; the VLAN sub-interface presents untagged frames to the IP stack.
Practical Impact at Scale
With 5,000+ VMs, VLAN planning is critical:
- VLAN exhaustion: The 12-bit VID field allows only 4,094 usable VLANs. If the organization assigns one VLAN per tenant or per application tier, this limit can be reached surprisingly quickly. This is one of the primary motivations for overlay networks (VXLAN/GENEVE), which extend the ID space to 24 bits (16 million segments).
- Trunk configuration consistency: Every physical host must have identical trunk configurations, or live migration will fail silently -- the VM arrives on the destination host but its VLAN is not trunked there, so it loses connectivity.
- Native VLAN mismatches: If two sides of a trunk disagree on the native VLAN, untagged traffic is placed on the wrong VLAN. This is a common cause of subtle security breaches.
Relationship to Other Topics
VLANs are the Layer-2 foundation on which everything else is built. SDN overlays (VXLAN, GENEVE) encapsulate VLAN-like segments inside UDP packets to overcome the 4,094 VLAN limit. Bonding modes determine how trunk traffic is distributed across physical links. MTU settings must account for the 4-byte 802.1Q tag (and the much larger overlay headers). BGP can be used to advertise VLAN subnets between leaf switches in a spine-leaf fabric.
2. East-West vs. North-South Traffic
What It Is and Why It Exists
These terms describe the two fundamental traffic flow directions inside a data center:
- North-South traffic: Flows between the data center and external networks (the internet, branch offices, partner connections). This is the traffic that crosses the data center perimeter.
- East-West traffic: Flows between servers, VMs, or containers within the data center. This is internal traffic -- application server to database, web frontend to API tier, storage replication between nodes.
External Users / Internet
|
[ Perimeter FW ]
|
[ Core Router ] <-- North-South
/ \ boundary
/ \
[ Spine 1 ] [ Spine 2 ]
/ | \ / | \
/ | \ / | \
[Leaf1] [Leaf2] [Leaf3] [Leaf4] [Leaf5]
| | | | |
+---+ +---+ +---+ +---+ +---+
|VM1| |VM2| |VM3| |VM4| |VM5|
+---+ +---+ +---+ +---+ +---+
\ | / \ | /
`---+---' `--+---'
East-West East-West
traffic traffic
Why the Distinction Matters
In a modern data center running microservices, multi-tier applications, and storage replication, east-west traffic typically dominates north-south traffic by a factor of 5:1 to 20:1. A single user HTTP request entering the data center (north-south) may trigger dozens of internal API calls, database queries, cache lookups, and storage I/O operations (all east-west).
This ratio has profound architectural implications:
- Firewall placement: Traditional perimeter firewalls only see north-south traffic. If a compromised VM starts scanning other VMs internally (east-west), the perimeter firewall never sees it. This is why micro-segmentation (firewalling at the VM/pod level) exists.
- Bandwidth provisioning: If you size your network only for north-south traffic, the internal fabric will be the bottleneck. East-west traffic must be distributed across multiple paths (ECMP, LAG) and optimized for low latency.
- Latency sensitivity: East-west traffic between an application and its database is often latency-critical (sub-millisecond). Adding unnecessary hops (e.g., routing through a centralized firewall for east-west traffic) degrades application performance.
Traffic Patterns in a 5,000+ VM Environment
Typical traffic composition (5,000+ VM data center):
North-South (user-facing): ~10-20% of total bandwidth
East-West (app-to-app): ~30-40% of total bandwidth
East-West (storage): ~30-40% of total bandwidth
East-West (management/control): ~5-10% of total bandwidth
----
100%
Storage replication traffic is often the largest single contributor to east-west traffic in hyperconverged environments (OVE with ODF, Azure Local with S2D), because every write is replicated to 2-3 nodes across the fabric.
Relationship to Other Topics
- SDN provides distributed firewalling for east-west traffic without forcing it through a central chokepoint.
- MTU / Jumbo Frames are most critical for east-west storage traffic, where the per-packet overhead of small MTU sizes significantly impacts throughput.
- BGP in a spine-leaf fabric ensures that east-west traffic takes optimal paths via ECMP rather than hair-pinning through a core router.
- Bonding modes determine how east-west traffic is distributed across the multiple physical uplinks from each server to the leaf switches.
3. MTU / Jumbo Frames
What It Is and Why It Exists
The Maximum Transmission Unit (MTU) is the largest Layer-3 packet (IP datagram) that can be transmitted over a network link without fragmentation. The standard Ethernet MTU is 1500 bytes. A "Jumbo Frame" is any Ethernet frame with an MTU larger than 1500, typically 9000 bytes (though some vendors support up to 9216).
Standard MTU (1500 bytes):
+------------------+--------------------+
| IP + TCP Header | Payload |
| 40 bytes | 1460 bytes |
+------------------+--------------------+
Overhead: 40/1500 = 2.67%
Jumbo Frame MTU (9000 bytes):
+------------------+--------------------+
| IP + TCP Header | Payload |
| 40 bytes | 8960 bytes |
+------------------+--------------------+
Overhead: 40/9000 = 0.44%
Reduction in per-byte overhead: ~6x
Reduction in packets for 1 GB transfer:
MTU 1500: ~729,444 packets
MTU 9000: ~119,304 packets (6.1x fewer packets)
Every packet incurs CPU overhead for processing (interrupt handling, checksum computation, buffer allocation). With 6x fewer packets for the same data volume, jumbo frames significantly reduce CPU utilization for high-throughput transfers -- exactly the kind of transfers that happen during storage replication, VM live migration, and backup operations.
How It Works
MTU is configured at the interface level. On Linux:
# Set MTU to 9000 on a physical interface
ip link set dev ens1f0 mtu 9000
# Verify
ip link show ens1f0
# 2: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ...
# Set MTU on a VLAN sub-interface (must be <= parent MTU)
ip link set dev ens1f0.100 mtu 9000
# Set MTU on a bond interface
ip link set dev bond0 mtu 9000
The critical rule: MTU must be consistent across the entire Layer-2 path. If any link in the chain has a smaller MTU, packets will be fragmented (if DF bit is not set) or dropped (if DF bit is set, which is the default for most modern stacks). This includes:
- The VM's virtual NIC
- The virtual switch (OVS, Hyper-V vSwitch)
- The physical NIC on the host
- Every physical switch port in the path
- The destination host's physical NIC
- The destination VM's virtual NIC
MTU Path Consistency Example:
VM-A (MTU 9000)
|
v
[OVS bridge] (MTU 9000)
|
v
[Physical NIC ens1f0] (MTU 9000)
|
v
[Leaf Switch Port] (MTU 9216) <-- switch MTU must be >= frame MTU
|
v
[Spine Switch Port] (MTU 9216)
|
v
[Leaf Switch Port] (MTU 9216)
|
v
[Physical NIC ens1f1] (MTU 9000)
|
v
[OVS bridge] (MTU 9000)
|
v
VM-B (MTU 9000)
If ANY hop drops to MTU 1500:
- TCP will negotiate a smaller MSS via Path MTU Discovery (PMTUD)
- OR packets will be silently dropped if ICMP "Fragmentation Needed"
messages are blocked by a firewall (a very common misconfiguration)
- Result: "black hole" connections that establish but hang on large transfers
Overlay Encapsulation and MTU
This is where MTU becomes especially important for our evaluation. SDN overlays (VXLAN, GENEVE, which OVE uses) add encapsulation headers to every packet:
VXLAN overhead:
Outer Ethernet: 14 bytes
Outer IP: 20 bytes
Outer UDP: 8 bytes
VXLAN header: 8 bytes
-------------------------
Total overhead: 50 bytes
GENEVE overhead (OVN default):
Outer Ethernet: 14 bytes
Outer IP: 20 bytes
Outer UDP: 8 bytes
GENEVE header: 8 bytes + variable options (typically 4-32 bytes)
-------------------------
Total overhead: 50-74 bytes
If physical MTU = 1500 and overlay overhead = 50 bytes:
Effective inner MTU = 1500 - 50 = 1450 bytes
--> VMs see MTU 1450, not 1500
--> Applications expecting MTU 1500 will experience fragmentation or drops
If physical MTU = 9000 and overlay overhead = 50 bytes:
Effective inner MTU = 9000 - 50 = 8950 bytes
--> VMs see MTU 8950, which is effectively a jumbo frame
--> No performance degradation from encapsulation
This is why jumbo frames on the physical underlay are effectively mandatory for any platform using overlay networking at scale. Without them, the inner MTU drops below 1500, which breaks assumptions in many applications and TCP stacks.
Practical Impact at Scale
- Storage replication (ODF, S2D): With 5,000+ VMs generating continuous write I/O, storage replication traffic can easily saturate 25 GbE links. The difference between MTU 1500 and MTU 9000 on these links translates to measurable CPU savings and higher achievable throughput.
- Live migration: A VM with 128 GB RAM generates ~128 GB of network traffic during a live migration. At MTU 1500, that is ~91 million packets. At MTU 9000, it is ~15 million packets. The migration completes faster and generates fewer interrupts on both source and destination hosts.
- Path MTU Discovery failures: In complex networks with multiple overlay layers, PMTUD failures are the number-one cause of "mystery" connectivity issues. The symptom is that small packets (ping, DNS, TCP SYN/ACK) work fine, but large transfers (file copies, database dumps) hang. Always ensure ICMP type 3 code 4 ("Fragmentation Needed") is permitted through all firewalls.
Relationship to Other Topics
- Bonding: The bond interface MTU must match the member interfaces. Setting
mtu 9000onbond0automatically propagates to member NICs in most configurations, but verify. - VLANs: VLAN sub-interfaces inherit the parent interface's MTU by default but can be set independently. The parent's MTU must be >= the child's.
- SDN overlays: As shown above, overlay encapsulation reduces effective MTU. This is the primary driver for jumbo frame adoption in overlay-based environments.
- IPv6: The minimum MTU for IPv6 is 1280 bytes (vs. 68 bytes for IPv4). IPv6 does not support router fragmentation -- only the source can fragment. This makes Path MTU Discovery even more critical in dual-stack environments.
4. Bonding Modes
What It Is and Why It Exists
NIC bonding (also called "link aggregation" or "NIC teaming") combines multiple physical network interfaces into a single logical interface. The goals are:
- Redundancy: If one NIC or cable fails, traffic continues on the surviving links.
- Throughput: Traffic is distributed across multiple links, increasing aggregate bandwidth.
On Linux, bonding is implemented by the bonding kernel module. The bond interface is created and managed via ip link or through configuration management tools like NMState (used by OVE) or Network ATC (used by Azure Local).
# Create a bond interface
ip link add bond0 type bond mode 802.3ad
# Add member interfaces
ip link set ens1f0 master bond0
ip link set ens1f1 master bond0
# Set bond parameters
ip link set bond0 type bond miimon 100 lacp_rate fast xmit_hash_policy layer3+4
# Bring up
ip link set bond0 up
Bonding Modes in Detail
Linux supports seven bonding modes. Not all are equally relevant for data center virtualization:
+------+------------------+------------+------------+--------------------------------+
| Mode | Name | Redundancy | Throughput | Mechanism |
+------+------------------+------------+------------+--------------------------------+
| 0 | balance-rr | Yes | Yes | Round-robin packet distribution |
| 1 | active-backup | Yes | No | One active, others standby |
| 2 | balance-xor | Yes | Yes | XOR hash on MAC addresses |
| 3 | broadcast | Yes | No | Send on all interfaces |
| 4 | 802.3ad (LACP) | Yes | Yes | Dynamic link aggregation |
| 5 | balance-tlb | Yes | Tx only | Adaptive transmit load balance |
| 6 | balance-alb | Yes | Yes | Adaptive load balancing |
+------+------------------+------------+------------+--------------------------------+
Mode 1 (active-backup) -- the simplest and most universally compatible:
bond0 (active-backup)
/ \
[ens1f0 ACTIVE] [ens1f1 STANDBY]
| |
[Leaf Switch A] [Leaf Switch B]
- Only one NIC transmits and receives at any time
- If the active NIC fails, the standby takes over (failover time = miimon interval)
- No switch configuration required
- Maximum throughput = single link speed (e.g., 25 Gbps on 25 GbE NICs)
- Use case: management networks, low-bandwidth VLANs, or when switches
do not support LACP
Mode 4 (802.3ad / LACP) -- the enterprise standard for high-throughput:
bond0 (802.3ad)
/ \
[ens1f0] [ens1f1]
| |
[Leaf Switch, LAG group / Port-Channel]
- Both NICs transmit and receive simultaneously
- Requires switch-side LACP configuration (both ends must negotiate)
- Traffic distribution is based on a hash function (xmit_hash_policy):
layer2: Hash on src/dst MAC (poor distribution for VMs)
layer3+4: Hash on src/dst IP + port (best distribution for VMs)
encap3+4: Hash on inner headers (for encapsulated traffic)
- Maximum throughput = N x link speed (e.g., 2 x 25 Gbps = 50 Gbps aggregate)
BUT: a single flow still maxes out at a single link's speed
- Both NICs must connect to the same switch or to an MLAG/SMLT pair
The xmit_hash_policy is critical for VM environments. With layer2 hashing, all VMs on a host that communicate with the same destination MAC (e.g., the default gateway) will hash to the same link, creating an imbalance. With layer3+4, the hash uses IP addresses and TCP/UDP ports, which distributes VM traffic much more evenly.
Hash policy comparison for 100 VMs on one host, all talking to the gateway:
layer2 hash: hash(src_mac, dst_mac)
--> All 100 VMs produce the same hash (same dst_mac = gateway)
--> 100% of traffic on one link, 0% on the other
layer3+4 hash: hash(src_ip, dst_ip, src_port, dst_port)
--> Each VM has a different src_ip and different ports
--> ~50/50 distribution across both links
Mode 6 (balance-alb) -- no switch configuration needed, works across switches:
bond0 (balance-alb)
/ \
[ens1f0] [ens1f1]
| |
[Leaf Switch A] [Leaf Switch B] <-- can be different switches!
- Outgoing: distributes based on per-peer adaptive load balancing
- Incoming: uses ARP negotiation to balance receive traffic
- No switch-side configuration required
- Works across two independent switches (no MLAG needed)
- Drawback: ARP manipulation can confuse some network monitoring tools
and may cause issues with certain switch security features (DAI, port security)
Practical Considerations for 5,000+ VMs
- LACP with MLAG/SMLT: In production data centers, LACP bonds typically connect to a pair of leaf switches configured with MLAG (Cisco), VPC (Nexus), SMLT (Extreme/Avaya), or MC-LAG (Juniper). This provides switch-level redundancy while maintaining full LACP bandwidth. Without MLAG, an LACP bond can only connect to a single switch, which becomes a single point of failure.
- Separate bonds for separate traffic types: A common architecture uses two bonds -- one for VM/overlay traffic and one for storage replication traffic -- to isolate failure domains and simplify QoS:
Typical dual-bond architecture per host:
bond0 (LACP, 2x25GbE) bond1 (LACP, 2x25GbE)
xmit_hash_policy: layer3+4 xmit_hash_policy: layer3+4
MTU: 9000 MTU: 9000
| |
VM traffic Storage replication
Overlay (VXLAN/GENEVE) Ceph/S2D cluster traffic
Management Live migration
- NMState (OVE): OVE uses the NMState operator to declare network configurations as Kubernetes custom resources (
NodeNetworkConfigurationPolicy). Bonding modes, VLANs, and bridges are defined declaratively and applied consistently across all worker nodes. - Network ATC (Azure Local): Azure Local uses Network ATC (Adaptive Traffic Control) to define "intents" (management, compute, storage) and automatically configures bonding, VLANs, and QoS based on best practices.
Relationship to Other Topics
- MTU: Bond MTU must match member NIC MTU. Set it on the bond; it propagates down.
- VLANs: VLAN sub-interfaces are created on top of the bond interface, not on individual member NICs.
- SDN: OVN-Kubernetes (OVE) and Microsoft SDN (Azure Local) both operate on top of the bonded interface. The bond provides the physical redundancy and bandwidth; the SDN provides the logical abstraction.
- East-West traffic: The bonding mode and hash policy directly determine how east-west traffic is distributed across physical links. A bad hash policy can leave one link at 100% while the other sits at 0%.
5. DNS / DHCP in Virtualized Environments
What It Is and Why It Exists
DNS (Domain Name System) translates human-readable names to IP addresses. DHCP (Dynamic Host Configuration Protocol) automatically assigns IP addresses, subnet masks, gateways, and DNS servers to devices when they join a network. In a physical data center with a few hundred servers, DNS and DHCP are straightforward services. In a virtualized environment with 5,000+ VMs that are created, cloned, migrated, and destroyed dynamically, they become critical infrastructure that must scale, respond quickly, and integrate tightly with the orchestration layer.
How DNS Works in Virtualized Environments
DNS in a virtualized data center has two dimensions:
1. Infrastructure DNS (the platform's own name resolution): The virtualization platform itself needs DNS to function. Kubernetes (OVE) relies on CoreDNS for internal service discovery. Azure Local uses Active Directory DNS for cluster operations. These are internal, platform-managed DNS systems that tenants do not directly interact with.
OVE / Kubernetes DNS architecture:
Pod/VM makes DNS query for "my-database.my-namespace.svc.cluster.local"
|
v
CoreDNS (runs as Pods in openshift-dns namespace)
|
+-- Is it a cluster-internal name? --> Resolve from Kubernetes API
| (Services, Endpoints, Pods)
|
+-- Is it an external name? --> Forward to upstream DNS servers
(configured in cluster DNS operator config)
DNS search path inside a Pod/VM:
my-namespace.svc.cluster.local
svc.cluster.local
cluster.local
<external domain>
2. Tenant/workload DNS: The VMs themselves need DNS for their own applications -- resolving database hostnames, external APIs, internal service names. This is typically served by the organization's existing enterprise DNS infrastructure (e.g., Windows DNS, Infoblox, or BIND), not by the virtualization platform.
The integration challenge: When a VM is created or migrated, its DNS record must be updated to reflect its new IP address. In VMware, this is often handled by the VM's DHCP lease triggering a Dynamic DNS (DDNS) update. In Kubernetes/OVE, the platform itself manages DNS for Services and Pods, but VMs exposed on secondary networks (via Multus) need explicit DNS integration -- either via DDNS from the guest OS or via external-dns operators.
How DHCP Works in Virtualized Environments
DHCP follows a four-step process (DORA):
VM (Client) DHCP Server
| |
|--- DHCP Discover (broadcast) -------->|
| |
|<------ DHCP Offer (unicast) ----------|
| (IP: 10.20.100.50, |
| Mask: 255.255.255.0, |
| GW: 10.20.100.1, |
| DNS: 10.1.1.53, |
| Lease: 8 hours) |
| |
|--- DHCP Request (broadcast) --------->|
| ("I accept 10.20.100.50") |
| |
|<------ DHCP ACK (unicast) ------------|
| ("Confirmed, it's yours") |
| |
DHCP in virtualized environments has specific challenges:
- Broadcast domain scope: DHCP Discover is a broadcast. In a VLAN-segmented network, the DHCP server must be in the same VLAN as the VM, or a DHCP relay agent (ip helper-address on the router/switch) must forward the broadcast to the DHCP server's subnet.
- Lease churn: With 5,000+ VMs and dynamic provisioning, DHCP lease tables can grow large. Lease times must be balanced -- too short means excessive DHCP traffic; too long means IP exhaustion when VMs are destroyed but leases are not released.
- IP address management (IPAM): At scale, manual DHCP scope management becomes untenable. Enterprise IPAM solutions (Infoblox, BlueCat, NetBox) integrate with the virtualization platform to automate IP allocation and DNS record creation.
- Static vs. dynamic: Many enterprise workloads (databases, middleware, load balancers) require static IPs for stability. In OVE, static IPs can be assigned via
cloud-initor via the VM's network configuration in theVirtualMachineCR. In Azure Local, static IPs are assigned via ARM templates or SCVMM.
Platform-Specific DNS/DHCP Behavior
OVE: CoreDNS handles cluster-internal resolution. For VMs on secondary networks (via Multus/bridge CNI), DNS and DHCP are typically provided by the organization's existing infrastructure. OVE does not inject itself into the tenant DNS path for secondary networks. The kubevirt network binding can optionally configure cloud-init to set static IPs and DNS servers at VM boot time.
Azure Local: Relies heavily on Active Directory DNS. Cluster nodes must be domain-joined. DHCP for VMs can be served by Windows DHCP Server roles, by the SDN's built-in DHCP, or by external DHCP infrastructure.
Swisscom ESC: DNS and DHCP are managed services provided by Swisscom within the tenant. The customer configures their zones and scopes through the ESC portal or API, but does not operate the underlying infrastructure.
Relationship to Other Topics
- VLANs: Each VLAN typically maps to a DHCP scope/subnet. DHCP relay agents are needed when the DHCP server is not in the same VLAN.
- IPv4/IPv6 Dual-Stack: DHCPv6 operates differently from DHCPv4 -- it can work in stateful mode (assigning addresses) or stateless mode (only providing DNS/domain info, with SLAAC handling addresses). Both must be planned.
- SDN: Some SDN solutions (NSX, OVN) include built-in DHCP servers for overlay networks, eliminating the need for external DHCP infrastructure for certain segments.
6. SDN (Software-Defined Networking)
What It Is and Why It Exists
Software-Defined Networking separates the control plane (the logic that decides where traffic should go) from the data plane (the hardware and software that actually forwards packets). In traditional networking, every switch and router contains both planes -- each device independently runs routing protocols, maintains its own forwarding tables, and makes its own decisions. SDN centralizes the control plane into a software controller that programs the forwarding tables of all switches from a single point of authority.
Traditional Networking: Software-Defined Networking:
+--------+ +--------+ +--------+ +-----------------------+
|Switch 1| |Switch 2| |Switch 3| | SDN Controller |
| Control| | Control| | Control| | (centralized brain) |
| Plane | | Plane | | Plane | +-----------+-----------+
| Data | | Data | | Data | |
| Plane | | Plane | | Plane | Southbound API (OpenFlow,
+--------+ +--------+ +--------+ OVSDB, gRPC, REST...)
|
Each device makes independent +--------+--------+
forwarding decisions. | | |
Configuration is per-device. +--------+--------+--------+
Changes require touching |Switch 1|Switch 2|Switch 3|
every device individually. | Data | Data | Data |
| Plane | Plane | Plane |
| only | only | only |
+--------+--------+--------+
Controller pushes forwarding
rules to all switches.
Configuration is centralized.
Changes are applied globally.
Why SDN Matters for Virtualization
In a data center running 5,000+ VMs, the network must support:
- Rapid provisioning: A new VM needs connectivity within seconds. Traditional networking requires switch port configuration, VLAN provisioning, and firewall rule updates -- all manual or semi-automated steps. SDN automates all of this.
- Micro-segmentation: Per-VM firewall rules enforced at the virtual switch level, not at a central chokepoint. Without SDN, east-west firewall rules require traffic to hairpin through a physical firewall.
- Multi-tenancy: Tenants need isolated network segments that can overlap in address space (e.g., two tenants both using 10.0.0.0/24). Overlay networks (VXLAN/GENEVE) provide this isolation.
- Live migration transparency: When a VM migrates from host A to host B, the network must seamlessly redirect traffic to the new location. SDN controllers update forwarding tables automatically.
SDN Architecture: Control Plane, Data Plane, Management Plane
+---------------------------------------------------------------+
| Management Plane |
| (UI, CLI, API for human and automation interaction) |
| Examples: OpenShift Console, Azure Portal, vCenter |
+----------------------------+----------------------------------+
|
Northbound API
(REST, gRPC)
|
+----------------------------v----------------------------------+
| Control Plane |
| (Centralized logic: topology, routing, policy, ACLs) |
| |
| OVE: OVN Northbound DB + ovn-northd |
| Azure Local: Network Controller (Windows SDN) |
| VMware: NSX Manager |
| ESC: NSX Manager (Swisscom-operated) |
+----------------------------+----------------------------------+
|
Southbound API
(OVSDB, OpenFlow, gRPC)
|
+----------------------------v----------------------------------+
| Data Plane |
| (Packet forwarding, encap/decap, ACL enforcement) |
| |
| OVE: OVS (Open vSwitch) on every node |
| Azure Local: Hyper-V Virtual Filtering Platform (VFP) |
| VMware: NSX-T Distributed Switch (N-VDS) |
| ESC: NSX-T (same as VMware, Swisscom-managed) |
+---------------------------------------------------------------+
OVN/OVS: The SDN Stack in OVE
OVE uses OVN (Open Virtual Network) as its SDN controller and OVS (Open vSwitch) as its data plane. This is the most important SDN stack for our evaluation because OVE is the shortlist favorite.
OVN Architecture in an OVE Cluster:
+---------------------------+
| OVN Northbound Database | <-- Logical network topology
| (logical switches, | (what the admin sees)
| logical routers, |
| ACLs, NAT rules) |
+-------------+-------------+
|
ovn-northd <-- Translates logical topology
| to physical flow rules
v
+---------------------------+
| OVN Southbound Database | <-- Physical bindings
| (chassis table, | (which VM is on which host,
| port bindings, | which tunnel to use)
| datapath flows) |
+-------------+-------------+
|
ovn-controller <-- Runs on every node
(per node) Reads Southbound DB,
| programs local OVS
v
+---------------------------+
| Open vSwitch (OVS) | <-- Data plane on every node
| (br-int bridge) | Forwards packets according
| | to OpenFlow rules from
| - Encap/decap GENEVE | ovn-controller
| - ACL enforcement |
| - ARP proxy |
| - DHCP responder |
+---------------------------+
Key OVS/OVN commands for troubleshooting on an OVE node:
# List OVS bridges
ovs-vsctl list-br
# Output: br-int, br-ex
# Show ports on br-int (the integration bridge where all VMs connect)
ovs-vsctl list-ports br-int
# Dump OpenFlow rules (the actual forwarding logic)
ovs-ofctl dump-flows br-int
# Show OVN logical switches (the "virtual switches" tenants see)
ovn-nbctl ls-list
# Show OVN logical routers
ovn-nbctl lr-list
# Show port bindings (which VM port is on which chassis/node)
ovn-sbctl show
# Trace a packet path through OVN (extremely useful for debugging)
ovn-trace <logical-switch> 'inport == "vm-port" && eth.src == ... && ip4.src == ...'
Microsoft SDN: The Stack in Azure Local
Azure Local uses the Microsoft SDN stack with these components:
- Network Controller: Centralized REST API for network configuration (analogous to OVN Northbound DB)
- Software Load Balancer (SLB): Distributed load balancing in the data plane
- Datacenter Firewall: Distributed ACL enforcement at the Hyper-V vSwitch level
- RAS Gateway: Multi-tenant gateway for VPN and routing
- Virtual Filtering Platform (VFP): The data plane extension in the Hyper-V virtual switch that enforces policies (analogous to OVS flow rules)
Azure Local SDN Architecture:
+---------------------------+
| Network Controller | <-- REST API, stores network
| (3-node cluster) | policy and topology
+-------------+-------------+
|
WCF / REST <-- Southbound communication
|
v
+---------------------------+
| Hyper-V Virtual Switch | <-- Data plane on every node
| + VFP (Virtual Filtering |
| Platform extension) |
| |
| - VXLAN encap/decap |
| - ACL enforcement |
| - SLB DIP/VIP translation |
| - Metering |
+---------------------------+
Network ATC (used since Azure Local 23H2) simplifies configuration by allowing administrators to declare "intents" instead of manually configuring each component:
# Declare intents -- Network ATC configures bonding, VLANs, QoS automatically
Add-NetIntent -Name "ConvergedIntent" `
-Management -Compute -Storage `
-AdapterName "pNIC1", "pNIC2"
Relationship to Other Topics
- VLANs: SDN overlays extend the concept of VLANs beyond the 4,094 limit. The physical network carries tagged VLANs; the SDN overlay carries logical segments inside VXLAN/GENEVE tunnels on top of those VLANs.
- East-West traffic: SDN enables distributed firewalling of east-west traffic without a central chokepoint. The virtual switch on each host enforces ACLs locally.
- BGP: SDN controllers often use BGP to advertise routes to the physical network (e.g., OVN can peer with physical routers via BGP; the Azure Local RAS Gateway uses BGP for multi-tenant routing).
- MTU: Overlay encapsulation (VXLAN/GENEVE) adds 50-74 bytes of overhead per packet. The physical MTU must accommodate this, as discussed in the MTU section.
7. BGP (Border Gateway Protocol)
What It Is and Why It Exists
BGP is the routing protocol that holds the internet together. Every autonomous system (AS) on the internet uses BGP to announce its IP prefixes to its neighbors and to learn about prefixes from other autonomous systems. BGP is a path-vector protocol -- it selects routes based on a list of attributes, not just a simple metric like hop count or link cost.
Traditionally, BGP was only used at the network edge (connecting to ISPs). In modern data centers, BGP has become the preferred routing protocol inside the data center fabric, replacing OSPF and Spanning Tree for several reasons:
- ECMP (Equal-Cost Multi-Path): BGP natively supports ECMP, allowing traffic to be distributed across multiple spine switches simultaneously.
- Simplicity at scale: In a spine-leaf fabric, every leaf switch peers with every spine switch via eBGP. The configuration is uniform and scales linearly.
- Failure isolation: Unlike OSPF (which floods state changes across the entire area), BGP failures are contained to the affected peering session.
- Multi-tenancy: BGP VRFs (Virtual Routing and Forwarding) allow multiple tenants to use overlapping IP spaces with separate routing tables.
How BGP Works
BGP operates over TCP (port 179). Two BGP routers ("peers" or "neighbors") establish a TCP session and exchange routing information in the form of UPDATE messages.
BGP Peering Establishment (Finite State Machine):
Idle --> Connect --> OpenSent --> OpenConfirm --> Established
| | | | |
| TCP SYN/ACK OPEN msg KEEPALIVE UPDATE msgs
| attempt exchanged exchanged exchanged
| (AS number, (route
| hold time, announcements
| router ID) & withdrawals)
|
+-- Back to Idle on any fatal error
BGP UPDATE message structure (simplified):
+--------------------------------------------------+
| Withdrawn Routes Length (2 bytes) |
| Withdrawn Routes (variable) |
| - List of prefixes being withdrawn |
+--------------------------------------------------+
| Total Path Attribute Length (2 bytes) |
| Path Attributes (variable) |
| - ORIGIN (IGP, EGP, or Incomplete) |
| - AS_PATH (list of AS numbers traversed) |
| - NEXT_HOP (IP of next router) |
| - LOCAL_PREF (preference within an AS) |
| - MED (Multi-Exit Discriminator) |
| - COMMUNITY (tagging for policy) |
+--------------------------------------------------+
| NLRI (Network Layer Reachability Information) |
| - List of prefixes being announced |
| e.g., 10.20.0.0/16, 192.168.100.0/24 |
+--------------------------------------------------+
BGP best path selection algorithm (simplified, in order of priority):
- Highest LOCAL_PREF (prefer paths marked as preferred within our AS)
- Shortest AS_PATH (fewer AS hops = closer)
- Lowest ORIGIN type (IGP < EGP < Incomplete)
- Lowest MED (prefer the entrance point the remote AS suggests)
- eBGP over iBGP (prefer externally learned routes)
- Lowest IGP metric to NEXT_HOP
- Oldest route (stability preference)
- Lowest router ID (tiebreaker)
BGP in a Data Center Spine-Leaf Fabric
BGP Peering in Spine-Leaf (eBGP design):
Each device runs its own AS number:
AS 65000 AS 65001
+-----------+ +-----------+
| Spine-1 | | Spine-2 |
| BGP RR/ | | BGP RR/ |
| Transit | | Transit |
+--+--+--+--+ +--+--+--+--+
| | | | | |
| | +-------+-------+ | |
| | | | |
| +----+-----+-----+---+ |
| | | | |
AS 65010 AS 65011 AS 65012 AS 65013
+------+ +------+ +------+ +------+
|Leaf-1| |Leaf-2| |Leaf-3| |Leaf-4|
+------+ +------+ +------+ +------+
| | | |
Hosts Hosts Hosts Hosts
- Every leaf peers with every spine (full mesh at L3)
- Each leaf announces its connected subnets
- Spines reflect routes between leaves
- ECMP across all spines for any leaf-to-leaf path
- No Spanning Tree, no Layer-2 loops, no broadcast storms
eBGP vs. iBGP in the data center:
- eBGP (external BGP): Peers have different AS numbers. Each leaf has its own AS. This is the dominant data center design because eBGP does not require full-mesh peering or route reflectors within an AS.
- iBGP (internal BGP): Peers share the same AS number. Requires route reflectors or full-mesh peering. Less common in modern leaf-spine designs but still used in some legacy architectures.
BGP in the Candidate Platforms
OVE (MetalLB + OVN): MetalLB is the bare-metal load balancer for Kubernetes. It uses BGP to announce Service VIPs (Virtual IPs) to the physical network:
MetalLB BGP peering in OVE:
Physical Network OVE Cluster
+-----------+
| Leaf | eBGP +------------+
| Switch |<----------->| MetalLB |
| (peer) | AS 65010 | Speaker |
+-----------+ AS 65020 | (on node) |
| +------------+
|
+-- Learns: "10.30.0.100/32 via <node-IP>"
(the Service VIP is reachable through this node)
When a Kubernetes Service of type LoadBalancer is created:
1. MetalLB assigns a VIP from a configured pool
2. MetalLB Speaker on the node establishes BGP session with leaf switch
3. Speaker announces the VIP as a /32 route
4. Leaf switch installs the route and forwards traffic to the node
5. If the node fails, the Speaker on another node takes over
and announces the VIP from the new location
Azure Local: The RAS Gateway in Microsoft SDN uses BGP to peer with the physical network for multi-tenant routing and VPN termination. Network Controller manages the BGP configuration.
VMware / ESC: NSX-T uses BGP peering between Tier-0 gateways and the physical fabric for north-south routing. The Tier-0 gateway runs as a VM or on bare-metal Edge nodes.
Practical Impact at Scale
- Convergence time: When a node fails, BGP must withdraw routes and re-converge. With BFD (Bidirectional Forwarding Detection), failures can be detected in under 1 second. Without BFD, BGP hold-time defaults (90 seconds) cause unacceptable failover delays. Always configure BFD for sub-second failover.
- Route table size: In a 5,000+ VM environment with per-VM or per-Service routes, the leaf switches must handle thousands of BGP routes. Verify that your switching hardware supports the required FIB (Forwarding Information Base) size.
- BGP communities: Use communities to tag routes by function (e.g.,
65000:100for production VIPs,65000:200for management). This enables fine-grained route filtering and policy control at the spine and border layers.
Relationship to Other Topics
- SDN: SDN controllers (OVN, NSX, Microsoft SDN) use BGP as the "glue" between the virtual overlay and the physical underlay. The SDN controller manages virtual routing internally; BGP peers the SDN gateway with the physical fabric.
- IPv4/IPv6 Dual-Stack: BGP supports both address families via Multi-Protocol BGP (MP-BGP). A single BGP session can carry both IPv4 and IPv6 routes using AFI/SAFI (Address Family Identifier / Subsequent AFI).
- East-West traffic: In a spine-leaf fabric with BGP ECMP, east-west traffic between any two leaf switches traverses exactly one spine hop, and traffic is distributed across all available spines.
- VLANs: BGP operates at Layer 3 and advertises IP prefixes, not VLANs. However, each VLAN typically maps to a subnet, and BGP advertises those subnets between leaf switches.
8. IPv4 / IPv6 Dual-Stack
What It Is and Why It Exists
IPv4 uses 32-bit addresses (e.g., 192.168.1.1), providing approximately 4.3 billion unique addresses. This space has been exhausted globally since 2011. IPv6 uses 128-bit addresses (e.g., 2001:0db8:85a3::8a2e:0370:7334), providing 3.4 x 10^38 unique addresses -- effectively unlimited.
Dual-stack means running both IPv4 and IPv6 simultaneously on every interface, switch, router, and service in the infrastructure. This is the industry-recommended transition strategy because it avoids the complexity of protocol translation (NAT64/DNS64) while allowing gradual migration.
IPv4 vs. IPv6 Header Comparison
IPv4 Header (20-60 bytes, variable):
+-------+------+----------+----------------+
| Ver=4 | IHL | DSCP/ECN | Total Length |
| 4 bit | 4 bit| 8 bits | 16 bits |
+-------+------+----------+----------------+
| Identification | Flags|Frag Off|
| 16 bits | 3b | 13 bits|
+-------+------+----------+----------------+
| TTL |Protocol| Header Checksum |
| 8 bits| 8 bits | 16 bits |
+-------+--------+-------------------------+
| Source IPv4 Address (32 bits) |
+------------------------------------------+
| Destination IPv4 Address (32 bits) |
+------------------------------------------+
| Options (variable, 0-40 bytes) |
+------------------------------------------+
IPv6 Header (40 bytes, fixed):
+-------+----------+----+------------------+
| Ver=6 | Traffic |Flow| Payload Length |
| 4 bit | Class 8b |Lab | 16 bits |
| | |20b | |
+-------+----------+----+------------------+
| Next Header | Hop Limit |
| 8 bits | 8 bits |
+------------------+----------------------+
| |
| Source IPv6 Address (128 bits) |
| |
| |
+------------------------------------------+
| |
| Destination IPv6 Address (128 bits) |
| |
| |
+------------------------------------------+
Key differences:
- IPv6 has NO header checksum (saves processing on every hop)
- IPv6 has NO fragmentation fields in the base header
(fragmentation is handled by extension headers, only at the source)
- IPv6 flow label enables per-flow handling without deep packet inspection
- IPv6 minimum MTU is 1280 (vs. 68 for IPv4)
Address Architecture in Dual-Stack Data Centers
Dual-Stack Interface Configuration (Linux):
$ ip addr show ens1f0
2: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000
inet 10.20.100.5/24 brd 10.20.100.255 scope global ens1f0
inet6 2001:db8:100::5/64 scope global
inet6 fe80::a00:27ff:fe4e:66a1/64 scope link
Three addresses on one interface:
10.20.100.5 -- IPv4 (manually or DHCP assigned)
2001:db8:100::5 -- IPv6 global unicast (SLAAC or DHCPv6)
fe80::... -- IPv6 link-local (auto-generated, always present)
IPv6 address types relevant for data center operations:
| Type | Prefix | Scope | Use |
|---|---|---|---|
| Link-local | fe80::/10 |
Single link | Always present, used for neighbor discovery, routing protocol peering |
| Unique Local (ULA) | fd00::/8 |
Organization-wide | Analogous to RFC1918 private addresses, not routable on the internet |
| Global Unicast | 2000::/3 |
Internet | Globally routable, assigned by RIR or ISP |
| Multicast | ff00::/8 |
Variable | Replaces broadcast (IPv6 has no broadcast) |
SLAAC (Stateless Address Autoconfiguration): IPv6 can self-assign addresses without a DHCP server. The router sends Router Advertisement (RA) messages containing the network prefix; the host appends its own interface identifier (derived from MAC address or randomly generated via RFC 7217 privacy extensions) to form a full address. This eliminates the need for DHCPv6 in many scenarios but complicates IP address tracking in regulated environments.
Dual-Stack Implications for 5,000+ VMs
Address planning:
IPv4 address planning for 5,000 VMs:
Using /24 subnets: 5000 / 254 hosts = ~20 subnets needed
Using /22 subnets: 5000 / 1022 hosts = ~5 subnets needed
Total IPv4 addresses: 5,000 (tight, requires careful management)
IPv6 address planning for 5,000 VMs:
A single /64 subnet provides 2^64 = 18,446,744,073,709,551,616 addresses
One /64 per VLAN is the standard practice
Address exhaustion is not a concern
But: address tracking requires IPAM tooling (SLAAC makes addresses dynamic)
Firewall rules: Every firewall rule, network policy, and ACL must exist in both IPv4 and IPv6 versions. A common security mistake is implementing strict IPv4 firewall rules while leaving IPv6 wide open because "we don't use IPv6" -- but IPv6 is enabled by default on most modern operating systems, and link-local addresses are always active.
DNS: In dual-stack, DNS queries return both A records (IPv4) and AAAA records (IPv6). The application or OS selects which to use based on the "Happy Eyeballs" algorithm (RFC 8305), which prefers IPv6 but falls back to IPv4 if the IPv6 connection attempt fails within a short timeout. This can cause confusing behavior if IPv6 connectivity is partially broken (e.g., router advertisements are received but the path is blackholed).
Monitoring and logging:
All monitoring, logging, and SIEM tools must handle both address families. An alert rule matching 10.20.100.0/24 will miss malicious traffic from the same host if it arrives on the 2001:db8:100::/64 address. NetFlow/sFlow collectors must process both IPv4 and IPv6 flow records.
Dual-Stack in Kubernetes / OVE
Kubernetes has supported dual-stack since version 1.23 (stable). In OVE, dual-stack is configured at cluster installation time:
# OVE cluster network configuration (install-config.yaml excerpt)
networking:
clusterNetwork:
- cidr: 10.128.0.0/14 # IPv4 pod CIDR
hostPrefix: 23
- cidr: fd01::/48 # IPv6 pod CIDR
hostPrefix: 64
serviceNetwork:
- 172.30.0.0/16 # IPv4 service CIDR
- fd02::/112 # IPv6 service CIDR
networkType: OVNKubernetes
Each Pod and VM gets both an IPv4 and an IPv6 address. Services can be configured as ipFamilyPolicy: PreferDualStack or RequireDualStack. CoreDNS returns both A and AAAA records for services.
Relationship to Other Topics
- MTU: IPv6 minimum MTU is 1280 bytes. IPv6 does not allow router fragmentation (only the source can fragment via a Fragment extension header). Path MTU Discovery is therefore even more critical for IPv6.
- BGP: MP-BGP (Multi-Protocol BGP) carries both IPv4 and IPv6 routes in a single session or in separate AFI sessions. MetalLB in OVE must be configured to announce both address families.
- SDN: OVN-Kubernetes supports dual-stack natively. Microsoft SDN in Azure Local supports dual-stack for VM networking. NSX-T (ESC) supports dual-stack.
- DNS/DHCP: Dual-stack requires both DHCPv4 and either DHCPv6 or SLAAC. DNS must return both A and AAAA records. Reverse DNS (PTR) must be maintained for both
in-addr.arpa(IPv4) andip6.arpa(IPv6). - Bonding: Bonding is Layer-2 and is protocol-agnostic -- the same bond carries both IPv4 and IPv6 traffic. No special bonding configuration is needed for dual-stack.
How the Candidates Handle This
Comparison Table
| Aspect | VMware (Current) | OVE | Azure Local | Swisscom ESC |
|---|---|---|---|---|
| VLAN management | vDS port groups, per-host VLAN trunk config via vCenter | NMState operator + NetworkAttachmentDefinitions (declarative, GitOps-compatible) | Network ATC intents + Hyper-V vSwitch (PowerShell/ARM) | Managed by Swisscom via NSX, tenant sees logical networks |
| East-West firewalling | NSX Distributed Firewall (per-VM, micro-segmentation) | OVN ACLs via NetworkPolicies and MultiNetworkPolicies | Datacenter Firewall via VFP (SDN-managed ACLs) | NSX Distributed Firewall (same as VMware, provider-managed) |
| MTU configuration | vDS MTU setting, per-vmnic MTU, per-port-group override | NMState declares MTU per interface/bond/VLAN; cluster-wide consistency enforced by operator | Network ATC auto-configures MTU for storage (jumbo) and compute (standard) based on intent | Provider-managed; customer has no direct MTU control |
| Bonding | vDS with multiple uplinks, load-based teaming or LACP | Linux bonding via NMState (mode 1, 4, or 6; LACP with xmit_hash_policy configurable) | SET (Switch Embedded Teaming) in Hyper-V vSwitch, or LACP via Network ATC | Provider-managed; Dell VxBlock standard bonding |
| DNS internal | vCenter DNS dependency; VMs use enterprise DNS | CoreDNS for cluster-internal; enterprise DNS for VM secondary networks | Active Directory DNS for cluster; Windows DNS or external for VMs | Managed DNS within tenant; Swisscom-operated |
| DHCP | Enterprise DHCP for VM networks; NSX can provide DHCP for overlay segments | Optional: OVN can serve DHCP on overlay networks; enterprise DHCP for bridged/secondary networks | Windows DHCP Server or SDN built-in DHCP | Managed DHCP as part of ESC service |
| SDN stack | NSX-T (proprietary, distributed switching + routing + firewalling + LB) | OVN-Kubernetes (open source, GENEVE overlays, distributed routing, ACLs) | Microsoft SDN (Network Controller + VFP + SLB + RAS Gateway, VXLAN overlays) | NSX-T (same as VMware baseline, fully managed by Swisscom) |
| SDN overlay protocol | GENEVE (NSX-T 3.x+) or VXLAN (NSX-T 2.x) | GENEVE (OVN default) | VXLAN (Microsoft SDN) | GENEVE / VXLAN depending on NSX version |
| BGP integration | NSX Tier-0 Gateway peers with physical fabric via BGP | MetalLB (BGP mode) for Service VIPs; OVN gateway router can peer via BGP | RAS Gateway uses BGP for multi-tenant routing; SDN integrates with ToR switches | NSX Tier-0 BGP peering (provider-managed) |
| IPv4/IPv6 Dual-Stack | NSX-T supports dual-stack; vDS supports dual-stack | Native dual-stack since OCP 4.12+ (OVN-Kubernetes) | Native dual-stack (Hyper-V + SDN) | Supported (NSX-T + VMware) |
| Max VLAN segments | 4,094 (physical) + millions via NSX overlay | 4,094 (physical) + 16M via GENEVE overlay | 4,094 (physical) + 16M via VXLAN overlay | Provider-managed; transparent to customer |
| Network config as code | Limited (vCenter API, PowerCLI); NSX has REST API but no native GitOps | Full: NMState CRDs, NetworkAttachmentDefinitions, NetworkPolicies -- all YAML, all GitOps-compatible | ARM Templates, Bicep, Terraform; Network ATC intents via PowerShell | ESC API exists but limited IaC maturity; Terraform provider reifegrad unklar |
| Troubleshooting tools | NSX UI, packet tracing, traceflow | ovs-ofctl, ovn-trace, ovn-nbctl, ovn-sbctl, tcpdump, Network Observability Operator |
Test-NetConnection, Get-VMNetworkAdapter, Network Controller diagnostics, WAC |
Ticket to Swisscom; no direct platform access |
Key Differences in Prose
SDN maturity and control: The most significant difference between the candidates is the degree of control over the SDN layer. OVE exposes the full OVN/OVS stack -- operators can inspect OpenFlow tables, trace packets through the overlay, and debug connectivity issues at the flow level. Azure Local abstracts this behind Network ATC and the Network Controller API, offering less visibility but simpler configuration. Swisscom ESC provides no SDN control at all -- the entire networking stack is a black box operated by the provider. For an organization that wants to deeply understand and optimize its network (as is necessary at 5,000+ VMs), OVE offers the most transparency, Azure Local offers a middle ground, and ESC offers the least.
VLAN and overlay model: OVE and Azure Local both use overlay networks as the default for inter-VM traffic, with bridge/trunk interfaces for VMs that need direct access to physical VLANs (e.g., legacy applications with hardcoded VLAN dependencies). The critical difference is that OVE uses Multus + bridge CNI to attach VMs to physical VLANs as secondary interfaces, which requires explicit NetworkAttachmentDefinition resources -- a new concept for teams coming from VMware. Azure Local uses Hyper-V vSwitch port profiles, which are conceptually closer to VMware port groups. ESC mirrors the current VMware model exactly (NSX + vDS port groups).
Bonding and NIC configuration: OVE's NMState operator is the most "infrastructure-as-code" approach to NIC configuration. Bonding modes, VLANs, and MTU are declared as Kubernetes CRDs and reconciled automatically. Azure Local's Network ATC is intent-based (you declare "I want management, compute, and storage on these NICs") and the system auto-configures -- convenient but less transparent when debugging. VMware vDS teaming policies are configured per port group in vCenter. ESC is provider-managed.
BGP integration depth: OVE with MetalLB requires explicit BGP peering configuration between the cluster and the physical fabric. This means the network team must understand BGP and must coordinate AS numbers, peer IPs, and route filters with the Kubernetes platform team. Azure Local's BGP integration is handled by the SDN Network Controller and is less exposed to the administrator. ESC's BGP peering is entirely Swisscom's responsibility. For an organization with strong network engineering skills, OVE's model is preferred because it provides full control. For an organization that wants networking to "just work," Azure Local or ESC may be more appropriate.
Dual-stack readiness: All three candidates support IPv4/IPv6 dual-stack. OVE's dual-stack is a cluster-wide install-time decision and is well-documented for OVN-Kubernetes. Azure Local's dual-stack works at the Hyper-V and SDN level. ESC's dual-stack support depends on the specific NSX version Swisscom has deployed. The key evaluation question is not whether dual-stack is "supported" but whether the operational tooling (monitoring, logging, firewall policies, IPAM) handles both address families equally well.
Key Takeaways
-
VLANs are necessary but insufficient at scale. The 4,094 VLAN limit, combined with the need for multi-tenancy and address space overlap, drives the adoption of overlay networks (VXLAN/GENEVE) in all three candidates. Understanding the physical VLAN layer is still essential because overlay traffic is carried inside VLAN-tagged underlay frames.
-
East-west traffic dominates. In a 5,000+ VM environment, internal traffic between VMs and storage replication traffic vastly exceeds north-south traffic. Network architecture must prioritize east-west bandwidth, low latency, and distributed security enforcement (micro-segmentation) over perimeter-centric designs.
-
MTU consistency is non-negotiable. Every link in the path -- from VM virtual NIC through the virtual switch, physical NIC, every physical switch, and to the destination -- must support the same MTU. Overlay encapsulation adds 50-74 bytes, making jumbo frames (MTU 9000) on the physical underlay effectively mandatory for overlay-based platforms.
-
Bonding mode selection directly impacts throughput distribution. LACP (mode 4) with
layer3+4orencap3+4hash policy is the enterprise standard for VM traffic. Incorrect hash policies (e.g.,layer2) cause asymmetric load and waste half the available bandwidth. The bonding configuration must be validated with actual traffic patterns during the PoC. -
DNS and DHCP at scale require IPAM integration. Manual DHCP scope management and static DNS record maintenance do not scale to 5,000+ VMs with dynamic provisioning and live migration. The chosen platform must integrate with the organization's IPAM system (Infoblox, BlueCat, or equivalent).
-
SDN is the differentiating technology choice. The candidates use fundamentally different SDN stacks: OVN/OVS (open source, full transparency), Microsoft SDN/VFP (proprietary, intent-based), and NSX (proprietary, managed by provider in the case of ESC). This choice determines the team's ability to debug network issues, enforce security policies, and integrate with the physical fabric. The SDN decision has a 5-10 year lock-in effect.
-
BGP knowledge is now required for platform teams, not just network teams. All three candidates use BGP to integrate their overlay networks with the physical fabric. In OVE, the platform team configures MetalLB BGP peering directly. This blurs the traditional boundary between "network engineering" and "platform engineering" and requires cross-team collaboration.
-
Dual-stack is a "when," not "if." Even if the organization runs pure IPv4 today, the platform must support dual-stack for future requirements. The more important question is whether the operational tooling -- monitoring, logging, firewall policies, IPAM, compliance reporting -- handles both address families with equal maturity.
Discussion Guide
The following questions are designed to reveal the depth of a vendor's or SME's understanding of their networking stack. They should be asked during vendor workshops, PoC planning sessions, and architecture reviews.
1. VLAN and Overlay Architecture
"Walk us through the packet path when VM-A on Node-1 in VLAN 100 sends a TCP packet to VM-B on Node-7 in VLAN 200. Include the encapsulation and decapsulation steps, the routing decision point, and the exact headers present on the wire at each hop. Where does the 802.1Q tag get stripped and the overlay header get added?"
Purpose: Tests whether the vendor understands the interaction between physical VLANs and overlay encapsulation. A weak answer will describe it abstractly ("the SDN handles it"); a strong answer will name the specific components (OVS br-int, GENEVE tunnel, OVN logical router) and describe the header transformations.
2. MTU and Fragmentation
"Our physical switches are configured with MTU 9216. Our current VMs expect MTU 1500. After migration to your platform, what is the effective MTU inside the VM on an overlay network? What happens if a VM sends a 1500-byte packet and the overlay encapsulation pushes it above the physical MTU? Have you encountered Path MTU Discovery black holes in production deployments, and how did you diagnose them?"
Purpose: Tests MTU planning discipline. The correct answer includes the specific overhead (50-74 bytes for GENEVE/VXLAN), the resulting inner MTU, and the PMTUD failure mode. A vendor who says "it just works" has not deployed at scale.
3. Bonding and NIC Failover
"We plan to use LACP bonding with two 25 GbE NICs per host. Describe the hash algorithm your platform uses to distribute VM traffic across the two links. How do you verify that traffic is actually balanced? What happens during the 3-second LACP convergence window if a single link fails -- do VMs experience packet loss? How does this interact with your overlay tunnel endpoints?"
Purpose: Tests understanding of bonding internals. The answer should reference xmit_hash_policy, how to measure per-link utilization (e.g., ethtool -S, ip -s link), and the relationship between LACP failover and overlay tunnel re-establishment.
4. East-West Micro-Segmentation at Scale
"We have 5,000 VMs across approximately 200 application groups. Each application group needs firewall rules allowing traffic within the group but blocking traffic between groups. How does your platform implement these rules? Where in the data path are they enforced? What is the performance overhead per packet? How many rules can a single host support before performance degrades? How do we audit that the rules are actually enforced?"
Purpose: Tests micro-segmentation scalability. The answer should describe distributed enforcement (at the vSwitch, not a central appliance), the specific mechanism (OVN ACLs, VFP rules, NSX DFW rules), and the performance impact. Ask for specific numbers -- "we tested 10,000 rules per host with less than 2% throughput degradation" is better than "it scales well."
5. BGP and Physical Fabric Integration
"Our data center runs a spine-leaf fabric with eBGP (one AS per leaf). How does your platform announce service VIPs and VM network routes to the leaf switches? What AS number does the platform use? How do we control which prefixes are announced and which are filtered? If a node hosting a VIP fails, how quickly does the BGP route withdrawal propagate, and what is the resulting traffic disruption?"
Purpose: Tests BGP integration maturity. The answer should include specific convergence times (with and without BFD), route filtering mechanisms (prefix lists, communities), and the interaction between the platform's internal routing and the external BGP peering. For OVE, expect discussion of MetalLB BGP configuration and FRR (Free Range Routing) under the hood.
6. DNS/DHCP Integration and IPAM
"When a new VM is provisioned, how does it get its IP address and DNS record? Is this integrated with our existing Infoblox IPAM, or does the platform maintain its own address pool? What happens to the DNS record when the VM is live-migrated to a different node? What happens when the VM is deleted -- is the DNS record cleaned up automatically, or is it orphaned?"
Purpose: Tests operational maturity around IP lifecycle management. A platform that provisions VMs without cleaning up DNS/DHCP records will accumulate stale entries over time, leading to IP conflicts and resolution failures. Look for answers that describe explicit integration with enterprise IPAM (Infoblox, BlueCat) rather than "we have our own DHCP."
7. SDN Troubleshooting Under Pressure
"A production VM loses network connectivity at 2:00 AM. Your SDN overlay is healthy at the control plane level -- no errors in the controller. The physical network shows no link failures. Walk us through your step-by-step troubleshooting methodology. What tools do you use? How do you determine whether the problem is in the overlay, the underlay, the VM's guest OS, or a firewall rule? How long does this diagnosis typically take?"
Purpose: Tests real-world troubleshooting capability. For OVE, expect references to ovs-ofctl dump-flows, ovn-trace, tcpdump on the node, and checking ovn-sbctl port bindings. For Azure Local, expect Test-NetConnection, Hyper-V network adapter diagnostics, and Network Controller event logs. For ESC, the honest answer is "we open a ticket with Swisscom" -- which is a valid model but must be evaluated against the required incident response time.
8. Dual-Stack Operational Readiness
"We currently run IPv4-only. Your platform supports dual-stack. If we enable IPv6 on day one, what changes in our operational model? Specifically: do our existing IPv4 firewall rules automatically apply to IPv6? Do our monitoring dashboards show IPv6 traffic? Does our SIEM correlate events across both address families? Can a VM configured with only IPv4 still communicate with a dual-stack VM?"
Purpose: Tests whether dual-stack is truly production-ready or just "technically supported." The critical failure mode is enabling IPv6 without corresponding security policies, creating an unmonitored path through the infrastructure. A mature answer will describe how the platform ensures policy parity between IPv4 and IPv6.
9. Network Performance Baseline and SLA
"What network latency should we expect for east-west VM-to-VM traffic on the same leaf switch? On different leaf switches (one spine hop)? Through the SDN overlay? How does this compare to our current VMware NSX latency baseline? Do you have published benchmarks, and can we reproduce them in our PoC environment?"
Purpose: Establishes concrete performance expectations. Typical east-west latency in a well-designed overlay network is 50-150 microseconds per hop. If the vendor cannot provide specific numbers or says "it depends," they have not benchmarked their own stack.
10. Day-2 Network Changes Without Downtime
"Six months after go-live, we need to add a new VLAN (VLAN 500) to all 100 worker nodes for a new application tier. Describe the process. Can this be done without draining VMs or rebooting nodes? How do we verify that the VLAN is correctly trunked on all physical switch ports and correctly configured on all host interfaces before connecting any VMs to it?"
Purpose: Tests Day-2 operational maturity. For OVE, expect a description of creating a NodeNetworkConfigurationPolicy via NMState, which rolls out the VLAN configuration across all nodes automatically with validation and rollback. For Azure Local, expect Network ATC intent updates. For ESC, expect a Swisscom change request with an SLA. The key differentiator is whether the change is self-service and automated or requires a ticket and a maintenance window.