Modern datacenters and beyond

Networking Foundational Concepts

Why This Matters

Every VM running in a data center depends on the network for three things: reaching other VMs (east-west), reaching users and external systems (north-south), and reaching shared storage (storage traffic). When you operate 5,000+ VMs, the network is not "plumbing" -- it is the primary determinant of latency, security posture, failure domain size, and operational complexity. A misconfigured MTU silently fragments storage replication traffic. A bonding mode mismatch causes asymmetric hashing and saturates one link while the other idles. A missing VLAN trunk drops an entire application tier during live migration.

The platform candidates under evaluation -- OVE, Azure Local, and Swisscom ESC -- each make fundamentally different architectural choices about how networking is realized. Understanding the foundational concepts below is necessary to evaluate those choices, to ask the right questions during vendor workshops, and to design a network architecture that will serve the organization for the next 5-10 years.

This document covers the "vocabulary layer" -- the concepts that every subsequent networking topic (overlays, SR-IOV, micro-segmentation, ingress routing) builds upon. If the team owns these eight topics, the advanced material becomes a matter of learning product-specific implementations rather than learning entirely new ideas.


Concepts

1. VLANs (Virtual Local Area Networks)

What It Is and Why It Exists

A VLAN is a logical partitioning of a single physical Layer-2 network into multiple isolated broadcast domains. Without VLANs, every device connected to the same set of switches shares a single broadcast domain -- every ARP request, every DHCP discover, every broadcast storm reaches every port. At 5,000+ VMs, an unpartitioned flat network would collapse under broadcast traffic alone.

VLANs solve three problems simultaneously:

How It Works

VLANs are defined by IEEE 802.1Q. When a frame enters a switch port configured as a "trunk" (carrying multiple VLANs), the switch inserts a 4-byte 802.1Q tag into the Ethernet frame header between the Source MAC and the EtherType/Length field.

Standard Ethernet Frame (untagged):
+----------+----------+-----------+---------------------+-----+
| Dst MAC  | Src MAC  | EtherType |       Payload       | FCS |
| 6 bytes  | 6 bytes  | 2 bytes   |    46-1500 bytes    | 4 B |
+----------+----------+-----------+---------------------+-----+

802.1Q Tagged Frame:
+----------+----------+------+------+-----------+---------------------+-----+
| Dst MAC  | Src MAC  | TPID | TCI  | EtherType |       Payload       | FCS |
| 6 bytes  | 6 bytes  | 2 B  | 2 B  | 2 bytes   |    46-1500 bytes    | 4 B |
+----------+----------+------+------+-----------+---------------------+-----+
                       |            |
                       +-- 802.1Q --+
                           Tag
                          (4 bytes)

TPID (Tag Protocol Identifier): 0x8100 -- signals "this frame is tagged"
TCI  (Tag Control Information):
  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  |        PCP        |DEI|              VLAN ID (VID)             |
  | 3 bits (priority) |1b |           12 bits (0-4095)            |
  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

PCP  = Priority Code Point (802.1p QoS, 0-7)
DEI  = Drop Eligible Indicator
VID  = VLAN Identifier (12 bits = 4096 possible VLANs, 0 and 4095 reserved)

Port types on a switch:

On a Linux host (which is what OVE worker nodes and Azure Local nodes are), VLANs are created as sub-interfaces:

# Create VLAN 100 on top of physical interface ens1f0
ip link add link ens1f0 name ens1f0.100 type vlan id 100
ip link set ens1f0.100 up
ip addr add 10.20.100.1/24 dev ens1f0.100

The kernel's VLAN driver (8021q module) handles tag insertion and removal transparently. The physical NIC sees tagged frames on the wire; the VLAN sub-interface presents untagged frames to the IP stack.

Practical Impact at Scale

With 5,000+ VMs, VLAN planning is critical:

Relationship to Other Topics

VLANs are the Layer-2 foundation on which everything else is built. SDN overlays (VXLAN, GENEVE) encapsulate VLAN-like segments inside UDP packets to overcome the 4,094 VLAN limit. Bonding modes determine how trunk traffic is distributed across physical links. MTU settings must account for the 4-byte 802.1Q tag (and the much larger overlay headers). BGP can be used to advertise VLAN subnets between leaf switches in a spine-leaf fabric.


2. East-West vs. North-South Traffic

What It Is and Why It Exists

These terms describe the two fundamental traffic flow directions inside a data center:

                        External Users / Internet
                                  |
                          [ Perimeter FW ]
                                  |
                          [ Core Router ]          <-- North-South
                           /          \                 boundary
                          /            \
                   [ Spine 1 ]    [ Spine 2 ]
                    / | \            / | \
                   /  |  \          /  |  \
             [Leaf1] [Leaf2] [Leaf3] [Leaf4] [Leaf5]
               |       |       |       |       |
             +---+   +---+   +---+   +---+   +---+
             |VM1|   |VM2|   |VM3|   |VM4|   |VM5|
             +---+   +---+   +---+   +---+   +---+
                  \    |    /     \    |    /
                   `---+---'       `--+---'
                   East-West       East-West
                   traffic         traffic

Why the Distinction Matters

In a modern data center running microservices, multi-tier applications, and storage replication, east-west traffic typically dominates north-south traffic by a factor of 5:1 to 20:1. A single user HTTP request entering the data center (north-south) may trigger dozens of internal API calls, database queries, cache lookups, and storage I/O operations (all east-west).

This ratio has profound architectural implications:

Traffic Patterns in a 5,000+ VM Environment

Typical traffic composition (5,000+ VM data center):

  North-South (user-facing):     ~10-20% of total bandwidth
  East-West (app-to-app):        ~30-40% of total bandwidth
  East-West (storage):           ~30-40% of total bandwidth
  East-West (management/control): ~5-10% of total bandwidth
                                  ----
                                  100%

Storage replication traffic is often the largest single contributor to east-west traffic in hyperconverged environments (OVE with ODF, Azure Local with S2D), because every write is replicated to 2-3 nodes across the fabric.

Relationship to Other Topics


3. MTU / Jumbo Frames

What It Is and Why It Exists

The Maximum Transmission Unit (MTU) is the largest Layer-3 packet (IP datagram) that can be transmitted over a network link without fragmentation. The standard Ethernet MTU is 1500 bytes. A "Jumbo Frame" is any Ethernet frame with an MTU larger than 1500, typically 9000 bytes (though some vendors support up to 9216).

Standard MTU (1500 bytes):
+------------------+--------------------+
| IP + TCP Header  |     Payload        |
|   40 bytes       |   1460 bytes       |
+------------------+--------------------+
   Overhead: 40/1500 = 2.67%

Jumbo Frame MTU (9000 bytes):
+------------------+--------------------+
| IP + TCP Header  |     Payload        |
|   40 bytes       |   8960 bytes       |
+------------------+--------------------+
   Overhead: 40/9000 = 0.44%

Reduction in per-byte overhead: ~6x
Reduction in packets for 1 GB transfer:
  MTU 1500: ~729,444 packets
  MTU 9000: ~119,304 packets  (6.1x fewer packets)

Every packet incurs CPU overhead for processing (interrupt handling, checksum computation, buffer allocation). With 6x fewer packets for the same data volume, jumbo frames significantly reduce CPU utilization for high-throughput transfers -- exactly the kind of transfers that happen during storage replication, VM live migration, and backup operations.

How It Works

MTU is configured at the interface level. On Linux:

# Set MTU to 9000 on a physical interface
ip link set dev ens1f0 mtu 9000

# Verify
ip link show ens1f0
# 2: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ...

# Set MTU on a VLAN sub-interface (must be <= parent MTU)
ip link set dev ens1f0.100 mtu 9000

# Set MTU on a bond interface
ip link set dev bond0 mtu 9000

The critical rule: MTU must be consistent across the entire Layer-2 path. If any link in the chain has a smaller MTU, packets will be fragmented (if DF bit is not set) or dropped (if DF bit is set, which is the default for most modern stacks). This includes:

MTU Path Consistency Example:

VM-A (MTU 9000)
  |
  v
[OVS bridge] (MTU 9000)
  |
  v
[Physical NIC ens1f0] (MTU 9000)
  |
  v
[Leaf Switch Port] (MTU 9216)    <-- switch MTU must be >= frame MTU
  |
  v
[Spine Switch Port] (MTU 9216)
  |
  v
[Leaf Switch Port] (MTU 9216)
  |
  v
[Physical NIC ens1f1] (MTU 9000)
  |
  v
[OVS bridge] (MTU 9000)
  |
  v
VM-B (MTU 9000)

If ANY hop drops to MTU 1500:
  - TCP will negotiate a smaller MSS via Path MTU Discovery (PMTUD)
  - OR packets will be silently dropped if ICMP "Fragmentation Needed"
    messages are blocked by a firewall (a very common misconfiguration)
  - Result: "black hole" connections that establish but hang on large transfers

Overlay Encapsulation and MTU

This is where MTU becomes especially important for our evaluation. SDN overlays (VXLAN, GENEVE, which OVE uses) add encapsulation headers to every packet:

VXLAN overhead:
  Outer Ethernet:  14 bytes
  Outer IP:        20 bytes
  Outer UDP:        8 bytes
  VXLAN header:     8 bytes
  -------------------------
  Total overhead:  50 bytes

GENEVE overhead (OVN default):
  Outer Ethernet:  14 bytes
  Outer IP:        20 bytes
  Outer UDP:        8 bytes
  GENEVE header:    8 bytes + variable options (typically 4-32 bytes)
  -------------------------
  Total overhead:  50-74 bytes

If physical MTU = 1500 and overlay overhead = 50 bytes:
  Effective inner MTU = 1500 - 50 = 1450 bytes
  --> VMs see MTU 1450, not 1500
  --> Applications expecting MTU 1500 will experience fragmentation or drops

If physical MTU = 9000 and overlay overhead = 50 bytes:
  Effective inner MTU = 9000 - 50 = 8950 bytes
  --> VMs see MTU 8950, which is effectively a jumbo frame
  --> No performance degradation from encapsulation

This is why jumbo frames on the physical underlay are effectively mandatory for any platform using overlay networking at scale. Without them, the inner MTU drops below 1500, which breaks assumptions in many applications and TCP stacks.

Practical Impact at Scale

Relationship to Other Topics


4. Bonding Modes

What It Is and Why It Exists

NIC bonding (also called "link aggregation" or "NIC teaming") combines multiple physical network interfaces into a single logical interface. The goals are:

On Linux, bonding is implemented by the bonding kernel module. The bond interface is created and managed via ip link or through configuration management tools like NMState (used by OVE) or Network ATC (used by Azure Local).

# Create a bond interface
ip link add bond0 type bond mode 802.3ad

# Add member interfaces
ip link set ens1f0 master bond0
ip link set ens1f1 master bond0

# Set bond parameters
ip link set bond0 type bond miimon 100 lacp_rate fast xmit_hash_policy layer3+4

# Bring up
ip link set bond0 up

Bonding Modes in Detail

Linux supports seven bonding modes. Not all are equally relevant for data center virtualization:

+------+------------------+------------+------------+--------------------------------+
| Mode | Name             | Redundancy | Throughput | Mechanism                      |
+------+------------------+------------+------------+--------------------------------+
|  0   | balance-rr       | Yes        | Yes        | Round-robin packet distribution |
|  1   | active-backup    | Yes        | No         | One active, others standby     |
|  2   | balance-xor      | Yes        | Yes        | XOR hash on MAC addresses      |
|  3   | broadcast        | Yes        | No         | Send on all interfaces         |
|  4   | 802.3ad (LACP)   | Yes        | Yes        | Dynamic link aggregation       |
|  5   | balance-tlb      | Yes        | Tx only    | Adaptive transmit load balance |
|  6   | balance-alb      | Yes        | Yes        | Adaptive load balancing        |
+------+------------------+------------+------------+--------------------------------+

Mode 1 (active-backup) -- the simplest and most universally compatible:

                 bond0 (active-backup)
                /                     \
        [ens1f0 ACTIVE]         [ens1f1 STANDBY]
              |                       |
        [Leaf Switch A]         [Leaf Switch B]

- Only one NIC transmits and receives at any time
- If the active NIC fails, the standby takes over (failover time = miimon interval)
- No switch configuration required
- Maximum throughput = single link speed (e.g., 25 Gbps on 25 GbE NICs)
- Use case: management networks, low-bandwidth VLANs, or when switches
  do not support LACP

Mode 4 (802.3ad / LACP) -- the enterprise standard for high-throughput:

                 bond0 (802.3ad)
                /                \
        [ens1f0]              [ens1f1]
              |                   |
        [Leaf Switch, LAG group / Port-Channel]

- Both NICs transmit and receive simultaneously
- Requires switch-side LACP configuration (both ends must negotiate)
- Traffic distribution is based on a hash function (xmit_hash_policy):
    layer2:    Hash on src/dst MAC            (poor distribution for VMs)
    layer3+4:  Hash on src/dst IP + port      (best distribution for VMs)
    encap3+4:  Hash on inner headers          (for encapsulated traffic)
- Maximum throughput = N x link speed (e.g., 2 x 25 Gbps = 50 Gbps aggregate)
  BUT: a single flow still maxes out at a single link's speed
- Both NICs must connect to the same switch or to an MLAG/SMLT pair

The xmit_hash_policy is critical for VM environments. With layer2 hashing, all VMs on a host that communicate with the same destination MAC (e.g., the default gateway) will hash to the same link, creating an imbalance. With layer3+4, the hash uses IP addresses and TCP/UDP ports, which distributes VM traffic much more evenly.

Hash policy comparison for 100 VMs on one host, all talking to the gateway:

  layer2 hash:   hash(src_mac, dst_mac)
                 --> All 100 VMs produce the same hash (same dst_mac = gateway)
                 --> 100% of traffic on one link, 0% on the other

  layer3+4 hash: hash(src_ip, dst_ip, src_port, dst_port)
                 --> Each VM has a different src_ip and different ports
                 --> ~50/50 distribution across both links

Mode 6 (balance-alb) -- no switch configuration needed, works across switches:

                 bond0 (balance-alb)
                /                \
        [ens1f0]              [ens1f1]
              |                   |
        [Leaf Switch A]     [Leaf Switch B]    <-- can be different switches!

- Outgoing: distributes based on per-peer adaptive load balancing
- Incoming: uses ARP negotiation to balance receive traffic
- No switch-side configuration required
- Works across two independent switches (no MLAG needed)
- Drawback: ARP manipulation can confuse some network monitoring tools
  and may cause issues with certain switch security features (DAI, port security)

Practical Considerations for 5,000+ VMs

Typical dual-bond architecture per host:

    bond0 (LACP, 2x25GbE)              bond1 (LACP, 2x25GbE)
    xmit_hash_policy: layer3+4         xmit_hash_policy: layer3+4
    MTU: 9000                          MTU: 9000
         |                                  |
    VM traffic                         Storage replication
    Overlay (VXLAN/GENEVE)             Ceph/S2D cluster traffic
    Management                         Live migration

Relationship to Other Topics


5. DNS / DHCP in Virtualized Environments

What It Is and Why It Exists

DNS (Domain Name System) translates human-readable names to IP addresses. DHCP (Dynamic Host Configuration Protocol) automatically assigns IP addresses, subnet masks, gateways, and DNS servers to devices when they join a network. In a physical data center with a few hundred servers, DNS and DHCP are straightforward services. In a virtualized environment with 5,000+ VMs that are created, cloned, migrated, and destroyed dynamically, they become critical infrastructure that must scale, respond quickly, and integrate tightly with the orchestration layer.

How DNS Works in Virtualized Environments

DNS in a virtualized data center has two dimensions:

1. Infrastructure DNS (the platform's own name resolution): The virtualization platform itself needs DNS to function. Kubernetes (OVE) relies on CoreDNS for internal service discovery. Azure Local uses Active Directory DNS for cluster operations. These are internal, platform-managed DNS systems that tenants do not directly interact with.

OVE / Kubernetes DNS architecture:

  Pod/VM makes DNS query for "my-database.my-namespace.svc.cluster.local"
       |
       v
  CoreDNS (runs as Pods in openshift-dns namespace)
       |
       +-- Is it a cluster-internal name? --> Resolve from Kubernetes API
       |       (Services, Endpoints, Pods)
       |
       +-- Is it an external name? --> Forward to upstream DNS servers
               (configured in cluster DNS operator config)

  DNS search path inside a Pod/VM:
    my-namespace.svc.cluster.local
    svc.cluster.local
    cluster.local
    <external domain>

2. Tenant/workload DNS: The VMs themselves need DNS for their own applications -- resolving database hostnames, external APIs, internal service names. This is typically served by the organization's existing enterprise DNS infrastructure (e.g., Windows DNS, Infoblox, or BIND), not by the virtualization platform.

The integration challenge: When a VM is created or migrated, its DNS record must be updated to reflect its new IP address. In VMware, this is often handled by the VM's DHCP lease triggering a Dynamic DNS (DDNS) update. In Kubernetes/OVE, the platform itself manages DNS for Services and Pods, but VMs exposed on secondary networks (via Multus) need explicit DNS integration -- either via DDNS from the guest OS or via external-dns operators.

How DHCP Works in Virtualized Environments

DHCP follows a four-step process (DORA):

VM (Client)                            DHCP Server
    |                                       |
    |--- DHCP Discover (broadcast) -------->|
    |                                       |
    |<------ DHCP Offer (unicast) ----------|
    |    (IP: 10.20.100.50,                 |
    |     Mask: 255.255.255.0,              |
    |     GW: 10.20.100.1,                  |
    |     DNS: 10.1.1.53,                   |
    |     Lease: 8 hours)                   |
    |                                       |
    |--- DHCP Request (broadcast) --------->|
    |    ("I accept 10.20.100.50")          |
    |                                       |
    |<------ DHCP ACK (unicast) ------------|
    |    ("Confirmed, it's yours")          |
    |                                       |

DHCP in virtualized environments has specific challenges:

Platform-Specific DNS/DHCP Behavior

OVE: CoreDNS handles cluster-internal resolution. For VMs on secondary networks (via Multus/bridge CNI), DNS and DHCP are typically provided by the organization's existing infrastructure. OVE does not inject itself into the tenant DNS path for secondary networks. The kubevirt network binding can optionally configure cloud-init to set static IPs and DNS servers at VM boot time.

Azure Local: Relies heavily on Active Directory DNS. Cluster nodes must be domain-joined. DHCP for VMs can be served by Windows DHCP Server roles, by the SDN's built-in DHCP, or by external DHCP infrastructure.

Swisscom ESC: DNS and DHCP are managed services provided by Swisscom within the tenant. The customer configures their zones and scopes through the ESC portal or API, but does not operate the underlying infrastructure.

Relationship to Other Topics


6. SDN (Software-Defined Networking)

What It Is and Why It Exists

Software-Defined Networking separates the control plane (the logic that decides where traffic should go) from the data plane (the hardware and software that actually forwards packets). In traditional networking, every switch and router contains both planes -- each device independently runs routing protocols, maintains its own forwarding tables, and makes its own decisions. SDN centralizes the control plane into a software controller that programs the forwarding tables of all switches from a single point of authority.

Traditional Networking:                   Software-Defined Networking:

+--------+  +--------+  +--------+       +-----------------------+
|Switch 1|  |Switch 2|  |Switch 3|       |    SDN Controller     |
| Control|  | Control|  | Control|       | (centralized brain)   |
|  Plane |  |  Plane |  |  Plane |       +-----------+-----------+
|  Data  |  |  Data  |  |  Data  |                   |
|  Plane |  |  Plane |  |  Plane |        Southbound API (OpenFlow,
+--------+  +--------+  +--------+        OVSDB, gRPC, REST...)
                                                     |
   Each device makes independent            +--------+--------+
   forwarding decisions.                    |        |        |
   Configuration is per-device.       +--------+--------+--------+
   Changes require touching           |Switch 1|Switch 2|Switch 3|
   every device individually.         | Data   | Data   | Data   |
                                      | Plane  | Plane  | Plane  |
                                      | only   | only   | only   |
                                      +--------+--------+--------+

                                      Controller pushes forwarding
                                      rules to all switches.
                                      Configuration is centralized.
                                      Changes are applied globally.

Why SDN Matters for Virtualization

In a data center running 5,000+ VMs, the network must support:

SDN Architecture: Control Plane, Data Plane, Management Plane

+---------------------------------------------------------------+
|                     Management Plane                          |
|  (UI, CLI, API for human and automation interaction)          |
|  Examples: OpenShift Console, Azure Portal, vCenter           |
+----------------------------+----------------------------------+
                             |
                    Northbound API
                    (REST, gRPC)
                             |
+----------------------------v----------------------------------+
|                      Control Plane                            |
|  (Centralized logic: topology, routing, policy, ACLs)        |
|                                                               |
|  OVE:         OVN Northbound DB + ovn-northd                 |
|  Azure Local: Network Controller (Windows SDN)                |
|  VMware:      NSX Manager                                     |
|  ESC:         NSX Manager (Swisscom-operated)                 |
+----------------------------+----------------------------------+
                             |
                    Southbound API
                    (OVSDB, OpenFlow, gRPC)
                             |
+----------------------------v----------------------------------+
|                       Data Plane                              |
|  (Packet forwarding, encap/decap, ACL enforcement)           |
|                                                               |
|  OVE:         OVS (Open vSwitch) on every node               |
|  Azure Local: Hyper-V Virtual Filtering Platform (VFP)        |
|  VMware:      NSX-T Distributed Switch (N-VDS)                |
|  ESC:         NSX-T (same as VMware, Swisscom-managed)        |
+---------------------------------------------------------------+

OVN/OVS: The SDN Stack in OVE

OVE uses OVN (Open Virtual Network) as its SDN controller and OVS (Open vSwitch) as its data plane. This is the most important SDN stack for our evaluation because OVE is the shortlist favorite.

OVN Architecture in an OVE Cluster:

  +---------------------------+
  | OVN Northbound Database   |  <-- Logical network topology
  | (logical switches,        |      (what the admin sees)
  |  logical routers,         |
  |  ACLs, NAT rules)         |
  +-------------+-------------+
                |
          ovn-northd            <-- Translates logical topology
                |                   to physical flow rules
                v
  +---------------------------+
  | OVN Southbound Database   |  <-- Physical bindings
  | (chassis table,           |      (which VM is on which host,
  |  port bindings,           |       which tunnel to use)
  |  datapath flows)          |
  +-------------+-------------+
                |
       ovn-controller           <-- Runs on every node
       (per node)                   Reads Southbound DB,
                |                   programs local OVS
                v
  +---------------------------+
  | Open vSwitch (OVS)        |  <-- Data plane on every node
  | (br-int bridge)           |      Forwards packets according
  |                           |      to OpenFlow rules from
  | - Encap/decap GENEVE      |      ovn-controller
  | - ACL enforcement          |
  | - ARP proxy               |
  | - DHCP responder           |
  +---------------------------+

Key OVS/OVN commands for troubleshooting on an OVE node:

# List OVS bridges
ovs-vsctl list-br
# Output: br-int, br-ex

# Show ports on br-int (the integration bridge where all VMs connect)
ovs-vsctl list-ports br-int

# Dump OpenFlow rules (the actual forwarding logic)
ovs-ofctl dump-flows br-int

# Show OVN logical switches (the "virtual switches" tenants see)
ovn-nbctl ls-list

# Show OVN logical routers
ovn-nbctl lr-list

# Show port bindings (which VM port is on which chassis/node)
ovn-sbctl show

# Trace a packet path through OVN (extremely useful for debugging)
ovn-trace <logical-switch> 'inport == "vm-port" && eth.src == ... && ip4.src == ...'

Microsoft SDN: The Stack in Azure Local

Azure Local uses the Microsoft SDN stack with these components:

Azure Local SDN Architecture:

  +---------------------------+
  | Network Controller        |  <-- REST API, stores network
  | (3-node cluster)          |      policy and topology
  +-------------+-------------+
                |
          WCF / REST             <-- Southbound communication
                |
                v
  +---------------------------+
  | Hyper-V Virtual Switch    |  <-- Data plane on every node
  | + VFP (Virtual Filtering  |
  |   Platform extension)     |
  |                           |
  | - VXLAN encap/decap       |
  | - ACL enforcement          |
  | - SLB DIP/VIP translation |
  | - Metering                 |
  +---------------------------+

Network ATC (used since Azure Local 23H2) simplifies configuration by allowing administrators to declare "intents" instead of manually configuring each component:

# Declare intents -- Network ATC configures bonding, VLANs, QoS automatically
Add-NetIntent -Name "ConvergedIntent" `
  -Management -Compute -Storage `
  -AdapterName "pNIC1", "pNIC2"

Relationship to Other Topics


7. BGP (Border Gateway Protocol)

What It Is and Why It Exists

BGP is the routing protocol that holds the internet together. Every autonomous system (AS) on the internet uses BGP to announce its IP prefixes to its neighbors and to learn about prefixes from other autonomous systems. BGP is a path-vector protocol -- it selects routes based on a list of attributes, not just a simple metric like hop count or link cost.

Traditionally, BGP was only used at the network edge (connecting to ISPs). In modern data centers, BGP has become the preferred routing protocol inside the data center fabric, replacing OSPF and Spanning Tree for several reasons:

How BGP Works

BGP operates over TCP (port 179). Two BGP routers ("peers" or "neighbors") establish a TCP session and exchange routing information in the form of UPDATE messages.

BGP Peering Establishment (Finite State Machine):

  Idle --> Connect --> OpenSent --> OpenConfirm --> Established
    |         |            |             |              |
    |    TCP SYN/ACK   OPEN msg      KEEPALIVE     UPDATE msgs
    |    attempt       exchanged     exchanged      exchanged
    |                  (AS number,                  (route
    |                   hold time,                   announcements
    |                   router ID)                   & withdrawals)
    |
    +-- Back to Idle on any fatal error

BGP UPDATE message structure (simplified):

+--------------------------------------------------+
| Withdrawn Routes Length (2 bytes)                 |
| Withdrawn Routes (variable)                      |
|   - List of prefixes being withdrawn             |
+--------------------------------------------------+
| Total Path Attribute Length (2 bytes)             |
| Path Attributes (variable)                       |
|   - ORIGIN (IGP, EGP, or Incomplete)             |
|   - AS_PATH (list of AS numbers traversed)        |
|   - NEXT_HOP (IP of next router)                 |
|   - LOCAL_PREF (preference within an AS)          |
|   - MED (Multi-Exit Discriminator)                |
|   - COMMUNITY (tagging for policy)                |
+--------------------------------------------------+
| NLRI (Network Layer Reachability Information)     |
|   - List of prefixes being announced              |
|   e.g., 10.20.0.0/16, 192.168.100.0/24          |
+--------------------------------------------------+

BGP best path selection algorithm (simplified, in order of priority):

  1. Highest LOCAL_PREF (prefer paths marked as preferred within our AS)
  2. Shortest AS_PATH (fewer AS hops = closer)
  3. Lowest ORIGIN type (IGP < EGP < Incomplete)
  4. Lowest MED (prefer the entrance point the remote AS suggests)
  5. eBGP over iBGP (prefer externally learned routes)
  6. Lowest IGP metric to NEXT_HOP
  7. Oldest route (stability preference)
  8. Lowest router ID (tiebreaker)

BGP in a Data Center Spine-Leaf Fabric

BGP Peering in Spine-Leaf (eBGP design):

  Each device runs its own AS number:

       AS 65000              AS 65001
    +-----------+         +-----------+
    | Spine-1   |         | Spine-2   |
    | BGP RR/   |         | BGP RR/   |
    | Transit   |         | Transit   |
    +--+--+--+--+         +--+--+--+--+
       |  |  |               |  |  |
       |  |  +-------+-------+  |  |
       |  |          |          |  |
       |  +----+-----+-----+---+  |
       |       |     |     |      |
    AS 65010  AS 65011  AS 65012  AS 65013
    +------+  +------+  +------+  +------+
    |Leaf-1|  |Leaf-2|  |Leaf-3|  |Leaf-4|
    +------+  +------+  +------+  +------+
       |         |         |         |
    Hosts     Hosts     Hosts     Hosts

  - Every leaf peers with every spine (full mesh at L3)
  - Each leaf announces its connected subnets
  - Spines reflect routes between leaves
  - ECMP across all spines for any leaf-to-leaf path
  - No Spanning Tree, no Layer-2 loops, no broadcast storms

eBGP vs. iBGP in the data center:

BGP in the Candidate Platforms

OVE (MetalLB + OVN): MetalLB is the bare-metal load balancer for Kubernetes. It uses BGP to announce Service VIPs (Virtual IPs) to the physical network:

MetalLB BGP peering in OVE:

  Physical Network          OVE Cluster
  +-----------+
  | Leaf      |  eBGP       +------------+
  | Switch    |<----------->| MetalLB    |
  | (peer)    |  AS 65010   | Speaker    |
  +-----------+  AS 65020   | (on node)  |
       |                    +------------+
       |
       +-- Learns: "10.30.0.100/32 via <node-IP>"
           (the Service VIP is reachable through this node)

  When a Kubernetes Service of type LoadBalancer is created:
    1. MetalLB assigns a VIP from a configured pool
    2. MetalLB Speaker on the node establishes BGP session with leaf switch
    3. Speaker announces the VIP as a /32 route
    4. Leaf switch installs the route and forwards traffic to the node
    5. If the node fails, the Speaker on another node takes over
       and announces the VIP from the new location

Azure Local: The RAS Gateway in Microsoft SDN uses BGP to peer with the physical network for multi-tenant routing and VPN termination. Network Controller manages the BGP configuration.

VMware / ESC: NSX-T uses BGP peering between Tier-0 gateways and the physical fabric for north-south routing. The Tier-0 gateway runs as a VM or on bare-metal Edge nodes.

Practical Impact at Scale

Relationship to Other Topics


8. IPv4 / IPv6 Dual-Stack

What It Is and Why It Exists

IPv4 uses 32-bit addresses (e.g., 192.168.1.1), providing approximately 4.3 billion unique addresses. This space has been exhausted globally since 2011. IPv6 uses 128-bit addresses (e.g., 2001:0db8:85a3::8a2e:0370:7334), providing 3.4 x 10^38 unique addresses -- effectively unlimited.

Dual-stack means running both IPv4 and IPv6 simultaneously on every interface, switch, router, and service in the infrastructure. This is the industry-recommended transition strategy because it avoids the complexity of protocol translation (NAT64/DNS64) while allowing gradual migration.

IPv4 vs. IPv6 Header Comparison

IPv4 Header (20-60 bytes, variable):
+-------+------+----------+----------------+
| Ver=4 | IHL  | DSCP/ECN | Total Length   |
| 4 bit | 4 bit| 8 bits   | 16 bits        |
+-------+------+----------+----------------+
| Identification           | Flags|Frag Off|
| 16 bits                  | 3b   | 13 bits|
+-------+------+----------+----------------+
| TTL   |Protocol| Header Checksum         |
| 8 bits| 8 bits | 16 bits                 |
+-------+--------+-------------------------+
| Source IPv4 Address (32 bits)             |
+------------------------------------------+
| Destination IPv4 Address (32 bits)        |
+------------------------------------------+
| Options (variable, 0-40 bytes)            |
+------------------------------------------+

IPv6 Header (40 bytes, fixed):
+-------+----------+----+------------------+
| Ver=6 | Traffic  |Flow| Payload Length   |
| 4 bit | Class 8b |Lab | 16 bits          |
|       |          |20b |                  |
+-------+----------+----+------------------+
| Next Header      | Hop Limit            |
| 8 bits           | 8 bits               |
+------------------+----------------------+
|                                          |
| Source IPv6 Address (128 bits)            |
|                                          |
|                                          |
+------------------------------------------+
|                                          |
| Destination IPv6 Address (128 bits)       |
|                                          |
|                                          |
+------------------------------------------+

Key differences:
- IPv6 has NO header checksum (saves processing on every hop)
- IPv6 has NO fragmentation fields in the base header
  (fragmentation is handled by extension headers, only at the source)
- IPv6 flow label enables per-flow handling without deep packet inspection
- IPv6 minimum MTU is 1280 (vs. 68 for IPv4)

Address Architecture in Dual-Stack Data Centers

Dual-Stack Interface Configuration (Linux):

  $ ip addr show ens1f0
  2: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000
      inet 10.20.100.5/24 brd 10.20.100.255 scope global ens1f0
      inet6 2001:db8:100::5/64 scope global
      inet6 fe80::a00:27ff:fe4e:66a1/64 scope link

  Three addresses on one interface:
    10.20.100.5         -- IPv4 (manually or DHCP assigned)
    2001:db8:100::5     -- IPv6 global unicast (SLAAC or DHCPv6)
    fe80::...           -- IPv6 link-local (auto-generated, always present)

IPv6 address types relevant for data center operations:

Type Prefix Scope Use
Link-local fe80::/10 Single link Always present, used for neighbor discovery, routing protocol peering
Unique Local (ULA) fd00::/8 Organization-wide Analogous to RFC1918 private addresses, not routable on the internet
Global Unicast 2000::/3 Internet Globally routable, assigned by RIR or ISP
Multicast ff00::/8 Variable Replaces broadcast (IPv6 has no broadcast)

SLAAC (Stateless Address Autoconfiguration): IPv6 can self-assign addresses without a DHCP server. The router sends Router Advertisement (RA) messages containing the network prefix; the host appends its own interface identifier (derived from MAC address or randomly generated via RFC 7217 privacy extensions) to form a full address. This eliminates the need for DHCPv6 in many scenarios but complicates IP address tracking in regulated environments.

Dual-Stack Implications for 5,000+ VMs

Address planning:

IPv4 address planning for 5,000 VMs:
  Using /24 subnets: 5000 / 254 hosts = ~20 subnets needed
  Using /22 subnets: 5000 / 1022 hosts = ~5 subnets needed
  Total IPv4 addresses: 5,000 (tight, requires careful management)

IPv6 address planning for 5,000 VMs:
  A single /64 subnet provides 2^64 = 18,446,744,073,709,551,616 addresses
  One /64 per VLAN is the standard practice
  Address exhaustion is not a concern
  But: address tracking requires IPAM tooling (SLAAC makes addresses dynamic)

Firewall rules: Every firewall rule, network policy, and ACL must exist in both IPv4 and IPv6 versions. A common security mistake is implementing strict IPv4 firewall rules while leaving IPv6 wide open because "we don't use IPv6" -- but IPv6 is enabled by default on most modern operating systems, and link-local addresses are always active.

DNS: In dual-stack, DNS queries return both A records (IPv4) and AAAA records (IPv6). The application or OS selects which to use based on the "Happy Eyeballs" algorithm (RFC 8305), which prefers IPv6 but falls back to IPv4 if the IPv6 connection attempt fails within a short timeout. This can cause confusing behavior if IPv6 connectivity is partially broken (e.g., router advertisements are received but the path is blackholed).

Monitoring and logging: All monitoring, logging, and SIEM tools must handle both address families. An alert rule matching 10.20.100.0/24 will miss malicious traffic from the same host if it arrives on the 2001:db8:100::/64 address. NetFlow/sFlow collectors must process both IPv4 and IPv6 flow records.

Dual-Stack in Kubernetes / OVE

Kubernetes has supported dual-stack since version 1.23 (stable). In OVE, dual-stack is configured at cluster installation time:

# OVE cluster network configuration (install-config.yaml excerpt)
networking:
  clusterNetwork:
    - cidr: 10.128.0.0/14           # IPv4 pod CIDR
      hostPrefix: 23
    - cidr: fd01::/48                # IPv6 pod CIDR
      hostPrefix: 64
  serviceNetwork:
    - 172.30.0.0/16                  # IPv4 service CIDR
    - fd02::/112                     # IPv6 service CIDR
  networkType: OVNKubernetes

Each Pod and VM gets both an IPv4 and an IPv6 address. Services can be configured as ipFamilyPolicy: PreferDualStack or RequireDualStack. CoreDNS returns both A and AAAA records for services.

Relationship to Other Topics


How the Candidates Handle This

Comparison Table

Aspect VMware (Current) OVE Azure Local Swisscom ESC
VLAN management vDS port groups, per-host VLAN trunk config via vCenter NMState operator + NetworkAttachmentDefinitions (declarative, GitOps-compatible) Network ATC intents + Hyper-V vSwitch (PowerShell/ARM) Managed by Swisscom via NSX, tenant sees logical networks
East-West firewalling NSX Distributed Firewall (per-VM, micro-segmentation) OVN ACLs via NetworkPolicies and MultiNetworkPolicies Datacenter Firewall via VFP (SDN-managed ACLs) NSX Distributed Firewall (same as VMware, provider-managed)
MTU configuration vDS MTU setting, per-vmnic MTU, per-port-group override NMState declares MTU per interface/bond/VLAN; cluster-wide consistency enforced by operator Network ATC auto-configures MTU for storage (jumbo) and compute (standard) based on intent Provider-managed; customer has no direct MTU control
Bonding vDS with multiple uplinks, load-based teaming or LACP Linux bonding via NMState (mode 1, 4, or 6; LACP with xmit_hash_policy configurable) SET (Switch Embedded Teaming) in Hyper-V vSwitch, or LACP via Network ATC Provider-managed; Dell VxBlock standard bonding
DNS internal vCenter DNS dependency; VMs use enterprise DNS CoreDNS for cluster-internal; enterprise DNS for VM secondary networks Active Directory DNS for cluster; Windows DNS or external for VMs Managed DNS within tenant; Swisscom-operated
DHCP Enterprise DHCP for VM networks; NSX can provide DHCP for overlay segments Optional: OVN can serve DHCP on overlay networks; enterprise DHCP for bridged/secondary networks Windows DHCP Server or SDN built-in DHCP Managed DHCP as part of ESC service
SDN stack NSX-T (proprietary, distributed switching + routing + firewalling + LB) OVN-Kubernetes (open source, GENEVE overlays, distributed routing, ACLs) Microsoft SDN (Network Controller + VFP + SLB + RAS Gateway, VXLAN overlays) NSX-T (same as VMware baseline, fully managed by Swisscom)
SDN overlay protocol GENEVE (NSX-T 3.x+) or VXLAN (NSX-T 2.x) GENEVE (OVN default) VXLAN (Microsoft SDN) GENEVE / VXLAN depending on NSX version
BGP integration NSX Tier-0 Gateway peers with physical fabric via BGP MetalLB (BGP mode) for Service VIPs; OVN gateway router can peer via BGP RAS Gateway uses BGP for multi-tenant routing; SDN integrates with ToR switches NSX Tier-0 BGP peering (provider-managed)
IPv4/IPv6 Dual-Stack NSX-T supports dual-stack; vDS supports dual-stack Native dual-stack since OCP 4.12+ (OVN-Kubernetes) Native dual-stack (Hyper-V + SDN) Supported (NSX-T + VMware)
Max VLAN segments 4,094 (physical) + millions via NSX overlay 4,094 (physical) + 16M via GENEVE overlay 4,094 (physical) + 16M via VXLAN overlay Provider-managed; transparent to customer
Network config as code Limited (vCenter API, PowerCLI); NSX has REST API but no native GitOps Full: NMState CRDs, NetworkAttachmentDefinitions, NetworkPolicies -- all YAML, all GitOps-compatible ARM Templates, Bicep, Terraform; Network ATC intents via PowerShell ESC API exists but limited IaC maturity; Terraform provider reifegrad unklar
Troubleshooting tools NSX UI, packet tracing, traceflow ovs-ofctl, ovn-trace, ovn-nbctl, ovn-sbctl, tcpdump, Network Observability Operator Test-NetConnection, Get-VMNetworkAdapter, Network Controller diagnostics, WAC Ticket to Swisscom; no direct platform access

Key Differences in Prose

SDN maturity and control: The most significant difference between the candidates is the degree of control over the SDN layer. OVE exposes the full OVN/OVS stack -- operators can inspect OpenFlow tables, trace packets through the overlay, and debug connectivity issues at the flow level. Azure Local abstracts this behind Network ATC and the Network Controller API, offering less visibility but simpler configuration. Swisscom ESC provides no SDN control at all -- the entire networking stack is a black box operated by the provider. For an organization that wants to deeply understand and optimize its network (as is necessary at 5,000+ VMs), OVE offers the most transparency, Azure Local offers a middle ground, and ESC offers the least.

VLAN and overlay model: OVE and Azure Local both use overlay networks as the default for inter-VM traffic, with bridge/trunk interfaces for VMs that need direct access to physical VLANs (e.g., legacy applications with hardcoded VLAN dependencies). The critical difference is that OVE uses Multus + bridge CNI to attach VMs to physical VLANs as secondary interfaces, which requires explicit NetworkAttachmentDefinition resources -- a new concept for teams coming from VMware. Azure Local uses Hyper-V vSwitch port profiles, which are conceptually closer to VMware port groups. ESC mirrors the current VMware model exactly (NSX + vDS port groups).

Bonding and NIC configuration: OVE's NMState operator is the most "infrastructure-as-code" approach to NIC configuration. Bonding modes, VLANs, and MTU are declared as Kubernetes CRDs and reconciled automatically. Azure Local's Network ATC is intent-based (you declare "I want management, compute, and storage on these NICs") and the system auto-configures -- convenient but less transparent when debugging. VMware vDS teaming policies are configured per port group in vCenter. ESC is provider-managed.

BGP integration depth: OVE with MetalLB requires explicit BGP peering configuration between the cluster and the physical fabric. This means the network team must understand BGP and must coordinate AS numbers, peer IPs, and route filters with the Kubernetes platform team. Azure Local's BGP integration is handled by the SDN Network Controller and is less exposed to the administrator. ESC's BGP peering is entirely Swisscom's responsibility. For an organization with strong network engineering skills, OVE's model is preferred because it provides full control. For an organization that wants networking to "just work," Azure Local or ESC may be more appropriate.

Dual-stack readiness: All three candidates support IPv4/IPv6 dual-stack. OVE's dual-stack is a cluster-wide install-time decision and is well-documented for OVN-Kubernetes. Azure Local's dual-stack works at the Hyper-V and SDN level. ESC's dual-stack support depends on the specific NSX version Swisscom has deployed. The key evaluation question is not whether dual-stack is "supported" but whether the operational tooling (monitoring, logging, firewall policies, IPAM) handles both address families equally well.


Key Takeaways


Discussion Guide

The following questions are designed to reveal the depth of a vendor's or SME's understanding of their networking stack. They should be asked during vendor workshops, PoC planning sessions, and architecture reviews.

1. VLAN and Overlay Architecture

"Walk us through the packet path when VM-A on Node-1 in VLAN 100 sends a TCP packet to VM-B on Node-7 in VLAN 200. Include the encapsulation and decapsulation steps, the routing decision point, and the exact headers present on the wire at each hop. Where does the 802.1Q tag get stripped and the overlay header get added?"

Purpose: Tests whether the vendor understands the interaction between physical VLANs and overlay encapsulation. A weak answer will describe it abstractly ("the SDN handles it"); a strong answer will name the specific components (OVS br-int, GENEVE tunnel, OVN logical router) and describe the header transformations.

2. MTU and Fragmentation

"Our physical switches are configured with MTU 9216. Our current VMs expect MTU 1500. After migration to your platform, what is the effective MTU inside the VM on an overlay network? What happens if a VM sends a 1500-byte packet and the overlay encapsulation pushes it above the physical MTU? Have you encountered Path MTU Discovery black holes in production deployments, and how did you diagnose them?"

Purpose: Tests MTU planning discipline. The correct answer includes the specific overhead (50-74 bytes for GENEVE/VXLAN), the resulting inner MTU, and the PMTUD failure mode. A vendor who says "it just works" has not deployed at scale.

3. Bonding and NIC Failover

"We plan to use LACP bonding with two 25 GbE NICs per host. Describe the hash algorithm your platform uses to distribute VM traffic across the two links. How do you verify that traffic is actually balanced? What happens during the 3-second LACP convergence window if a single link fails -- do VMs experience packet loss? How does this interact with your overlay tunnel endpoints?"

Purpose: Tests understanding of bonding internals. The answer should reference xmit_hash_policy, how to measure per-link utilization (e.g., ethtool -S, ip -s link), and the relationship between LACP failover and overlay tunnel re-establishment.

4. East-West Micro-Segmentation at Scale

"We have 5,000 VMs across approximately 200 application groups. Each application group needs firewall rules allowing traffic within the group but blocking traffic between groups. How does your platform implement these rules? Where in the data path are they enforced? What is the performance overhead per packet? How many rules can a single host support before performance degrades? How do we audit that the rules are actually enforced?"

Purpose: Tests micro-segmentation scalability. The answer should describe distributed enforcement (at the vSwitch, not a central appliance), the specific mechanism (OVN ACLs, VFP rules, NSX DFW rules), and the performance impact. Ask for specific numbers -- "we tested 10,000 rules per host with less than 2% throughput degradation" is better than "it scales well."

5. BGP and Physical Fabric Integration

"Our data center runs a spine-leaf fabric with eBGP (one AS per leaf). How does your platform announce service VIPs and VM network routes to the leaf switches? What AS number does the platform use? How do we control which prefixes are announced and which are filtered? If a node hosting a VIP fails, how quickly does the BGP route withdrawal propagate, and what is the resulting traffic disruption?"

Purpose: Tests BGP integration maturity. The answer should include specific convergence times (with and without BFD), route filtering mechanisms (prefix lists, communities), and the interaction between the platform's internal routing and the external BGP peering. For OVE, expect discussion of MetalLB BGP configuration and FRR (Free Range Routing) under the hood.

6. DNS/DHCP Integration and IPAM

"When a new VM is provisioned, how does it get its IP address and DNS record? Is this integrated with our existing Infoblox IPAM, or does the platform maintain its own address pool? What happens to the DNS record when the VM is live-migrated to a different node? What happens when the VM is deleted -- is the DNS record cleaned up automatically, or is it orphaned?"

Purpose: Tests operational maturity around IP lifecycle management. A platform that provisions VMs without cleaning up DNS/DHCP records will accumulate stale entries over time, leading to IP conflicts and resolution failures. Look for answers that describe explicit integration with enterprise IPAM (Infoblox, BlueCat) rather than "we have our own DHCP."

7. SDN Troubleshooting Under Pressure

"A production VM loses network connectivity at 2:00 AM. Your SDN overlay is healthy at the control plane level -- no errors in the controller. The physical network shows no link failures. Walk us through your step-by-step troubleshooting methodology. What tools do you use? How do you determine whether the problem is in the overlay, the underlay, the VM's guest OS, or a firewall rule? How long does this diagnosis typically take?"

Purpose: Tests real-world troubleshooting capability. For OVE, expect references to ovs-ofctl dump-flows, ovn-trace, tcpdump on the node, and checking ovn-sbctl port bindings. For Azure Local, expect Test-NetConnection, Hyper-V network adapter diagnostics, and Network Controller event logs. For ESC, the honest answer is "we open a ticket with Swisscom" -- which is a valid model but must be evaluated against the required incident response time.

8. Dual-Stack Operational Readiness

"We currently run IPv4-only. Your platform supports dual-stack. If we enable IPv6 on day one, what changes in our operational model? Specifically: do our existing IPv4 firewall rules automatically apply to IPv6? Do our monitoring dashboards show IPv6 traffic? Does our SIEM correlate events across both address families? Can a VM configured with only IPv4 still communicate with a dual-stack VM?"

Purpose: Tests whether dual-stack is truly production-ready or just "technically supported." The critical failure mode is enabling IPv6 without corresponding security policies, creating an unmonitored path through the infrastructure. A mature answer will describe how the platform ensures policy parity between IPv4 and IPv6.

9. Network Performance Baseline and SLA

"What network latency should we expect for east-west VM-to-VM traffic on the same leaf switch? On different leaf switches (one spine hop)? Through the SDN overlay? How does this compare to our current VMware NSX latency baseline? Do you have published benchmarks, and can we reproduce them in our PoC environment?"

Purpose: Establishes concrete performance expectations. Typical east-west latency in a well-designed overlay network is 50-150 microseconds per hop. If the vendor cannot provide specific numbers or says "it depends," they have not benchmarked their own stack.

10. Day-2 Network Changes Without Downtime

"Six months after go-live, we need to add a new VLAN (VLAN 500) to all 100 worker nodes for a new application tier. Describe the process. Can this be done without draining VMs or rebooting nodes? How do we verify that the VLAN is correctly trunked on all physical switch ports and correctly configured on all host interfaces before connecting any VMs to it?"

Purpose: Tests Day-2 operational maturity. For OVE, expect a description of creating a NodeNetworkConfigurationPolicy via NMState, which rolls out the VLAN configuration across all nodes automatically with validation and rollback. For Azure Local, expect Network ATC intent updates. For ESC, expect a Swisscom change request with an SLA. The key differentiator is whether the change is self-service and automated or requires a ticket and a maintenance window.