Physical Connectivity & Redundancy

Why This Matters

Every virtualization platform ultimately depends on physical cables, NICs, and switches. A software-defined overlay can abstract away VLANs, routing, and firewall rules, but it cannot abstract away a severed fiber, a failed switch ASIC, or a misconfigured link aggregation group. When 5,000+ VMs share a physical fabric, a single link failure must not cause an outage -- and a single link must not become a bottleneck. The physical connectivity layer determines how much bandwidth is available, how quickly the network recovers from component failures, and whether traffic is evenly distributed or silently concentrated on a single path.

This document covers four technologies that together form the physical redundancy and bandwidth foundation of a modern data center:

LACP -- how servers and switches negotiate link bundles for bandwidth and redundancy.
LLDP -- how devices discover their physical neighbors for topology mapping and troubleshooting.
SMLT / MLAG / MC-LAG -- how two physical switches present themselves as a single logical switch to eliminate Spanning Tree and provide switch-level redundancy.
ECMP -- how Layer-3 routing distributes traffic across multiple equal-cost paths in a spine-leaf fabric.

These four technologies interact directly with every platform decision. OVE uses Linux bonding (mode 4 / 802.3ad) managed by NMState. Azure Local uses Switch Embedded Teaming (SET) managed by Network ATC. Both platforms expect a multi-chassis LAG (MLAG/SMLT/VPC) pair at the top-of-rack layer. The spine-leaf fabric underneath uses ECMP with BGP to distribute east-west traffic. Getting any of these wrong -- a mismatched LACP key, a broken MLAG peer link, ECMP polarization on the spine tier -- degrades performance or availability for thousands of VMs simultaneously.

The organization must understand these technologies deeply enough to evaluate switch vendor proposals, validate PoC lab configurations, and design a production fabric that serves the next 5-10 years.

Concepts

1. LACP (Link Aggregation Control Protocol)

What It Is and Why It Exists

LACP is defined by IEEE 802.3ad (originally) and IEEE 802.1AX (current standard). It is a control protocol that allows two devices -- typically a server and a switch, or two switches -- to dynamically negotiate the formation of a Link Aggregation Group (LAG). A LAG combines multiple physical links into a single logical interface, providing both increased bandwidth (aggregate throughput of all member links) and redundancy (surviving links continue operating if one fails).

Without LACP, link aggregation requires static configuration on both ends. If one end is configured for a LAG and the other is not (or is configured differently), the result can be loops, black-holed traffic, or silent packet loss. LACP provides a negotiation mechanism that prevents these misconfigurations by requiring both ends to explicitly agree on the LAG membership before forwarding traffic.

LACPDU Frame Format

LACP operates by exchanging LACPDU (Link Aggregation Control Protocol Data Unit) frames between the two ends of a link. LACPDUs are sent to the well-known multicast MAC address 01:80:C2:00:00:02 with EtherType 0x8809 (Slow Protocols). They are never forwarded by switches -- they are processed locally on the port where they arrive.

LACPDU Frame Structure:
+------------------------------------------------------------------+
| Destination MAC: 01:80:C2:00:00:02  (Slow Protocols multicast)   |
| Source MAC:      (sending port's MAC)                             |
| EtherType:       0x8809 (Slow Protocols)                         |
| Subtype:         0x01 (LACP)                                     |
+------------------------------------------------------------------+
| Version: 0x01                                                    |
+------------------------------------------------------------------+
|                     ACTOR INFORMATION                             |
| +--------------------------------------------------------------+ |
| | TLV Type:        0x01 (Actor Information)                     | |
| | Length:           20 bytes                                     | |
| | System Priority: 2 bytes (0-65535, lower = higher priority)   | |
| | System ID:       6 bytes (MAC address of the system)          | |
| | Key:             2 bytes (groups ports that CAN aggregate)    | |
| | Port Priority:   2 bytes (0-65535, lower = higher priority)   | |
| | Port Number:     2 bytes (unique port identifier)             | |
| | State:           1 byte (8 flags, see below)                  | |
| | Reserved:        3 bytes                                      | |
| +--------------------------------------------------------------+ |
+------------------------------------------------------------------+
|                    PARTNER INFORMATION                            |
| +--------------------------------------------------------------+ |
| | TLV Type:        0x02 (Partner Information)                   | |
| | Length:           20 bytes                                     | |
| | (Same fields as Actor, but reflecting what this end           | |
| |  has learned about the remote end from received LACPDUs)     | |
| +--------------------------------------------------------------+ |
+------------------------------------------------------------------+
|                   COLLECTOR INFORMATION                           |
| +--------------------------------------------------------------+ |
| | TLV Type:        0x03                                         | |
| | Length:           16 bytes                                     | |
| | Max Delay:       2 bytes (max collection delay in 10us units) | |
| +--------------------------------------------------------------+ |
+------------------------------------------------------------------+
| Terminator TLV:  Type=0x00, Length=0                             |
| Padding:         to reach 110 bytes minimum                      |
+------------------------------------------------------------------+

State Flags (1 byte, 8 bits):
  Bit 0: LACP_Activity   (1=Active, 0=Passive)
  Bit 1: LACP_Timeout     (1=Short timeout/1s, 0=Long timeout/30s)
  Bit 2: Aggregation      (1=port can aggregate, 0=individual)
  Bit 3: Synchronization  (1=port is in sync with partner)
  Bit 4: Collecting       (1=port is accepting incoming frames)
  Bit 5: Distributing     (1=port is sending outgoing frames)
  Bit 6: Defaulted        (1=using default partner info, no LACPDU received)
  Bit 7: Expired          (1=partner info has expired)

Actor and Partner State Machines

Each end of an LACP-negotiated link maintains two views:

Actor: This device's own state -- its system priority, system ID, key, port priority, and port state flags. The actor information describes "what I am offering."
Partner: What this device has learned about the remote end from received LACPDUs. The partner information describes "what I believe about you."

When both ends agree -- the actor's information matches the partner's expectation, and vice versa -- the link is placed into the "Collecting and Distributing" state, and traffic flows.

LACP Negotiation Sequence (Server <-> Switch):

  Server (bond0, mode 4)              ToR Switch (LAG/Port-Channel)
  ens1f0  ens1f1                       Eth1/1  Eth1/2
    |       |                            |       |
    |---LACPDU (Actor: Sys=ServerMAC,------>|     |
    |    Key=1, Port=1, State=Active)   |       |
    |       |                            |       |
    |       |---LACPDU (Actor: Sys=ServerMAC,--->|
    |       |    Key=1, Port=2, State=Active)   |
    |       |                            |       |
    |<------LACPDU (Actor: Sys=SwitchMAC,---|    |
    |        Key=100, Port=1, State=Active)|    |
    |       |                            |       |
    |       |<---LACPDU (Actor: Sys=SwitchMAC,--|
    |       |    Key=100, Port=2, State=Active) |
    |       |                            |       |
    |  Both ends see matching keys       |       |
    |  and compatible system IDs         |       |
    |       |                            |       |
    |---LACPDU (State: Sync+Collect+Dist)-->|   |
    |       |---LACPDU (State: Sync+Collect+Dist)->|
    |<------LACPDU (State: Sync+Collect+Dist)---|
    |       |<---LACPDU (State: Sync+Collect+Dist)|
    |       |                            |       |
    |  LAG is now active on both ends    |       |
    |  Traffic flows on both links       |       |
    |       |                            |       |

  Time to establish: typically 1-3 seconds with fast LACP rate,
                      up to 90 seconds with slow LACP rate.

Key matching rule: Two ports can aggregate only if they have the same Key on each end. On the switch, all ports in a port-channel share the same key. On the Linux server, the bonding driver assigns the same key to all member interfaces. If a server NIC is accidentally cabled to a switch port that belongs to a different port-channel (different key), LACP will refuse to aggregate it -- this is a safety mechanism that prevents forwarding loops.

Negotiation Modes

LACP supports two negotiation modes per port:

Mode	Behavior	Sends LACPDUs?	Responds to LACPDUs?
Active	Actively initiates LACP negotiation	Yes (periodically)	Yes
Passive	Waits for the other end to initiate	No (until receiving one)	Yes

Interaction matrix:

Server Mode	Switch Mode	Result
Active	Active	LAG forms (both initiate)
Active	Passive	LAG forms (server initiates, switch responds)
Passive	Active	LAG forms (switch initiates, server responds)
Passive	Passive	LAG does NOT form (neither initiates)

Best practice: Configure at least one side as Active. In data center environments, the standard practice is to set both sides to Active. Linux bonding mode 4 defaults to Active LACP. Most enterprise switches also default to Active when LACP is enabled on a port-channel.

LACP Rate (Fast vs. Slow)

Rate	LACPDU Interval	Timeout (3x interval)	Use Case
Slow (default)	30 seconds	90 seconds	Low overhead, acceptable for non-critical links
Fast	1 second	3 seconds	Sub-5-second failure detection, required for production VM traffic

For a 5,000+ VM environment, always use fast LACP rate. A 90-second timeout before a failed link is removed from the LAG means 90 seconds of potential packet loss on flows hashed to that link. With fast rate, failure detection drops to 3 seconds. Combined with BFD on the routing layer, total convergence time after a link failure can be brought under 5 seconds.

Linux bonding configuration for fast LACP:

# Set LACP rate to fast (1-second PDU interval)
ip link set bond0 type bond lacp_rate fast

# Or in NMState (OVE):
# apiVersion: nmstate.io/v1
# kind: NodeNetworkConfigurationPolicy
# spec:
#   desiredState:
#     interfaces:
#       - name: bond0
#         type: bond
#         link-aggregation:
#           mode: 802.3ad
#           options:
#             lacp_rate: fast
#             miimon: '100'

Hash Algorithms for Traffic Distribution

A LAG does not stripe data across links like a RAID-0 stripe. Instead, it assigns each traffic flow to a specific link based on a hash function. The hash function determines which fields of the packet header are used to compute the link assignment. This is critically important for VM environments.

Hash Policy Comparison:

+---------------+------------------+---------------------------------------------+
| Hash Policy   | Fields Hashed    | Impact on VM Traffic                        |
+---------------+------------------+---------------------------------------------+
| layer2        | Src MAC,         | POOR for VMs. All VMs sending to the same   |
|               | Dst MAC          | gateway (same dst MAC) hash to ONE link.    |
|               |                  | Result: 100% on link A, 0% on link B.       |
+---------------+------------------+---------------------------------------------+
| layer2+3      | Src MAC,         | BETTER. Different VM IPs distribute traffic |
|               | Dst MAC,         | even with same dst MAC. But no port-level   |
|               | Src IP, Dst IP   | entropy for same src-dst IP pair.           |
+---------------+------------------+---------------------------------------------+
| layer3+4      | Src IP, Dst IP,  | BEST for most VM traffic. Each TCP/UDP flow |
|               | Src Port,        | gets its own hash. 100 VMs with different   |
|               | Dst Port         | src IPs and ports distribute well.           |
+---------------+------------------+---------------------------------------------+
| encap3+4      | Inner Src IP,    | BEST for overlay traffic (VXLAN/GENEVE).    |
|               | Inner Dst IP,    | Hashes on the inner (VM) headers, not the   |
|               | Inner Src Port,  | outer (tunnel) headers. Without this,        |
|               | Inner Dst Port   | overlay traffic between two TEPs hashes     |
|               |                  | to a SINGLE link (same outer src/dst IP).   |
+---------------+------------------+---------------------------------------------+

Illustration -- why layer2 fails for VM traffic:

  100 VMs on Host-A, all sending to Default Gateway (GW MAC: aa:bb:cc:dd:ee:ff)

  layer2 hash = XOR(src_mac, dst_mac) mod 2

  VM-1:  hash(MAC1, GW_MAC) = 0  --> Link A
  VM-2:  hash(MAC2, GW_MAC) = 0  --> Link A      <-- same dst MAC
  VM-3:  hash(MAC3, GW_MAC) = 0  --> Link A          dominates hash
  ...
  VM-100: hash(MAC100, GW_MAC) = 0 --> Link A

  Result: Link A = 100% utilized, Link B = 0% utilized
          Effective bandwidth = 25 Gbps (not 50 Gbps)

  layer3+4 hash = XOR(src_ip, dst_ip, src_port, dst_port) mod 2

  VM-1:  hash(10.0.0.1, 10.1.1.1, 45200, 443) = 0  --> Link A
  VM-2:  hash(10.0.0.2, 10.1.1.1, 38100, 443) = 1  --> Link B
  VM-3:  hash(10.0.0.3, 10.1.1.2, 51000, 8080) = 0 --> Link A
  ...
  Result: ~50/50 distribution across links

The switch side matters too. The server's hash policy controls outgoing traffic distribution. The switch's hash policy controls incoming traffic distribution. Both must be configured for optimal balance. On most enterprise switches, the hash policy for LAG traffic distribution is configurable (e.g., Cisco port-channel load-balance src-dst-ip-port, Arista port-channel load-balance trident fields ip source-port destination-port).

Minimum Links and Max-Bundle

Linux bonding supports two important LAG sizing parameters:

# min_links: minimum number of active links required for the bond to be "up"
# If active links drop below this number, the bond interface goes DOWN.
# This triggers higher-layer failover (e.g., routing protocol withdraws routes).
ip link set bond0 type bond min_links 1

# In practice:
#   min_links=1 (default): bond stays up if at least 1 link is active
#   min_links=2: bond goes down if only 1 link remains (forces failover
#                to a backup path rather than running degraded)

For a dual-NIC bond (2x 25 GbE), min_links=1 is the standard production setting. Setting min_links=2 is used in architectures where running on a single link is considered unacceptable (e.g., if one link cannot sustain the combined storage + VM traffic load) and a full failover to a different bond or path is preferred.

Failure Detection and Convergence

When a link in an LACP bond fails, the following sequence occurs:

Link Failure Convergence Timeline (fast LACP rate):

  t=0.0s   Physical link failure (fiber cut, NIC failure, switch port failure)
            |
  t=0.0s   Link-down event detected via MII monitoring (miimon=100ms)
            OR via carrier loss on the NIC
            |
  t=0.1s   Linux bonding driver removes the failed link from the LAG
            Remaining link(s) continue forwarding
            All flows that were hashed to the failed link are REHASHED
            to surviving links
            |
  t=0.1s   LACPDU with updated state sent to switch
            (Synchronization flag cleared for failed port)
            |
  t=0.1-   Switch detects LACPDU timeout (fast rate: 3 missed PDUs = 3s)
  3.0s     OR switch detects link-down via physical layer
            Switch removes port from its LAG
            |
  t=0.5-   Convergence complete. Traffic flows on surviving links.
  3.0s     Total packet loss: 0-3 seconds depending on detection method.
           With MII monitoring (100ms): <1 second.
           With LACP timeout only: up to 3 seconds.

  NOTE: Flows that were on the surviving link are NOT disrupted.
        Only flows hashed to the failed link experience brief loss.

MII monitoring vs. ARP monitoring:

Method	How It Works	Detection Time	Best For
`miimon` (MII)	Polls NIC driver for link status every N ms	100 ms typical	Direct-attached links, NIC failures
`arp_interval`	Sends ARP probes to a target IP, declares failure if no reply	1-5 seconds typical	Detecting upstream switch/path failures that MII cannot see

For data center environments with direct server-to-ToR cabling, miimon=100 is standard. ARP monitoring is used in environments where the NIC can remain "link up" even when the upstream path is broken (e.g., through a media converter or passive patch panel).

Linux Bonding Mode 4 (802.3ad) Implementation

Linux implements LACP as bonding mode 4. The bonding driver manages the state machines, LACPDU transmission and reception, link selection, and hash-based traffic distribution.

# Full production LACP bond configuration via ip commands:
ip link add bond0 type bond \
    mode 802.3ad \
    miimon 100 \
    lacp_rate fast \
    xmit_hash_policy layer3+4 \
    min_links 1

ip link set ens1f0 master bond0
ip link set ens1f1 master bond0
ip link set bond0 mtu 9000
ip link set bond0 up

# Verify LACP state:
cat /proc/net/bonding/bond0
# Output includes:
#   Bonding Mode: IEEE 802.3ad Dynamic link aggregation
#   Transmit Hash Policy: layer3+4
#   LACP rate: fast
#   MII Status: up
#   Partner Mac Address: <switch MAC>
#   Aggregator ID: 1
#   Number of ports: 2
#   Actor Key: 9
#   Partner Key: 100
#   Slave Interface: ens1f0
#     MII Status: up
#     Aggregator ID: 1
#     Actor Churn State: none
#     Partner Churn State: none
#   Slave Interface: ens1f1
#     MII Status: up
#     Aggregator ID: 1

NMState (OVE) declarative bond configuration:

apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: bond0-lacp
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  desiredState:
    interfaces:
      - name: bond0
        type: bond
        state: up
        mtu: 9000
        ipv4:
          dhcp: false
          enabled: false
        link-aggregation:
          mode: 802.3ad
          port:
            - ens1f0
            - ens1f1
          options:
            lacp_rate: fast
            miimon: '100'
            xmit_hash_policy: layer3+4
            min_links: '1'

Network ATC (Azure Local) intent-based configuration:

# Network ATC abstracts bonding configuration behind intents.
# The administrator declares the intent; ATC configures SET (Switch Embedded Teaming).
Add-NetIntent -Name "Compute_Management" `
    -Management -Compute `
    -AdapterName "pNIC01", "pNIC02"

# ATC automatically creates a SET team (Hyper-V virtual switch
# with embedded teaming), configures LACP or Switch Independent
# mode based on best practices, and applies QoS policies.

# Note: Azure Local uses SET, not Linux bonding.
# SET operates inside the Hyper-V virtual switch and supports
# LACP or Switch Independent teaming modes.

Common Pitfalls

Mismatched keys: Server port has LACP key 1, switch port is in a different port-channel with key 200. LACP refuses to form the LAG. The link appears "up" at Layer 1 but carries no traffic. Symptom: Partner Churn State: churned in bonding status.
Uneven distribution: Using layer2 hash policy when all VM traffic goes to the same gateway MAC. One link at 100%, the other at 0%. Diagnosed with ethtool -S ens1f0 | grep tx_bytes vs ethtool -S ens1f1 | grep tx_bytes.
Single-flow bottleneck: A single large flow (e.g., a 100 GB VM live migration, a bulk database replication, or a storage sync) can only use one link because all packets in the flow hash to the same link. No hash policy can solve this -- it is a fundamental property of hash-based distribution. Mitigation: use multiple parallel streams (some storage replication protocols do this automatically) or size links to accommodate the largest expected single flow.
Fast LACP not configured: Default slow rate (30s PDU, 90s timeout) causes unacceptably long failure detection. Production bonds must use lacp_rate fast.
MTU mismatch on bond members: If ens1f0 has MTU 9000 and ens1f1 has MTU 1500 (due to a configuration error), frames larger than 1500 will be dropped on the link with the lower MTU. The bond interface MTU should be set on the bond, which propagates it to all members.
LACP across two independent switches without MLAG: If ens1f0 connects to Switch-A and ens1f1 connects to Switch-B, but Switch-A and Switch-B are not configured as an MLAG/SMLT/VPC pair, LACP will fail because the two switch ports present different System IDs. The bonding driver sees two different partners and cannot form a single LAG. Solution: use active-backup (mode 1) or balance-alb (mode 6) instead, or configure MLAG between the switches (see SMLT section below).

2. LLDP (Link Layer Discovery Protocol)

What It Is and Why It Exists

LLDP is defined by IEEE 802.1AB. It is a vendor-neutral, link-layer discovery protocol that allows network devices to advertise their identity, capabilities, and configuration to directly connected neighbors. LLDP frames are sent periodically (default: every 30 seconds) and are never forwarded by switches -- they are consumed by the directly connected device only.

In a data center with hundreds of servers and thousands of cables, LLDP provides automated topology discovery. Instead of manually tracing cables (which is error-prone and does not scale), LLDP allows the network management system to build a real-time map of "server NIC X is connected to switch port Y on switch Z." This is invaluable for:

Troubleshooting: "Which switch port is this server actually connected to?" is the first question asked in every physical network issue.
Validation: After a new server is racked and cabled, LLDP confirms that each NIC is connected to the intended switch and port.
Automation: Orchestration tools can read LLDP data from the OS and auto-configure VLANs, bonds, and overlays based on the physical topology.
Compliance: Many regulatory frameworks require documented physical network topology. LLDP provides machine-readable topology data that feeds into DCIM systems.

LLDPDU Structure

LLDP frames are sent to the well-known multicast MAC address 01:80:C2:00:00:0E with EtherType 0x88CC. Each LLDPDU contains a series of TLVs (Type-Length-Value) that carry specific information about the sending device.

LLDPDU Frame Structure:

+--------------------------------------------------------------------+
| Ethernet Header                                                    |
| Dst MAC:    01:80:C2:00:00:0E  (LLDP multicast)                   |
| Src MAC:    (sending port's MAC address)                           |
| EtherType:  0x88CC (LLDP)                                         |
+--------------------------------------------------------------------+
| LLDPDU Payload (sequence of TLVs):                                 |
|                                                                    |
| +------+--------+--------------------------------------------+    |
| | Type | Length | Value                                      |    |
| | 7bit | 9bit   | (variable)                                 |    |
| +------+--------+--------------------------------------------+    |
|                                                                    |
| Mandatory TLVs (must appear in this order):                        |
|                                                                    |
| TLV Type 1: Chassis ID                                             |
|   Subtype: 4 (MAC address) or 7 (locally assigned)                 |
|   Value:   e.g., "aa:bb:cc:dd:ee:00" (switch base MAC)            |
|                                                                    |
| TLV Type 2: Port ID                                                |
|   Subtype: 5 (interface name) or 7 (locally assigned)              |
|   Value:   e.g., "Ethernet1/1" or "GigabitEthernet0/1"            |
|                                                                    |
| TLV Type 3: Time to Live (TTL)                                     |
|   Value:   120 seconds (default, 4x the transmit interval)         |
|   After TTL expires without a new LLDPDU, the neighbor             |
|   entry is removed from the local LLDP table.                      |
|                                                                    |
| Optional TLVs (order not fixed):                                   |
|                                                                    |
| TLV Type 4: Port Description                                       |
|   Value:   e.g., "25GbE Server Port - Rack 12, U30"               |
|                                                                    |
| TLV Type 5: System Name                                            |
|   Value:   e.g., "leaf-switch-01.dc1.example.com"                  |
|                                                                    |
| TLV Type 6: System Description                                     |
|   Value:   e.g., "Arista DCS-7050SX3-48YC12, EOS 4.28.1F"        |
|                                                                    |
| TLV Type 7: System Capabilities                                    |
|   Value:   bitmap (bridge, router, WLAN AP, station, etc.)         |
|                                                                    |
| TLV Type 8: Management Address                                     |
|   Value:   e.g., "10.255.0.1" (switch management IP)              |
|                                                                    |
| TLV Type 127: Organizationally Specific TLVs                      |
|   OUI-based extensions:                                            |
|   - IEEE 802.1: VLAN name, Port VLAN ID, Link Aggregation status  |
|   - IEEE 802.3: MAC/PHY config, Maximum Frame Size, Link Agg.     |
|   - LLDP-MED: Media Endpoint Discovery (VoIP, PoE info)           |
|   - Vendor-specific: Cisco, Arista, Juniper each add custom TLVs  |
|                                                                    |
| TLV Type 0: End of LLDPDU                                         |
|   Length: 0 (marks the end of the TLV sequence)                    |
+--------------------------------------------------------------------+

LLDP on Linux Hosts

Linux hosts can both receive and transmit LLDP frames using the lldpd daemon (from the lldpd or open-lldp package). On RHEL / CoreOS (the OS underlying OVE worker nodes), lldpd can be installed and configured to expose neighbor information.

# Query LLDP neighbors seen by the host:
lldpctl
# Output example:
# -------------------------------------------------------------------------------
# LLDP neighbors:
# -------------------------------------------------------------------------------
# Interface:    ens1f0, via: LLDP
#   Chassis:
#     ChassisID:    mac aa:bb:cc:dd:ee:00
#     SysName:      leaf-01.dc1.example.com
#     SysDescr:     Arista Networks EOS version 4.28.1F
#     MgmtIP:       10.255.0.1
#     Capability:   Bridge, Router
#   Port:
#     PortID:       ifname Ethernet1/1
#     PortDescr:    Server-Rack12-U30-NIC1
#     TTL:          120
#     PMD autoneg:  supported: yes, enabled: yes
#
# Interface:    ens1f1, via: LLDP
#   Chassis:
#     ChassisID:    mac aa:bb:cc:dd:ee:01
#     SysName:      leaf-02.dc1.example.com
#     SysDescr:     Arista Networks EOS version 4.28.1F
#     MgmtIP:       10.255.0.2
#     Capability:   Bridge, Router
#   Port:
#     PortID:       ifname Ethernet1/1
#     PortDescr:    Server-Rack12-U30-NIC2

# This tells you:
#   ens1f0 is connected to leaf-01, port Ethernet1/1
#   ens1f1 is connected to leaf-02, port Ethernet1/1
#   Both are Arista switches running EOS 4.28.1F

CDP vs. LLDP Comparison

CDP (Cisco Discovery Protocol) is Cisco's proprietary equivalent of LLDP. Many data centers still use CDP because of Cisco's market dominance in switching.

Aspect	LLDP (802.1AB)	CDP (Cisco)
Standard	IEEE, vendor-neutral	Cisco proprietary
Multicast MAC	`01:80:C2:00:00:0E`	`01:00:0C:CC:CC:CC`
EtherType	`0x88CC`	SNAP encapsulation
Default interval	30 seconds	60 seconds
Default TTL/holdtime	120 seconds	180 seconds
Multi-vendor support	All modern switch vendors	Cisco, some others receive-only
Extensibility	TLV-based, OUI extensions	Fixed field structure
VoIP/PoE support	Via LLDP-MED	Native

Recommendation for this evaluation: Require LLDP on all switches and hosts. If the physical fabric uses Cisco switches, enable both CDP and LLDP (Cisco switches support running both simultaneously). If the fabric uses Arista, Juniper, Dell, or Extreme switches, LLDP is the standard. OVE worker nodes (RHEL CoreOS) use lldpd by default. Azure Local nodes (Windows Server) support LLDP via DataCenterBridging features.

LLDP in Virtualized Environments

In a physical server, LLDP operates at the NIC level -- the NIC port exchanges LLDP frames with the directly connected switch port. In a virtualized environment, there is an additional question: can VMs see the LLDP information from the physical switch?

Short answer: Generally no, by design. LLDP frames are link-local and consumed by the first device that receives them. The physical NIC receives the LLDP frame, and the host OS processes it. The virtual switch (OVS, Hyper-V vSwitch) does not forward LLDP frames to VMs because LLDP uses a reserved multicast address that bridges are not supposed to forward.

However, the host OS can expose LLDP data to the management plane. In OVE, the NMState operator can read LLDP neighbor information from each node's NICs and expose it as part of the NodeNetworkState custom resource. This allows the platform operator to query LLDP data via kubectl or the OpenShift console, without needing to SSH to each node.

# Example: LLDP data exposed in NodeNetworkState (OVE)
apiVersion: nmstate.io/v1beta1
kind: NodeNetworkState
metadata:
  name: worker-node-01
status:
  currentState:
    interfaces:
      - name: ens1f0
        type: ethernet
        state: up
        lldp:
          enabled: true
          neighbors:
            - - type: 5    # System Name
                value: "leaf-01.dc1.example.com"
              - type: 2    # Port ID
                value: "Ethernet1/1"
              - type: 8    # Management Address
                value: "10.255.0.1"

How Each Platform Uses LLDP

OVE: NMState can enable LLDP reception on physical interfaces and expose neighbor data in NodeNetworkState CRDs. This enables GitOps-style topology validation: a CI pipeline can query all nodes' LLDP data and verify that every server is connected to the expected switch pair and ports before deploying workloads.

Azure Local: Windows Server supports LLDP via the Data Center Bridging (DCB) feature. Network ATC can read LLDP information to validate physical connectivity. The Windows Admin Center displays LLDP neighbor information in the network configuration view.

VMware (current): ESXi supports both CDP and LLDP on the vDS (vSphere Distributed Switch). The N-VDS also supports LLDP. LLDP neighbor data is visible in the vCenter UI under the host's network configuration. This is commonly used by VMware admins to verify physical cabling.

Swisscom ESC: LLDP is managed by Swisscom as part of the underlying infrastructure. Customers do not have direct access to physical LLDP data.

3. SMLT (Split Multi-Link Trunking) and Multi-Chassis LAG

The Generic Problem: Eliminating Spanning Tree at the Access Layer

In a traditional access layer design, each server connects to a single top-of-rack (ToR) switch. If the switch fails, all servers connected to it lose connectivity. Adding a second switch and connecting each server to both switches introduces a Layer-2 loop, which Spanning Tree Protocol (STP) resolves by blocking one link -- eliminating the redundancy benefit.

The Problem: Spanning Tree Blocks Redundant Links

  Server                Server
  [NIC1] [NIC2]         [NIC1] [NIC2]
    |       |             |       |
    |       +-----+-------+       |
    |             |               |
    v             v               v
  +-------+    +-------+
  |Switch A|---|Switch B|   <-- STP blocks one direction
  +---+---+    +---+---+       to prevent loop
      |            |
    Spine         Spine

  With STP: One of the server-to-switch links is BLOCKED.
  Server has single-link connectivity = no bandwidth gain.
  STP convergence after failure: 30-50 seconds (RSTP: 1-5 seconds).

Multi-Chassis Link Aggregation (MC-LAG) is the generic solution to this problem. Two physical switches are configured to act as a single logical switch from the perspective of LACP. The server's LACP bond sees a single partner (single System ID), even though the physical links connect to two different chassis. This eliminates the need for STP on the access links, provides true active-active bandwidth utilization, and delivers sub-second failover.

Vendor Implementations

The concept is the same across vendors; the naming and implementation details differ:

Multi-Chassis LAG -- Vendor Naming:

+--------------------+-------------------------+--------------------+
| Vendor             | Technology Name          | Peer Link Name    |
+--------------------+-------------------------+--------------------+
| Avaya / Extreme    | SMLT (Split Multi-Link   | IST (Inter-Switch |
|                    | Trunking)                | Trunk) / vIST     |
+--------------------+-------------------------+--------------------+
| Cisco Nexus        | VPC (Virtual             | vPC Peer Link     |
|                    | Port-Channel)            |                   |
+--------------------+-------------------------+--------------------+
| Cisco IOS-XE       | StackWise Virtual        | StackWise Virtual |
|                    |                          | Link (SVL)        |
+--------------------+-------------------------+--------------------+
| Arista             | MLAG (Multi-chassis      | Peer Link         |
|                    | Link Aggregation)        |                   |
+--------------------+-------------------------+--------------------+
| Juniper            | MC-LAG (Multi-Chassis    | ICL (Inter-Chassis|
|                    | Link Aggregation Group)  | Link)             |
+--------------------+-------------------------+--------------------+
| Dell OS10          | VLT (Virtual Link        | VLTi (VLT         |
|                    | Trunking)                | Interconnect)     |
+--------------------+-------------------------+--------------------+
| Nokia (ALU)        | MC-LAG                   | MC-LAG ICL        |
+--------------------+-------------------------+--------------------+

All achieve the same goal: two switches, one logical entity for LACP.

How SMLT / MLAG Works

MLAG / SMLT Topology:

                        +------ IST / Peer Link ------+
                        |   (carries sync traffic,     |
                        |    control plane keepalives,  |
                        |    orphan port traffic)       |
                        |                              |
                  +-----v------+              +--------v---+
                  |  Switch A  |              |  Switch B  |
                  | (MLAG Peer)|              | (MLAG Peer)|
                  |            |              |            |
                  | System ID: |              | System ID: |
                  | aa:bb:cc:  |              | aa:bb:cc:  |
                  | dd:ee:01   |              | dd:ee:01   |
                  | (SHARED    |              | (SHARED    |
                  |  LACP ID)  |              |  LACP ID)  |
                  +--+-+--+-+--+              +--+-+--+-+--+
                     | |  | |                    | |  | |
              Server A    Server B         Server C    Server D
              bond0       bond0            bond0       bond0
              (LACP)      (LACP)           (LACP)      (LACP)

  Key Mechanism:
  1. Switch A and Switch B agree on a SHARED LACP System ID.
     They present this shared ID in all LACPDUs on MLAG ports.

  2. The server's bonding driver receives LACPDUs from both
     switch ports with the SAME System ID and the SAME Key.
     It therefore treats both ports as belonging to one LAG.

  3. The IST/Peer Link connects the two switches and carries:
     - MLAG control traffic (keepalives, state sync)
     - Data traffic for "orphan" ports (a port on Switch A
       that needs to reach a port only on Switch B)
     - MAC table synchronization
     - ARP table synchronization

  4. Both switches forward traffic simultaneously (active-active).
     The server gets full LAG bandwidth across both switches.

IST (Inter-Switch Trunk) and vIST

The IST (Inter-Switch Trunk) is the dedicated link or link bundle connecting the two MLAG/SMLT peer switches. It serves two functions:

Control plane synchronization: The two switches exchange keepalive messages, MAC address table updates, ARP/ND table updates, and MLAG state information over the IST. If the IST fails, the switches lose the ability to coordinate, which triggers a split-brain recovery mechanism.
Data plane forwarding for orphan traffic: If a server is connected only to Switch A (single-attached, not dual-homed), but needs to reach a server connected only to Switch B, the traffic traverses the IST. Similarly, if a BUM frame needs to be flooded across the MLAG domain, it crosses the IST.

vIST (virtual IST) is Extreme Networks' evolution of IST. Instead of requiring a dedicated physical link for the IST, vIST uses an in-band VLAN on the existing uplinks between the two switches. This simplifies cabling but shares bandwidth with data traffic.

IST vs. vIST:

Traditional IST:                    vIST:
+----------+   Dedicated    +----------+     +----------+ Shared +----------+
| Switch A |---IST Link(s)--| Switch B |     | Switch A |--Uplinks-| Switch B |
|          |   (10G/40G/    |          |     |          | (IST    |          |
|          |    100G)       |          |     |          | VLAN    |          |
+----------+                +----------+     +----------+ inside) +----------+

IST: Dedicated bandwidth, no contention.
     Requires additional ports and cables.
     Industry standard (Cisco VPC, Arista MLAG).

vIST: Shares uplink bandwidth with data traffic.
      No additional ports needed.
      Extreme/Avaya specific.
      Risk: congestion on uplinks can starve IST control traffic.

Failure Scenarios and Failover Behavior

Understanding MLAG failure scenarios is critical for availability design:

Scenario 1: Single Server Link Failure
+----------+              +----------+
| Switch A |              | Switch B |
+--+-------+              +----+-----+
   |                            |
   |  X (link fails)            |
   |                            |
  [NIC1]    [NIC2]             bond still has one
  Server bond0 (LACP)          active link via Switch B

  Impact:   Bond removes failed link. Traffic continues on surviving link.
  Failover: < 1 second (MII monitoring + LACP fast rate)
  Traffic:  50% capacity (one link remains)
  User VMs: No disruption (brief packet loss for flows on failed link)

Scenario 2: Entire Switch Failure
+----------+              +----------+
|SWITCH A  |              | Switch B |
|  (DEAD)  |              | (ALIVE)  |
+----------+              +----+-----+
                                |
   X (all links to A dead)      |
                                |
  [NIC1]    [NIC2]             bond removes all links
  Server bond0 (LACP)          to dead switch

  Impact:   All server links to Switch A fail. Bond runs on Switch B links only.
  Failover: < 3 seconds (LACP fast rate timeout)
  Traffic:  50% capacity per server
  Orphans:  Any single-attached device on Switch A is unreachable.
  IST:      Switch B detects IST failure, assumes primary role.
            Switch B takes over MAC addresses learned on Switch A
            via the synchronized MAC table.

Scenario 3: IST / Peer Link Failure (most dangerous)
+----------+     X X X      +----------+
| Switch A |---IST FAILS----| Switch B |
|          |                 |          |
+--+-------+                 +----+-----+
   |                               |
  Servers on A                  Servers on B

  Impact:   Both switches are alive but cannot coordinate.
            SPLIT-BRAIN RISK: both may try to become primary.
  Behavior: Depends on vendor implementation:
    - Cisco VPC: Secondary switch shuts down its VPC member ports
      (all MLAG links on the secondary go down). This forces all
      traffic to the primary switch. Drastic but prevents loops.
    - Arista MLAG: Similar -- secondary disables MLAG interfaces.
    - Extreme SMLT: vIST uses keepalive probes via management
      network as backup. If both IST and keepalive fail,
      secondary shuts down its SMLT ports.
  Recovery: Restore IST link. Secondary re-syncs and re-enables ports.

  CRITICAL: Always dual-home the IST link (two or more physical
  cables in a LAG). Never run IST on a single cable.

Scenario 4: Server NIC Failure (one NIC of a dual-NIC bond)
+----------+              +----------+
| Switch A |              | Switch B |
+--+-------+              +----+-----+
   |                            |
   | (active)         X (NIC fails)
   |                            |
  [NIC1]    [NIC2 DEAD]
  Server bond0 (LACP)

  Impact:   Identical to Scenario 1. Bond removes dead NIC.
  Failover: < 1 second
  Note:     The switch port on Switch B detects LACP timeout and
            removes the port from its MLAG LAG as well.

Why MLAG Eliminates Spanning Tree on Access Links

With MLAG, each server has an LACP bond to a pair of switches that appear as a single logical switch. There is no loop because LACP ensures that only one logical link exists between the server and the (logical) switch. STP is still running on the switches (as a safety net), but no links are blocked because MLAG eliminates the topology loops that STP would need to resolve.

Without MLAG (STP Required):            With MLAG (No STP Needed):

  Server                                   Server
  [NIC1] [NIC2]                            [NIC1] [NIC2]
    |       |                                |       |
    |       | <-- STP blocks                 |       | <-- Both active
    |       |     one link                   |       |     (LACP LAG)
    v       v                                v       v
  Switch A--Switch B                       Switch A==Switch B
  (STP between switches                   (MLAG peer link,
   may also block one                      both switches
   inter-switch link)                      are one entity)

  STP convergence: 1-50 seconds            LACP failover: < 1 second
  Bandwidth: single link                   Bandwidth: both links active

This is why MLAG/SMLT/VPC is considered mandatory for production data center access layers. The combination of active-active bandwidth, sub-second failover, and STP elimination makes MLAG the industry standard for server connectivity. All three platform candidates (OVE, Azure Local, Swisscom ESC) assume an MLAG pair at the top-of-rack layer in their reference architectures.

4. ECMP (Equal-Cost Multi-Path)

What It Is and Why It Exists

ECMP is a routing mechanism that distributes traffic across multiple next-hops when those next-hops have equal routing cost. In a spine-leaf data center fabric, every leaf switch has the same cost to reach every other leaf switch via any spine switch. ECMP exploits this symmetry to distribute traffic across all spine switches simultaneously, providing full bisection bandwidth without any single bottleneck link.

Unlike LACP (which operates at Layer 2 and aggregates physical links into a single logical link), ECMP operates at Layer 3 (routing) and distributes traffic across multiple independent routed paths. This is a fundamental distinction: LACP requires both ends to negotiate and agree on the LAG; ECMP requires only that the routing table contains multiple equal-cost routes to the destination.

How ECMP Works at Layer 3

When a router has multiple routes to the same destination prefix with the same metric/cost, it installs all of them in the forwarding table (FIB) as ECMP next-hops.

ECMP in the Routing Table:

  Leaf-1 routing table:
  +----------------+---------+--------------------+
  | Destination    | Cost    | Next-Hop(s)        |
  +----------------+---------+--------------------+
  | 10.20.0.0/24   | 10      | Spine-1: 10.255.1.1|
  | (Leaf-3 subnet)|         | Spine-2: 10.255.2.1|  <-- 2 equal-cost
  |                |         | Spine-3: 10.255.3.1|      next-hops = ECMP
  +----------------+---------+--------------------+
  | 10.30.0.0/24   | 10      | Spine-1: 10.255.1.1|
  | (Leaf-4 subnet)|         | Spine-2: 10.255.2.1|  <-- Same ECMP
  |                |         | Spine-3: 10.255.3.1|
  +----------------+---------+--------------------+

  When Leaf-1 needs to send a packet to 10.20.0.50 (on Leaf-3):
  1. Routing table lookup: 10.20.0.0/24 has 3 ECMP next-hops
  2. Hash function selects one next-hop based on the packet's flow
  3. Packet is forwarded to the selected spine
  4. The spine forwards to Leaf-3 (direct link)

Hash-Based Forwarding

Like LACP, ECMP uses hash-based forwarding to assign flows to paths. The hash function determines which packet fields are used to compute the path selection.

Per-flow vs. per-packet:

Approach	Behavior	Problem
Per-packet	Each packet is sent on a different path (round-robin across ECMP paths)	Packets arrive out of order. TCP interprets this as congestion and backs off. Massive performance degradation. Never used in practice.
Per-flow	All packets of the same flow (same 5-tuple) take the same path	Packets arrive in order. Some paths may be more loaded than others, but correctness is preserved. This is the standard.

5-tuple hashing:

The standard ECMP hash uses the 5-tuple: source IP, destination IP, IP protocol, source port, destination port. This provides good distribution for diverse traffic patterns (many different source/destination pairs with different ports).

5-Tuple ECMP Hash Illustration:

  Flow 1: 10.10.0.5:45000 -> 10.20.0.50:443 (TCP)
           hash(10.10.0.5, 10.20.0.50, TCP, 45000, 443) mod 3 = 0
           --> Spine-1

  Flow 2: 10.10.0.5:45001 -> 10.20.0.50:443 (TCP)
           hash(10.10.0.5, 10.20.0.50, TCP, 45001, 443) mod 3 = 2
           --> Spine-3

  Flow 3: 10.10.0.6:38000 -> 10.20.0.51:8080 (TCP)
           hash(10.10.0.6, 10.20.0.51, TCP, 38000, 8080) mod 3 = 1
           --> Spine-2

  Three flows, distributed across all three spine switches.
  Each flow stays on its path (no reordering).

ECMP in Spine-Leaf Fabrics

The spine-leaf fabric is the standard data center network architecture because it provides predictable latency (every leaf-to-leaf path is exactly 2 hops: leaf -> spine -> leaf) and full bisection bandwidth via ECMP across all spines.

Spine-Leaf with ECMP (3 spines, 6 leaves):

                 +----------+   +----------+   +----------+
                 | Spine-1  |   | Spine-2  |   | Spine-3  |
                 | AS 65000 |   | AS 65001 |   | AS 65002 |
                 +--+--+--+-+   +--+--+--+-+   +--+--+--+-+
                    |  |  |        |  |  |        |  |  |
        +-----------+  |  +--------+  |  +--------+  |  +--------+
        |              |              |              |              |
        |  +-----------+--+-----------+--+-----------+              |
        |  |              |              |              |           |
     +--v--v--+  +--------v--+  +--------v--+  +-------v---+  +---v------+
     |Leaf-1  |  |Leaf-2     |  |Leaf-3     |  |Leaf-4     |  |Leaf-5    |
     |AS 65010|  |AS 65011   |  |AS 65012   |  |AS 65013   |  |AS 65014  |
     +---+----+  +---+-------+  +---+-------+  +---+-------+  +---+------+
         |           |              |              |              |
      Servers     Servers        Servers        Servers        Servers
     (bond0       (bond0        (bond0         (bond0         (bond0
      to MLAG      to MLAG       to MLAG        to MLAG        to MLAG
      pair)        pair)         pair)          pair)          pair)

  Traffic from Leaf-1 to Leaf-4:
  - Leaf-1 has 3 ECMP routes to Leaf-4's subnets (one via each spine)
  - Traffic is hash-distributed across all 3 spines
  - Total bandwidth: 3x the individual spine-leaf link speed
  - If Spine-2 fails: traffic redistributes to Spine-1 and Spine-3
    Total bandwidth: 2x (graceful degradation, not total failure)

  Bisection Bandwidth:
  - If each spine-leaf link is 100 Gbps and there are 3 spines:
    Each leaf has 300 Gbps uplink capacity
  - With 6 leaves, total fabric bandwidth = 6 x 300 Gbps = 1.8 Tbps
  - Full bisection means half the leaves can talk to the other half
    at their full uplink speed simultaneously

ECMP + BGP

In a modern spine-leaf fabric, BGP is the routing protocol that establishes the equal-cost routes that ECMP uses. Each leaf switch runs its own BGP AS number and peers with every spine switch (eBGP). Each spine switch learns routes from all leaves and reflects them, creating the equal-cost multi-path entries.

BGP ECMP Route Advertisement:

  Leaf-1 (AS 65010) announces to all spines:
    "10.10.0.0/24 is reachable through me"

  Leaf-4 (AS 65013) announces to all spines:
    "10.40.0.0/24 is reachable through me"

  Each spine receives both announcements and re-advertises them:

  Spine-1 (AS 65000) tells Leaf-1:
    "10.40.0.0/24 via me, AS path: 65000 65013"

  Spine-2 (AS 65001) tells Leaf-1:
    "10.40.0.0/24 via me, AS path: 65001 65013"

  Spine-3 (AS 65002) tells Leaf-1:
    "10.40.0.0/24 via me, AS path: 65002 65013"

  Leaf-1 now has 3 routes to 10.40.0.0/24, all with the same
  AS path length (2 hops). BGP installs all 3 as ECMP next-hops.

  Configuration (example, Arista EOS on Leaf-1):
    router bgp 65010
      maximum-paths 3         <-- Allow up to 3 ECMP paths
      neighbor Spine1 remote-as 65000
      neighbor Spine2 remote-as 65001
      neighbor Spine3 remote-as 65002

BGP with unnumbered interfaces: In modern leaf-spine designs, BGP peering uses unnumbered interfaces (IPv6 link-local addresses for the BGP session, with IPv4 routes advertised via MP-BGP). This eliminates the need to assign and manage /31 point-to-point subnets on every spine-leaf link, dramatically simplifying IP address management.

Unnumbered BGP Peering:

  Leaf-1                        Spine-1
  +----------+                  +----------+
  | eth1     |------ link ------| eth1     |
  | fe80::1  |                  | fe80::2  |
  | (no IPv4 |                  | (no IPv4 |
  |  address)|                  |  address)|
  +----------+                  +----------+

  BGP session uses fe80 link-local addresses.
  IPv4 routes are advertised with IPv6 next-hop
  and resolved via the link-local neighbor.

  No /31 subnets to plan, no IP conflicts, no IPAM needed
  for inter-switch links. Each link is self-contained.

Resilient Hashing

When an ECMP path fails (e.g., a spine switch goes down), the hash function's denominator changes (from 3 paths to 2 paths). Without resilient hashing, this causes all flows to be rehashed -- even flows that were on the surviving paths. This means that a flow that was happily traversing Spine-1 might suddenly be moved to Spine-3, causing transient reordering.

Resilient hashing (also called "consistent hashing" or "resilient ECMP") minimizes this disruption by only rehashing flows that were on the failed path. Flows on surviving paths keep their current assignment.

Standard ECMP Rehash (without resilient hashing):

  Before failure (3 spines):
    Flow A: hash mod 3 = 0 --> Spine-1
    Flow B: hash mod 3 = 1 --> Spine-2
    Flow C: hash mod 3 = 2 --> Spine-3
    Flow D: hash mod 3 = 0 --> Spine-1
    Flow E: hash mod 3 = 1 --> Spine-2

  Spine-2 fails. New hash: mod 2

    Flow A: hash mod 2 = 0 --> Spine-1  (same -- lucky)
    Flow B: hash mod 2 = 1 --> Spine-3  (MOVED from Spine-2 -- expected)
    Flow C: hash mod 2 = 0 --> Spine-1  (MOVED from Spine-3 -- unnecessary!)
    Flow D: hash mod 2 = 1 --> Spine-3  (MOVED from Spine-1 -- unnecessary!)
    Flow E: hash mod 2 = 0 --> Spine-1  (MOVED from Spine-2 -- expected)

  Result: 4 of 5 flows were disrupted, but only 2 needed to move.

Resilient ECMP (with resilient hashing):

  Uses a hash bucket table with a fixed size (e.g., 64 or 128 buckets).
  Each bucket is assigned to a next-hop. When a next-hop fails,
  only its buckets are reassigned to surviving next-hops.

  Before failure:
    Buckets 0-21:  Spine-1 (22 buckets)
    Buckets 22-42: Spine-2 (21 buckets)
    Buckets 43-63: Spine-3 (21 buckets)

  Spine-2 fails:
    Buckets 0-21:  Spine-1 (unchanged)
    Buckets 22-31: Spine-1 (reassigned from Spine-2)
    Buckets 32-42: Spine-3 (reassigned from Spine-2)
    Buckets 43-63: Spine-3 (unchanged)

  Result: Only flows in buckets 22-42 are disrupted.
          Flows in buckets 0-21 and 43-63 are NOT moved.

Resilient hashing must be enabled on the switches. It is not enabled by default on all platforms. Check with the switch vendor during fabric design:

Arista EOS: ip ecmp resilient
Cisco NX-OS: hardware ecmp hash-resilient
Juniper QFX: set forwarding-options enhanced-hash-key ecmp resilient

ECMP for Overlay Traffic

Overlay traffic (VXLAN, GENEVE) presents a specific challenge for ECMP hashing. The outer IP header of an encapsulated packet has the same source and destination IP for all traffic between the same pair of TEPs (Tunnel Endpoints). If the ECMP hash function only examines the outer header, all overlay traffic between two hosts hashes to a single spine -- completely defeating the purpose of ECMP.

The Overlay ECMP Problem:

  Host-A (TEP: 192.168.250.10) sends to Host-B (TEP: 192.168.250.20)

  100 different VM flows encapsulated in GENEVE:
    Outer src: 192.168.250.10  (always the same)
    Outer dst: 192.168.250.20  (always the same)
    Outer proto: UDP/6081      (always the same)

  If ECMP hashes only on outer 3-tuple:
    hash(192.168.250.10, 192.168.250.20, UDP) = CONSTANT
    --> ALL 100 flows go through the SAME spine
    --> One spine at 100%, other spines at 0%
    --> Effective bisection bandwidth = 1/N of design capacity

The solution: entropy in the outer UDP source port. VXLAN and GENEVE both use a UDP outer header. The standard behavior of overlay encapsulators (OVS, VFP, NSX N-VDS) is to compute the outer UDP source port as a hash of the inner packet's 5-tuple. This means that different inner flows produce different outer UDP source ports, even though the outer IP addresses are identical.

Overlay Entropy via UDP Source Port:

  Inner flow 1: VM-A (10.0.0.1:45000) -> VM-B (10.0.0.2:443)
  Outer: UDP src_port = hash(10.0.0.1, 10.0.0.2, 45000, 443) = 49152
  ECMP hash: hash(192.168.250.10, 192.168.250.20, 49152, 6081) --> Spine-1

  Inner flow 2: VM-C (10.0.0.3:38000) -> VM-D (10.0.0.4:8080)
  Outer: UDP src_port = hash(10.0.0.3, 10.0.0.4, 38000, 8080) = 51234
  ECMP hash: hash(192.168.250.10, 192.168.250.20, 51234, 6081) --> Spine-3

  Inner flow 3: VM-E (10.0.0.5:22000) -> VM-F (10.0.0.6:5432)
  Outer: UDP src_port = hash(10.0.0.5, 10.0.0.6, 22000, 5432) = 50100
  ECMP hash: hash(192.168.250.10, 192.168.250.20, 50100, 6081) --> Spine-2

  Now ECMP distributes overlay traffic across all spines,
  even though outer src/dst IP are the same.

Switch configuration requirement: The physical switches performing ECMP must include the UDP source port in their hash calculation. This is the default on most modern data center switches when using a 5-tuple hash, but it must be verified. On some older ASICs, the UDP source port is not included in the ECMP hash by default, and the overlay entropy is wasted.

Verifying ECMP hash includes UDP source port:

  Arista EOS:
    show ip ecmp hash-algorithm    # Verify 5-tuple or custom
    ip load-sharing trident fields ip source-port destination-port

  Cisco NX-OS:
    show port-channel load-balance  # Shows hash fields
    port-channel load-balance src-dst-ip-l4port

  Juniper QFX:
    show forwarding-options enhanced-hash-key

The Polarization Problem

ECMP polarization occurs when multiple layers of the network use the same hash function with the same seed, causing traffic that was distributed at one layer to be concentrated at the next layer.

ECMP Polarization:

  Layer 1: Leaf switches hash traffic across spines
  Layer 2: Spine switches hash traffic across super-spines (in a 5-stage fabric)
           OR spine switches hash for LAG distribution to leaf

  If both layers use the same hash function and seed:

  Leaf-1 hashes:
    Flow A --> Spine-1
    Flow B --> Spine-2
    Flow C --> Spine-3

  Spine-1 receives Flow A and must forward to Leaf-4.
  Spine-1 hashes Flow A for its next-hop selection:
    hash(Flow A) = SAME VALUE as Leaf-1's hash
    --> Flow A always goes to the same downstream link
    --> Polarized: traffic is not re-randomized at each layer

  Result: Some links carry disproportionately more traffic.
  In extreme cases, one spine link is saturated while
  parallel links are idle.

How to avoid polarization:

Different hash seeds at each layer: Configure spine switches to use a different hash seed than leaf switches. Most modern switch ASICs support configurable hash seeds.
Different hash algorithms: Use one hash algorithm (e.g., CRC16) on the leaf layer and a different algorithm (e.g., CRC32) on the spine layer.
Symmetric hashing with per-layer randomization: Some ASICs support per-chip unique hash seeds derived from the switch serial number, ensuring that no two switches produce the same hash for the same flow.

Anti-Polarization Configuration:

  Arista EOS (per-switch unique seed):
    ip ecmp hash-seed <unique-value-per-switch>
    # Use different values on leaf switches vs. spine switches

  Cisco NX-OS:
    hardware ecmp hash-polynomial CRC10c
    # OR use ip load-sharing address source-destination port source-destination
    #    universal-id <unique-per-switch>

  Juniper QFX:
    set forwarding-options hash-key family inet layer-3 destination-address
    set forwarding-options hash-key family inet layer-4 destination-port
    # Juniper uses a per-chip entropy value by default

Verification: To detect polarization, monitor per-link utilization on all spine uplinks. In a well-balanced ECMP fabric, all links should show similar utilization (within 10-15% of each other) under aggregate load. If one link is consistently 2-3x more loaded than others, polarization is the likely cause.

How the Candidates Handle This

Comparison Table

Aspect	VMware (Current)	OVE	Azure Local	Swisscom ESC
Server-side link aggregation	vDS uplink teaming: LACP or load-based teaming (proprietary). LACP support on vDS only (not standard vSwitch).	Linux bonding mode 4 (802.3ad / LACP) via NMState operator. Full control over hash policy, LACP rate, min_links.	SET (Switch Embedded Teaming) inside Hyper-V vSwitch. Supports LACP and Switch Independent modes. Configured via Network ATC intents.	Provider-managed. Typically Dell VxBlock with pre-configured LACP bonds. Customer has no direct NIC configuration.
Hash policy control	vDS LACP: configurable (source/dest IP+port, source/dest MAC, VLAN+source).	Full control: `xmit_hash_policy` set to `layer3+4` or `encap3+4` via NMState.	SET hash: configurable via PowerShell (`Set-VMSwitchTeam -LoadBalancingAlgorithm HyperVPort`). Default is HyperVPort (per-VM distribution).	Provider-managed. Not customer-configurable.
LACP rate	Configurable per vDS LAG.	Configurable via NMState: `lacp_rate: fast`.	Configurable via Network ATC or PowerShell.	Provider-managed.
LLDP support	vDS supports LLDP (transmit and receive). Visible in vCenter UI per-host.	NMState exposes LLDP neighbor data in `NodeNetworkState` CRDs. `lldpd` on CoreOS.	Windows DCB feature provides LLDP. Windows Admin Center shows LLDP neighbors.	Provider-managed. Not visible to customer.
MLAG / multi-chassis LAG requirement	Reference arch requires MLAG pair at ToR for vDS LACP uplinks.	Reference arch requires MLAG pair at ToR for NMState LACP bonds. Alternatively, mode 1 (active-backup) works without MLAG.	Reference arch requires MLAG pair at ToR for SET LACP. Network ATC validates physical connectivity.	Provider-managed. Swisscom operates the physical fabric.
ECMP / fabric design	NSX Edge nodes peer with spines via BGP. Fabric ECMP is a physical network responsibility.	MetalLB speakers peer with leaf switches via BGP. OVN gateway routers can peer via BGP. Fabric ECMP is a physical network responsibility.	RAS Gateway / Network Controller peers with ToR via BGP. Fabric ECMP is a physical network responsibility.	Provider-managed. Swisscom operates the spine-leaf fabric.
Overlay entropy (UDP src port)	NSX GENEVE uses entropy in outer UDP source port. Verified at scale in production.	OVN/OVS GENEVE uses entropy in outer UDP source port (default behavior in OVS).	Microsoft SDN VXLAN uses entropy in outer UDP source port (default behavior in VFP).	Same as VMware (NSX GENEVE).
NIC configuration as code	vDS configuration via vCenter API / PowerCLI. Not natively GitOps-compatible.	NMState `NodeNetworkConfigurationPolicy` CRDs. Fully declarative, GitOps-compatible, applied uniformly across all worker nodes.	Network ATC intents via PowerShell or ARM templates. Intent-based (less transparent than NMState but simpler).	Not applicable (provider-managed).
Physical topology discovery	vCenter shows LLDP/CDP neighbors per host. NSX UI shows transport node connectivity.	NMState LLDP in `NodeNetworkState`. Can be queried via `kubectl` or automated via CI pipeline for validation.	Windows Admin Center shows LLDP neighbors.	Not visible to customer.
Failure detection speed	Depends on vDS teaming policy. LACP with fast rate provides <3s detection. Load-based teaming uses beacon probing (configurable interval).	Configurable: `miimon=100` (100ms link detection) + `lacp_rate fast` (3s LACP timeout). Sub-second with MII monitoring.	SET provides fast link failure detection via miniport driver. LACP fast rate supported.	Provider-managed. SLA-based (not directly measurable by customer).

Key Differences in Prose

Level of control over physical connectivity: The most significant difference is operational transparency. OVE exposes every parameter of the bonding configuration -- mode, hash policy, LACP rate, min_links, MTU -- as declarative YAML that is version-controlled and applied uniformly across all nodes via NMState. An engineer can read a NodeNetworkConfigurationPolicy and know exactly how every node's NICs are configured. Azure Local abstracts this behind Network ATC intents, which is simpler to configure but less transparent when troubleshooting (you declare "management + compute on these NICs" and ATC decides the bonding parameters). VMware vDS teaming is configured per port group in vCenter, which is transparent but not declarative or version-controlled by default. Swisscom ESC removes this concern entirely -- the provider manages all physical connectivity.

MLAG dependency: All three self-managed platforms (OVE, Azure Local, VMware) assume an MLAG-capable switch pair at the top of rack. The specific MLAG technology depends on the switch vendor (Cisco VPC, Arista MLAG, Extreme SMLT, Dell VLT). This means the switch vendor selection is tightly coupled with the platform deployment. During the PoC, the chosen switch vendor must demonstrate MLAG interoperability with the platform's bonding implementation (LACP negotiation, failover timing, LLDP data exposure).

ECMP is the physical network's responsibility, not the platform's. None of the three platforms "do" ECMP themselves -- ECMP is a property of the physical spine-leaf fabric. However, the platforms interact with ECMP indirectly through overlay encapsulation. The platform's overlay implementation must generate entropy in the outer UDP source port so that the physical fabric's ECMP hash can distribute overlay traffic across spines. OVS (OVE) and VFP (Azure Local) both do this correctly by default. The verification responsibility falls on the network engineering team during fabric design and PoC validation.

SET vs. Linux bonding: Azure Local uses Switch Embedded Teaming (SET), which is a Hyper-V virtual switch feature, not a separate NIC teaming layer. SET operates inside the vSwitch and provides teaming at the virtual switch level, not at the OS NIC level. This has a subtle implication: SET's default load-balancing algorithm is "HyperVPort," which distributes traffic per-VM (each VM is assigned to a specific physical NIC). This is different from Linux bonding's per-flow hash, where a single VM's traffic can be distributed across multiple NICs if it has multiple flows. For most VM workloads, per-VM distribution is adequate, but for VMs with very high bandwidth requirements (e.g., a single database VM consuming 40+ Gbps), SET's HyperVPort algorithm may not distribute traffic optimally. SET also supports "Dynamic" mode, which provides per-flow distribution similar to Linux bonding's layer3+4.

Key Takeaways

LACP with fast rate and layer3+4 hash policy is the production standard. For a 5,000+ VM environment, there is no reason to use slow LACP rate (90-second failure detection is unacceptable) or layer2 hash policy (all-VM-to-gateway traffic hashes to one link). Every host bond configuration must be validated during PoC to confirm sub-second failure detection and balanced traffic distribution.
MLAG at the access layer is mandatory, not optional. Without MLAG (or equivalent multi-chassis LAG), server LACP bonds can only connect to a single switch, making that switch a single point of failure. The switch vendor selection must be validated against the platform's LACP implementation -- mismatches between the switch's MLAG system ID presentation and the server's LACP state machine can cause subtle failures that only appear under specific failure conditions.
ECMP across spines provides the bandwidth scaling for east-west traffic. In a 5,000+ VM environment, east-west traffic between leaf switches dominates bandwidth consumption. ECMP across all spines provides linear bandwidth scaling -- adding a spine switch adds proportional bandwidth. The fabric must be validated for ECMP polarization (use different hash seeds per layer) and overlay entropy (verify that the switch ASIC includes the outer UDP source port in its ECMP hash).
LLDP is essential for operational sanity at scale. With hundreds of servers and thousands of cables, manual topology tracking is unsustainable. LLDP provides machine-readable physical topology data that can be validated automatically. OVE's NMState integration with LLDP is particularly powerful -- topology validation can be automated as a Kubernetes reconciliation loop or CI pipeline step.
Resilient ECMP hashing prevents unnecessary flow disruption during failures. Without resilient hashing, a single spine failure causes all flows to be rehashed, disrupting even flows that were on surviving paths. Enable resilient hashing on all spine and leaf switches. This is a switch configuration item, not a platform configuration item -- the network team must own this.
Overlay encapsulation entropy is a silent prerequisite for ECMP effectiveness. If the physical switches do not hash on the outer UDP source port, all overlay traffic between a pair of hosts traverses a single spine, regardless of how many spines exist. This negates the entire ECMP investment. Verify this during fabric bring-up, not after workloads are running.
The platform and the physical network must be designed together. Unlike traditional virtualization where the network team designed the fabric independently and the VMware team configured vDS on top, modern platforms (especially OVE with MetalLB BGP and NMState LACP) require joint design. Bond hash policies, MLAG configuration, ECMP path counts, BGP AS numbering, and overlay entropy are all cross-domain decisions that affect both teams.

Discussion Guide

The following questions are designed for vendor workshops, network architecture reviews, and PoC validation sessions. They target the interface between the platform and the physical fabric.

1. LACP Configuration and Validation

"Walk us through the exact LACP configuration applied to each server by your platform. What is the hash policy, LACP rate, and MII monitoring interval? How do we verify that traffic is actually distributed across both links in production? Show us the commands or dashboards that display per-link utilization and per-link packet counters for a bond interface."

Purpose: Validates that the vendor understands bonding internals, not just "we use LACP." The answer must include specific hash policy (layer3+4 or equivalent), fast LACP rate, and a method to measure per-link distribution (e.g., ethtool -S, cat /proc/net/bonding/bond0, Network ATC diagnostics).

2. MLAG Interoperability

"Our top-of-rack switches are [vendor X] with [MLAG/VPC/SMLT] configured. Have you tested your platform's LACP bond negotiation against this specific MLAG implementation? What system ID does the MLAG pair present to the server? How do we verify that the server bond sees a single LACP partner, not two? What happens if the IST/peer link between the two switches fails -- does the server bond detect the failure, and how quickly?"

Purpose: Tests whether the vendor has actually validated MLAG interop, not just assumed it works. The answer must describe the MLAG System ID presentation, the expected LACP partner MAC in the bond status, and the IST failure behavior. Specific switch vendor/model testing is expected.

3. Single-Flow Bandwidth Limitation

"We have database replication jobs that can generate a single 80 Gbps flow between two hosts. Each host has a 2x 100 GbE LACP bond. What is the maximum throughput this single flow can achieve? Can your platform split a single flow across multiple links, or is a single flow limited to a single link's speed? How do we design around this limitation for our highest-bandwidth workloads?"

Purpose: Tests understanding of the fundamental hash-based distribution limitation. The correct answer is: a single flow is limited to a single link's speed (100 Gbps in this case). Mitigation strategies include multi-path storage protocols (e.g., Ceph's multiple OSD connections), parallel TCP streams, or RSS (Receive Side Scaling) to spread a single connection across queues.

4. ECMP Verification and Polarization

"Our spine-leaf fabric has 4 spines. How do we verify that traffic from your platform is actually distributed across all 4 spines via ECMP? How do we detect ECMP polarization? Have you tested overlay traffic (VXLAN/GENEVE) distribution across spines -- does the outer UDP source port provide sufficient entropy for the ECMP hash on our specific switch ASIC?"

Purpose: Tests awareness of ECMP verification methodology and overlay entropy. The vendor should describe how to check per-spine link utilization, mention the importance of the outer UDP source port for overlay ECMP, and reference testing methodology (e.g., iperf3 with multiple parallel streams between hosts on different leaf switches, monitoring per-spine counters).

5. LLDP Integration and Topology Validation

"After initial deployment, how can we programmatically verify that every server NIC is connected to the correct switch and port? Can we integrate LLDP data from all hosts into our DCIM system? If a cable is accidentally swapped (server A's NIC1 is now plugged into the wrong switch port), how quickly is this detected, and does the platform alert or prevent misconfigured connectivity?"

Purpose: Tests operational maturity around physical topology management. For OVE, expect a description of NMState LLDP in NodeNetworkState CRDs with automated validation. For Azure Local, expect Windows Admin Center LLDP display and Network ATC validation. For any vendor, the answer should go beyond "we support LLDP" to describe automated topology verification workflows.

6. Failure Convergence Testing

"During the PoC, we will simulate the following failures and measure convergence time: (a) single NIC failure on a server, (b) single ToR switch power-off, (c) IST/peer-link failure between MLAG switches, (d) single spine switch failure. For each scenario, what is the expected packet loss duration? What monitoring should we have in place to measure actual convergence time? Are there any platform-specific behaviors that could extend convergence beyond what the physical network provides?"

Purpose: Establishes a concrete failure testing plan. The expected convergence times are: (a) < 1 second (MII monitoring), (b) < 3 seconds (LACP timeout), (c) switch-vendor-dependent (typically < 5 seconds for secondary shutdown), (d) < 5 seconds (ECMP reconvergence + resilient hashing). The vendor should identify any platform-layer convergence overhead (e.g., overlay tunnel re-establishment, BGP route withdrawal).

7. Physical NIC Requirements and Validated Hardware

"What are the minimum NIC specifications for your platform? Which NIC vendors and models are validated? Do you require specific firmware versions? How many physical NICs per server do you recommend for a deployment supporting 50 VMs per host with both overlay network traffic and HCI storage replication traffic? Does the NIC need to support specific offload features (checksum offload, TSO, GRO, tunnel offload for GENEVE/VXLAN)?"

Purpose: Reveals hardware constraints and validated configurations. OVE's NMState works with any Linux-supported NIC but has a validated hardware list. Azure Local has a stricter hardware compatibility list (HCL). NIC offload capabilities (especially tunnel offload for GENEVE) directly affect CPU overhead for overlay networking at 100 GbE speeds.

8. Day-2 Bond Reconfiguration

"Six months after go-live, we need to change the bonding hash policy from layer3+4 to encap3+4 on all 100 worker nodes because we have deployed an overlay network that is causing uneven link utilization with the current hash. Describe the process. Can this be done without VM downtime? Is it applied rolling (node by node) or simultaneously? How do we validate that the new hash policy is active on every node and that traffic distribution has improved?"

Purpose: Tests Day-2 operational maturity for a common network tuning operation. For OVE, expect: update the NodeNetworkConfigurationPolicy YAML, NMState applies it node-by-node with automatic rollback on failure, verify via cat /proc/net/bonding/bond0 on each node. For Azure Local, expect: update Network ATC intent via PowerShell, ATC reconfigures SET on each node. The key is that this should be non-disruptive and automated, not a manual per-node change.

9. Switch Vendor Neutrality

"Our current data center uses [Vendor A] switches. We are evaluating [Vendor B] for the new fabric. Does your platform have any dependency on or preference for a specific switch vendor? Are there switch features that your platform requires (e.g., specific MLAG implementation, specific ECMP hash capabilities, specific LLDP TLV support)? Have you tested your platform on both vendor's switches?"

Purpose: Validates that the platform does not have a hidden switch vendor dependency. OVE and Azure Local are switch-vendor-neutral in principle, but NMState LLDP parsing and Network ATC validation may behave differently with different switch vendors' LLDP TLV formats. The PoC must test with the actual target switch vendor.

10. Capacity Planning for Physical Links

"We plan to deploy 100 servers, each with 2x 25 GbE NICs for VM/overlay traffic and 2x 25 GbE NICs for storage replication. Our spine-leaf fabric has 4 spines with 100 GbE spine-leaf links. Is this sufficient for 5,000 VMs with an east-west-dominant traffic pattern? How do we model the oversubscription ratio? What is the maximum number of live migrations we can run simultaneously without saturating the VM traffic bonds? At what point should we consider upgrading to 100 GbE server NICs?"

Purpose: Tests capacity planning methodology. The answer should include oversubscription ratio calculation (leaf uplink bandwidth / server port bandwidth), ECMP bandwidth per leaf (4 spines x 100 GbE = 400 Gbps per leaf), live migration bandwidth estimation (VM memory size / target migration time = required bandwidth per migration), and a discussion of when 25 GbE becomes a bottleneck (typically when HCI storage replication + VM traffic + live migration exceed the 50 Gbps aggregate bond capacity per host).