Physical Design & Management
Why This Matters
The physical data center is the foundation that no software layer can abstract away. An overlay network can virtualize segments, a distributed firewall can enforce micro-segmentation, and an SDN controller can automate routing -- but all of this runs on physical switches, physical cables, physical racks, and physical power feeds. When any of those fail, the software layer fails with them. When any of those are poorly designed, the software layer performs poorly on top of them.
For an organization running 5,000+ VMs, the physical design determines three things that no amount of software can compensate for:
-
Maximum available bandwidth. The number of spines, the speed of spine-leaf links, and the oversubscription ratio at the leaf tier set a hard ceiling on east-west throughput. If the fabric cannot deliver the aggregate bandwidth that 5,000 VMs demand, no tuning of overlay encapsulation or hash policies will help.
-
Failure blast radius. The topology determines what breaks when a component fails. In a well-designed spine-leaf fabric, a single spine failure reduces total bandwidth by 1/N (where N is the number of spines) but causes zero connectivity loss. In a poorly designed three-tier fabric, a single aggregation switch failure can partition half the data center.
-
Operational sustainability. The physical infrastructure must be tracked, maintained, and evolved over a 10-15 year lifecycle. Servers are racked and decommissioned. Cables are added and rerouted. Power and cooling capacity must be planned against actual consumption. Without systematic infrastructure management (DCIM), the organization accumulates technical debt in the physical layer that makes every platform migration harder, every capacity expansion slower, and every outage longer to diagnose.
This document covers two topics that bridge the gap between the logical networking concepts (covered in earlier documents) and the physical reality of the data center:
- Spine-Leaf Architecture -- the standard physical network topology for modern data centers, why it replaced three-tier designs, how to size it for 5,000+ VMs, and how it integrates with BGP-based routing.
- DCIM (Data Center Infrastructure Management) -- the discipline and tooling for tracking, monitoring, and planning the physical infrastructure that the platform runs on.
Both topics are directly relevant to the platform evaluation. OVE, Azure Local, and Swisscom ESC each have different physical requirements, hardware compatibility lists, and assumptions about the underlying fabric. Understanding spine-leaf design enables the team to evaluate switch vendor proposals critically. Understanding DCIM enables the team to plan the physical migration -- which racks will host the new platform, how much power is available, how many rack units are needed, and how the cabling will be organized.
Concepts
1. Spine-Leaf Architecture
Why Traditional Three-Tier Fails for Modern Workloads
The traditional data center network was built as a three-tier hierarchy: core, aggregation (or distribution), and access. This design was optimized for north-south traffic patterns -- clients outside the data center connecting to servers inside. Traffic flowed from the access layer up through aggregation to the core, and back down. The architecture worked well when the dominant traffic pattern was client-server (north-south) and each server was relatively independent.
Traditional Three-Tier Architecture:
+-------------------+
| Core Switch |
| (Layer 3 routing) |
+---+---+---+---+---+
| | | |
+-------------+ | | +-------------+
| | | |
+----v----+ +----v----+ +-----v---+
| Agg-1 |------| Agg-2 | | Agg-3 |
| (STP | | (STP | | |
| Active) | | Stdby) | | |
+--+---+--+ +--+--+--+ +--+---+--+
| | | | | |
+-----+ +-----+ +--+ +--+ +----+ +----+
| | | | | |
+--v--+ +--v--+ +--v--+ +--v--+ +--v--+ +--v--+
|Acc-1| |Acc-2| |Acc-3| |Acc-4| |Acc-5| |Acc-6|
+--+--+ +--+--+ +--+--+ +--+--+ +--+--+ +--+--+
| | | | | |
Servers Servers Servers Servers Servers Servers
Problems:
1. STP blocks redundant links --> wasted bandwidth
2. Agg-1 to Agg-2 is a bottleneck for east-west traffic
3. Server on Acc-1 to server on Acc-6: 5 hops (up-up-across-down-down)
4. Adding capacity requires forklift upgrades at higher tiers
5. Oversubscription compounds at each tier
Three specific failures of this design make it unsuitable for modern virtualized workloads:
1. Spanning Tree Protocol (STP) limits. In a three-tier design, redundant links between tiers create Layer-2 loops. STP prevents loops by blocking redundant links -- which means that 50% or more of the available bandwidth is unused in steady state. Rapid STP (RSTP) improves convergence time from 30-50 seconds to 1-5 seconds but does not solve the blocked-bandwidth problem. With 5,000+ VMs generating heavy east-west traffic, wasting half the fabric bandwidth is unacceptable.
2. East-west bottleneck at the aggregation tier. In a three-tier design, traffic between two servers on different access switches must traverse up to the aggregation layer and back down. If the servers are in different aggregation "pods," traffic must go all the way to the core. The aggregation-to-core links become the bottleneck for east-west traffic. Modern workloads (HCI storage replication, VM live migration, microservices communication) generate far more east-west traffic than north-south, often at ratios of 80:20 or higher.
3. Oversubscription compounds unpredictably. In a three-tier design, oversubscription is applied at each tier. If the access-to-aggregation oversubscription is 4:1 and the aggregation-to-core oversubscription is 3:1, the effective end-to-end oversubscription for cross-pod traffic is 12:1. At 5,000+ VMs, this means that aggregate demand can exceed available bandwidth by a factor of 12 during peak periods, causing congestion, packet loss, and application timeouts.
Spine-Leaf Topology: The Modern Standard
The spine-leaf topology eliminates all three problems by flattening the network into exactly two tiers. Every leaf switch connects to every spine switch. There are no leaf-to-leaf links and no spine-to-spine links. Every path from any leaf to any other leaf traverses exactly one spine -- a consistent 2-hop path.
Spine-Leaf Topology (4 Spines, 8 Leaves):
+--------+ +--------+ +--------+ +--------+
|Spine-1 | |Spine-2 | |Spine-3 | |Spine-4 |
+--++++--+ +--++++--+ +--++++--+ +--++++--+
|||| |||| |||| ||||
+--------+|||--------+|||--------+|||--------+|||---------+
| +------+||--------+|+|--------+||+--------+||------+ |
| | +----+|--------+||+--------+|+|--------+|+--+ | |
| | | +--+--------+|+|--------+||+--------++| | | |
| | | | |||| |||| || | | |
| | | | |||| |||| || | | |
+--v--v--v--v-+ +-------vvvv+ +----vvvv-----+ +v--v--v--v-+
| Leaf-1 | | Leaf-2 | | Leaf-3 | | Leaf-4 |
+------+------+ +-----+-----+ +------+------+ +-----+-----+
| | | |
Servers Servers Servers Servers
(Rack 1) (Rack 2) (Rack 3) (Rack 4)
... (Leaves 5-8 follow the same pattern)
Rules:
1. Every leaf connects to EVERY spine (full mesh between tiers)
2. No leaf-to-leaf direct links
3. No spine-to-spine direct links
4. Every leaf-to-leaf path is exactly 2 hops: leaf -> spine -> leaf
5. Traffic is ECMP-distributed across ALL spines (see document 03)
Key properties of spine-leaf:
| Property | Three-Tier | Spine-Leaf |
|---|---|---|
| Path length (worst case) | 5+ hops | Exactly 2 hops |
| Latency predictability | Variable (depends on path) | Consistent (always 2 hops) |
| STP dependency | Yes (blocks links) | No (all links active, L3 routed) |
| Bandwidth utilization | 50% (STP blocks) | 100% (all links active) |
| East-west bandwidth | Limited by aggregation tier | Scales linearly with spine count |
| Adding capacity | Forklift upgrades at higher tiers | Add another leaf (or spine) |
| Oversubscription | Compounds across tiers | Single, calculable ratio |
Design Math: Spines, Leaves, Oversubscription, Port Density
The fundamental sizing calculation for a spine-leaf fabric depends on four parameters:
- Server port speed -- the NIC speed on each server (typically 25 GbE or 100 GbE).
- Leaf downlink port count -- how many server-facing ports each leaf switch has.
- Leaf uplink speed -- the speed of each leaf-to-spine connection (typically 100 GbE or 400 GbE).
- Number of spines -- how many spine switches exist (determines total uplink bandwidth per leaf).
Oversubscription Ratio Calculation:
Oversubscription = Total Downlink Bandwidth / Total Uplink Bandwidth
Example:
Leaf switch: 48x 25 GbE downlinks (server ports) = 1,200 Gbps total downlink
Leaf switch: 6x 100 GbE uplinks (to spines) = 600 Gbps total uplink
Oversubscription ratio = 1,200 / 600 = 2:1
Interpretation:
At 2:1, if ALL servers transmit at line rate simultaneously,
only 50% of the traffic can be forwarded to the spine tier.
The other 50% is queued or dropped.
In practice, not all servers transmit simultaneously at line
rate. 2:1 to 3:1 is standard for general-purpose compute.
1:1 (non-oversubscribed) is required for HCI storage traffic.
Common Oversubscription Targets:
+------------------+-------------------------------+
| Workload Type | Acceptable Oversubscription |
+------------------+-------------------------------+
| HCI storage | 1:1 (no oversubscription) |
| Live migration | 2:1 or better |
| General VM | 3:1 |
| Desktop VDI | 4:1 to 6:1 |
| Dev/test | 6:1 to 10:1 |
+------------------+-------------------------------+
Port density and spine count relationship:
The number of spines is limited by the number of uplink ports on each leaf switch. If a leaf switch has 6 uplink ports, you can have at most 6 spines. The number of leaves is limited by the number of downlink ports on each spine switch. If a spine switch has 32 downlink ports, you can have at most 32 leaves.
Spine-Leaf Scaling Constraints:
Max Spines = Leaf Uplink Port Count
Max Leaves = Spine Downlink Port Count
Example with a common switch platform:
Leaf: 48x 25G downlinks + 8x 100G uplinks
Spine: 32x 100G ports (all used for leaf connections)
Max spines: 8
Max leaves: 32
Max servers: 32 leaves x 48 ports = 1,536 servers
Max VMs (at 50 VMs/server): 76,800 VMs
For higher density:
Leaf: 48x 25G downlinks + 8x 400G uplinks
Spine: 64x 400G ports
Max spines: 8
Max leaves: 64
Max servers: 64 x 48 = 3,072 servers
Max VMs: 153,600 VMs
Note: MLAG consumes leaf downlink ports in pairs.
With MLAG (2 leaf switches per rack pair), the server sees
one logical switch, but each physical leaf has its own uplinks
to all spines.
Full Bisection Bandwidth
Full bisection bandwidth means that if you split all the servers in the fabric into two equal halves, the aggregate bandwidth between the two halves equals the aggregate bandwidth of all server ports on one half. In other words, the fabric never becomes the bottleneck -- the spine tier can carry as much traffic as the servers can generate.
Full Bisection Bandwidth:
Example: 8 leaves, each with 48x 25G = 1,200 Gbps downlink
Total server bandwidth (all 8 leaves):
8 x 1,200 Gbps = 9,600 Gbps
For full bisection, the spine tier must support:
9,600 / 2 = 4,800 Gbps across the cut
Each leaf has N uplinks at 100 Gbps.
For full bisection: N x 100 Gbps >= 1,200 Gbps / 2 = 600 Gbps
Required: N >= 6 uplinks per leaf at 100 Gbps
With 6 spines and 6x 100G uplinks per leaf:
Each leaf: 6 x 100G = 600 Gbps uplink
Oversubscription: 1,200 / 600 = 2:1
Full bisection requires 1:1 oversubscription:
Each leaf: 12 x 100G = 1,200 Gbps uplink = 12 spines
This is expensive. In practice, 2:1 to 3:1 is used,
accepting that worst-case simultaneous load is throttled.
When is 1:1 justified?
- HCI clusters where every write is replicated to 2-3 other nodes
- Financial trading systems with strict latency requirements
- Large-scale machine learning training with all-to-all communication
- Environments where east-west traffic consistently exceeds 50%
of total server bandwidth
Failure Domains
Spine-leaf provides graceful degradation under failure, not catastrophic failure:
Spine Failure:
+--------+ +--------+ +--------+ +--------+
|Spine-1 | |Spine-2 | |SPINE-3 | |Spine-4 |
| (OK) | | (OK) | | (DEAD) | | (OK) |
+--------+ +--------+ +--------+ +--------+
Impact:
- ALL leaf switches lose 1 of 4 uplinks (25% bandwidth reduction)
- NO leaf switch loses connectivity (remaining 3 spines carry traffic)
- ECMP reconverges in <5 seconds (BGP withdrawal + resilient hashing)
- Traffic on the failed spine's paths is rehashed to surviving spines
- With resilient ECMP hashing, only flows on the dead spine are moved
Total bandwidth reduction: 25% (from 4 spines to 3)
Connectivity loss: NONE
VM impact: Brief packet loss (<5s) for flows that were on Spine-3
Leaf Failure:
+--------+ +--------+ +--------+ +--------+
|Spine-1 | |Spine-2 | |Spine-3 | |Spine-4 |
+--------+ +--------+ +--------+ +--------+
| | | |
+---v---+ +----v---+ +---v---+ +---v---+
|Leaf-1 | |Leaf-2 | |LEAF-3 | |Leaf-4 |
| (OK) | | (OK) | |(DEAD) | | (OK) |
+-------+ +--------+ +-------+ +-------+
| | | |
Servers Servers Servers Servers
(OK) (OK) (ISOLATED) (OK)
Impact:
- ONLY servers connected to Leaf-3 lose connectivity
- All other leaf switches and their servers are unaffected
- With MLAG: each rack has 2 leaf switches. If one fails,
servers fail over to the second leaf (see document 03, SMLT section).
No servers are isolated.
Total bandwidth reduction: 1/N_leaves (e.g., 12.5% with 8 leaves)
Connectivity loss: Only servers on the dead leaf (or zero with MLAG)
VM impact: With MLAG, <3 seconds failover via LACP
Layer-3 Spine-Leaf with BGP
Modern spine-leaf fabrics are fully routed -- every link between a leaf and a spine is a Layer-3 routed link, not a Layer-2 switched link. This eliminates all Layer-2 concerns (STP, broadcast storms, MAC table overflow) from the fabric. The routing protocol used is almost universally eBGP (external BGP), with each switch running its own autonomous system number (ASN).
BGP ASN Allocation in a Spine-Leaf Fabric:
Spine Tier (unique ASN per spine)
+----------+ +----------+ +----------+ +----------+
|Spine-1 | |Spine-2 | |Spine-3 | |Spine-4 |
|AS 65000 | |AS 65001 | |AS 65002 | |AS 65003 |
+--+--+--+-+ +--+--+--+-+ +--+--+--+-+ +--+--+--+-+
| | | | | | | | | | | |
| | +--------+ | +--------+ | +--------+ | |
| +----+---------+----+---------+----+ | |
+----+ | | +----+ | |
| | | | | |
+--v--v--+ +---v---+ +---v---v--+
|Leaf-1 | |Leaf-2 | |Leaf-3 |
|AS 65010| |AS 65011| |AS 65012 |
+--------+ +--------+ +----------+
Leaf Tier (unique ASN per leaf)
ASN Allocation Scheme (RFC 6996 private ASN range: 64512-65534):
+------------------+----------------------------------+
| Role | ASN Assignment |
+------------------+----------------------------------+
| Spines | 65000 - 65003 (4 spines) |
| Leaves | 65010 - 65041 (up to 32 leaves) |
+------------------+----------------------------------+
Alternative: Use 4-byte private ASNs (4200000000 - 4294967294)
for larger fabrics. FRRouting and all modern switch NOSes support
4-byte ASNs.
eBGP Peering Rules:
1. Every leaf peers with EVERY spine (full mesh between tiers)
2. No leaf-to-leaf BGP sessions
3. No spine-to-spine BGP sessions
4. Each leaf advertises its connected subnets (server networks)
5. Each spine reflects routes from all leaves to all other leaves
6. ECMP is automatic: leaf sees N equal-length AS paths (one per spine)
With unnumbered interfaces (IPv6 link-local peering):
+----------+ +----------+
| Leaf-1 | | Spine-1 |
| AS 65010 | | AS 65000 |
| | | |
| Eth49 |---- 100G link ---| Eth1 |
| fe80::1 | | fe80::2 |
| (no IPv4)| | (no IPv4)|
+----------+ +----------+
BGP session established over IPv6 link-local addresses.
IPv4 routes advertised via MP-BGP with IPv6 next-hop.
No /31 point-to-point subnets needed on inter-switch links.
Arista EOS configuration example (Leaf-1):
router bgp 65010
router-id 10.255.255.10 ! Loopback address
no bgp default ipv4-unicast
maximum-paths 4 ! ECMP across all 4 spines
neighbor SPINES peer group
neighbor SPINES remote-as-range 65000-65003
neighbor Ethernet49 peer group SPINES
neighbor Ethernet49 interface ! Unnumbered
neighbor Ethernet50 peer group SPINES
neighbor Ethernet50 interface
neighbor Ethernet51 peer group SPINES
neighbor Ethernet51 interface
neighbor Ethernet52 peer group SPINES
neighbor Ethernet52 interface
address-family ipv4 unicast
neighbor SPINES activate
redistribute connected ! Advertise server subnets
FRRouting configuration example (for Cumulus Linux / SONiC):
router bgp 65010
bgp router-id 10.255.255.10
bgp bestpath as-path multipath-relax
neighbor fabric peer-group
neighbor fabric remote-as external
neighbor swp49 interface peer-group fabric
neighbor swp50 interface peer-group fabric
neighbor swp51 interface peer-group fabric
neighbor swp52 interface peer-group fabric
address-family ipv4 unicast
redistribute connected
maximum-paths 4
Why eBGP and not OSPF or IS-IS?
| Criteria | eBGP | OSPF / IS-IS |
|---|---|---|
| Multi-path support | Native with maximum-paths |
ECMP supported, but equal-cost metric management is more complex |
| ASN-based policy control | Yes -- per-switch ASN allows granular route filtering | No -- area-based, less granular |
| Convergence speed | Sub-second with BFD | Sub-second with BFD (comparable) |
| Operational complexity | Higher initial setup, but simpler at scale (no areas, no LSA flooding storms) | Simpler initial setup, but flooding storms at scale (hundreds of switches) |
| Vendor support for DC fabric | Universal -- every DC switch vendor supports eBGP as the primary fabric protocol | Supported but not the recommended approach for most vendors |
| Unnumbered interface support | Yes (IPv6 link-local) | Yes (OSPF unnumbered) but less common in practice |
| Multi-vendor compatibility | Excellent -- eBGP is the internet routing protocol | Good, but OSPF/IS-IS interop quirks between vendors exist |
The data center networking industry has converged on eBGP as the standard for spine-leaf fabrics. RFC 7938 ("Use of BGP for Routing in Large-Scale Data Centers") documents this practice and provides design guidance. All major switch vendors (Arista, Cisco, Juniper, Dell, NVIDIA/Mellanox) provide reference designs based on eBGP spine-leaf.
Multi-Tier Spine-Leaf (Super-Spine) for Very Large Deployments
When a single spine-leaf fabric is not large enough -- because the spine switches run out of ports to connect all the leaves -- the solution is a multi-tier (3-stage or 5-stage) Clos topology. A layer of "super-spine" switches connects multiple spine-leaf "pods."
Multi-Tier Spine-Leaf (5-Stage Clos) with Super-Spines:
+----------+ +----------+ +----------+
|Super- | |Super- | |Super- |
|Spine-1 | |Spine-2 | |Spine-3 |
|AS 64999 | |AS 64998 | |AS 64997 |
+--++--++--+ +--++--++--+ +--++--++--+
|| || || || || ||
+-------------+| |+-------+| |+--------+| ||
| +----------+ +|-------+| || +------+ ||
| | || | || | ||
+----v---v----+ +----v---v----+ +--v-----v----+
| Pod A | | Pod B | | Pod C |
| | | | | |
| +---------+ | | +---------+ | | +---------+ |
| |Spine-A1 | | | |Spine-B1 | | | |Spine-C1 | |
| |AS 65000 | | | |AS 65100 | | | |AS 65200 | |
| +---------+ | | +---------+ | | +---------+ |
| |Spine-A2 | | | |Spine-B2 | | | |Spine-C2 | |
| |AS 65001 | | | |AS 65101 | | | |AS 65201 | |
| +---------+ | | +---------+ | | +---------+ |
| | | | | |
| Leaf Leaf | | Leaf Leaf | | Leaf Leaf |
| -A1 -A2 ...| | -B1 -B2 ..| | -C1 -C2 ..|
| 65010 65011 | | 65110 65111| | 65210 65211|
+---+----+----+ +---+----+---+ +---+----+---+
| | | | | |
Servers Servers Servers
When is this needed?
- Single spine switch has 32 ports, so max 32 leaves per pod
- Each pod: 32 leaves x 48 servers = 1,536 servers
- 3 pods: 4,608 servers (sufficient for ~5,000 VM environment)
- Super-spines connect pods together for cross-pod traffic
Path for cross-pod traffic (Pod A to Pod C):
Leaf-A1 -> Spine-A1 -> Super-Spine-1 -> Spine-C1 -> Leaf-C1
= 4 hops (vs. 2 hops within a pod)
ASN scheme for multi-tier:
+------------------+----------------------------------+
| Role | ASN Assignment |
+------------------+----------------------------------+
| Super-Spines | 64997 - 64999 |
| Pod A Spines | 65000 - 65001 |
| Pod A Leaves | 65010 - 65041 |
| Pod B Spines | 65100 - 65101 |
| Pod B Leaves | 65110 - 65141 |
| Pod C Spines | 65200 - 65201 |
| Pod C Leaves | 65210 - 65241 |
+------------------+----------------------------------+
For most enterprises running 5,000+ VMs, a single-tier spine-leaf (2-stage Clos) is sufficient. Super-spine architectures are typically needed only when the server count exceeds what a single spine-leaf pod can accommodate (typically 1,500-3,000 servers per pod, depending on switch port density). However, the organization should understand the super-spine option because:
- Multiple data centers or data center halls may each have their own pod, connected via super-spines.
- A phased migration (old VMware fabric in Pod A, new OVE/Azure Local fabric in Pod B) may use super-spines for cross-pod connectivity during the transition period.
Cabling: Fiber Types, Breakout Cables, and Structured Cabling
The physical cabling in a spine-leaf fabric is one of the highest-cost and hardest-to-change components. Getting it right during initial deployment avoids years of operational pain.
Fiber Types for Data Center Cabling:
+--------+---------------------+--------------+-------------------+
| Type | Core Size | Max Distance | Common Use |
+--------+---------------------+--------------+-------------------+
| OM3 | 50 um multimode | 100m @ 10G | Short runs, |
| | | 70m @ 40G | intra-rack |
| | | 70m @ 100G | |
+--------+---------------------+--------------+-------------------+
| OM4 | 50 um multimode | 150m @ 10G | Intra-building, |
| | (laser-optimized) | 100m @ 40G | ToR to spine |
| | | 100m @ 100G | |
+--------+---------------------+--------------+-------------------+
| OM5 | 50 um multimode | 150m @ 10G | Future-proofing |
| | (wideband) | 100m @ 100G | for SWDM/WDM |
| | | 150m @ 400G | over multimode |
+--------+---------------------+--------------+-------------------+
| OS2 | 9 um single-mode | 10 km+ @ any | Inter-building, |
| | | speed | long runs, DCI |
| | | | (Data Center |
| | | | Interconnect) |
+--------+---------------------+--------------+-------------------+
Decision Guide:
- Server to ToR leaf (< 10m): OM3/OM4 or DAC (Direct Attach Copper)
- Leaf to spine (< 100m): OM4 multimode or OS2 single-mode
- Spine to super-spine (< 500m): OS2 single-mode
- Building to building: OS2 single-mode
DAC (Direct Attach Copper) cables:
- Passive: up to 5m @ 25G/100G. Cheapest option for short runs.
- Active (AOC): up to 30m @ 25G/100G. More expensive but no transceiver needed.
- Use for server-to-leaf connections within the same rack or adjacent racks.
Breakout cables are used to split a single high-speed port into multiple lower-speed connections. This is critical for cost-effective spine-leaf designs:
Breakout Cable Configurations:
100G QSFP28 --> 4x 25G SFP28 (4x breakout)
+----+ +----+
|100G| -- breakout ---> |25G | (to server 1)
|port| |25G | (to server 2)
| | |25G | (to server 3)
| | |25G | (to server 4)
+----+ +----+
400G QSFP-DD --> 4x 100G QSFP28 (4x breakout)
+----+ +-----+
|400G| -- breakout ---> |100G | (to leaf 1)
|port| |100G | (to leaf 2)
| | |100G | (to leaf 3)
| | |100G | (to leaf 4)
+----+ +-----+
Why breakouts matter:
A spine switch with 32x 400G ports can serve:
- 32 leaf switches at 400G each, OR
- 128 leaf switches at 100G each (via 4x breakout on every port)
This dramatically increases the number of leaves a single spine can support
and often makes a super-spine architecture unnecessary for medium deployments.
Structured cabling standards:
The physical cabling must follow a structured approach to remain manageable over a 10-15 year lifecycle:
- TIA-942 (Telecommunications Infrastructure Standard for Data Centers) defines the structured cabling model: entrance room, main distribution area (MDA), horizontal distribution area (HDA), equipment distribution area (EDA), and zone distribution area (ZDA).
- End-of-row (EoR) vs. Top-of-Rack (ToR): ToR places leaf switches in every rack (shortest cable runs to servers). EoR places switches at the end of each row (fewer switches, longer cable runs). ToR is the dominant model for spine-leaf because it minimizes cable length and simplifies troubleshooting.
- Patch panels and fiber trunks: Use pre-terminated fiber trunk cables between patch panels (MTP/MPO connectors) for clean, repeatable cabling between racks. Individual fiber patch cables connect from the patch panel to the switch/server ports.
Structured Cabling Example (ToR Model):
Spine Rack (Middle of Row)
+-------------------------+
| Spine-1 Spine-2 |
| Spine-3 Spine-4 |
| |
| Top: Fiber patch panels |
| MTP/MPO trunks to |
| each leaf rack |
+-------------------------+
| | | | | |
MTP trunk cables (pre-terminated, 12/24 fibers each)
| | | | | |
+-----+ | | | +--+-------+
| | | | |
v v v v v
Rack 1 Rack 2 Rack 3 Rack N
+------+ +------+ +------+ +------+
|Patch | |Patch | |Patch | |Patch |
|Panel | |Panel | |Panel | |Panel |
|------| |------| |------| |------|
|Leaf | |Leaf | |Leaf | |Leaf |
|Sw-1A | |Sw-2A | |Sw-3A | |Sw-NA |
|Leaf | |Leaf | |Leaf | |Leaf |
|Sw-1B | |Sw-2B | |Sw-3B | |Sw-NB |
|(MLAG)| |(MLAG)| |(MLAG)| |(MLAG)|
|------| |------| |------| |------|
|Srv 1 | |Srv 1 | |Srv 1 | |Srv 1 |
|Srv 2 | |Srv 2 | |Srv 2 | |Srv 2 |
|... | |... | |... | |... |
|Srv N | |Srv N | |Srv N | |Srv N |
+------+ +------+ +------+ +------+
Each rack has:
- 2 leaf switches (MLAG pair) at the top
- Fiber patch panel above the switches
- Servers below, cabled to both leaf switches via DAC or short fiber
- MTP/MPO trunk cables run from patch panel to spine rack
Border Leaf / Services Leaf for North-South Traffic
Not all traffic stays within the spine-leaf fabric. North-south traffic (to/from the internet, WAN, partner networks) enters and exits through dedicated "border leaf" or "services leaf" switches. These leaf switches connect to firewalls, load balancers, WAN routers, and internet edge routers.
Border Leaf / Services Leaf Placement:
+--------+ +--------+ +--------+ +--------+
|Spine-1 | |Spine-2 | |Spine-3 | |Spine-4 |
+--------+ +--------+ +--------+ +--------+
|||| |||| |||| ||||
+--------+||+--------+||+--------+||+--------+||+--------+
| || || || || |
+--v--+ +--v--+ +---v--+ +---v--+ +----v--+ +---v---+
|Leaf | |Leaf | |Leaf | |Leaf | |Border | |Border |
| -1 | | -2 | | -3 | | -4 | |Leaf-1 | |Leaf-2 |
+--+--+ +--+--+ +--+---+ +--+---+ +---+---+ +---+---+
| | | | | |
Servers Servers Servers Servers +---v---+ +---v---+
|Firewall| |WAN |
|Cluster | |Router |
+---+---+ +---+---+
| |
+---v---+ |
|Load | Internet
|Balancer| / WAN
+-------+
Design principles:
1. Border leaves are just normal leaf switches connected to the spines
2. They host firewalls, LBs, and WAN routers as "servers" (downlinks)
3. Traffic from any server leaf to the border leaf traverses the spine
tier (2 hops) -- same as any other east-west traffic
4. Firewalls/LBs connect to border leaves via MLAG bonds, same as servers
5. Redundancy: at least 2 border leaves (MLAG pair) for each service type
Spine-Leaf Sizing for 5,000+ VMs: Worked Example
The following worked example sizes a spine-leaf fabric for the organization's target of 5,000+ VMs. This is a concrete calculation with real port counts and oversubscription ratios.
=================================================================
SPINE-LEAF SIZING: 5,000 VMs
=================================================================
STEP 1: Server Count
--------------------------------------------------------------
Target VMs: 5,000
VMs per server: 50 (typical for 2-socket server with
512 GB RAM, 64 cores)
Required servers: 5,000 / 50 = 100 servers
Add 20% headroom for failover capacity (N+1 within cluster):
Servers with headroom: 120 servers
Add management/infra nodes (OVE control plane, monitoring, etc.):
Infrastructure nodes: 10 (3 control plane + 3 infra + 4 misc)
Total servers: 130
STEP 2: NIC Configuration per Server
--------------------------------------------------------------
NICs per server: 2x dual-port 25 GbE NICs = 4 ports total
Traffic separation:
- 2x 25G ports: VM/overlay traffic (LACP bond = 50 Gbps)
- 2x 25G ports: Storage replication (LACP bond = 50 Gbps)
Total switch ports per server: 4 (but with MLAG, each port
connects to one of two leaf switches in the MLAG pair)
Alternative for higher bandwidth:
- 2x dual-port 100 GbE NICs = 4 ports total
- 2x 100G: VM/overlay (bond = 200 Gbps)
- 2x 100G: Storage (bond = 200 Gbps)
This example uses 25 GbE NICs (common, cost-effective).
STEP 3: Leaf Switch Selection and Rack Design
--------------------------------------------------------------
Leaf switch model (example): 48x 25G + 8x 100G
- 48 downlink ports @ 25 GbE (server-facing)
- 8 uplink ports @ 100 GbE (spine-facing)
MLAG pair per rack: 2 leaf switches
Usable server ports per rack (MLAG pair):
Each leaf has 48 ports. With MLAG, each server connects one
port to Leaf-A and one port to Leaf-B.
Servers per rack (VM/overlay bond): 48 ports / 1 port per server
= 48 servers per MLAG pair
(if using only one bond)
With 2 bonds (VM + storage): each server uses 2 ports per leaf
Servers per rack: 48 / 2 = 24 servers per MLAG pair
Racks required: 130 servers / 24 servers per rack = 6 racks
(round up to 6, with some spare ports)
Physical rack layout (each rack):
+-------------------------------------------+
| U42 | Patch Panel (fiber to spines) |
| U41 | Leaf Switch A (MLAG primary) |
| U40 | Leaf Switch B (MLAG secondary) |
| U39 | 1U cable management |
| U38 | Server 1 (2U) |
| U36 | Server 2 (2U) |
| U34 | Server 3 (2U) |
| ... | ... |
| U02 | Server 18 (2U) |
| U01 | (empty / PDU space) |
+-------------------------------------------+
With 2U servers: 18-20 servers per 42U rack
(leaves headroom for cable management, PDU, patch panels)
Adjusted: 130 servers / 20 servers per rack = 7 racks
Leaf switch pairs: 7 racks x 2 leaves = 14 leaf switches total
STEP 4: Spine Count and Oversubscription
--------------------------------------------------------------
Each leaf has 8x 100G uplinks to spines.
Reserve 2 uplinks for future expansion.
Active uplinks per leaf: 6 (connecting to 6 spine switches)
Total uplink bandwidth per leaf: 6 x 100 Gbps = 600 Gbps
Total downlink bandwidth per leaf: 48 x 25 Gbps = 1,200 Gbps
Oversubscription ratio: 1,200 / 600 = 2:1
Is 2:1 acceptable?
- For VM/overlay traffic: Yes (general-purpose compute)
- For HCI storage traffic: Marginal (3:1 to 1:1 recommended)
- With separate storage bond on separate leaf ports, storage
traffic has its own oversubscription calculation
Spine switch model (example): 32x 100G ports
Each spine connects to all 14 leaf switches:
14 ports used out of 32 = 44% utilization
18 ports available for future leaf expansion
Can support up to 32 leaves (32 racks) without adding spines.
32 racks x 20 servers = 640 servers = 32,000 VMs at 50 VMs/server.
STEP 5: Total Fabric Bandwidth
--------------------------------------------------------------
East-west bisection bandwidth:
14 leaves x 600 Gbps uplink = 8,400 Gbps total uplink
Bisection = 8,400 / 2 = 4,200 Gbps
Per-server east-west bandwidth (worst case, all talking):
4,200 Gbps / 130 servers = ~32 Gbps per server
(vs. 50 Gbps bond capacity -- headroom exists)
Per-VM bandwidth (fair share):
4,200 Gbps / 5,000 VMs = ~840 Mbps per VM
(sufficient for most workloads; bursty traffic is smoothed
by statistical multiplexing across 5,000 VMs)
STEP 6: Summary Bill of Materials
--------------------------------------------------------------
+--------------------+-------+----------------------------------+
| Component | Count | Specification |
+--------------------+-------+----------------------------------+
| Spine switches | 6 | 32x 100G (e.g., Arista 7050X3, |
| | | Cisco N9K-C9332D, Dell Z9332F) |
+--------------------+-------+----------------------------------+
| Leaf switches | 14 | 48x 25G + 8x 100G (e.g., Arista |
| | | 7050SX3-48YC8, Cisco N9K-C93180, |
| | | Dell S5248F-ON) |
+--------------------+-------+----------------------------------+
| Border leaf | 2 | Same model as leaf (MLAG pair |
| switches | | for firewall/LB/WAN connections) |
+--------------------+-------+----------------------------------+
| Server NICs | 130 | 2x dual-port 25G (e.g., |
| | | Mellanox CX-6 Lx, Intel E810) |
+--------------------+-------+----------------------------------+
| Spine-leaf fiber | 84 | 100G QSFP28-QSFP28 OM4 (14 |
| | | leaves x 6 spines = 84 links) |
+--------------------+-------+----------------------------------+
| Server-leaf DAC | 520 | 25G SFP28 DAC/AOC, 1-3m |
| | | (130 servers x 4 ports) |
+--------------------+-------+----------------------------------+
| Server racks | 7 | 42U with dual PDU, 15-20 kW |
+--------------------+-------+----------------------------------+
| Spine/mgmt rack | 1 | 42U for 6 spines + 2 border |
| | | leaves + management switches |
+--------------------+-------+----------------------------------+
| Total racks | 8 | 7 compute + 1 spine/network |
+--------------------+-------+----------------------------------+
Total switch ports consumed:
Spine: 6 x 14 = 84 ports (of 192 available)
Leaf: 14 x (48 + 6) = 756 ports (48 down + 6 up per leaf)
Border: 2 x ~10 = 20 ports (for FW/LB/WAN connections)
=================================================================
Reference Designs from Switch Vendors
All major switch vendors publish spine-leaf reference architectures. When evaluating switch vendors for this deployment, request the following from each:
| Vendor | Reference Design | Key Switch Models (2024-2026) |
|---|---|---|
| Arista | Arista Validated Design (AVD) for Data Center, CloudVision as fabric manager | 7050X3 (spine), 7050SX3 (leaf), 7060X (high-density spine) |
| Cisco | Cisco ACI (application-centric) or NDFC (Nexus Dashboard Fabric Controller) for VXLAN-EVPN | Nexus 9336C-FX2 (spine), Nexus 93180YC-FX3 (leaf) |
| Dell | Dell SmartFabric with OS10 | Z9332F-ON (spine), S5248F-ON (leaf) |
| Juniper | Juniper Apstra for intent-based fabric management | QFX5130 (spine), QFX5120 (leaf) |
| NVIDIA/Mellanox | Spectrum-based fabric with Cumulus Linux or SONiC | SN4700 (spine), SN3700 (leaf) |
When evaluating, ask each vendor to map their reference design to the sizing example above (130 servers, 14 leaves, 6 spines, 2:1 oversubscription) and demonstrate:
- MLAG configuration with LACP to Linux bonding (OVE) and SET (Azure Local)
- ECMP with eBGP and unnumbered interfaces
- Resilient ECMP hashing
- LLDP interoperability with NMState (OVE) and Network ATC (Azure Local)
- Breakout cable support for future 100G server NIC migration
2. DCIM (Data Center Infrastructure Management)
What DCIM Covers
DCIM is the practice and tooling for managing the physical infrastructure of a data center. It bridges the gap between the facilities world (power, cooling, physical space) and the IT world (servers, switches, cables). In a traditional organization, these two worlds are managed by different teams with different tools. DCIM integrates them into a single system of record.
DCIM Scope:
+------------------------------------------------------------------+
| DCIM System |
| |
| +------------------+ +------------------+ +------------------+ |
| | Asset Management | | Power Monitoring | | Capacity | |
| | - Server serial | | - Per-PDU watts | | Planning | |
| | numbers | | - Per-rack power | | - Rack unit | |
| | - Switch models | | consumption | | availability | |
| | - NIC MACs | | - Phase balance | | - Power budget | |
| | - Firmware ver. | | - UPS capacity | | remaining | |
| | - Warranty dates | | - PUE tracking | | - Cooling margin | |
| +------------------+ +------------------+ | - Weight limits | |
| +------------------+ |
| +------------------+ +------------------+ +------------------+ |
| | Rack Layout | | Cable Management | | Cooling | |
| | - U-position of | | - Port-to-port | | - Hot/cold aisle | |
| | each device | | patch records | | containment | |
| | - Front/rear | | - Fiber/copper | | - CRAC/CRAH | |
| | orientation | | type, length | | utilization | |
| | - Elevation | | - Cable pathway | | - Temperature | |
| | diagrams | | and tray usage | | monitoring | |
| | - Adjacency rules| | - Structured | | - Humidity | |
| | (keep apart / | | cabling IDs | | tracking | |
| | keep together) | | | | | |
| +------------------+ +------------------+ +------------------+ |
| |
| Integrations: |
| +------------------+ +------------------+ +------------------+ |
| | CMDB | | Ticketing | | Monitoring | |
| | (ServiceNow, | | (Jira, SNow) | | (Prometheus, | |
| | i-doit, etc.) | | | | Zabbix, PRTG) | |
| +------------------+ +------------------+ +------------------+ |
+------------------------------------------------------------------+
Power Management
Power is the most critical physical constraint in a data center. Every rack has a power budget, every PDU (Power Distribution Unit) has a capacity limit, and every UPS has a runtime. Running over budget trips breakers, causes unplanned outages, and in extreme cases damages equipment.
PDU monitoring:
Modern "intelligent" or "managed" PDUs (e.g., Raritan, APC/Schneider, ServerTech) provide per-outlet or per-circuit power monitoring via SNMP, HTTP API, or Modbus. DCIM tools poll these PDUs to build real-time power consumption dashboards.
Per-Rack Power Budget Example:
Rack Capacity: 20 kW (dual-feed, 2x 30A 208V circuits)
PDU A: 16 kW capacity (primary feed)
PDU B: 16 kW capacity (redundant feed)
Design rule: Never exceed 50% of TOTAL capacity on a
single feed (so either feed can sustain
the full rack if the other fails)
Usable power: 16 kW x 50% = 8 kW per feed
Total usable: 16 kW (both feeds combined, with N+1 margin)
Per-Device Power Draw:
+---------------------------+--------+--------+
| Device | Idle W | Max W |
+---------------------------+--------+--------+
| 2U server (2x CPU, 512G) | 350 | 800 |
| Leaf switch (48x25G) | 150 | 250 |
| Spine switch (32x100G) | 200 | 350 |
+---------------------------+--------+--------+
Rack Power Calculation:
2 leaf switches: 2 x 250W = 500W max
20 servers: 20 x 800W = 16,000W max
Overhead (fans, PDU loss): 500W
Total max: 17,000W = 17 kW
PROBLEM: 17 kW exceeds our 16 kW usable budget.
Solution: Reduce to 18 servers (18 x 800W + 500 + 500 = 15.4 kW)
Or: Use power capping (IPMI/BMC power limit) to cap each server
at 600W, allowing 20 servers (20 x 600 + 500 + 500 = 13 kW)
PUE (Power Usage Effectiveness):
PUE measures how efficiently a data center uses power. It is the ratio of total facility power to IT equipment power.
PUE = Total Facility Power / IT Equipment Power
PUE = 1.0: Perfect efficiency (impossible -- cooling needs power)
PUE = 1.2: Excellent (modern hyperscale data centers)
PUE = 1.4: Good (well-designed enterprise data center)
PUE = 1.6: Average (typical legacy enterprise DC)
PUE = 2.0: Poor (significant cooling/lighting/UPS overhead)
Example:
IT equipment draws 500 kW.
Total facility draws 700 kW (IT + cooling + UPS losses + lighting).
PUE = 700 / 500 = 1.4
Implication: For every 1 kW of compute, you pay for 1.4 kW of electricity.
In a year: 500 kW x 8,760 hours x CHF 0.20/kWh = CHF 876,000 IT power
Total cost: 700 kW x 8,760 hours x CHF 0.20/kWh = CHF 1,226,400 total
Overhead cost: CHF 350,400 for cooling/UPS/lighting
DCIM tracks PUE continuously by monitoring both the facility power meter and the IT load meters (sum of all PDU readings). Trends in PUE indicate whether efficiency is improving (due to better cooling, server consolidation) or degrading (due to hot spots, aging cooling equipment).
Cooling
Cooling is the second-hardest physical constraint after power. Heat generated by servers must be removed from the data center. Inadequate cooling causes thermal throttling (CPUs reduce clock speed to avoid overheating), component failures, and in extreme cases fire.
Hot aisle / cold aisle containment:
Hot Aisle / Cold Aisle Layout:
Cold Aisle (contained, pressurized with cold air)
+----------+ +----------+
| Rack 1 | | Rack 2 |
| (front | | (front |
| faces | | faces |
| cold | | cold |
| aisle) | | aisle) |
+----+-----+ +-----+----+
| |
v servers draw v
cold air from
the front
Hot Aisle (contained, collects exhaust heat)
+----------+ +----------+
| Rack 1 | | Rack 3 |
| (rear | | (front |
| faces | | faces |
| hot | | hot |
| aisle) | | aisle) |
+----+-----+ +-----+----+
| |
v servers exhaust v
hot air to
the rear
CRAC/CRAH (Computer Room Air Conditioner/Handler) units push
cold air under the raised floor into the cold aisles.
Hot air rises from the hot aisle and returns to the CRAC/CRAH.
Containment: Physical barriers (curtains, rigid panels, doors)
seal the cold aisle or hot aisle to prevent mixing. Without
containment, hot exhaust recirculates into cold air intake,
reducing cooling efficiency and creating hot spots.
Key cooling concepts for platform planning:
-
Blanking panels: Empty rack units (where no server is installed) must be covered with blanking panels. Without blanking panels, hot exhaust air from the rear flows through the empty space to the front, bypassing the server air intake. This raises inlet temperatures and reduces effective cooling capacity. During a platform migration where servers are added and removed, blanking panel management is essential.
-
In-row cooling: For high-density racks (20+ kW), CRAC/CRAH units under the raised floor may not provide sufficient airflow. In-row cooling units are placed between racks, directly inside the row, providing targeted cooling close to the heat source.
-
Rear-door heat exchangers (RDHx): A heat exchanger mounted on the rear door of the rack captures exhaust heat before it enters the hot aisle. Uses chilled water to absorb heat. Effective for high-density racks without requiring in-row units. Requires plumbing to each rack.
-
Per-server power and thermal monitoring: Modern servers report inlet temperature, exhaust temperature, fan speed, and CPU temperature via IPMI/Redfish. DCIM tools collect these readings to detect hot spots before they cause failures.
Cable Management
At 130 servers with 4 NIC ports each, plus 14 leaf switches with 48 downlinks and 6 uplinks each, the total cable count exceeds 600 individual cables. Without systematic cable management, the data center becomes a rat's nest that impedes airflow, makes troubleshooting impossible, and increases the risk of accidental disconnections.
Cable Count Estimate for 5,000 VM Fabric:
Server-to-leaf DAC cables: 130 x 4 = 520 cables
Leaf-to-spine fiber cables: 14 x 6 = 84 cables
Border leaf cables: 2 x 10 = 20 cables
Management network cables: 130 x 1 = 130 cables (IPMI/iDRAC)
PDU power cables: 130 x 2 = 260 cables (dual PSU)
Console cables (switches): 22 x 1 = 22 cables
-------------------------------------------------------
Total cables: ~1,036 cables
Without structured labeling and patch management, this is
unmanageable within 6 months of initial deployment.
Structured cable management practices:
-
Labeling standard: Every cable must be labeled at both ends with a machine-readable (barcode or QR code) and human-readable identifier. The label encodes: source device, source port, destination device, destination port. Example:
SRV042-P1 -> LF03A-E12(Server 42, Port 1 to Leaf 3A, Ethernet 12). -
Patch panel management: All fiber connections between racks pass through patch panels. The patch panel provides a clean demarcation point where trunk cables (running through cable trays between racks) meet patch cables (short cables within the rack). This allows a trunk cable to be replaced without touching the switch, and a patch cable to be replaced without touching the trunk.
-
Cable pathways: Cables run through overhead cable trays or under-floor pathways, never draped loosely between racks. Fiber and copper cables should be separated (fiber is fragile and can be damaged by sharing a tray with heavy copper bundles). Cable trays should never exceed 50% fill capacity (for airflow and future additions).
-
Color coding: Use cable jacket color to distinguish cable types:
- Blue: VM/overlay network
- Red: Storage replication network
- Yellow: Management/IPMI network
- Green: inter-switch links (leaf-to-spine)
- Orange: fiber (OM3/OM4 is typically aqua, OS2 is yellow -- follow the industry color standard for fiber type)
-
Cable length management: Use the shortest cable that reaches. Excess cable length creates airflow obstruction and visual clutter. Pre-measure runs and order cables in specific lengths (0.5m, 1m, 2m, 3m, 5m) rather than using one universal long length.
Capacity Planning
Capacity planning for physical infrastructure involves tracking four dimensions over time: rack units, power, cooling, and weight.
Capacity Planning Dimensions:
+------------------+------------------+-----------------+------------------+
| Dimension | Unit | Typical Limit | Tracking Method |
+------------------+------------------+-----------------+------------------+
| Rack space | U (rack units) | 42U per rack | DCIM rack |
| | 1U = 44.45mm | (usable: 38-40U | elevation diagram|
| | | after cabling/ | |
| | | switch/panels) | |
+------------------+------------------+-----------------+------------------+
| Power | Watts (W) or | 15-20 kW per | PDU monitoring |
| | kilowatts (kW) | rack (typical) | (real-time) |
| | | Up to 40 kW | |
| | | high-density | |
+------------------+------------------+-----------------+------------------+
| Cooling | kW of heat | Equal to power | Temperature |
| | dissipation | draw (1 kW | sensors, CRAC |
| | | electrical = | capacity reports |
| | | 1 kW thermal) | |
+------------------+------------------+-----------------+------------------+
| Weight | kg per rack | 800-1,200 kg | Manual tracking |
| | | (varies by | (weigh each |
| | | floor rating) | device on intake)|
+------------------+------------------+-----------------+------------------+
Weight is often overlooked. A fully loaded 42U rack with 20 servers,
2 switches, and 2 PDUs can weigh 800-1,000 kg. Raised floor tiles
typically support 450-550 kg per tile. If a heavy rack is placed on
a weak tile, the floor can collapse. DCIM must track per-rack weight
against the floor's rated capacity.
Capacity planning for platform migration:
During a migration from VMware to a new platform, the physical capacity plan must account for:
-
Coexistence period: Old VMware hosts and new platform hosts run simultaneously during migration. This temporarily doubles the rack, power, and cooling requirements. DCIM must model this coexistence period and verify that the data center can sustain both platforms concurrently.
-
Different hardware profiles: The new platform may require different server hardware (e.g., OVE may need servers from the Red Hat Hardware Certification List, Azure Local requires Microsoft-certified hardware). New hardware may have different power draw, cooling requirements, and form factors (1U vs. 2U vs. 4-node chassis).
-
Network fabric transition: If the spine-leaf fabric is being built alongside the migration (replacing a three-tier fabric), additional rack space is needed for new spine and leaf switches. The old and new fabrics must coexist until the migration is complete.
-
Decommissioning timeline: As VMware hosts are decommissioned, their rack space, power, and cooling are freed. DCIM must track this timeline to ensure the data center does not exceed capacity during the peak coexistence period.
Migration Capacity Timeline (Simplified):
Quarter | VMware Hosts | New Platform | Total Racks | Power (kW)
-----------+--------------+--------------+-------------+-----------
Q1 (Start) | 100 | 0 | 6 | 96
Q2 (Pilot) | 100 | 20 | 7 | 112
Q3 (Wave 1)| 80 | 50 | 8 | 128 <-- peak!
Q4 (Wave 2)| 40 | 80 | 8 | 128
Q5 (Wave 3)| 10 | 100 | 7 | 112
Q6 (Done) | 0 | 130 | 8 | 110
Peak capacity (Q3-Q4): 8 racks, 128 kW
DCIM must verify: facility has 8 racks available with 128 kW
total power and corresponding cooling capacity.
If not: migration must be phased more slowly or hardware
decommissioned faster.
Integration with CMDB, Ticketing, and Monitoring
DCIM does not exist in isolation. It must integrate with the organization's IT management systems to provide end-to-end visibility:
-
CMDB (Configuration Management Database): DCIM exports physical asset data (serial numbers, rack positions, power connections) to the CMDB. The CMDB correlates physical assets with logical assets (VMs, clusters, services). When an incident occurs ("VM xyz is slow"), the operations team can trace from the VM to the host to the rack to the switch port to the cable to the PDU to the power circuit -- all from a single pane.
-
Ticketing system: When DCIM detects a capacity threshold (rack power approaching budget, cooling margin below 20%), it creates a ticket automatically. When a new server needs to be racked, the ticketing system triggers a DCIM workflow: select rack, assign U-position, verify power budget, generate cable plan, print labels.
-
Monitoring: DCIM provides the physical context for infrastructure monitoring. Prometheus/Zabbix/PRTG monitors CPU, memory, NIC utilization. DCIM monitors inlet temperature, PDU load, and cooling capacity. Together, they answer questions like "Is this server slow because it is thermally throttling?" or "Can we add 10 more servers to Rack 5 without exceeding the power budget?"
DCIM Tools
| Tool | Type | Strengths | Weaknesses |
|---|---|---|---|
| NetBox | Open source (DigitalOcean/NetBox Labs) | Excellent IPAM and rack management, strong API, large community, integrates well with Ansible/Terraform. Industry standard for network-centric DCIM. | Weaker on power/cooling monitoring (requires plugins or external integration). Not a full facilities DCIM. |
| Device42 | Commercial | Strong auto-discovery, CMDB integration, dependency mapping. Good for hybrid physical+virtual inventory. | License cost. Can be complex to configure. |
| Nlyte | Commercial | Full facilities DCIM: power, cooling, capacity planning, what-if modeling. Enterprise-grade. | Expensive. Overkill for organizations with simple DC footprints. |
| Sunbird dcTrack | Commercial | Strong cable management and capacity planning. Visual rack layouts. Workflow automation for provisioning. | License cost. Steep learning curve. |
| openDCIM | Open source | Simple, lightweight rack management. | Limited features, small community, no power monitoring integration. |
| Ralph | Open source (Allegro) | Asset management with DC features. Good for CMDB-centric organizations. | Less focused on power/cooling than commercial alternatives. |
Recommendation for this evaluation:
For an organization with 5,000+ VMs and 7-8 racks of physical infrastructure, NetBox is the pragmatic starting point. It is free, has a mature API, integrates with automation tools (Ansible, Terraform), and covers the essential DCIM functions: rack layouts, device inventory, cable tracking, and IP address management. Its API-first design makes it suitable for integration with both OVE's Kubernetes-based management (via custom controllers that sync rack data to NetBox) and Azure Local's PowerShell/ARM-based management.
For organizations with stricter compliance requirements around power and cooling tracking (e.g., banking regulators requiring documented PUE reporting and capacity audit trails), a commercial tool like Nlyte or Sunbird dcTrack provides the facilities-grade monitoring and reporting that NetBox lacks out of the box.
Why DCIM Matters for Platform Migration
DCIM is not a "nice to have" during a platform migration. It directly answers questions that block migration progress:
-
"Where do we rack the new servers?" DCIM shows which racks have available U positions, power budget, and cooling margin. Without DCIM, the answer requires physically walking the data center floor with a clipboard.
-
"Which servers are still running VMware?" DCIM cross-referenced with the CMDB shows which physical hosts are assigned to the old platform and which have been migrated. This is the single source of truth for migration progress tracking at the physical layer.
-
"Can we decommission Rack 3?" DCIM shows that Rack 3 still has 2 VMware hosts with 100 VMs scheduled for migration in Q4. Decommissioning must wait. Without DCIM, someone has to physically inspect the rack.
-
"What cables need to move when we migrate Host 12 to the new fabric?" DCIM's cable tracking shows exactly which switch ports Host 12 is connected to on the old fabric and where it needs to be reconnected on the new fabric. It generates a cable change plan with labels.
-
"Do we have enough power for the PoC cluster?" The PoC requires 4 servers and 2 switches. DCIM calculates: 4 x 800W + 2 x 250W = 3,700W needed. Rack 7 has 4.2 kW available. Approved -- rack in position U22-U30.
How the Candidates Handle This
| Aspect | VMware (Current) | OVE | Azure Local | Swisscom ESC |
|---|---|---|---|---|
| Physical fabric assumption | Three-tier or spine-leaf. NSX is fabric-agnostic. vDS connects to any ToR switch via LACP. | Spine-leaf recommended. Red Hat reference architecture assumes MLAG leaf pair at ToR with eBGP spine. No proprietary fabric dependency. | Spine-leaf recommended. Microsoft reference architecture specifies MLAG leaf pair with BGP peering to Network Controller. | Provider-managed. Swisscom operates the fabric; customer has no visibility or influence on physical topology. |
| Switch vendor requirements | None. vDS/N-VDS works with any switch vendor. NSX requires BGP peering on Edge nodes but is vendor-neutral for the underlay fabric. | None. NMState LACP works with any switch vendor. MetalLB BGP works with any BGP-speaking switch. OVN gateway BGP requires FRRouting or a BGP-speaking ToR. | None officially, but Microsoft validates specific switch models for Network ATC. Dell, Arista, and Cisco Nexus are most commonly validated. | Not applicable (Swisscom selects the hardware). |
| Hardware Compatibility List (HCL) | VMware HCL for servers and NICs (ESXi driver support). Extensive, covers most enterprise hardware. | Red Hat Ecosystem Catalog (hardware certification). Must use RHEL-certified servers. NIC support depends on RHEL kernel drivers. | Windows Server Catalog + Azure Local Catalog. Stricter than VMware and OVE -- specific server models validated as "Integrated Systems" or "Validated Nodes." | Swisscom selects the hardware. Customer chooses from Swisscom's offering (typically Dell or HPE). |
| Server form factor | Any server from HCL. Commonly 2U dual-socket. | Any server from Red Hat Ecosystem Catalog. Commonly 2U dual-socket. Also supports 1U and 4-node chassis (e.g., Dell C6620). | Validated nodes or integrated systems only. Common: Dell AX series, HPE ProLiant, Lenovo ThinkSystem. Must match Azure Local hardware catalog. | Swisscom specifies the hardware. |
| Minimum cluster size | 3 hosts for vSAN, 2 hosts without shared storage. | 3 control-plane + 2 worker nodes minimum (compact cluster: 3 nodes total). | 2-node minimum (with witness), 3-node recommended for Storage Spaces Direct. | Provider-determined. Minimum depends on service tier. |
| NIC requirements | 2x 10G minimum (ESXi 7.x), 2x 25G recommended. RDMA NICs for vSAN over RDMA. | 2x 25G minimum recommended. RDMA optional (for ODF storage). Any RHEL-supported NIC. Intel E810, Mellanox CX-5/6 most common. | 2x 25G minimum. RDMA mandatory for Storage Spaces Direct (iWARP or RoCE). Intel E810, Mellanox CX-5/6 required for RDMA. | Provider-managed. |
| DCIM integration | vCenter API provides host inventory. No native DCIM integration -- requires third-party tools (NetBox, Device42) and custom scripts to sync physical data. | Kubernetes API provides node inventory. Can sync to NetBox via custom operators or Ansible. NMState LLDP data provides physical topology for DCIM cross-reference. | Windows Admin Center provides host inventory. PowerShell/ARM API for automation. Can sync to DCIM via scheduled scripts. | Not applicable. Swisscom manages the physical infrastructure. Customer DCIM (if any) does not include Swisscom-managed hardware. |
| Reference cabling architecture | Not prescriptive. VMware does not provide cabling guidance beyond "connect ESXi to ToR switch." | Red Hat reference architecture documents specify NIC count, bond configuration, and switch connectivity. Cabling follows the spine-leaf ToR model described above. | Microsoft Azure Local deployment guide specifies NIC count, SET configuration, and switch connectivity per validated hardware model. Specific cable types documented per reference design. | Not applicable. |
| Power/cooling guidance | VMware Sizing Tool estimates server count but not power/cooling. Physical planning is the customer's responsibility. | Red Hat does not provide power/cooling planning tools. Physical planning is the customer's responsibility. Use server vendor specs. | Microsoft does not provide power/cooling planning tools. Integrated System vendors (Dell, HPE) provide power/cooling specs per validated configuration. | Swisscom manages power and cooling. Customer receives SLA, not physical specifications. |
Key differences in prose:
The most significant difference is operational ownership. With VMware, OVE, and Azure Local, the organization owns the physical infrastructure end to end -- fabric design, switch selection, cabling, power, cooling, rack layout, and DCIM. The organization must build or procure the competency to design and operate a spine-leaf fabric. With Swisscom ESC, the physical layer is entirely provider-managed. The organization trades control for simplicity: no spine-leaf design decisions, no switch vendor evaluation, no DCIM responsibility for the compute infrastructure -- but also no ability to optimize, troubleshoot, or scale the physical layer independently.
For OVE and Azure Local specifically:
-
OVE is the most transparent about physical infrastructure. NMState CRDs expose every bond, VLAN, and LLDP neighbor as Kubernetes resources. The platform actively inspects and reports on the physical layer. This makes OVE the best candidate for DCIM integration via API: a custom Kubernetes operator can continuously sync physical topology data (NIC model, firmware version, LLDP neighbor, bond status) from all nodes to NetBox.
-
Azure Local is more opinionated about hardware. The Azure Local hardware catalog defines exactly which server models, NIC models, and switch models are supported. This reduces design freedom but also reduces risk -- the vendor has pre-validated the physical configuration. Network ATC automates NIC configuration based on intents, reducing the chance of misconfiguration but also reducing the operator's visibility into what is actually configured at the physical layer.
Key Takeaways
-
Spine-leaf is the only viable fabric topology for 5,000+ VMs. The traditional three-tier design cannot deliver the east-west bandwidth, consistent latency, and graceful degradation that modern virtualized workloads (especially HCI storage replication) require. The sizing example shows that 130 servers fit into a single-tier spine-leaf with 6 spines, 14 leaves, and 2:1 oversubscription -- well within the capacity of commonly available switch platforms.
-
Oversubscription must be calculated explicitly and documented. A 2:1 oversubscription ratio means the fabric can sustain 50% of aggregate server bandwidth simultaneously. For general compute, this is acceptable. For HCI storage replication, consider a separate storage network with 1:1 oversubscription or higher-speed uplinks. The oversubscription ratio is the single most important design decision in the physical fabric.
-
eBGP on unnumbered interfaces is the standard fabric routing protocol. Using a unique ASN per switch, eBGP on every spine-leaf link, and ECMP across all spines provides a scalable, vendor-neutral, well-understood routing design. This is documented in RFC 7938 and supported by every data center switch vendor.
-
Breakout cables dramatically increase spine-leaf scale. A spine switch with 32x 400G ports can support up to 128 leaf switches at 100G each via 4x breakout cables. This is often the difference between needing a super-spine tier and staying within a single-tier design.
-
DCIM is not optional for a migration of this scale. With 130+ servers, 20+ switches, and 1,000+ cables, the physical infrastructure must be systematically tracked. DCIM provides the data needed to plan rack placement, verify power budgets, track migration progress, and generate cable change plans. NetBox is the pragmatic starting point for organizations without an existing DCIM tool.
-
The coexistence period during migration is the peak demand on physical infrastructure. Old and new platforms run simultaneously, potentially doubling rack, power, and cooling requirements. DCIM must model this peak demand before migration begins. If the data center cannot sustain both platforms concurrently, the migration must be phased more slowly or hardware must be decommissioned faster.
-
Physical design decisions are cross-platform. The spine-leaf fabric, cabling, power, and cooling infrastructure will outlive any single virtualization platform. Design the physical layer for a 10-15 year lifecycle, not for a single platform. This means choosing switch vendors that support eBGP spine-leaf (all major vendors do), using structured cabling that can be re-patched for different server/switch configurations, and sizing power and cooling for future density increases (plan for 25-30 kW per rack even if current density is 15-20 kW).
-
Azure Local's stricter hardware requirements may limit physical design flexibility. The Azure Local hardware catalog specifies exact server and NIC models. If the organization wants to reuse existing servers or select a non-validated NIC, Azure Local may not support it. OVE and VMware are more flexible in hardware selection, though all three platforms have hardware compatibility lists that must be checked.
Discussion Guide
The following questions are designed for vendor workshops, switch vendor evaluations, DC facilities reviews, and PoC planning sessions. They target the physical layer decisions that directly impact platform deployment and operation.
1. Fabric Topology Validation
"Present the spine-leaf design for our 5,000 VM deployment. Show the spine count, leaf count, oversubscription ratio at the leaf tier, and total bisection bandwidth. Walk us through the calculation. What is the maximum number of simultaneous live migrations this fabric can sustain without degrading VM traffic below acceptable thresholds?"
Purpose: Forces the switch vendor or integrator to present a concrete, calculated design rather than marketing slides. The answer must include specific numbers: port counts, link speeds, oversubscription ratio, and a live migration bandwidth analysis (e.g., 10 simultaneous live migrations of 4 GB VMs, each consuming X Gbps).
2. BGP and ECMP Configuration
"Show us the BGP configuration for the spine-leaf fabric. What ASN scheme do you use? Is it eBGP with unique ASN per switch, or iBGP with route reflectors? How many ECMP paths are configured per prefix? Are you using unnumbered interfaces, and if so, how are IPv4 routes resolved via IPv6 link-local next-hops? What is the BFD interval for fast failure detection on BGP sessions?"
Purpose: Validates that the fabric design follows modern best practices (eBGP, unique ASN, unnumbered interfaces, BFD). The answer must include specific configuration examples or snippets. If the vendor proposes iBGP or OSPF, ask them to justify the deviation from RFC 7938.
3. Oversubscription Under Load
"At 2:1 oversubscription, what happens when 60% of our servers simultaneously generate east-west traffic at 80% of their NIC capacity? How much congestion occurs? What queuing or QoS mechanisms are in place on the leaf switches to manage the oversubscription? Can the fabric signal back-pressure to the servers (ECN/WRED), or does it silently drop excess traffic?"
Purpose: Tests understanding of real-world oversubscription behavior. The answer should address: leaf switch buffer sizes, QoS policies (DSCP-based queuing), ECN (Explicit Congestion Notification) support for lossless Ethernet (relevant for RDMA/RoCE storage traffic), and whether the switch platform supports micro-burst absorption.
4. Failure Impact Demonstration
"During the PoC, we will power off one spine switch and measure the impact on active VM traffic. What is the expected behavior? How quickly does ECMP reconverge? What packet loss should we expect during reconvergence? Is resilient ECMP hashing enabled, and if so, what is the expected disruption to flows that were NOT on the failed spine?"
Purpose: Establishes a concrete failure test plan for the spine tier. Expected answers: ECMP reconverges in 1-5 seconds (BGP withdrawal + BFD timeout), resilient hashing limits disruption to only flows on the failed spine, total bandwidth reduces by 1/N_spines (e.g., 17% with 6 spines).
5. Cable and Patch Management Plan
"Describe the cabling plan for the initial deployment. What labeling standard will be used? How are cable records maintained -- manually, or via DCIM integration? When we need to migrate Server 42 from the old VMware ToR switch to the new OVE ToR switch, what is the change management process for the cable move? How do we verify after the move that the server is connected to the correct switch and port?"
Purpose: Tests operational maturity around physical cable management. The answer must describe a labeling standard, a cable record system (DCIM or spreadsheet at minimum), and a post-move verification method (LLDP validation or manual port check).
6. Power and Cooling for Peak Migration Period
"During the migration, we will run both the old VMware cluster (100 servers) and the new platform cluster (up to 130 servers) simultaneously. What is the peak power draw during this period? Does the data center have sufficient power capacity and cooling capacity for this peak? If not, what is the maximum number of new servers we can deploy before we must decommission old servers?"
Purpose: Forces a capacity planning conversation before migration begins. The answer requires knowledge of the data center's total power budget, available rack space, and cooling capacity. DCIM data is essential to answer this accurately.
7. DCIM Tool Selection and Data Flow
"What DCIM tool do we currently use, and does it meet the requirements for tracking the new platform's physical infrastructure? If we need to deploy NetBox (or another DCIM tool), how does it integrate with our CMDB (ServiceNow)? Can we auto-populate NetBox with device data from OVE's NMState or Azure Local's Windows Admin Center? How do we keep rack layouts, power budgets, and cable records accurate as the migration progresses?"
Purpose: Assesses readiness for systematic physical infrastructure management. If the organization has no DCIM tool, this question triggers the planning for one. If a DCIM tool exists, it validates whether it can handle the new platform's hardware.
8. Hardware Certification and NIC Compatibility
"Show us the hardware compatibility list for your platform. Are our target server models (specify exact models) on the list? Are the NICs we plan to use (specify models) certified for LACP bonding, RDMA (if applicable), and overlay offload (GENEVE/VXLAN)? What firmware versions are required for the NICs? If we need to update NIC firmware during deployment, what is the process -- does the platform automate firmware management, or is it a manual pre-deployment step?"
Purpose: Prevents the discovery of hardware incompatibility after hardware has been purchased. NIC firmware incompatibilities are one of the most common causes of PoC failures. The answer must include specific firmware versions and a firmware update process.
9. Rack Layout and Physical Security
"Show us the proposed rack layout for the new platform cluster. Where do the leaf switches sit in the rack? Where are the servers positioned? Is the layout optimized for airflow (switches at top, servers below, no gaps without blanking panels)? How does the rack layout accommodate dual PDU power feeds? Does the rack layout support rear cable management arms, or are fixed cable management trays used?"
Purpose: Validates that the physical deployment is planned to data center standards, not improvised during installation. The answer should include a rack elevation diagram showing U-positions for every device, PDU placement, and cable management positions.
10. Future Proofing and Scaling
"The current requirement is 5,000 VMs on 130 servers. If demand grows to 10,000 VMs, what physical changes are needed? Can we add leaf switches to the existing spines without reconfiguring the fabric? At what point do we need to add more spines? Is the cabling infrastructure (fiber trunks, patch panels) designed to accommodate the additional leaves, or will we need new trunk cables? Have we sized the power and cooling infrastructure for this growth?"
Purpose: Ensures the physical infrastructure is designed for growth, not just for day-one requirements. The answer should reference the spine switch's remaining port capacity (e.g., "our spines have 18 of 32 ports unused, so we can add 18 more leaf switches without touching the spines") and the data center's power/cooling headroom.