Storage Architectures
Why This Matters
The choice of storage architecture -- SAN, NAS, or HCI/SDS -- is the single most consequential infrastructure decision in this platform migration. It determines the physical bill of materials, the operational skill set required, the failure domains, the performance ceiling, and the day-2 operational model for the next 5-10 years. In the current VMware environment, the storage architecture is already HCI (vSAN), potentially supplemented by external SAN for specific workloads. Each candidate platform makes a different architectural bet:
- OVE bets on HCI via Ceph/ODF, with optional external SAN/NAS via CSI drivers.
- Azure Local bets on HCI via Storage Spaces Direct (S2D), with optional external SAN via iSCSI/FC passthrough.
- Swisscom ESC bets on traditional SAN (Dell PowerMax/PowerStore behind VxBlock), fully managed and invisible to the customer.
Understanding the architectural trade-offs between SAN, NAS, and HCI is essential for three reasons:
- Informed vendor challenge. When a vendor says "our HCI replaces your SAN," the evaluation team must understand what is actually being replaced (dedicated storage controllers, deterministic latency, mature snapshot/replication) and what new complexities are being introduced (distributed consensus, rebalancing storms, capacity overhead from replication).
- Hybrid architecture design. The answer may not be "all SAN" or "all HCI." Some workloads (high-frequency trading databases, Oracle RAC on shared storage) may justify retaining external SAN alongside HCI for the bulk of VMs. The decision framework must be explicit.
- Operational model shift. SAN is operated by storage administrators using vendor-specific tools (NetApp System Manager, Pure1, Dell Unisphere). HCI is operated by platform engineers using Kubernetes operators (Rook-Ceph) or PowerShell (S2D). NAS is operated by file services teams. The organizational impact is as significant as the technical change.
This page covers the three architectures in depth, with particular emphasis on HCI/SDS because that is the model used by the two self-operated candidates (OVE and Azure Local).
Concepts
1. SAN (Storage Area Network)
Architecture Overview
A SAN is a dedicated, purpose-built network that connects servers to centralized block storage arrays. The defining characteristic is separation of concerns: compute nodes contain no persistent storage; all data lives on purpose-built storage controllers with enterprise-grade data services (snapshots, replication, deduplication, encryption, QoS). The SAN fabric provides deterministic, low-latency connectivity between compute and storage.
SAN Architecture -- Dual-Fabric Topology (Standard Enterprise)
================================================================
Compute Nodes (Servers / Hypervisor Hosts)
+-----------+ +-----------+ +-----------+ +-----------+
| Host 01 | | Host 02 | | Host 03 | | Host 04 |
| [HBA-A]---|--+-----------+--+-----------+--+-- Fabric A |
| [HBA-B]---|--+-----------+--+-----------+--+-- Fabric B |
+-----------+ +-----------+ +-----------+ +-----------+
SAN Fabric A (Primary) SAN Fabric B (Secondary)
+---------------------+ +---------------------+
| FC Switch A1 | | FC Switch B1 |
| (Brocade G720 / | | (Brocade G720 / |
| Cisco MDS 9148T) | | Cisco MDS 9148T) |
+-----+-----+---------+ +-----+-----+---------+
| | | |
| +------ ISL ------+ | +------ ISL ------+
| | | |
+-----+-----+---------+ +----+------+-----++ +----+------+--------+
| FC Switch A2 | | FC Switch B2 | | |
| (edge / director) | | (edge / director) | | |
+-----+-----+---------+ +----+------+-------+ + +
| | | |
v v v v
+-----+-----+---------+ +----+------+--------+
| Storage Array | | Storage Array |
| Controller A | | Controller B |
| (active) | | (standby/active) |
+----------------------+ +---------------------+
| Disk Shelves (SSD / NVMe / HDD) |
| [shelf-1] [shelf-2] [shelf-3] [shelf-4] |
+-----------------------------------------------+
Key design principles:
- Two independent fabrics (A and B) -- no single point of failure
- Each host has two HBAs, one per fabric
- Each storage controller has ports on both fabrics
- ISLs (Inter-Switch Links) connect switches within a fabric
- No cross-connections between Fabric A and Fabric B
- Dual-controller array: active/active or active/standby
SAN Components
Host Bus Adapters (HBAs): Dedicated PCIe cards that provide Fibre Channel connectivity from the server to the SAN fabric. Each HBA has a globally unique World Wide Port Name (WWPN) and World Wide Node Name (WWNN), analogous to a MAC address on Ethernet. Modern HBAs operate at 32 Gbps FC (Gen 6) or 64 Gbps FC (Gen 7). Each server requires at least two HBAs for multipath redundancy -- one per fabric.
Fibre Channel Switches: Purpose-built network switches that forward FC frames between HBAs and storage array ports. Major vendors are Brocade (Broadcom) and Cisco (MDS series). Switches operate at wire speed with cut-through forwarding and deterministic latency (typically 2-5 microseconds per hop). Enterprise deployments use director-class switches (Brocade X7-8, Cisco MDS 9700) with hundreds of ports and redundant control planes. Smaller environments use fixed-port switches (Brocade G720, Cisco MDS 9148T).
Storage Arrays: The centralized storage controllers that own and manage all persistent data. A storage array consists of:
| Component | Function |
|---|---|
| Controllers (2 or more) | Process I/O, run data services (snapshots, replication, dedup), manage cache |
| Cache (DRAM + NVMe write buffer) | Absorb writes, accelerate reads; 64-512 GB per controller is typical |
| Backend connectivity | SAS/NVMe connections to disk shelves |
| Disk shelves | House the physical media (NVMe, SSD, HDD); connected via dual-path SAS or NVMe fabric |
| Management interfaces | Out-of-band management (REST API, GUI, CLI) for provisioning and monitoring |
SAN Vendors Relevant to Financial Enterprises:
| Vendor | Platform | Architecture | Sweet Spot |
|---|---|---|---|
| NetApp | AFF A-Series / C-Series | Unified (block + file), ONTAP OS, active/active controllers | Environments needing block + NFS + SnapMirror replication |
| Pure Storage | FlashArray//X, //XL, //E | All-flash, NVMe-native, active/active, Evergreen subscription | Simplicity-focused, high-IOPS, no tuning needed |
| Dell | PowerStore, PowerMax | PowerStore: unified mid-range; PowerMax: enterprise block | PowerMax for extreme scale; PowerStore for mixed workloads |
| HPE | Alletra 9000 / MP | Alletra 9000 (Primera heritage): mission-critical block | Environments requiring guaranteed latency SLAs |
Zoning and LUN Masking
Zoning and LUN masking are the two layers of access control in a SAN. They are conceptually similar to network firewall rules and application-level authentication, respectively.
Zoning (Fabric-Level Access Control): Zoning restricts which HBA ports can communicate with which storage ports within a fabric. Without zoning, every HBA port can see every storage port -- a security and stability risk.
Zoning Example -- Fabric A
============================
Zone: Zone_Host01_ArrayA
Members:
- 21:00:00:1b:32:a1:00:01 (Host01 HBA-A WWPN)
- 50:00:09:73:f0:10:00:0a (Array Controller-A Port 1)
Zone: Zone_Host02_ArrayA
Members:
- 21:00:00:1b:32:a2:00:01 (Host02 HBA-A WWPN)
- 50:00:09:73:f0:10:00:0a (Array Controller-A Port 1)
Zone: Zone_Host03_ArrayA
Members:
- 21:00:00:1b:32:a3:00:01 (Host03 HBA-A WWPN)
- 50:00:09:73:f0:10:00:0b (Array Controller-A Port 2)
Zoneset: Production_FabricA
Active zones: Zone_Host01_ArrayA,
Zone_Host02_ArrayA,
Zone_Host03_ArrayA
Zoning types:
- WWPN zoning (port-based): recommended, follows the HBA regardless of switch port
- Port zoning (switch port-based): tied to physical switch port, breaks if cable moves
- Smart zoning / peer zoning: Brocade optimization reducing zone database size
LUN Masking (Array-Level Access Control): LUN masking controls which hosts can access which LUNs (Logical Unit Numbers) on the storage array. Even if zoning allows an HBA to talk to a storage port, LUN masking restricts which LUNs that HBA can discover and use.
LUN Masking Workflow
=====================
1. Create LUN on array:
lun create -name db_prod_01 -size 500GB -pool ssd_tier1
2. Create host object (identifies the server to the array):
host create -name esxi-host-01 \
-initiators 21:00:00:1b:32:a1:00:01,21:00:00:1b:32:a1:00:02 \
-type vmware
3. Map LUN to host (LUN masking):
lun map -name db_prod_01 -host esxi-host-01 -lun-id 5
4. Result: Host01 sees LUN 5 (500 GB) when it scans the fabric
Other hosts see nothing -- LUN is masked from them
Host Group (for shared-access scenarios like VMFS):
hostgroup create -name esxi-cluster-01 \
-hosts esxi-host-01,esxi-host-02,esxi-host-03,esxi-host-04
lun map -name vmfs_shared_01 -hostgroup esxi-cluster-01 -lun-id 10
-> All four hosts see LUN 10 (shared VMFS datastore)
LUN Presentation and Multipathing
When a host has two HBAs (one per fabric) and the storage array has controllers on both fabrics, the host discovers the same LUN through multiple paths. Multipathing software (dm-multipath on Linux, Microsoft MPIO on Windows, VMware NMP) aggregates these paths into a single logical device.
Multipath Topology -- Single LUN, Four Paths
===============================================
Host: esxi-host-01
|
+-- HBA-A (Fabric A) ------> Array Controller-A, Port 1
| |
| +-> LUN 5 (optimized path)
|
+-- HBA-A (Fabric A) ------> Array Controller-B, Port 1
| |
| +-> LUN 5 (non-optimized path)
|
+-- HBA-B (Fabric B) ------> Array Controller-A, Port 2
| |
| +-> LUN 5 (optimized path)
|
+-- HBA-B (Fabric B) ------> Array Controller-B, Port 2
|
+-> LUN 5 (non-optimized path)
ALUA (Asymmetric Logical Unit Access):
- LUN 5 is "owned" by Controller-A
- Paths through Controller-A are "Active/Optimized" (AO)
- Paths through Controller-B are "Active/Non-Optimized" (ANO)
- I/O normally flows through AO paths
- If Controller-A fails, ANO paths become AO (controller failover)
Multipath policies:
- Round Robin: alternates I/Os across active paths (best throughput)
- Fixed: uses one path, failover to another on failure
- MRU: most recently used path (avoids unnecessary path changes)
- ALUA-aware: respects AO/ANO status, load-balances within AO paths
Thin Provisioning at the Array Level
Modern storage arrays support thin provisioning at the pool level, distinct from thin provisioning within the guest OS (LVM thin pools) or at the hypervisor level (thin VMDKs):
Array-Level Thin Provisioning
===============================
Storage Pool: ssd_tier1 (physical capacity: 50 TB)
|
+-- LUN: db_prod_01 provisioned: 500 GB used: 120 GB
+-- LUN: db_prod_02 provisioned: 500 GB used: 85 GB
+-- LUN: vmfs_shared_01 provisioned: 10 TB used: 3.2 TB
+-- LUN: vmfs_shared_02 provisioned: 10 TB used: 4.1 TB
+-- LUN: oracle_rac_01 provisioned: 2 TB used: 800 GB
+-- LUN: oracle_rac_02 provisioned: 2 TB used: 750 GB
...
Total provisioned: 120 TB (2.4x overcommit ratio)
Total used: 38 TB (76% physical utilization)
Free physical: 12 TB
Warning thresholds:
- 80% physical: alert
- 90% physical: critical alert + auto-pause thin provisioning
- 95% physical: emergency -- risk of pool full, I/O errors
The danger: if total used exceeds 50 TB, the pool is physically
full and new writes FAIL. This is the "thin provisioning time bomb."
Financial environments must monitor this and set conservative thresholds.
SAN Management Workflow
The end-to-end workflow for provisioning a new LUN to a host on a SAN illustrates the operational complexity:
SAN Provisioning Workflow (New LUN for a VM Cluster)
======================================================
Step 1: Capacity planning
- Identify target pool (SSD tier, HDD tier)
- Check free physical capacity and overcommit ratio
- Determine RAID/protection level
Step 2: Create LUN on storage array
- Vendor GUI / CLI / REST API
- Set size, thin/thick, tiering policy, QoS limits
- Assign to storage pool
Step 3: Create zoning on BOTH fabrics
- Log into Fabric A switch: create zone, add to active zoneset
- Log into Fabric B switch: create zone, add to active zoneset
- Activate zonesets (causes brief fabric reconfiguration)
Step 4: Create LUN masking on storage array
- Identify host/hostgroup by WWPN
- Map LUN with a specific LUN ID
- Verify host can discover the LUN
Step 5: Host-side discovery
- Rescan HBA (Linux: echo "- - -" > /sys/class/scsi_host/hostX/scan)
- Verify multipath device appears (multipathd show paths)
- Verify ALUA path states
Step 6: Present to hypervisor
- VMware: create VMFS datastore on the new LUN
- Or: use as RDM (Raw Device Mapping) for direct VM access
Step 7: Document in CMDB
- Record WWPN mappings, zone names, LUN IDs, multipath policy
Total time: 30-90 minutes per LUN (manual process)
Automation potential: high (Ansible, Terraform, vendor REST APIs)
Performance Characteristics
SAN storage delivers the best deterministic performance of any architecture because:
- Dedicated bandwidth. The FC fabric carries only storage traffic -- no competition with management, vMotion, or tenant network traffic.
- Deterministic latency. FC switches use credit-based flow control (buffer-to-buffer credits), which prevents congestion drops. There is no TCP congestion window, no retransmissions, no head-of-line blocking. Latency is predictable and consistent.
- Controller-side intelligence. Enterprise arrays have massive DRAM caches (64-512 GB per controller), NVMe write journals, and intelligent prefetch algorithms. Hot data is served from cache at DRAM speed.
- No replication overhead on the write path. Unlike HCI, a SAN write goes to one storage controller (which handles internal redundancy via RAID or erasure coding within the array). There is no cross-node network replication on every write.
Typical SAN performance numbers (single array, all-flash):
| Metric | Mid-Range (PowerStore, AFF A250) | High-End (PowerMax, AFF A900) |
|---|---|---|
| Random 4K read IOPS | 200K - 500K | 1M - 10M |
| Random 4K write IOPS | 100K - 300K | 500K - 5M |
| Read latency (avg) | 100 - 300 us | 50 - 150 us |
| Write latency (avg) | 100 - 500 us | 50 - 200 us |
| Sequential throughput | 10 - 25 GB/s | 50 - 150 GB/s |
| Max capacity (usable) | 100 - 500 TB | 500 TB - 4 PB |
SAN in a VM Environment
In a VMware environment, SAN LUNs are consumed in two ways:
- VMFS Datastores: A LUN is formatted with VMFS (VMware's cluster filesystem) and shared across all ESXi hosts in the cluster. Multiple VMDKs from different VMs reside on the same VMFS datastore. This is the standard approach for most workloads.
- Raw Device Mappings (RDMs): A LUN is mapped directly to a single VM, bypassing VMFS. Used for workloads that need raw device access (some databases, Microsoft Failover Clustering with shared disks). RDMs come in physical mode (VM sees the full SCSI interface) and virtual mode (VM sees a VMFS-like abstraction).
Boot from SAN: Servers can boot their hypervisor OS from a SAN LUN, eliminating the need for local disks entirely. This simplifies server hardware (diskless blade servers) but adds a dependency on SAN availability for basic compute. Boot from SAN is common in Tier-1 financial environments with existing SAN infrastructure.
When SAN Still Makes Sense
Despite the industry trend toward HCI, external SAN remains justified in specific scenarios:
| Scenario | Why SAN Wins |
|---|---|
| Regulatory mandate | Some regulators or audit frameworks require that storage infrastructure is a separate failure domain from compute. SAN provides physical separation. |
| Existing investment | If you have a recently purchased (< 3 years) SAN array with active support contracts, migrating its workloads to HCI gains nothing and wastes CapEx. |
| Performance-critical databases | Workloads requiring sub-100-us latency, deterministic QoS, and zero noisy-neighbor risk (e.g., high-frequency trading, real-time risk calculations) benefit from dedicated SAN. |
| Oracle RAC / shared-disk clusters | Clustered databases using shared raw block devices (RDMs) with fencing rely on SAN semantics that HCI does not natively provide. |
| Very large single volumes | Volumes > 16 TB that need efficient snapshots and replication are better served by array-native data services than by distributed HCI replication. |
| Disaggregated scaling | When compute and storage need to scale independently (e.g., adding 50 TB of storage without adding compute nodes), SAN decouples the two. |
2. NAS (Network Attached Storage)
Architecture Overview
A NAS system is a file server appliance that serves files over standard Ethernet using file-level protocols (NFS, SMB/CIFS). Unlike SAN (which provides raw block devices), NAS provides a shared filesystem -- clients mount directories and access files through standard POSIX or SMB semantics. The storage appliance owns the filesystem, handles locking, permissions, and data protection.
NAS Architecture -- Enterprise Deployment
===========================================
Clients (VMs, Hypervisors, Application Servers)
+-----------+ +-----------+ +-----------+ +-----------+
| Client 01 | | Client 02 | | Client 03 | | Client 04 |
| NFS/SMB | | NFS/SMB | | NFS/SMB | | NFS/SMB |
+-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
| | | |
+------+-------+------+------+------+-------+
| | |
+------+------+ +----+-----+ +-----+------+
| Ethernet | | Ethernet | | Ethernet |
| Switch L1 | | Switch | | Switch L2 |
| (25/100GbE) | | (core) | | (25/100GbE)|
+------+------+ +----+-----+ +-----+------+
| | |
+------+------+------+------+
| |
+-----+------+ +---+--------+
| NAS Head 1 | | NAS Head 2 |
| (active) | | (standby / |
| | | active) |
+-----+------+ +---+--------+
| |
+-----+-------------+------+
| Internal Disk Shelves |
| (SSD / HDD / hybrid) |
| [shelf-1] [shelf-2] ... |
+---------------------------+
NAS protocols:
NFS (TCP port 2049): POSIX file semantics, used by Linux/Unix/VMware
SMB (TCP port 445): Windows file semantics, used by Windows VMs/clients
Network:
- Shared Ethernet (same physical network as other traffic, or VLAN-separated)
- No dedicated fabric required (unlike SAN/FC)
- Jumbo frames (MTU 9000) recommended for performance
NAS Protocols: NFS vs SMB
| Dimension | NFS | SMB (SMB2/SMB3) |
|---|---|---|
| Native OS | Linux, Unix, macOS, VMware ESXi | Windows, macOS (via SMB), Linux (via cifs.ko) |
| Filesystem semantics | POSIX (uid/gid, mode bits, ACLs via NFSv4) | Windows ACLs (NTFS-style DACL/SACL) |
| Locking | Advisory (NFSv3), mandatory (NFSv4) | Mandatory (oplocks, leases) |
| Authentication | AUTH_SYS (uid/gid, insecure), Kerberos (NFSv4) | NTLM, Kerberos (integrated with AD) |
| Encryption | Kerberos privacy (krb5p) for NFSv4 | SMB3 encryption (AES-128/256-GCM) |
| Multichannel | Not natively (client-side bonding instead) | SMB3 Multichannel (built-in NIC aggregation) |
| Typical use case | VMware NFS datastores, Linux app data, Kubernetes NFS provisioner | Windows file shares, DFS namespaces, user home drives |
When to use NFS: VMware NFS datastores, Linux application shared data, Kubernetes PVs via NFS provisioner, cross-platform file sharing in Linux-dominated environments.
When to use SMB: Windows VM file shares, Active Directory-integrated environments, user home directories for Windows desktops, SQL Server filegroups on SMB shares (supported since SQL 2012).
NAS for VM Storage
NFS is used as a VM storage backend in two scenarios:
VMware NFS Datastores: ESXi mounts NFS exports as datastores. VMDKs are stored as regular files on the NFS volume. This is operationally simpler than SAN (no zoning, no LUN masking, no multipath configuration), but performance depends on the NFS implementation and network.
VMware NFS Datastore Architecture
====================================
ESXi Host 01 ESXi Host 02 ESXi Host 03
+------------------+ +------------------+ +------------------+
| NFS Client | | NFS Client | | NFS Client |
| (vmkernel port) | | (vmkernel port) | | (vmkernel port) |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+----------+--------------+----------+--------------+
| |
+-----+------+ +-----+------+
| NFS Server | | NFS Server |
| (NetApp | | (NetApp |
| node 1) | | node 2) |
+-----+------+ +-----+------+
| |
+-----+-------------------------+------+
| Shared Storage Pool |
| /vol/vmware_ds01 (NFS export) |
| vm01-flat.vmdk |
| vm02-flat.vmdk |
| vm03-flat.vmdk |
| /vol/vmware_ds02 (NFS export) |
| vm04-flat.vmdk |
| ... |
+--------------------------------------+
Advantages over SAN:
- No zoning, no LUN masking, no HBA configuration
- VMDKs are individual files (easy to list, copy, manage)
- Storage vMotion between NFS datastores is a file copy
- Easy to add datastores (just mount a new NFS export)
Disadvantages vs SAN:
- Higher latency (NFS/TCP overhead vs FC direct)
- Shared Ethernet bandwidth (unless dedicated NFS VLAN)
- No RDM equivalent (cannot pass raw device to VM)
- NFS locking complexity for certain workloads
Kubernetes NFS Provisioner: In a Kubernetes environment, NFS is consumed via CSI drivers (e.g., NFS Subdir External Provisioner, NetApp Trident for ONTAP NFS) to dynamically provision PersistentVolumes. Each PVC creates a subdirectory on the NFS export. This is simple but has performance limitations (NFS metadata overhead, single-server bottleneck for non-scale-out NAS).
Scale-Out NAS vs Single-Controller NAS
| Aspect | Single-Controller / HA Pair | Scale-Out NAS |
|---|---|---|
| Architecture | One or two controllers (active/standby or active/active) | Many nodes (4-100+) forming a distributed filesystem |
| Throughput scaling | Limited by controller CPU and network ports | Scales linearly with nodes |
| Capacity scaling | Add disk shelves to existing controllers | Add nodes (compute + storage together) |
| Failure impact | Controller failure = failover (brief interruption) | Node failure = redistributed load (minimal impact) |
| Examples | NetApp FAS/AFF (HA pair mode), QNAP, Synology | NetApp ONTAP (cluster mode), Dell PowerScale (Isilon), Vast Data |
| Use case | Small-medium file shares, VMware NFS datastores | Large-scale analytics, media, research, massive file counts |
Performance Characteristics vs SAN
NAS performance is fundamentally limited by protocol overhead and shared network bandwidth:
Latency Comparison: SAN vs NAS
================================
Operation: 4K random read
SAN (FC):
App -> Guest kernel -> virtio-blk -> QEMU -> Host block layer
-> HBA -> FC Switch (2-5 us) -> Array controller -> Cache/SSD
Total: 100-300 us
NAS (NFS over 25GbE):
App -> Guest kernel -> NFS client -> RPC/XDR serialization
-> TCP/IP stack -> NIC -> Ethernet switch -> NAS head
-> NFS server -> Filesystem -> Cache/SSD
Total: 300-800 us
Why NAS is slower:
1. RPC/XDR serialization overhead: ~10-30 us
2. TCP/IP processing: ~5-20 us (vs FC credit-based flow: ~2-5 us)
3. NFS server filesystem layer: ~10-50 us
4. Shared Ethernet bandwidth: potential queueing delay
5. No equivalent of FC buffer-to-buffer credits (TCP uses congestion windows)
When this gap does NOT matter:
- Sequential I/O (throughput-bound, not latency-bound)
- Workloads with > 5 ms tolerance (file servers, web servers, logs)
- NFS over RDMA (eliminates TCP overhead, latency approaches SAN)
NAS Vendors
| Vendor | Platform | Strengths | Use Case |
|---|---|---|---|
| NetApp ONTAP | AFF / FAS (HA-pair and cluster mode) | Unified block+file, SnapMirror, FlexClone, multi-protocol | Primary enterprise NAS, VMware NFS datastores |
| Dell PowerScale (Isilon) | Scale-out NAS cluster | Massive scale (100+ nodes, 100+ PB), parallel throughput | Analytics, media, large unstructured data sets |
| Vast Data | Universal Storage | NFS, SMB, S3 on one platform, disaggregated shared-nothing | Next-gen unified storage for AI/ML and enterprise |
| QNAP / Synology | Desktop/rackmount NAS | Low cost, easy management, consumer-grade | Lab, dev/test, non-production file shares |
3. HCI / Software-Defined Storage (SDS)
This is the most critical section for the evaluation. Both OVE (via Ceph/ODF) and Azure Local (via S2D) use HCI as their primary storage model. Understanding HCI internals -- replication mechanics, consistency models, failure handling, write paths -- is essential for evaluating their claims and planning PoCs.
The HCI Concept
Hyper-Converged Infrastructure (HCI) eliminates the dedicated storage array by distributing storage across the same servers that run compute workloads. Each node contributes its local disks to a shared storage pool managed by software-defined storage (SDS). The SDS layer presents this distributed pool as a single logical storage system to the hypervisor or container runtime.
HCI Architecture -- Compute + Storage Converged
=================================================
Traditional (3-Tier: Compute + Network + Storage)
==================================================
+--------+ +--------+ +--------+ +------------------+
| Compute| | Compute| | Compute| --> | SAN Fabric |
| Node 1 | | Node 2 | | Node 3 | | (FC Switches) |
| (no | | (no | | (no | +--------+---------+
| disks)| | disks)| | disks)| |
+--------+ +--------+ +--------+ +--------+---------+
| Storage Array |
| (controllers + |
| disk shelves) |
+------------------+
vs. HCI (Converged)
====================
+------------------+ +------------------+ +------------------+
| HCI Node 1 | | HCI Node 2 | | HCI Node 3 |
| | | | | |
| [Compute: VMs] | | [Compute: VMs] | | [Compute: VMs] |
| [SDS daemon] | | [SDS daemon] | | [SDS daemon] |
| [NVMe] [NVMe] | | [NVMe] [NVMe] | | [NVMe] [NVMe] |
| [SSD] [SSD] | | [SSD] [SSD] | | [SSD] [SSD] |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+----------+----------+----------+----------+
| |
Storage Network Storage Network
(dedicated VLAN (dedicated VLAN
or separate NICs) or separate NICs)
Key differences:
- No dedicated storage hardware (arrays, shelves, FC switches)
- Every node is both compute and storage
- SDS software creates a distributed storage pool
- Data is replicated across nodes for redundancy
- Scaling: add a node = add compute AND storage simultaneously
How SDS Works: Distributed Storage Pooling
Software-defined storage takes the local disks from each node and combines them into a cluster-wide storage pool. The mechanism varies by implementation, but the core concept is the same:
SDS Data Distribution -- Conceptual View
==========================================
Physical disks across 4 nodes:
Node 1: [NVMe-1a] [NVMe-1b] [SSD-1a] [SSD-1b]
Node 2: [NVMe-2a] [NVMe-2b] [SSD-2a] [SSD-2b]
Node 3: [NVMe-3a] [NVMe-3b] [SSD-3a] [SSD-3b]
Node 4: [NVMe-4a] [NVMe-4b] [SSD-4a] [SSD-4b]
SDS Pool (logical view):
+---------------------------------------------------------------+
| Distributed Storage Pool |
| Total raw: 16 disks x 3.84 TB = 61.44 TB |
| Usable (with 3-way replication): ~20 TB |
| |
| Volume: vm-boot-01 50 GB (replicated across 3 nodes) |
| Volume: vm-boot-02 50 GB (replicated across 3 nodes) |
| Volume: db-data-01 500 GB (replicated across 3 nodes) |
| Volume: db-data-02 500 GB (replicated across 3 nodes) |
| ... |
+---------------------------------------------------------------+
The SDS layer handles:
1. Splitting volumes into fixed-size chunks/objects (4 MB typical for Ceph)
2. Placing replicas of each chunk on different nodes (placement algorithm)
3. Routing I/O to the correct node for each chunk
4. Replicating writes to all replica nodes before acknowledging
5. Detecting and recovering from node/disk failures
6. Rebalancing data when nodes are added or removed
Data Placement and Replication
Data placement is how the SDS decides where to store each piece of data. The goal is to distribute data evenly across nodes while ensuring replicas are on different failure domains.
Replica Factor (RF): The number of copies of each data chunk stored across the cluster.
| RF | Capacity Overhead | Fault Tolerance | Use Case |
|---|---|---|---|
| RF=2 (2-way mirror) | 50% usable (2x raw) | Tolerates 1 node failure | Development, non-critical workloads |
| RF=3 (3-way mirror) | 33% usable (3x raw) | Tolerates 2 node failures | Production, Tier-1 workloads |
| Erasure Coding (e.g., 4+2) | 67% usable (1.5x raw) | Tolerates 2 failures | Cold data, backups, archival storage |
Replica Placement Across Nodes:
3-Way Replication -- Data Distribution Example
================================================
Volume: db-data-01 (500 GB, split into 4 MB chunks)
Chunk 001: Primary: Node 1 Replica: Node 2 Replica: Node 4
Chunk 002: Primary: Node 2 Replica: Node 3 Replica: Node 1
Chunk 003: Primary: Node 3 Replica: Node 4 Replica: Node 2
Chunk 004: Primary: Node 4 Replica: Node 1 Replica: Node 3
Chunk 005: Primary: Node 1 Replica: Node 3 Replica: Node 4
...
Placement rules (Ceph CRUSH / S2D):
- No two replicas of the same chunk on the same node
- No two replicas on the same failure domain (rack, if configured)
- Distribute evenly: each node stores ~25% of data in a 4-node cluster
- Weight-based: nodes with more/larger disks store proportionally more
Visual distribution of chunk replicas:
Node 1 Node 2 Node 3 Node 4
------ ------ ------ ------
C001: [PRIMARY] [replica] [replica]
C002: [replica] [PRIMARY] [replica]
C003: [replica] [PRIMARY] [replica]
C004: [replica] [replica] [PRIMARY]
C005: [PRIMARY] [replica] [replica]
C006: [PRIMARY] [replica] [replica]
...
Each node holds roughly equal data volume.
Each node serves roughly equal I/O load.
No single node failure loses more than 1 copy of any chunk.
Erasure Coding:
An alternative to replication that provides data protection with less capacity overhead. Data is split into k data fragments and m parity fragments. Any k fragments can reconstruct the original data.
Erasure Coding Example: k=4, m=2 (4+2)
=========================================
Original 4 MB chunk split into 4 data fragments + 2 parity fragments:
Fragment: D1 D2 D3 D4 P1 P2
Size: 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB
Stored on: Node1 Node2 Node3 Node4 Node5 Node6
Capacity overhead: 6 MB stored for 4 MB of data = 1.5x (vs 3x for RF=3)
Fault tolerance: Any 2 fragments can be lost; data is recoverable
Trade-offs vs replication:
+ 50% less capacity overhead (1.5x vs 3x)
- Higher CPU cost (Galois field math for parity calculation)
- Higher read latency for degraded reads (must reconstruct from k fragments)
- Higher write latency (must compute parity and write to k+m nodes)
- Requires more nodes (at least k+m nodes for optimal placement)
Best for: cold data, backups, object storage, large sequential reads
Avoid for: OLTP databases, latency-sensitive block storage, small random I/O
Consistency Models
Distributed storage systems must choose how to handle concurrent reads and writes across replicas. The consistency model determines what a reader sees when a writer is updating data simultaneously.
Strong Consistency (Linearizability): Every read returns the most recent write. All replicas agree on the current state before acknowledging the write. This is what block storage requires -- a VM must never read stale data from its own disk.
Both Ceph and S2D implement strong consistency for block storage:
- Ceph: Writes are acknowledged only after all replicas confirm. Reads go to the primary OSD, which always has the latest data.
- S2D: Writes are acknowledged after all mirror copies are written. Reads can come from any mirror (all are identical due to synchronous writes).
Eventual Consistency: Writes are acknowledged after some (but not all) replicas confirm. Other replicas catch up asynchronously. Readers may see stale data temporarily. This model is acceptable for object storage (S3) but NOT for block storage.
Quorum-Based Consistency:
A write succeeds if a majority (quorum) of replicas acknowledge. A read succeeds if a majority responds. By requiring W + R > N (write quorum + read quorum > total replicas), at least one reader is guaranteed to see the latest write. This is a middle ground used by some distributed databases but NOT by Ceph or S2D for block I/O (they use full synchronous replication instead).
Consistency Model Comparison
==============================
Strong Consistency (Ceph RBD, S2D block):
Write: Client -> Primary OSD -> [replicate to ALL replicas] -> ACK
Read: Client -> Primary OSD -> return latest data
Guarantee: read-after-write consistency, always
Cost: write latency = slowest replica
Eventual Consistency (Ceph RGW / S3-compatible):
Write: Client -> Primary -> ACK -> [replicate asynchronously]
Read: Client -> any replica -> may return stale data
Guarantee: all replicas converge eventually
Cost: lower write latency, but stale reads possible
Quorum (N=3, W=2, R=2):
Write: Client -> write to 3, ACK after 2 confirm
Read: Client -> read from 3, use value from 2 agreeing
Guarantee: W + R > N, so at least 1 reader sees latest write
Cost: balanced latency, but requires quorum logic
For VM block storage: ONLY strong consistency is acceptable.
A VM's filesystem assumes its disk is a single coherent device.
Stale reads would cause filesystem corruption.
Write Path in a Distributed Storage System
Understanding the write path is critical for performance analysis and troubleshooting. Every VM write traverses multiple layers before being durable.
Write Path -- HCI with SDS (Generic Model)
============================================
1. VM Application issues write(fd, buf, 4096)
|
v
2. Guest Kernel: filesystem (ext4/XFS) -> bio -> virtio-blk driver
| ~1-5 us
v
3. Hypervisor: QEMU I/O thread receives virtio request
| -> translates to SDS client call ~2-5 us
v
4. SDS Client Library (e.g., librbd for Ceph, ReFS for S2D)
| a) Identify which chunk/slab this LBA belongs to
| b) Look up primary node for this chunk ~1-3 us
| c) Send write request to primary node
v
5. Primary Storage Node receives write
| a) Write to local journal (WAL / write-ahead log) ~10-50 us
| (fast NVMe device, sequential write, durable)
| b) Send replication requests to replica nodes
v
6. Replica Nodes (in parallel):
| a) Receive write over network ~5-30 us
| b) Write to local journal ~10-50 us
| c) Acknowledge to primary
v
7. Primary collects ALL replica acknowledgements
| (waits for slowest replica -- "tail latency") ~0-20 us
v
8. Primary acknowledges write to SDS client
| ~1-3 us
v
9. SDS Client returns completion to QEMU
| ~1-3 us
v
10. QEMU signals completion to guest virtio driver
| ~1-3 us
v
11. Guest kernel marks bio complete, returns to application
Total write latency (NVMe, 3-way replica, RDMA network):
Best case: ~100-200 us
Typical: ~200-500 us
Worst case: ~1-5 ms (during rebalancing or under load)
Journal flush (background, async):
- Periodically (every few seconds), journal entries are
flushed to the main data store (LSM tree / extent map)
- This is NOT on the write path -- it happens asynchronously
- If the journal fills up, writes stall until flush completes
(this is the "journal full" condition -- a major performance risk)
Journal / Write-Ahead Log (WAL):
The journal is the key performance mechanism in SDS. All writes first go to a fast, durable device (NVMe), which provides sequential write performance. Later, the data is flushed to the main data store in larger, more efficient batches.
Journal / WAL Mechanics
=========================
Write arrives -> Journal (NVMe WAL device)
|
| (sequential write, ~10-50 us)
|
v
Data is durable
(ACK sent to client)
|
| (async background flush)
| (triggered by: time interval, journal fullness, idle periods)
v
Main Data Store (larger SSD/NVMe capacity devices)
|
| (written as sorted runs / large extents)
| (more efficient than random writes)
v
Journal entry freed
Why this matters:
- Journal absorbs random writes as sequential writes (much faster)
- Journal device must be low-latency NVMe (not SATA SSD)
- Journal size determines how long writes can burst before stalling
- Ceph: WAL + DB on separate NVMe, data on larger SSD/HDD
- S2D: Cache tier (NVMe) acts as journal, capacity tier is SSD/HDD
Journal full scenario (performance cliff):
1. Burst of writes fills journal faster than background flush can drain it
2. New writes must wait for journal space -> latency spikes to 10-100 ms
3. Common causes: undersized journal, too few NVMe WAL devices, burst I/O
4. Mitigation: size journal to handle 30-60 seconds of peak write rate
Read Path
The read path is simpler than the write path because reads do not require replication. However, cache hierarchy and data locality significantly affect performance.
Read Path -- HCI with SDS
===========================
1. VM Application issues read(fd, buf, 4096)
|
v
2. Guest Kernel: filesystem -> bio -> virtio-blk driver
|
v
3. Hypervisor: QEMU I/O thread -> SDS client library
|
v
4. SDS Client: identify chunk and primary node
|
+--- Case A: Data is LOCAL (on this node)
| |
| v
| Local SDS daemon -> check page cache
| |
| +-- Cache HIT: return from RAM ~5-20 us
| |
| +-- Cache MISS: read from local NVMe ~50-100 us
| | read from local SSD ~100-300 us
| |
| Result: LOCAL READ latency = 50-300 us
|
+--- Case B: Data is REMOTE (on another node)
|
v
Network request to remote node ~5-30 us (RDMA)
| ~50-200 us (TCP)
v
Remote SDS daemon -> check page cache
|
+-- Cache HIT: return from RAM ~5-20 us
|
+-- Cache MISS: read from NVMe ~50-100 us
|
Network response ~5-30 us (RDMA)
|
Result: REMOTE READ latency = 100-500 us
Cache Hierarchy (typical HCI node):
Layer 1: Guest page cache (RAM inside VM) ~1 us
Layer 2: Host page cache (hypervisor RAM) ~2-5 us
Layer 3: SDS cache (dedicated RAM or NVMe) ~5-50 us
Layer 4: Local SSD/NVMe (data tier) ~50-300 us
Layer 5: Remote SSD/NVMe (over network) ~100-500 us
Data Locality Optimization:
Some SDS implementations (Ceph with primary affinity, vSAN)
try to place the primary replica on the same node as the VM.
This maximizes local reads and minimizes network traffic.
However, after live migration (vMotion equivalent), the VM
moves to a different node, and all reads become remote until
the SDS rebalances the primary replica to the new node.
Failure Domains
A failure domain is a group of components that can fail together due to a shared dependency (power, network, physical location). HCI must be configured with failure domain awareness to ensure that replicas are spread across independent failure domains.
Failure Domain Hierarchy
==========================
Level 0: Disk
- Single disk failure: SDS rebuilds data from other replicas
- Impact: reduced redundancy for affected chunks until rebuild completes
- Rebuild time: minutes to hours (depends on disk size and cluster I/O)
Level 1: Node
- Node failure (hardware, OS crash, reboot): all disks on that node offline
- Impact: RF=3 -> 2 copies remain, cluster still healthy
- Recovery: automatic, SDS re-replicates data to surviving nodes
- Rebuild time: 30 min to several hours (depends on data volume)
Level 2: Rack
- Rack failure (PDU failure, top-of-rack switch failure)
- Impact: all nodes in the rack offline simultaneously
- Design rule: never place more than (RF - 1) replicas in the same rack
- Ceph CRUSH: configure rack-level failure domain rules
- S2D: configure fault domains (rack, chassis, site)
Level 3: Site / Data Center
- Site failure (power, network, natural disaster)
- Stretched cluster: replicas across two sites + witness at third site
- Ceph: stretched mode with crush rules for site affinity
- S2D: stretch cluster with site-aware volume placement
Example: 12 nodes across 3 racks, RF=3
Rack 1 Rack 2 Rack 3
+------+ +------+ +------+
|Node01| |Node05| |Node09|
|Node02| |Node06| |Node10|
|Node03| |Node07| |Node11|
|Node04| |Node08| |Node12|
+------+ +------+ +------+
Chunk placement (rack-aware, RF=3):
Chunk 001: Node02 (Rack1) Node06 (Rack2) Node10 (Rack3)
Chunk 002: Node04 (Rack1) Node07 (Rack2) Node11 (Rack3)
Chunk 003: Node01 (Rack1) Node08 (Rack2) Node12 (Rack3)
-> Any single rack can fail completely, data remains accessible
-> Two racks failing simultaneously: data loss possible (RF=3
can survive 2 node failures, not 2 rack failures with 4 nodes each)
Rebalancing and Recovery After Failures
When a node fails or is added, the SDS must redistribute data to maintain even distribution and the target replica factor. This process has significant performance implications.
Rebalancing Scenarios
=======================
Scenario 1: Node failure (unplanned)
-------------------------------------
Initial state: 4 nodes, RF=3, each node holds ~25% of data
Node 1: [C01,C02,C05,C08,...] <- FAILED
Node 2: [C01,C03,C04,C06,...]
Node 3: [C02,C04,C07,C08,...]
Node 4: [C03,C05,C06,C07,...]
After Node 1 failure:
- Chunks that had a replica on Node 1 now have only 2 copies
- SDS detects under-replicated chunks
- Recovery: re-replicate missing copies to surviving nodes
Node 2: [C01,C03,C04,C06,...] + [C05*,C08*] <- new replicas
Node 3: [C02,C04,C07,C08,...] + [C01*,C05*] <- new replicas
Node 4: [C03,C05,C06,C07,...] + [C02*,C08*] <- new replicas
* = newly replicated chunks
Recovery bandwidth consumed: ~(data on Node 1 / RF) must be
read from surviving nodes and written to new locations.
Example: 10 TB on Node 1, RF=3 -> ~3.3 TB of data to replicate
At 2 GB/s recovery rate: ~28 minutes
PERFORMANCE IMPACT during recovery:
- Recovery I/O competes with production VM I/O
- Expect 10-30% IOPS degradation during recovery
- Ceph: "osd recovery max active" limits concurrent recoveries
- S2D: repair jobs are background-prioritized but still consume I/O
Scenario 2: Node addition (planned)
-------------------------------------
Adding Node 5 to a 4-node cluster:
Before: each node holds 25% of data
After: each node should hold 20% of data
SDS moves ~5% of total data from each existing node to Node 5:
Node 1: 25% -> 20% (moves 5% out)
Node 2: 25% -> 20% (moves 5% out)
Node 3: 25% -> 20% (moves 5% out)
Node 4: 25% -> 20% (moves 5% out)
Node 5: 0% -> 20% (receives 20% total)
This is a gradual, background process:
- Ceph: PG (placement group) remapping, controlled by "backfill" limits
- S2D: storage job with low priority, can take hours for large clusters
HCI Networking Requirements
HCI places heavy demands on the storage network because every write generates cross-node replication traffic. The network is now part of the storage path, not just a management channel.
HCI Network Bandwidth Calculation
====================================
Assumptions:
- Cluster aggregate write throughput: 5 GB/s
- RF=3 (each write generates 2 replication copies)
- Total storage network traffic: 5 GB/s x 2 = 10 GB/s replication
- Plus read traffic (remote reads): ~2 GB/s
- Plus recovery/rebalancing (background): ~1-2 GB/s
- Total sustained storage network: ~14 GB/s
Minimum network per node (4-node cluster):
- Each node generates/receives: ~3.5 GB/s = 28 Gbps
- Minimum: 2 x 25 GbE dedicated to storage (50 Gbps)
- Recommended: 2 x 100 GbE with RDMA (200 Gbps headroom)
Network design for HCI:
+------------------+
| HCI Node |
| |
| [NIC Port 1] ----+--> Management + VM traffic (25 GbE)
| [NIC Port 2] ----+--> Management + VM traffic (25 GbE)
| [NIC Port 3] ----+--> Storage replication (25 GbE, dedicated)
| [NIC Port 4] ----+--> Storage replication (25 GbE, dedicated)
+------------------+
or (converged with QoS):
+------------------+
| HCI Node |
| |
| [NIC Port 1] ----+--> Converged: Mgmt + VM + Storage (100 GbE)
| [NIC Port 2] ----+--> Converged: Mgmt + VM + Storage (100 GbE)
+------------------+
(RDMA traffic class separated via PFC/ECN/DCBX)
Critical requirements:
- Storage network latency: < 50 us for RDMA, < 200 us for TCP
- Jumbo frames (MTU 9000) mandatory for storage VLAN
- For RDMA: lossless Ethernet (PFC, ECN, DCBX) required
- Leaf-spine topology with non-blocking bandwidth
- No oversubscription on storage VLAN switches
Trade-Offs: SAN vs NAS vs HCI Decision Framework
The decision between architectures is not binary. The right answer depends on workload requirements, existing infrastructure, operational skills, and regulatory constraints. The following framework provides a structured approach.
Decision Framework: When to Use What
=======================================
+------------------------------------------+
| Do you need sub-200 us deterministic |
| latency with zero noisy-neighbor risk? |
+-----+------------------------------------+
|
Yes | No
| | |
v | v
+--------+ | +------------------------------------+
| SAN | | | Is shared file access (NFS/SMB) |
| (keep | | | the primary use case (not block)? |
| or add)| | +-----+------------------------------+
+--------+ | |
| Yes | No
| | | |
| v | v
| +---+ | +-------------------------------+
| |NAS| | | Are you willing to operate |
| +---+ | | distributed storage software |
| | | (Ceph, S2D) with the required |
| | | team skills? |
| | +-----+-------------------------+
| | |
| | Yes | No
| | | | |
| | v | v
| | +---+ | +-----------+
| | |HCI| | | SAN or |
| | +---+ | | Managed |
| | | | Service |
| | | | (ESC) |
| | | +-----------+
| | |
+--------+--------+
Quantified Trade-Off Matrix:
| Dimension | SAN | NAS | HCI (SDS) |
|---|---|---|---|
| CapEx (initial) | High (array + FC switches + HBAs) | Medium (NAS head + Ethernet) | Low-Medium (commodity servers + disks) |
| OpEx (ongoing) | Medium (array support contracts, FC admin) | Low (simple management) | Medium-High (SDS expertise, more nodes to manage) |
| Scalability model | Scale storage independently of compute | Scale NAS heads or shelves | Scale by adding nodes (compute + storage together) |
| Max latency (4K random write) | 50-200 us | 300-1000 us | 200-500 us (local), 300-800 us (remote) |
| Latency consistency | Excellent (deterministic) | Good (shared network) | Moderate (varies with load, rebalancing) |
| Failure blast radius | Array failure = all VMs lose storage | NAS head failure = all NFS clients impacted | Node failure = degraded, not down (replicated) |
| Capacity overhead | 10-30% (RAID, hot spare) | 10-30% (RAID) | 50-67% (RF=3) or 33-50% (erasure coding) |
| Data services maturity | Excellent (20+ years, enterprise-grade) | Good (snapshots, replication, quotas) | Improving (Ceph and S2D mature, but less polished) |
| Vendor lock-in | Array vendor (NetApp, Pure, Dell, HPE) | NAS vendor (NetApp, Dell) | Platform vendor (Red Hat/Ceph, Microsoft/S2D) |
| Skills required | Storage admin (array + FC + zoning) | NAS admin (NFS/SMB, Ethernet) | SDS + platform engineer (Kubernetes/Ceph or Windows/S2D) |
Hybrid Architecture: When to Mix SAN and HCI
For a financial enterprise with 5,000+ VMs, the pragmatic answer is often a hybrid architecture:
Hybrid Architecture -- SAN + HCI
===================================
Tier 1: Performance-Critical (5-10% of VMs)
- Oracle RAC, high-frequency databases, real-time analytics
- External SAN (NetApp AFF, Pure FlashArray)
- Connected to HCI nodes via iSCSI or FC (CSI driver)
- Dedicated latency SLA: < 200 us p99
Tier 2: General Enterprise (80-90% of VMs)
- Application servers, web servers, middleware, dev/test
- HCI storage (Ceph/ODF or S2D)
- Standard SLA: < 1 ms p99
Tier 3: Archival / Bulk (5-10% of VMs)
- Backup targets, log aggregation, cold storage
- HCI with erasure coding (lower cost)
- Relaxed SLA: < 5 ms p99
+-------------------+ +------------------------------------+
| External SAN | | HCI Cluster |
| (NetApp/Pure/Dell)| | |
| [Tier 1 LUNs] | | Node1 [Tier2] [Tier3-EC] |
+--------+----------+ | Node2 [Tier2] [Tier3-EC] |
| | Node3 [Tier2] [Tier3-EC] |
iSCSI/FC/NVMe-oF | Node4 [Tier2] [Tier3-EC] |
| | ... |
+--------+----------+ +------------------------------------+
| CSI Driver |
| (Trident/Pure CSI) |
+--------------------+
|
v
Kubernetes/OpenShift presents all tiers
uniformly via StorageClasses:
storageclass: tier1-san -> External SAN (iSCSI/FC)
storageclass: tier2-hci -> HCI replicated (RF=3)
storageclass: tier3-archive -> HCI erasure coded (4+2)
The "Disaggregated HCI" Middle Ground
Pure HCI couples compute and storage on every node: adding storage means adding compute (and vice versa). This is inefficient when scaling is asymmetric. Disaggregated HCI introduces storage-only nodes alongside compute-storage nodes, or fully separates compute and storage into different node pools while keeping the software-defined storage layer.
Disaggregated HCI Architectures
==================================
Model A: Compute-Storage Nodes + Storage-Only Nodes
+------------------+ +------------------+ +------------------+
| Compute-Storage | | Compute-Storage | | Compute-Storage |
| Node 1 | | Node 2 | | Node 3 |
| [VMs] [SDS] | | [VMs] [SDS] | | [VMs] [SDS] |
| [NVMe] [SSD] | | [NVMe] [SSD] | | [NVMe] [SSD] |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+----------+----------+----------+----------+
| |
+--------+---------+ +--------+---------+
| Storage-Only | | Storage-Only |
| Node 4 | | Node 5 |
| [SDS only] | | [SDS only] |
| [NVMe] [SSD] | | [NVMe] [SSD] |
| [SSD] [SSD] | | [SSD] [SSD] |
+------------------+ +------------------+
Use case: need more storage capacity without adding compute
Model B: Fully Disaggregated (Compute Nodes + Storage Nodes)
Compute Nodes (VMs only, no local storage for SDS):
+----------+ +----------+ +----------+ +----------+
| Compute | | Compute | | Compute | | Compute |
| Node 1 | | Node 2 | | Node 3 | | Node 4 |
| [VMs] | | [VMs] | | [VMs] | | [VMs] |
+----+-----+ +----+-----+ +----+-----+ +----+-----+
| | | |
+------+-------+------+------+------+-------+
| | |
v v v
Storage Nodes (SDS daemons + disks, no VMs):
+----------+ +----------+ +----------+
| Storage | | Storage | | Storage |
| Node 1 | | Node 2 | | Node 3 |
| [SDS] | | [SDS] | | [SDS] |
| [NVMe x4]| | [NVMe x4]| | [NVMe x4]|
| [SSD x8] | | [SSD x8] | | [SSD x8] |
+----------+ +----------+ +----------+
Use case: independent scaling, no resource contention between VMs and SDS
Platform support:
- OVE/ODF: supports "infra nodes" for Ceph OSDs (disaggregated model B)
- Azure Local: does NOT support disaggregated S2D; all nodes must participate
- Ceph (upstream): fully supports disaggregated with separate MON/OSD/MDS nodes
How the Candidates Handle This
Comparison Table
| Aspect | VMware (Current) | OVE (OpenShift Virtualization Engine) | Azure Local | Swisscom ESC |
|---|---|---|---|---|
| Primary storage architecture | HCI (vSAN) | HCI (Ceph/ODF) | HCI (S2D) | SAN (Dell PowerMax/PowerStore) |
| SDS engine | vSAN (proprietary) | Ceph (open-source, Red Hat supported) | Storage Spaces Direct (proprietary) | N/A (managed SAN) |
| Replication model | vSAN RAID-1/5/6 (policy-based) | Ceph: RF=2/3 or erasure coding (per pool) | S2D: 2-way/3-way mirror or parity (per volume) | Array-level RAID (managed by Swisscom) |
| Consistency model | Strong (synchronous replication) | Strong (synchronous replication, primary-copy) | Strong (synchronous replication) | Strong (array controller) |
| Write path | VM -> vSAN DOM -> CLOM -> journal -> replicate | VM -> QEMU -> librbd -> primary OSD -> journal -> replicate | VM -> Hyper-V -> ReFS -> CSV -> S2D cache -> replicate | VM -> array controller -> cache -> RAID write |
| Cache / Journal tier | Dedicated cache device per disk group | WAL + DB on NVMe, data on SSD/HDD | NVMe cache tier (read + write) | Array DRAM cache + NVMe journal |
| Failure domain awareness | Host-level (rack-aware with stretched cluster) | CRUSH map: disk, host, rack, row, site | Fault domains: node, chassis, rack, site | Array controller HA (managed) |
| Minimum nodes (HA) | 3 (witness for 2-node) | 3 (ODF compact mode) | 2 (with witness) | N/A (managed) |
| Maximum nodes per cluster | 64 | No hard limit (practical: 100+) | 16 | N/A (managed) |
| Disaggregated mode | Compute-only hosts allowed | ODF infra nodes (storage-only), fully disaggregated Ceph | Not supported -- all nodes must run S2D | N/A |
| External SAN integration | VMFS on SAN LUNs, RDMs | CSI drivers (Trident, Pure CSI, Dell CSI) | iSCSI/FC passthrough to VMs | Included (it IS the SAN) |
| Erasure coding | Yes (RAID-5/6 equivalent) | Yes (Ceph EC pools, configurable k+m) | Yes (parity volumes, single/dual parity) | N/A (array-level) |
| Data locality optimization | Automatic (vSAN locality awareness) | Primary affinity (configurable, not default) | Automatic (CSV ownership, SMB redirect) | N/A (SAN has no locality concept) |
| Capacity overhead (RF=3) | 3x raw (RAID-1 mirror across 3 hosts) | 3x raw (3 replicas) | 3x raw (3-way mirror) | 1.2-1.5x (RAID-6 or dual parity) |
| Encryption at rest | vSAN encryption (AES-256, vCenter KMS) | Ceph OSD encryption (dm-crypt/LUKS, KMIP) | BitLocker on CSV volumes (TPM-backed) | Managed by Swisscom (vendor encryption) |
| Snapshot mechanism | vSAN snapshots (redo logs, COW) | Ceph RBD snapshots (COW, instant) | Storage Spaces checkpoints (COW) | Array-native snapshots (managed) |
Detailed Comparison
VMware vSAN (Current Baseline): vSAN is the HCI storage engine that the organization already operates. It pools local disks across ESXi hosts into a shared datastore, using policy-based storage management (SPBM) to define replication level, stripe width, and failure tolerance per VM. vSAN's CLOM (Cluster Level Object Manager) automatically places replicas across hosts, and DOM (Distributed Object Manager) handles I/O routing. The familiar operational model includes vCenter health checks, automatic rebalancing, and policy-driven provisioning. The exit motivation is Broadcom licensing, not vSAN's technical capabilities.
OVE -- Ceph/ODF: OVE uses OpenShift Data Foundation (ODF), which is Red Hat's productized distribution of Ceph. Ceph is the most widely deployed open-source distributed storage system, providing block (RBD), file (CephFS), and object (RGW) storage from a single platform.
Key Ceph architecture points for this evaluation:
- CRUSH algorithm: Ceph's data placement algorithm. Unlike a central metadata server, CRUSH computes the location of every data object algorithmically based on the CRUSH map (a hierarchical description of the physical topology). This eliminates a metadata bottleneck and allows clients to compute data locations directly.
- OSD daemons: One OSD (Object Storage Daemon) per disk. Each OSD manages its local disk and participates in replication, recovery, and rebalancing. A 12-disk node runs 12 OSD processes.
- MON daemons: 3 or 5 monitors maintain the cluster map (CRUSH map, OSD states) and enforce quorum. Monitors use Paxos consensus.
- Placement Groups (PGs): The intermediate mapping layer between volumes (RBD images) and OSDs. A volume is split into objects, each object is mapped to a PG, each PG is mapped to a set of OSDs by CRUSH. Typical pool has 128-256 PGs.
- BlueStore: Ceph's storage backend since Luminous. Writes directly to raw block devices (no filesystem on OSDs), with RocksDB for metadata. Supports inline compression (snappy, zstd, lz4) and checksumming.
ODF operational model on OpenShift:
- Deployed via the ODF Operator (which internally deploys Rook-Ceph).
- Ceph cluster is managed as Kubernetes custom resources (CephCluster, CephBlockPool, CephFilesystem).
- Scaling: add storage nodes, the operator automatically deploys new OSDs and rebalances.
- Monitoring: Ceph metrics exposed via Prometheus, integrated into OpenShift monitoring stack.
Azure Local -- Storage Spaces Direct (S2D): S2D is Microsoft's HCI storage engine built into Windows Server. It pools local NVMe, SSD, and HDD drives across cluster nodes into a unified storage pool, presented as Cluster Shared Volumes (CSVs) formatted with ReFS.
Key S2D architecture points:
- Software Storage Bus: The internal transport layer that connects local disks on each node into a cluster-wide pool. Uses SMB3/RDMA for inter-node communication.
- Storage tiers: S2D automatically creates tiers (NVMe cache + SSD capacity, or SSD cache + HDD capacity). The cache tier absorbs hot I/O; the capacity tier stores bulk data.
- Resiliency: Per-volume choice of 2-way mirror, 3-way mirror, mirror-accelerated parity (combines mirroring for hot data with parity for cold data), or dual parity.
- ReFS filesystem: The Resilient File System (ReFS) replaces NTFS on S2D volumes. ReFS provides integrity streams (checksumming), block cloning (instant VM copy), and allocate-on-write (no in-place overwrites, COW by default).
- CSV (Cluster Shared Volumes): The cluster filesystem layer that allows all nodes to access the same volumes simultaneously. Each volume has an "owner node" that coordinates metadata; data I/O can be either direct (through the Software Storage Bus) or redirected (through SMB3 if the owner is remote).
S2D constraints for this evaluation:
- Maximum 16 nodes per cluster. This is a hard architectural limit. For 5,000+ VMs, you will need multiple S2D clusters. Ceph has no such limit.
- No disaggregated mode. Every node in the S2D cluster must contribute storage. You cannot have compute-only nodes. This means scaling compute requires scaling storage (and vice versa).
- Windows-only. S2D runs exclusively on Windows Server. The operational model is PowerShell, Windows Admin Center, and Azure portal -- a fundamentally different skill set from Linux/Kubernetes.
Swisscom ESC -- Managed SAN: ESC uses traditional SAN (Dell PowerMax/PowerStore behind VxBlock). The customer has zero visibility into or control over the storage architecture. Capacity is consumed as a managed service: request storage, get a volume, attach to VM. The trade-off is clear: no operational burden, but no optimization capability, no architecture choice, and complete dependency on Swisscom for performance, capacity planning, and data protection.
Key Takeaways
-
SAN is not dead, but it is no longer the default. For this evaluation, SAN remains relevant for two scenarios: (a) Swisscom ESC, where it is the underlying architecture hidden behind a managed service, and (b) hybrid architectures where performance-critical Tier-1 workloads on OVE or Azure Local consume external SAN via CSI drivers. For the 80-90% of general-purpose VMs, HCI is the architecturally appropriate choice. Do not invest in new SAN infrastructure unless a specific workload justifies it with measured performance data.
-
HCI capacity overhead is the hidden cost. RF=3 replication means only 33% of raw storage is usable. For 5,000+ VMs requiring, say, 200 TB usable, you need 600 TB raw NVMe/SSD -- a significant hardware cost. Compare this to SAN with RAID-6, where 200 TB usable requires approximately 250-280 TB raw. Erasure coding on HCI (e.g., Ceph EC 4+2) improves efficiency to ~67% usable, but with higher latency and CPU cost. The PoC should include a detailed capacity model comparing raw-to-usable ratios for each candidate with the actual workload profile.
-
The write path determines the performance floor. In HCI, every write traverses the network at least twice (primary journal + replica journal). This means the storage network is on the critical path for write latency. A misconfigured or congested storage network does not just slow down storage -- it directly increases write latency for every VM on the cluster. Dedicated storage network interfaces and correct RDMA/PFC/ECN configuration are not optional -- they are prerequisites for acceptable HCI write performance.
-
Ceph and S2D are architecturally different in meaningful ways. Ceph is a distributed object store with no filesystem on the data path (BlueStore writes directly to raw devices). S2D is a distributed block layer that sits underneath ReFS, which is a filesystem. Ceph uses CRUSH for algorithmic placement (no metadata server for block storage). S2D uses the Software Storage Bus with CSV owner coordination. These differences matter for failure behavior, recovery speed, and scalability limits (S2D: 16 nodes; Ceph: effectively unlimited).
-
Disaggregated HCI is a competitive advantage of OVE. ODF supports dedicated storage nodes (infra nodes) that run only Ceph OSDs, separate from compute nodes running VMs. This allows independent scaling of compute and storage, approaching the flexibility of traditional SAN without the dedicated hardware. Azure Local does not support disaggregated S2D -- every node must contribute storage. For an environment with 5,000+ VMs where compute and storage growth rates may differ, this is a meaningful architectural difference.
-
Journal sizing is the most common HCI performance mistake. An undersized journal (WAL) device causes write stalls when burst I/O fills the journal faster than it can be flushed. Both Ceph and S2D use NVMe devices as the journal/cache tier. The PoC must test sustained write performance under load to validate that the journal tier can absorb peak write rates without stalling. Ask vendors for journal sizing guidelines based on your measured write profile from the VMware baseline.
-
Rebalancing is the operational unknown. Adding or removing nodes from an HCI cluster triggers data redistribution that consumes network bandwidth and disk I/O for hours. During rebalancing, production VM performance degrades. The PoC should explicitly test rebalancing impact: add a node during a load test and measure the latency degradation. Both Ceph and S2D provide throttling controls, but defaults may not be tuned for a 5,000+ VM production environment.
-
NAS remains relevant for file-level workloads. HCI replacing SAN does not mean HCI replaces NAS. Shared file access (configuration repositories, user home directories, application data shared across VMs) is still best served by NFS/SMB from a dedicated NAS or from the SDS file layer (CephFS, S2D SMB shares). The evaluation should clarify whether OVE's CephFS or Azure Local's SMB shares provide adequate NAS functionality or whether an external NAS (NetApp, PowerScale) remains necessary.
-
SAN operational skills do not transfer to HCI. A team skilled in FC zoning, LUN masking, ALUA multipathing, and array-level snapshots will find HCI operations fundamentally different. Ceph operations involve CRUSH map management, PG balancing, OSD lifecycle, and Rook-Ceph operator troubleshooting. S2D operations involve PowerShell, CSV ownership, storage tier management, and S2D repair jobs. The training and skill transition plan must be explicit, with a timeline that precedes production deployment.
Discussion Guide
The following questions are designed for vendor deep-dives, PoC planning sessions, and internal architecture reviews. They probe the practical implications of storage architecture choices.
Questions for All Candidates
-
Capacity overhead quantification: "For our target of X TB usable block storage for VMs, how much raw disk capacity is required under your recommended resiliency configuration? Break this down: raw capacity, capacity after replication/parity, capacity after metadata/filesystem overhead, and final usable capacity. How does this change if we use erasure coding for Tier-2/3 workloads?"
-
Write latency under replication: "Walk us through the exact write path from VM application to durable commit, including all replication hops. What is the measured p50, p95, p99, and p99.9 write latency for 4K random writes at 80% cluster utilization with your recommended replication level? How does write latency change when one node is in recovery (degraded mode)?"
-
Rebalancing impact: "Add a node to a 50%-utilized cluster during a sustained fio workload. What is the measured IOPS and latency degradation during rebalancing? How long does rebalancing take for 10 TB of data? What controls exist to throttle rebalancing and protect production I/O?"
-
Failure recovery time: "A node with 10 TB of data fails at 3 AM. How long until the cluster is back to full redundancy (RF=3 restored)? What is the VM-visible latency impact during recovery? How does the system prioritize recovery I/O vs production I/O? Is there a risk of cascading failure if a second node fails during recovery?"
-
Journal/cache tier sizing: "How should we size the NVMe journal/cache tier for a workload profile of X IOPS write, Y GB/s sequential write? What happens when the journal fills up? How do we monitor journal utilization and set alerts before a performance cliff?"
Questions Specific to OVE (Ceph/ODF)
-
CRUSH map design for our topology: "We have X racks across 2 data center rooms. Design a CRUSH map that ensures RF=3 replicas are placed on different racks. How does the CRUSH map change when we add a third data center room? Can CRUSH rules be changed online without data movement?"
-
PG count and autoscaling: "What PG count should we configure for our pool sizes? Does the ODF operator enable PG autoscaling by default? What are the risks of PG splits and merges on production workload performance? Can PG operations be scheduled during maintenance windows?"
-
BlueStore tuning for NVMe: "Is the default BlueStore configuration optimized for all-NVMe deployments? What bluestore_cache_size, bluestore_min_alloc_size, and WAL/DB sizing are recommended for our workload profile? Should we allocate separate NVMe devices for WAL+DB, or collocate them with data?"
-
Disaggregated ODF: "We want to run compute-only worker nodes alongside dedicated storage (infra) nodes. What is the minimum number of storage nodes for ODF? Can we mix node sizes (different disk counts) without CRUSH weight imbalances? What network bandwidth is required between compute and storage nodes?"
Questions Specific to Azure Local (S2D)
-
16-node cluster limit: "We need to host 5,000+ VMs. With a 16-node cluster limit, how many S2D clusters do we need? How do VMs on one cluster access storage on another? Is there a shared-nothing penalty for cross-cluster storage access? How does management scale across multiple clusters?"
-
Mirror-accelerated parity: "For cold/warm data, S2D offers mirror-accelerated parity (MAP). How does the system decide when to tier data from the mirror layer to the parity layer? What is the read latency from the parity layer compared to the mirror layer? Can we control tiering thresholds?"
-
ReFS vs NTFS implications: "S2D requires ReFS. What limitations does ReFS impose compared to NTFS (e.g., deduplication behavior, backup agent compatibility, file-level restore)? Are all our backup tools (Veeam, Commvault) fully compatible with ReFS on CSV volumes?"
Questions Specific to Swisscom ESC
-
Storage architecture transparency: "We understand ESC uses Dell VxBlock with PowerMax/PowerStore. Can you confirm the RAID level and protection scheme used for our tenant? Is our data on dedicated LUNs or shared with other tenants? What is our performance isolation guarantee -- is there QoS at the array level, and what are our IOPS/latency SLAs?"
-
Storage scaling and limits: "What is the maximum storage capacity available per tenant? If we need to scale from 100 TB to 500 TB, is this a config change or a hardware procurement? What is the lead time? Can we burst beyond our committed capacity?"
Architecture-Level Questions (for Internal Discussion)
-
Hybrid architecture decision: "For our 5,000+ VMs, what percentage realistically need SAN-grade performance (sub-200 us, deterministic)? Should we provision an external SAN for those workloads and use HCI for the rest? What is the TCO difference between full-HCI and hybrid (HCI + external SAN for Tier-1)? Consider hardware, licensing, support contracts, and operational skill sets."
-
Capacity planning model: "Build a 5-year capacity model for each candidate, starting from our current VMware storage consumption. Account for: replication overhead (RF=3 vs RAID-6), growth rate (15-20% annual), snapshot space, thin provisioning overcommit safety margin, and recovery headroom (enough free space to absorb a node failure's re-replication). What is the year-1 and year-5 raw capacity requirement for each candidate?"
-
NAS strategy alongside HCI: "If we adopt HCI for block storage, do we still need a dedicated NAS (NetApp, PowerScale) for file shares, or can CephFS / S2D SMB shares replace it? What are the feature gaps (Active Directory integration, DFS namespaces, qtrees, quotas, virus scanning integration)? Is a dedicated NAS for file workloads + HCI for block workloads a cleaner operational model than trying to do everything on HCI?"
-
Storage operations skill transition: "Our current team operates VMware vSAN and NetApp ONTAP. Map the skill equivalencies to Ceph/ODF and S2D. For each new skill area (CRUSH map management, OSD lifecycle, PG balancing for Ceph; PowerShell storage cmdlets, CSV management, ReFS administration for S2D), what training is available, what is the estimated ramp-up time, and when in the project timeline must the team be proficient?"
Next: 05-sds-platforms.md -- Software-Defined Storage Platforms (Ceph/ODF, Storage Spaces Direct, and direct comparison)