Modern datacenters and beyond

Storage Architectures

Why This Matters

The choice of storage architecture -- SAN, NAS, or HCI/SDS -- is the single most consequential infrastructure decision in this platform migration. It determines the physical bill of materials, the operational skill set required, the failure domains, the performance ceiling, and the day-2 operational model for the next 5-10 years. In the current VMware environment, the storage architecture is already HCI (vSAN), potentially supplemented by external SAN for specific workloads. Each candidate platform makes a different architectural bet:

  1. OVE bets on HCI via Ceph/ODF, with optional external SAN/NAS via CSI drivers.
  2. Azure Local bets on HCI via Storage Spaces Direct (S2D), with optional external SAN via iSCSI/FC passthrough.
  3. Swisscom ESC bets on traditional SAN (Dell PowerMax/PowerStore behind VxBlock), fully managed and invisible to the customer.

Understanding the architectural trade-offs between SAN, NAS, and HCI is essential for three reasons:

This page covers the three architectures in depth, with particular emphasis on HCI/SDS because that is the model used by the two self-operated candidates (OVE and Azure Local).


Concepts

1. SAN (Storage Area Network)

Architecture Overview

A SAN is a dedicated, purpose-built network that connects servers to centralized block storage arrays. The defining characteristic is separation of concerns: compute nodes contain no persistent storage; all data lives on purpose-built storage controllers with enterprise-grade data services (snapshots, replication, deduplication, encryption, QoS). The SAN fabric provides deterministic, low-latency connectivity between compute and storage.

SAN Architecture -- Dual-Fabric Topology (Standard Enterprise)
================================================================

  Compute Nodes (Servers / Hypervisor Hosts)
  +-----------+  +-----------+  +-----------+  +-----------+
  | Host 01   |  | Host 02   |  | Host 03   |  | Host 04   |
  | [HBA-A]---|--+-----------+--+-----------+--+-- Fabric A |
  | [HBA-B]---|--+-----------+--+-----------+--+-- Fabric B |
  +-----------+  +-----------+  +-----------+  +-----------+

  SAN Fabric A (Primary)                SAN Fabric B (Secondary)
  +---------------------+              +---------------------+
  | FC Switch A1        |              | FC Switch B1        |
  | (Brocade G720 /     |              | (Brocade G720 /     |
  |  Cisco MDS 9148T)   |              |  Cisco MDS 9148T)   |
  +-----+-----+---------+              +-----+-----+---------+
        |     |                              |     |
        |     +------ ISL ------+            |     +------ ISL ------+
        |                       |            |                       |
  +-----+-----+---------+ +----+------+-----++ +----+------+--------+
  | FC Switch A2        | | FC Switch B2      | |                    |
  | (edge / director)   | | (edge / director) | |                    |
  +-----+-----+---------+ +----+------+-------+ +                   +
        |     |                 |      |
        v     v                 v      v
  +-----+-----+---------+ +----+------+--------+
  | Storage Array        | | Storage Array       |
  | Controller A         | | Controller B        |
  | (active)             | | (standby/active)    |
  +----------------------+ +---------------------+
  |  Disk Shelves (SSD / NVMe / HDD)             |
  |  [shelf-1] [shelf-2] [shelf-3] [shelf-4]     |
  +-----------------------------------------------+

  Key design principles:
  - Two independent fabrics (A and B) -- no single point of failure
  - Each host has two HBAs, one per fabric
  - Each storage controller has ports on both fabrics
  - ISLs (Inter-Switch Links) connect switches within a fabric
  - No cross-connections between Fabric A and Fabric B
  - Dual-controller array: active/active or active/standby

SAN Components

Host Bus Adapters (HBAs): Dedicated PCIe cards that provide Fibre Channel connectivity from the server to the SAN fabric. Each HBA has a globally unique World Wide Port Name (WWPN) and World Wide Node Name (WWNN), analogous to a MAC address on Ethernet. Modern HBAs operate at 32 Gbps FC (Gen 6) or 64 Gbps FC (Gen 7). Each server requires at least two HBAs for multipath redundancy -- one per fabric.

Fibre Channel Switches: Purpose-built network switches that forward FC frames between HBAs and storage array ports. Major vendors are Brocade (Broadcom) and Cisco (MDS series). Switches operate at wire speed with cut-through forwarding and deterministic latency (typically 2-5 microseconds per hop). Enterprise deployments use director-class switches (Brocade X7-8, Cisco MDS 9700) with hundreds of ports and redundant control planes. Smaller environments use fixed-port switches (Brocade G720, Cisco MDS 9148T).

Storage Arrays: The centralized storage controllers that own and manage all persistent data. A storage array consists of:

Component Function
Controllers (2 or more) Process I/O, run data services (snapshots, replication, dedup), manage cache
Cache (DRAM + NVMe write buffer) Absorb writes, accelerate reads; 64-512 GB per controller is typical
Backend connectivity SAS/NVMe connections to disk shelves
Disk shelves House the physical media (NVMe, SSD, HDD); connected via dual-path SAS or NVMe fabric
Management interfaces Out-of-band management (REST API, GUI, CLI) for provisioning and monitoring

SAN Vendors Relevant to Financial Enterprises:

Vendor Platform Architecture Sweet Spot
NetApp AFF A-Series / C-Series Unified (block + file), ONTAP OS, active/active controllers Environments needing block + NFS + SnapMirror replication
Pure Storage FlashArray//X, //XL, //E All-flash, NVMe-native, active/active, Evergreen subscription Simplicity-focused, high-IOPS, no tuning needed
Dell PowerStore, PowerMax PowerStore: unified mid-range; PowerMax: enterprise block PowerMax for extreme scale; PowerStore for mixed workloads
HPE Alletra 9000 / MP Alletra 9000 (Primera heritage): mission-critical block Environments requiring guaranteed latency SLAs

Zoning and LUN Masking

Zoning and LUN masking are the two layers of access control in a SAN. They are conceptually similar to network firewall rules and application-level authentication, respectively.

Zoning (Fabric-Level Access Control): Zoning restricts which HBA ports can communicate with which storage ports within a fabric. Without zoning, every HBA port can see every storage port -- a security and stability risk.

Zoning Example -- Fabric A
============================

Zone: Zone_Host01_ArrayA
  Members:
    - 21:00:00:1b:32:a1:00:01   (Host01 HBA-A WWPN)
    - 50:00:09:73:f0:10:00:0a   (Array Controller-A Port 1)

Zone: Zone_Host02_ArrayA
  Members:
    - 21:00:00:1b:32:a2:00:01   (Host02 HBA-A WWPN)
    - 50:00:09:73:f0:10:00:0a   (Array Controller-A Port 1)

Zone: Zone_Host03_ArrayA
  Members:
    - 21:00:00:1b:32:a3:00:01   (Host03 HBA-A WWPN)
    - 50:00:09:73:f0:10:00:0b   (Array Controller-A Port 2)

Zoneset: Production_FabricA
  Active zones: Zone_Host01_ArrayA,
                Zone_Host02_ArrayA,
                Zone_Host03_ArrayA

Zoning types:
  - WWPN zoning (port-based): recommended, follows the HBA regardless of switch port
  - Port zoning (switch port-based): tied to physical switch port, breaks if cable moves
  - Smart zoning / peer zoning: Brocade optimization reducing zone database size

LUN Masking (Array-Level Access Control): LUN masking controls which hosts can access which LUNs (Logical Unit Numbers) on the storage array. Even if zoning allows an HBA to talk to a storage port, LUN masking restricts which LUNs that HBA can discover and use.

LUN Masking Workflow
=====================

1. Create LUN on array:
   lun create -name db_prod_01 -size 500GB -pool ssd_tier1

2. Create host object (identifies the server to the array):
   host create -name esxi-host-01 \
     -initiators 21:00:00:1b:32:a1:00:01,21:00:00:1b:32:a1:00:02 \
     -type vmware

3. Map LUN to host (LUN masking):
   lun map -name db_prod_01 -host esxi-host-01 -lun-id 5

4. Result: Host01 sees LUN 5 (500 GB) when it scans the fabric
   Other hosts see nothing -- LUN is masked from them

Host Group (for shared-access scenarios like VMFS):
   hostgroup create -name esxi-cluster-01 \
     -hosts esxi-host-01,esxi-host-02,esxi-host-03,esxi-host-04
   lun map -name vmfs_shared_01 -hostgroup esxi-cluster-01 -lun-id 10
   -> All four hosts see LUN 10 (shared VMFS datastore)

LUN Presentation and Multipathing

When a host has two HBAs (one per fabric) and the storage array has controllers on both fabrics, the host discovers the same LUN through multiple paths. Multipathing software (dm-multipath on Linux, Microsoft MPIO on Windows, VMware NMP) aggregates these paths into a single logical device.

Multipath Topology -- Single LUN, Four Paths
===============================================

Host: esxi-host-01
  |
  +-- HBA-A (Fabric A) ------> Array Controller-A, Port 1
  |                              |
  |                              +-> LUN 5 (optimized path)
  |
  +-- HBA-A (Fabric A) ------> Array Controller-B, Port 1
  |                              |
  |                              +-> LUN 5 (non-optimized path)
  |
  +-- HBA-B (Fabric B) ------> Array Controller-A, Port 2
  |                              |
  |                              +-> LUN 5 (optimized path)
  |
  +-- HBA-B (Fabric B) ------> Array Controller-B, Port 2
                                 |
                                 +-> LUN 5 (non-optimized path)

ALUA (Asymmetric Logical Unit Access):
  - LUN 5 is "owned" by Controller-A
  - Paths through Controller-A are "Active/Optimized" (AO)
  - Paths through Controller-B are "Active/Non-Optimized" (ANO)
  - I/O normally flows through AO paths
  - If Controller-A fails, ANO paths become AO (controller failover)

Multipath policies:
  - Round Robin:  alternates I/Os across active paths (best throughput)
  - Fixed:        uses one path, failover to another on failure
  - MRU:          most recently used path (avoids unnecessary path changes)
  - ALUA-aware:   respects AO/ANO status, load-balances within AO paths

Thin Provisioning at the Array Level

Modern storage arrays support thin provisioning at the pool level, distinct from thin provisioning within the guest OS (LVM thin pools) or at the hypervisor level (thin VMDKs):

Array-Level Thin Provisioning
===============================

Storage Pool: ssd_tier1 (physical capacity: 50 TB)
  |
  +-- LUN: db_prod_01     provisioned: 500 GB   used: 120 GB
  +-- LUN: db_prod_02     provisioned: 500 GB   used:  85 GB
  +-- LUN: vmfs_shared_01 provisioned:  10 TB   used: 3.2 TB
  +-- LUN: vmfs_shared_02 provisioned:  10 TB   used: 4.1 TB
  +-- LUN: oracle_rac_01  provisioned:   2 TB   used: 800 GB
  +-- LUN: oracle_rac_02  provisioned:   2 TB   used: 750 GB
  ...
  Total provisioned: 120 TB   (2.4x overcommit ratio)
  Total used:         38 TB   (76% physical utilization)
  Free physical:      12 TB

  Warning thresholds:
    - 80% physical: alert
    - 90% physical: critical alert + auto-pause thin provisioning
    - 95% physical: emergency -- risk of pool full, I/O errors

  The danger: if total used exceeds 50 TB, the pool is physically
  full and new writes FAIL. This is the "thin provisioning time bomb."
  Financial environments must monitor this and set conservative thresholds.

SAN Management Workflow

The end-to-end workflow for provisioning a new LUN to a host on a SAN illustrates the operational complexity:

SAN Provisioning Workflow (New LUN for a VM Cluster)
======================================================

Step 1: Capacity planning
  - Identify target pool (SSD tier, HDD tier)
  - Check free physical capacity and overcommit ratio
  - Determine RAID/protection level

Step 2: Create LUN on storage array
  - Vendor GUI / CLI / REST API
  - Set size, thin/thick, tiering policy, QoS limits
  - Assign to storage pool

Step 3: Create zoning on BOTH fabrics
  - Log into Fabric A switch: create zone, add to active zoneset
  - Log into Fabric B switch: create zone, add to active zoneset
  - Activate zonesets (causes brief fabric reconfiguration)

Step 4: Create LUN masking on storage array
  - Identify host/hostgroup by WWPN
  - Map LUN with a specific LUN ID
  - Verify host can discover the LUN

Step 5: Host-side discovery
  - Rescan HBA (Linux: echo "- - -" > /sys/class/scsi_host/hostX/scan)
  - Verify multipath device appears (multipathd show paths)
  - Verify ALUA path states

Step 6: Present to hypervisor
  - VMware: create VMFS datastore on the new LUN
  - Or: use as RDM (Raw Device Mapping) for direct VM access

Step 7: Document in CMDB
  - Record WWPN mappings, zone names, LUN IDs, multipath policy

Total time: 30-90 minutes per LUN (manual process)
Automation potential: high (Ansible, Terraform, vendor REST APIs)

Performance Characteristics

SAN storage delivers the best deterministic performance of any architecture because:

  1. Dedicated bandwidth. The FC fabric carries only storage traffic -- no competition with management, vMotion, or tenant network traffic.
  2. Deterministic latency. FC switches use credit-based flow control (buffer-to-buffer credits), which prevents congestion drops. There is no TCP congestion window, no retransmissions, no head-of-line blocking. Latency is predictable and consistent.
  3. Controller-side intelligence. Enterprise arrays have massive DRAM caches (64-512 GB per controller), NVMe write journals, and intelligent prefetch algorithms. Hot data is served from cache at DRAM speed.
  4. No replication overhead on the write path. Unlike HCI, a SAN write goes to one storage controller (which handles internal redundancy via RAID or erasure coding within the array). There is no cross-node network replication on every write.

Typical SAN performance numbers (single array, all-flash):

Metric Mid-Range (PowerStore, AFF A250) High-End (PowerMax, AFF A900)
Random 4K read IOPS 200K - 500K 1M - 10M
Random 4K write IOPS 100K - 300K 500K - 5M
Read latency (avg) 100 - 300 us 50 - 150 us
Write latency (avg) 100 - 500 us 50 - 200 us
Sequential throughput 10 - 25 GB/s 50 - 150 GB/s
Max capacity (usable) 100 - 500 TB 500 TB - 4 PB

SAN in a VM Environment

In a VMware environment, SAN LUNs are consumed in two ways:

Boot from SAN: Servers can boot their hypervisor OS from a SAN LUN, eliminating the need for local disks entirely. This simplifies server hardware (diskless blade servers) but adds a dependency on SAN availability for basic compute. Boot from SAN is common in Tier-1 financial environments with existing SAN infrastructure.

When SAN Still Makes Sense

Despite the industry trend toward HCI, external SAN remains justified in specific scenarios:

Scenario Why SAN Wins
Regulatory mandate Some regulators or audit frameworks require that storage infrastructure is a separate failure domain from compute. SAN provides physical separation.
Existing investment If you have a recently purchased (< 3 years) SAN array with active support contracts, migrating its workloads to HCI gains nothing and wastes CapEx.
Performance-critical databases Workloads requiring sub-100-us latency, deterministic QoS, and zero noisy-neighbor risk (e.g., high-frequency trading, real-time risk calculations) benefit from dedicated SAN.
Oracle RAC / shared-disk clusters Clustered databases using shared raw block devices (RDMs) with fencing rely on SAN semantics that HCI does not natively provide.
Very large single volumes Volumes > 16 TB that need efficient snapshots and replication are better served by array-native data services than by distributed HCI replication.
Disaggregated scaling When compute and storage need to scale independently (e.g., adding 50 TB of storage without adding compute nodes), SAN decouples the two.

2. NAS (Network Attached Storage)

Architecture Overview

A NAS system is a file server appliance that serves files over standard Ethernet using file-level protocols (NFS, SMB/CIFS). Unlike SAN (which provides raw block devices), NAS provides a shared filesystem -- clients mount directories and access files through standard POSIX or SMB semantics. The storage appliance owns the filesystem, handles locking, permissions, and data protection.

NAS Architecture -- Enterprise Deployment
===========================================

  Clients (VMs, Hypervisors, Application Servers)
  +-----------+  +-----------+  +-----------+  +-----------+
  | Client 01 |  | Client 02 |  | Client 03 |  | Client 04 |
  |  NFS/SMB  |  |  NFS/SMB  |  |  NFS/SMB  |  |  NFS/SMB  |
  +-----+-----+  +-----+-----+  +-----+-----+  +-----+-----+
        |              |              |              |
        +------+-------+------+------+------+-------+
               |              |             |
        +------+------+ +----+-----+ +-----+------+
        | Ethernet    | | Ethernet | | Ethernet   |
        | Switch L1   | | Switch   | | Switch L2  |
        | (25/100GbE) | | (core)   | | (25/100GbE)|
        +------+------+ +----+-----+ +-----+------+
               |              |             |
               +------+------+------+------+
                      |             |
                +-----+------+ +---+--------+
                | NAS Head 1 | | NAS Head 2 |
                | (active)   | | (standby / |
                |            | |  active)   |
                +-----+------+ +---+--------+
                      |             |
                +-----+-------------+------+
                | Internal Disk Shelves    |
                | (SSD / HDD / hybrid)     |
                | [shelf-1] [shelf-2] ...  |
                +---------------------------+

  NAS protocols:
    NFS (TCP port 2049): POSIX file semantics, used by Linux/Unix/VMware
    SMB (TCP port 445):  Windows file semantics, used by Windows VMs/clients

  Network:
    - Shared Ethernet (same physical network as other traffic, or VLAN-separated)
    - No dedicated fabric required (unlike SAN/FC)
    - Jumbo frames (MTU 9000) recommended for performance

NAS Protocols: NFS vs SMB

Dimension NFS SMB (SMB2/SMB3)
Native OS Linux, Unix, macOS, VMware ESXi Windows, macOS (via SMB), Linux (via cifs.ko)
Filesystem semantics POSIX (uid/gid, mode bits, ACLs via NFSv4) Windows ACLs (NTFS-style DACL/SACL)
Locking Advisory (NFSv3), mandatory (NFSv4) Mandatory (oplocks, leases)
Authentication AUTH_SYS (uid/gid, insecure), Kerberos (NFSv4) NTLM, Kerberos (integrated with AD)
Encryption Kerberos privacy (krb5p) for NFSv4 SMB3 encryption (AES-128/256-GCM)
Multichannel Not natively (client-side bonding instead) SMB3 Multichannel (built-in NIC aggregation)
Typical use case VMware NFS datastores, Linux app data, Kubernetes NFS provisioner Windows file shares, DFS namespaces, user home drives

When to use NFS: VMware NFS datastores, Linux application shared data, Kubernetes PVs via NFS provisioner, cross-platform file sharing in Linux-dominated environments.

When to use SMB: Windows VM file shares, Active Directory-integrated environments, user home directories for Windows desktops, SQL Server filegroups on SMB shares (supported since SQL 2012).

NAS for VM Storage

NFS is used as a VM storage backend in two scenarios:

VMware NFS Datastores: ESXi mounts NFS exports as datastores. VMDKs are stored as regular files on the NFS volume. This is operationally simpler than SAN (no zoning, no LUN masking, no multipath configuration), but performance depends on the NFS implementation and network.

VMware NFS Datastore Architecture
====================================

ESXi Host 01              ESXi Host 02              ESXi Host 03
+------------------+      +------------------+      +------------------+
| NFS Client       |      | NFS Client       |      | NFS Client       |
| (vmkernel port)  |      | (vmkernel port)  |      | (vmkernel port)  |
+--------+---------+      +--------+---------+      +--------+---------+
         |                         |                         |
         +----------+--------------+----------+--------------+
                    |                         |
              +-----+------+           +-----+------+
              | NFS Server |           | NFS Server |
              | (NetApp    |           | (NetApp    |
              |  node 1)   |           |  node 2)   |
              +-----+------+           +-----+------+
                    |                         |
              +-----+-------------------------+------+
              |        Shared Storage Pool          |
              |  /vol/vmware_ds01  (NFS export)     |
              |    vm01-flat.vmdk                    |
              |    vm02-flat.vmdk                    |
              |    vm03-flat.vmdk                    |
              |  /vol/vmware_ds02  (NFS export)     |
              |    vm04-flat.vmdk                    |
              |    ...                               |
              +--------------------------------------+

  Advantages over SAN:
  - No zoning, no LUN masking, no HBA configuration
  - VMDKs are individual files (easy to list, copy, manage)
  - Storage vMotion between NFS datastores is a file copy
  - Easy to add datastores (just mount a new NFS export)

  Disadvantages vs SAN:
  - Higher latency (NFS/TCP overhead vs FC direct)
  - Shared Ethernet bandwidth (unless dedicated NFS VLAN)
  - No RDM equivalent (cannot pass raw device to VM)
  - NFS locking complexity for certain workloads

Kubernetes NFS Provisioner: In a Kubernetes environment, NFS is consumed via CSI drivers (e.g., NFS Subdir External Provisioner, NetApp Trident for ONTAP NFS) to dynamically provision PersistentVolumes. Each PVC creates a subdirectory on the NFS export. This is simple but has performance limitations (NFS metadata overhead, single-server bottleneck for non-scale-out NAS).

Scale-Out NAS vs Single-Controller NAS

Aspect Single-Controller / HA Pair Scale-Out NAS
Architecture One or two controllers (active/standby or active/active) Many nodes (4-100+) forming a distributed filesystem
Throughput scaling Limited by controller CPU and network ports Scales linearly with nodes
Capacity scaling Add disk shelves to existing controllers Add nodes (compute + storage together)
Failure impact Controller failure = failover (brief interruption) Node failure = redistributed load (minimal impact)
Examples NetApp FAS/AFF (HA pair mode), QNAP, Synology NetApp ONTAP (cluster mode), Dell PowerScale (Isilon), Vast Data
Use case Small-medium file shares, VMware NFS datastores Large-scale analytics, media, research, massive file counts

Performance Characteristics vs SAN

NAS performance is fundamentally limited by protocol overhead and shared network bandwidth:

Latency Comparison: SAN vs NAS
================================

Operation: 4K random read

SAN (FC):
  App -> Guest kernel -> virtio-blk -> QEMU -> Host block layer
  -> HBA -> FC Switch (2-5 us) -> Array controller -> Cache/SSD
  Total: 100-300 us

NAS (NFS over 25GbE):
  App -> Guest kernel -> NFS client -> RPC/XDR serialization
  -> TCP/IP stack -> NIC -> Ethernet switch -> NAS head
  -> NFS server -> Filesystem -> Cache/SSD
  Total: 300-800 us

Why NAS is slower:
  1. RPC/XDR serialization overhead: ~10-30 us
  2. TCP/IP processing: ~5-20 us (vs FC credit-based flow: ~2-5 us)
  3. NFS server filesystem layer: ~10-50 us
  4. Shared Ethernet bandwidth: potential queueing delay
  5. No equivalent of FC buffer-to-buffer credits (TCP uses congestion windows)

When this gap does NOT matter:
  - Sequential I/O (throughput-bound, not latency-bound)
  - Workloads with > 5 ms tolerance (file servers, web servers, logs)
  - NFS over RDMA (eliminates TCP overhead, latency approaches SAN)

NAS Vendors

Vendor Platform Strengths Use Case
NetApp ONTAP AFF / FAS (HA-pair and cluster mode) Unified block+file, SnapMirror, FlexClone, multi-protocol Primary enterprise NAS, VMware NFS datastores
Dell PowerScale (Isilon) Scale-out NAS cluster Massive scale (100+ nodes, 100+ PB), parallel throughput Analytics, media, large unstructured data sets
Vast Data Universal Storage NFS, SMB, S3 on one platform, disaggregated shared-nothing Next-gen unified storage for AI/ML and enterprise
QNAP / Synology Desktop/rackmount NAS Low cost, easy management, consumer-grade Lab, dev/test, non-production file shares

3. HCI / Software-Defined Storage (SDS)

This is the most critical section for the evaluation. Both OVE (via Ceph/ODF) and Azure Local (via S2D) use HCI as their primary storage model. Understanding HCI internals -- replication mechanics, consistency models, failure handling, write paths -- is essential for evaluating their claims and planning PoCs.

The HCI Concept

Hyper-Converged Infrastructure (HCI) eliminates the dedicated storage array by distributing storage across the same servers that run compute workloads. Each node contributes its local disks to a shared storage pool managed by software-defined storage (SDS). The SDS layer presents this distributed pool as a single logical storage system to the hypervisor or container runtime.

HCI Architecture -- Compute + Storage Converged
=================================================

  Traditional (3-Tier: Compute + Network + Storage)
  ==================================================

  +--------+ +--------+ +--------+     +------------------+
  | Compute| | Compute| | Compute| --> | SAN Fabric       |
  | Node 1 | | Node 2 | | Node 3 |     | (FC Switches)    |
  | (no    | | (no    | | (no    |     +--------+---------+
  |  disks)| |  disks)| |  disks)|              |
  +--------+ +--------+ +--------+     +--------+---------+
                                        | Storage Array    |
                                        | (controllers +   |
                                        |  disk shelves)   |
                                        +------------------+

  vs. HCI (Converged)
  ====================

  +------------------+ +------------------+ +------------------+
  | HCI Node 1       | | HCI Node 2       | | HCI Node 3       |
  |                  | |                  | |                  |
  | [Compute: VMs]   | | [Compute: VMs]   | | [Compute: VMs]   |
  | [SDS daemon]     | | [SDS daemon]     | | [SDS daemon]     |
  | [NVMe] [NVMe]    | | [NVMe] [NVMe]    | | [NVMe] [NVMe]    |
  | [SSD]  [SSD]     | | [SSD]  [SSD]     | | [SSD]  [SSD]     |
  +--------+---------+ +--------+---------+ +--------+---------+
           |                     |                     |
           +----------+----------+----------+----------+
                      |                     |
               Storage Network        Storage Network
               (dedicated VLAN         (dedicated VLAN
                or separate NICs)       or separate NICs)

  Key differences:
  - No dedicated storage hardware (arrays, shelves, FC switches)
  - Every node is both compute and storage
  - SDS software creates a distributed storage pool
  - Data is replicated across nodes for redundancy
  - Scaling: add a node = add compute AND storage simultaneously

How SDS Works: Distributed Storage Pooling

Software-defined storage takes the local disks from each node and combines them into a cluster-wide storage pool. The mechanism varies by implementation, but the core concept is the same:

SDS Data Distribution -- Conceptual View
==========================================

Physical disks across 4 nodes:

Node 1:  [NVMe-1a] [NVMe-1b] [SSD-1a] [SSD-1b]
Node 2:  [NVMe-2a] [NVMe-2b] [SSD-2a] [SSD-2b]
Node 3:  [NVMe-3a] [NVMe-3b] [SSD-3a] [SSD-3b]
Node 4:  [NVMe-4a] [NVMe-4b] [SSD-4a] [SSD-4b]

SDS Pool (logical view):
+---------------------------------------------------------------+
|                    Distributed Storage Pool                    |
|  Total raw: 16 disks x 3.84 TB = 61.44 TB                    |
|  Usable (with 3-way replication): ~20 TB                      |
|                                                                |
|  Volume: vm-boot-01    50 GB   (replicated across 3 nodes)    |
|  Volume: vm-boot-02    50 GB   (replicated across 3 nodes)    |
|  Volume: db-data-01   500 GB   (replicated across 3 nodes)    |
|  Volume: db-data-02   500 GB   (replicated across 3 nodes)    |
|  ...                                                           |
+---------------------------------------------------------------+

The SDS layer handles:
  1. Splitting volumes into fixed-size chunks/objects (4 MB typical for Ceph)
  2. Placing replicas of each chunk on different nodes (placement algorithm)
  3. Routing I/O to the correct node for each chunk
  4. Replicating writes to all replica nodes before acknowledging
  5. Detecting and recovering from node/disk failures
  6. Rebalancing data when nodes are added or removed

Data Placement and Replication

Data placement is how the SDS decides where to store each piece of data. The goal is to distribute data evenly across nodes while ensuring replicas are on different failure domains.

Replica Factor (RF): The number of copies of each data chunk stored across the cluster.

RF Capacity Overhead Fault Tolerance Use Case
RF=2 (2-way mirror) 50% usable (2x raw) Tolerates 1 node failure Development, non-critical workloads
RF=3 (3-way mirror) 33% usable (3x raw) Tolerates 2 node failures Production, Tier-1 workloads
Erasure Coding (e.g., 4+2) 67% usable (1.5x raw) Tolerates 2 failures Cold data, backups, archival storage

Replica Placement Across Nodes:

3-Way Replication -- Data Distribution Example
================================================

Volume: db-data-01 (500 GB, split into 4 MB chunks)

Chunk 001:  Primary: Node 1   Replica: Node 2   Replica: Node 4
Chunk 002:  Primary: Node 2   Replica: Node 3   Replica: Node 1
Chunk 003:  Primary: Node 3   Replica: Node 4   Replica: Node 2
Chunk 004:  Primary: Node 4   Replica: Node 1   Replica: Node 3
Chunk 005:  Primary: Node 1   Replica: Node 3   Replica: Node 4
...

Placement rules (Ceph CRUSH / S2D):
  - No two replicas of the same chunk on the same node
  - No two replicas on the same failure domain (rack, if configured)
  - Distribute evenly: each node stores ~25% of data in a 4-node cluster
  - Weight-based: nodes with more/larger disks store proportionally more

Visual distribution of chunk replicas:

       Node 1      Node 2      Node 3      Node 4
       ------      ------      ------      ------
 C001: [PRIMARY]   [replica]               [replica]
 C002: [replica]   [PRIMARY]   [replica]
 C003:             [replica]   [PRIMARY]   [replica]
 C004: [replica]               [replica]   [PRIMARY]
 C005: [PRIMARY]               [replica]   [replica]
 C006:             [PRIMARY]   [replica]   [replica]
 ...

Each node holds roughly equal data volume.
Each node serves roughly equal I/O load.
No single node failure loses more than 1 copy of any chunk.

Erasure Coding: An alternative to replication that provides data protection with less capacity overhead. Data is split into k data fragments and m parity fragments. Any k fragments can reconstruct the original data.

Erasure Coding Example: k=4, m=2 (4+2)
=========================================

Original 4 MB chunk split into 4 data fragments + 2 parity fragments:

Fragment:   D1     D2     D3     D4     P1     P2
Size:       1 MB   1 MB   1 MB   1 MB   1 MB   1 MB
Stored on:  Node1  Node2  Node3  Node4  Node5  Node6

Capacity overhead:  6 MB stored for 4 MB of data = 1.5x (vs 3x for RF=3)
Fault tolerance:    Any 2 fragments can be lost; data is recoverable

Trade-offs vs replication:
  + 50% less capacity overhead (1.5x vs 3x)
  - Higher CPU cost (Galois field math for parity calculation)
  - Higher read latency for degraded reads (must reconstruct from k fragments)
  - Higher write latency (must compute parity and write to k+m nodes)
  - Requires more nodes (at least k+m nodes for optimal placement)

Best for:  cold data, backups, object storage, large sequential reads
Avoid for: OLTP databases, latency-sensitive block storage, small random I/O

Consistency Models

Distributed storage systems must choose how to handle concurrent reads and writes across replicas. The consistency model determines what a reader sees when a writer is updating data simultaneously.

Strong Consistency (Linearizability): Every read returns the most recent write. All replicas agree on the current state before acknowledging the write. This is what block storage requires -- a VM must never read stale data from its own disk.

Both Ceph and S2D implement strong consistency for block storage:

Eventual Consistency: Writes are acknowledged after some (but not all) replicas confirm. Other replicas catch up asynchronously. Readers may see stale data temporarily. This model is acceptable for object storage (S3) but NOT for block storage.

Quorum-Based Consistency: A write succeeds if a majority (quorum) of replicas acknowledge. A read succeeds if a majority responds. By requiring W + R > N (write quorum + read quorum > total replicas), at least one reader is guaranteed to see the latest write. This is a middle ground used by some distributed databases but NOT by Ceph or S2D for block I/O (they use full synchronous replication instead).

Consistency Model Comparison
==============================

Strong Consistency (Ceph RBD, S2D block):
  Write: Client -> Primary OSD -> [replicate to ALL replicas] -> ACK
  Read:  Client -> Primary OSD -> return latest data
  Guarantee: read-after-write consistency, always
  Cost: write latency = slowest replica

Eventual Consistency (Ceph RGW / S3-compatible):
  Write: Client -> Primary -> ACK -> [replicate asynchronously]
  Read:  Client -> any replica -> may return stale data
  Guarantee: all replicas converge eventually
  Cost: lower write latency, but stale reads possible

Quorum (N=3, W=2, R=2):
  Write: Client -> write to 3, ACK after 2 confirm
  Read:  Client -> read from 3, use value from 2 agreeing
  Guarantee: W + R > N, so at least 1 reader sees latest write
  Cost: balanced latency, but requires quorum logic

For VM block storage: ONLY strong consistency is acceptable.
A VM's filesystem assumes its disk is a single coherent device.
Stale reads would cause filesystem corruption.

Write Path in a Distributed Storage System

Understanding the write path is critical for performance analysis and troubleshooting. Every VM write traverses multiple layers before being durable.

Write Path -- HCI with SDS (Generic Model)
============================================

1. VM Application issues write(fd, buf, 4096)
   |
   v
2. Guest Kernel: filesystem (ext4/XFS) -> bio -> virtio-blk driver
   |                                                   ~1-5 us
   v
3. Hypervisor: QEMU I/O thread receives virtio request
   |           -> translates to SDS client call         ~2-5 us
   v
4. SDS Client Library (e.g., librbd for Ceph, ReFS for S2D)
   |  a) Identify which chunk/slab this LBA belongs to
   |  b) Look up primary node for this chunk             ~1-3 us
   |  c) Send write request to primary node
   v
5. Primary Storage Node receives write
   |  a) Write to local journal (WAL / write-ahead log)  ~10-50 us
   |     (fast NVMe device, sequential write, durable)
   |  b) Send replication requests to replica nodes
   v
6. Replica Nodes (in parallel):
   |  a) Receive write over network                      ~5-30 us
   |  b) Write to local journal                          ~10-50 us
   |  c) Acknowledge to primary
   v
7. Primary collects ALL replica acknowledgements
   |  (waits for slowest replica -- "tail latency")      ~0-20 us
   v
8. Primary acknowledges write to SDS client
   |                                                     ~1-3 us
   v
9. SDS Client returns completion to QEMU
   |                                                     ~1-3 us
   v
10. QEMU signals completion to guest virtio driver
    |                                                    ~1-3 us
    v
11. Guest kernel marks bio complete, returns to application

Total write latency (NVMe, 3-way replica, RDMA network):
  Best case:  ~100-200 us
  Typical:    ~200-500 us
  Worst case: ~1-5 ms (during rebalancing or under load)

Journal flush (background, async):
  - Periodically (every few seconds), journal entries are
    flushed to the main data store (LSM tree / extent map)
  - This is NOT on the write path -- it happens asynchronously
  - If the journal fills up, writes stall until flush completes
    (this is the "journal full" condition -- a major performance risk)

Journal / Write-Ahead Log (WAL):

The journal is the key performance mechanism in SDS. All writes first go to a fast, durable device (NVMe), which provides sequential write performance. Later, the data is flushed to the main data store in larger, more efficient batches.

Journal / WAL Mechanics
=========================

Write arrives -> Journal (NVMe WAL device)
                    |
                    | (sequential write, ~10-50 us)
                    |
                    v
               Data is durable
               (ACK sent to client)
                    |
                    | (async background flush)
                    | (triggered by: time interval, journal fullness, idle periods)
                    v
               Main Data Store (larger SSD/NVMe capacity devices)
                    |
                    | (written as sorted runs / large extents)
                    | (more efficient than random writes)
                    v
               Journal entry freed

Why this matters:
  - Journal absorbs random writes as sequential writes (much faster)
  - Journal device must be low-latency NVMe (not SATA SSD)
  - Journal size determines how long writes can burst before stalling
  - Ceph: WAL + DB on separate NVMe, data on larger SSD/HDD
  - S2D: Cache tier (NVMe) acts as journal, capacity tier is SSD/HDD

Journal full scenario (performance cliff):
  1. Burst of writes fills journal faster than background flush can drain it
  2. New writes must wait for journal space -> latency spikes to 10-100 ms
  3. Common causes: undersized journal, too few NVMe WAL devices, burst I/O
  4. Mitigation: size journal to handle 30-60 seconds of peak write rate

Read Path

The read path is simpler than the write path because reads do not require replication. However, cache hierarchy and data locality significantly affect performance.

Read Path -- HCI with SDS
===========================

1. VM Application issues read(fd, buf, 4096)
   |
   v
2. Guest Kernel: filesystem -> bio -> virtio-blk driver
   |
   v
3. Hypervisor: QEMU I/O thread -> SDS client library
   |
   v
4. SDS Client: identify chunk and primary node
   |
   +--- Case A: Data is LOCAL (on this node)
   |    |
   |    v
   |    Local SDS daemon -> check page cache
   |    |
   |    +-- Cache HIT: return from RAM          ~5-20 us
   |    |
   |    +-- Cache MISS: read from local NVMe    ~50-100 us
   |    |                read from local SSD     ~100-300 us
   |    |
   |    Result: LOCAL READ latency = 50-300 us
   |
   +--- Case B: Data is REMOTE (on another node)
        |
        v
        Network request to remote node           ~5-30 us (RDMA)
        |                                        ~50-200 us (TCP)
        v
        Remote SDS daemon -> check page cache
        |
        +-- Cache HIT: return from RAM           ~5-20 us
        |
        +-- Cache MISS: read from NVMe           ~50-100 us
        |
        Network response                         ~5-30 us (RDMA)
        |
        Result: REMOTE READ latency = 100-500 us

Cache Hierarchy (typical HCI node):
  Layer 1: Guest page cache (RAM inside VM)         ~1 us
  Layer 2: Host page cache (hypervisor RAM)          ~2-5 us
  Layer 3: SDS cache (dedicated RAM or NVMe)         ~5-50 us
  Layer 4: Local SSD/NVMe (data tier)                ~50-300 us
  Layer 5: Remote SSD/NVMe (over network)            ~100-500 us

Data Locality Optimization:
  Some SDS implementations (Ceph with primary affinity, vSAN)
  try to place the primary replica on the same node as the VM.
  This maximizes local reads and minimizes network traffic.

  However, after live migration (vMotion equivalent), the VM
  moves to a different node, and all reads become remote until
  the SDS rebalances the primary replica to the new node.

Failure Domains

A failure domain is a group of components that can fail together due to a shared dependency (power, network, physical location). HCI must be configured with failure domain awareness to ensure that replicas are spread across independent failure domains.

Failure Domain Hierarchy
==========================

Level 0: Disk
  - Single disk failure: SDS rebuilds data from other replicas
  - Impact: reduced redundancy for affected chunks until rebuild completes
  - Rebuild time: minutes to hours (depends on disk size and cluster I/O)

Level 1: Node
  - Node failure (hardware, OS crash, reboot): all disks on that node offline
  - Impact: RF=3 -> 2 copies remain, cluster still healthy
  - Recovery: automatic, SDS re-replicates data to surviving nodes
  - Rebuild time: 30 min to several hours (depends on data volume)

Level 2: Rack
  - Rack failure (PDU failure, top-of-rack switch failure)
  - Impact: all nodes in the rack offline simultaneously
  - Design rule: never place more than (RF - 1) replicas in the same rack
  - Ceph CRUSH: configure rack-level failure domain rules
  - S2D: configure fault domains (rack, chassis, site)

Level 3: Site / Data Center
  - Site failure (power, network, natural disaster)
  - Stretched cluster: replicas across two sites + witness at third site
  - Ceph: stretched mode with crush rules for site affinity
  - S2D: stretch cluster with site-aware volume placement

Example: 12 nodes across 3 racks, RF=3

  Rack 1          Rack 2          Rack 3
  +------+        +------+        +------+
  |Node01|        |Node05|        |Node09|
  |Node02|        |Node06|        |Node10|
  |Node03|        |Node07|        |Node11|
  |Node04|        |Node08|        |Node12|
  +------+        +------+        +------+

  Chunk placement (rack-aware, RF=3):
  Chunk 001:  Node02 (Rack1)  Node06 (Rack2)  Node10 (Rack3)
  Chunk 002:  Node04 (Rack1)  Node07 (Rack2)  Node11 (Rack3)
  Chunk 003:  Node01 (Rack1)  Node08 (Rack2)  Node12 (Rack3)

  -> Any single rack can fail completely, data remains accessible
  -> Two racks failing simultaneously: data loss possible (RF=3
     can survive 2 node failures, not 2 rack failures with 4 nodes each)

Rebalancing and Recovery After Failures

When a node fails or is added, the SDS must redistribute data to maintain even distribution and the target replica factor. This process has significant performance implications.

Rebalancing Scenarios
=======================

Scenario 1: Node failure (unplanned)
-------------------------------------
Initial state: 4 nodes, RF=3, each node holds ~25% of data

  Node 1: [C01,C02,C05,C08,...]      <- FAILED
  Node 2: [C01,C03,C04,C06,...]
  Node 3: [C02,C04,C07,C08,...]
  Node 4: [C03,C05,C06,C07,...]

After Node 1 failure:
  - Chunks that had a replica on Node 1 now have only 2 copies
  - SDS detects under-replicated chunks
  - Recovery: re-replicate missing copies to surviving nodes

  Node 2: [C01,C03,C04,C06,...] + [C05*,C08*]  <- new replicas
  Node 3: [C02,C04,C07,C08,...] + [C01*,C05*]  <- new replicas
  Node 4: [C03,C05,C06,C07,...] + [C02*,C08*]  <- new replicas

  * = newly replicated chunks

  Recovery bandwidth consumed: ~(data on Node 1 / RF) must be
  read from surviving nodes and written to new locations.
  Example: 10 TB on Node 1, RF=3 -> ~3.3 TB of data to replicate
  At 2 GB/s recovery rate: ~28 minutes

  PERFORMANCE IMPACT during recovery:
  - Recovery I/O competes with production VM I/O
  - Expect 10-30% IOPS degradation during recovery
  - Ceph: "osd recovery max active" limits concurrent recoveries
  - S2D: repair jobs are background-prioritized but still consume I/O


Scenario 2: Node addition (planned)
-------------------------------------
Adding Node 5 to a 4-node cluster:

  Before: each node holds 25% of data
  After:  each node should hold 20% of data

  SDS moves ~5% of total data from each existing node to Node 5:

  Node 1: 25% -> 20%  (moves 5% out)
  Node 2: 25% -> 20%  (moves 5% out)
  Node 3: 25% -> 20%  (moves 5% out)
  Node 4: 25% -> 20%  (moves 5% out)
  Node 5:  0% -> 20%  (receives 20% total)

  This is a gradual, background process:
  - Ceph: PG (placement group) remapping, controlled by "backfill" limits
  - S2D: storage job with low priority, can take hours for large clusters

HCI Networking Requirements

HCI places heavy demands on the storage network because every write generates cross-node replication traffic. The network is now part of the storage path, not just a management channel.

HCI Network Bandwidth Calculation
====================================

Assumptions:
  - Cluster aggregate write throughput: 5 GB/s
  - RF=3 (each write generates 2 replication copies)
  - Total storage network traffic: 5 GB/s x 2 = 10 GB/s replication
  - Plus read traffic (remote reads): ~2 GB/s
  - Plus recovery/rebalancing (background): ~1-2 GB/s
  - Total sustained storage network: ~14 GB/s

Minimum network per node (4-node cluster):
  - Each node generates/receives: ~3.5 GB/s = 28 Gbps
  - Minimum: 2 x 25 GbE dedicated to storage (50 Gbps)
  - Recommended: 2 x 100 GbE with RDMA (200 Gbps headroom)

Network design for HCI:
  +------------------+
  | HCI Node         |
  |                  |
  | [NIC Port 1] ----+--> Management + VM traffic (25 GbE)
  | [NIC Port 2] ----+--> Management + VM traffic (25 GbE)
  | [NIC Port 3] ----+--> Storage replication (25 GbE, dedicated)
  | [NIC Port 4] ----+--> Storage replication (25 GbE, dedicated)
  +------------------+

  or (converged with QoS):

  +------------------+
  | HCI Node         |
  |                  |
  | [NIC Port 1] ----+--> Converged: Mgmt + VM + Storage (100 GbE)
  | [NIC Port 2] ----+--> Converged: Mgmt + VM + Storage (100 GbE)
  +------------------+
  (RDMA traffic class separated via PFC/ECN/DCBX)

Critical requirements:
  - Storage network latency: < 50 us for RDMA, < 200 us for TCP
  - Jumbo frames (MTU 9000) mandatory for storage VLAN
  - For RDMA: lossless Ethernet (PFC, ECN, DCBX) required
  - Leaf-spine topology with non-blocking bandwidth
  - No oversubscription on storage VLAN switches

Trade-Offs: SAN vs NAS vs HCI Decision Framework

The decision between architectures is not binary. The right answer depends on workload requirements, existing infrastructure, operational skills, and regulatory constraints. The following framework provides a structured approach.

Decision Framework: When to Use What
=======================================

                    +------------------------------------------+
                    |  Do you need sub-200 us deterministic     |
                    |  latency with zero noisy-neighbor risk?   |
                    +-----+------------------------------------+
                          |
                   Yes    |    No
                   |      |    |
                   v      |    v
              +--------+  |  +------------------------------------+
              | SAN    |  |  | Is shared file access (NFS/SMB)    |
              | (keep  |  |  | the primary use case (not block)?  |
              | or add)|  |  +-----+------------------------------+
              +--------+  |        |
                          |  Yes   |    No
                          |  |     |    |
                          |  v     |    v
                          | +---+  |  +-------------------------------+
                          | |NAS|  |  | Are you willing to operate    |
                          | +---+  |  | distributed storage software  |
                          |        |  | (Ceph, S2D) with the required |
                          |        |  | team skills?                  |
                          |        |  +-----+-------------------------+
                          |        |        |
                          |        |  Yes   |    No
                          |        |  |     |    |
                          |        |  v     |    v
                          |        | +---+  |  +-----------+
                          |        | |HCI|  |  | SAN or    |
                          |        | +---+  |  | Managed   |
                          |        |        |  | Service   |
                          |        |        |  | (ESC)     |
                          |        |        |  +-----------+
                          |        |        |
                          +--------+--------+

Quantified Trade-Off Matrix:

Dimension SAN NAS HCI (SDS)
CapEx (initial) High (array + FC switches + HBAs) Medium (NAS head + Ethernet) Low-Medium (commodity servers + disks)
OpEx (ongoing) Medium (array support contracts, FC admin) Low (simple management) Medium-High (SDS expertise, more nodes to manage)
Scalability model Scale storage independently of compute Scale NAS heads or shelves Scale by adding nodes (compute + storage together)
Max latency (4K random write) 50-200 us 300-1000 us 200-500 us (local), 300-800 us (remote)
Latency consistency Excellent (deterministic) Good (shared network) Moderate (varies with load, rebalancing)
Failure blast radius Array failure = all VMs lose storage NAS head failure = all NFS clients impacted Node failure = degraded, not down (replicated)
Capacity overhead 10-30% (RAID, hot spare) 10-30% (RAID) 50-67% (RF=3) or 33-50% (erasure coding)
Data services maturity Excellent (20+ years, enterprise-grade) Good (snapshots, replication, quotas) Improving (Ceph and S2D mature, but less polished)
Vendor lock-in Array vendor (NetApp, Pure, Dell, HPE) NAS vendor (NetApp, Dell) Platform vendor (Red Hat/Ceph, Microsoft/S2D)
Skills required Storage admin (array + FC + zoning) NAS admin (NFS/SMB, Ethernet) SDS + platform engineer (Kubernetes/Ceph or Windows/S2D)

Hybrid Architecture: When to Mix SAN and HCI

For a financial enterprise with 5,000+ VMs, the pragmatic answer is often a hybrid architecture:

Hybrid Architecture -- SAN + HCI
===================================

Tier 1: Performance-Critical (5-10% of VMs)
  - Oracle RAC, high-frequency databases, real-time analytics
  - External SAN (NetApp AFF, Pure FlashArray)
  - Connected to HCI nodes via iSCSI or FC (CSI driver)
  - Dedicated latency SLA: < 200 us p99

Tier 2: General Enterprise (80-90% of VMs)
  - Application servers, web servers, middleware, dev/test
  - HCI storage (Ceph/ODF or S2D)
  - Standard SLA: < 1 ms p99

Tier 3: Archival / Bulk (5-10% of VMs)
  - Backup targets, log aggregation, cold storage
  - HCI with erasure coding (lower cost)
  - Relaxed SLA: < 5 ms p99

  +-------------------+    +------------------------------------+
  | External SAN      |    | HCI Cluster                        |
  | (NetApp/Pure/Dell)|    |                                    |
  | [Tier 1 LUNs]     |    | Node1 [Tier2] [Tier3-EC]           |
  +--------+----------+    | Node2 [Tier2] [Tier3-EC]           |
           |               | Node3 [Tier2] [Tier3-EC]           |
    iSCSI/FC/NVMe-oF       | Node4 [Tier2] [Tier3-EC]           |
           |               | ...                                 |
  +--------+----------+    +------------------------------------+
  | CSI Driver         |
  | (Trident/Pure CSI) |
  +--------------------+
           |
           v
  Kubernetes/OpenShift presents all tiers
  uniformly via StorageClasses:
    storageclass: tier1-san     -> External SAN (iSCSI/FC)
    storageclass: tier2-hci     -> HCI replicated (RF=3)
    storageclass: tier3-archive -> HCI erasure coded (4+2)

The "Disaggregated HCI" Middle Ground

Pure HCI couples compute and storage on every node: adding storage means adding compute (and vice versa). This is inefficient when scaling is asymmetric. Disaggregated HCI introduces storage-only nodes alongside compute-storage nodes, or fully separates compute and storage into different node pools while keeping the software-defined storage layer.

Disaggregated HCI Architectures
==================================

Model A: Compute-Storage Nodes + Storage-Only Nodes

  +------------------+  +------------------+  +------------------+
  | Compute-Storage  |  | Compute-Storage  |  | Compute-Storage  |
  | Node 1           |  | Node 2           |  | Node 3           |
  | [VMs] [SDS]      |  | [VMs] [SDS]      |  | [VMs] [SDS]      |
  | [NVMe] [SSD]     |  | [NVMe] [SSD]     |  | [NVMe] [SSD]     |
  +--------+---------+  +--------+---------+  +--------+---------+
           |                     |                     |
           +----------+----------+----------+----------+
                      |                     |
  +--------+---------+  +--------+---------+
  | Storage-Only     |  | Storage-Only     |
  | Node 4           |  | Node 5           |
  | [SDS only]       |  | [SDS only]       |
  | [NVMe] [SSD]     |  | [NVMe] [SSD]     |
  | [SSD] [SSD]      |  | [SSD] [SSD]      |
  +------------------+  +------------------+

  Use case: need more storage capacity without adding compute


Model B: Fully Disaggregated (Compute Nodes + Storage Nodes)

  Compute Nodes (VMs only, no local storage for SDS):
  +----------+  +----------+  +----------+  +----------+
  | Compute  |  | Compute  |  | Compute  |  | Compute  |
  | Node 1   |  | Node 2   |  | Node 3   |  | Node 4   |
  | [VMs]    |  | [VMs]    |  | [VMs]    |  | [VMs]    |
  +----+-----+  +----+-----+  +----+-----+  +----+-----+
       |              |              |              |
       +------+-------+------+------+------+-------+
              |              |             |
              v              v             v
  Storage Nodes (SDS daemons + disks, no VMs):
  +----------+  +----------+  +----------+
  | Storage  |  | Storage  |  | Storage  |
  | Node 1   |  | Node 2   |  | Node 3   |
  | [SDS]    |  | [SDS]    |  | [SDS]    |
  | [NVMe x4]|  | [NVMe x4]|  | [NVMe x4]|
  | [SSD x8] |  | [SSD x8] |  | [SSD x8] |
  +----------+  +----------+  +----------+

  Use case: independent scaling, no resource contention between VMs and SDS

Platform support:
  - OVE/ODF: supports "infra nodes" for Ceph OSDs (disaggregated model B)
  - Azure Local: does NOT support disaggregated S2D; all nodes must participate
  - Ceph (upstream): fully supports disaggregated with separate MON/OSD/MDS nodes

How the Candidates Handle This

Comparison Table

Aspect VMware (Current) OVE (OpenShift Virtualization Engine) Azure Local Swisscom ESC
Primary storage architecture HCI (vSAN) HCI (Ceph/ODF) HCI (S2D) SAN (Dell PowerMax/PowerStore)
SDS engine vSAN (proprietary) Ceph (open-source, Red Hat supported) Storage Spaces Direct (proprietary) N/A (managed SAN)
Replication model vSAN RAID-1/5/6 (policy-based) Ceph: RF=2/3 or erasure coding (per pool) S2D: 2-way/3-way mirror or parity (per volume) Array-level RAID (managed by Swisscom)
Consistency model Strong (synchronous replication) Strong (synchronous replication, primary-copy) Strong (synchronous replication) Strong (array controller)
Write path VM -> vSAN DOM -> CLOM -> journal -> replicate VM -> QEMU -> librbd -> primary OSD -> journal -> replicate VM -> Hyper-V -> ReFS -> CSV -> S2D cache -> replicate VM -> array controller -> cache -> RAID write
Cache / Journal tier Dedicated cache device per disk group WAL + DB on NVMe, data on SSD/HDD NVMe cache tier (read + write) Array DRAM cache + NVMe journal
Failure domain awareness Host-level (rack-aware with stretched cluster) CRUSH map: disk, host, rack, row, site Fault domains: node, chassis, rack, site Array controller HA (managed)
Minimum nodes (HA) 3 (witness for 2-node) 3 (ODF compact mode) 2 (with witness) N/A (managed)
Maximum nodes per cluster 64 No hard limit (practical: 100+) 16 N/A (managed)
Disaggregated mode Compute-only hosts allowed ODF infra nodes (storage-only), fully disaggregated Ceph Not supported -- all nodes must run S2D N/A
External SAN integration VMFS on SAN LUNs, RDMs CSI drivers (Trident, Pure CSI, Dell CSI) iSCSI/FC passthrough to VMs Included (it IS the SAN)
Erasure coding Yes (RAID-5/6 equivalent) Yes (Ceph EC pools, configurable k+m) Yes (parity volumes, single/dual parity) N/A (array-level)
Data locality optimization Automatic (vSAN locality awareness) Primary affinity (configurable, not default) Automatic (CSV ownership, SMB redirect) N/A (SAN has no locality concept)
Capacity overhead (RF=3) 3x raw (RAID-1 mirror across 3 hosts) 3x raw (3 replicas) 3x raw (3-way mirror) 1.2-1.5x (RAID-6 or dual parity)
Encryption at rest vSAN encryption (AES-256, vCenter KMS) Ceph OSD encryption (dm-crypt/LUKS, KMIP) BitLocker on CSV volumes (TPM-backed) Managed by Swisscom (vendor encryption)
Snapshot mechanism vSAN snapshots (redo logs, COW) Ceph RBD snapshots (COW, instant) Storage Spaces checkpoints (COW) Array-native snapshots (managed)

Detailed Comparison

VMware vSAN (Current Baseline): vSAN is the HCI storage engine that the organization already operates. It pools local disks across ESXi hosts into a shared datastore, using policy-based storage management (SPBM) to define replication level, stripe width, and failure tolerance per VM. vSAN's CLOM (Cluster Level Object Manager) automatically places replicas across hosts, and DOM (Distributed Object Manager) handles I/O routing. The familiar operational model includes vCenter health checks, automatic rebalancing, and policy-driven provisioning. The exit motivation is Broadcom licensing, not vSAN's technical capabilities.

OVE -- Ceph/ODF: OVE uses OpenShift Data Foundation (ODF), which is Red Hat's productized distribution of Ceph. Ceph is the most widely deployed open-source distributed storage system, providing block (RBD), file (CephFS), and object (RGW) storage from a single platform.

Key Ceph architecture points for this evaluation:

ODF operational model on OpenShift:

Azure Local -- Storage Spaces Direct (S2D): S2D is Microsoft's HCI storage engine built into Windows Server. It pools local NVMe, SSD, and HDD drives across cluster nodes into a unified storage pool, presented as Cluster Shared Volumes (CSVs) formatted with ReFS.

Key S2D architecture points:

S2D constraints for this evaluation:

Swisscom ESC -- Managed SAN: ESC uses traditional SAN (Dell PowerMax/PowerStore behind VxBlock). The customer has zero visibility into or control over the storage architecture. Capacity is consumed as a managed service: request storage, get a volume, attach to VM. The trade-off is clear: no operational burden, but no optimization capability, no architecture choice, and complete dependency on Swisscom for performance, capacity planning, and data protection.


Key Takeaways

  1. SAN is not dead, but it is no longer the default. For this evaluation, SAN remains relevant for two scenarios: (a) Swisscom ESC, where it is the underlying architecture hidden behind a managed service, and (b) hybrid architectures where performance-critical Tier-1 workloads on OVE or Azure Local consume external SAN via CSI drivers. For the 80-90% of general-purpose VMs, HCI is the architecturally appropriate choice. Do not invest in new SAN infrastructure unless a specific workload justifies it with measured performance data.

  2. HCI capacity overhead is the hidden cost. RF=3 replication means only 33% of raw storage is usable. For 5,000+ VMs requiring, say, 200 TB usable, you need 600 TB raw NVMe/SSD -- a significant hardware cost. Compare this to SAN with RAID-6, where 200 TB usable requires approximately 250-280 TB raw. Erasure coding on HCI (e.g., Ceph EC 4+2) improves efficiency to ~67% usable, but with higher latency and CPU cost. The PoC should include a detailed capacity model comparing raw-to-usable ratios for each candidate with the actual workload profile.

  3. The write path determines the performance floor. In HCI, every write traverses the network at least twice (primary journal + replica journal). This means the storage network is on the critical path for write latency. A misconfigured or congested storage network does not just slow down storage -- it directly increases write latency for every VM on the cluster. Dedicated storage network interfaces and correct RDMA/PFC/ECN configuration are not optional -- they are prerequisites for acceptable HCI write performance.

  4. Ceph and S2D are architecturally different in meaningful ways. Ceph is a distributed object store with no filesystem on the data path (BlueStore writes directly to raw devices). S2D is a distributed block layer that sits underneath ReFS, which is a filesystem. Ceph uses CRUSH for algorithmic placement (no metadata server for block storage). S2D uses the Software Storage Bus with CSV owner coordination. These differences matter for failure behavior, recovery speed, and scalability limits (S2D: 16 nodes; Ceph: effectively unlimited).

  5. Disaggregated HCI is a competitive advantage of OVE. ODF supports dedicated storage nodes (infra nodes) that run only Ceph OSDs, separate from compute nodes running VMs. This allows independent scaling of compute and storage, approaching the flexibility of traditional SAN without the dedicated hardware. Azure Local does not support disaggregated S2D -- every node must contribute storage. For an environment with 5,000+ VMs where compute and storage growth rates may differ, this is a meaningful architectural difference.

  6. Journal sizing is the most common HCI performance mistake. An undersized journal (WAL) device causes write stalls when burst I/O fills the journal faster than it can be flushed. Both Ceph and S2D use NVMe devices as the journal/cache tier. The PoC must test sustained write performance under load to validate that the journal tier can absorb peak write rates without stalling. Ask vendors for journal sizing guidelines based on your measured write profile from the VMware baseline.

  7. Rebalancing is the operational unknown. Adding or removing nodes from an HCI cluster triggers data redistribution that consumes network bandwidth and disk I/O for hours. During rebalancing, production VM performance degrades. The PoC should explicitly test rebalancing impact: add a node during a load test and measure the latency degradation. Both Ceph and S2D provide throttling controls, but defaults may not be tuned for a 5,000+ VM production environment.

  8. NAS remains relevant for file-level workloads. HCI replacing SAN does not mean HCI replaces NAS. Shared file access (configuration repositories, user home directories, application data shared across VMs) is still best served by NFS/SMB from a dedicated NAS or from the SDS file layer (CephFS, S2D SMB shares). The evaluation should clarify whether OVE's CephFS or Azure Local's SMB shares provide adequate NAS functionality or whether an external NAS (NetApp, PowerScale) remains necessary.

  9. SAN operational skills do not transfer to HCI. A team skilled in FC zoning, LUN masking, ALUA multipathing, and array-level snapshots will find HCI operations fundamentally different. Ceph operations involve CRUSH map management, PG balancing, OSD lifecycle, and Rook-Ceph operator troubleshooting. S2D operations involve PowerShell, CSV ownership, storage tier management, and S2D repair jobs. The training and skill transition plan must be explicit, with a timeline that precedes production deployment.


Discussion Guide

The following questions are designed for vendor deep-dives, PoC planning sessions, and internal architecture reviews. They probe the practical implications of storage architecture choices.

Questions for All Candidates

  1. Capacity overhead quantification: "For our target of X TB usable block storage for VMs, how much raw disk capacity is required under your recommended resiliency configuration? Break this down: raw capacity, capacity after replication/parity, capacity after metadata/filesystem overhead, and final usable capacity. How does this change if we use erasure coding for Tier-2/3 workloads?"

  2. Write latency under replication: "Walk us through the exact write path from VM application to durable commit, including all replication hops. What is the measured p50, p95, p99, and p99.9 write latency for 4K random writes at 80% cluster utilization with your recommended replication level? How does write latency change when one node is in recovery (degraded mode)?"

  3. Rebalancing impact: "Add a node to a 50%-utilized cluster during a sustained fio workload. What is the measured IOPS and latency degradation during rebalancing? How long does rebalancing take for 10 TB of data? What controls exist to throttle rebalancing and protect production I/O?"

  4. Failure recovery time: "A node with 10 TB of data fails at 3 AM. How long until the cluster is back to full redundancy (RF=3 restored)? What is the VM-visible latency impact during recovery? How does the system prioritize recovery I/O vs production I/O? Is there a risk of cascading failure if a second node fails during recovery?"

  5. Journal/cache tier sizing: "How should we size the NVMe journal/cache tier for a workload profile of X IOPS write, Y GB/s sequential write? What happens when the journal fills up? How do we monitor journal utilization and set alerts before a performance cliff?"

Questions Specific to OVE (Ceph/ODF)

  1. CRUSH map design for our topology: "We have X racks across 2 data center rooms. Design a CRUSH map that ensures RF=3 replicas are placed on different racks. How does the CRUSH map change when we add a third data center room? Can CRUSH rules be changed online without data movement?"

  2. PG count and autoscaling: "What PG count should we configure for our pool sizes? Does the ODF operator enable PG autoscaling by default? What are the risks of PG splits and merges on production workload performance? Can PG operations be scheduled during maintenance windows?"

  3. BlueStore tuning for NVMe: "Is the default BlueStore configuration optimized for all-NVMe deployments? What bluestore_cache_size, bluestore_min_alloc_size, and WAL/DB sizing are recommended for our workload profile? Should we allocate separate NVMe devices for WAL+DB, or collocate them with data?"

  4. Disaggregated ODF: "We want to run compute-only worker nodes alongside dedicated storage (infra) nodes. What is the minimum number of storage nodes for ODF? Can we mix node sizes (different disk counts) without CRUSH weight imbalances? What network bandwidth is required between compute and storage nodes?"

Questions Specific to Azure Local (S2D)

  1. 16-node cluster limit: "We need to host 5,000+ VMs. With a 16-node cluster limit, how many S2D clusters do we need? How do VMs on one cluster access storage on another? Is there a shared-nothing penalty for cross-cluster storage access? How does management scale across multiple clusters?"

  2. Mirror-accelerated parity: "For cold/warm data, S2D offers mirror-accelerated parity (MAP). How does the system decide when to tier data from the mirror layer to the parity layer? What is the read latency from the parity layer compared to the mirror layer? Can we control tiering thresholds?"

  3. ReFS vs NTFS implications: "S2D requires ReFS. What limitations does ReFS impose compared to NTFS (e.g., deduplication behavior, backup agent compatibility, file-level restore)? Are all our backup tools (Veeam, Commvault) fully compatible with ReFS on CSV volumes?"

Questions Specific to Swisscom ESC

  1. Storage architecture transparency: "We understand ESC uses Dell VxBlock with PowerMax/PowerStore. Can you confirm the RAID level and protection scheme used for our tenant? Is our data on dedicated LUNs or shared with other tenants? What is our performance isolation guarantee -- is there QoS at the array level, and what are our IOPS/latency SLAs?"

  2. Storage scaling and limits: "What is the maximum storage capacity available per tenant? If we need to scale from 100 TB to 500 TB, is this a config change or a hardware procurement? What is the lead time? Can we burst beyond our committed capacity?"

Architecture-Level Questions (for Internal Discussion)

  1. Hybrid architecture decision: "For our 5,000+ VMs, what percentage realistically need SAN-grade performance (sub-200 us, deterministic)? Should we provision an external SAN for those workloads and use HCI for the rest? What is the TCO difference between full-HCI and hybrid (HCI + external SAN for Tier-1)? Consider hardware, licensing, support contracts, and operational skill sets."

  2. Capacity planning model: "Build a 5-year capacity model for each candidate, starting from our current VMware storage consumption. Account for: replication overhead (RF=3 vs RAID-6), growth rate (15-20% annual), snapshot space, thin provisioning overcommit safety margin, and recovery headroom (enough free space to absorb a node failure's re-replication). What is the year-1 and year-5 raw capacity requirement for each candidate?"

  3. NAS strategy alongside HCI: "If we adopt HCI for block storage, do we still need a dedicated NAS (NetApp, PowerScale) for file shares, or can CephFS / S2D SMB shares replace it? What are the feature gaps (Active Directory integration, DFS namespaces, qtrees, quotas, virus scanning integration)? Is a dedicated NAS for file workloads + HCI for block workloads a cleaner operational model than trying to do everything on HCI?"

  4. Storage operations skill transition: "Our current team operates VMware vSAN and NetApp ONTAP. Map the skill equivalencies to Ceph/ODF and S2D. For each new skill area (CRUSH map management, OSD lifecycle, PG balancing for Ceph; PowerShell storage cmdlets, CSV management, ReFS administration for S2D), what training is available, what is the estimated ramp-up time, and when in the project timeline must the team be proficient?"


Next: 05-sds-platforms.md -- Software-Defined Storage Platforms (Ceph/ODF, Storage Spaces Direct, and direct comparison)