Modern datacenters and beyond

vSAN -- Current Storage Baseline

Why This Matters

vSAN is the storage engine underneath every VM in the current VMware environment. Before evaluating Ceph/ODF, Storage Spaces Direct, or managed SAN services as replacements, we need a precise understanding of what vSAN actually does -- not at the marketing level ("hyper-converged software-defined storage") but at the level of internal components, data paths, failure handling, and operational behavior. Every candidate will be measured against the operational reality of vSAN: its strengths (policy-based provisioning, transparent rebalancing, integrated health checks) and its weaknesses (Broadcom licensing, opaque internals, scaling limits, multicast dependencies).

This document serves three purposes:

  1. Establish the baseline. Quantify what vSAN delivers today -- IOPS, latency, capacity overhead, failure recovery time -- so PoC acceptance criteria are grounded in reality, not vendor datasheets.
  2. Identify dependencies. Discover which vSAN behaviors (SPBM integration, CLOM placement, DOM object structure) our operations team relies on explicitly or implicitly. These need conceptual equivalents in any replacement.
  3. Know what to leave behind. Some vSAN mechanisms are VMware-proprietary couplings that should not be replicated -- they should be replaced with better abstractions in the target platform.

Concepts

1. vSAN Architecture Overview

vSAN is a distributed storage layer embedded in the ESXi hypervisor kernel. It pools local disks (NVMe, SSD, HDD) across all ESXi hosts in a vSAN cluster into a single shared datastore. Unlike external SAN/NAS, there is no dedicated storage controller -- every ESXi host participates as both a compute node and a storage node.

Cluster Topology

vSAN Cluster (typical: 4-64 hosts)
=================================================

  +-----------+    +-----------+    +-----------+    +-----------+
  | ESXi Host |    | ESXi Host |    | ESXi Host |    | ESXi Host |
  |    01     |    |    02     |    |    03     |    |    04     |
  +-----------+    +-----------+    +-----------+    +-----------+
  | Disk Grp 1|    | Disk Grp 1|    | Disk Grp 1|    | Disk Grp 1|
  |  [NVMe-C] |    |  [NVMe-C] |    |  [NVMe-C] |    |  [NVMe-C] |
  |  [SSD-1]  |    |  [SSD-1]  |    |  [SSD-1]  |    |  [SSD-1]  |
  |  [SSD-2]  |    |  [SSD-2]  |    |  [SSD-2]  |    |  [SSD-2]  |
  |  [SSD-3]  |    |  [SSD-3]  |    |  [SSD-3]  |    |  [SSD-3]  |
  | Disk Grp 2|    | Disk Grp 2|    | Disk Grp 2|    | Disk Grp 2|
  |  [NVMe-C] |    |  [NVMe-C] |    |  [NVMe-C] |    |  [NVMe-C] |
  |  [SSD-4]  |    |  [SSD-4]  |    |  [SSD-4]  |    |  [SSD-4]  |
  |  [SSD-5]  |    |  [SSD-5]  |    |  [SSD-5]  |    |  [SSD-5]  |
  |  [SSD-6]  |    |  [SSD-6]  |    |  [SSD-6]  |    |  [SSD-6]  |
  +-----------+    +-----------+    +-----------+    +-----------+
        |                |                |                |
        +--------+-------+--------+-------+--------+------+
                 |                 |                |
            vSAN Network     vSAN Network     vSAN Network
            (vmk1, 25 GbE)  (vmk1, 25 GbE)  (vmk1, 25 GbE)

                      Shared vSAN Datastore
                    (single namespace, all hosts)

Disk Groups

A disk group is the fundamental storage unit on each host. Each disk group contains exactly one cache device and one or more capacity devices.

Disk Group Structure
======================

+------------------------------------------+
|              Disk Group                   |
|                                           |
|   +-------------------+                   |
|   |   Cache Tier      |   1 device only   |
|   |   (NVMe or SSD)   |   70% read cache  |
|   |                    |   30% write buffer|
|   +-------------------+                   |
|                                           |
|   +--------+ +--------+ +--------+        |
|   | Cap    | | Cap    | | Cap    |  1-7   |
|   | Disk 1 | | Disk 2 | | Disk 3 |  devs  |
|   | (SSD)  | | (SSD)  | | (SSD)  |        |
|   +--------+ +--------+ +--------+        |
+------------------------------------------+

Rules:
- Max 5 disk groups per host
- Max 7 capacity devices per disk group
- Max 35 capacity devices per host (5 x 7)
- Cache device must be >= 10% of total capacity
  in the disk group (recommendation, not hard limit)
- In all-flash configurations:
    Cache = write buffer only (100% write buffer)
    No read cache (flash capacity is fast enough)
- In hybrid configurations (SSD cache + HDD capacity):
    Cache = 70% read cache + 30% write buffer

All-flash vs. hybrid behavior: In an all-flash (AF) configuration, which is standard for new deployments, the cache tier serves exclusively as a write buffer. Reads go directly to the capacity SSDs because random read performance on flash is already high enough. In legacy hybrid configurations (SSD cache + HDD capacity), 70% of the cache device is dedicated to read caching to avoid hitting slow HDD random reads.

Witness Nodes and Stretched Clusters

For 2-node vSAN clusters or stretched clusters across two sites, a witness node is required to maintain quorum. The witness stores only metadata (witness components), not data.

Stretched Cluster Topology
============================

    Site A                          Site B
  +----------+                    +----------+
  | Host A-1 |  <-- sync repl -> | Host B-1 |
  | Host A-2 |  <-- sync repl -> | Host B-2 |
  +----------+                    +----------+
       \                               /
        \                             /
         \       +----------+        /
          +----->| Witness  |<------+
                 | (Site C) |
                 |  ESXi VM |
                 | metadata |
                 |   only   |
                 +----------+

Witness role:
- Stores witness components (small metadata objects, ~2 MB each)
- Breaks quorum tie when one site is unreachable
- Does NOT store data components
- Must be in a third failure domain (different rack/site)
- Can be a nested ESXi VM (VMware provides the appliance OVA)

Quorum rule:
  Object accessible IF > 50% of votes (components) are reachable
  With FTT=1 (mirroring): data on Site A, data on Site B, witness on Site C
    Site A fails -> Site B + Witness = 2/3 votes = quorum maintained
    Site C fails -> Site A + Site B = 2/3 votes = quorum maintained
    Site A + Site C fail -> only Site B = 1/3 votes = NO quorum, object offline

2. vSAN Data Path Internals

vSAN's I/O stack is composed of five kernel modules that run inside the ESXi vmkernel. Understanding these is essential for performance troubleshooting and for identifying which behaviors need conceptual replacements.

vSAN I/O Stack (Kernel-Space Components)
==========================================

  VM issues I/O (SCSI command via vmhba adapter)
       |
       v
  +----------------------------------------------------------+
  |  CLOM  (Cluster-Level Object Manager)                    |
  |  - Policy engine: translates SPBM policies into          |
  |    placement decisions                                    |
  |  - Decides WHERE components live (which hosts, which DGs) |
  |  - Enforces FTT, stripe width, locality rules             |
  |  - Runs on vSAN master node, elected per-cluster          |
  +----------------------------------------------------------+
       |  (placement map: "put component X on host Y, DG Z")
       v
  +----------------------------------------------------------+
  |  DOM  (Distributed Object Manager)                       |
  |  - Object I/O coordinator                                 |
  |  - Owns the object tree (components, witnesses, RAID)     |
  |  - Routes reads to nearest replica                        |
  |  - Sends writes to all replicas (sync replication)        |
  |  - Handles object-level operations (create, delete, grow) |
  |  - Runs on the host that owns the object (object owner)   |
  +----------------------------------------------------------+
       |  (I/O dispatched to component hosts)
       v
  +----------------------------------------------------------+
  |  LSOM  (Local Log-Structured Object Manager)             |
  |  - Local I/O engine on each host                          |
  |  - Writes: append to write buffer (cache tier), then      |
  |    destages to capacity tier in background                 |
  |  - Reads: check write buffer -> check capacity tier        |
  |  - Manages on-disk layout (log-structured)                |
  |  - Handles dedup/compression at the component level       |
  +----------------------------------------------------------+
       |
       v
  +----------------------------------------------------------+
  |  CMMDS  (Cluster Monitoring, Membership, and Directory   |
  |          Services)                                        |
  |  - Cluster membership protocol (heartbeat, health)        |
  |  - Distributed metadata directory                         |
  |  - Tracks: object locations, component states, disk UUIDs |
  |  - Uses gossip protocol between hosts                     |
  |  - Previously multicast-based, now unicast (vSAN 7.0+)   |
  +----------------------------------------------------------+
       |
       v
  +----------------------------------------------------------+
  |  RDT  (Reliable Datagram Transport)                      |
  |  - vSAN's custom network transport protocol               |
  |  - Sits on top of UDP (port 12321)                        |
  |  - Provides: congestion control, flow control,            |
  |    retransmission, ordering                               |
  |  - Replaces TCP for vSAN inter-host traffic               |
  |  - Optimized for datacenter latency (not WAN)             |
  +----------------------------------------------------------+
       |
       v
  Physical Network (25/100 GbE)

Write Path in Detail

Understanding the write path is critical because write latency directly impacts VM performance for database and transactional workloads.

vSAN Write Path (FTT=1, RAID-1 Mirror)
=========================================

VM on Host 1 writes 4 KiB block
       |
       v
  DOM (Host 1, object owner)
       |
       +---> LSOM on Host 1 (local replica)
       |       |
       |       +-> Write to cache device (NVMe write buffer)
       |       +-> Acknowledge to DOM
       |
       +---> RDT network --> LSOM on Host 3 (remote replica)
                |
                +-> Write to cache device (NVMe write buffer)
                +-> Acknowledge to DOM (via RDT)
       |
       v
  DOM receives ACKs from BOTH replicas
       |
       v
  ACK returned to VM
       |
  (write complete -- latency = max(local_cache, network + remote_cache))

Background destage (async, not on write path):
  LSOM periodically flushes write buffer to capacity tier
  - Destage threshold: ~70-80% write buffer utilization
  - Destage I/O is sequential, optimized for capacity device bandwidth
  - If write buffer fills to 100%: writes block until destage frees space
    (this is a performance emergency -- "write buffer full" alarm)

Typical write latency breakdown (all-flash, 25 GbE):
  Local cache write:     ~50-100 us (NVMe)
  Network RTT:           ~50-100 us (25 GbE, same rack)
  Remote cache write:    ~50-100 us (NVMe)
  DOM coordination:      ~10-30 us
  Total:                 ~150-300 us (typical)
  Total:                 ~300-600 us (cross-rack, busy network)

Read Path in Detail

vSAN Read Path
================

VM on Host 1 reads 4 KiB block
       |
       v
  DOM (Host 1, object owner)
       |
       +---> Is data in local LSOM? (Host 1 has a component?)
       |       |
       |     YES --> Read from local capacity tier (or write buffer if recent)
       |             Latency: ~100-200 us (NVMe)
       |
       |     NO  --> Route to remote host with nearest component
       |             RDT network --> LSOM on Host 3
       |             Read from capacity tier on Host 3
       |             Latency: ~200-400 us (network + NVMe)
       |
       v
  Data returned to VM

Read optimization (all-flash):
  - No read cache (cache device = write buffer only)
  - Reads go directly to capacity flash
  - DOM prefers local replica if available (data locality)
  - If VM migrates via vMotion, reads temporarily go remote
    until vSAN rebalances components (can take hours)

LSOM Internals: Log-Structured I/O

LSOM uses a log-structured write model on the cache device. All writes are appended sequentially to the write buffer log, regardless of the original I/O pattern. This converts random writes into sequential writes on the cache device, which is beneficial for NAND endurance and write performance.

LSOM Write Buffer (Cache Device)
==================================

  Cache NVMe (e.g., 800 GB)
  +-------------------------------------------------------+
  |  Log Head --> [W1][W2][W3][W4][W5][W6] ... [Wn] <--   |
  |                                                        |
  |  Destage pointer -->  [W1] already flushed to capacity |
  |                       [W2] flushing now                |
  |                       [W3..Wn] pending destage         |
  +-------------------------------------------------------+

  Write buffer behavior:
  1. New write appended at log head
  2. Destage thread reads from destage pointer, writes to capacity tier
  3. Freed space reclaimed for new writes
  4. If log wraps around (buffer full):
     - Write latency spikes (destage becomes synchronous)
     - "Write buffer full" condition --> CRITICAL ALARM

  Monitoring:
    esxcli vsan debug disk info (shows write buffer utilization)
    vSAN Performance Service: "Write Buffer Fill %" metric
    Alert threshold: > 80% sustained for > 30 seconds

3. Storage Policies (SPBM)

Storage Policy-Based Management (SPBM) is vSAN's mechanism for defining storage characteristics declaratively. Instead of configuring RAID levels, stripe counts, and failure tolerance at the storage layer, administrators define policies that describe the desired behavior. vSAN (specifically CLOM) translates these policies into component placement decisions.

Policy Capabilities

Policy Attribute Values Effect
failures to tolerate (FTT) 0, 1, 2, 3 Number of host/disk failures the object survives
failure tolerance method RAID-1 (mirror), RAID-5/6 (erasure coding) How redundancy is implemented
stripe width 1-12 Number of capacity devices data is striped across
force provisioning yes/no Provision even if policy cannot be satisfied
object space reservation 0-100% Percentage of thick provisioning (0 = fully thin)
flash read cache reservation 0-100% Reserved read cache (hybrid only, ignored in AF)
IOPS limit 0 (unlimited) or value Per-object IOPS cap (basic QoS)
disable object checksum yes/no Disable end-to-end checksum (not recommended)
storage tier (vSAN with HCI Mesh) Target specific tier in disaggregated model

FTT and Host Requirements

FTT   Method      Min Hosts   Capacity Overhead   Description
---   ------      ---------   ----------------    -----------
0     None            1       1x (no redundancy)  Dev/test only
1     RAID-1          3       2x                  Default. Mirror to 2 hosts.
1     RAID-5          4       1.33x               Erasure coding (3+1)
2     RAID-1          5       3x                  Triple mirror.
2     RAID-6          6       1.5x                Erasure coding (4+2)
3     RAID-1          7       4x                  Quadruple mirror. Rare.

Example: 1 TB VMDK with different policies
-------------------------------------------
FTT=1, RAID-1:  2 TB consumed (1 TB x 2 replicas)
FTT=1, RAID-5:  1.33 TB consumed (1 TB data + 0.33 TB parity)
FTT=2, RAID-1:  3 TB consumed (1 TB x 3 replicas)
FTT=2, RAID-6:  1.5 TB consumed (1 TB data + 0.5 TB parity)

For a financial enterprise: FTT=1 RAID-1 is the minimum for production.
FTT=2 recommended for data that must survive double-failure scenarios.

CLOM Placement Decisions

CLOM is the brain of vSAN storage policy enforcement. When a VM is created or a policy is applied, CLOM calculates where to place each component across the cluster.

CLOM Placement Algorithm (simplified)
========================================

Input:
  - SPBM policy (e.g., FTT=1, RAID-1, stripe width=1)
  - Cluster topology (hosts, disk groups, fault domains)
  - Current capacity utilization per host/disk group
  - Current component count per host (balance target)

Processing:
  1. Determine required components:
     FTT=1 RAID-1, 1 stripe -> 2 data components + 1 witness
  2. Select hosts for components:
     - Each data component on a DIFFERENT fault domain
     - Witness on a THIRD fault domain
     - Prefer hosts with most free capacity
     - Prefer hosts with fewest components (balance)
     - Respect affinity rules (if configured)
  3. Select disk group within each chosen host:
     - Prefer disk group with most free capacity
     - Balance component count across disk groups
  4. Validate:
     - Enough capacity? If not, fail (or force-provision)
     - Enough fault domains? If not, fail (or force-provision)

Output:
  Component placement map:
    Data Component 1 -> Host 02, Disk Group 1
    Data Component 2 -> Host 04, Disk Group 2
    Witness Component -> Host 01, Disk Group 1

CLOM recalculates when:
  - Policy changes (VM storage policy update)
  - Host fails or returns
  - Disk fails or is added
  - Rebalance triggered (proactive rebalancing)
  - Maintenance mode entered/exited

Erasure Coding on vSAN

vSAN supports RAID-5 (FTT=1, 3 data + 1 parity) and RAID-6 (FTT=2, 4 data + 2 parity) as erasure coding alternatives to mirroring. This saves significant capacity but has performance trade-offs.

RAID-5 on vSAN (FTT=1)
=========================

Object: 1 TB VMDK

  Host 1        Host 2        Host 3        Host 4
  +-------+     +-------+     +-------+     +-------+
  | D0    |     | D1    |     | D2    |     | P     |
  | 333GB |     | 333GB |     | 333GB |     | 333GB |
  +-------+     +-------+     +-------+     +-------+

  Total consumed: 1.33 TB (vs. 2 TB for RAID-1)
  Savings: 33%

Write penalty:
  - Full stripe write: 4 I/O ops (write D0, D1, D2, P) -- efficient
  - Partial stripe write: read-modify-write cycle:
    1. Read old data chunk
    2. Read old parity chunk
    3. Compute new parity
    4. Write new data chunk + new parity chunk
    = 4 I/O ops for 1 application write
  - Write latency: 2-4x higher than RAID-1 for small random writes

RAID-6 on vSAN (FTT=2)
=========================

  Host 1   Host 2   Host 3   Host 4   Host 5   Host 6
  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+
  | D0  |  | D1  |  | D2  |  | D3  |  | P   |  | Q   |
  | 250G|  | 250G|  | 250G|  | 250G|  | 250G|  | 250G|
  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+

  Total consumed: 1.5 TB (vs. 3 TB for triple mirror FTT=2 RAID-1)
  Savings: 50%

When to use erasure coding on vSAN:
  - Read-heavy workloads (no write penalty on reads)
  - Large sequential writes (full-stripe writes are efficient)
  - Capacity-constrained environments
  - Warm/cold data (templates, archives, compliance stores)

When NOT to use erasure coding:
  - OLTP databases (small random writes, latency-sensitive)
  - VDI boot storms (high write amplification)
  - Stretched clusters (not supported with erasure coding)

4. vSAN File System and Object Structure

vSAN does not use VMFS. It has its own on-disk format called VSAN-FS (or vSAN on-disk format). Understanding the object structure is important because it determines how failures affect individual VMs and how capacity is consumed.

Object Hierarchy

Every entity on vSAN is an object. A single VM consists of multiple objects, and each object is decomposed into components spread across the cluster.

VM Object Structure
=====================

VM: db-prod-01
  |
  +-- VM Home Namespace Object
  |     (VMX file, log files, snapshots metadata)
  |     Size: small (few MB)
  |     Components: 2 data + 1 witness (FTT=1 RAID-1)
  |
  +-- VMDK Object (boot disk, 100 GB)
  |     |
  |     +-- Component 1 (data, on Host 02, DG 1)
  |     |     50 GB (or less, depending on stripe width)
  |     +-- Component 2 (data, on Host 04, DG 2)  -- mirror of C1
  |     |     50 GB
  |     +-- Component 3 (witness, on Host 01, DG 1)
  |           ~2 MB (metadata only, votes in quorum)
  |
  +-- VMDK Object (data disk, 500 GB)
  |     |
  |     +-- Striped? If stripe width=2 and FTT=1 RAID-1:
  |     |
  |     |   RAID-1 tree:
  |     |     Mirror Leg A (stripe width=2):
  |     |       Component A1 (Host 02, DG 1) -- 250 GB
  |     |       Component A2 (Host 02, DG 2) -- 250 GB
  |     |     Mirror Leg B (stripe width=2):
  |     |       Component B1 (Host 04, DG 1) -- 250 GB
  |     |       Component B2 (Host 04, DG 2) -- 250 GB
  |     |     Witness (Host 01) -- ~2 MB
  |     |
  |     +-- Total components: 4 data + 1 witness
  |
  +-- Swap Object (VM swap file, = RAM size)
  |     Components: typically FTT=1 RAID-1 (thin-provisioned)
  |
  +-- Snapshot Delta Objects (if snapshots exist)
        Each snapshot creates additional delta VMDK objects

Total components for this VM: 12-15+ objects spread across cluster

Component Size Limits

vSAN has a maximum component size of 255 GB. Objects larger than 255 GB are automatically split into multiple components.

Large VMDK Decomposition
==========================

VMDK: 1 TB, FTT=1 RAID-1, stripe width=1

  Since 1 TB > 255 GB, vSAN splits into segments:
    Segment 1: 255 GB
    Segment 2: 255 GB
    Segment 3: 255 GB
    Segment 4: 235 GB  (remainder)

  Each segment is independently mirrored:
    Segment 1: Component on Host A + Component on Host B + Witness
    Segment 2: Component on Host C + Component on Host D + Witness
    Segment 3: Component on Host A + Component on Host D + Witness
    Segment 4: Component on Host B + Component on Host C + Witness

  Total components: 4 segments x (2 data + 1 witness) = 12 components

  Impact: A 1 TB VMDK with FTT=1 creates 12 components.
  A 5,000-VM cluster with average 500 GB per VM =
    ~2,500,000 GB / 255 GB per segment = ~10,000 segments
    x 3 components each = ~30,000 components (data + witness)
    Plus VM home, swap, snapshots = 50,000-100,000+ total components

  vSAN scaling limit: ~9,000 components per host (check release notes)
  At 64 hosts: ~576,000 components cluster-wide -- sufficient for 5,000 VMs
  At 32 hosts: ~288,000 -- potentially tight, monitor component count

On-Disk Format (VSAN-FS)

vSAN On-Disk Format v5+ (vSAN 7.0+)
=======================================

Capacity Device Layout:
+------------------------------------------------------------+
| Partition 1: Metadata (small, fixed size)                   |
|   - Device UUID, disk group membership                      |
|   - Component table (which components live here)            |
|   - Bitmap of allocated/free blocks                         |
+------------------------------------------------------------+
| Partition 2: Data (remainder of device)                     |
|   - Fixed-size allocation units (1 MiB)                     |
|   - Log-structured writes from LSOM destage                 |
|   - Checksum interleaved with data blocks                   |
|   - Dedup hash table (if dedup enabled)                     |
|   - Compression metadata (if compression enabled)           |
+------------------------------------------------------------+

Cache Device Layout:
+------------------------------------------------------------+
| Write Buffer Log (log-structured)                           |
|   - Sequential append of incoming writes                    |
|   - Each entry: component UUID + offset + data + checksum   |
|   - Entries destaged to capacity tier, then reclaimed        |
+------------------------------------------------------------+

Key characteristics:
  - Block size: 1 MiB allocation unit on capacity tier
  - Thin by default: unwritten blocks consume no space
  - Checksum: CRC-32C on every block (end-to-end data integrity)
  - On-disk format is proprietary -- no third-party tools can read it
  - Disk replacement: new disk auto-formats with VSAN-FS on claim

5. Networking Requirements

vSAN places specific requirements on the network that differ from general VM traffic networking. Misconfigured vSAN networking is a leading cause of performance problems and split-brain scenarios.

Dedicated vmknic

ESXi Host Network Configuration for vSAN
===========================================

Physical NICs (typical):
  +----------+  +----------+  +----------+  +----------+
  | NIC 0    |  | NIC 1    |  | NIC 2    |  | NIC 3    |
  | 25 GbE   |  | 25 GbE   |  | 25 GbE   |  | 25 GbE   |
  +----------+  +----------+  +----------+  +----------+
       |              |              |              |
  +----+--------------+----+   +----+--------------+----+
  | vSwitch0 / VDS         |   | vSwitch1 / VDS         |
  | (Management + vMotion) |   | (vSAN traffic)         |
  +------------------------+   +------------------------+
       |                              |
  vmk0 (Management)             vmk1 (vSAN)
  10.0.1.x/24                   10.0.10.x/24
       |                              |
  vmk2 (vMotion)                 VLAN 10 (dedicated)
  10.0.2.x/24                   MTU 9000 (jumbo frames)

Requirements:
  - Dedicated vmknic (vmk1) tagged for vSAN traffic
  - Minimum 10 GbE (25 GbE recommended, 100 GbE for large clusters)
  - Jumbo frames (MTU 9000) -- reduces CPU overhead by ~30%
  - Dedicated VLAN -- isolate vSAN traffic from VM and management
  - NIC teaming for redundancy (active-active or active-standby)
  - Layer 2 adjacency between all hosts in the vSAN cluster
  - Latency between hosts: < 1 ms (same rack/site)
    Stretched cluster: < 5 ms RTT between sites (< 1 ms recommended)

Multicast to Unicast Transition

This is historically significant and still relevant for environments that have not upgraded.

vSAN Network Protocol Evolution
==================================

vSAN 6.x and earlier:
  CMMDS used MULTICAST for cluster membership and metadata exchange
  - Required IGMP snooping on all switches in the vSAN VLAN
  - Required multicast groups configured end-to-end
  - Common failure: misconfigured IGMP snooping causes cluster partition
  - Troubleshooting nightmare in large L2 domains
  - Multicast addresses: 224.1.2.3 and 224.2.3.4 (default)

vSAN 7.0 and later:
  CMMDS switched to UNICAST
  - No multicast requirements
  - Each host communicates directly with all other hosts
  - CMMDS master elected per cluster (handles directory)
  - Significantly simplifies network configuration
  - Eliminates the #1 vSAN networking troubleshooting issue

Migration impact:
  If current environment is on vSAN 6.x: multicast dependencies exist
  If on vSAN 7.0+: unicast, no multicast concerns
  For target platform comparison: Ceph uses unicast, S2D uses SMB Direct
  (unicast). None of the candidates require multicast.

RDMA Support

vSAN RDMA Support (vSAN 7.0 U1+)
====================================

vSAN supports RDMA via RoCE v2 (RDMA over Converged Ethernet v2)
  - Bypasses TCP/IP stack entirely for vSAN data traffic
  - Reduces latency by ~30-50% (eliminates kernel network stack overhead)
  - Reduces CPU utilization for storage I/O by ~20-40%

Requirements:
  - RoCE v2 capable NICs (Mellanox ConnectX-5/6/7, Broadcom P2100G)
  - Lossless Ethernet (PFC -- Priority Flow Control, ECN enabled)
  - DCB (Data Center Bridging) configured on switches
  - Dedicated traffic class for vSAN RDMA

Typical latency impact:
  Without RDMA (TCP/IP):   200-400 us per remote I/O
  With RDMA (RoCE v2):     100-200 us per remote I/O

Comparison to candidates:
  - S2D: native SMB Direct / RDMA support (mature, well-integrated)
  - Ceph/ODF: RDMA support via msgr v2 protocol (experimental, not
    yet production-grade in ODF as of 2025)
  - vSAN: RDMA via RoCE v2 (production-supported since 7.0 U1)

6. Capacity Management

Deduplication and Compression

vSAN supports inline deduplication and compression (combined, cannot enable one without the other). This is a significant capacity optimization but carries CPU and latency overhead.

vSAN Dedup + Compression Pipeline
====================================

Write path with dedup+compression enabled:
  1. Data written to write buffer (cache tier) -- unmodified
  2. During destage from cache to capacity tier:
     a. Data block hashed (SHA-1 for dedup fingerprint)
     b. Hash compared against dedup table
        - Match: increment refcount, skip write (dedup hit)
        - No match: compress block (LZ4 algorithm)
          - Compressed block written to capacity tier
          - Hash + location recorded in dedup table
  3. Dedup table stored on capacity tier (memory-mapped for speed)

Performance impact:
  - Dedup hash computation: ~5-15 us per 4 KiB block (CPU bound)
  - Compression (LZ4): ~2-10 us per block (very fast)
  - Combined overhead on write path: minimal (happens during destage,
    not on the synchronous write path)
  - Memory overhead: dedup hash table uses ~1-2 GB RAM per TB of
    deduplicated data on the host

Dedup/compression ratios (typical for enterprise VMs):
  - Windows VMs (similar OS installs): 1.5-2.5x dedup ratio
  - Linux VMs (similar OS installs): 1.3-2.0x dedup ratio
  - Database volumes: 1.0-1.3x (already unique data, poor dedup)
  - VDI desktops: 2.0-5.0x (many identical OS images)
  - Overall cluster average: 1.5-2.5x combined savings

Important limitations:
  - Dedup+compression is all-or-nothing (cannot enable separately)
  - Applies per disk group, not per VM or per VMDK
  - Requires all-flash configuration (not available in hybrid)
  - Cannot be enabled on existing data retroactively without
    a full data evacuation and re-ingestion
  - Dedup table memory consumption scales with data volume
    (plan ~1.5 GB RAM per TB of unique data)

TRIM/UNMAP Processing

vSAN TRIM/UNMAP Flow
=======================

Guest VM: fstrim /mountpoint (or continuous discard via mount -o discard)
       |
       v
Guest kernel: issues SCSI UNMAP to virtual disk
       |
       v
ESXi vmkernel: translates to vSAN UNMAP on the object
       |
       v
DOM: propagates UNMAP to all component replicas
       |
       v
LSOM (each host with a component):
  - Marks blocks as free in allocation bitmap
  - Freed blocks available for reuse immediately
  - Physical space returned to disk group capacity pool

Automatic reclamation:
  - vSAN 6.7+: automatic UNMAP processing (no manual trigger needed)
  - Processing rate throttled to avoid impacting production I/O
  - Reclamation is asynchronous -- space may not appear freed immediately
  - Monitor via: vSAN capacity overview in vSphere Client

SSD TRIM passthrough:
  - vSAN does NOT pass TRIM to the underlying SSD firmware
  - SSD garbage collection handles wear leveling independently
  - vSAN manages its own free space tracking at the VSAN-FS level

Overhead Calculations

Understanding the real usable capacity of a vSAN cluster requires accounting for multiple overhead layers.

vSAN Capacity Overhead Calculation
=====================================

Raw capacity example: 16 hosts x 6 SSDs x 3.84 TB = 368.64 TB raw

Subtract: Cache tier overhead
  16 hosts x 2 disk groups x 1 NVMe cache device (1.6 TB each)
  = 51.2 TB reserved for cache (NOT usable for data)
  Remaining: 368.64 - 51.2 = 317.44 TB

Subtract: vSAN metadata overhead (~1-2% of capacity)
  ~3.17 - 6.35 TB
  Remaining: ~311 - 314 TB

Subtract: FTT overhead
  FTT=1 RAID-1:  314 / 2 = 157 TB usable
  FTT=1 RAID-5:  314 / 1.33 = 236 TB usable
  FTT=2 RAID-1:  314 / 3 = 105 TB usable
  FTT=2 RAID-6:  314 / 1.5 = 209 TB usable

Subtract: Slack space (25-30% recommended free)
  vSAN requires ~25-30% free capacity for:
    - Rebuild headroom (if a host fails, data must be rebuilt elsewhere)
    - Rebalancing operations
    - Write buffer destage efficiency (degrades when capacity > 80%)
    - Maintenance mode operations (data evacuation)

  FTT=1 RAID-1 example: 157 TB x 0.70 = 110 TB truly usable
  FTT=1 RAID-5 example: 236 TB x 0.70 = 165 TB truly usable

Subtract: Dedup/compression savings (adds capacity back):
  If dedup+compression ratio = 2x:
  FTT=1 RAID-1: 110 TB x 2 = 220 TB effective
  FTT=1 RAID-5: 165 TB x 2 = 330 TB effective

Summary (16 hosts x 6 x 3.84 TB SSDs, 2 cache devices each):
+--------------------+--------+-------------+------------------+
| Configuration      | Usable | After Slack | With 2x Dedup    |
+--------------------+--------+-------------+------------------+
| FTT=1 RAID-1       | 157 TB | 110 TB      | 220 TB effective |
| FTT=1 RAID-5       | 236 TB | 165 TB      | 330 TB effective |
| FTT=2 RAID-1       | 105 TB |  73 TB      | 147 TB effective |
| FTT=2 RAID-6       | 209 TB | 146 TB      | 293 TB effective |
+--------------------+--------+-------------+------------------+
| Raw total:         | 369 TB (including cache tier)            |
| Efficiency (RAID-1)| 110/369 = 29.8% (without dedup)         |
| Efficiency (RAID-5)| 165/369 = 44.7% (without dedup)         |
+--------------------+--------+-------------+------------------+

Key insight: raw-to-usable efficiency is 30-45% for production
configurations. This is the number to compare against Ceph/ODF
and S2D, not the raw capacity number vendors quote.

7. Failure Handling

Failure handling is the most critical area for a financial enterprise. Understanding exactly what happens when hardware fails -- and how long recovery takes -- determines whether the platform meets availability requirements.

Component States

vSAN Component State Machine
===============================

  +----------+
  | Active   |  <-- Normal state, component healthy and accessible
  +----------+
       |
       | (disk failure / host failure / network partition)
       v
  +----------+
  | Degraded |  <-- Component inaccessible, object still available
  +----------+    (quorum maintained via remaining components)
       |
       | (60-minute timer starts for transient failures)
       |
       +-------> Host returns within 60 min?
       |           YES -> component resyncs, returns to Active
       |           NO  -> component marked Absent
       |
       v
  +----------+
  | Absent   |  <-- Component permanently unavailable
  +----------+    (host confirmed down or disk failed)
       |
       | CLOM initiates rebuild
       v
  +-----------+
  | Rebuilding|  <-- New component being created on another host/DG
  +-----------+
       |
       | (data copied from surviving replica)
       v
  +----------+
  | Active   |  <-- New component online, object fully protected again
  +----------+

Failure Scenarios

Scenario 1: Single Disk Failure
==================================

Event: One capacity SSD fails in Host 03, Disk Group 1

Impact:
  - All components on that specific device become Degraded/Absent
  - Objects with replicas on other hosts/DGs remain accessible
  - Objects lose one level of fault tolerance
    (e.g., FTT=1 objects have 0 remaining tolerance until rebuild)

Response:
  1. vSAN detects disk absence via SMART/health check (~seconds)
  2. Components on failed disk marked Absent immediately (no 60-min wait
     for disk failures -- only host failures trigger the wait)
  3. CLOM calculates new placement for each affected component
  4. Rebuild starts: data read from surviving replicas, written to
     new component on a healthy disk group
  5. Rebuild rate: ~100-200 MB/s per component (throttled to avoid
     saturating network/disk bandwidth)

Timeline:
  Detection:           ~5-30 seconds
  Rebuild start:       ~1-5 minutes (CLOM calculation time)
  Rebuild completion:  depends on data volume
    100 GB of components: ~10-20 minutes
    1 TB of components:   ~90-180 minutes
    10 TB of components:  ~15-30 hours

During rebuild:
  - VM I/O continues normally (reads from surviving replicas)
  - Write latency may increase ~10-20% (network/disk contention)
  - A second disk failure in the same fault domain during rebuild
    can cause data loss for objects that lost both copies
Scenario 2: Single Host Failure
===================================

Event: Host 03 powers off unexpectedly (hardware failure)

Impact:
  - ALL components on ALL disk groups on Host 03 become Degraded
  - VMs running on Host 03: HA restarts them on other hosts
  - VMs on other hosts with components on Host 03:
    still accessible (read from other replicas)

Response:
  1. CMMDS detects host absence via heartbeat (~5-15 seconds)
  2. Components marked Degraded (NOT Absent yet)
  3. 60-MINUTE WAIT TIMER starts
     Why? The host might be rebooting. If it comes back within 60 min,
     only a delta resync is needed (much faster than full rebuild).
  4. If host returns within 60 min:
     - Components resync (stale reads replayed from write buffer)
     - Minimal data movement
  5. If host does NOT return within 60 min:
     - Components marked Absent
     - Full rebuild initiated (same as disk failure, but much more data)

Timeline:
  Host with 30 TB of components:
    Wait period:     60 minutes (configurable via advanced setting)
    Rebuild start:   60 min + 1-5 min (CLOM calculation)
    Rebuild complete: ~3-8 hours (30 TB at cluster rebuild bandwidth)
    Total time at reduced redundancy: ~4-9 hours

  The 60-minute timer is configurable:
    VSAN.ClomRepairDelay (default: 60 minutes)
    For financial workloads, consider reducing to 30 minutes
    Trade-off: shorter delay = faster rebuild but more unnecessary
    rebuilds for transient failures (e.g., host reboot for patching)
Scenario 3: Network Partition (Split-Brain)
=============================================

Event: Network switch failure partitions the cluster into two groups

  Partition A: Host 01, Host 02     (2 hosts)
  Partition B: Host 03, Host 04     (2 hosts)

Impact depends on component placement:
  - Objects with components only in Partition A: accessible from A
  - Objects with components only in Partition B: accessible from B
  - Objects with components split across both:
    QUORUM VOTE determines accessibility

Quorum example (FTT=1 RAID-1):
  Object X: Data Component on Host 01, Data Component on Host 03,
            Witness on Host 02

  Partition A has: Data Component (1 vote) + Witness (1 vote) = 2 votes
  Partition B has: Data Component (1 vote) = 1 vote
  Quorum threshold: > 50% of votes = > 1.5 votes = need 2

  Result: Object X accessible in Partition A, INACCESSIBLE in Partition B

  If a VM on Host 03 needs Object X:
    VM stalls (I/O hangs) until partition heals or VM is restarted
    on a host in Partition A

Prevention:
  - Redundant network paths (NIC teaming, dual switches)
  - vSAN fault domains aligned with network failure domains
  - Stretched cluster with witness in third failure domain

Resync and Rebuild Throttling

vSAN Rebuild Bandwidth Management
====================================

vSAN throttles rebuild/resync I/O to protect production workload
performance. This is a critical trade-off:

  Too aggressive rebuild = production VMs experience latency spikes
  Too conservative rebuild = cluster stays at reduced redundancy longer

Throttling parameters:
  VSAN.ResyncThrottleThreshold     Default: 70 (percentage)
    If host I/O utilization > 70%, rebuild I/O is throttled

  VSAN.ResyncIoSize                Default: 256 KiB
    I/O size for resync operations

  VSAN.ResyncEtaPcnt               (monitoring only)
    Estimated time remaining for resync

Monitoring rebuild progress:
  vSphere Client: Monitor > vSAN > Resyncing Components
    Shows: bytes remaining, estimated time, resync reason

  esxcli:
    esxcli vsan debug resync summary
    esxcli vsan debug object health summary

Rebuild bandwidth (typical, all-flash, 25 GbE):
  Per-host rebuild read:   200-500 MB/s
  Per-host rebuild write:  200-500 MB/s
  Network impact:          5-20% of available bandwidth during rebuild
  Production latency impact: 10-30% higher during active rebuild

  Cluster-wide rebuild bandwidth: parallelized across all participating
  hosts. A 30 TB rebuild across 15 remaining hosts:
    Each host contributes ~2 TB of reads and ~2 TB of writes
    At 300 MB/s per host: ~7,000 seconds = ~2 hours
    With throttling (50% duty cycle): ~4 hours

Data Evacuation Modes

When a host enters maintenance mode (for patching, hardware replacement, etc.), vSAN must handle the data on that host. Three modes are available.

Maintenance Mode Data Evacuation Options
===========================================

Option 1: "Full data migration" (safest, slowest)
  - ALL components on the host are rebuilt on other hosts
  - Host can be down indefinitely
  - Requires enough free capacity on remaining hosts
  - Time: hours (proportional to data on host)
  - Use when: hardware replacement, decommissioning host

Option 2: "Ensure accessibility" (default, balanced)
  - Only components that would lose quorum are migrated
  - Objects that still have quorum without this host: no action
  - Much faster than full migration
  - Risk: if a second host fails during maintenance, some objects
    may become inaccessible
  - Time: minutes to tens of minutes
  - Use when: routine patching, short maintenance windows

Option 3: "No data migration" (fastest, riskiest)
  - No data movement at all
  - Components on the host become Degraded while host is in MM
  - If another failure occurs, data loss possible
  - Time: immediate
  - Use when: emergency maintenance where speed is critical
  - NOT recommended for production financial workloads

Capacity planning for maintenance:
  To support "full data migration" for 1 host at a time:
    Free capacity needed >= data on the largest host
    In a 16-host cluster with 20 TB per host:
      Need ~20 TB free = 25-30% headroom matches this requirement

  To support "full data migration" for 2 hosts simultaneously
  (rolling upgrade scenario):
    Need ~40 TB free = requires ~40-50% headroom
    This is why vSAN clusters should not exceed 70% utilization

8. Performance Characteristics

Cache Hit Ratios

vSAN Cache Behavior by Configuration
=======================================

All-Flash Configuration (standard for new deployments):
  Cache tier = WRITE BUFFER ONLY
  - No read cache (reads go directly to capacity flash)
  - "Cache hit ratio" concept does not apply to reads
  - Write buffer hit: if a read targets data still in the write
    buffer (recently written), it is served from cache
    Typical write buffer read hit rate: 5-15% (very workload dependent)

Hybrid Configuration (legacy, SSD cache + HDD capacity):
  Cache tier = 70% READ CACHE + 30% WRITE BUFFER
  - Read cache hit ratio determines HDD avoidance
  - Target: > 90% read cache hit rate
  - Below 90%: HDD I/O dominates, latency spikes to 5-15 ms
  - Cache sizing rule: cache should be >= 10% of working set size
    (not total data size -- working set is the frequently accessed portion)

  Cache hit ratio monitoring:
    vSAN Performance Service > Cache > "Read Cache Hit Rate"
    esxcli vsan debug disk stats

  Example:
    Working set: 200 GB (active data across all VMs on this host)
    Cache read area: 560 GB (70% of 800 GB cache device)
    Hit ratio: 560/200 = 2.8x overprovisioned -> 95%+ hit rate

    If working set grows to 800 GB:
    Hit ratio drops: 560/800 = 0.7x -> ~70% hit rate
    Result: 30% of reads hit HDD -> average latency jumps from
    ~500 us to ~5 ms -> VM performance degrades significantly

Write Buffer Behavior

Write Buffer Performance Model
=================================

Write buffer capacity: determined by cache device size
  All-flash: 100% of cache device = write buffer
  Typical: 800 GB - 1.6 TB NVMe cache device

Write buffer fill dynamics:
  - Incoming write rate: W MB/s (application writes to this host)
  - Destage rate: D MB/s (background flush to capacity tier)
  - Net fill rate: W - D MB/s

  If W > D sustained: write buffer fills up
  If W < D sustained: write buffer drains, steady state

Destage performance:
  Sequential writes to capacity SSD: 500-2000 MB/s per device
  With 6 capacity devices: 3-12 GB/s aggregate destage bandwidth
  In practice, destage is throttled to ~30-50% of capacity device
  bandwidth to avoid starving production reads

Write buffer states:
  0-30% full:    Normal operation, no concern
  30-70% full:   Normal, destage running at standard rate
  70-80% full:   Elevated destage rate, performance monitoring advised
  80-90% full:   WARNING: destage rate increased aggressively
  90-95% full:   CRITICAL: write latency increasing (backpressure)
  95-100% full:  EMERGENCY: writes stall, VMs experience I/O hangs
                 (this triggers vSAN alarm: "write buffer full")

  Recovery: once destage catches up, write buffer drains and
  performance normalizes. But the stall can last seconds to minutes,
  causing application timeouts.

Latency Profiles by Media Type

vSAN Latency Reference (approximate, per I/O operation)
==========================================================

                        4 KiB Random   4 KiB Random   Sequential
Configuration           Read           Write          Read (1 MiB)
-----------------------+-------------+-------------+--------------
All-NVMe (cache+cap)   |             |             |
  Local component       | 100-200 us  | 100-200 us  | 100-200 us
  Remote component      | 200-400 us  | 200-500 us  | 200-400 us
  With RDMA             | 100-250 us  | 150-350 us  | 150-300 us
                        |             |             |
All-Flash (NVMe cache + |             |             |
SSD capacity)           |             |             |
  Local component       | 150-300 us  | 100-200 us* | 200-400 us
  Remote component      | 300-600 us  | 200-500 us* | 300-600 us
                        |             |             |
Hybrid (SSD cache +     |             |             |
HDD capacity)           |             |             |
  Cache hit (read)      | 200-500 us  | 100-200 us* | 200-500 us
  Cache miss (HDD read) | 5-15 ms     | 100-200 us* | 5-15 ms
  Remote + HDD miss     | 8-20 ms     | 200-500 us* | 8-20 ms

  * Write latency = cache write latency (writes always go to cache first)

Latency percentiles (all-flash, FTT=1 RAID-1, 25 GbE):
  p50:    200 us
  p95:    400 us
  p99:    800 us
  p99.9:  2-5 ms   (tail latency from network/GC spikes)
  p99.99: 5-20 ms  (extreme outliers: GC, rebuild, destage contention)

Factors that increase latency:
  1. Network congestion (vSAN traffic competes with vMotion or VM traffic)
  2. Write buffer approaching full (destage contention)
  3. Active rebuild/resync (I/O contention on capacity devices)
  4. Dedup hash computation during destage (CPU contention)
  5. SSD garbage collection cycles (firmware-level, ~100-500 us spikes)
  6. Large stripe width (more hosts involved per I/O)
  7. Erasure coding writes (read-modify-write penalty)
  8. Cross-rack traffic (additional switch hops: +50-100 us)

Aggregate Cluster Performance

vSAN Cluster IOPS Estimation
===============================

Variables:
  H = number of hosts
  D = capacity devices per host
  IOPSdev = IOPS per capacity device (SSD: ~50K, NVMe: ~500K)
  FTT_factor = write amplification from replication
    RAID-1 FTT=1: 2x writes
    RAID-5 FTT=1: ~1.5-2x writes (depends on I/O size)
  Read_pct = percentage of reads in workload

Example: 16 hosts, 6 NVMe capacity devices each, FTT=1 RAID-1
  Total devices: 96
  Raw device IOPS: 96 x 500,000 = 48,000,000 (theoretical max)

  Reality check (70/30 read/write, 4K random):
    Read IOPS capacity:  96 devices x 500K x 0.70 = 33.6M reads/s
    Write IOPS capacity: 96 devices x 500K / 2 x 0.30 = 7.2M writes/s
    Effective total: ~40M IOPS (theoretical, never achieved)

  Practical achievable (with overhead, queueing, coordination):
    Cluster aggregate: 2-5M IOPS (4K random mixed)
    Per-host: 125-310K IOPS
    Per-VM average: 400-1000 IOPS (5,000 VMs)

  This aligns with VMware published benchmarks and customer reports.
  The gap between theoretical and practical is due to:
    - DOM/LSOM coordination overhead
    - Network latency for remote components
    - Write buffer destage contention
    - CLOM/CMMDS metadata operations
    - ESXi kernel scheduling overhead

9. vSAN HCI Mesh and Disaggregated Storage

vSAN HCI Mesh (introduced in vSAN 7.0 U1) allows compute-only hosts to consume storage from storage-rich hosts in the same or a different vSAN cluster.

vSAN HCI Mesh Architecture
=============================

Standard HCI (every host has local storage):

  +----------+  +----------+  +----------+  +----------+
  | Host 01  |  | Host 02  |  | Host 03  |  | Host 04  |
  | Compute  |  | Compute  |  | Compute  |  | Compute  |
  | + Storage|  | + Storage|  | + Storage|  | + Storage|
  +----------+  +----------+  +----------+  +----------+

  Limitation: compute and storage scale together (buy more CPUs
  even if you only need more disk, or vice versa)


HCI Mesh (disaggregated):

  Cluster A (Compute-heavy)     Cluster B (Storage-heavy)
  +----------+ +----------+     +----------+ +----------+
  | Host 01  | | Host 02  |     | Host 05  | | Host 06  |
  | CPU: 128c| | CPU: 128c|     | CPU: 32c | | CPU: 32c |
  | Disk: 2TB| | Disk: 2TB|     | Disk: 50T| | Disk: 50T|
  +----------+ +----------+     +----------+ +----------+
       |            |                 |            |
       +------+-+---+---------+------+----+-------+
              | |             |           |
         vSAN Network   vSAN Network
              |                           |
       Cluster A mounts          Cluster B serves
       Cluster B's datastore     remote storage

Benefits:
  - Scale compute independently from storage
  - Dedicated storage nodes with dense disk configurations
  - Compute nodes can be diskless (or minimal local storage)
  - Better hardware utilization (right-size each node type)

Limitations:
  - All remote I/O traverses the network (no data locality for
    VMs on compute-only nodes)
  - Latency for remote storage: 200-600 us (vs. 100-200 us local)
  - Network bandwidth becomes the bottleneck
  - Requires vSAN Enterprise or VCF licensing
  - Maximum 2 remote datastores per cluster (vSAN 8.0)

Relevance to migration:
  HCI Mesh is VMware's answer to the "we need more storage without
  buying more compute" problem. Ceph/ODF handles this natively
  (separate storage nodes are a first-class concept). S2D does NOT
  support disaggregated storage -- all nodes must have local disks.

10. Monitoring and Troubleshooting

vSAN Health Checks

vSAN Health Service (built into vCenter)
===========================================

Category              Key Checks
--------------------  ---------------------------------------------------
Cluster               Cluster health state, CMMDS membership consistency,
                      time sync across hosts (NTP), stretched cluster
                      witness connectivity

Network               vSAN vmknic configuration, MTU consistency (all hosts
                      must match), multicast/unicast connectivity, network
                      latency between hosts (threshold: <1 ms warning),
                      NIC speed consistency

Disk                  Disk health (SMART status), disk balance (capacity
                      distribution across hosts and disk groups), metadata
                      health, software state consistency

Object Health         Objects with reduced redundancy, inaccessible objects,
                      compliance status (objects not matching their SPBM
                      policy), invalid/orphaned objects

Capacity              Overall utilization, per-host utilization, thin
                      provisioning overcommit ratio, slack space sufficiency

Performance           Write buffer utilization, congestion events per host,
                      latency outliers, throughput bottlenecks

Data Integrity        Checksum errors (silent data corruption detection),
                      component CRC mismatches

Critical health checks for financial operations:
  - "Objects with reduced availability": any object with FTT=0 in degraded
    state = one failure away from data loss
  - "vSAN cluster partition": indicates network issue causing split-brain
  - "Component metadata consistency": detects CMMDS/DOM desynchronization
  - "Write buffer full": indicates imminent write stall

Performance Service

vSAN Performance Service
==========================

Enabled by default in vSAN 7.0+. Stores performance metrics in a
vSAN object on the cluster itself (self-contained, no external DB).

Key metrics available:
  Per-VM:
    - IOPS (read/write)
    - Throughput (read/write MB/s)
    - Latency (read/write, average and p95)
    - Outstanding I/O (queue depth)

  Per-Host:
    - Backend IOPS (vSAN stack to disk)
    - Congestion value (0-255, higher = more contention)
    - CPU utilization for vSAN kernel threads
    - Write buffer fill percentage
    - Destage rate (MB/s)

  Per-Disk-Group:
    - Device latency (read/write per individual disk)
    - IOPS per capacity device
    - Cache tier hit rates (hybrid only)

  Per-Cluster:
    - Aggregate IOPS/throughput/latency
    - Resync bandwidth and progress
    - Dedup/compression savings ratio
    - Component count and distribution

Retention: 90 days by default (configurable)
Overhead: ~1-2% CPU and ~50 GB storage for metric database

Export options:
  - vRealize Operations Manager (now Aria Operations)
  - Syslog/SNMP for alerting
  - vSAN API (PowerCLI: Get-VsanStat, Get-VsanDisk)
  - No native Prometheus endpoint (third-party exporters exist)

esxcli vsan Namespace

Essential esxcli vsan Commands
=================================

# Cluster status and membership
esxcli vsan cluster get
  Output: Sub-cluster Master UUID, Local Node UUID, Node Count

# Health summary
esxcli vsan health cluster list
  Output: test results per health check category

# Disk information
esxcli vsan storage list
  Output: all claimed disks, group, tier (cache/capacity), state

# Debug: object placement
esxcli vsan debug object list
  Output: object UUID, components, placement hosts, health state

# Debug: resync status
esxcli vsan debug resync summary
  Output: bytes remaining to resync, ETA, resync reason

# Debug: disk stats
esxcli vsan debug disk stats
  Output: per-disk IOPS, latency, congestion, write buffer stats

# Debug: component health
esxcli vsan debug object health summary
  Output: count of healthy, degraded, absent, inaccessible objects

# Network diagnostics
esxcli vsan network list
  Output: vSAN vmknic configuration, traffic type (vSAN, witness)

# Advanced: LSOM internals
esxcli vsan debug disk info --disk <uuid>
  Output: write buffer fill %, destage stats, component count

# Performance diagnostics
vscsiStats (command-line tool, separate from esxcli)
  - Captures per-VMDK I/O histograms (latency, IOPS, block size)
  - Essential for identifying individual VM performance issues
  - Syntax: vscsiStats -s -w <world_id>

11. Licensing and Broadcom Impact

Historical vSAN Licensing

vSAN Licensing History
========================

Pre-Broadcom (VMware era):
  vSAN Standard:    Basic features (RAID-1, 5 disk groups, no stretch)
  vSAN Advanced:    + dedup/compression, RAID-5/6, stretch cluster
  vSAN Enterprise:  + encryption, HCI Mesh, file services
  vSAN Enterprise+: + data persistence platform (cloud-native storage)

  Licensed per CPU socket, perpetual or subscription
  Typical cost: $2,500-$6,500 per socket (list, before EA discounts)

Post-Broadcom (2024+):
  VMware Cloud Foundation (VCF):
    - vSAN is ONLY available as part of VCF bundle
    - No standalone vSAN licensing for new purchases
    - VCF includes: ESXi + vCenter + vSAN + NSX + Aria
    - Licensed per core (not per socket)
    - Significant price increase for most customers
    - Minimum 16 cores per CPU counted

  VCF Pricing impact:
    Pre-Broadcom: ~$5,000 per socket (2 sockets per host)
      16-host cluster: ~$160,000 for vSAN licensing
    Post-Broadcom: ~$400-$600 per core (estimated, varies by EA)
      16-host cluster, 2 x 32-core CPUs per host:
      16 x 64 cores x $500 = ~$512,000 for VCF licensing
      (includes all components, not just vSAN)

  Impact on migration decision:
    - vSAN can no longer be purchased standalone
    - Must buy entire VCF stack even if only vSAN is needed
    - Per-core licensing penalizes high-core-count servers
    - Existing perpetual licenses honored (for now) but no new purchases
    - Support renewal costs increasing significantly
    - This licensing change is the PRIMARY driver for many organizations
      evaluating alternative platforms

Licensing Comparison Context

Cost Comparison Framework (approximate, 16 hosts)
====================================================

                        Licensing Model     Estimated Annual Cost
                        ----------------    --------------------
VMware VCF (vSAN incl.) Per core            $300K-$600K
                        (bundled with ESXi,  (depends on core count
                         NSX, Aria, etc.)     and EA negotiation)

OVE / OpenShift         Per core            $150K-$350K
                        (OpenShift sub       (includes ODF/Ceph,
                         includes ODF)        OpenShift platform)

Azure Local             Per core (if AKS)   $100K-$300K
                        or per physical      (Azure Stack HCI sub
                         core (Azure sub)     + Windows Server)

Swisscom ESC            Per VM or resource   Contract-dependent
                        unit (managed        (includes hardware,
                         service pricing)     operations, SLA)

Note: These are rough estimates for a 16-host, ~1000-core environment.
Actual pricing depends heavily on enterprise agreement negotiations,
existing licensing investments, and specific configuration choices.
The point is not the exact numbers but the structural shift:
  vSAN-on-VCF is the most expensive option due to bundling
  OVE/OpenShift includes storage (ODF) in the platform subscription
  Azure Local has the lowest licensing cost but may need more nodes
  ESC amortizes all costs into a managed service fee

What to Preserve vs. What to Leave Behind

Must Preserve (Essential Capabilities)

vSAN Capability Why Essential Conceptual Replacement
Policy-based provisioning (SPBM) Operators define intent ("FTT=1"), system handles placement. Scales to 5,000 VMs without manual per-VM storage config. Kubernetes StorageClasses (OVE), S2D volume policies (Azure Local), service tiers (ESC)
Automatic component rebalancing When hosts are added/removed, data redistributes without manual intervention. Ceph CRUSH reweight/rebalance (OVE), S2D automatic rebalance (Azure Local), provider-managed (ESC)
Transparent failure recovery Disk/host fails, data rebuilds automatically on surviving hardware. No operator action needed for normal failures. Ceph self-healing via PG recovery (OVE), S2D mirror/parity rebuild (Azure Local), provider SLA (ESC)
End-to-end checksums Silent data corruption detected at the storage layer before it reaches the VM. Ceph BlueStore checksums (OVE), ReFS integrity streams (Azure Local), array-level checksums (ESC)
Thin provisioning with UNMAP Capacity overcommit + automatic space reclaim when VMs delete data. Ceph RBD thin + DISCARD (OVE), S2D thin + TRIM (Azure Local), provider-managed (ESC)
Integrated health monitoring Single-pane health dashboard showing disk health, object compliance, network status, capacity. Ceph Dashboard / ODF console (OVE), Windows Admin Center (Azure Local), provider portal (ESC)

Nice to Have (Valuable but Not Blocking)

vSAN Capability Why Nice Alternative Approach
Dedup + compression (inline) Saves 1.5-2.5x capacity on mixed VM workloads. Ceph BlueStore compression (OVE -- no dedup), ReFS dedup (limited in S2D), provider-managed (ESC)
HCI Mesh (disaggregated compute/storage) Scale compute independently from storage. Native in Ceph/ODF (separate storage nodes). Not available in S2D (all nodes must have disks).
Erasure coding (RAID-5/6) Significant capacity savings for read-heavy or warm/cold data. Ceph EC pools (OVE), S2D parity volumes (Azure Local)
Per-object IOPS limits Basic QoS to cap noisy neighbor VMs. Ceph rbd_qos (OVE), S2D Storage QoS policies (Azure Local)
vSAN file services NFS shares served directly from the vSAN datastore. CephFS (OVE), SMB shares on S2D (Azure Local), NFS managed service (ESC)

Leave Behind (VMware-Specific, Do Not Replicate)

vSAN Mechanism Why Leave Behind What Replaces It
CLOM (Cluster-Level Object Manager) Tightly coupled to ESXi/vCenter. CLOM's placement logic is an implementation detail, not a portable concept. Ceph CRUSH maps (OVE), S2D storage bus layer (Azure Local) each have their own, more transparent, placement algorithms.
DOM object tree (component/witness structure) The specific decomposition of a VMDK into components with quorum voting is vSAN-specific. Ceph PG + CRUSH (OVE), S2D mirror/parity at the volume level (Azure Local). Different decomposition models, same outcome.
CMMDS (cluster metadata gossip) vSAN's internal metadata directory. Replaced by etcd (OVE), S2D metadata service (Azure Local). No action needed -- each platform has its own cluster state management.
RDT (Reliable Datagram Transport) vSAN's custom UDP transport. Not portable. Ceph msgr v2 over TCP/RDMA (OVE), SMB Direct / RDMA (Azure Local). Standard transports.
Disk group concept (1 cache + N capacity) The rigid cache-tier-per-disk-group model is a vSAN design constraint. Ceph: WAL/DB on NVMe, OSD on capacity -- more flexible mapping. S2D: automatic tiering across all devices in the pool -- no disk group concept.
vSAN Performance Service (proprietary metrics) Locked to vCenter. Replace with Prometheus + Grafana (OVE/Ceph), Azure Monitor + Windows Admin Center (Azure Local), provider monitoring (ESC).

Key Takeaways

  1. vSAN is deeply embedded in the VMware stack. It is not a detachable storage layer -- it depends on ESXi, vCenter, SPBM, CMMDS, and the entire vSphere ecosystem. Migration means replacing the entire storage architecture, not just swapping out a backend.

  2. The write path is vSAN's performance-critical path. Writes go through DOM -> LSOM -> cache (write buffer) -> network -> remote LSOM -> remote cache. Write latency = max(local cache write, network + remote cache write). Any candidate must match or beat the 150-300 us write latency for all-flash local writes.

  3. vSAN's 60-minute rebuild delay is a design choice, not a limitation. It optimizes for transient failures (host reboots) but means the cluster operates at reduced redundancy for at least 60 minutes after any host failure. Understand whether your team has modified this default and why.

  4. Capacity overhead is substantial. With FTT=1 RAID-1 and recommended slack space, only ~30% of raw capacity is usable. With dedup/compression, effective usable can reach ~60%. Every candidate has similar overhead -- the key is to compare using the same methodology (raw -> usable -> effective with data reduction).

  5. The cache tier is not "cache" in all-flash. In all-flash (the modern standard), the cache device is exclusively a write buffer. There is no read cache. This is counterintuitive and a common source of confusion. Understanding this is essential for sizing cache devices during migration planning.

  6. Component count is a scaling consideration. A 5,000-VM cluster can easily generate 50,000-100,000+ components. vSAN has per-host component limits (~9,000). This constrains how small a cluster can be for a given VM count. Ceph's PG model and S2D's volume model have analogous but different scaling constraints.

  7. The Broadcom licensing change is the forcing function. vSAN is no longer available standalone -- it is bundled into VCF at per-core pricing. For many enterprises, this licensing change alone justifies the migration evaluation. However, the technical migration complexity is the real risk, not the licensing cost.

  8. vSAN networking was historically fragile. Multicast dependencies (pre-7.0), MTU mismatches, and shared uplinks with VM traffic caused the majority of vSAN outages. Any replacement must have cleaner network requirements. All three candidates use unicast-only protocols.

  9. SPBM is the one concept that must survive in some form. The ability to define storage intent declaratively ("I want 2 failures tolerated with erasure coding") and have the system handle placement automatically is what makes vSAN manageable at scale. Kubernetes StorageClasses (OVE), S2D policies (Azure Local), and ESC service tiers are the equivalents -- but none offer the same granularity of per-VM policy attachment that SPBM provides.

  10. Monitor what you measure today. Before migrating, catalog every vSAN alarm, health check, and performance metric that your operations team actively monitors. Each one represents an operational dependency that must have an equivalent in the target platform. Missing a single critical alarm (e.g., "write buffer full" -> equivalent in Ceph: "OSD nearfull") can cause the first post-migration incident.


Discussion Guide

These questions are for your own storage and infrastructure team. The goal is to surface the implicit knowledge about how vSAN operates in your specific environment -- knowledge that may not be documented but is critical for migration planning.

Current State Understanding

  1. What is our current vSAN version and on-disk format version? Are we on vSAN 7.x or 8.x? Have we upgraded the on-disk format to v15+? Are there clusters still on v2/v5 that would need format migration before any platform migration?

  2. What SPBM policies are actively in use? List every storage policy applied to production VMs. What FTT levels, failure tolerance methods (RAID-1 vs. RAID-5/6), stripe widths, and object space reservations are configured? How many distinct policies exist? Are any VMs using "force provisioning" to bypass policy compliance?

  3. What is our actual capacity utilization? Not the vCenter summary, but the real breakdown: raw capacity, cache overhead, FTT overhead, metadata overhead, slack space, dedup/compression ratio, and final usable capacity. What is our physical utilization percentage right now? How fast is it growing (monthly trend)?

  4. Have we ever experienced a write buffer full event? If yes, what caused it and how long did the I/O stall last? Which VMs were affected? What was the business impact? This scenario will exist in some form on any replacement platform -- how did we handle it?

  5. What is our measured rebuild time for a host failure? Not the theoretical calculation, but the actual observed time from host failure to full redundancy restoration. Has this been tested recently, or are we relying on assumptions?

Operational Dependencies

  1. Which vSAN health checks trigger automated remediation or paging? Walk through the alerting chain: vSAN health check -> vCenter alarm -> monitoring system (Aria, Zabbix, PagerDuty, etc.) -> on-call page. Which specific checks are wired to P1/P2 pages? These must be replicated in the target platform.

  2. Do we use vSAN stretched clusters or 2-node clusters? If yes, where are the witness nodes? What is the inter-site latency? What is the RPO/RTO for site failure? Stretched cluster migration is significantly more complex than single-site.

  3. How do we handle maintenance mode today? Which evacuation mode is used for patching (full data migration, ensure accessibility, or no migration)? How long does a typical host maintenance window take? This directly impacts the patching cadence on any replacement platform.

  4. Are there VMs with specific data locality requirements? Are any VMs pinned to specific hosts for licensing or compliance reasons? Does the storage team manually influence component placement, or is it fully automatic?

  5. What custom vSAN advanced settings have been modified? Check VSAN.ClomRepairDelay, resync throttle settings, and any other non-default advanced parameters. Each one represents a tuning decision that encodes operational knowledge about our specific environment.

Performance Baseline

  1. Can we export 30 days of vSAN Performance Service data? We need per-VM IOPS, latency, and throughput histograms to establish the acceptance criteria for the PoC. Without this data, we cannot objectively compare the candidates against the current baseline.

  2. Which VMs are our top 50 storage consumers? Identify the VMs that generate the most IOPS, the most throughput, and the most write load. These are the VMs that will stress-test any replacement platform. They should be the first candidates for PoC migration.

  3. What is our peak I/O period? Is it end-of-month batch processing, morning login storms, overnight backups? The replacement must handle peak load without degradation, not just average load.

  4. Have we measured end-to-end I/O latency from inside the guest VMs? vSAN metrics show storage-level latency, but application-perceived latency includes guest kernel, virtio driver, and hypervisor overhead. Run fio inside representative VMs to establish the guest-perceived baseline.

Pain Points and Known Issues

  1. What are the top three vSAN-related incidents in the last 12 months? What was the root cause, impact, and resolution for each? These incidents reveal the operational weaknesses of the current platform -- and may or may not exist on the replacement.

  2. What vSAN operations do we avoid or defer because they are risky? Examples: firmware upgrades on cache devices, disk group reconfiguration, enabling dedup on existing clusters. These operational pain points should be compared against the operational model of each candidate.

  3. How confident are we in our vSAN capacity forecasting? Have we ever been surprised by capacity growth? Do we account for dedup/compression ratio changes as workload mix evolves? A platform migration is an opportunity to reset capacity planning methodology.

  4. What is our experience with vSAN upgrades? How disruptive are rolling upgrades? How long does a cluster-wide upgrade take? What has broken during upgrades? This experience calibrates expectations for Day-2 lifecycle management on any platform.

Migration-Specific

  1. What is the total data footprint to migrate? Not virtual provisioned size, but actual consumed data (after thin provisioning, before dedup). This determines the migration timeline and network bandwidth requirements.

  2. Do we have any vSAN-specific integrations? Examples: backup solutions that use vSAN snapshot APIs, monitoring tools that query the vSAN Performance Service API, automation scripts that use PowerCLI Get-VsanDisk cmdlets. Each integration is a migration dependency.

  3. What is our appetite for "big bang" vs. gradual migration? Can we run the old and new platforms in parallel during migration, or must we cut over? vSAN-to-Ceph data migration typically requires V2V conversion (VMDK to QCOW2/raw) -- there is no live migration path between fundamentally different hypervisors.


Next: 03-storage-protocols.md -- Storage Protocols (iSCSI, NVMe-oF, Fibre Channel, NFS, SMB)