vSAN -- Current Storage Baseline
Why This Matters
vSAN is the storage engine underneath every VM in the current VMware environment. Before evaluating Ceph/ODF, Storage Spaces Direct, or managed SAN services as replacements, we need a precise understanding of what vSAN actually does -- not at the marketing level ("hyper-converged software-defined storage") but at the level of internal components, data paths, failure handling, and operational behavior. Every candidate will be measured against the operational reality of vSAN: its strengths (policy-based provisioning, transparent rebalancing, integrated health checks) and its weaknesses (Broadcom licensing, opaque internals, scaling limits, multicast dependencies).
This document serves three purposes:
- Establish the baseline. Quantify what vSAN delivers today -- IOPS, latency, capacity overhead, failure recovery time -- so PoC acceptance criteria are grounded in reality, not vendor datasheets.
- Identify dependencies. Discover which vSAN behaviors (SPBM integration, CLOM placement, DOM object structure) our operations team relies on explicitly or implicitly. These need conceptual equivalents in any replacement.
- Know what to leave behind. Some vSAN mechanisms are VMware-proprietary couplings that should not be replicated -- they should be replaced with better abstractions in the target platform.
Concepts
1. vSAN Architecture Overview
vSAN is a distributed storage layer embedded in the ESXi hypervisor kernel. It pools local disks (NVMe, SSD, HDD) across all ESXi hosts in a vSAN cluster into a single shared datastore. Unlike external SAN/NAS, there is no dedicated storage controller -- every ESXi host participates as both a compute node and a storage node.
Cluster Topology
vSAN Cluster (typical: 4-64 hosts)
=================================================
+-----------+ +-----------+ +-----------+ +-----------+
| ESXi Host | | ESXi Host | | ESXi Host | | ESXi Host |
| 01 | | 02 | | 03 | | 04 |
+-----------+ +-----------+ +-----------+ +-----------+
| Disk Grp 1| | Disk Grp 1| | Disk Grp 1| | Disk Grp 1|
| [NVMe-C] | | [NVMe-C] | | [NVMe-C] | | [NVMe-C] |
| [SSD-1] | | [SSD-1] | | [SSD-1] | | [SSD-1] |
| [SSD-2] | | [SSD-2] | | [SSD-2] | | [SSD-2] |
| [SSD-3] | | [SSD-3] | | [SSD-3] | | [SSD-3] |
| Disk Grp 2| | Disk Grp 2| | Disk Grp 2| | Disk Grp 2|
| [NVMe-C] | | [NVMe-C] | | [NVMe-C] | | [NVMe-C] |
| [SSD-4] | | [SSD-4] | | [SSD-4] | | [SSD-4] |
| [SSD-5] | | [SSD-5] | | [SSD-5] | | [SSD-5] |
| [SSD-6] | | [SSD-6] | | [SSD-6] | | [SSD-6] |
+-----------+ +-----------+ +-----------+ +-----------+
| | | |
+--------+-------+--------+-------+--------+------+
| | |
vSAN Network vSAN Network vSAN Network
(vmk1, 25 GbE) (vmk1, 25 GbE) (vmk1, 25 GbE)
Shared vSAN Datastore
(single namespace, all hosts)
Disk Groups
A disk group is the fundamental storage unit on each host. Each disk group contains exactly one cache device and one or more capacity devices.
Disk Group Structure
======================
+------------------------------------------+
| Disk Group |
| |
| +-------------------+ |
| | Cache Tier | 1 device only |
| | (NVMe or SSD) | 70% read cache |
| | | 30% write buffer|
| +-------------------+ |
| |
| +--------+ +--------+ +--------+ |
| | Cap | | Cap | | Cap | 1-7 |
| | Disk 1 | | Disk 2 | | Disk 3 | devs |
| | (SSD) | | (SSD) | | (SSD) | |
| +--------+ +--------+ +--------+ |
+------------------------------------------+
Rules:
- Max 5 disk groups per host
- Max 7 capacity devices per disk group
- Max 35 capacity devices per host (5 x 7)
- Cache device must be >= 10% of total capacity
in the disk group (recommendation, not hard limit)
- In all-flash configurations:
Cache = write buffer only (100% write buffer)
No read cache (flash capacity is fast enough)
- In hybrid configurations (SSD cache + HDD capacity):
Cache = 70% read cache + 30% write buffer
All-flash vs. hybrid behavior: In an all-flash (AF) configuration, which is standard for new deployments, the cache tier serves exclusively as a write buffer. Reads go directly to the capacity SSDs because random read performance on flash is already high enough. In legacy hybrid configurations (SSD cache + HDD capacity), 70% of the cache device is dedicated to read caching to avoid hitting slow HDD random reads.
Witness Nodes and Stretched Clusters
For 2-node vSAN clusters or stretched clusters across two sites, a witness node is required to maintain quorum. The witness stores only metadata (witness components), not data.
Stretched Cluster Topology
============================
Site A Site B
+----------+ +----------+
| Host A-1 | <-- sync repl -> | Host B-1 |
| Host A-2 | <-- sync repl -> | Host B-2 |
+----------+ +----------+
\ /
\ /
\ +----------+ /
+----->| Witness |<------+
| (Site C) |
| ESXi VM |
| metadata |
| only |
+----------+
Witness role:
- Stores witness components (small metadata objects, ~2 MB each)
- Breaks quorum tie when one site is unreachable
- Does NOT store data components
- Must be in a third failure domain (different rack/site)
- Can be a nested ESXi VM (VMware provides the appliance OVA)
Quorum rule:
Object accessible IF > 50% of votes (components) are reachable
With FTT=1 (mirroring): data on Site A, data on Site B, witness on Site C
Site A fails -> Site B + Witness = 2/3 votes = quorum maintained
Site C fails -> Site A + Site B = 2/3 votes = quorum maintained
Site A + Site C fail -> only Site B = 1/3 votes = NO quorum, object offline
2. vSAN Data Path Internals
vSAN's I/O stack is composed of five kernel modules that run inside the ESXi vmkernel. Understanding these is essential for performance troubleshooting and for identifying which behaviors need conceptual replacements.
vSAN I/O Stack (Kernel-Space Components)
==========================================
VM issues I/O (SCSI command via vmhba adapter)
|
v
+----------------------------------------------------------+
| CLOM (Cluster-Level Object Manager) |
| - Policy engine: translates SPBM policies into |
| placement decisions |
| - Decides WHERE components live (which hosts, which DGs) |
| - Enforces FTT, stripe width, locality rules |
| - Runs on vSAN master node, elected per-cluster |
+----------------------------------------------------------+
| (placement map: "put component X on host Y, DG Z")
v
+----------------------------------------------------------+
| DOM (Distributed Object Manager) |
| - Object I/O coordinator |
| - Owns the object tree (components, witnesses, RAID) |
| - Routes reads to nearest replica |
| - Sends writes to all replicas (sync replication) |
| - Handles object-level operations (create, delete, grow) |
| - Runs on the host that owns the object (object owner) |
+----------------------------------------------------------+
| (I/O dispatched to component hosts)
v
+----------------------------------------------------------+
| LSOM (Local Log-Structured Object Manager) |
| - Local I/O engine on each host |
| - Writes: append to write buffer (cache tier), then |
| destages to capacity tier in background |
| - Reads: check write buffer -> check capacity tier |
| - Manages on-disk layout (log-structured) |
| - Handles dedup/compression at the component level |
+----------------------------------------------------------+
|
v
+----------------------------------------------------------+
| CMMDS (Cluster Monitoring, Membership, and Directory |
| Services) |
| - Cluster membership protocol (heartbeat, health) |
| - Distributed metadata directory |
| - Tracks: object locations, component states, disk UUIDs |
| - Uses gossip protocol between hosts |
| - Previously multicast-based, now unicast (vSAN 7.0+) |
+----------------------------------------------------------+
|
v
+----------------------------------------------------------+
| RDT (Reliable Datagram Transport) |
| - vSAN's custom network transport protocol |
| - Sits on top of UDP (port 12321) |
| - Provides: congestion control, flow control, |
| retransmission, ordering |
| - Replaces TCP for vSAN inter-host traffic |
| - Optimized for datacenter latency (not WAN) |
+----------------------------------------------------------+
|
v
Physical Network (25/100 GbE)
Write Path in Detail
Understanding the write path is critical because write latency directly impacts VM performance for database and transactional workloads.
vSAN Write Path (FTT=1, RAID-1 Mirror)
=========================================
VM on Host 1 writes 4 KiB block
|
v
DOM (Host 1, object owner)
|
+---> LSOM on Host 1 (local replica)
| |
| +-> Write to cache device (NVMe write buffer)
| +-> Acknowledge to DOM
|
+---> RDT network --> LSOM on Host 3 (remote replica)
|
+-> Write to cache device (NVMe write buffer)
+-> Acknowledge to DOM (via RDT)
|
v
DOM receives ACKs from BOTH replicas
|
v
ACK returned to VM
|
(write complete -- latency = max(local_cache, network + remote_cache))
Background destage (async, not on write path):
LSOM periodically flushes write buffer to capacity tier
- Destage threshold: ~70-80% write buffer utilization
- Destage I/O is sequential, optimized for capacity device bandwidth
- If write buffer fills to 100%: writes block until destage frees space
(this is a performance emergency -- "write buffer full" alarm)
Typical write latency breakdown (all-flash, 25 GbE):
Local cache write: ~50-100 us (NVMe)
Network RTT: ~50-100 us (25 GbE, same rack)
Remote cache write: ~50-100 us (NVMe)
DOM coordination: ~10-30 us
Total: ~150-300 us (typical)
Total: ~300-600 us (cross-rack, busy network)
Read Path in Detail
vSAN Read Path
================
VM on Host 1 reads 4 KiB block
|
v
DOM (Host 1, object owner)
|
+---> Is data in local LSOM? (Host 1 has a component?)
| |
| YES --> Read from local capacity tier (or write buffer if recent)
| Latency: ~100-200 us (NVMe)
|
| NO --> Route to remote host with nearest component
| RDT network --> LSOM on Host 3
| Read from capacity tier on Host 3
| Latency: ~200-400 us (network + NVMe)
|
v
Data returned to VM
Read optimization (all-flash):
- No read cache (cache device = write buffer only)
- Reads go directly to capacity flash
- DOM prefers local replica if available (data locality)
- If VM migrates via vMotion, reads temporarily go remote
until vSAN rebalances components (can take hours)
LSOM Internals: Log-Structured I/O
LSOM uses a log-structured write model on the cache device. All writes are appended sequentially to the write buffer log, regardless of the original I/O pattern. This converts random writes into sequential writes on the cache device, which is beneficial for NAND endurance and write performance.
LSOM Write Buffer (Cache Device)
==================================
Cache NVMe (e.g., 800 GB)
+-------------------------------------------------------+
| Log Head --> [W1][W2][W3][W4][W5][W6] ... [Wn] <-- |
| |
| Destage pointer --> [W1] already flushed to capacity |
| [W2] flushing now |
| [W3..Wn] pending destage |
+-------------------------------------------------------+
Write buffer behavior:
1. New write appended at log head
2. Destage thread reads from destage pointer, writes to capacity tier
3. Freed space reclaimed for new writes
4. If log wraps around (buffer full):
- Write latency spikes (destage becomes synchronous)
- "Write buffer full" condition --> CRITICAL ALARM
Monitoring:
esxcli vsan debug disk info (shows write buffer utilization)
vSAN Performance Service: "Write Buffer Fill %" metric
Alert threshold: > 80% sustained for > 30 seconds
3. Storage Policies (SPBM)
Storage Policy-Based Management (SPBM) is vSAN's mechanism for defining storage characteristics declaratively. Instead of configuring RAID levels, stripe counts, and failure tolerance at the storage layer, administrators define policies that describe the desired behavior. vSAN (specifically CLOM) translates these policies into component placement decisions.
Policy Capabilities
| Policy Attribute | Values | Effect |
|---|---|---|
failures to tolerate (FTT) |
0, 1, 2, 3 | Number of host/disk failures the object survives |
failure tolerance method |
RAID-1 (mirror), RAID-5/6 (erasure coding) | How redundancy is implemented |
stripe width |
1-12 | Number of capacity devices data is striped across |
force provisioning |
yes/no | Provision even if policy cannot be satisfied |
object space reservation |
0-100% | Percentage of thick provisioning (0 = fully thin) |
flash read cache reservation |
0-100% | Reserved read cache (hybrid only, ignored in AF) |
IOPS limit |
0 (unlimited) or value | Per-object IOPS cap (basic QoS) |
disable object checksum |
yes/no | Disable end-to-end checksum (not recommended) |
storage tier |
(vSAN with HCI Mesh) | Target specific tier in disaggregated model |
FTT and Host Requirements
FTT Method Min Hosts Capacity Overhead Description
--- ------ --------- ---------------- -----------
0 None 1 1x (no redundancy) Dev/test only
1 RAID-1 3 2x Default. Mirror to 2 hosts.
1 RAID-5 4 1.33x Erasure coding (3+1)
2 RAID-1 5 3x Triple mirror.
2 RAID-6 6 1.5x Erasure coding (4+2)
3 RAID-1 7 4x Quadruple mirror. Rare.
Example: 1 TB VMDK with different policies
-------------------------------------------
FTT=1, RAID-1: 2 TB consumed (1 TB x 2 replicas)
FTT=1, RAID-5: 1.33 TB consumed (1 TB data + 0.33 TB parity)
FTT=2, RAID-1: 3 TB consumed (1 TB x 3 replicas)
FTT=2, RAID-6: 1.5 TB consumed (1 TB data + 0.5 TB parity)
For a financial enterprise: FTT=1 RAID-1 is the minimum for production.
FTT=2 recommended for data that must survive double-failure scenarios.
CLOM Placement Decisions
CLOM is the brain of vSAN storage policy enforcement. When a VM is created or a policy is applied, CLOM calculates where to place each component across the cluster.
CLOM Placement Algorithm (simplified)
========================================
Input:
- SPBM policy (e.g., FTT=1, RAID-1, stripe width=1)
- Cluster topology (hosts, disk groups, fault domains)
- Current capacity utilization per host/disk group
- Current component count per host (balance target)
Processing:
1. Determine required components:
FTT=1 RAID-1, 1 stripe -> 2 data components + 1 witness
2. Select hosts for components:
- Each data component on a DIFFERENT fault domain
- Witness on a THIRD fault domain
- Prefer hosts with most free capacity
- Prefer hosts with fewest components (balance)
- Respect affinity rules (if configured)
3. Select disk group within each chosen host:
- Prefer disk group with most free capacity
- Balance component count across disk groups
4. Validate:
- Enough capacity? If not, fail (or force-provision)
- Enough fault domains? If not, fail (or force-provision)
Output:
Component placement map:
Data Component 1 -> Host 02, Disk Group 1
Data Component 2 -> Host 04, Disk Group 2
Witness Component -> Host 01, Disk Group 1
CLOM recalculates when:
- Policy changes (VM storage policy update)
- Host fails or returns
- Disk fails or is added
- Rebalance triggered (proactive rebalancing)
- Maintenance mode entered/exited
Erasure Coding on vSAN
vSAN supports RAID-5 (FTT=1, 3 data + 1 parity) and RAID-6 (FTT=2, 4 data + 2 parity) as erasure coding alternatives to mirroring. This saves significant capacity but has performance trade-offs.
RAID-5 on vSAN (FTT=1)
=========================
Object: 1 TB VMDK
Host 1 Host 2 Host 3 Host 4
+-------+ +-------+ +-------+ +-------+
| D0 | | D1 | | D2 | | P |
| 333GB | | 333GB | | 333GB | | 333GB |
+-------+ +-------+ +-------+ +-------+
Total consumed: 1.33 TB (vs. 2 TB for RAID-1)
Savings: 33%
Write penalty:
- Full stripe write: 4 I/O ops (write D0, D1, D2, P) -- efficient
- Partial stripe write: read-modify-write cycle:
1. Read old data chunk
2. Read old parity chunk
3. Compute new parity
4. Write new data chunk + new parity chunk
= 4 I/O ops for 1 application write
- Write latency: 2-4x higher than RAID-1 for small random writes
RAID-6 on vSAN (FTT=2)
=========================
Host 1 Host 2 Host 3 Host 4 Host 5 Host 6
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+
| D0 | | D1 | | D2 | | D3 | | P | | Q |
| 250G| | 250G| | 250G| | 250G| | 250G| | 250G|
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+
Total consumed: 1.5 TB (vs. 3 TB for triple mirror FTT=2 RAID-1)
Savings: 50%
When to use erasure coding on vSAN:
- Read-heavy workloads (no write penalty on reads)
- Large sequential writes (full-stripe writes are efficient)
- Capacity-constrained environments
- Warm/cold data (templates, archives, compliance stores)
When NOT to use erasure coding:
- OLTP databases (small random writes, latency-sensitive)
- VDI boot storms (high write amplification)
- Stretched clusters (not supported with erasure coding)
4. vSAN File System and Object Structure
vSAN does not use VMFS. It has its own on-disk format called VSAN-FS (or vSAN on-disk format). Understanding the object structure is important because it determines how failures affect individual VMs and how capacity is consumed.
Object Hierarchy
Every entity on vSAN is an object. A single VM consists of multiple objects, and each object is decomposed into components spread across the cluster.
VM Object Structure
=====================
VM: db-prod-01
|
+-- VM Home Namespace Object
| (VMX file, log files, snapshots metadata)
| Size: small (few MB)
| Components: 2 data + 1 witness (FTT=1 RAID-1)
|
+-- VMDK Object (boot disk, 100 GB)
| |
| +-- Component 1 (data, on Host 02, DG 1)
| | 50 GB (or less, depending on stripe width)
| +-- Component 2 (data, on Host 04, DG 2) -- mirror of C1
| | 50 GB
| +-- Component 3 (witness, on Host 01, DG 1)
| ~2 MB (metadata only, votes in quorum)
|
+-- VMDK Object (data disk, 500 GB)
| |
| +-- Striped? If stripe width=2 and FTT=1 RAID-1:
| |
| | RAID-1 tree:
| | Mirror Leg A (stripe width=2):
| | Component A1 (Host 02, DG 1) -- 250 GB
| | Component A2 (Host 02, DG 2) -- 250 GB
| | Mirror Leg B (stripe width=2):
| | Component B1 (Host 04, DG 1) -- 250 GB
| | Component B2 (Host 04, DG 2) -- 250 GB
| | Witness (Host 01) -- ~2 MB
| |
| +-- Total components: 4 data + 1 witness
|
+-- Swap Object (VM swap file, = RAM size)
| Components: typically FTT=1 RAID-1 (thin-provisioned)
|
+-- Snapshot Delta Objects (if snapshots exist)
Each snapshot creates additional delta VMDK objects
Total components for this VM: 12-15+ objects spread across cluster
Component Size Limits
vSAN has a maximum component size of 255 GB. Objects larger than 255 GB are automatically split into multiple components.
Large VMDK Decomposition
==========================
VMDK: 1 TB, FTT=1 RAID-1, stripe width=1
Since 1 TB > 255 GB, vSAN splits into segments:
Segment 1: 255 GB
Segment 2: 255 GB
Segment 3: 255 GB
Segment 4: 235 GB (remainder)
Each segment is independently mirrored:
Segment 1: Component on Host A + Component on Host B + Witness
Segment 2: Component on Host C + Component on Host D + Witness
Segment 3: Component on Host A + Component on Host D + Witness
Segment 4: Component on Host B + Component on Host C + Witness
Total components: 4 segments x (2 data + 1 witness) = 12 components
Impact: A 1 TB VMDK with FTT=1 creates 12 components.
A 5,000-VM cluster with average 500 GB per VM =
~2,500,000 GB / 255 GB per segment = ~10,000 segments
x 3 components each = ~30,000 components (data + witness)
Plus VM home, swap, snapshots = 50,000-100,000+ total components
vSAN scaling limit: ~9,000 components per host (check release notes)
At 64 hosts: ~576,000 components cluster-wide -- sufficient for 5,000 VMs
At 32 hosts: ~288,000 -- potentially tight, monitor component count
On-Disk Format (VSAN-FS)
vSAN On-Disk Format v5+ (vSAN 7.0+)
=======================================
Capacity Device Layout:
+------------------------------------------------------------+
| Partition 1: Metadata (small, fixed size) |
| - Device UUID, disk group membership |
| - Component table (which components live here) |
| - Bitmap of allocated/free blocks |
+------------------------------------------------------------+
| Partition 2: Data (remainder of device) |
| - Fixed-size allocation units (1 MiB) |
| - Log-structured writes from LSOM destage |
| - Checksum interleaved with data blocks |
| - Dedup hash table (if dedup enabled) |
| - Compression metadata (if compression enabled) |
+------------------------------------------------------------+
Cache Device Layout:
+------------------------------------------------------------+
| Write Buffer Log (log-structured) |
| - Sequential append of incoming writes |
| - Each entry: component UUID + offset + data + checksum |
| - Entries destaged to capacity tier, then reclaimed |
+------------------------------------------------------------+
Key characteristics:
- Block size: 1 MiB allocation unit on capacity tier
- Thin by default: unwritten blocks consume no space
- Checksum: CRC-32C on every block (end-to-end data integrity)
- On-disk format is proprietary -- no third-party tools can read it
- Disk replacement: new disk auto-formats with VSAN-FS on claim
5. Networking Requirements
vSAN places specific requirements on the network that differ from general VM traffic networking. Misconfigured vSAN networking is a leading cause of performance problems and split-brain scenarios.
Dedicated vmknic
ESXi Host Network Configuration for vSAN
===========================================
Physical NICs (typical):
+----------+ +----------+ +----------+ +----------+
| NIC 0 | | NIC 1 | | NIC 2 | | NIC 3 |
| 25 GbE | | 25 GbE | | 25 GbE | | 25 GbE |
+----------+ +----------+ +----------+ +----------+
| | | |
+----+--------------+----+ +----+--------------+----+
| vSwitch0 / VDS | | vSwitch1 / VDS |
| (Management + vMotion) | | (vSAN traffic) |
+------------------------+ +------------------------+
| |
vmk0 (Management) vmk1 (vSAN)
10.0.1.x/24 10.0.10.x/24
| |
vmk2 (vMotion) VLAN 10 (dedicated)
10.0.2.x/24 MTU 9000 (jumbo frames)
Requirements:
- Dedicated vmknic (vmk1) tagged for vSAN traffic
- Minimum 10 GbE (25 GbE recommended, 100 GbE for large clusters)
- Jumbo frames (MTU 9000) -- reduces CPU overhead by ~30%
- Dedicated VLAN -- isolate vSAN traffic from VM and management
- NIC teaming for redundancy (active-active or active-standby)
- Layer 2 adjacency between all hosts in the vSAN cluster
- Latency between hosts: < 1 ms (same rack/site)
Stretched cluster: < 5 ms RTT between sites (< 1 ms recommended)
Multicast to Unicast Transition
This is historically significant and still relevant for environments that have not upgraded.
vSAN Network Protocol Evolution
==================================
vSAN 6.x and earlier:
CMMDS used MULTICAST for cluster membership and metadata exchange
- Required IGMP snooping on all switches in the vSAN VLAN
- Required multicast groups configured end-to-end
- Common failure: misconfigured IGMP snooping causes cluster partition
- Troubleshooting nightmare in large L2 domains
- Multicast addresses: 224.1.2.3 and 224.2.3.4 (default)
vSAN 7.0 and later:
CMMDS switched to UNICAST
- No multicast requirements
- Each host communicates directly with all other hosts
- CMMDS master elected per cluster (handles directory)
- Significantly simplifies network configuration
- Eliminates the #1 vSAN networking troubleshooting issue
Migration impact:
If current environment is on vSAN 6.x: multicast dependencies exist
If on vSAN 7.0+: unicast, no multicast concerns
For target platform comparison: Ceph uses unicast, S2D uses SMB Direct
(unicast). None of the candidates require multicast.
RDMA Support
vSAN RDMA Support (vSAN 7.0 U1+)
====================================
vSAN supports RDMA via RoCE v2 (RDMA over Converged Ethernet v2)
- Bypasses TCP/IP stack entirely for vSAN data traffic
- Reduces latency by ~30-50% (eliminates kernel network stack overhead)
- Reduces CPU utilization for storage I/O by ~20-40%
Requirements:
- RoCE v2 capable NICs (Mellanox ConnectX-5/6/7, Broadcom P2100G)
- Lossless Ethernet (PFC -- Priority Flow Control, ECN enabled)
- DCB (Data Center Bridging) configured on switches
- Dedicated traffic class for vSAN RDMA
Typical latency impact:
Without RDMA (TCP/IP): 200-400 us per remote I/O
With RDMA (RoCE v2): 100-200 us per remote I/O
Comparison to candidates:
- S2D: native SMB Direct / RDMA support (mature, well-integrated)
- Ceph/ODF: RDMA support via msgr v2 protocol (experimental, not
yet production-grade in ODF as of 2025)
- vSAN: RDMA via RoCE v2 (production-supported since 7.0 U1)
6. Capacity Management
Deduplication and Compression
vSAN supports inline deduplication and compression (combined, cannot enable one without the other). This is a significant capacity optimization but carries CPU and latency overhead.
vSAN Dedup + Compression Pipeline
====================================
Write path with dedup+compression enabled:
1. Data written to write buffer (cache tier) -- unmodified
2. During destage from cache to capacity tier:
a. Data block hashed (SHA-1 for dedup fingerprint)
b. Hash compared against dedup table
- Match: increment refcount, skip write (dedup hit)
- No match: compress block (LZ4 algorithm)
- Compressed block written to capacity tier
- Hash + location recorded in dedup table
3. Dedup table stored on capacity tier (memory-mapped for speed)
Performance impact:
- Dedup hash computation: ~5-15 us per 4 KiB block (CPU bound)
- Compression (LZ4): ~2-10 us per block (very fast)
- Combined overhead on write path: minimal (happens during destage,
not on the synchronous write path)
- Memory overhead: dedup hash table uses ~1-2 GB RAM per TB of
deduplicated data on the host
Dedup/compression ratios (typical for enterprise VMs):
- Windows VMs (similar OS installs): 1.5-2.5x dedup ratio
- Linux VMs (similar OS installs): 1.3-2.0x dedup ratio
- Database volumes: 1.0-1.3x (already unique data, poor dedup)
- VDI desktops: 2.0-5.0x (many identical OS images)
- Overall cluster average: 1.5-2.5x combined savings
Important limitations:
- Dedup+compression is all-or-nothing (cannot enable separately)
- Applies per disk group, not per VM or per VMDK
- Requires all-flash configuration (not available in hybrid)
- Cannot be enabled on existing data retroactively without
a full data evacuation and re-ingestion
- Dedup table memory consumption scales with data volume
(plan ~1.5 GB RAM per TB of unique data)
TRIM/UNMAP Processing
vSAN TRIM/UNMAP Flow
=======================
Guest VM: fstrim /mountpoint (or continuous discard via mount -o discard)
|
v
Guest kernel: issues SCSI UNMAP to virtual disk
|
v
ESXi vmkernel: translates to vSAN UNMAP on the object
|
v
DOM: propagates UNMAP to all component replicas
|
v
LSOM (each host with a component):
- Marks blocks as free in allocation bitmap
- Freed blocks available for reuse immediately
- Physical space returned to disk group capacity pool
Automatic reclamation:
- vSAN 6.7+: automatic UNMAP processing (no manual trigger needed)
- Processing rate throttled to avoid impacting production I/O
- Reclamation is asynchronous -- space may not appear freed immediately
- Monitor via: vSAN capacity overview in vSphere Client
SSD TRIM passthrough:
- vSAN does NOT pass TRIM to the underlying SSD firmware
- SSD garbage collection handles wear leveling independently
- vSAN manages its own free space tracking at the VSAN-FS level
Overhead Calculations
Understanding the real usable capacity of a vSAN cluster requires accounting for multiple overhead layers.
vSAN Capacity Overhead Calculation
=====================================
Raw capacity example: 16 hosts x 6 SSDs x 3.84 TB = 368.64 TB raw
Subtract: Cache tier overhead
16 hosts x 2 disk groups x 1 NVMe cache device (1.6 TB each)
= 51.2 TB reserved for cache (NOT usable for data)
Remaining: 368.64 - 51.2 = 317.44 TB
Subtract: vSAN metadata overhead (~1-2% of capacity)
~3.17 - 6.35 TB
Remaining: ~311 - 314 TB
Subtract: FTT overhead
FTT=1 RAID-1: 314 / 2 = 157 TB usable
FTT=1 RAID-5: 314 / 1.33 = 236 TB usable
FTT=2 RAID-1: 314 / 3 = 105 TB usable
FTT=2 RAID-6: 314 / 1.5 = 209 TB usable
Subtract: Slack space (25-30% recommended free)
vSAN requires ~25-30% free capacity for:
- Rebuild headroom (if a host fails, data must be rebuilt elsewhere)
- Rebalancing operations
- Write buffer destage efficiency (degrades when capacity > 80%)
- Maintenance mode operations (data evacuation)
FTT=1 RAID-1 example: 157 TB x 0.70 = 110 TB truly usable
FTT=1 RAID-5 example: 236 TB x 0.70 = 165 TB truly usable
Subtract: Dedup/compression savings (adds capacity back):
If dedup+compression ratio = 2x:
FTT=1 RAID-1: 110 TB x 2 = 220 TB effective
FTT=1 RAID-5: 165 TB x 2 = 330 TB effective
Summary (16 hosts x 6 x 3.84 TB SSDs, 2 cache devices each):
+--------------------+--------+-------------+------------------+
| Configuration | Usable | After Slack | With 2x Dedup |
+--------------------+--------+-------------+------------------+
| FTT=1 RAID-1 | 157 TB | 110 TB | 220 TB effective |
| FTT=1 RAID-5 | 236 TB | 165 TB | 330 TB effective |
| FTT=2 RAID-1 | 105 TB | 73 TB | 147 TB effective |
| FTT=2 RAID-6 | 209 TB | 146 TB | 293 TB effective |
+--------------------+--------+-------------+------------------+
| Raw total: | 369 TB (including cache tier) |
| Efficiency (RAID-1)| 110/369 = 29.8% (without dedup) |
| Efficiency (RAID-5)| 165/369 = 44.7% (without dedup) |
+--------------------+--------+-------------+------------------+
Key insight: raw-to-usable efficiency is 30-45% for production
configurations. This is the number to compare against Ceph/ODF
and S2D, not the raw capacity number vendors quote.
7. Failure Handling
Failure handling is the most critical area for a financial enterprise. Understanding exactly what happens when hardware fails -- and how long recovery takes -- determines whether the platform meets availability requirements.
Component States
vSAN Component State Machine
===============================
+----------+
| Active | <-- Normal state, component healthy and accessible
+----------+
|
| (disk failure / host failure / network partition)
v
+----------+
| Degraded | <-- Component inaccessible, object still available
+----------+ (quorum maintained via remaining components)
|
| (60-minute timer starts for transient failures)
|
+-------> Host returns within 60 min?
| YES -> component resyncs, returns to Active
| NO -> component marked Absent
|
v
+----------+
| Absent | <-- Component permanently unavailable
+----------+ (host confirmed down or disk failed)
|
| CLOM initiates rebuild
v
+-----------+
| Rebuilding| <-- New component being created on another host/DG
+-----------+
|
| (data copied from surviving replica)
v
+----------+
| Active | <-- New component online, object fully protected again
+----------+
Failure Scenarios
Scenario 1: Single Disk Failure
==================================
Event: One capacity SSD fails in Host 03, Disk Group 1
Impact:
- All components on that specific device become Degraded/Absent
- Objects with replicas on other hosts/DGs remain accessible
- Objects lose one level of fault tolerance
(e.g., FTT=1 objects have 0 remaining tolerance until rebuild)
Response:
1. vSAN detects disk absence via SMART/health check (~seconds)
2. Components on failed disk marked Absent immediately (no 60-min wait
for disk failures -- only host failures trigger the wait)
3. CLOM calculates new placement for each affected component
4. Rebuild starts: data read from surviving replicas, written to
new component on a healthy disk group
5. Rebuild rate: ~100-200 MB/s per component (throttled to avoid
saturating network/disk bandwidth)
Timeline:
Detection: ~5-30 seconds
Rebuild start: ~1-5 minutes (CLOM calculation time)
Rebuild completion: depends on data volume
100 GB of components: ~10-20 minutes
1 TB of components: ~90-180 minutes
10 TB of components: ~15-30 hours
During rebuild:
- VM I/O continues normally (reads from surviving replicas)
- Write latency may increase ~10-20% (network/disk contention)
- A second disk failure in the same fault domain during rebuild
can cause data loss for objects that lost both copies
Scenario 2: Single Host Failure
===================================
Event: Host 03 powers off unexpectedly (hardware failure)
Impact:
- ALL components on ALL disk groups on Host 03 become Degraded
- VMs running on Host 03: HA restarts them on other hosts
- VMs on other hosts with components on Host 03:
still accessible (read from other replicas)
Response:
1. CMMDS detects host absence via heartbeat (~5-15 seconds)
2. Components marked Degraded (NOT Absent yet)
3. 60-MINUTE WAIT TIMER starts
Why? The host might be rebooting. If it comes back within 60 min,
only a delta resync is needed (much faster than full rebuild).
4. If host returns within 60 min:
- Components resync (stale reads replayed from write buffer)
- Minimal data movement
5. If host does NOT return within 60 min:
- Components marked Absent
- Full rebuild initiated (same as disk failure, but much more data)
Timeline:
Host with 30 TB of components:
Wait period: 60 minutes (configurable via advanced setting)
Rebuild start: 60 min + 1-5 min (CLOM calculation)
Rebuild complete: ~3-8 hours (30 TB at cluster rebuild bandwidth)
Total time at reduced redundancy: ~4-9 hours
The 60-minute timer is configurable:
VSAN.ClomRepairDelay (default: 60 minutes)
For financial workloads, consider reducing to 30 minutes
Trade-off: shorter delay = faster rebuild but more unnecessary
rebuilds for transient failures (e.g., host reboot for patching)
Scenario 3: Network Partition (Split-Brain)
=============================================
Event: Network switch failure partitions the cluster into two groups
Partition A: Host 01, Host 02 (2 hosts)
Partition B: Host 03, Host 04 (2 hosts)
Impact depends on component placement:
- Objects with components only in Partition A: accessible from A
- Objects with components only in Partition B: accessible from B
- Objects with components split across both:
QUORUM VOTE determines accessibility
Quorum example (FTT=1 RAID-1):
Object X: Data Component on Host 01, Data Component on Host 03,
Witness on Host 02
Partition A has: Data Component (1 vote) + Witness (1 vote) = 2 votes
Partition B has: Data Component (1 vote) = 1 vote
Quorum threshold: > 50% of votes = > 1.5 votes = need 2
Result: Object X accessible in Partition A, INACCESSIBLE in Partition B
If a VM on Host 03 needs Object X:
VM stalls (I/O hangs) until partition heals or VM is restarted
on a host in Partition A
Prevention:
- Redundant network paths (NIC teaming, dual switches)
- vSAN fault domains aligned with network failure domains
- Stretched cluster with witness in third failure domain
Resync and Rebuild Throttling
vSAN Rebuild Bandwidth Management
====================================
vSAN throttles rebuild/resync I/O to protect production workload
performance. This is a critical trade-off:
Too aggressive rebuild = production VMs experience latency spikes
Too conservative rebuild = cluster stays at reduced redundancy longer
Throttling parameters:
VSAN.ResyncThrottleThreshold Default: 70 (percentage)
If host I/O utilization > 70%, rebuild I/O is throttled
VSAN.ResyncIoSize Default: 256 KiB
I/O size for resync operations
VSAN.ResyncEtaPcnt (monitoring only)
Estimated time remaining for resync
Monitoring rebuild progress:
vSphere Client: Monitor > vSAN > Resyncing Components
Shows: bytes remaining, estimated time, resync reason
esxcli:
esxcli vsan debug resync summary
esxcli vsan debug object health summary
Rebuild bandwidth (typical, all-flash, 25 GbE):
Per-host rebuild read: 200-500 MB/s
Per-host rebuild write: 200-500 MB/s
Network impact: 5-20% of available bandwidth during rebuild
Production latency impact: 10-30% higher during active rebuild
Cluster-wide rebuild bandwidth: parallelized across all participating
hosts. A 30 TB rebuild across 15 remaining hosts:
Each host contributes ~2 TB of reads and ~2 TB of writes
At 300 MB/s per host: ~7,000 seconds = ~2 hours
With throttling (50% duty cycle): ~4 hours
Data Evacuation Modes
When a host enters maintenance mode (for patching, hardware replacement, etc.), vSAN must handle the data on that host. Three modes are available.
Maintenance Mode Data Evacuation Options
===========================================
Option 1: "Full data migration" (safest, slowest)
- ALL components on the host are rebuilt on other hosts
- Host can be down indefinitely
- Requires enough free capacity on remaining hosts
- Time: hours (proportional to data on host)
- Use when: hardware replacement, decommissioning host
Option 2: "Ensure accessibility" (default, balanced)
- Only components that would lose quorum are migrated
- Objects that still have quorum without this host: no action
- Much faster than full migration
- Risk: if a second host fails during maintenance, some objects
may become inaccessible
- Time: minutes to tens of minutes
- Use when: routine patching, short maintenance windows
Option 3: "No data migration" (fastest, riskiest)
- No data movement at all
- Components on the host become Degraded while host is in MM
- If another failure occurs, data loss possible
- Time: immediate
- Use when: emergency maintenance where speed is critical
- NOT recommended for production financial workloads
Capacity planning for maintenance:
To support "full data migration" for 1 host at a time:
Free capacity needed >= data on the largest host
In a 16-host cluster with 20 TB per host:
Need ~20 TB free = 25-30% headroom matches this requirement
To support "full data migration" for 2 hosts simultaneously
(rolling upgrade scenario):
Need ~40 TB free = requires ~40-50% headroom
This is why vSAN clusters should not exceed 70% utilization
8. Performance Characteristics
Cache Hit Ratios
vSAN Cache Behavior by Configuration
=======================================
All-Flash Configuration (standard for new deployments):
Cache tier = WRITE BUFFER ONLY
- No read cache (reads go directly to capacity flash)
- "Cache hit ratio" concept does not apply to reads
- Write buffer hit: if a read targets data still in the write
buffer (recently written), it is served from cache
Typical write buffer read hit rate: 5-15% (very workload dependent)
Hybrid Configuration (legacy, SSD cache + HDD capacity):
Cache tier = 70% READ CACHE + 30% WRITE BUFFER
- Read cache hit ratio determines HDD avoidance
- Target: > 90% read cache hit rate
- Below 90%: HDD I/O dominates, latency spikes to 5-15 ms
- Cache sizing rule: cache should be >= 10% of working set size
(not total data size -- working set is the frequently accessed portion)
Cache hit ratio monitoring:
vSAN Performance Service > Cache > "Read Cache Hit Rate"
esxcli vsan debug disk stats
Example:
Working set: 200 GB (active data across all VMs on this host)
Cache read area: 560 GB (70% of 800 GB cache device)
Hit ratio: 560/200 = 2.8x overprovisioned -> 95%+ hit rate
If working set grows to 800 GB:
Hit ratio drops: 560/800 = 0.7x -> ~70% hit rate
Result: 30% of reads hit HDD -> average latency jumps from
~500 us to ~5 ms -> VM performance degrades significantly
Write Buffer Behavior
Write Buffer Performance Model
=================================
Write buffer capacity: determined by cache device size
All-flash: 100% of cache device = write buffer
Typical: 800 GB - 1.6 TB NVMe cache device
Write buffer fill dynamics:
- Incoming write rate: W MB/s (application writes to this host)
- Destage rate: D MB/s (background flush to capacity tier)
- Net fill rate: W - D MB/s
If W > D sustained: write buffer fills up
If W < D sustained: write buffer drains, steady state
Destage performance:
Sequential writes to capacity SSD: 500-2000 MB/s per device
With 6 capacity devices: 3-12 GB/s aggregate destage bandwidth
In practice, destage is throttled to ~30-50% of capacity device
bandwidth to avoid starving production reads
Write buffer states:
0-30% full: Normal operation, no concern
30-70% full: Normal, destage running at standard rate
70-80% full: Elevated destage rate, performance monitoring advised
80-90% full: WARNING: destage rate increased aggressively
90-95% full: CRITICAL: write latency increasing (backpressure)
95-100% full: EMERGENCY: writes stall, VMs experience I/O hangs
(this triggers vSAN alarm: "write buffer full")
Recovery: once destage catches up, write buffer drains and
performance normalizes. But the stall can last seconds to minutes,
causing application timeouts.
Latency Profiles by Media Type
vSAN Latency Reference (approximate, per I/O operation)
==========================================================
4 KiB Random 4 KiB Random Sequential
Configuration Read Write Read (1 MiB)
-----------------------+-------------+-------------+--------------
All-NVMe (cache+cap) | | |
Local component | 100-200 us | 100-200 us | 100-200 us
Remote component | 200-400 us | 200-500 us | 200-400 us
With RDMA | 100-250 us | 150-350 us | 150-300 us
| | |
All-Flash (NVMe cache + | | |
SSD capacity) | | |
Local component | 150-300 us | 100-200 us* | 200-400 us
Remote component | 300-600 us | 200-500 us* | 300-600 us
| | |
Hybrid (SSD cache + | | |
HDD capacity) | | |
Cache hit (read) | 200-500 us | 100-200 us* | 200-500 us
Cache miss (HDD read) | 5-15 ms | 100-200 us* | 5-15 ms
Remote + HDD miss | 8-20 ms | 200-500 us* | 8-20 ms
* Write latency = cache write latency (writes always go to cache first)
Latency percentiles (all-flash, FTT=1 RAID-1, 25 GbE):
p50: 200 us
p95: 400 us
p99: 800 us
p99.9: 2-5 ms (tail latency from network/GC spikes)
p99.99: 5-20 ms (extreme outliers: GC, rebuild, destage contention)
Factors that increase latency:
1. Network congestion (vSAN traffic competes with vMotion or VM traffic)
2. Write buffer approaching full (destage contention)
3. Active rebuild/resync (I/O contention on capacity devices)
4. Dedup hash computation during destage (CPU contention)
5. SSD garbage collection cycles (firmware-level, ~100-500 us spikes)
6. Large stripe width (more hosts involved per I/O)
7. Erasure coding writes (read-modify-write penalty)
8. Cross-rack traffic (additional switch hops: +50-100 us)
Aggregate Cluster Performance
vSAN Cluster IOPS Estimation
===============================
Variables:
H = number of hosts
D = capacity devices per host
IOPSdev = IOPS per capacity device (SSD: ~50K, NVMe: ~500K)
FTT_factor = write amplification from replication
RAID-1 FTT=1: 2x writes
RAID-5 FTT=1: ~1.5-2x writes (depends on I/O size)
Read_pct = percentage of reads in workload
Example: 16 hosts, 6 NVMe capacity devices each, FTT=1 RAID-1
Total devices: 96
Raw device IOPS: 96 x 500,000 = 48,000,000 (theoretical max)
Reality check (70/30 read/write, 4K random):
Read IOPS capacity: 96 devices x 500K x 0.70 = 33.6M reads/s
Write IOPS capacity: 96 devices x 500K / 2 x 0.30 = 7.2M writes/s
Effective total: ~40M IOPS (theoretical, never achieved)
Practical achievable (with overhead, queueing, coordination):
Cluster aggregate: 2-5M IOPS (4K random mixed)
Per-host: 125-310K IOPS
Per-VM average: 400-1000 IOPS (5,000 VMs)
This aligns with VMware published benchmarks and customer reports.
The gap between theoretical and practical is due to:
- DOM/LSOM coordination overhead
- Network latency for remote components
- Write buffer destage contention
- CLOM/CMMDS metadata operations
- ESXi kernel scheduling overhead
9. vSAN HCI Mesh and Disaggregated Storage
vSAN HCI Mesh (introduced in vSAN 7.0 U1) allows compute-only hosts to consume storage from storage-rich hosts in the same or a different vSAN cluster.
vSAN HCI Mesh Architecture
=============================
Standard HCI (every host has local storage):
+----------+ +----------+ +----------+ +----------+
| Host 01 | | Host 02 | | Host 03 | | Host 04 |
| Compute | | Compute | | Compute | | Compute |
| + Storage| | + Storage| | + Storage| | + Storage|
+----------+ +----------+ +----------+ +----------+
Limitation: compute and storage scale together (buy more CPUs
even if you only need more disk, or vice versa)
HCI Mesh (disaggregated):
Cluster A (Compute-heavy) Cluster B (Storage-heavy)
+----------+ +----------+ +----------+ +----------+
| Host 01 | | Host 02 | | Host 05 | | Host 06 |
| CPU: 128c| | CPU: 128c| | CPU: 32c | | CPU: 32c |
| Disk: 2TB| | Disk: 2TB| | Disk: 50T| | Disk: 50T|
+----------+ +----------+ +----------+ +----------+
| | | |
+------+-+---+---------+------+----+-------+
| | | |
vSAN Network vSAN Network
| |
Cluster A mounts Cluster B serves
Cluster B's datastore remote storage
Benefits:
- Scale compute independently from storage
- Dedicated storage nodes with dense disk configurations
- Compute nodes can be diskless (or minimal local storage)
- Better hardware utilization (right-size each node type)
Limitations:
- All remote I/O traverses the network (no data locality for
VMs on compute-only nodes)
- Latency for remote storage: 200-600 us (vs. 100-200 us local)
- Network bandwidth becomes the bottleneck
- Requires vSAN Enterprise or VCF licensing
- Maximum 2 remote datastores per cluster (vSAN 8.0)
Relevance to migration:
HCI Mesh is VMware's answer to the "we need more storage without
buying more compute" problem. Ceph/ODF handles this natively
(separate storage nodes are a first-class concept). S2D does NOT
support disaggregated storage -- all nodes must have local disks.
10. Monitoring and Troubleshooting
vSAN Health Checks
vSAN Health Service (built into vCenter)
===========================================
Category Key Checks
-------------------- ---------------------------------------------------
Cluster Cluster health state, CMMDS membership consistency,
time sync across hosts (NTP), stretched cluster
witness connectivity
Network vSAN vmknic configuration, MTU consistency (all hosts
must match), multicast/unicast connectivity, network
latency between hosts (threshold: <1 ms warning),
NIC speed consistency
Disk Disk health (SMART status), disk balance (capacity
distribution across hosts and disk groups), metadata
health, software state consistency
Object Health Objects with reduced redundancy, inaccessible objects,
compliance status (objects not matching their SPBM
policy), invalid/orphaned objects
Capacity Overall utilization, per-host utilization, thin
provisioning overcommit ratio, slack space sufficiency
Performance Write buffer utilization, congestion events per host,
latency outliers, throughput bottlenecks
Data Integrity Checksum errors (silent data corruption detection),
component CRC mismatches
Critical health checks for financial operations:
- "Objects with reduced availability": any object with FTT=0 in degraded
state = one failure away from data loss
- "vSAN cluster partition": indicates network issue causing split-brain
- "Component metadata consistency": detects CMMDS/DOM desynchronization
- "Write buffer full": indicates imminent write stall
Performance Service
vSAN Performance Service
==========================
Enabled by default in vSAN 7.0+. Stores performance metrics in a
vSAN object on the cluster itself (self-contained, no external DB).
Key metrics available:
Per-VM:
- IOPS (read/write)
- Throughput (read/write MB/s)
- Latency (read/write, average and p95)
- Outstanding I/O (queue depth)
Per-Host:
- Backend IOPS (vSAN stack to disk)
- Congestion value (0-255, higher = more contention)
- CPU utilization for vSAN kernel threads
- Write buffer fill percentage
- Destage rate (MB/s)
Per-Disk-Group:
- Device latency (read/write per individual disk)
- IOPS per capacity device
- Cache tier hit rates (hybrid only)
Per-Cluster:
- Aggregate IOPS/throughput/latency
- Resync bandwidth and progress
- Dedup/compression savings ratio
- Component count and distribution
Retention: 90 days by default (configurable)
Overhead: ~1-2% CPU and ~50 GB storage for metric database
Export options:
- vRealize Operations Manager (now Aria Operations)
- Syslog/SNMP for alerting
- vSAN API (PowerCLI: Get-VsanStat, Get-VsanDisk)
- No native Prometheus endpoint (third-party exporters exist)
esxcli vsan Namespace
Essential esxcli vsan Commands
=================================
# Cluster status and membership
esxcli vsan cluster get
Output: Sub-cluster Master UUID, Local Node UUID, Node Count
# Health summary
esxcli vsan health cluster list
Output: test results per health check category
# Disk information
esxcli vsan storage list
Output: all claimed disks, group, tier (cache/capacity), state
# Debug: object placement
esxcli vsan debug object list
Output: object UUID, components, placement hosts, health state
# Debug: resync status
esxcli vsan debug resync summary
Output: bytes remaining to resync, ETA, resync reason
# Debug: disk stats
esxcli vsan debug disk stats
Output: per-disk IOPS, latency, congestion, write buffer stats
# Debug: component health
esxcli vsan debug object health summary
Output: count of healthy, degraded, absent, inaccessible objects
# Network diagnostics
esxcli vsan network list
Output: vSAN vmknic configuration, traffic type (vSAN, witness)
# Advanced: LSOM internals
esxcli vsan debug disk info --disk <uuid>
Output: write buffer fill %, destage stats, component count
# Performance diagnostics
vscsiStats (command-line tool, separate from esxcli)
- Captures per-VMDK I/O histograms (latency, IOPS, block size)
- Essential for identifying individual VM performance issues
- Syntax: vscsiStats -s -w <world_id>
11. Licensing and Broadcom Impact
Historical vSAN Licensing
vSAN Licensing History
========================
Pre-Broadcom (VMware era):
vSAN Standard: Basic features (RAID-1, 5 disk groups, no stretch)
vSAN Advanced: + dedup/compression, RAID-5/6, stretch cluster
vSAN Enterprise: + encryption, HCI Mesh, file services
vSAN Enterprise+: + data persistence platform (cloud-native storage)
Licensed per CPU socket, perpetual or subscription
Typical cost: $2,500-$6,500 per socket (list, before EA discounts)
Post-Broadcom (2024+):
VMware Cloud Foundation (VCF):
- vSAN is ONLY available as part of VCF bundle
- No standalone vSAN licensing for new purchases
- VCF includes: ESXi + vCenter + vSAN + NSX + Aria
- Licensed per core (not per socket)
- Significant price increase for most customers
- Minimum 16 cores per CPU counted
VCF Pricing impact:
Pre-Broadcom: ~$5,000 per socket (2 sockets per host)
16-host cluster: ~$160,000 for vSAN licensing
Post-Broadcom: ~$400-$600 per core (estimated, varies by EA)
16-host cluster, 2 x 32-core CPUs per host:
16 x 64 cores x $500 = ~$512,000 for VCF licensing
(includes all components, not just vSAN)
Impact on migration decision:
- vSAN can no longer be purchased standalone
- Must buy entire VCF stack even if only vSAN is needed
- Per-core licensing penalizes high-core-count servers
- Existing perpetual licenses honored (for now) but no new purchases
- Support renewal costs increasing significantly
- This licensing change is the PRIMARY driver for many organizations
evaluating alternative platforms
Licensing Comparison Context
Cost Comparison Framework (approximate, 16 hosts)
====================================================
Licensing Model Estimated Annual Cost
---------------- --------------------
VMware VCF (vSAN incl.) Per core $300K-$600K
(bundled with ESXi, (depends on core count
NSX, Aria, etc.) and EA negotiation)
OVE / OpenShift Per core $150K-$350K
(OpenShift sub (includes ODF/Ceph,
includes ODF) OpenShift platform)
Azure Local Per core (if AKS) $100K-$300K
or per physical (Azure Stack HCI sub
core (Azure sub) + Windows Server)
Swisscom ESC Per VM or resource Contract-dependent
unit (managed (includes hardware,
service pricing) operations, SLA)
Note: These are rough estimates for a 16-host, ~1000-core environment.
Actual pricing depends heavily on enterprise agreement negotiations,
existing licensing investments, and specific configuration choices.
The point is not the exact numbers but the structural shift:
vSAN-on-VCF is the most expensive option due to bundling
OVE/OpenShift includes storage (ODF) in the platform subscription
Azure Local has the lowest licensing cost but may need more nodes
ESC amortizes all costs into a managed service fee
What to Preserve vs. What to Leave Behind
Must Preserve (Essential Capabilities)
| vSAN Capability | Why Essential | Conceptual Replacement |
|---|---|---|
| Policy-based provisioning (SPBM) | Operators define intent ("FTT=1"), system handles placement. Scales to 5,000 VMs without manual per-VM storage config. | Kubernetes StorageClasses (OVE), S2D volume policies (Azure Local), service tiers (ESC) |
| Automatic component rebalancing | When hosts are added/removed, data redistributes without manual intervention. | Ceph CRUSH reweight/rebalance (OVE), S2D automatic rebalance (Azure Local), provider-managed (ESC) |
| Transparent failure recovery | Disk/host fails, data rebuilds automatically on surviving hardware. No operator action needed for normal failures. | Ceph self-healing via PG recovery (OVE), S2D mirror/parity rebuild (Azure Local), provider SLA (ESC) |
| End-to-end checksums | Silent data corruption detected at the storage layer before it reaches the VM. | Ceph BlueStore checksums (OVE), ReFS integrity streams (Azure Local), array-level checksums (ESC) |
| Thin provisioning with UNMAP | Capacity overcommit + automatic space reclaim when VMs delete data. | Ceph RBD thin + DISCARD (OVE), S2D thin + TRIM (Azure Local), provider-managed (ESC) |
| Integrated health monitoring | Single-pane health dashboard showing disk health, object compliance, network status, capacity. | Ceph Dashboard / ODF console (OVE), Windows Admin Center (Azure Local), provider portal (ESC) |
Nice to Have (Valuable but Not Blocking)
| vSAN Capability | Why Nice | Alternative Approach |
|---|---|---|
| Dedup + compression (inline) | Saves 1.5-2.5x capacity on mixed VM workloads. | Ceph BlueStore compression (OVE -- no dedup), ReFS dedup (limited in S2D), provider-managed (ESC) |
| HCI Mesh (disaggregated compute/storage) | Scale compute independently from storage. | Native in Ceph/ODF (separate storage nodes). Not available in S2D (all nodes must have disks). |
| Erasure coding (RAID-5/6) | Significant capacity savings for read-heavy or warm/cold data. | Ceph EC pools (OVE), S2D parity volumes (Azure Local) |
| Per-object IOPS limits | Basic QoS to cap noisy neighbor VMs. | Ceph rbd_qos (OVE), S2D Storage QoS policies (Azure Local) |
| vSAN file services | NFS shares served directly from the vSAN datastore. | CephFS (OVE), SMB shares on S2D (Azure Local), NFS managed service (ESC) |
Leave Behind (VMware-Specific, Do Not Replicate)
| vSAN Mechanism | Why Leave Behind | What Replaces It |
|---|---|---|
| CLOM (Cluster-Level Object Manager) | Tightly coupled to ESXi/vCenter. CLOM's placement logic is an implementation detail, not a portable concept. | Ceph CRUSH maps (OVE), S2D storage bus layer (Azure Local) each have their own, more transparent, placement algorithms. |
| DOM object tree (component/witness structure) | The specific decomposition of a VMDK into components with quorum voting is vSAN-specific. | Ceph PG + CRUSH (OVE), S2D mirror/parity at the volume level (Azure Local). Different decomposition models, same outcome. |
| CMMDS (cluster metadata gossip) | vSAN's internal metadata directory. Replaced by etcd (OVE), S2D metadata service (Azure Local). | No action needed -- each platform has its own cluster state management. |
| RDT (Reliable Datagram Transport) | vSAN's custom UDP transport. Not portable. | Ceph msgr v2 over TCP/RDMA (OVE), SMB Direct / RDMA (Azure Local). Standard transports. |
| Disk group concept (1 cache + N capacity) | The rigid cache-tier-per-disk-group model is a vSAN design constraint. | Ceph: WAL/DB on NVMe, OSD on capacity -- more flexible mapping. S2D: automatic tiering across all devices in the pool -- no disk group concept. |
| vSAN Performance Service (proprietary metrics) | Locked to vCenter. | Replace with Prometheus + Grafana (OVE/Ceph), Azure Monitor + Windows Admin Center (Azure Local), provider monitoring (ESC). |
Key Takeaways
-
vSAN is deeply embedded in the VMware stack. It is not a detachable storage layer -- it depends on ESXi, vCenter, SPBM, CMMDS, and the entire vSphere ecosystem. Migration means replacing the entire storage architecture, not just swapping out a backend.
-
The write path is vSAN's performance-critical path. Writes go through DOM -> LSOM -> cache (write buffer) -> network -> remote LSOM -> remote cache. Write latency = max(local cache write, network + remote cache write). Any candidate must match or beat the 150-300 us write latency for all-flash local writes.
-
vSAN's 60-minute rebuild delay is a design choice, not a limitation. It optimizes for transient failures (host reboots) but means the cluster operates at reduced redundancy for at least 60 minutes after any host failure. Understand whether your team has modified this default and why.
-
Capacity overhead is substantial. With FTT=1 RAID-1 and recommended slack space, only ~30% of raw capacity is usable. With dedup/compression, effective usable can reach ~60%. Every candidate has similar overhead -- the key is to compare using the same methodology (raw -> usable -> effective with data reduction).
-
The cache tier is not "cache" in all-flash. In all-flash (the modern standard), the cache device is exclusively a write buffer. There is no read cache. This is counterintuitive and a common source of confusion. Understanding this is essential for sizing cache devices during migration planning.
-
Component count is a scaling consideration. A 5,000-VM cluster can easily generate 50,000-100,000+ components. vSAN has per-host component limits (~9,000). This constrains how small a cluster can be for a given VM count. Ceph's PG model and S2D's volume model have analogous but different scaling constraints.
-
The Broadcom licensing change is the forcing function. vSAN is no longer available standalone -- it is bundled into VCF at per-core pricing. For many enterprises, this licensing change alone justifies the migration evaluation. However, the technical migration complexity is the real risk, not the licensing cost.
-
vSAN networking was historically fragile. Multicast dependencies (pre-7.0), MTU mismatches, and shared uplinks with VM traffic caused the majority of vSAN outages. Any replacement must have cleaner network requirements. All three candidates use unicast-only protocols.
-
SPBM is the one concept that must survive in some form. The ability to define storage intent declaratively ("I want 2 failures tolerated with erasure coding") and have the system handle placement automatically is what makes vSAN manageable at scale. Kubernetes StorageClasses (OVE), S2D policies (Azure Local), and ESC service tiers are the equivalents -- but none offer the same granularity of per-VM policy attachment that SPBM provides.
-
Monitor what you measure today. Before migrating, catalog every vSAN alarm, health check, and performance metric that your operations team actively monitors. Each one represents an operational dependency that must have an equivalent in the target platform. Missing a single critical alarm (e.g., "write buffer full" -> equivalent in Ceph: "OSD nearfull") can cause the first post-migration incident.
Discussion Guide
These questions are for your own storage and infrastructure team. The goal is to surface the implicit knowledge about how vSAN operates in your specific environment -- knowledge that may not be documented but is critical for migration planning.
Current State Understanding
-
What is our current vSAN version and on-disk format version? Are we on vSAN 7.x or 8.x? Have we upgraded the on-disk format to v15+? Are there clusters still on v2/v5 that would need format migration before any platform migration?
-
What SPBM policies are actively in use? List every storage policy applied to production VMs. What FTT levels, failure tolerance methods (RAID-1 vs. RAID-5/6), stripe widths, and object space reservations are configured? How many distinct policies exist? Are any VMs using "force provisioning" to bypass policy compliance?
-
What is our actual capacity utilization? Not the vCenter summary, but the real breakdown: raw capacity, cache overhead, FTT overhead, metadata overhead, slack space, dedup/compression ratio, and final usable capacity. What is our physical utilization percentage right now? How fast is it growing (monthly trend)?
-
Have we ever experienced a write buffer full event? If yes, what caused it and how long did the I/O stall last? Which VMs were affected? What was the business impact? This scenario will exist in some form on any replacement platform -- how did we handle it?
-
What is our measured rebuild time for a host failure? Not the theoretical calculation, but the actual observed time from host failure to full redundancy restoration. Has this been tested recently, or are we relying on assumptions?
Operational Dependencies
-
Which vSAN health checks trigger automated remediation or paging? Walk through the alerting chain: vSAN health check -> vCenter alarm -> monitoring system (Aria, Zabbix, PagerDuty, etc.) -> on-call page. Which specific checks are wired to P1/P2 pages? These must be replicated in the target platform.
-
Do we use vSAN stretched clusters or 2-node clusters? If yes, where are the witness nodes? What is the inter-site latency? What is the RPO/RTO for site failure? Stretched cluster migration is significantly more complex than single-site.
-
How do we handle maintenance mode today? Which evacuation mode is used for patching (full data migration, ensure accessibility, or no migration)? How long does a typical host maintenance window take? This directly impacts the patching cadence on any replacement platform.
-
Are there VMs with specific data locality requirements? Are any VMs pinned to specific hosts for licensing or compliance reasons? Does the storage team manually influence component placement, or is it fully automatic?
-
What custom vSAN advanced settings have been modified? Check
VSAN.ClomRepairDelay, resync throttle settings, and any other non-default advanced parameters. Each one represents a tuning decision that encodes operational knowledge about our specific environment.
Performance Baseline
-
Can we export 30 days of vSAN Performance Service data? We need per-VM IOPS, latency, and throughput histograms to establish the acceptance criteria for the PoC. Without this data, we cannot objectively compare the candidates against the current baseline.
-
Which VMs are our top 50 storage consumers? Identify the VMs that generate the most IOPS, the most throughput, and the most write load. These are the VMs that will stress-test any replacement platform. They should be the first candidates for PoC migration.
-
What is our peak I/O period? Is it end-of-month batch processing, morning login storms, overnight backups? The replacement must handle peak load without degradation, not just average load.
-
Have we measured end-to-end I/O latency from inside the guest VMs? vSAN metrics show storage-level latency, but application-perceived latency includes guest kernel, virtio driver, and hypervisor overhead. Run
fioinside representative VMs to establish the guest-perceived baseline.
Pain Points and Known Issues
-
What are the top three vSAN-related incidents in the last 12 months? What was the root cause, impact, and resolution for each? These incidents reveal the operational weaknesses of the current platform -- and may or may not exist on the replacement.
-
What vSAN operations do we avoid or defer because they are risky? Examples: firmware upgrades on cache devices, disk group reconfiguration, enabling dedup on existing clusters. These operational pain points should be compared against the operational model of each candidate.
-
How confident are we in our vSAN capacity forecasting? Have we ever been surprised by capacity growth? Do we account for dedup/compression ratio changes as workload mix evolves? A platform migration is an opportunity to reset capacity planning methodology.
-
What is our experience with vSAN upgrades? How disruptive are rolling upgrades? How long does a cluster-wide upgrade take? What has broken during upgrades? This experience calibrates expectations for Day-2 lifecycle management on any platform.
Migration-Specific
-
What is the total data footprint to migrate? Not virtual provisioned size, but actual consumed data (after thin provisioning, before dedup). This determines the migration timeline and network bandwidth requirements.
-
Do we have any vSAN-specific integrations? Examples: backup solutions that use vSAN snapshot APIs, monitoring tools that query the vSAN Performance Service API, automation scripts that use PowerCLI
Get-VsanDiskcmdlets. Each integration is a migration dependency. -
What is our appetite for "big bang" vs. gradual migration? Can we run the old and new platforms in parallel during migration, or must we cut over? vSAN-to-Ceph data migration typically requires V2V conversion (VMDK to QCOW2/raw) -- there is no live migration path between fundamentally different hypervisors.
Next: 03-storage-protocols.md -- Storage Protocols (iSCSI, NVMe-oF, Fibre Channel, NFS, SMB)