Storage Foundational Concepts
Why This Matters
Every VM running on our platform ultimately reads and writes data. The storage subsystem determines whether a database query returns in 1 ms or 100 ms, whether a failed disk causes an outage or is silently absorbed, and whether we can run 5,000+ VMs without hitting a capacity or performance cliff. Understanding the foundational concepts below is essential for three reasons:
- Evaluation accuracy. Each candidate (OVE, Azure Local, Swisscom ESC) makes different storage architecture decisions. Without understanding what block storage actually is, how RAID protects data, or what IOPS really measures, we cannot critically evaluate vendor claims or compare them against our current VMware/vSAN baseline.
- Day-2 operations. When a VM experiences slow I/O at 2 AM, the on-call engineer needs to know whether the bottleneck is at the LVM layer, the RAID rebuild, the thin provisioning overcommit, or the storage tier placement. These concepts are the diagnostic vocabulary.
- Architecture decisions. Choosing between HCI storage (ODF/Ceph, S2D) and traditional SAN, sizing disk pools, setting QoS policies, and planning capacity all depend on a solid grasp of these fundamentals.
This page covers the six building blocks that every other storage topic in this study rests on. Master these, and the platform-specific topics (vSAN, ODF, S2D) in later pages will make immediate sense.
Concepts
1. Block vs File vs Object Storage
What It Is and Why It Exists
Storage systems present data to consumers through one of three fundamental interfaces. The choice of interface determines how applications address data, what protocols carry the traffic, and what performance and scalability characteristics are possible.
| Dimension | Block | File | Object |
|---|---|---|---|
| Abstraction | Raw disk (LBA addresses) | Hierarchical filesystem (paths) | Flat namespace (bucket + key) |
| Protocol examples | iSCSI, FC, NVMe-oF, virtio-blk | NFS, SMB/CIFS, CephFS | S3, Swift, HTTP REST |
| Typical latency | Sub-millisecond to low ms | Low ms to mid ms | Mid ms to high ms |
| Use case | Databases, VM boot disks, OLTP | Shared home dirs, media, config | Backups, logs, archives, artifacts |
| Metadata model | None (raw bytes at offsets) | POSIX (owner, perms, timestamps) | Custom key-value pairs per object |
| Filesystem | Consumer creates it (ext4, XFS, NTFS) | Server provides it | None (flat key-value) |
How It Works
Block storage exposes a linear array of fixed-size blocks (typically 512 bytes or 4 KiB sectors) identified by Logical Block Addresses (LBAs). The consumer (a kernel filesystem driver, a database engine with raw I/O) issues read/write commands to specific LBAs. There is no notion of "files" or "directories" at this layer. The kernel's block layer (struct bio, struct request) manages I/O scheduling, merging, and dispatch.
Application (e.g., PostgreSQL)
|
v
Filesystem (ext4 / XFS / NTFS)
| translates file offsets to LBAs
v
Block Layer (Linux: bio -> request -> dispatch)
|
v
Device Driver (virtio-blk, SCSI, NVMe)
|
v
Storage Target (local disk / SAN LUN / Ceph RBD)
File storage adds a POSIX-compatible layer on top. A file server (NFS daemon, Samba) or a distributed filesystem (CephFS, GlusterFS) manages the inode table, directory tree, permissions, and locks. The client mounts the filesystem over the network and interacts via standard open(), read(), write(), stat() syscalls. The wire protocol handles serialization, caching, and delegation.
Application
|
v
VFS (Virtual Filesystem Switch)
|
+---> Local FS (ext4, XFS) -> Block Layer -> Disk
|
+---> NFS Client (nfs4_proc_*) -> RPC/XDR -> Network -> NFS Server
|
+---> SMB Client (cifs.ko) -> SMB3 -> Network -> Samba / Windows
Object storage abandons the filesystem hierarchy entirely. Data is stored as immutable objects in flat namespaces (buckets). Each object has a unique key, a blob of data, and arbitrary metadata. Access is via HTTP REST APIs (PUT, GET, DELETE). There is no seek(), no partial update, no directory listing in the traditional sense. This model scales to exabytes because there is no global inode table to bottleneck.
Why This Matters for 5,000+ VMs
- VM boot disks and database volumes require block storage -- low latency, direct LBA access, no protocol overhead beyond the block layer.
- Shared configuration repositories, user home directories, and media files use file storage -- multiple VMs mount the same NFS/SMB export.
- Backups, log archives, VM image templates, and compliance data naturally fit object storage -- write once, read rarely, scale massively.
Every candidate must provide all three. The question is how they provide them and what performance trade-offs they make.
Multi-Protocol Access in Virtualized Environments
A single VM may consume all three storage types simultaneously. Understanding how these coexist is critical for architecture planning:
Typical Enterprise VM: Database Server
========================================
VM: db-prod-01
|
+-- /dev/vda (boot disk) --> Block (Ceph RBD / S2D / SAN LUN)
| 100 GB, ext4, OS + binaries
|
+-- /dev/vdb (data disk) --> Block (Ceph RBD / S2D / SAN LUN)
| 500 GB, XFS, raw tablespace
| Requires: low latency, high IOPS
|
+-- /mnt/share (config mount) --> File (NFS v4 / SMB3 / CephFS)
| Shared config across DB cluster
| Requires: consistency, locking
|
+-- s3://backup-bucket --> Object (Ceph RGW / Azure Blob)
Nightly pg_dump uploads
Requires: durability, capacity, cost efficiency
The protocol choice per data type directly impacts performance, availability, and cost. A common mistake in IaaS evaluations is to focus exclusively on block storage (because VM disks are block devices) and neglect file and object storage requirements until post-migration, when teams discover their NFS workflows or backup pipelines do not have equivalent services on the new platform.
Consistency Semantics
The three models differ fundamentally in consistency guarantees, which matters for multi-VM or clustered applications:
| Model | Consistency | Implication |
|---|---|---|
| Block | Single-writer exclusive access (one VM per LUN/RBD image) | No shared-nothing risk; multi-attach requires cluster filesystem (GFS2, OCFS2) |
| File | POSIX semantics with locking (fcntl, flock) | Safe for concurrent multi-VM access; performance depends on lock contention |
| Object | Eventual consistency (S3 standard) or read-after-write | Not suitable for real-time multi-writer coordination |
2. LVM (Logical Volume Management)
What It Is and Why It Exists
LVM is the Linux kernel subsystem (implemented via the device-mapper framework) that inserts a flexible abstraction layer between physical disks and filesystems. It solves a fundamental problem: physical disks have fixed sizes and fixed partition layouts, but workloads need volumes that can grow, shrink, move, snapshot, and stripe across multiple physical devices -- all without downtime.
In virtualized environments, LVM is doubly relevant: the hypervisor host often uses LVM to manage its own local storage, and guest VMs may use LVM internally for their own volumes. Storage backends like Ceph OSD daemons and S2D may also leverage LVM (or device-mapper directly) under the hood.
How It Works
LVM operates in three layers:
+------------------------------------------------------------------+
| Filesystem (ext4 / XFS) |
+------------------------------------------------------------------+
| Logical Volume (LV) |
| /dev/vg_data/lv_database (500 GB, thin) |
+------------------------------------------------------------------+
| Volume Group (VG) |
| vg_data = pv1 + pv2 + pv3 (pooled capacity) |
+------------------------------------------------------------------+
| Physical Volumes (PV) |
| /dev/sda2 /dev/sdb1 /dev/nvme0n1p1 |
| (500 GB HDD) (500 GB HDD) (1 TB NVMe) |
+------------------------------------------------------------------+
| Physical Disks / Partitions |
+------------------------------------------------------------------+
Physical Volumes (PVs): Any block device (whole disk, partition, RAID array, iSCSI LUN) is initialized with pvcreate, which writes an LVM metadata header and divides the device into Physical Extents (PEs), typically 4 MiB each.
Volume Groups (VGs): One or more PVs are combined into a VG with vgcreate. The VG is the pool from which Logical Volumes are allocated. PEs from different PVs become interchangeable within the VG.
Logical Volumes (LVs): Carved from a VG with lvcreate. An LV is a contiguous virtual block device backed by a set of PEs (which may be physically scattered across multiple disks). LVs appear as /dev/<vg_name>/<lv_name> and are formatted with a filesystem or used as raw block devices.
Kernel Internals: device-mapper
LVM is built on top of the Linux device-mapper (DM) subsystem (drivers/md/dm.c). Device-mapper creates virtual block devices by mapping I/O requests to underlying physical devices according to mapping tables. Key DM targets used by LVM:
| DM Target | Purpose | Used By |
|---|---|---|
dm-linear |
Maps a range of virtual sectors to a range on a physical device | Standard LVs |
dm-striped |
Stripes I/O across multiple devices (like RAID 0) | Striped LVs |
dm-snapshot |
Copy-on-write snapshots | LVM snapshots |
dm-thin-pool |
Thin provisioning with on-demand block allocation | Thin LVs |
dm-cache |
SSD caching for HDD-backed volumes | lvmcache |
dm-crypt |
Encryption (LUKS) | Encrypted LVs |
The I/O path for an LVM volume:
Application write()
|
v
Filesystem (ext4): maps file offset -> LBA on /dev/vg/lv
|
v
device-mapper: translates virtual LBA -> physical LBA on /dev/sdX
| (lookup in DM mapping table, O(1) for linear, O(log n) for thin)
v
Block layer: I/O scheduler (mq-deadline / none / bfq)
|
v
SCSI / NVMe driver -> physical disk
Key Operations
| Operation | Command | Impact |
|---|---|---|
| Extend LV online | lvextend -L +100G /dev/vg/lv && resize2fs /dev/vg/lv |
Zero downtime, no data movement |
| Shrink LV | resize2fs /dev/vg/lv 400G && lvreduce -L 400G /dev/vg/lv |
Requires unmount for ext4; XFS cannot shrink |
| Create snapshot | lvcreate -s -L 10G -n snap_lv /dev/vg/lv |
COW snapshot, reads from origin until blocks are modified |
| Mirror LV | lvcreate --type mirror -m 1 -L 100G -n lv_mirror vg |
Synchronous mirror across PVs |
| Move PV data | pvmove /dev/sda2 /dev/sdb1 |
Online data migration between physical devices |
Thin Provisioning via LVM
LVM thin provisioning (dm-thin-pool) deserves special mention because it is the mechanism used by many storage backends to overcommit capacity:
+----------------------------------------------------+
| Thin Pool (VG: vg_data) |
| Actual allocated: 200 GB |
| Metadata device: /dev/vg_data/tp_meta (small) |
| Data device: /dev/vg_data/tp_data (1 TB) |
+----------------------------------------------------+
| | | |
thin-lv1 thin-lv2 thin-lv3 thin-lv4
(100 GB) (200 GB) (300 GB) (400 GB)
used: 50G used: 60G used: 40G used: 50G
Sum of virtual sizes: 1,000 GB
Actual physical usage: 200 GB
Overcommit ratio: 5:1
Blocks are allocated from the thin pool only when actually written. The metadata device tracks which blocks belong to which thin LV using a B-tree structure. This is critical: if the thin pool fills to 100%, all thin LVs freeze and data corruption can occur. Monitoring thin pool usage is a non-negotiable operational requirement.
Relationship to the Stack
- RAID sits below LVM: PVs can be RAID arrays (md devices or hardware RAID LUNs).
- Thin Provisioning is implemented as a DM target within LVM.
- Storage Tiering can be implemented via
dm-cache(an LV that caches a slow HDD-backed LV on fast SSD). - Ceph OSD uses LVM to manage its local BlueStore volumes (
ceph-volume lvmcommands). - CSI drivers in Kubernetes dynamically create LVM thin LVs for PersistentVolumes (e.g., TopoLVM, LVM Operator).
3. RAID Levels
What It Is and Why It Exists
RAID (Redundant Array of Independent Disks) combines multiple physical disks into a single logical unit to achieve one or more of: (a) data redundancy (survive disk failures), (b) improved performance (parallelize I/O), and (c) increased capacity. RAID was invented to solve a simple problem: individual disks fail, and the larger the storage pool, the more frequently failures occur. At 5,000+ VMs, disk failures are not exceptional events -- they are routine operations that must be handled transparently.
RAID Levels in Detail
RAID 0 — Striping (No Redundancy)
===========================================
Disk 0: [A1] [A3] [A5] [A7]
Disk 1: [A2] [A4] [A6] [A8]
- Data striped across N disks in chunks (typically 64-512 KiB)
- Read/write performance: N x single disk (theoretical)
- Capacity: 100% (sum of all disks)
- Fault tolerance: NONE. One disk failure = total data loss
- Use case: Scratch space, temporary data, caches
RAID 1 — Mirroring
===========================================
Disk 0: [A1] [A2] [A3] [A4] (primary)
Disk 1: [A1] [A2] [A3] [A4] (mirror)
- Every write goes to both disks
- Read performance: up to 2x (reads can be split)
- Write performance: 1x (must write to both)
- Capacity: 50% (half lost to mirroring)
- Fault tolerance: 1 disk failure
- Use case: OS boot drives, small critical volumes
RAID 5 — Striping + Distributed Parity
===========================================
Disk 0: [A1] [B2] [C3] [Dp]
Disk 1: [A2] [B3] [Cp] [D1]
Disk 2: [A3] [Bp] [C1] [D2]
Disk 3: [Ap] [B1] [C2] [D3]
p = parity block (XOR of data blocks in the stripe)
- Parity is distributed across all disks (no single bottleneck)
- Read performance: (N-1) x single disk
- Write performance: reduced due to parity calculation
- Small random writes require read-modify-write:
1. Read old data block
2. Read old parity block
3. Compute new parity = old_parity XOR old_data XOR new_data
4. Write new data + new parity (= 4 I/O ops per write)
- Capacity: (N-1)/N (one disk worth lost to parity)
- Fault tolerance: 1 disk failure
- DANGER: Rebuild times on large disks (10+ TB) can exceed 24 hours,
during which a second failure causes total data loss (URE risk)
- Use case: General-purpose, read-heavy workloads
RAID 6 — Striping + Double Distributed Parity
===========================================
Disk 0: [A1] [B2] [C3] [Dp] [Eq]
Disk 1: [A2] [B3] [Cp] [Dq] [E1]
Disk 2: [A3] [Bp] [Cq] [D1] [E2]
Disk 3: [Ap] [Bq] [C1] [D2] [E3]
Disk 4: [Aq] [B1] [C2] [D3] [Ep]
p = P parity (XOR), q = Q parity (Reed-Solomon / GF(2^8))
- Two independent parity calculations per stripe
- Write penalty: even higher than RAID 5 (6 I/O ops per small write)
- Capacity: (N-2)/N
- Fault tolerance: 2 simultaneous disk failures
- Use case: Large arrays (>6 disks), compliance data, archive
RAID 10 — Mirror + Stripe (Nested)
===========================================
Stripe (RAID 0)
/ | \
Mirror Mirror Mirror
(RAID 1) (RAID 1) (RAID 1)
D0 D1 D2 D3 D4 D5
Disk 0: [A1] [A3] [A5] Disk 1: [A1] [A3] [A5] (mirror of D0)
Disk 2: [A2] [A4] [A6] Disk 3: [A2] [A4] [A6] (mirror of D2)
Disk 4: [A7] [A9] [A11] Disk 5: [A7] [A9] [A11] (mirror of D4)
- Stripe across mirrored pairs
- Read performance: up to N x single disk
- Write performance: (N/2) x single disk
- Capacity: 50%
- Fault tolerance: 1 disk per mirror pair (survives up to N/2 failures
if no two failures hit the same pair)
- Rebuild time: FAST (only mirror the failed disk from its pair,
not the entire array)
- Use case: Databases, OLTP, latency-sensitive workloads
This is the gold standard for enterprise storage performance.
RAID Performance Summary
| Level | Min Disks | Capacity | Read IOPS | Write IOPS | Write Penalty | Fault Tolerance | Rebuild Speed |
|---|---|---|---|---|---|---|---|
| RAID 0 | 2 | 100% | Nx | Nx | 1 | None | N/A |
| RAID 1 | 2 | 50% | 2x | 1x | 2 | 1 disk | Fast |
| RAID 5 | 3 | (N-1)/N | (N-1)x | ~(N-1)/4 x | 4 | 1 disk | Slow (full stripe read) |
| RAID 6 | 4 | (N-2)/N | (N-2)x | ~(N-2)/6 x | 6 | 2 disks | Very slow |
| RAID 10 | 4 | 50% | Nx | (N/2)x | 2 | 1 per pair | Fast (mirror only) |
Rebuild Risk: The Hidden Danger of Large Disks
Rebuild time is often overlooked but is critical for availability planning. When a disk fails, the RAID system must reconstruct the lost data from parity or mirror copies. During this rebuild:
- The array operates in a degraded state with reduced redundancy
- Rebuild I/O competes with production I/O, increasing latency by 20-50%
- A second disk failure during rebuild can cause data loss (RAID 5) or further degradation (RAID 6)
Rebuild time is proportional to disk size, not data size:
Rebuild Time Estimates (single disk failure)
=============================================
Disk Size RAID 5 Rebuild RAID 6 Rebuild RAID 10 (mirror)
--------- --------------- --------------- -----------------
1 TB 2-4 hours 2-4 hours 1-2 hours
4 TB 8-16 hours 8-16 hours 4-8 hours
10 TB 20-40 hours 20-40 hours 8-16 hours
16 TB 32-64 hours 32-64 hours 12-24 hours
20 TB 40-80 hours 40-80 hours 15-30 hours
Assumptions: 200 MB/s sustained rebuild rate with production I/O running.
Actual rates vary with workload intensity and controller/software capability.
DANGER ZONE: With 16+ TB disks (common in HDD-based capacity tiers),
RAID 5 rebuild exceeds 24 hours. The probability of a second disk
failure during a 48-hour rebuild window on a 12-disk array is
non-trivial (~1-5% depending on disk age and environment).
This is why:
- Enterprise storage has moved to RAID 6 minimum for HDDs
- RAID 10 is preferred for latency-sensitive workloads
- Distributed SDS (Ceph, S2D) rebuild faster because they
parallelize across ALL remaining nodes, not just the spare disk
Distributed storage systems have a major advantage here: when a node or disk fails, the rebuild is distributed across all remaining nodes in the cluster. A 30 TB node loss in a 100-node Ceph cluster means each of the 99 remaining nodes contributes ~300 GB of rebuild capacity, completing in minutes to hours rather than days. This is a key reason why HCI/SDS architectures are preferred at scale.
Software RAID vs Hardware RAID
In modern storage stacks, software RAID (Linux md driver, or distributed software like Ceph's CRUSH + erasure coding) has largely replaced hardware RAID controllers:
| Aspect | Hardware RAID | Software RAID (md / Ceph) |
|---|---|---|
| Controller | Dedicated RAID card (LSI/Broadcom, Adaptec) | CPU-based, kernel module md |
| Battery/Capacitor | BBU/CacheVault for write cache | No hardware write cache (relies on disk cache/fsync) |
| Hot-swap | Firmware-managed | mdadm / Ceph-managed |
| Performance | HW write cache improves small writes | NVMe drives make HW cache irrelevant |
| Flexibility | Fixed to controller firmware | Full kernel control, scriptable |
| Ceph/ODF approach | Disks presented as individual JBODs (HBA mode) | Ceph manages replication/EC at the software layer |
Critical note for OVE/ODF: Ceph requires disks in HBA passthrough mode (JBOD), not behind a hardware RAID controller. Ceph handles data redundancy itself via replication (replica=3) or erasure coding (e.g., k=4, m=2, equivalent to RAID 6). Hardware RAID underneath Ceph would double the redundancy overhead without benefit and mask disk failures from Ceph's self-healing.
Critical note for Azure Local/S2D: Storage Spaces Direct similarly expects HBA passthrough mode. S2D implements its own mirroring (2-way, 3-way) and parity at the software layer. Hardware RAID underneath S2D is explicitly unsupported.
Erasure Coding: The Distributed RAID
In distributed storage systems (Ceph, S2D), erasure coding (EC) replaces traditional RAID parity:
Erasure Coding (k=4, m=2) — analogous to RAID 6
================================================
Original data: split into 4 data chunks (k)
Parity: 2 coding chunks computed (m)
OSD 0 OSD 1 OSD 2 OSD 3 OSD 4 OSD 5
[D0] [D1] [D2] [D3] [C0] [C1]
data data data data code code
- Survives any 2 OSD/node failures
- Storage overhead: (k+m)/k = 6/4 = 1.5x (vs 3x for replica=3)
- CPU cost: Reed-Solomon encode/decode on every write/recovery
- Read: any k of k+m chunks suffice
- Write: must write k+m chunks (all 6)
- Latency: higher than replication (tail latency from slowest chunk)
Erasure coding is typically used for "warm" and "cold" data (backups, logs, infrequently accessed VM images) where the capacity savings outweigh the latency penalty. Hot data (VM boot disks, databases) typically uses replication (replica=3 in Ceph, 3-way mirror in S2D) for lower latency.
4. Thin Provisioning
What It Is and Why It Exists
Thin provisioning is the practice of presenting a storage volume to a consumer with a larger virtual size than the physical space actually allocated to it. Physical blocks are allocated only when data is first written to them. This is the storage equivalent of memory overcommit in virtualization -- it assumes that not every consumer will use 100% of their allocated space simultaneously.
Without thin provisioning, an environment with 5,000 VMs where each VM is allocated a 100 GB disk would need 500 TB of physical storage, even if the average actual usage is only 30 GB per VM (150 TB total). Thin provisioning lets us provision those 5,000 VMs from a 250 TB pool and monitor actual consumption.
How It Works
Thin provisioning is implemented at various layers in the stack. The mechanism differs, but the principle is the same: deferred allocation.
Layer 1: Storage array / SDS level
Traditional SAN arrays (Dell PowerMax, NetApp) and software-defined storage (Ceph, S2D) implement thin provisioning at the pool level. When a LUN or volume is created as "thin," the array metadata records the virtual size but allocates physical extents only on first write.
Layer 2: LVM thin pools (dm-thin-pool)
As described in the LVM section, dm-thin-pool implements thin provisioning at the Linux device-mapper level. A thin pool has a fixed physical data device and a metadata device. Thin LVs draw blocks from the pool on demand.
Layer 3: Filesystem level
Some filesystems (ReFS on Windows, XFS, ZFS) support sparse files, where unwritten regions consume no disk space. This is a file-level form of thin provisioning.
Layer 4: Virtual disk format
Virtual disk formats like QCOW2 (QEMU), VMDK-thin (VMware), and VHDX-dynamic (Hyper-V) are inherently thin-provisioned: the virtual disk file grows as data is written.
Thin Provisioning Across the Stack
====================================
+--------------------------------------------------+
| VM sees: /dev/vda = 100 GB disk |
+--------------------------------------------------+
| QCOW2 file: actual size on host = 32 GB | <-- VM disk level
+--------------------------------------------------+
| Ceph RBD image: 100 GB virtual, 32 GB allocated | <-- SDS level
| (4 MiB objects, only written objects exist) |
+--------------------------------------------------+
| Ceph OSD thin LV: physical extents on NVMe | <-- LVM level
+--------------------------------------------------+
| NVMe SSD: physical NAND cells | <-- Hardware level
+--------------------------------------------------+
Each layer may independently thin-provision.
Actual physical usage: ~32 GB
Virtual allocation: 100 GB
Overcommit ratio: ~3:1
The Overcommit Trap
Thin provisioning enables overcommitment, which is powerful but dangerous. The critical failure mode is pool exhaustion:
Thin Pool Exhaustion Scenario
==============================
Time T0: Pool = 10 TB physical, allocated 8 TB virtually (80%)
Actual usage: 6 TB (60% physical utilization)
Time T1: Batch job runs, writes 3 TB of new data
Actual usage: 9 TB (90% physical utilization)
Time T2: More writes, pool hits 100% physical capacity
ALL thin volumes FREEZE
I/O errors cascade to ALL 5,000 VMs on the pool
Potential filesystem corruption in guests
MITIGATION:
- Set alerts at 70%, 80%, 90% physical utilization
- Configure autoextend: /etc/lvm/lvm.conf
thin_pool_autoextend_threshold = 80
thin_pool_autoextend_percent = 20
- Monitor "actual vs virtual" ratio continuously
- Set per-VM I/O quotas to prevent runaway writes
TRIM/DISCARD: Reclaiming Space
When a VM deletes files, the guest filesystem marks blocks as free, but the thin provisioning layer does not automatically know. The TRIM/DISCARD mechanism propagates "this block is no longer needed" down the stack:
Guest VM: rm /var/log/big.log
|
v
Guest filesystem (ext4): marks inodes/blocks as free
| (fstrim / mount -o discard)
v
virtio-blk / virtio-scsi: UNMAP/WRITE_ZEROES command
|
v
Ceph RBD / S2D: deallocates the corresponding objects/extents
|
v
Physical pool: space returned for reuse
Without TRIM/DISCARD, thin pools exhibit "phantom usage" -- physical space remains allocated for data the VM has long since deleted. In a 5,000-VM environment, this can easily waste 20-30% of pool capacity. All three candidates must support TRIM propagation end-to-end.
Performance Characteristics
Thin provisioning introduces a measurable overhead:
| Operation | Thick (pre-allocated) | Thin (on-demand) |
|---|---|---|
| First write to new block | Direct write (block exists) | Allocate block + write (metadata update + write) |
| Subsequent writes | Direct write | Direct write (same as thick) |
| Latency overhead (first write) | 0 | 5-15 us (metadata lookup + allocate) |
| Steady-state overhead | 0 | Negligible (< 1%) |
| Snapshot creation | Full copy or COW setup | Instant (metadata-only, shared blocks) |
| Space reclaim | N/A | Requires TRIM/DISCARD propagation |
For enterprise workloads, the first-write penalty is acceptable because it is amortized over the lifetime of the volume. The real risk is operational (pool exhaustion), not performance.
Monitoring Thin Provisioning: Essential Metrics
In a 5,000-VM environment, thin provisioning monitoring is not optional. These are the critical metrics and the commands/tools to collect them:
LVM Thin Pool Monitoring
==========================
# Check thin pool utilization
$ lvs -o lv_name,data_percent,metadata_percent vg_data/tp_data
LV Data% Meta%
tp_data 72.3 14.5
# Per-thin-LV actual usage
$ lvs -o lv_name,lv_size,data_percent --select 'pool_lv=tp_data'
LV LSize Data%
thin-lv-001 100.0g 32.10
thin-lv-002 200.0g 15.75
thin-lv-003 300.0g 41.20
Ceph RBD Monitoring (ODF)
===========================
# Pool utilization
$ ceph df
POOLS:
NAME USED %USED MAX AVAIL OBJECTS
rbd-pool 12 TiB 48.2% 12.8 TiB 3.2M
# Per-image actual usage
$ rbd du rbd-pool/vm-disk-001
NAME PROVISIONED USED
vm-disk-001 100 GiB 32 GiB
S2D Monitoring (Azure Local)
===============================
# Volume utilization (PowerShell)
Get-VirtualDisk | Select FriendlyName, Size, FootprintOnPool,
@{N='UsedPct';E={[math]::Round($_.FootprintOnPool/$_.Size*100,1)}}
Alert Thresholds (recommended):
70% -> WARNING (plan capacity expansion)
80% -> CRITICAL (start blocking new VM provisioning)
90% -> EMERGENCY (begin emergency space reclamation)
95% -> P1 ALERT (risk of I/O freeze imminent)
5. Storage Tiering (Hot / Warm / Cold)
What It Is and Why It Exists
Storage tiering is the practice of placing data on different classes of storage media based on access frequency, latency requirements, and cost. The fundamental insight is that not all data is equal: a database transaction log needs sub-millisecond latency on NVMe, while a two-year-old audit report can sit on slow, cheap HDDs or even object storage.
In a 5,000+ VM environment, the cost difference is enormous:
| Media | $/GB (approx. 2025-2026) | IOPS (4K random read) | Latency | Endurance |
|---|---|---|---|---|
| NVMe SSD (datacenter, TLC) | $0.10 - $0.20 | 500,000 - 1,000,000+ | 50-100 us | 1-3 DWPD |
| SATA SSD (datacenter) | $0.06 - $0.12 | 50,000 - 100,000 | 100-500 us | 0.3-1 DWPD |
| HDD (nearline, 7200 RPM) | $0.01 - $0.03 | 100-200 | 5-15 ms | N/A (mechanical) |
| Object storage (S3-compat) | $0.005 - $0.02 | N/A (HTTP API) | 10-100 ms | N/A |
At 500 TB total capacity for 5,000 VMs:
- All-NVMe: $50,000 - $100,000 in media cost
- All-HDD: $5,000 - $15,000 in media cost
- Tiered (20% NVMe + 30% SSD + 50% HDD): ~$25,000 in media cost
Tiering captures 80% of the performance of all-flash at 50% of the cost.
How It Works
Tiering can be implemented at several layers:
Manual tiering (StorageClass-based): The administrator defines multiple storage tiers as distinct pools or StorageClasses. Workloads are assigned to a tier at provisioning time based on their performance requirements. No automatic data movement.
StorageClass: "fast" --> NVMe pool, replica=3, max IOPS
StorageClass: "standard" --> SSD pool, replica=3, balanced
StorageClass: "archive" --> HDD pool, EC 4+2, cost-optimized
VM provisioning:
Database VM --> PVC with StorageClass "fast"
App Server VM --> PVC with StorageClass "standard"
Log Collector --> PVC with StorageClass "archive"
Automatic tiering (data movement based on access patterns): The storage system monitors I/O patterns and automatically moves hot data to fast media and cold data to slow media. This is implemented as a background process that periodically analyzes access statistics and migrates data blocks between tiers.
Automatic Tiering Flow
========================
Access Frequency Analysis
(per block / per extent)
|
+------------+------------+
| | |
v v v
Hot Tier Warm Tier Cold Tier
(NVMe SSD) (SATA SSD) (HDD)
+----------+ +-----------+ +---------+
| Block A |<--->| Block B |<--->| Block C |
| 500 IOPS | | 10 IOPS | | 0 IOPS |
| last 24h | | last 24h | | 90 days |
+----------+ +-----------+ +---------+
^ |
| Promotion (cold -> hot) |
+----------------------------------+
| Demotion (hot -> cold) |
+----------------------------------+
(background I/O, scheduled)
Tiering granularity:
- VMware vSAN: per-object (typically 1 MiB blocks)
- S2D: per-slab (256 MiB slabs)
- Ceph: per-pool (manual) or per-object with tiering agents
Cache tiering (read/write cache on fast media):
Instead of moving data between tiers, a fast device acts as a read/write cache in front of a slow device. Linux dm-cache (and its variants dm-writecache, bcache) implement this at the device-mapper level.
dm-cache Architecture
======================
I/O Request
|
v
+---------------+
| dm-cache |
| (cache policy:|
| smq / mq) |
+-------+-------+
|
+--------+--------+
| |
v v
+----------+ +-----------+
| SSD Cache | | HDD Origin |
| (fast, | | (slow, |
| small) | | large) |
+----------+ +-----------+
Cache policies:
smq (stochastic multiqueue): default, low memory overhead
mq (multiqueue): original, higher memory, more tunable
Modes:
writeback: writes go to SSD first, flushed to HDD later (risky if SSD fails)
writethrough: writes go to both SSD and HDD (safe, slower writes)
passthrough: SSD only caches reads (safest, no write acceleration)
Tiering in HCI Environments
In HCI (where compute and storage share the same physical nodes), tiering has an additional dimension: data locality. Ideally, hot data should reside on the same node as the VM consuming it, eliminating network round-trips.
HCI Node with Tiered Storage
==============================
Node 1 (runs VM-A, VM-B)
+------------------------------------------+
| NVMe (cache tier) [VM-A hot blocks] |
| SSD (capacity tier) [VM-A warm blocks] |
| HDD (archive tier) [VM-B cold blocks] |
+------------------------------------------+
|
Cluster Network (25/100 GbE)
|
Node 2 (runs VM-C, VM-D)
+------------------------------------------+
| NVMe (cache tier) [VM-C hot blocks] |
| SSD (capacity tier) [VM-C warm blocks] |
| HDD (archive tier) [VM-A replicas] |
+------------------------------------------+
Data locality + tiering = optimal performance:
Hot data for VM-A on Node 1's NVMe = local read, ~100 us
Cold data for VM-A on Node 2's HDD = network + HDD, ~10 ms
Data Lifecycle and Tier Migration for Financial Workloads
In a financial institution, data has a well-defined lifecycle that maps naturally to storage tiers:
Financial Data Lifecycle and Tier Mapping
==========================================
Phase 1: Active (0-90 days)
Data: live transactions, current balances, active sessions
Access: thousands of reads/writes per second
Tier: HOT (NVMe, replica=3)
Latency requirement: < 1 ms
Phase 2: Recent (90 days - 1 year)
Data: completed transactions, recent reports, audit trails
Access: occasional queries, compliance checks
Tier: WARM (SSD, replica=3 or EC 4+2)
Latency requirement: < 10 ms
Phase 3: Archive (1-7 years)
Data: regulatory retention (FINMA: 10 years for some records)
Access: rare, typically for audits or legal discovery
Tier: COLD (HDD, EC 8+3 or object storage)
Latency requirement: < 1 second (acceptable for batch access)
Phase 4: Deep Archive (7-10+ years)
Data: long-term compliance retention
Access: exceptional (legal/regulatory only)
Tier: COLD/ARCHIVE (object storage, tape, or immutable vault)
Latency requirement: minutes (acceptable for retrieval requests)
Cost optimization: moving 1 PB of archive data from NVMe ($100K+)
to object storage ($5K-20K) saves $80K+ per year.
Automated lifecycle policies (similar to AWS S3 Lifecycle Rules) should be a key evaluation criterion. The question is whether each candidate can automate the transition between phases based on data age, access frequency, or explicit policy tags.
Relationship to Other Concepts
- RAID protects each tier independently. NVMe tier might use replication (mirror), HDD tier might use erasure coding.
- Thin provisioning applies within each tier -- a thin pool on the NVMe tier can overcommit the hot tier capacity.
- IOPS/Throughput/Latency metrics differ dramatically per tier and determine correct workload placement.
- StorageClasses in Kubernetes map directly to storage tiers, allowing workload owners to self-service their tier selection.
6. IOPS / Throughput / Latency
What It Is and Why It Exists
These are the three fundamental performance dimensions of any storage system. They are independent metrics, and optimizing for one often trades off against another. Understanding them is essential because vendor claims like "millions of IOPS" are meaningless without context (block size, queue depth, read/write mix, sequential vs random).
Definitions and Relationships
IOPS (Input/Output Operations Per Second): The number of discrete read or write operations a storage system can perform per second. Each operation transfers one block of data (typically 4 KiB for random workloads, 64-256 KiB for sequential). IOPS measures the "operations per second" capability, regardless of how much data each operation transfers.
Throughput (MB/s or GB/s): The total data transfer rate. Throughput = IOPS x Block Size. A system doing 100,000 IOPS at 4 KiB blocks delivers ~400 MB/s throughput. The same system doing 1,000 IOPS at 1 MiB blocks also delivers ~1,000 MB/s throughput but only 1,000 IOPS.
Latency (microseconds or milliseconds): The time from when an I/O request is submitted until the response is received. This includes queueing time, processing time, network transit (for networked storage), and media access time. Latency is the metric end-users feel most directly -- it determines how "snappy" a database or application is.
The Relationship Triangle
==========================
IOPS
/ \
/ \
/ \
Throughput ---- Latency
Throughput = IOPS x Block Size
IOPS = Throughput / Block Size
Latency ≈ 1 / IOPS (at low queue depth, single-threaded)
At high queue depths:
IOPS = Queue Depth / Latency
(Little's Law applied to storage)
I/O Patterns and Their Metrics
| Workload | Pattern | Primary Metric | Typical Target |
|---|---|---|---|
| OLTP Database (PostgreSQL, Oracle) | Random 4-8K read/write | IOPS + Latency | 10,000-50,000 IOPS, <1 ms |
| Data Warehouse (analytics query) | Sequential 64-256K read | Throughput | 1-10 GB/s |
| VM boot | Sequential 4-64K read | Throughput + Latency | >500 MB/s, <5 ms |
| Log ingestion | Sequential 4-64K write | Throughput | 200-1,000 MB/s |
| Email server | Random 4K mixed R/W | IOPS | 5,000-20,000 IOPS |
| VDI (Virtual Desktop) | Random 4K read-heavy | IOPS + Latency | 20-50 IOPS/desktop, <5 ms |
| Backup write | Sequential 1M write | Throughput | 1-5 GB/s |
The I/O Path: Where Latency Accumulates
Understanding where latency comes from is essential for diagnosing storage performance problems:
End-to-End I/O Path (VM on HCI with SDS)
==========================================
Application: write(fd, buf, 4096)
| ~1-5 us
v
Guest Kernel: VFS -> ext4 -> submit_bio()
| ~1-3 us
v
virtio-blk / virtio-scsi driver
| ~2-5 us
v
Hypervisor: QEMU I/O thread -> host block layer
| ~1-3 us
v
SDS Client (Ceph librbd / S2D ReFS)
| ~5-20 us
v
+--- Local path (data on this node) -----> ~50-100 us (NVMe)
| ~5-15 ms (HDD)
|
+--- Network path (data on remote node)
| ~5-30 us (RDMA/RoCE)
v ~50-200 us (TCP/IP)
Network transit
|
v
Remote SDS daemon (Ceph OSD / S2D)
| ~5-20 us
v
Local block layer -> NVMe/SSD/HDD
~50-100 us (NVMe)
~100-500 us (SSD)
~5-15 ms (HDD)
Total end-to-end latency examples:
Local NVMe path: ~100-200 us
Remote NVMe path: ~200-400 us (TCP), ~150-250 us (RDMA)
Remote HDD path: ~10-20 ms
For comparison, VMware vSAN typical latency:
All-flash local: ~200-300 us
All-flash remote: ~300-500 us
Queue Depth and Its Effect on IOPS
Queue depth (QD) is the number of I/O operations in flight simultaneously. It is the single most important tuning parameter for storage performance:
IOPS vs Queue Depth (typical NVMe SSD)
========================================
QD | IOPS (4K rand read) | Latency (avg)
-----+-----------------------+----------------
1 | 10,000 | 100 us
2 | 19,000 | 105 us
4 | 36,000 | 110 us
8 | 65,000 | 123 us
16 | 110,000 | 145 us
32 | 180,000 | 178 us
64 | 300,000 | 213 us
128 | 450,000 | 284 us
256 | 500,000 | 512 us <-- saturation
512 | 500,000 | 1,024 us <-- queueing delay
Observations:
- IOPS scales almost linearly with QD up to device saturation
- Latency increases slowly at first, then sharply past saturation
- Optimal QD: highest IOPS before latency degrades significantly
- For VMs: each VM typically generates QD 1-4, so 100 VMs on
one node generate aggregate QD 100-400 across shared storage
Linux I/O Schedulers: Controlling Fairness
The Linux kernel's I/O scheduler sits between the filesystem/block layer and the device driver. It reorders, merges, and prioritizes I/O requests. The choice of scheduler directly impacts fairness between VMs sharing the same physical storage.
| Scheduler | Kernel Config | Best For | VM Relevance |
|---|---|---|---|
none / noop |
elevator=none |
NVMe devices (already have internal schedulers) | Default for NVMe, correct for HCI |
mq-deadline |
elevator=mq-deadline |
SSD/HDD devices, ensures bounded latency | Good for SCSI/SATA, prevents starvation |
bfq (Budget Fair Queueing) |
elevator=bfq |
Desktop/interactive, per-process fairness | Rarely used in server/VM contexts |
kyber |
elevator=kyber |
Fast SSDs, two-level priority (read/write) | Experimental, not widely deployed |
In a virtualized HCI environment:
- Host-level scheduler (
nonefor NVMe) handles physical device queues - Guest-level scheduler (
mq-deadlineornone) handles the virtual disk queues - SDS layer (Ceph, S2D) may implement its own I/O prioritization between clients
The interaction between these three levels of scheduling is where "noisy neighbor" problems originate. A VM with aggressive ionice settings inside the guest cannot override the host-level or SDS-level fair-share policies -- but misconfigured schedulers can allow it.
Benchmarking: How to Measure Correctly
The standard tool for storage benchmarking is fio (Flexible I/O Tester). Key parameters that must be controlled for meaningful results:
# Random 4K read IOPS test (typical OLTP simulation)
fio --name=rand_read \
--ioengine=libaio \ # Linux async I/O (closest to real VM I/O)
--direct=1 \ # Bypass page cache (measures actual storage)
--bs=4k \ # Block size = 4 KiB
--rw=randread \ # Random read pattern
--iodepth=32 \ # Queue depth
--numjobs=4 \ # 4 parallel workers
--size=10G \ # Test data size
--runtime=300 \ # 5-minute sustained test
--group_reporting \
--filename=/dev/vdb # Test directly on block device
# Sequential throughput test (typical backup/restore simulation)
fio --name=seq_write \
--ioengine=libaio \
--direct=1 \
--bs=1M \ # Large block size for throughput
--rw=write \ # Sequential write
--iodepth=8 \
--numjobs=1 \
--size=100G \
--runtime=300 \
--filename=/dev/vdb
# Mixed workload (70% read, 30% write, typical enterprise)
fio --name=mixed \
--ioengine=libaio \
--direct=1 \
--bs=8k \
--rw=randrw \
--rwmixread=70 \ # 70% reads, 30% writes
--iodepth=16 \
--numjobs=8 \
--size=10G \
--runtime=300 \
--filename=/dev/vdb
Common benchmarking mistakes to avoid:
- Not using
--direct=1: Without this, Linux page cache absorbs reads/writes, and you measure RAM speed, not storage speed. - Too short runtime: SSDs and distributed storage need 60+ seconds to reach steady state. Burst performance is not sustained performance.
- Wrong block size: 4K random is not the same as 1M sequential. Match the block size to the actual workload.
- Ignoring latency percentiles: Average latency hides tail latency. Report p50, p95, p99, p99.9. A system with 200 us average but 50 ms p99 will cause application timeouts.
- Testing empty volumes: Thin-provisioned volumes perform differently when first written (allocation overhead) vs. when overwritten (steady state).
Performance Targets for 5,000+ VMs
Based on typical enterprise workload profiles:
| Metric | Target (aggregate cluster) | Per-VM average | Notes |
|---|---|---|---|
| Random 4K read IOPS | 500,000 - 2,000,000 | 100 - 400 | Assumes 50% of VMs are active simultaneously |
| Random 4K write IOPS | 100,000 - 500,000 | 20 - 100 | Writes are typically 20-30% of total I/O |
| Sequential throughput | 10 - 50 GB/s | 2 - 10 MB/s | Aggregate across all VMs |
| Read latency (p99) | < 1 ms | < 1 ms | For OLTP workloads |
| Write latency (p99) | < 2 ms | < 2 ms | For OLTP workloads |
| Read latency (p99.9) | < 5 ms | < 5 ms | Tail latency SLA |
These numbers are achievable with all-NVMe or NVMe+SSD HCI configurations with 50-100 nodes. They should be validated in the PoC against the actual VMware baseline using identical fio test profiles.
Profiling Real Workloads: Establishing the VMware Baseline
Before evaluating candidates, we must establish the performance profile of our current VMware environment. This means instrumenting the existing 5,000+ VMs to understand the actual I/O demand, not theoretical maximums.
Step 1: Capture aggregate storage statistics from vCenter
Key metrics to export from vCenter / esxtop / vRealize Operations:
- Per-VM: avg IOPS (read/write), avg latency, avg throughput
- Per-datastore: aggregate IOPS, capacity used vs provisioned
- Per-host: storage adapter queue depth, device latency
- Time range: at least 30 days to capture monthly patterns
Step 2: Categorize VMs by I/O profile
Typical Distribution (enterprise, 5,000 VMs):
Profile Category VMs Avg IOPS/VM Pattern Tier
----------------- ----- ----------- --------------- --------
Idle / near-idle 2,500 0-5 Negligible Cold
Light I/O 1,500 5-50 Mostly reads Warm
Moderate I/O 700 50-500 Mixed R/W Standard
Database / OLTP 250 500-5,000 Random, write-heavy Hot
High-performance 50 5,000+ Extreme random Hot+
This distribution follows the 80/20 rule: ~5% of VMs generate
~60% of total storage I/O. These are the VMs that determine
the storage architecture requirements.
Step 3: Calculate aggregate demand
Using the example distribution above:
- Total cluster IOPS: ~500,000 (sum of all VMs at peak)
- Required aggregate throughput: ~5 GB/s
- Critical latency VMs: 300 VMs need p99 < 1 ms
This baseline becomes the acceptance criterion for the PoC: the candidate platform must match or exceed these numbers under equivalent load.
Write Amplification in Distributed Storage
One critical factor that vendor IOPS numbers often omit is write amplification -- the ratio of actual physical writes to application-level writes. In distributed storage:
Write Amplification Examples
==============================
Application writes 1 x 4 KiB block:
Ceph (replica=3):
-> 1 primary write + 2 replica writes = 3 physical writes
-> Plus WAL/journal write on each OSD = up to 6 writes
-> Write amplification: 3-6x
S2D (3-way mirror):
-> 1 primary + 2 mirror copies = 3 physical writes
-> Plus ReFS metadata update = ~3.5 writes
-> Write amplification: 3-4x
Ceph (EC 4+2):
-> 4 data chunks + 2 coding chunks = 6 writes for 4 blocks
-> Effective amplification per block: 1.5x (better for large writes)
-> But small writes require read-modify-write: up to 4-6x
Implication: If your baseline VMware IOPS is 500,000 writes/sec,
the physical storage must handle 1.5-3M writes/sec in an HCI model.
Size NVMe endurance (DWPD) accordingly.
How the Candidates Handle This
Comparison Table
| Aspect | VMware (Current) | OVE (OpenShift Virtualization Engine) | Azure Local | Swisscom ESC |
|---|---|---|---|---|
| Block storage | vSAN objects / VMFS datastores on SAN LUNs | Ceph RBD via CSI (ODF), or external SAN via CSI | Storage Spaces Direct (S2D) volumes on ReFS | Dell PowerMax/PowerStore LUNs, abstracted as service tiers |
| File storage | NFS datastores, VMFS shared | CephFS via CSI, NFS via external, Multus-attached NFS | SMB3 shares on S2D, native SMB Direct with RDMA | NFS/SMB as managed service |
| Object storage | Not native (3rd party: MinIO, Dell ECS) | Ceph Object Gateway (RGW) via ODF, S3-compatible | Azure Blob (cloud), no native on-prem object | Object storage as managed service |
| LVM usage | ESXi does not use LVM; VMFS is proprietary | Ceph OSDs use LVM (BlueStore); host OS may use LVM; TopoLVM for local PVs | Not applicable (S2D uses ReFS, no LVM) | Abstracted by provider; Dell arrays use proprietary volume management |
| RAID approach | vSAN: software-defined (FTT=1 mirror, FTT=2, erasure coding); SAN: array-managed | Ceph: replica=2/3 or erasure coding (k+m), per-pool configurable | S2D: 2-way mirror, 3-way mirror, or parity (single/dual), per-volume | Provider-managed, typically Dell RAID + array replication |
| Thin provisioning | vSAN thin by default; VMDK thin/thick; SAN depends on array | Ceph RBD thin by default (objects created on write); QCOW2 thin | S2D thin provisioning via ReFS; VHDX dynamic | Provider-managed, transparent to consumer |
| Storage tiering | vSAN: all-flash (NVMe cache + SSD capacity) or hybrid (SSD cache + HDD capacity) | ODF: manual via StorageClasses (NVMe pool vs HDD pool); no automatic intra-pool tiering | S2D: automatic tiering across NVMe/SSD/HDD within a single volume (slab-based, 256 MiB granularity) | Service tiers (e.g., "Platinum", "Gold", "Silver") mapped to different media classes |
| IOPS/perf model | vSAN: SIOC for QoS, per-VM IOPS limits; SAN: array-level QoS | ODF: StorageClass-based QoS, Ceph QoS (rbdqos*), rate limiting at CephFS/RBD level | S2D: Storage QoS policies (min/max IOPS per volume), integrated with Failover Clustering | SLA-based, vertraglich defined per service class |
| TRIM/DISCARD | VMFS UNMAP, vSAN automatic space reclamation | Full TRIM chain: guest -> virtio-scsi -> Ceph RBD DISCARD -> BlueStore | Full TRIM chain: guest -> Hyper-V -> S2D -> physical SSD TRIM | Provider-managed |
| Benchmarking | VMware vdbench, esxtop, vscsiStats | fio inside VMs, Ceph built-in benchmarks (rados bench, rbd bench), node-level iostat |
diskspd (Windows fio equivalent), S2D performance counters, Get-StorageSubsystem |
Not customer-accessible; SLA adherence measured by provider |
| Max cluster IOPS (all-NVMe, estimated) | vSAN: ~2-4M IOPS per cluster (64-node, all-flash) | ODF: ~2-5M IOPS per cluster (100-node, all-NVMe, replica=3) | S2D: ~5-13M IOPS per cluster (16-node, all-NVMe) per Microsoft claims | Provider SLA, not disclosed per cluster |
Detailed Analysis
OVE / OpenShift Data Foundation (ODF): ODF wraps Ceph, which is the most flexible storage engine among the candidates. It provides unified block (RBD), file (CephFS), and object (RGW) storage from a single platform. Ceph's CRUSH algorithm distributes data across failure domains (racks, nodes, DCs) with configurable redundancy per pool. This means you can have a replica-3 pool for hot VM disks and an EC 4+2 pool for cold data on the same cluster. The trade-off is complexity: Ceph has many tuning knobs (PG count, OSD memory, BlueStore cache size, recovery throttling) and requires deep operational expertise to run well at scale. Red Hat abstracts some of this through the Rook-Ceph operator, but the underlying Ceph architecture must be understood for troubleshooting.
ODF does not currently offer automatic intra-pool tiering (moving individual data blocks between NVMe and HDD within a single pool based on access frequency). Tiering is achieved by defining separate Ceph pools on different media types and assigning workloads to the appropriate pool via StorageClasses. This is simpler to reason about but requires the workload owner to make the right tier selection at provisioning time.
Azure Local / Storage Spaces Direct (S2D): S2D is the most integrated and opinionated storage solution. It is built into the Windows Server kernel and manages the full I/O path from ReFS filesystem through the software storage bus to physical disks. S2D's key differentiator is automatic tiering: it moves data at 256 MiB slab granularity between NVMe, SSD, and HDD tiers based on access frequency, entirely transparently. A single volume can span all tiers without the workload owner needing to choose.
S2D also offers SMB Direct with RDMA (Remote Direct Memory Access) for storage traffic, which bypasses the TCP/IP stack and delivers near-local-disk latency for remote reads. This is a significant performance advantage for cross-node I/O in HCI. However, S2D is limited to 16 nodes per cluster, which constrains the total IOPS and capacity pool. For 5,000+ VMs, multiple clusters are required, and storage is not shared across clusters.
S2D lacks native object storage (S3-compatible). Object storage requires Azure Blob (cloud) or a third-party solution. This is a gap for on-premises backup and archival workflows.
Swisscom ESC: As a managed service, ESC abstracts all storage internals. The customer selects a service tier ("Platinum" for high-IOPS NVMe, "Gold" for balanced SSD, etc.) and receives a volume with contractually guaranteed performance. The underlying hardware is Dell PowerMax/PowerStore arrays connected via Fibre Channel to Dell VxBlock compute nodes. This is traditional SAN architecture -- the most proven and well-understood model, but also the least flexible and most expensive per GB.
The customer has no visibility into RAID configurations, thin provisioning ratios, or tiering policies. Performance is governed by SLAs, not by architectural understanding. This is a feature for organizations that want to outsource complexity, but a limitation for organizations that want to optimize for specific workload patterns. There is no ability to run custom benchmarks against the storage infrastructure or tune performance parameters.
Key Takeaways
-
Block storage is the foundation for VM workloads. All candidates provide it, but the implementation (Ceph RBD vs S2D vs Dell SAN) determines the performance profile, operational model, and failure domain characteristics.
-
LVM and device-mapper are the invisible plumbing in Linux-based platforms. Ceph OSDs use LVM internally (BlueStore on LVM), and understanding
dm-thin-poolis essential for diagnosing ODF capacity issues. Azure Local uses ReFS instead and has no LVM dependency. -
RAID is now a software concept in all HCI candidates. Both Ceph (ODF) and S2D expect disks in JBOD/HBA passthrough mode and handle redundancy in software. This is fundamentally different from the traditional SAN model (Dell PowerMax behind ESC) where RAID is handled by the array controller firmware.
-
Thin provisioning is universal but dangerous. All candidates thin-provision by default. The critical operational discipline is monitoring actual physical utilization vs. virtual allocation and alerting before pool exhaustion. At 5,000+ VMs, a runaway write workload on one VM can cascade into a cluster-wide I/O freeze if the thin pool is full.
-
Storage tiering is the biggest architectural differentiator between the candidates. S2D (Azure Local) offers fully automatic, transparent tiering within a single volume. ODF (OVE) requires manual tier selection via StorageClasses. ESC offers tiering as opaque service classes. The right model depends on whether you want operational simplicity (S2D), fine-grained control (ODF), or fully outsourced management (ESC).
-
IOPS numbers without context are meaningless. Always specify block size, read/write ratio, queue depth, and whether the measurement is at the VM level or the cluster level. The PoC must establish a common benchmarking methodology (fio with identical parameters) across the VMware baseline and all candidates.
-
Latency matters more than aggregate IOPS for most enterprise workloads. A database VM does not need 1M IOPS; it needs 5,000 IOPS with p99 latency under 1 ms. Focus PoC measurements on latency percentiles (p50, p95, p99, p99.9), not peak IOPS.
-
TRIM/DISCARD propagation is a silent capacity killer. Verify that each candidate supports the full TRIM chain from guest filesystem through the hypervisor/SDS layer to the physical device. Without this, thin-provisioned pools will bloat over months and require manual intervention.
-
S2D's 16-node limit has a storage consequence: the total storage pool per cluster is capped at 16 nodes x local disks. For 5,000+ VMs, this means storage is fragmented across multiple independent clusters with no cross-cluster pool. OVE/ODF can have a single 100+ node Ceph cluster. This is a fundamental architectural difference.
Discussion Guide
The following questions are designed to probe vendor and SME understanding of storage architecture in the context of our 5,000+ VM environment. They should be asked during PoC planning, vendor deep-dives, and architecture review sessions.
Questions for All Candidates
-
Thin provisioning overcommit policy: "What is your recommended maximum overcommit ratio for thin-provisioned storage in a 5,000-VM environment? What automated actions does the platform take when physical utilization exceeds 85%? 90%? 95%? Can we define per-namespace or per-tenant capacity quotas that enforce hard limits before pool exhaustion?"
-
IOPS isolation and noisy neighbor: "If one VM generates 50,000 IOPS of random write load (e.g., a runaway batch job), what mechanisms prevent that VM from degrading the latency of the other 4,999 VMs on the same storage pool? Show us the QoS enforcement path -- where exactly in the I/O stack is the rate limit applied, and what is the granularity (per-VM, per-volume, per-node)?"
-
Latency percentiles under load: "At 80% of rated cluster IOPS capacity, what is the p99.9 read latency for a 4K random read? We need this number from a real benchmark, not a datasheet. Can you run this test during our PoC on a cluster sized for 500 VMs?"
-
TRIM/DISCARD end-to-end: "Walk us through the exact TRIM/DISCARD propagation path from a Linux VM guest running ext4 with
fstrim, through the hypervisor, through the SDS layer, to the physical NVMe device. At which layer, if any, does space reclamation happen asynchronously? What is the delay between guest TRIM and physical space return?" -
Failure domain and rebuild impact: "A physical node with 8 NVMe drives (total 30 TB) fails permanently. Describe the data rebuild process: how much data needs to be reconstructed, how long it takes, what I/O overhead the rebuild imposes on running VMs, and what redundancy level remains during the rebuild. What happens if a second node fails during the rebuild?"
Questions Specific to OVE / ODF
-
Ceph PG autoscaler and pool sizing: "How does ODF determine the number of Placement Groups per pool? In our scenario with 200+ OSDs across 100 nodes, what is the target PG-per-OSD ratio, and what happens if PG count is misconfigured (too low or too high)? How does Rook-Ceph's PG autoscaler behave during cluster expansion from 50 to 100 nodes?"
-
BlueStore tuning for all-NVMe: "The default BlueStore configuration was designed for mixed SSD/HDD environments. For an all-NVMe ODF cluster, what tuning parameters need to change (bluestore_cache_size, bluestore_min_alloc_size, rocksdb_cache_size)? Has Red Hat published a validated all-NVMe ODF performance profile?"
Questions Specific to Azure Local / S2D
-
Automatic tiering transparency: "S2D moves data between NVMe, SSD, and HDD tiers automatically. How can we monitor which percentage of a specific VM's data is on which tier at any given time? Is there a way to pin a critical VM's data to the NVMe tier and prevent demotion? What is the tiering slab granularity, and can it cause performance cliffs when a hot 256 MiB slab is partially cold?"
-
Cross-cluster storage for 5,000 VMs: "With the 16-node cluster limit, we need approximately 5-6 clusters for 5,000 VMs. How do we handle a VM that needs to access a volume on a different cluster? Is there a federated storage namespace, or is storage strictly cluster-local? How does this affect DR and backup architecture?"
Questions Specific to Swisscom ESC
- Storage performance guarantees and observability: "Your 'Platinum' service tier guarantees X IOPS and Y ms latency. Can we access real-time storage metrics (IOPS, latency percentiles, queue depth) for our tenant via API? If we observe performance below SLA, what is the escalation path and the mean time to resolution? Can we run our own
fiobenchmarks inside VMs to independently verify SLA adherence?"
Architecture-Level Questions (for Internal Discussion)
-
Write amplification budget: "Given our measured baseline of N write IOPS on VMware, what is the expected write amplification factor on each candidate platform? How does this affect NVMe endurance planning over a 5-year lifecycle? What DWPD (Drive Writes Per Day) rating should we specify in hardware procurement?"
-
Data gravity and migration complexity: "For each candidate, how is VM storage data physically organized? If we need to migrate a 2 TB database VM from one candidate platform to another (or back to VMware as a fallback), what is the data export/import path? Is the VM disk format portable (QCOW2, VMDK, VHDX), or does migration require full data copy through a conversion pipeline?"
-
Encryption key management integration: "For encryption at rest, where are the encryption keys stored? Does the platform integrate with our existing HSM (Hardware Security Module) or external KMS (Key Management Service)? What happens to encrypted volumes if the KMS is temporarily unavailable? Who has access to the keys -- our team, the vendor, or both?"
-
Capacity planning model: "Walk us through your capacity planning methodology for a 5,000-VM environment with 20% annual growth. How do we model thin provisioning overcommit ratios over time? At what physical utilization percentage do we need to order additional hardware, and what is the lead time from order to operational capacity?"
Next: 02-vmware-baseline.md -- Current-State VMware Storage Architecture (vSAN, VMFS, SAN Integration)