Storage Foundational Concepts

Why This Matters

Every VM running on our platform ultimately reads and writes data. The storage subsystem determines whether a database query returns in 1 ms or 100 ms, whether a failed disk causes an outage or is silently absorbed, and whether we can run 5,000+ VMs without hitting a capacity or performance cliff. Understanding the foundational concepts below is essential for three reasons:

Evaluation accuracy. Each candidate (OVE, Azure Local, Swisscom ESC) makes different storage architecture decisions. Without understanding what block storage actually is, how RAID protects data, or what IOPS really measures, we cannot critically evaluate vendor claims or compare them against our current VMware/vSAN baseline.
Day-2 operations. When a VM experiences slow I/O at 2 AM, the on-call engineer needs to know whether the bottleneck is at the LVM layer, the RAID rebuild, the thin provisioning overcommit, or the storage tier placement. These concepts are the diagnostic vocabulary.
Architecture decisions. Choosing between HCI storage (ODF/Ceph, S2D) and traditional SAN, sizing disk pools, setting QoS policies, and planning capacity all depend on a solid grasp of these fundamentals.

This page covers the six building blocks that every other storage topic in this study rests on. Master these, and the platform-specific topics (vSAN, ODF, S2D) in later pages will make immediate sense.

Concepts

1. Block vs File vs Object Storage

What It Is and Why It Exists

Storage systems present data to consumers through one of three fundamental interfaces. The choice of interface determines how applications address data, what protocols carry the traffic, and what performance and scalability characteristics are possible.

Dimension	Block	File	Object
Abstraction	Raw disk (LBA addresses)	Hierarchical filesystem (paths)	Flat namespace (bucket + key)
Protocol examples	iSCSI, FC, NVMe-oF, virtio-blk	NFS, SMB/CIFS, CephFS	S3, Swift, HTTP REST
Typical latency	Sub-millisecond to low ms	Low ms to mid ms	Mid ms to high ms
Use case	Databases, VM boot disks, OLTP	Shared home dirs, media, config	Backups, logs, archives, artifacts
Metadata model	None (raw bytes at offsets)	POSIX (owner, perms, timestamps)	Custom key-value pairs per object
Filesystem	Consumer creates it (ext4, XFS, NTFS)	Server provides it	None (flat key-value)

How It Works

Block storage exposes a linear array of fixed-size blocks (typically 512 bytes or 4 KiB sectors) identified by Logical Block Addresses (LBAs). The consumer (a kernel filesystem driver, a database engine with raw I/O) issues read/write commands to specific LBAs. There is no notion of "files" or "directories" at this layer. The kernel's block layer (struct bio, struct request) manages I/O scheduling, merging, and dispatch.

 Application (e.g., PostgreSQL)
      |
      v
 Filesystem (ext4 / XFS / NTFS)
      |  translates file offsets to LBAs
      v
 Block Layer (Linux: bio -> request -> dispatch)
      |
      v
 Device Driver (virtio-blk, SCSI, NVMe)
      |
      v
 Storage Target (local disk / SAN LUN / Ceph RBD)

File storage adds a POSIX-compatible layer on top. A file server (NFS daemon, Samba) or a distributed filesystem (CephFS, GlusterFS) manages the inode table, directory tree, permissions, and locks. The client mounts the filesystem over the network and interacts via standard open(), read(), write(), stat() syscalls. The wire protocol handles serialization, caching, and delegation.

 Application
      |
      v
 VFS (Virtual Filesystem Switch)
      |
      +---> Local FS (ext4, XFS)       -> Block Layer -> Disk
      |
      +---> NFS Client (nfs4_proc_*)   -> RPC/XDR -> Network -> NFS Server
      |
      +---> SMB Client (cifs.ko)       -> SMB3 -> Network -> Samba / Windows

Object storage abandons the filesystem hierarchy entirely. Data is stored as immutable objects in flat namespaces (buckets). Each object has a unique key, a blob of data, and arbitrary metadata. Access is via HTTP REST APIs (PUT, GET, DELETE). There is no seek(), no partial update, no directory listing in the traditional sense. This model scales to exabytes because there is no global inode table to bottleneck.

Why This Matters for 5,000+ VMs

VM boot disks and database volumes require block storage -- low latency, direct LBA access, no protocol overhead beyond the block layer.
Shared configuration repositories, user home directories, and media files use file storage -- multiple VMs mount the same NFS/SMB export.
Backups, log archives, VM image templates, and compliance data naturally fit object storage -- write once, read rarely, scale massively.

Every candidate must provide all three. The question is how they provide them and what performance trade-offs they make.

Multi-Protocol Access in Virtualized Environments

A single VM may consume all three storage types simultaneously. Understanding how these coexist is critical for architecture planning:

Typical Enterprise VM: Database Server
========================================

VM: db-prod-01
  |
  +-- /dev/vda   (boot disk)      --> Block (Ceph RBD / S2D / SAN LUN)
  |       100 GB, ext4, OS + binaries
  |
  +-- /dev/vdb   (data disk)      --> Block (Ceph RBD / S2D / SAN LUN)
  |       500 GB, XFS, raw tablespace
  |       Requires: low latency, high IOPS
  |
  +-- /mnt/share (config mount)   --> File (NFS v4 / SMB3 / CephFS)
  |       Shared config across DB cluster
  |       Requires: consistency, locking
  |
  +-- s3://backup-bucket          --> Object (Ceph RGW / Azure Blob)
          Nightly pg_dump uploads
          Requires: durability, capacity, cost efficiency

The protocol choice per data type directly impacts performance, availability, and cost. A common mistake in IaaS evaluations is to focus exclusively on block storage (because VM disks are block devices) and neglect file and object storage requirements until post-migration, when teams discover their NFS workflows or backup pipelines do not have equivalent services on the new platform.

Consistency Semantics

The three models differ fundamentally in consistency guarantees, which matters for multi-VM or clustered applications:

Model	Consistency	Implication
Block	Single-writer exclusive access (one VM per LUN/RBD image)	No shared-nothing risk; multi-attach requires cluster filesystem (GFS2, OCFS2)
File	POSIX semantics with locking (fcntl, flock)	Safe for concurrent multi-VM access; performance depends on lock contention
Object	Eventual consistency (S3 standard) or read-after-write	Not suitable for real-time multi-writer coordination

2. LVM (Logical Volume Management)

What It Is and Why It Exists

LVM is the Linux kernel subsystem (implemented via the device-mapper framework) that inserts a flexible abstraction layer between physical disks and filesystems. It solves a fundamental problem: physical disks have fixed sizes and fixed partition layouts, but workloads need volumes that can grow, shrink, move, snapshot, and stripe across multiple physical devices -- all without downtime.

In virtualized environments, LVM is doubly relevant: the hypervisor host often uses LVM to manage its own local storage, and guest VMs may use LVM internally for their own volumes. Storage backends like Ceph OSD daemons and S2D may also leverage LVM (or device-mapper directly) under the hood.

How It Works

LVM operates in three layers:

+------------------------------------------------------------------+
|                        Filesystem (ext4 / XFS)                    |
+------------------------------------------------------------------+
|                     Logical Volume (LV)                           |
|          /dev/vg_data/lv_database   (500 GB, thin)                |
+------------------------------------------------------------------+
|                      Volume Group (VG)                            |
|          vg_data = pv1 + pv2 + pv3   (pooled capacity)           |
+------------------------------------------------------------------+
|              Physical Volumes (PV)                                |
|   /dev/sda2       /dev/sdb1       /dev/nvme0n1p1                  |
|   (500 GB HDD)   (500 GB HDD)    (1 TB NVMe)                     |
+------------------------------------------------------------------+
|              Physical Disks / Partitions                          |
+------------------------------------------------------------------+

Physical Volumes (PVs): Any block device (whole disk, partition, RAID array, iSCSI LUN) is initialized with pvcreate, which writes an LVM metadata header and divides the device into Physical Extents (PEs), typically 4 MiB each.

Volume Groups (VGs): One or more PVs are combined into a VG with vgcreate. The VG is the pool from which Logical Volumes are allocated. PEs from different PVs become interchangeable within the VG.

Logical Volumes (LVs): Carved from a VG with lvcreate. An LV is a contiguous virtual block device backed by a set of PEs (which may be physically scattered across multiple disks). LVs appear as /dev/<vg_name>/<lv_name> and are formatted with a filesystem or used as raw block devices.

Kernel Internals: device-mapper

LVM is built on top of the Linux device-mapper (DM) subsystem (drivers/md/dm.c). Device-mapper creates virtual block devices by mapping I/O requests to underlying physical devices according to mapping tables. Key DM targets used by LVM:

DM Target	Purpose	Used By
`dm-linear`	Maps a range of virtual sectors to a range on a physical device	Standard LVs
`dm-striped`	Stripes I/O across multiple devices (like RAID 0)	Striped LVs
`dm-snapshot`	Copy-on-write snapshots	LVM snapshots
`dm-thin-pool`	Thin provisioning with on-demand block allocation	Thin LVs
`dm-cache`	SSD caching for HDD-backed volumes	`lvmcache`
`dm-crypt`	Encryption (LUKS)	Encrypted LVs

The I/O path for an LVM volume:

Application write()
      |
      v
Filesystem (ext4): maps file offset -> LBA on /dev/vg/lv
      |
      v
device-mapper: translates virtual LBA -> physical LBA on /dev/sdX
      |  (lookup in DM mapping table, O(1) for linear, O(log n) for thin)
      v
Block layer: I/O scheduler (mq-deadline / none / bfq)
      |
      v
SCSI / NVMe driver -> physical disk

Key Operations

Operation	Command	Impact
Extend LV online	`lvextend -L +100G /dev/vg/lv && resize2fs /dev/vg/lv`	Zero downtime, no data movement
Shrink LV	`resize2fs /dev/vg/lv 400G && lvreduce -L 400G /dev/vg/lv`	Requires unmount for ext4; XFS cannot shrink
Create snapshot	`lvcreate -s -L 10G -n snap_lv /dev/vg/lv`	COW snapshot, reads from origin until blocks are modified
Mirror LV	`lvcreate --type mirror -m 1 -L 100G -n lv_mirror vg`	Synchronous mirror across PVs
Move PV data	`pvmove /dev/sda2 /dev/sdb1`	Online data migration between physical devices

Thin Provisioning via LVM

LVM thin provisioning (dm-thin-pool) deserves special mention because it is the mechanism used by many storage backends to overcommit capacity:

+----------------------------------------------------+
|              Thin Pool (VG: vg_data)                |
|  Actual allocated: 200 GB                           |
|  Metadata device:  /dev/vg_data/tp_meta (small)     |
|  Data device:      /dev/vg_data/tp_data  (1 TB)     |
+----------------------------------------------------+
    |            |            |            |
  thin-lv1    thin-lv2    thin-lv3    thin-lv4
  (100 GB)    (200 GB)    (300 GB)    (400 GB)
  used: 50G   used: 60G   used: 40G   used: 50G

  Sum of virtual sizes: 1,000 GB
  Actual physical usage:  200 GB
  Overcommit ratio:       5:1

Blocks are allocated from the thin pool only when actually written. The metadata device tracks which blocks belong to which thin LV using a B-tree structure. This is critical: if the thin pool fills to 100%, all thin LVs freeze and data corruption can occur. Monitoring thin pool usage is a non-negotiable operational requirement.

Relationship to the Stack

RAID sits below LVM: PVs can be RAID arrays (md devices or hardware RAID LUNs).
Thin Provisioning is implemented as a DM target within LVM.
Storage Tiering can be implemented via dm-cache (an LV that caches a slow HDD-backed LV on fast SSD).
Ceph OSD uses LVM to manage its local BlueStore volumes (ceph-volume lvm commands).
CSI drivers in Kubernetes dynamically create LVM thin LVs for PersistentVolumes (e.g., TopoLVM, LVM Operator).

3. RAID Levels

What It Is and Why It Exists

RAID (Redundant Array of Independent Disks) combines multiple physical disks into a single logical unit to achieve one or more of: (a) data redundancy (survive disk failures), (b) improved performance (parallelize I/O), and (c) increased capacity. RAID was invented to solve a simple problem: individual disks fail, and the larger the storage pool, the more frequently failures occur. At 5,000+ VMs, disk failures are not exceptional events -- they are routine operations that must be handled transparently.

RAID Levels in Detail

RAID 0 — Striping (No Redundancy)
===========================================
Disk 0:  [A1] [A3] [A5] [A7]
Disk 1:  [A2] [A4] [A6] [A8]

- Data striped across N disks in chunks (typically 64-512 KiB)
- Read/write performance: N x single disk (theoretical)
- Capacity: 100% (sum of all disks)
- Fault tolerance: NONE. One disk failure = total data loss
- Use case: Scratch space, temporary data, caches

RAID 1 — Mirroring
===========================================
Disk 0:  [A1] [A2] [A3] [A4]   (primary)
Disk 1:  [A1] [A2] [A3] [A4]   (mirror)

- Every write goes to both disks
- Read performance: up to 2x (reads can be split)
- Write performance: 1x (must write to both)
- Capacity: 50% (half lost to mirroring)
- Fault tolerance: 1 disk failure
- Use case: OS boot drives, small critical volumes

RAID 5 — Striping + Distributed Parity
===========================================
Disk 0:  [A1]  [B2]  [C3]  [Dp]
Disk 1:  [A2]  [B3]  [Cp]  [D1]
Disk 2:  [A3]  [Bp]  [C1]  [D2]
Disk 3:  [Ap]  [B1]  [C2]  [D3]

p = parity block (XOR of data blocks in the stripe)

- Parity is distributed across all disks (no single bottleneck)
- Read performance: (N-1) x single disk
- Write performance: reduced due to parity calculation
  - Small random writes require read-modify-write:
    1. Read old data block
    2. Read old parity block
    3. Compute new parity = old_parity XOR old_data XOR new_data
    4. Write new data + new parity (= 4 I/O ops per write)
- Capacity: (N-1)/N (one disk worth lost to parity)
- Fault tolerance: 1 disk failure
- DANGER: Rebuild times on large disks (10+ TB) can exceed 24 hours,
  during which a second failure causes total data loss (URE risk)
- Use case: General-purpose, read-heavy workloads

RAID 6 — Striping + Double Distributed Parity
===========================================
Disk 0:  [A1]  [B2]  [C3]  [Dp]  [Eq]
Disk 1:  [A2]  [B3]  [Cp]  [Dq]  [E1]
Disk 2:  [A3]  [Bp]  [Cq]  [D1]  [E2]
Disk 3:  [Ap]  [Bq]  [C1]  [D2]  [E3]
Disk 4:  [Aq]  [B1]  [C2]  [D3]  [Ep]

p = P parity (XOR), q = Q parity (Reed-Solomon / GF(2^8))

- Two independent parity calculations per stripe
- Write penalty: even higher than RAID 5 (6 I/O ops per small write)
- Capacity: (N-2)/N
- Fault tolerance: 2 simultaneous disk failures
- Use case: Large arrays (>6 disks), compliance data, archive

RAID 10 — Mirror + Stripe (Nested)
===========================================
      Stripe (RAID 0)
     /        |        \
  Mirror    Mirror    Mirror
  (RAID 1)  (RAID 1)  (RAID 1)
  D0  D1    D2  D3    D4  D5

Disk 0: [A1] [A3] [A5]     Disk 1: [A1] [A3] [A5]  (mirror of D0)
Disk 2: [A2] [A4] [A6]     Disk 3: [A2] [A4] [A6]  (mirror of D2)
Disk 4: [A7] [A9] [A11]    Disk 5: [A7] [A9] [A11] (mirror of D4)

- Stripe across mirrored pairs
- Read performance: up to N x single disk
- Write performance: (N/2) x single disk
- Capacity: 50%
- Fault tolerance: 1 disk per mirror pair (survives up to N/2 failures
  if no two failures hit the same pair)
- Rebuild time: FAST (only mirror the failed disk from its pair,
  not the entire array)
- Use case: Databases, OLTP, latency-sensitive workloads
  This is the gold standard for enterprise storage performance.

RAID Performance Summary

Level	Min Disks	Capacity	Read IOPS	Write IOPS	Write Penalty	Fault Tolerance	Rebuild Speed
RAID 0	2	100%	Nx	Nx	1	None	N/A
RAID 1	2	50%	2x	1x	2	1 disk	Fast
RAID 5	3	(N-1)/N	(N-1)x	~(N-1)/4 x	4	1 disk	Slow (full stripe read)
RAID 6	4	(N-2)/N	(N-2)x	~(N-2)/6 x	6	2 disks	Very slow
RAID 10	4	50%	Nx	(N/2)x	2	1 per pair	Fast (mirror only)

Rebuild Risk: The Hidden Danger of Large Disks

Rebuild time is often overlooked but is critical for availability planning. When a disk fails, the RAID system must reconstruct the lost data from parity or mirror copies. During this rebuild:

The array operates in a degraded state with reduced redundancy
Rebuild I/O competes with production I/O, increasing latency by 20-50%
A second disk failure during rebuild can cause data loss (RAID 5) or further degradation (RAID 6)

Rebuild time is proportional to disk size, not data size:

Rebuild Time Estimates (single disk failure)
=============================================

Disk Size    RAID 5 Rebuild    RAID 6 Rebuild    RAID 10 (mirror)
---------    ---------------   ---------------   -----------------
  1 TB         2-4 hours         2-4 hours          1-2 hours
  4 TB         8-16 hours        8-16 hours         4-8 hours
 10 TB        20-40 hours       20-40 hours         8-16 hours
 16 TB        32-64 hours       32-64 hours        12-24 hours
 20 TB        40-80 hours       40-80 hours        15-30 hours

Assumptions: 200 MB/s sustained rebuild rate with production I/O running.
Actual rates vary with workload intensity and controller/software capability.

DANGER ZONE: With 16+ TB disks (common in HDD-based capacity tiers),
RAID 5 rebuild exceeds 24 hours. The probability of a second disk
failure during a 48-hour rebuild window on a 12-disk array is
non-trivial (~1-5% depending on disk age and environment).

This is why:
  - Enterprise storage has moved to RAID 6 minimum for HDDs
  - RAID 10 is preferred for latency-sensitive workloads
  - Distributed SDS (Ceph, S2D) rebuild faster because they
    parallelize across ALL remaining nodes, not just the spare disk

Distributed storage systems have a major advantage here: when a node or disk fails, the rebuild is distributed across all remaining nodes in the cluster. A 30 TB node loss in a 100-node Ceph cluster means each of the 99 remaining nodes contributes ~300 GB of rebuild capacity, completing in minutes to hours rather than days. This is a key reason why HCI/SDS architectures are preferred at scale.

Software RAID vs Hardware RAID

In modern storage stacks, software RAID (Linux md driver, or distributed software like Ceph's CRUSH + erasure coding) has largely replaced hardware RAID controllers:

Aspect	Hardware RAID	Software RAID (md / Ceph)
Controller	Dedicated RAID card (LSI/Broadcom, Adaptec)	CPU-based, kernel module `md`
Battery/Capacitor	BBU/CacheVault for write cache	No hardware write cache (relies on disk cache/fsync)
Hot-swap	Firmware-managed	mdadm / Ceph-managed
Performance	HW write cache improves small writes	NVMe drives make HW cache irrelevant
Flexibility	Fixed to controller firmware	Full kernel control, scriptable
Ceph/ODF approach	Disks presented as individual JBODs (HBA mode)	Ceph manages replication/EC at the software layer

Critical note for OVE/ODF: Ceph requires disks in HBA passthrough mode (JBOD), not behind a hardware RAID controller. Ceph handles data redundancy itself via replication (replica=3) or erasure coding (e.g., k=4, m=2, equivalent to RAID 6). Hardware RAID underneath Ceph would double the redundancy overhead without benefit and mask disk failures from Ceph's self-healing.

Critical note for Azure Local/S2D: Storage Spaces Direct similarly expects HBA passthrough mode. S2D implements its own mirroring (2-way, 3-way) and parity at the software layer. Hardware RAID underneath S2D is explicitly unsupported.

Erasure Coding: The Distributed RAID

In distributed storage systems (Ceph, S2D), erasure coding (EC) replaces traditional RAID parity:

Erasure Coding (k=4, m=2) — analogous to RAID 6
================================================

Original data: split into 4 data chunks (k)
Parity:        2 coding chunks computed (m)

  OSD 0    OSD 1    OSD 2    OSD 3    OSD 4    OSD 5
  [D0]     [D1]     [D2]     [D3]     [C0]     [C1]
  data     data     data     data     code     code

- Survives any 2 OSD/node failures
- Storage overhead: (k+m)/k = 6/4 = 1.5x (vs 3x for replica=3)
- CPU cost: Reed-Solomon encode/decode on every write/recovery
- Read: any k of k+m chunks suffice
- Write: must write k+m chunks (all 6)
- Latency: higher than replication (tail latency from slowest chunk)

Erasure coding is typically used for "warm" and "cold" data (backups, logs, infrequently accessed VM images) where the capacity savings outweigh the latency penalty. Hot data (VM boot disks, databases) typically uses replication (replica=3 in Ceph, 3-way mirror in S2D) for lower latency.

4. Thin Provisioning

What It Is and Why It Exists

Thin provisioning is the practice of presenting a storage volume to a consumer with a larger virtual size than the physical space actually allocated to it. Physical blocks are allocated only when data is first written to them. This is the storage equivalent of memory overcommit in virtualization -- it assumes that not every consumer will use 100% of their allocated space simultaneously.

Without thin provisioning, an environment with 5,000 VMs where each VM is allocated a 100 GB disk would need 500 TB of physical storage, even if the average actual usage is only 30 GB per VM (150 TB total). Thin provisioning lets us provision those 5,000 VMs from a 250 TB pool and monitor actual consumption.

How It Works

Thin provisioning is implemented at various layers in the stack. The mechanism differs, but the principle is the same: deferred allocation.

Layer 1: Storage array / SDS level

Traditional SAN arrays (Dell PowerMax, NetApp) and software-defined storage (Ceph, S2D) implement thin provisioning at the pool level. When a LUN or volume is created as "thin," the array metadata records the virtual size but allocates physical extents only on first write.

Layer 2: LVM thin pools (dm-thin-pool)

As described in the LVM section, dm-thin-pool implements thin provisioning at the Linux device-mapper level. A thin pool has a fixed physical data device and a metadata device. Thin LVs draw blocks from the pool on demand.

Layer 3: Filesystem level

Some filesystems (ReFS on Windows, XFS, ZFS) support sparse files, where unwritten regions consume no disk space. This is a file-level form of thin provisioning.

Layer 4: Virtual disk format

Virtual disk formats like QCOW2 (QEMU), VMDK-thin (VMware), and VHDX-dynamic (Hyper-V) are inherently thin-provisioned: the virtual disk file grows as data is written.

Thin Provisioning Across the Stack
====================================

+--------------------------------------------------+
|  VM sees:    /dev/vda  =  100 GB disk             |
+--------------------------------------------------+
|  QCOW2 file: actual size on host = 32 GB          |  <-- VM disk level
+--------------------------------------------------+
|  Ceph RBD image: 100 GB virtual, 32 GB allocated   |  <-- SDS level
|  (4 MiB objects, only written objects exist)        |
+--------------------------------------------------+
|  Ceph OSD thin LV: physical extents on NVMe        |  <-- LVM level
+--------------------------------------------------+
|  NVMe SSD: physical NAND cells                     |  <-- Hardware level
+--------------------------------------------------+

Each layer may independently thin-provision.
Actual physical usage: ~32 GB
Virtual allocation:    100 GB
Overcommit ratio:      ~3:1

The Overcommit Trap

Thin provisioning enables overcommitment, which is powerful but dangerous. The critical failure mode is pool exhaustion:

Thin Pool Exhaustion Scenario
==============================

Time T0:  Pool = 10 TB physical, allocated 8 TB virtually (80%)
          Actual usage: 6 TB (60% physical utilization)

Time T1:  Batch job runs, writes 3 TB of new data
          Actual usage: 9 TB (90% physical utilization)

Time T2:  More writes, pool hits 100% physical capacity
          ALL thin volumes FREEZE
          I/O errors cascade to ALL 5,000 VMs on the pool
          Potential filesystem corruption in guests

MITIGATION:
- Set alerts at 70%, 80%, 90% physical utilization
- Configure autoextend: /etc/lvm/lvm.conf
    thin_pool_autoextend_threshold = 80
    thin_pool_autoextend_percent = 20
- Monitor "actual vs virtual" ratio continuously
- Set per-VM I/O quotas to prevent runaway writes

TRIM/DISCARD: Reclaiming Space

When a VM deletes files, the guest filesystem marks blocks as free, but the thin provisioning layer does not automatically know. The TRIM/DISCARD mechanism propagates "this block is no longer needed" down the stack:

Guest VM: rm /var/log/big.log
    |
    v
Guest filesystem (ext4): marks inodes/blocks as free
    |  (fstrim / mount -o discard)
    v
virtio-blk / virtio-scsi: UNMAP/WRITE_ZEROES command
    |
    v
Ceph RBD / S2D: deallocates the corresponding objects/extents
    |
    v
Physical pool: space returned for reuse

Without TRIM/DISCARD, thin pools exhibit "phantom usage" -- physical space remains allocated for data the VM has long since deleted. In a 5,000-VM environment, this can easily waste 20-30% of pool capacity. All three candidates must support TRIM propagation end-to-end.

Performance Characteristics

Thin provisioning introduces a measurable overhead:

Operation	Thick (pre-allocated)	Thin (on-demand)
First write to new block	Direct write (block exists)	Allocate block + write (metadata update + write)
Subsequent writes	Direct write	Direct write (same as thick)
Latency overhead (first write)	0	5-15 us (metadata lookup + allocate)
Steady-state overhead	0	Negligible (< 1%)
Snapshot creation	Full copy or COW setup	Instant (metadata-only, shared blocks)
Space reclaim	N/A	Requires TRIM/DISCARD propagation

For enterprise workloads, the first-write penalty is acceptable because it is amortized over the lifetime of the volume. The real risk is operational (pool exhaustion), not performance.

Monitoring Thin Provisioning: Essential Metrics

In a 5,000-VM environment, thin provisioning monitoring is not optional. These are the critical metrics and the commands/tools to collect them:

LVM Thin Pool Monitoring
==========================

# Check thin pool utilization
$ lvs -o lv_name,data_percent,metadata_percent vg_data/tp_data
  LV       Data%  Meta%
  tp_data  72.3   14.5

# Per-thin-LV actual usage
$ lvs -o lv_name,lv_size,data_percent --select 'pool_lv=tp_data'
  LV           LSize   Data%
  thin-lv-001  100.0g  32.10
  thin-lv-002  200.0g  15.75
  thin-lv-003  300.0g  41.20

Ceph RBD Monitoring (ODF)
===========================

# Pool utilization
$ ceph df
POOLS:
  NAME        USED     %USED   MAX AVAIL   OBJECTS
  rbd-pool    12 TiB   48.2%   12.8 TiB    3.2M

# Per-image actual usage
$ rbd du rbd-pool/vm-disk-001
NAME              PROVISIONED   USED
vm-disk-001       100 GiB       32 GiB

S2D Monitoring (Azure Local)
===============================

# Volume utilization (PowerShell)
Get-VirtualDisk | Select FriendlyName, Size, FootprintOnPool,
    @{N='UsedPct';E={[math]::Round($_.FootprintOnPool/$_.Size*100,1)}}

Alert Thresholds (recommended):
  70%  -> WARNING   (plan capacity expansion)
  80%  -> CRITICAL  (start blocking new VM provisioning)
  90%  -> EMERGENCY (begin emergency space reclamation)
  95%  -> P1 ALERT  (risk of I/O freeze imminent)

5. Storage Tiering (Hot / Warm / Cold)

What It Is and Why It Exists

Storage tiering is the practice of placing data on different classes of storage media based on access frequency, latency requirements, and cost. The fundamental insight is that not all data is equal: a database transaction log needs sub-millisecond latency on NVMe, while a two-year-old audit report can sit on slow, cheap HDDs or even object storage.

In a 5,000+ VM environment, the cost difference is enormous:

Media	$/GB (approx. 2025-2026)	IOPS (4K random read)	Latency	Endurance
NVMe SSD (datacenter, TLC)	$0.10 - $0.20	500,000 - 1,000,000+	50-100 us	1-3 DWPD
SATA SSD (datacenter)	$0.06 - $0.12	50,000 - 100,000	100-500 us	0.3-1 DWPD
HDD (nearline, 7200 RPM)	$0.01 - $0.03	100-200	5-15 ms	N/A (mechanical)
Object storage (S3-compat)	$0.005 - $0.02	N/A (HTTP API)	10-100 ms	N/A

At 500 TB total capacity for 5,000 VMs:

All-NVMe: $50,000 - $100,000 in media cost
All-HDD: $5,000 - $15,000 in media cost
Tiered (20% NVMe + 30% SSD + 50% HDD): ~$25,000 in media cost

Tiering captures 80% of the performance of all-flash at 50% of the cost.

How It Works

Tiering can be implemented at several layers:

Manual tiering (StorageClass-based): The administrator defines multiple storage tiers as distinct pools or StorageClasses. Workloads are assigned to a tier at provisioning time based on their performance requirements. No automatic data movement.

StorageClass: "fast"    --> NVMe pool,   replica=3,   max IOPS
StorageClass: "standard" --> SSD pool,    replica=3,   balanced
StorageClass: "archive"  --> HDD pool,    EC 4+2,      cost-optimized

VM provisioning:
  Database VM   --> PVC with StorageClass "fast"
  App Server VM --> PVC with StorageClass "standard"
  Log Collector --> PVC with StorageClass "archive"

Automatic tiering (data movement based on access patterns): The storage system monitors I/O patterns and automatically moves hot data to fast media and cold data to slow media. This is implemented as a background process that periodically analyzes access statistics and migrates data blocks between tiers.

Automatic Tiering Flow
========================

                    Access Frequency Analysis
                    (per block / per extent)
                           |
              +------------+------------+
              |            |            |
              v            v            v
         Hot Tier     Warm Tier     Cold Tier
        (NVMe SSD)   (SATA SSD)     (HDD)

 +----------+     +-----------+     +---------+
 | Block A   |<--->| Block B   |<--->| Block C |
 | 500 IOPS  |     | 10 IOPS   |     | 0 IOPS  |
 | last 24h  |     | last 24h  |     | 90 days |
 +----------+     +-----------+     +---------+
       ^                                  |
       |    Promotion (cold -> hot)       |
       +----------------------------------+
       |    Demotion  (hot -> cold)       |
       +----------------------------------+
              (background I/O, scheduled)

Tiering granularity:
  - VMware vSAN: per-object (typically 1 MiB blocks)
  - S2D: per-slab (256 MiB slabs)
  - Ceph: per-pool (manual) or per-object with tiering agents

Cache tiering (read/write cache on fast media): Instead of moving data between tiers, a fast device acts as a read/write cache in front of a slow device. Linux dm-cache (and its variants dm-writecache, bcache) implement this at the device-mapper level.

dm-cache Architecture
======================

             I/O Request
                 |
                 v
         +---------------+
         |   dm-cache    |
         | (cache policy:|
         |  smq / mq)    |
         +-------+-------+
                 |
        +--------+--------+
        |                 |
        v                 v
  +----------+     +-----------+
  | SSD Cache |     | HDD Origin |
  | (fast,    |     | (slow,     |
  |  small)   |     |  large)    |
  +----------+     +-----------+

Cache policies:
  smq (stochastic multiqueue): default, low memory overhead
  mq (multiqueue): original, higher memory, more tunable

Modes:
  writeback:    writes go to SSD first, flushed to HDD later (risky if SSD fails)
  writethrough: writes go to both SSD and HDD (safe, slower writes)
  passthrough:  SSD only caches reads (safest, no write acceleration)

Tiering in HCI Environments

In HCI (where compute and storage share the same physical nodes), tiering has an additional dimension: data locality. Ideally, hot data should reside on the same node as the VM consuming it, eliminating network round-trips.

HCI Node with Tiered Storage
==============================

Node 1 (runs VM-A, VM-B)
+------------------------------------------+
|  NVMe (cache tier)    [VM-A hot blocks]  |
|  SSD   (capacity tier) [VM-A warm blocks] |
|  HDD   (archive tier)  [VM-B cold blocks] |
+------------------------------------------+
         |
    Cluster Network (25/100 GbE)
         |
Node 2 (runs VM-C, VM-D)
+------------------------------------------+
|  NVMe (cache tier)    [VM-C hot blocks]  |
|  SSD   (capacity tier) [VM-C warm blocks] |
|  HDD   (archive tier)  [VM-A replicas]   |
+------------------------------------------+

Data locality + tiering = optimal performance:
  Hot data for VM-A on Node 1's NVMe = local read, ~100 us
  Cold data for VM-A on Node 2's HDD = network + HDD, ~10 ms

Data Lifecycle and Tier Migration for Financial Workloads

In a financial institution, data has a well-defined lifecycle that maps naturally to storage tiers:

Financial Data Lifecycle and Tier Mapping
==========================================

Phase 1: Active (0-90 days)
  Data: live transactions, current balances, active sessions
  Access: thousands of reads/writes per second
  Tier: HOT (NVMe, replica=3)
  Latency requirement: < 1 ms

Phase 2: Recent (90 days - 1 year)
  Data: completed transactions, recent reports, audit trails
  Access: occasional queries, compliance checks
  Tier: WARM (SSD, replica=3 or EC 4+2)
  Latency requirement: < 10 ms

Phase 3: Archive (1-7 years)
  Data: regulatory retention (FINMA: 10 years for some records)
  Access: rare, typically for audits or legal discovery
  Tier: COLD (HDD, EC 8+3 or object storage)
  Latency requirement: < 1 second (acceptable for batch access)

Phase 4: Deep Archive (7-10+ years)
  Data: long-term compliance retention
  Access: exceptional (legal/regulatory only)
  Tier: COLD/ARCHIVE (object storage, tape, or immutable vault)
  Latency requirement: minutes (acceptable for retrieval requests)

Cost optimization: moving 1 PB of archive data from NVMe ($100K+)
to object storage ($5K-20K) saves $80K+ per year.

Automated lifecycle policies (similar to AWS S3 Lifecycle Rules) should be a key evaluation criterion. The question is whether each candidate can automate the transition between phases based on data age, access frequency, or explicit policy tags.

Relationship to Other Concepts

RAID protects each tier independently. NVMe tier might use replication (mirror), HDD tier might use erasure coding.
Thin provisioning applies within each tier -- a thin pool on the NVMe tier can overcommit the hot tier capacity.
IOPS/Throughput/Latency metrics differ dramatically per tier and determine correct workload placement.
StorageClasses in Kubernetes map directly to storage tiers, allowing workload owners to self-service their tier selection.

6. IOPS / Throughput / Latency

What It Is and Why It Exists

These are the three fundamental performance dimensions of any storage system. They are independent metrics, and optimizing for one often trades off against another. Understanding them is essential because vendor claims like "millions of IOPS" are meaningless without context (block size, queue depth, read/write mix, sequential vs random).

Definitions and Relationships

IOPS (Input/Output Operations Per Second): The number of discrete read or write operations a storage system can perform per second. Each operation transfers one block of data (typically 4 KiB for random workloads, 64-256 KiB for sequential). IOPS measures the "operations per second" capability, regardless of how much data each operation transfers.

Throughput (MB/s or GB/s): The total data transfer rate. Throughput = IOPS x Block Size. A system doing 100,000 IOPS at 4 KiB blocks delivers ~400 MB/s throughput. The same system doing 1,000 IOPS at 1 MiB blocks also delivers ~1,000 MB/s throughput but only 1,000 IOPS.

Latency (microseconds or milliseconds): The time from when an I/O request is submitted until the response is received. This includes queueing time, processing time, network transit (for networked storage), and media access time. Latency is the metric end-users feel most directly -- it determines how "snappy" a database or application is.

The Relationship Triangle
==========================

                    IOPS
                   /    \
                  /      \
                 /        \
        Throughput ---- Latency

Throughput = IOPS x Block Size
IOPS = Throughput / Block Size
Latency ≈ 1 / IOPS  (at low queue depth, single-threaded)

At high queue depths:
  IOPS = Queue Depth / Latency
  (Little's Law applied to storage)

I/O Patterns and Their Metrics

Workload	Pattern	Primary Metric	Typical Target
OLTP Database (PostgreSQL, Oracle)	Random 4-8K read/write	IOPS + Latency	10,000-50,000 IOPS, <1 ms
Data Warehouse (analytics query)	Sequential 64-256K read	Throughput	1-10 GB/s
VM boot	Sequential 4-64K read	Throughput + Latency	>500 MB/s, <5 ms
Log ingestion	Sequential 4-64K write	Throughput	200-1,000 MB/s
Email server	Random 4K mixed R/W	IOPS	5,000-20,000 IOPS
VDI (Virtual Desktop)	Random 4K read-heavy	IOPS + Latency	20-50 IOPS/desktop, <5 ms
Backup write	Sequential 1M write	Throughput	1-5 GB/s

The I/O Path: Where Latency Accumulates

Understanding where latency comes from is essential for diagnosing storage performance problems:

End-to-End I/O Path (VM on HCI with SDS)
==========================================

Application: write(fd, buf, 4096)
  |                                          ~1-5 us
  v
Guest Kernel: VFS -> ext4 -> submit_bio()
  |                                          ~1-3 us
  v
virtio-blk / virtio-scsi driver
  |                                          ~2-5 us
  v
Hypervisor: QEMU I/O thread -> host block layer
  |                                          ~1-3 us
  v
SDS Client (Ceph librbd / S2D ReFS)
  |                                          ~5-20 us
  v
  +--- Local path (data on this node) -----> ~50-100 us (NVMe)
  |                                          ~5-15 ms  (HDD)
  |
  +--- Network path (data on remote node)
       |                                     ~5-30 us (RDMA/RoCE)
       v                                     ~50-200 us (TCP/IP)
       Network transit
       |
       v
       Remote SDS daemon (Ceph OSD / S2D)
       |                                     ~5-20 us
       v
       Local block layer -> NVMe/SSD/HDD
                                             ~50-100 us (NVMe)
                                             ~100-500 us (SSD)
                                             ~5-15 ms (HDD)

Total end-to-end latency examples:
  Local NVMe path:    ~100-200 us
  Remote NVMe path:   ~200-400 us (TCP), ~150-250 us (RDMA)
  Remote HDD path:    ~10-20 ms

For comparison, VMware vSAN typical latency:
  All-flash local:    ~200-300 us
  All-flash remote:   ~300-500 us

Queue Depth and Its Effect on IOPS

Queue depth (QD) is the number of I/O operations in flight simultaneously. It is the single most important tuning parameter for storage performance:

IOPS vs Queue Depth (typical NVMe SSD)
========================================

QD   |  IOPS (4K rand read)  |  Latency (avg)
-----+-----------------------+----------------
  1  |      10,000           |    100 us
  2  |      19,000           |    105 us
  4  |      36,000           |    110 us
  8  |      65,000           |    123 us
 16  |     110,000           |    145 us
 32  |     180,000           |    178 us
 64  |     300,000           |    213 us
128  |     450,000           |    284 us
256  |     500,000           |    512 us    <-- saturation
512  |     500,000           |  1,024 us    <-- queueing delay

Observations:
- IOPS scales almost linearly with QD up to device saturation
- Latency increases slowly at first, then sharply past saturation
- Optimal QD: highest IOPS before latency degrades significantly
- For VMs: each VM typically generates QD 1-4, so 100 VMs on
  one node generate aggregate QD 100-400 across shared storage

Linux I/O Schedulers: Controlling Fairness

The Linux kernel's I/O scheduler sits between the filesystem/block layer and the device driver. It reorders, merges, and prioritizes I/O requests. The choice of scheduler directly impacts fairness between VMs sharing the same physical storage.

Scheduler	Kernel Config	Best For	VM Relevance
`none` / `noop`	`elevator=none`	NVMe devices (already have internal schedulers)	Default for NVMe, correct for HCI
`mq-deadline`	`elevator=mq-deadline`	SSD/HDD devices, ensures bounded latency	Good for SCSI/SATA, prevents starvation
`bfq` (Budget Fair Queueing)	`elevator=bfq`	Desktop/interactive, per-process fairness	Rarely used in server/VM contexts
`kyber`	`elevator=kyber`	Fast SSDs, two-level priority (read/write)	Experimental, not widely deployed

In a virtualized HCI environment:

Host-level scheduler (none for NVMe) handles physical device queues
Guest-level scheduler (mq-deadline or none) handles the virtual disk queues
SDS layer (Ceph, S2D) may implement its own I/O prioritization between clients

The interaction between these three levels of scheduling is where "noisy neighbor" problems originate. A VM with aggressive ionice settings inside the guest cannot override the host-level or SDS-level fair-share policies -- but misconfigured schedulers can allow it.

Benchmarking: How to Measure Correctly

The standard tool for storage benchmarking is fio (Flexible I/O Tester). Key parameters that must be controlled for meaningful results:

# Random 4K read IOPS test (typical OLTP simulation)
fio --name=rand_read \
    --ioengine=libaio \      # Linux async I/O (closest to real VM I/O)
    --direct=1 \             # Bypass page cache (measures actual storage)
    --bs=4k \                # Block size = 4 KiB
    --rw=randread \          # Random read pattern
    --iodepth=32 \           # Queue depth
    --numjobs=4 \            # 4 parallel workers
    --size=10G \             # Test data size
    --runtime=300 \          # 5-minute sustained test
    --group_reporting \
    --filename=/dev/vdb      # Test directly on block device

# Sequential throughput test (typical backup/restore simulation)
fio --name=seq_write \
    --ioengine=libaio \
    --direct=1 \
    --bs=1M \                # Large block size for throughput
    --rw=write \             # Sequential write
    --iodepth=8 \
    --numjobs=1 \
    --size=100G \
    --runtime=300 \
    --filename=/dev/vdb

# Mixed workload (70% read, 30% write, typical enterprise)
fio --name=mixed \
    --ioengine=libaio \
    --direct=1 \
    --bs=8k \
    --rw=randrw \
    --rwmixread=70 \         # 70% reads, 30% writes
    --iodepth=16 \
    --numjobs=8 \
    --size=10G \
    --runtime=300 \
    --filename=/dev/vdb

Common benchmarking mistakes to avoid:

Not using --direct=1: Without this, Linux page cache absorbs reads/writes, and you measure RAM speed, not storage speed.
Too short runtime: SSDs and distributed storage need 60+ seconds to reach steady state. Burst performance is not sustained performance.
Wrong block size: 4K random is not the same as 1M sequential. Match the block size to the actual workload.
Ignoring latency percentiles: Average latency hides tail latency. Report p50, p95, p99, p99.9. A system with 200 us average but 50 ms p99 will cause application timeouts.
Testing empty volumes: Thin-provisioned volumes perform differently when first written (allocation overhead) vs. when overwritten (steady state).

Performance Targets for 5,000+ VMs

Based on typical enterprise workload profiles:

Metric	Target (aggregate cluster)	Per-VM average	Notes
Random 4K read IOPS	500,000 - 2,000,000	100 - 400	Assumes 50% of VMs are active simultaneously
Random 4K write IOPS	100,000 - 500,000	20 - 100	Writes are typically 20-30% of total I/O
Sequential throughput	10 - 50 GB/s	2 - 10 MB/s	Aggregate across all VMs
Read latency (p99)	< 1 ms	< 1 ms	For OLTP workloads
Write latency (p99)	< 2 ms	< 2 ms	For OLTP workloads
Read latency (p99.9)	< 5 ms	< 5 ms	Tail latency SLA

These numbers are achievable with all-NVMe or NVMe+SSD HCI configurations with 50-100 nodes. They should be validated in the PoC against the actual VMware baseline using identical fio test profiles.

Profiling Real Workloads: Establishing the VMware Baseline

Before evaluating candidates, we must establish the performance profile of our current VMware environment. This means instrumenting the existing 5,000+ VMs to understand the actual I/O demand, not theoretical maximums.

Step 1: Capture aggregate storage statistics from vCenter

Key metrics to export from vCenter / esxtop / vRealize Operations:
  - Per-VM: avg IOPS (read/write), avg latency, avg throughput
  - Per-datastore: aggregate IOPS, capacity used vs provisioned
  - Per-host: storage adapter queue depth, device latency
  - Time range: at least 30 days to capture monthly patterns

Step 2: Categorize VMs by I/O profile

Typical Distribution (enterprise, 5,000 VMs):

Profile Category     VMs     Avg IOPS/VM   Pattern            Tier
-----------------    -----   -----------   ---------------    --------
Idle / near-idle     2,500   0-5           Negligible         Cold
Light I/O            1,500   5-50          Mostly reads       Warm
Moderate I/O           700   50-500        Mixed R/W          Standard
Database / OLTP        250   500-5,000     Random, write-heavy Hot
High-performance        50   5,000+        Extreme random     Hot+

This distribution follows the 80/20 rule: ~5% of VMs generate
~60% of total storage I/O. These are the VMs that determine
the storage architecture requirements.

Step 3: Calculate aggregate demand

Using the example distribution above:

Total cluster IOPS: ~500,000 (sum of all VMs at peak)
Required aggregate throughput: ~5 GB/s
Critical latency VMs: 300 VMs need p99 < 1 ms

This baseline becomes the acceptance criterion for the PoC: the candidate platform must match or exceed these numbers under equivalent load.

Write Amplification in Distributed Storage

One critical factor that vendor IOPS numbers often omit is write amplification -- the ratio of actual physical writes to application-level writes. In distributed storage:

Write Amplification Examples
==============================

Application writes 1 x 4 KiB block:

Ceph (replica=3):
  -> 1 primary write + 2 replica writes = 3 physical writes
  -> Plus WAL/journal write on each OSD = up to 6 writes
  -> Write amplification: 3-6x

S2D (3-way mirror):
  -> 1 primary + 2 mirror copies = 3 physical writes
  -> Plus ReFS metadata update = ~3.5 writes
  -> Write amplification: 3-4x

Ceph (EC 4+2):
  -> 4 data chunks + 2 coding chunks = 6 writes for 4 blocks
  -> Effective amplification per block: 1.5x (better for large writes)
  -> But small writes require read-modify-write: up to 4-6x

Implication: If your baseline VMware IOPS is 500,000 writes/sec,
the physical storage must handle 1.5-3M writes/sec in an HCI model.
Size NVMe endurance (DWPD) accordingly.

How the Candidates Handle This

Comparison Table

Aspect	VMware (Current)	OVE (OpenShift Virtualization Engine)	Azure Local	Swisscom ESC
Block storage	vSAN objects / VMFS datastores on SAN LUNs	Ceph RBD via CSI (ODF), or external SAN via CSI	Storage Spaces Direct (S2D) volumes on ReFS	Dell PowerMax/PowerStore LUNs, abstracted as service tiers
File storage	NFS datastores, VMFS shared	CephFS via CSI, NFS via external, Multus-attached NFS	SMB3 shares on S2D, native SMB Direct with RDMA	NFS/SMB as managed service
Object storage	Not native (3rd party: MinIO, Dell ECS)	Ceph Object Gateway (RGW) via ODF, S3-compatible	Azure Blob (cloud), no native on-prem object	Object storage as managed service
LVM usage	ESXi does not use LVM; VMFS is proprietary	Ceph OSDs use LVM (BlueStore); host OS may use LVM; TopoLVM for local PVs	Not applicable (S2D uses ReFS, no LVM)	Abstracted by provider; Dell arrays use proprietary volume management
RAID approach	vSAN: software-defined (FTT=1 mirror, FTT=2, erasure coding); SAN: array-managed	Ceph: replica=2/3 or erasure coding (k+m), per-pool configurable	S2D: 2-way mirror, 3-way mirror, or parity (single/dual), per-volume	Provider-managed, typically Dell RAID + array replication
Thin provisioning	vSAN thin by default; VMDK thin/thick; SAN depends on array	Ceph RBD thin by default (objects created on write); QCOW2 thin	S2D thin provisioning via ReFS; VHDX dynamic	Provider-managed, transparent to consumer
Storage tiering	vSAN: all-flash (NVMe cache + SSD capacity) or hybrid (SSD cache + HDD capacity)	ODF: manual via StorageClasses (NVMe pool vs HDD pool); no automatic intra-pool tiering	S2D: automatic tiering across NVMe/SSD/HDD within a single volume (slab-based, 256 MiB granularity)	Service tiers (e.g., "Platinum", "Gold", "Silver") mapped to different media classes
IOPS/perf model	vSAN: SIOC for QoS, per-VM IOPS limits; SAN: array-level QoS	ODF: StorageClass-based QoS, Ceph QoS (rbdqos*), rate limiting at CephFS/RBD level	S2D: Storage QoS policies (min/max IOPS per volume), integrated with Failover Clustering	SLA-based, vertraglich defined per service class
TRIM/DISCARD	VMFS UNMAP, vSAN automatic space reclamation	Full TRIM chain: guest -> virtio-scsi -> Ceph RBD DISCARD -> BlueStore	Full TRIM chain: guest -> Hyper-V -> S2D -> physical SSD TRIM	Provider-managed
Benchmarking	VMware vdbench, esxtop, vscsiStats	fio inside VMs, Ceph built-in benchmarks (`rados bench`, `rbd bench`), node-level `iostat`	diskspd (Windows fio equivalent), S2D performance counters, `Get-StorageSubsystem`	Not customer-accessible; SLA adherence measured by provider
Max cluster IOPS (all-NVMe, estimated)	vSAN: ~2-4M IOPS per cluster (64-node, all-flash)	ODF: ~2-5M IOPS per cluster (100-node, all-NVMe, replica=3)	S2D: ~5-13M IOPS per cluster (16-node, all-NVMe) per Microsoft claims	Provider SLA, not disclosed per cluster

Detailed Analysis

OVE / OpenShift Data Foundation (ODF): ODF wraps Ceph, which is the most flexible storage engine among the candidates. It provides unified block (RBD), file (CephFS), and object (RGW) storage from a single platform. Ceph's CRUSH algorithm distributes data across failure domains (racks, nodes, DCs) with configurable redundancy per pool. This means you can have a replica-3 pool for hot VM disks and an EC 4+2 pool for cold data on the same cluster. The trade-off is complexity: Ceph has many tuning knobs (PG count, OSD memory, BlueStore cache size, recovery throttling) and requires deep operational expertise to run well at scale. Red Hat abstracts some of this through the Rook-Ceph operator, but the underlying Ceph architecture must be understood for troubleshooting.

ODF does not currently offer automatic intra-pool tiering (moving individual data blocks between NVMe and HDD within a single pool based on access frequency). Tiering is achieved by defining separate Ceph pools on different media types and assigning workloads to the appropriate pool via StorageClasses. This is simpler to reason about but requires the workload owner to make the right tier selection at provisioning time.

Azure Local / Storage Spaces Direct (S2D): S2D is the most integrated and opinionated storage solution. It is built into the Windows Server kernel and manages the full I/O path from ReFS filesystem through the software storage bus to physical disks. S2D's key differentiator is automatic tiering: it moves data at 256 MiB slab granularity between NVMe, SSD, and HDD tiers based on access frequency, entirely transparently. A single volume can span all tiers without the workload owner needing to choose.

S2D also offers SMB Direct with RDMA (Remote Direct Memory Access) for storage traffic, which bypasses the TCP/IP stack and delivers near-local-disk latency for remote reads. This is a significant performance advantage for cross-node I/O in HCI. However, S2D is limited to 16 nodes per cluster, which constrains the total IOPS and capacity pool. For 5,000+ VMs, multiple clusters are required, and storage is not shared across clusters.

S2D lacks native object storage (S3-compatible). Object storage requires Azure Blob (cloud) or a third-party solution. This is a gap for on-premises backup and archival workflows.

Swisscom ESC: As a managed service, ESC abstracts all storage internals. The customer selects a service tier ("Platinum" for high-IOPS NVMe, "Gold" for balanced SSD, etc.) and receives a volume with contractually guaranteed performance. The underlying hardware is Dell PowerMax/PowerStore arrays connected via Fibre Channel to Dell VxBlock compute nodes. This is traditional SAN architecture -- the most proven and well-understood model, but also the least flexible and most expensive per GB.

The customer has no visibility into RAID configurations, thin provisioning ratios, or tiering policies. Performance is governed by SLAs, not by architectural understanding. This is a feature for organizations that want to outsource complexity, but a limitation for organizations that want to optimize for specific workload patterns. There is no ability to run custom benchmarks against the storage infrastructure or tune performance parameters.

Key Takeaways

Block storage is the foundation for VM workloads. All candidates provide it, but the implementation (Ceph RBD vs S2D vs Dell SAN) determines the performance profile, operational model, and failure domain characteristics.
LVM and device-mapper are the invisible plumbing in Linux-based platforms. Ceph OSDs use LVM internally (BlueStore on LVM), and understanding dm-thin-pool is essential for diagnosing ODF capacity issues. Azure Local uses ReFS instead and has no LVM dependency.
RAID is now a software concept in all HCI candidates. Both Ceph (ODF) and S2D expect disks in JBOD/HBA passthrough mode and handle redundancy in software. This is fundamentally different from the traditional SAN model (Dell PowerMax behind ESC) where RAID is handled by the array controller firmware.
Thin provisioning is universal but dangerous. All candidates thin-provision by default. The critical operational discipline is monitoring actual physical utilization vs. virtual allocation and alerting before pool exhaustion. At 5,000+ VMs, a runaway write workload on one VM can cascade into a cluster-wide I/O freeze if the thin pool is full.
Storage tiering is the biggest architectural differentiator between the candidates. S2D (Azure Local) offers fully automatic, transparent tiering within a single volume. ODF (OVE) requires manual tier selection via StorageClasses. ESC offers tiering as opaque service classes. The right model depends on whether you want operational simplicity (S2D), fine-grained control (ODF), or fully outsourced management (ESC).
IOPS numbers without context are meaningless. Always specify block size, read/write ratio, queue depth, and whether the measurement is at the VM level or the cluster level. The PoC must establish a common benchmarking methodology (fio with identical parameters) across the VMware baseline and all candidates.
Latency matters more than aggregate IOPS for most enterprise workloads. A database VM does not need 1M IOPS; it needs 5,000 IOPS with p99 latency under 1 ms. Focus PoC measurements on latency percentiles (p50, p95, p99, p99.9), not peak IOPS.
TRIM/DISCARD propagation is a silent capacity killer. Verify that each candidate supports the full TRIM chain from guest filesystem through the hypervisor/SDS layer to the physical device. Without this, thin-provisioned pools will bloat over months and require manual intervention.
S2D's 16-node limit has a storage consequence: the total storage pool per cluster is capped at 16 nodes x local disks. For 5,000+ VMs, this means storage is fragmented across multiple independent clusters with no cross-cluster pool. OVE/ODF can have a single 100+ node Ceph cluster. This is a fundamental architectural difference.

Discussion Guide

The following questions are designed to probe vendor and SME understanding of storage architecture in the context of our 5,000+ VM environment. They should be asked during PoC planning, vendor deep-dives, and architecture review sessions.

Questions for All Candidates

Thin provisioning overcommit policy: "What is your recommended maximum overcommit ratio for thin-provisioned storage in a 5,000-VM environment? What automated actions does the platform take when physical utilization exceeds 85%? 90%? 95%? Can we define per-namespace or per-tenant capacity quotas that enforce hard limits before pool exhaustion?"
IOPS isolation and noisy neighbor: "If one VM generates 50,000 IOPS of random write load (e.g., a runaway batch job), what mechanisms prevent that VM from degrading the latency of the other 4,999 VMs on the same storage pool? Show us the QoS enforcement path -- where exactly in the I/O stack is the rate limit applied, and what is the granularity (per-VM, per-volume, per-node)?"
Latency percentiles under load: "At 80% of rated cluster IOPS capacity, what is the p99.9 read latency for a 4K random read? We need this number from a real benchmark, not a datasheet. Can you run this test during our PoC on a cluster sized for 500 VMs?"
TRIM/DISCARD end-to-end: "Walk us through the exact TRIM/DISCARD propagation path from a Linux VM guest running ext4 with fstrim, through the hypervisor, through the SDS layer, to the physical NVMe device. At which layer, if any, does space reclamation happen asynchronously? What is the delay between guest TRIM and physical space return?"
Failure domain and rebuild impact: "A physical node with 8 NVMe drives (total 30 TB) fails permanently. Describe the data rebuild process: how much data needs to be reconstructed, how long it takes, what I/O overhead the rebuild imposes on running VMs, and what redundancy level remains during the rebuild. What happens if a second node fails during the rebuild?"

Questions Specific to OVE / ODF

Ceph PG autoscaler and pool sizing: "How does ODF determine the number of Placement Groups per pool? In our scenario with 200+ OSDs across 100 nodes, what is the target PG-per-OSD ratio, and what happens if PG count is misconfigured (too low or too high)? How does Rook-Ceph's PG autoscaler behave during cluster expansion from 50 to 100 nodes?"
BlueStore tuning for all-NVMe: "The default BlueStore configuration was designed for mixed SSD/HDD environments. For an all-NVMe ODF cluster, what tuning parameters need to change (bluestore_cache_size, bluestore_min_alloc_size, rocksdb_cache_size)? Has Red Hat published a validated all-NVMe ODF performance profile?"

Questions Specific to Azure Local / S2D

Automatic tiering transparency: "S2D moves data between NVMe, SSD, and HDD tiers automatically. How can we monitor which percentage of a specific VM's data is on which tier at any given time? Is there a way to pin a critical VM's data to the NVMe tier and prevent demotion? What is the tiering slab granularity, and can it cause performance cliffs when a hot 256 MiB slab is partially cold?"
Cross-cluster storage for 5,000 VMs: "With the 16-node cluster limit, we need approximately 5-6 clusters for 5,000 VMs. How do we handle a VM that needs to access a volume on a different cluster? Is there a federated storage namespace, or is storage strictly cluster-local? How does this affect DR and backup architecture?"

Questions Specific to Swisscom ESC

Storage performance guarantees and observability: "Your 'Platinum' service tier guarantees X IOPS and Y ms latency. Can we access real-time storage metrics (IOPS, latency percentiles, queue depth) for our tenant via API? If we observe performance below SLA, what is the escalation path and the mean time to resolution? Can we run our own fio benchmarks inside VMs to independently verify SLA adherence?"

Architecture-Level Questions (for Internal Discussion)

Write amplification budget: "Given our measured baseline of N write IOPS on VMware, what is the expected write amplification factor on each candidate platform? How does this affect NVMe endurance planning over a 5-year lifecycle? What DWPD (Drive Writes Per Day) rating should we specify in hardware procurement?"
Data gravity and migration complexity: "For each candidate, how is VM storage data physically organized? If we need to migrate a 2 TB database VM from one candidate platform to another (or back to VMware as a fallback), what is the data export/import path? Is the VM disk format portable (QCOW2, VMDK, VHDX), or does migration require full data copy through a conversion pipeline?"
Encryption key management integration: "For encryption at rest, where are the encryption keys stored? Does the platform integrate with our existing HSM (Hardware Security Module) or external KMS (Key Management Service)? What happens to encrypted volumes if the KMS is temporarily unavailable? Who has access to the keys -- our team, the vendor, or both?"
Capacity planning model: "Walk us through your capacity planning methodology for a 5,000-VM environment with 20% annual growth. How do we model thin provisioning overcommit ratios over time? At what physical utilization percentage do we need to order additional hardware, and what is the lead time from order to operational capacity?"

Next: 02-vmware-baseline.md -- Current-State VMware Storage Architecture (vSAN, VMFS, SAN Integration)