Software-Defined Storage Platforms

Why This Matters

This is the most consequential technical comparison in the entire storage evaluation. The previous page (04-architectures.md) established that both self-operated candidates -- OVE and Azure Local -- use HCI with software-defined storage. This page goes deep into the specific SDS platforms that power each candidate: Ceph/ODF (OVE) and Storage Spaces Direct (Azure Local). Every operational decision -- capacity planning, failure recovery, performance tuning, disaster recovery, day-2 troubleshooting -- depends on understanding the internals of these two platforms.

The comparison is not symmetric. Ceph and S2D are fundamentally different systems:

Ceph is a distributed object store that provides block, file, and object interfaces. It uses algorithmic data placement (CRUSH), manages one daemon per disk (OSD), and stores data directly on raw block devices (BlueStore). It is Linux-native, Kubernetes-native (via Rook), and has no hard cluster size limit.
S2D is a distributed block storage layer integrated into Windows Server. It uses a Software Storage Bus to pool local disks, layers ReFS on top of storage spaces, and presents volumes as Cluster Shared Volumes (CSVs). It is Windows-native, managed via PowerShell and Azure Arc, and has a hard limit of 16 nodes per cluster.

For a Tier-1 financial enterprise running 5,000+ VMs, this page answers three critical questions:

Architectural depth. What actually happens when a VM writes a 4 KiB block? How does each platform place, replicate, cache, and protect that data? Where are the performance cliffs and failure boundaries?
Operational model. How does each platform handle day-2 operations -- disk replacement, capacity expansion, node failure, performance troubleshooting? What skills does the team need?
Platform selection. Given our scale (5,000+ VMs, multi-rack, potentially multi-site), which platform's architecture better fits our requirements for performance, resilience, and growth?

Concepts

1. Ceph / Rook-Ceph

Ceph Architecture: RADOS

Ceph is built on RADOS -- the Reliable Autonomic Distributed Object Store. RADOS is the foundational layer that handles data distribution, replication, recovery, and consistency. Everything in Ceph -- whether you use block (RBD), file (CephFS), or object (RGW) storage -- ultimately stores data as objects in RADOS.

Ceph RADOS Architecture
=========================

                         Client Access Layer
    +------------+     +------------+     +------------+
    | RBD Client |     | CephFS     |     | RGW        |
    | (Block)    |     | Client     |     | (S3/Swift) |
    | librbd     |     | (File)     |     | HTTP API   |
    +------+-----+     +------+-----+     +------+-----+
           |                  |                  |
           +--------+---------+--------+---------+
                    |                  |
                    v                  v
    +--------------------------------------------------+
    |                    librados                       |
    |          (RADOS client library)                   |
    |  - CRUSH calculation (no central lookup)          |
    |  - Direct OSD communication                      |
    |  - Connection pooling, failure detection          |
    +--------------------------------------------------+
                    |
                    | RADOS protocol (msgr2, TCP/RDMA)
                    |
    +--------------------------------------------------+
    |                  RADOS Cluster                    |
    |                                                  |
    |   MON (x3/5)    MGR (x2)     MDS (x2+, CephFS)  |
    |   +---------+   +--------+   +---------+         |
    |   | Paxos   |   | Metrics|   | Metadata|         |
    |   | quorum  |   | Dashbd |   | for     |         |
    |   | Cluster |   | Alerts |   | CephFS  |         |
    |   | map     |   | Balancr|   | POSIX   |         |
    |   +---------+   +--------+   +---------+         |
    |                                                  |
    |   OSD Daemons (one per physical disk)             |
    |   +-------+ +-------+ +-------+ +-------+       |
    |   |OSD.0  | |OSD.1  | |OSD.2  | |OSD.3  |       |
    |   |NVMe-0 | |NVMe-1 | |SSD-0  | |SSD-1  |       |
    |   |Node 1 | |Node 1 | |Node 2 | |Node 2 |       |
    |   +-------+ +-------+ +-------+ +-------+       |
    |   +-------+ +-------+ +-------+ +-------+       |
    |   |OSD.4  | |OSD.5  | |OSD.6  | |OSD.7  |       |
    |   |SSD-2  | |SSD-3  | |NVMe-0 | |NVMe-1 |       |
    |   |Node 3 | |Node 3 | |Node 4 | |Node 4 |       |
    |   +-------+ +-------+ +-------+ +-------+       |
    |          ... (one OSD per disk, any scale) ...    |
    +--------------------------------------------------+

Ceph Components in Detail

MON (Monitor Daemon): Monitors maintain the cluster map -- the authoritative description of the cluster's topology and state. The cluster map consists of:

Map	Contents
Monitor Map	Monitor hostnames, addresses, epochs
OSD Map	OSD states (up/down, in/out), pool definitions, PG counts, flags
CRUSH Map	Physical topology hierarchy (datacenters, racks, hosts, OSDs), placement rules
MDS Map	Metadata server states, active/standby assignments (CephFS only)
Manager Map	Manager daemon states, active/standby, enabled modules

Monitors use Paxos consensus to agree on map updates. A majority quorum (2-of-3 or 3-of-5) must agree before any map change is committed. When an OSD fails, the surviving OSDs detect the failure via heartbeats and report to the monitors. The monitors update the OSD map, increment the epoch, and distribute the new map to all clients and OSDs. Clients then recalculate data placement using the updated CRUSH map.

Monitor resource requirements (ODF defaults):

3 MON pods (or 5 for large clusters), each requiring 2 GiB RAM + 50 GiB persistent storage
MON data is small but latency-sensitive -- place on SSD/NVMe, never HDD
MON network traffic is minimal except during map distribution bursts

OSD (Object Storage Daemon): The workhorse of Ceph. One OSD daemon runs per physical disk (NVMe or SSD). Each OSD:

Owns a set of Placement Groups (PGs)
Processes client reads and writes for PGs it owns
Replicates writes to peer OSDs (secondary and tertiary replicas)
Sends heartbeats to neighboring OSDs (every 6 seconds, timeout at 20 seconds by default)
Performs scrubbing (background data integrity checking)
Participates in recovery and backfill when the PG membership changes

A node with 12 NVMe drives runs 12 independent OSD processes. Each OSD manages its own BlueStore instance, its own RocksDB for metadata, and its own journal (WAL). OSDs are the primary consumers of RAM and CPU on storage nodes.

OSD resource requirements (per OSD):

Default: osd_memory_target = 4 GiB (auto-tuned, can use more under load)
CPU: 1-2 cores per OSD (more for NVMe-backed OSDs with high IOPS)
Network: each OSD needs both a "public" network (client traffic) and a "cluster" network (replication traffic)

MDS (Metadata Server): Required only for CephFS (the POSIX-compatible distributed filesystem). MDS manages the filesystem metadata -- the directory hierarchy, file names, permissions, timestamps, and inode data. Data blocks are still stored in RADOS as objects; MDS handles only the namespace.

MDS is relevant for VMs that mount CephFS for shared file access (ReadWriteMany volumes in Kubernetes). For pure block storage (VM boot/data disks via RBD), MDS is not involved.

MGR (Manager Daemon): Runs as active/standby pair. Hosts modules:

Module	Function
Dashboard	Web UI for cluster monitoring and management
Prometheus	Exports Ceph metrics in Prometheus format
Balancer	Automatically optimizes PG distribution across OSDs
Rook	Integration module for Rook operator orchestration
Telemetry	Optional anonymous usage reporting to upstream
Crash	Collects and stores daemon crash dumps

CRUSH Algorithm: Data Placement Without a Lookup Table

CRUSH (Controlled Replication Under Scalable Hashing) is the algorithm that determines where every object in Ceph is stored. Unlike traditional storage systems that use a central metadata server to look up data locations, CRUSH is a deterministic, pseudo-random algorithm that any client can execute independently. Given an object name and the current CRUSH map, any node can compute the exact OSDs that hold (or should hold) that object's replicas.

This is Ceph's most important architectural advantage: no metadata bottleneck, no single point of failure for data location, and placement scales to thousands of OSDs without a lookup directory.

CRUSH Placement Flow
======================

Input:
  Object name:  "rbd_data.1a2b3c.0000000000000042"
  Pool ID:      3 (ceph-blockpool, RF=3)
  PG count:     256 (for this pool)
  CRUSH map:    current cluster topology
  CRUSH rule:   "replicated_rack" (failure domain = rack)

Step 1: Object -> Placement Group (PG)
  PG_id = hash(object_name) mod PG_count
  PG_id = hash("rbd_data.1a2b3c.0000000000000042") mod 256
  PG_id = 3.a7   (pool 3, PG 0xa7 = 167)

Step 2: PG -> OSD set (CRUSH calculation)
  Input:  PG_id (3.a7), CRUSH map, CRUSH rule

  CRUSH Rule "replicated_rack":
    Step 1: take(root default)           # start at top of hierarchy
    Step 2: chooseleaf 3 type rack       # select 3 items of type "rack"
             using straw2 algorithm       # weighted pseudo-random selection
    Step 3: emit                          # output the selected OSDs

  CRUSH Map Hierarchy (simplified):

    root default
    +-- rack rack01
    |   +-- host node01
    |   |   +-- osd.0  (weight 3.49, 3.49 TiB NVMe)
    |   |   +-- osd.1  (weight 3.49)
    |   |   +-- osd.2  (weight 3.49)
    |   +-- host node02
    |       +-- osd.3  (weight 3.49)
    |       +-- osd.4  (weight 3.49)
    |       +-- osd.5  (weight 3.49)
    +-- rack rack02
    |   +-- host node03
    |   |   +-- osd.6  (weight 3.49)
    |   |   +-- osd.7  (weight 3.49)
    |   |   +-- osd.8  (weight 3.49)
    |   +-- host node04
    |       +-- osd.9  (weight 3.49)
    |       +-- osd.10 (weight 3.49)
    |       +-- osd.11 (weight 3.49)
    +-- rack rack03
        +-- host node05
        |   +-- osd.12 (weight 3.49)
        |   +-- osd.13 (weight 3.49)
        |   +-- osd.14 (weight 3.49)
        +-- host node06
            +-- osd.15 (weight 3.49)
            +-- osd.16 (weight 3.49)
            +-- osd.17 (weight 3.49)

  CRUSH straw2 Selection (for PG 3.a7):
    1. Compute pseudo-random hash for each rack:
       hash(3.a7, rack01) * weight(rack01) = 0.72
       hash(3.a7, rack02) * weight(rack02) = 0.91  <- highest
       hash(3.a7, rack03) * weight(rack03) = 0.45
    2. Select rack02 for first replica -> descend into rack02
       -> chooseleaf selects host, then OSD within host
       -> Selected: osd.8 (on node03, rack02)
    3. Repeat for second replica (exclude rack02):
       hash(3.a7+1, rack01) * weight(rack01) = 0.63
       hash(3.a7+1, rack03) * weight(rack03) = 0.88  <- highest
       -> Selected: osd.15 (on node06, rack03)
    4. Repeat for third replica (exclude rack02, rack03):
       -> Selected: osd.1 (on node01, rack01)

Result:
  PG 3.a7 -> [osd.8 (primary), osd.15, osd.1]
              rack02         rack03     rack01

  Every client, OSD, and monitor computes this SAME result
  independently. No lookup table. No metadata server.

CRUSH Map Structure:

The CRUSH map is a hierarchical tree of "buckets" that mirrors the physical topology:

CRUSH Map Hierarchy (Enterprise Example)
==========================================

root default                              # Logical root
|
+-- datacenter dc-zurich                  # Site level
|   +-- room room-a
|   |   +-- rack rack-a01
|   |   |   +-- host node-01
|   |   |   |   +-- osd.0  (class: nvme, weight: 3.49)
|   |   |   |   +-- osd.1  (class: nvme, weight: 3.49)
|   |   |   |   +-- osd.2  (class: ssd,  weight: 7.28)
|   |   |   |   +-- osd.3  (class: ssd,  weight: 7.28)
|   |   |   +-- host node-02
|   |   |       +-- osd.4  (class: nvme, weight: 3.49)
|   |   |       +-- osd.5  (class: nvme, weight: 3.49)
|   |   |       +-- osd.6  (class: ssd,  weight: 7.28)
|   |   |       +-- osd.7  (class: ssd,  weight: 7.28)
|   |   +-- rack rack-a02
|   |       +-- host node-03 ...
|   |       +-- host node-04 ...
|   +-- room room-b
|       +-- rack rack-b01 ...
|       +-- rack rack-b02 ...
|
+-- datacenter dc-bern                    # Second site
    +-- room room-c
        +-- rack rack-c01 ...
        +-- rack rack-c02 ...

Device Classes:
  "nvme"  -> fast tier (VM boot disks, database volumes)
  "ssd"   -> standard tier (general VM storage)
  "hdd"   -> capacity tier (backups, archives -- if used)

CRUSH Rules (examples):
  rule replicated_rack_nvme:
    take(root default, class nvme)
    chooseleaf 3 type rack
    emit
    -> 3 replicas on NVMe, each on a different rack

  rule replicated_host_ssd:
    take(root default, class ssd)
    chooseleaf 3 type host
    emit
    -> 3 replicas on SSD, each on a different host

  rule ec_rack_ssd:
    take(root default, class ssd)
    chooseleaf 6 type rack      # 4 data + 2 parity = 6
    emit
    -> erasure-coded across 6 racks on SSD

Key CRUSH Concepts:

Straw2 algorithm: The selection algorithm within each bucket. Each item (OSD, host, rack) draws a "straw" with length = hash * weight. The longest straw wins. Straw2 ensures that adding or removing an OSD causes minimal data movement -- only data mapped to the changed OSD moves.
Weights: Each OSD's weight is proportional to its capacity (in TiB). A 3.49 TiB NVMe gets weight 3.49; a 7.28 TiB SSD gets weight 7.28. CRUSH distributes data proportionally to weights.
Device classes: A tag on each OSD (nvme, ssd, hdd) that allows CRUSH rules to select only devices of a specific class. This enables separate pools for different performance tiers using the same physical topology.
Failure domain: The CRUSH rule's type parameter (host, rack, datacenter) determines the minimum failure domain separation. With type rack, CRUSH guarantees each replica is on a different rack.

Data Flow: Write Path in Detail

Understanding the exact write path is critical for performance analysis and troubleshooting. Here is what happens when a VM writes a 4 KiB block to a Ceph RBD volume:

Ceph Write Path (RBD, RF=3)
==============================

1. VM Application: write(fd, data, 4096)
   |
   v
2. Guest Kernel: filesystem (ext4/XFS) -> bio -> virtio-blk
   |
   v
3. QEMU (hypervisor): virtio-blk backend -> librbd
   |
   v
4. librbd (Ceph client library):
   a. Determine RBD object: offset / object_size (default 4 MiB)
      -> object name: "rbd_data.<image_id>.<stripe_unit>"
   b. Compute PG: hash(object_name) mod pg_num -> PG 3.a7
   c. Run CRUSH: PG 3.a7 -> [osd.8, osd.15, osd.1]
   d. Send write to PRIMARY OSD (osd.8)
   |
   v
5. Primary OSD (osd.8):
   a. Receive write request
   b. Assign version number (epoch.version)
   c. Write to local BlueStore WAL (Write-Ahead Log on NVMe)
   d. Forward write to SECONDARY OSDs (osd.15, osd.1) in parallel
   |           |                |
   v           v                v
6. osd.8     osd.15          osd.1
   (local)   (replica 2)     (replica 3)
   WAL write WAL write       WAL write
   ~50 us    ~50 us + net    ~50 us + net
   |           |                |
   |           +--- ACK --------+----> Primary OSD
   |                                   (waits for all ACKs)
   v
7. Primary OSD: all replicas confirmed WAL write
   -> Send ACK to client
   |
   v
8. librbd: write acknowledged
   -> Return to QEMU -> guest sees write as durable
   |
   (Asynchronously, in background:)
   v
9. Each OSD: BlueStore deferred write
   - Flush WAL entry to data area on the main block device
   - Update RocksDB metadata (object -> extent mapping)
   - Reclaim WAL space

Total write latency (all-NVMe, same datacenter):
  - Local WAL write:   ~30-50 us
  - Network RTT:       ~10-30 us (RDMA) or ~50-200 us (TCP)
  - Remote WAL writes: ~30-50 us (parallel with network)
  - Coordination:      ~10-20 us
  -----------------------------------------------
  Total p50:           ~150-300 us (RDMA)
  Total p50:           ~300-700 us (TCP)
  Total p99:           ~500-2000 us (depends on load)

Key observations about the write path:

The write is durable once all replicas have written to WAL (journal). This is the "commit" point.
The client does NOT wait for data to be flushed from WAL to the main data area. WAL persistence is the durability guarantee.
Network latency is on the critical path for every write (replication). This is why RDMA (RoCE v2) is strongly recommended for Ceph storage networks.
The primary OSD serializes writes to the same object. Concurrent writes to different objects in different PGs can be served by different primary OSDs in parallel.

BlueStore: The OSD Storage Backend

BlueStore is Ceph's default (and only production-supported) OSD backend since the Luminous release. It replaced FileStore, which layered Ceph on top of a POSIX filesystem (XFS). BlueStore writes directly to a raw block device, eliminating the double-write penalty and filesystem overhead.

BlueStore On-Disk Layout (Single OSD)
========================================

Physical Block Device: /dev/nvme0n1 (3.49 TiB NVMe SSD)

  +------------------------------------------------------------------+
  |                     Raw Block Device                              |
  |                                                                  |
  |  +-------------------+  +------+  +---------------------------+  |
  |  | BlueFS (internal  |  | WAL  |  | Data Area               |  |
  |  | filesystem)       |  |      |  | (main object storage)   |  |
  |  |                   |  |      |  |                         |  |
  |  | Contains:         |  | Write|  | - Raw object data       |  |
  |  | - RocksDB SST     |  | Ahead|  | - Allocated in extents  |  |
  |  |   files           |  | Log  |  | - No filesystem         |  |
  |  | - RocksDB WAL     |  |      |  | - Direct I/O            |  |
  |  | - Metadata        |  |      |  | - Checksummed (crc32c)  |  |
  |  |                   |  |      |  |                         |  |
  |  | Size: ~1-5% of    |  | Size:|  | Size: ~93-98% of        |  |
  |  | device            |  | ~1%  |  | device                  |  |
  |  +-------------------+  +------+  +---------------------------+  |
  +------------------------------------------------------------------+

  Alternative: Separate Devices for WAL + DB
  ===========================================

  When mixing device types (e.g., NVMe for WAL+DB, SSD for data):

  +---------------------+       +---------------------+
  | NVMe Device         |       | SSD Device          |
  | /dev/nvme0n1        |       | /dev/sda            |
  | (fast, small)       |       | (larger, slower)    |
  |                     |       |                     |
  | +---------+------+  |       | +------------------+|
  | |RocksDB  | WAL  |  |       | | Data Area        ||
  | |DB       |      |  |       | | (main objects)   ||
  | |(SST     |      |  |       | |                  ||
  | | files)  |      |  |       | |                  ||
  | |         |      |  |       | |                  ||
  | | ~30 GiB | ~2GiB|  |       | | ~7 TiB           ||
  | +---------+------+  |       | +------------------+|
  +---------------------+       +---------------------+

  Recommended WAL+DB sizing (ODF guidance):
    WAL:  1-2 GiB per OSD (absorbs burst writes)
    DB:   ~30-50 GiB per OSD (RocksDB metadata, grows with data)
    Rule: DB should be ~1-4% of OSD data capacity

BlueStore Internal Write Flow:
  1. Client write arrives at OSD
  2. Small write (<= min_alloc_size, default 4K for SSD, 64K for HDD):
     -> Write data directly to WAL (deferred write)
     -> Update RocksDB key: object -> [offset, length, checksum]
     -> Later: async flush from WAL to data area
  3. Large write (> min_alloc_size):
     -> Allocate new extent in data area
     -> Write data directly to data area (non-overwrite, COW)
     -> Update RocksDB key: object -> [new_extent, checksum]
     -> No WAL involvement for data (only metadata in WAL)

Key BlueStore Parameters:
  bluestore_min_alloc_size:     4096 (SSD) / 65536 (HDD)
    Minimum allocation unit. Smaller = less wasted space, more metadata.
    For all-NVMe: 4096 is correct.

  bluestore_cache_size:         Auto (based on osd_memory_target)
    In-memory cache for hot data and metadata. Managed by the OSD
    memory auto-tuner. Typical effective cache: 2-3 GiB per OSD.

  osd_memory_target:            4294967296 (4 GiB default)
    Total memory budget per OSD. The OSD auto-tuner distributes
    this among BlueStore cache, RocksDB block cache, PG metadata,
    and operational buffers. For NVMe-dense nodes, increase to 6-8 GiB.

  bluestore_compression_algorithm:  none (default), snappy, zstd, lz4
    Inline compression. zstd gives best ratio; snappy gives best speed.
    Enable per-pool for compressible workloads (logs, backups).
    Caution: compression increases CPU usage per OSD significantly.

Placement Groups (PGs): The Hashing Intermediary

Objects in Ceph are not mapped directly to OSDs. Instead, there is an intermediate layer called Placement Groups (PGs). PGs serve as the unit of replication, recovery, and rebalancing.

Object-to-OSD Mapping via Placement Groups
=============================================

                Objects (millions)
   +--------+--------+--------+--------+--------+
   | obj_01 | obj_02 | obj_03 | obj_04 | obj_05 | ...
   +---+----+---+----+---+----+---+----+---+----+
       |        |        |        |        |
       | hash mod pg_num (256)
       v        v        v        v        v
   +--------+--------+--------+--------+
   | PG 0   | PG 1   | PG 2   | PG 255 |    256 PGs per pool
   | (holds | (holds | (holds | (holds |    (manageable number)
   | ~N/256 | ~N/256 | ~N/256 | ~N/256 |
   | objs)  | objs)  | objs)  | objs)  |
   +---+----+---+----+---+----+---+----+
       |        |        |        |
       | CRUSH algorithm
       v        v        v        v
   +--------+--------+--------+--------+
   |[osd.2, |[osd.8, |[osd.0, |[osd.5, |    Each PG maps to
   | osd.7, | osd.1, | osd.11,| osd.9, |    RF OSDs via CRUSH
   | osd.14]| osd.16]| osd.4] | osd.13]|
   +--------+--------+--------+--------+

Why PGs exist (and why not map objects directly to OSDs):
  1. Tracking millions of objects individually would require a huge
     metadata table. PGs reduce the tracking to ~100-300 entries per pool.
  2. Recovery and rebalancing operate at PG granularity, not per-object.
     Moving a PG moves all its objects as a batch.
  3. Peering (the process where OSDs agree on PG state after a failure)
     operates per-PG. Fewer PGs = faster peering.
  4. Scrubbing (background integrity checking) is scheduled per-PG.

PG Count Guidelines (per pool):
  Target: ~100-200 PGs per OSD (across all pools on that OSD)
  Formula: total_PGs = (num_OSDs * target_PGs_per_OSD) / RF
  Example: 18 OSDs, target 100, RF=3:
           total_PGs = (18 * 100) / 3 = 600
           If 2 pools: ~300 PGs each (round to 256 = nearest power of 2)

  Too few PGs:  uneven data distribution, some OSDs overfull
  Too many PGs: excessive memory per OSD, slow peering, longer recovery
  Ceph Reef+:   PG autoscaler (on by default) adjusts PG counts
                automatically based on pool usage

PG States and Health

PG states are the primary diagnostic tool for Ceph cluster health. Understanding states is essential for operations:

State	Meaning	Action Required
`active+clean`	Normal, healthy. All replicas present and serving I/O.	None.
`active+degraded`	Serving I/O but missing one or more replicas. Data is still accessible.	OSD may be down. Check `ceph osd tree`. Recovery will start automatically.
`active+recovering`	Serving I/O while re-replicating data to restore full redundancy.	Normal after OSD failure/restart. Monitor recovery speed vs I/O impact.
`active+backfilling`	Like recovering but for PGs that were remapped (e.g., after adding a new OSD).	Normal after cluster topology change. Backfill is lower priority than recovery.
`active+remapped`	PG is active but on a different set of OSDs than CRUSH dictates. Waiting for backfill.	Temporary state. Will clear after backfill completes.
`peering`	OSDs are negotiating PG state (who has the latest data). No I/O served during peering.	Brief during startup or after failure. If stuck: investigate OSD connectivity.
`stale`	The PG's primary OSD has not reported to monitors recently. Data may be inaccessible.	OSD is likely down or network-partitioned. Urgent: check OSD and network.
`inactive`	PG cannot serve I/O. No active primary OSD.	Critical. Usually means multiple OSDs in the PG's acting set are down.
`inconsistent`	Scrubbing found mismatches between replicas. Data corruption possible.	Run `ceph pg repair <pgid>`. Investigate cause (disk errors, bit rot).
`incomplete`	PG has not had a complete set of replicas since a particular epoch. Cannot serve I/O.	Critical. May require manual intervention. Check for permanently lost OSDs.

Monitoring PG Health (Operational Commands)
=============================================

# Overall health
$ ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs undersized
  PG 3.a7 is active+degraded (2 of 3 replicas, osd.15 is down)
  PG 3.b2 is active+degraded (2 of 3 replicas, osd.15 is down)

# PG distribution per OSD
$ ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  %USE  AVAIL   PGS  TYPE
-1              62.82            62 TiB   42 TiB  66.8   20 TiB      root default
-3              20.94            20 TiB   14 TiB  66.7    6 TiB      rack rack01
-5              10.47            10 TiB    7 TiB  66.8    3 TiB      host node01
 0   nvme   3.49  1.00000   3.49 TiB  2.3 TiB  66.7  1.1 TiB  142  osd.0
 1   nvme   3.49  1.00000   3.49 TiB  2.3 TiB  66.9  1.1 TiB  138  osd.1
 2   ssd    3.49  1.00000   3.49 TiB  2.3 TiB  66.7  1.1 TiB  140  osd.2

# Recovery progress
$ ceph -s
  cluster: (id)
  health: HEALTH_WARN
  services: mon: 3 daemons, mgr: active+standby
  data: 256 pgs (254 active+clean, 2 active+recovering)
        42 TiB / 63 TiB raw used (66.8%)
  io: recovery: 512 MiB/s, 1200 objects/s

Storage Interfaces: RBD, CephFS, RGW

RBD (RADOS Block Device): The block interface to Ceph. RBD provides virtual block devices that are thin-provisioned, snapshottable, and cloneable. For VM storage on OVE, RBD is the primary interface -- every VM disk is an RBD image.

RBD features relevant to VM workloads:

Feature	Description	Impact for VMs
Thin provisioning	Space is allocated only as data is written. A 100 GiB RBD image with 10 GiB written consumes 10 GiB (x RF).	Allows overcommitting storage; must monitor actual usage vs provisioned.
Snapshots	Copy-on-write point-in-time captures. Creating a snapshot is instant (metadata-only).	VM snapshots for backup, rollback. Snapshot overhead: only changed blocks since snapshot.
Clones	A writable snapshot -- a new RBD image that shares unchanged data with its parent.	Fast VM provisioning from golden images. Clone is instant; divergent writes are COW.
Mirroring	Replicate RBD images to a remote Ceph cluster for DR. Two modes: journal-based (synchronous log shipping) and snapshot-based (periodic async).	DR for VM storage. Snapshot-based is simpler and lower overhead; journal-based provides lower RPO.
Exclusive lock	Only one client can write to the image at a time. Prevents split-brain for non-shared disks.	Required for KubeVirt: ensures a VM disk is not accidentally mounted by two VMs.
Object map	Tracks which 4 MiB objects within the image have been allocated. Accelerates snapshot diff, export, and discard operations.	Speeds up backup (only export allocated blocks), improves discard/TRIM performance.
Striping	Distributes a single image across multiple RADOS objects. Default: 4 MiB object size, 1 stripe.	Larger stripe_count increases parallelism for sequential workloads (rare to change default).

CephFS (Ceph File System): A POSIX-compatible distributed filesystem. Relevant for shared file access (ReadWriteMany) use cases. Uses MDS for metadata and RADOS for data. Not typically used for VM boot disks.

RGW (RADOS Gateway): S3/Swift-compatible object storage. Relevant for backup targets (Veeam, Kasten storing backups to S3), log aggregation, and artifact repositories. ODF bundles RGW via the NooBaa Multi-Cloud Gateway.

Performance Tuning Essentials

Parameter	Default	Tuning Guidance
`osd_memory_target`	4 GiB	Increase to 6-8 GiB for NVMe-dense nodes (12+ OSDs). Monitor with `ceph daemon osd.X perf dump`.
`bluestore_cache_size_ssd`	3 GiB	Auto-managed by `osd_memory_target`. Only override if manual tuning is needed.
`osd_recovery_max_active`	3	Limits concurrent recovery operations per OSD. Reduce to 1-2 to minimize production impact during recovery. Increase to 5+ to speed recovery during maintenance windows.
`osd_recovery_sleep`	0	Add 0.05-0.1 seconds sleep between recovery operations to reduce I/O impact.
`osd_max_backfills`	1	Max concurrent backfill operations per OSD. Keep low for production; increase during maintenance.
`osd_op_queue`	wpq	Use `mclock_scheduler` for QoS (Ceph Quincy+). Allows prioritizing client I/O over recovery/scrub.
`rbd_cache`	true	Client-side writeback cache in librbd. Default 32 MiB. Improves small-write performance.
`rbd_cache_writethrough_until_flush`	true	Starts in writethrough mode (safe), switches to writeback after first flush (fast).

Rook: The Kubernetes Operator for Ceph

Rook is the open-source Kubernetes operator that automates the deployment, configuration, and lifecycle management of Ceph on Kubernetes. ODF uses Rook internally. Understanding Rook's pod topology is critical for troubleshooting ODF clusters.

Rook-Ceph Pod Topology on a 6-Node ODF Cluster
==================================================

Control Plane Pods (scheduled on infra/storage nodes):
  +-----------------------------------------------------------+
  | Namespace: openshift-storage                               |
  |                                                            |
  | rook-ceph-operator-7f8b4c5d9-xxxxx    (1 pod, Deployment) |
  |   - Watches CRDs (CephCluster, CephBlockPool, etc.)       |
  |   - Reconciles desired state -> deploys/updates Ceph pods  |
  |   - Controls OSD provisioning, pool creation, upgrades     |
  |                                                            |
  | rook-ceph-mon-a-xxxx  (on node01, storage node)            |
  | rook-ceph-mon-b-xxxx  (on node03, storage node)            |
  | rook-ceph-mon-c-xxxx  (on node05, storage node)            |
  |   - Paxos quorum: 2 of 3 must be available                |
  |   - Each MON has a PVC for persistent data (~10 GiB)       |
  |   - Anti-affinity: spread across failure domains           |
  |                                                            |
  | rook-ceph-mgr-a-xxxx  (active manager, on node02)         |
  | rook-ceph-mgr-b-xxxx  (standby manager, on node04)        |
  |   - Dashboard, Prometheus exporter, balancer module        |
  |                                                            |
  | rook-ceph-mds-cephfs-a-xxxx  (active, on node01)          |
  | rook-ceph-mds-cephfs-b-xxxx  (standby, on node03)         |
  |   - Only if CephFS is deployed (for ReadWriteMany PVCs)   |
  +-----------------------------------------------------------+

OSD Pods (one per disk, on storage nodes):
  +-----------------------------------------------------------+
  | node01 (storage node, 3 NVMe drives):                     |
  |   rook-ceph-osd-0-xxxx   -> /dev/nvme0n1 (3.49 TiB)      |
  |   rook-ceph-osd-1-xxxx   -> /dev/nvme1n1 (3.49 TiB)      |
  |   rook-ceph-osd-2-xxxx   -> /dev/nvme2n1 (3.49 TiB)      |
  |                                                            |
  | node02 (storage node, 3 NVMe drives):                     |
  |   rook-ceph-osd-3-xxxx   -> /dev/nvme0n1                  |
  |   rook-ceph-osd-4-xxxx   -> /dev/nvme1n1                  |
  |   rook-ceph-osd-5-xxxx   -> /dev/nvme2n1                  |
  |                                                            |
  | node03-node06: similar (osd.6 through osd.17)             |
  +-----------------------------------------------------------+

CSI Driver Pods (on ALL nodes that consume storage):
  +-----------------------------------------------------------+
  | csi-cephfsplugin-xxxxx    (DaemonSet, every node)          |
  |   - Mounts CephFS PVCs into pods/VMs                      |
  |                                                            |
  | csi-rbdplugin-xxxxx       (DaemonSet, every node)          |
  |   - Maps RBD images as block devices for pods/VMs         |
  |   - Handles volume attach/detach lifecycle                 |
  |                                                            |
  | csi-cephfsplugin-provisioner-xxxxx  (2 replicas)           |
  | csi-rbdplugin-provisioner-xxxxx    (2 replicas)            |
  |   - Handle dynamic PVC provisioning (create RBD/CephFS)   |
  |   - Handle snapshot creation and deletion                  |
  +-----------------------------------------------------------+

CRDs managed by Rook:
  CephCluster         -> defines the overall Ceph cluster config
  CephBlockPool       -> defines an RBD pool (size, PG count, device class)
  CephFilesystem      -> defines a CephFS filesystem (MDS count, pools)
  CephObjectStore     -> defines an RGW endpoint (S3-compatible)
  CephObjectStoreUser -> creates S3 access credentials
  CephClient          -> defines a Ceph client with specific capabilities

2. OpenShift Data Foundation (ODF)

What ODF Adds on Top of Rook-Ceph

ODF is not a different storage system -- it is Red Hat's productized, supported, and integrated distribution of Ceph for OpenShift. The relationship is analogous to RHEL and Linux: the kernel is the same, but the packaging, support, certification, and integration are what you pay for.

Aspect	Raw Rook-Ceph (upstream)	ODF (Red Hat)
Support	Community (mailing list, IRC, GitHub)	Red Hat 24/7 support with SLA
Certification	Self-tested	Certified with specific OCP versions, hardware
Testing	Community CI	Red Hat QE, regression testing across matrix
Upgrades	Manual operator + Ceph upgrades	OLM-managed, tested upgrade paths
Integration	Manual Prometheus/Grafana setup	Built-in OCP console plugin, pre-configured alerts
Multi-cloud object	Manual RGW deployment	NooBaa (Multi-Cloud Gateway) included
Disaster recovery	Manual mirroring configuration	ODF-DR operator for Metro-DR and Regional-DR
Lifecycle	Roll your own	Operator-managed: install, configure, upgrade, scale

ODF Components

ODF Component Stack
======================

+--------------------------------------------------------------+
| OpenShift Console                                            |
|  +---------------------------+                               |
|  | ODF Console Plugin        |  Storage dashboard, health,   |
|  | (integrated into OCP UI)  |  capacity, topology view      |
|  +---------------------------+                               |
+--------------------------------------------------------------+
         |                            |
         v                            v
+----------------------------+ +----------------------------+
| ODF Operator               | | NooBaa Operator            |
| (manages Rook-Ceph)        | | (Multi-Cloud Gateway)      |
|                            | |                            |
| - CephCluster lifecycle    | | - S3-compatible endpoint   |
| - StorageCluster CRD       | | - Namespace buckets        |
| - OSD provisioning         | | - Multi-cloud data policy  |
| - Pool management          | | - Caching, tiering to      |
| - Upgrade orchestration    | |   AWS S3, Azure Blob, GCS  |
+----------------------------+ +----------------------------+
         |
         v
+--------------------------------------------------------------+
| Rook-Ceph (Ceph on Kubernetes)                               |
|  MON pods | MGR pods | OSD pods | MDS pods | RGW pods        |
|  CSI drivers (RBD + CephFS)                                  |
+--------------------------------------------------------------+
         |
         v
+--------------------------------------------------------------+
| Physical Storage (NVMe / SSD per node)                       |
+--------------------------------------------------------------+

ODF Deployment Modes

Internal Mode (Converged): ODF deploys and manages Ceph entirely within the OpenShift cluster. The Ceph MONs, MGRs, and OSDs run as pods on designated storage nodes (typically labeled with cluster.ocs.openshift.io/openshift-storage). This is the standard deployment for OVE.

Minimum: 3 storage nodes (for MON quorum and RF=3)
Compact mode: allows MON/MGR to run on the same nodes as worker pods (for smaller clusters)
Storage nodes require dedicated raw disks (not partitioned, not formatted) for OSDs

External Mode: ODF connects to an existing, externally managed Ceph cluster. ODF deploys only the CSI drivers and monitoring components on OpenShift; the Ceph daemons run outside the cluster. Use case: large shared Ceph cluster serving multiple OpenShift clusters.

OpenShift nodes do not need local storage
Ceph cluster is managed separately (cephadm, ceph-ansible, or another Rook instance)
Lower OCP resource overhead (no OSD/MON pods)
More complex network design (OCP nodes must reach external Ceph MON/OSD endpoints)

ODF StorageClasses

When ODF is deployed on OpenShift, it creates several StorageClasses that workloads (including VMs via KubeVirt) use to request storage:

StorageClass	Backend	Access Modes	Use Case
`ocs-storagecluster-ceph-rbd`	Ceph RBD (block)	RWO (ReadWriteOnce)	VM boot/data disks, databases, anything needing a dedicated block device
`ocs-storagecluster-ceph-rbd-virtualization`	Ceph RBD (block, tuned)	RWO	Optimized for KubeVirt VMs: thick provisioning, immediate binding
`ocs-storagecluster-cephfs`	CephFS (file)	RWX (ReadWriteMany)	Shared filesystems across pods/VMs, config mounts
`openshift-storage.noobaa.io`	NooBaa (S3-compatible object)	N/A (S3 API)	Backup targets, artifact storage, log archives

ODF for KubeVirt: How VM Disks Are Stored

When a VM runs on OVE (KubeVirt on OpenShift), its disks are typically PVCs backed by the ocs-storagecluster-ceph-rbd StorageClass. Here is the full path from VM disk to physical media:

VM Disk Storage Path (ODF + KubeVirt)
=======================================

VM Definition (VirtualMachine CR):
  spec:
    template:
      spec:
        volumes:
        - name: rootdisk
          persistentVolumeClaim:
            claimName: vm-prod-db01-rootdisk
        - name: datadisk
          persistentVolumeClaim:
            claimName: vm-prod-db01-datadisk

PVC: vm-prod-db01-rootdisk
  storageClassName: ocs-storagecluster-ceph-rbd
  capacity: 100Gi
  accessModes: [ReadWriteOnce]
      |
      v
PV (dynamically provisioned):
  csi-driver: rbd.csi.ceph.com
  rbd-image: csi-vol-abcdef-1234-5678
  pool: ocs-storagecluster-cephblockpool
      |
      v
Ceph RBD Image: csi-vol-abcdef-1234-5678
  Size: 100 GiB (thin provisioned -> actual used ~30 GiB)
  Features: layering, exclusive-lock, object-map, fast-diff
  Objects: 100 GiB / 4 MiB = 25,600 RADOS objects (max)
  PGs: objects distributed across ~256 PGs in the pool
      |
      v
RADOS Objects -> PGs -> OSDs (via CRUSH)
  Each 4 MiB object replicated to 3 OSDs on different racks

On the compute node (where the VM runs):
  1. CSI driver (csi-rbdplugin) maps the RBD image:
     rbd map ocs-storagecluster-cephblockpool/csi-vol-abcdef-1234-5678
     -> creates /dev/rbd0
  2. /dev/rbd0 is passed to the virt-launcher pod
  3. QEMU uses /dev/rbd0 as a virtio-blk backend
  4. Guest sees /dev/vda (block device)

Thick vs Thin Provisioning:
  Thin (default):  Space allocated on write. Risk of overcommit.
                   PVC reports full size, actual usage is less.
  Thick:           Pre-allocates all blocks at PVC creation.
                   No overcommit risk but slow creation (must write zeros).
                   Use StorageClass volumeMode: Block + annotation.

Capacity Planning for ODF

Capacity planning for ODF must account for replication overhead, metadata overhead, and operational headroom. Here is the math for a 5,000+ VM environment:

ODF Capacity Planning Example
================================

Assumptions:
  - 5,000 VMs, average 100 GiB provisioned per VM (thin)
  - Actual average utilization: 40% (40 GiB used per VM)
  - Total usable capacity needed: 5,000 * 40 GiB = 200 TiB usable
  - Replication factor: RF=3 (3-way replication)
  - Target cluster utilization: max 70% (30% headroom for recovery)

Step 1: Raw capacity for data
  Usable data:   200 TiB
  x RF=3:        200 * 3 = 600 TiB raw for data replicas

Step 2: Metadata and OSD overhead
  BlueStore metadata (RocksDB): ~1-2% of data capacity
  OSD journal (WAL): ~1% of data capacity
  Ceph internal overhead: ~3-5% total
  600 TiB * 1.05 = 630 TiB

Step 3: Operational headroom (30% free for recovery + rebalancing)
  630 TiB / 0.70 = 900 TiB raw capacity required

Step 4: Node count calculation
  If each storage node has 6 x 3.84 TiB NVMe SSDs = 23 TiB raw per node
  900 TiB / 23 TiB = ~40 storage nodes (rounded up)

  Alternative: 12 x 3.84 TiB NVMe per node = 46 TiB raw per node
  900 TiB / 46 TiB = ~20 storage nodes

Step 5: MON/MGR resource overhead (per storage cluster)
  3 MON pods:  3 * 2 GiB RAM = 6 GiB, 3 * 1 vCPU = 3 vCPU
  2 MGR pods:  2 * 3 GiB RAM = 6 GiB, 2 * 1 vCPU = 2 vCPU
  Total control plane: ~12 GiB RAM, 5 vCPU (negligible at this scale)

Step 6: OSD resource overhead
  Per OSD: 4-8 GiB RAM, 1-2 vCPU
  20 nodes * 12 OSDs = 240 OSDs
  RAM: 240 * 6 GiB = 1,440 GiB = 1.4 TiB total OSD RAM
  CPU: 240 * 1.5 = 360 vCPU total OSD CPU

Summary:
  +--------------------------+------------------+
  | Item                     | Value            |
  +--------------------------+------------------+
  | Usable capacity          | 200 TiB          |
  | Raw capacity (RF=3)      | 600 TiB          |
  | With overhead + headroom | 900 TiB          |
  | Storage nodes            | 20 (12 NVMe each)|
  | Total OSDs               | 240              |
  | OSD RAM total            | ~1.4 TiB         |
  | OSD CPU total            | ~360 vCPU        |
  | Efficiency               | 22% (usable/raw) |
  +--------------------------+------------------+

  Efficiency note: 22% usable-to-raw is the "worst case" with RF=3
  and 30% headroom. With erasure coding (4+2) for cold data:
    Usable 200 TiB = 50 TiB hot (RF=3) + 150 TiB cold (EC 4+2)
    Raw = 50*3 + 150*1.5 = 150 + 225 = 375 TiB
    With headroom: 375 / 0.70 = 536 TiB raw
    Efficiency: 200 / 536 = 37% (significant improvement)

ODF Disaster Recovery

ODF provides two DR architectures, both built on Ceph RBD mirroring:

Metro-DR (Stretched Cluster): A single Ceph cluster stretched across two sites within a metro area (< 10 ms RTT). Data is synchronously replicated to both sites. An arbiter node (MON only, no OSDs) at a third site breaks ties in a split-brain scenario.

ODF Metro-DR Architecture
============================

Site A (Primary)              Site B (Secondary)          Site C (Arbiter)
+-------------------+        +-------------------+       +----------+
| OCP Cluster       |        | OCP Cluster       |       | Arbiter  |
|                   |        |                   |       | MON only |
| MON-a, MGR-a      |        | MON-b, MGR-b      |       | MON-c    |
| OSD.0 - OSD.11   |  <-->  | OSD.12 - OSD.23   |       | (no OSDs)|
| (12 OSDs, Site A) | sync   | (12 OSDs, Site B) |       | Tiebreak |
|                   | replic | (each write goes  |       +----------+
| CRUSH rule:       | ation  |  to both sites)   |
|  2 replicas in    |        |  1 replica in     |
|  Site A           |        |  Site B           |
+-------------------+        +-------------------+

Requirements:
  - Latency between sites: < 10 ms RTT (recommended < 5 ms)
  - Bandwidth: sufficient for full write throughput (every write
    crosses the inter-site link)
  - CRUSH rule: "2 replicas at primary site, 1 at secondary"
    or "1 at each site, 1 floating"
  - Arbiter site: only needs network connectivity, minimal compute

RPO: 0 (synchronous replication, no data loss)
RTO: minutes (automatic failover if one site is lost)

Regional-DR (Async Mirroring): Two independent Ceph clusters at geographically separated sites. RBD images are asynchronously mirrored using snapshot-based replication. A hub cluster (can be lightweight) orchestrates failover.

ODF Regional-DR Architecture
===============================

Site A (Primary Cluster)         Site B (Secondary Cluster)
+----------------------+        +----------------------+
| OCP Cluster A        |        | OCP Cluster B        |
| ODF (full Ceph)      |  -->   | ODF (full Ceph)      |
|                      | async  |                      |
| RBD images:          | mirror | RBD images:          |
|   vm-db01-root       | -----> |   vm-db01-root       |
|   vm-db01-data       | -----> |   vm-db01-data       |
|   vm-app01-root      | -----> |   vm-app01-root      |
+----------------------+        +----------------------+

Mirroring modes:
  Snapshot-based (recommended):
    - Periodic snapshots (e.g., every 5 min)
    - Delta between snapshots shipped to remote
    - RPO = snapshot interval (e.g., 5 min)
    - Lower performance impact than journal-based

  Journal-based:
    - Every write is journaled and shipped to remote
    - RPO ~= network propagation delay (seconds)
    - Higher performance overhead (double journaling)
    - Deprecated in newer ODF versions in favor of snapshot

Orchestration:
  - ODF-DR operator (on hub cluster) manages failover
  - Integrates with RHACM (Red Hat Advanced Cluster Management)
  - Failover: promote secondary images to primary, redirect VMs
  - Failback: re-sync, reverse mirroring direction

ODF Monitoring

ODF integrates with the OpenShift monitoring stack (Prometheus + AlertManager + Grafana):

Pre-configured alerts: ODF ships with ~40 Prometheus alert rules covering Ceph health (CephClusterWarningState, CephClusterErrorState), OSD status (CephOSDNearFull, CephOSDDiskNotResponding), PG health (CephPGNotScrubbed, CephPGRepairNeeded), and cluster capacity (CephClusterNearFull at 75%, CephClusterCriticallyFull at 85%).
Ceph Dashboard: Accessible via the ODF Console Plugin in the OpenShift web console. Shows cluster health, OSD utilization, pool I/O, and host performance.
Key metrics for VM workloads:
- ceph_pool_rd / ceph_pool_wr: read/write operations per pool
- ceph_osd_op_r_latency / ceph_osd_op_w_latency: OSD operation latency (histogram)
- ceph_pool_stored: actual data stored per pool (before replication)
- ceph_osd_apply_latency_ms: time to apply (commit) operations on OSDs
- ceph_pg_degraded: number of PGs in degraded state (should be 0)

3. Storage Spaces Direct (S2D)

S2D Architecture Overview

Storage Spaces Direct is Microsoft's software-defined storage engine, built into Windows Server and tightly integrated with Azure Local (formerly Azure Stack HCI). S2D pools local disks from multiple cluster nodes into a unified, resilient storage pool presented as Cluster Shared Volumes (CSVs).

Unlike Ceph, which is a distributed object store, S2D is a distributed block storage layer that extends the traditional Windows storage stack (Storage Spaces, Storage Bus, ReFS) with cross-node capabilities.

S2D Architecture Layers
=========================

+--------------------------------------------------------------+
|                     VM / Hyper-V Layer                        |
|  +--------+  +--------+  +--------+  +--------+             |
|  | VM 01  |  | VM 02  |  | VM 03  |  | VM 04  |             |
|  | VHDX   |  | VHDX   |  | VHDX   |  | VHDX   |             |
|  +---+----+  +---+----+  +---+----+  +---+----+             |
|      |           |           |           |                    |
+------+-----------+-----------+-----------+--------------------+
       |           |           |           |
+------+-----------+-----------+-----------+--------------------+
|                Cluster Shared Volumes (CSV)                   |
|  +--------------------------+  +--------------------------+  |
|  | Volume 1                 |  | Volume 2                 |  |
|  | ReFS formatted           |  | ReFS formatted           |  |
|  | C:\ClusterStorage\Vol1   |  | C:\ClusterStorage\Vol2   |  |
|  | Owner: Node01            |  | Owner: Node03            |  |
|  | 3-way mirror             |  | Mirror-accelerated parity|  |
|  +--------------------------+  +--------------------------+  |
+--------------------------------------------------------------+
       |
+--------------------------------------------------------------+
|              Storage Spaces (Virtual Disks)                   |
|  Resiliency applied here:                                    |
|  - Mirror (2-way, 3-way)                                     |
|  - Parity (single, dual)                                     |
|  - Mirror-accelerated parity (MAP)                           |
|  - Nested resiliency (mirror-of-mirror, mirror-of-parity)    |
+--------------------------------------------------------------+
       |
+--------------------------------------------------------------+
|           Software Storage Bus (SSB)                         |
|  +---------------------------------------------------+      |
|  | Bus Layer (SBL)                                    |      |
|  | - Discovers all local disks on all cluster nodes   |      |
|  | - Creates unified view of all storage devices      |      |
|  | - Handles disk health, SMART, slot mapping         |      |
|  +---------------------------------------------------+      |
|  +---------------------------------------------------+      |
|  | Cache Layer                                        |      |
|  | - Binds cache devices to capacity devices          |      |
|  | - Manages read cache + write cache on NVMe/SSD     |      |
|  | - Transparent to upper layers                      |      |
|  +---------------------------------------------------+      |
|  +---------------------------------------------------+      |
|  | Network Transport                                  |      |
|  | - SMB Direct (RDMA: RoCE v2 or iWARP)              |      |
|  | - Handles cross-node I/O                           |      |
|  | - Provides authentication and encryption           |      |
|  +---------------------------------------------------+      |
+--------------------------------------------------------------+
       |
+--------------------------------------------------------------+
|              Physical Disks (per node)                        |
|  Node 01:                                                    |
|  [NVMe-0] [NVMe-1]  (cache tier)                             |
|  [SSD-0] [SSD-1] [SSD-2] [SSD-3] [SSD-4] [SSD-5] (capacity)|
|                                                              |
|  Node 02:                                                    |
|  [NVMe-0] [NVMe-1]  (cache tier)                             |
|  [SSD-0] [SSD-1] [SSD-2] [SSD-3] [SSD-4] [SSD-5] (capacity)|
|                                                              |
|  ... (up to 16 nodes)                                        |
+--------------------------------------------------------------+

S2D Components in Detail

Software Storage Bus (SSB): The foundational layer that unifies disks across all cluster nodes into a single pool. The SSB consists of three sub-layers:

Storage Bus Layer (SBL): Enumerates all physical disks on all cluster nodes, creates a unified namespace, and tracks disk health. Every node can "see" every disk in the cluster through the SBL, even though physical access is local. Remote disk access goes through SMB Direct over the storage network.

Cache Layer: S2D automatically designates the fastest devices as cache and the slower/larger devices as capacity:

S2D Cache + Capacity Device Binding
======================================

Auto-tiering rules (S2D auto-detects device types):

  Configuration 1: NVMe (cache) + SSD (capacity)
    NVMe devices -> cache tier (read + write cache)
    SSD devices  -> capacity tier (persistent data)
    Binding ratio: 1 NVMe : up to 4-6 SSDs (recommended)

  Configuration 2: SSD (cache) + HDD (capacity)
    SSD devices  -> cache tier (read + write cache)
    HDD devices  -> capacity tier (persistent data)
    Binding ratio: 1 SSD : up to 4-6 HDDs (recommended)

  Configuration 3: All-NVMe (no separate cache tier)
    All NVMe devices -> capacity tier (no cache)
    Cache is implicit in NVMe's internal DRAM/SLC cache
    Highest performance, simplest configuration

  Configuration 4: NVMe + SSD + HDD (three-tier)
    NVMe -> cache
    SSD  -> performance capacity tier
    HDD  -> capacity tier
    Rarely recommended due to complexity

Cache Binding Example (4-node cluster):

  Node 01:                         Node 02:
  Cache:  [NVMe-0]  [NVMe-1]      Cache:  [NVMe-0]  [NVMe-1]
            |   |      |   |                |   |      |   |
  Bound to: |   |      |   |      Bound to: |   |      |   |
            v   v      v   v                v   v      v   v
  Capacity:[SSD0][SSD1][SSD2][SSD3]       [SSD0][SSD1][SSD2][SSD3]

  Each cache device is bound to specific capacity devices.
  If a cache device fails: S2D destages dirty data from
  remaining cache, then re-binds capacity devices to
  surviving cache devices.

Cache Behavior:
  Writes: 100% write-through to cache tier FIRST
    1. Client write arrives at S2D
    2. Write lands on local cache device (NVMe) -- ACK to client
    3. Write simultaneously mirrored to cache device on partner
       node (for redundancy of cached data)
    4. Destaging: background process flushes cache to capacity tier
    5. Cache is always mirrored across at least 2 nodes for
       write durability, regardless of volume resiliency setting

  Reads: Read-through cache
    1. First read: fetched from capacity tier, cached on NVMe
    2. Subsequent reads: served from NVMe cache (if still hot)
    3. Cache eviction: LRU (least recently used) policy
    4. Read cache is LOCAL only (not mirrored)

S2D Cache Write Flow (Detailed)
==================================

1. VM writes 4 KiB to VHDX
   |
   v
2. Hyper-V: VHD stack -> NTFS/ReFS filter -> Storage Spaces
   |
   v
3. Storage Spaces: determine which cache device owns this extent
   |
   v
4. Write to LOCAL cache device (NVMe):
   |  ~10-30 us (NVMe latency)
   |
   +--- Simultaneously: mirror write to REMOTE cache (partner node)
   |    via SMB Direct / RDMA
   |    ~10-30 us (RDMA network) + ~10-30 us (remote NVMe write)
   |
   v
5. Both cache writes confirmed -> ACK to VM
   Total write latency: ~30-80 us (all-NVMe + RDMA)

   (Background destage, not on write path:)
   |
   v
6. Destage thread: read from cache, write to capacity device(s)
   - Applies volume resiliency (e.g., 3-way mirror:
     write to capacity on Node A, Node B, and Node C)
   - Destage rate adapts to cache fullness
   - At 25% cache used: low-priority destage
   - At 50% cache used: medium-priority destage
   - At 75% cache used: high-priority destage
   - At 90%+ cache used: aggressive destage (may impact latency)

Resiliency Options

S2D provides multiple resiliency options at the virtual disk (volume) level:

Resiliency Type	Copies	Capacity Efficiency	Fault Tolerance	Best For
2-way mirror	2	50%	1 fault (disk or node)	Small clusters (2-3 nodes)
3-way mirror	3	33%	2 faults	Production workloads, VMs (standard)
Single parity	1 parity stripe	67-80% (depends on columns)	1 fault	Not recommended for S2D (poor perf)
Dual parity	2 parity stripes	50-67%	2 faults	Cold data, archival
Mirror-accelerated parity (MAP)	Mirror (hot) + Parity (cold)	40-60% (blended)	2 faults	Mixed workloads, cost-optimized
Nested: Mirror of mirrors	4	25%	3 faults (survives any 2-node failure)	2-node clusters at same site
Nested: Mirror of parity	varies	~40%	Node + disk failure combined	2-node clusters needing capacity

Mirror-Accelerated Parity (MAP): The most interesting resiliency option for large deployments. MAP combines a mirror tier (fast, space-expensive) with a parity tier (slower, space-efficient) within a single volume:

Mirror-Accelerated Parity (MAP) Volume
=========================================

Single ReFS Volume:
  +--------------------------------------------------------------+
  |                        ReFS Filesystem                        |
  |                                                               |
  |  Hot data (recent writes, frequently accessed):               |
  |  +--------------------------------------------------+        |
  |  | Mirror Tier                                       |        |
  |  | 3-way mirror (33% efficiency)                     |        |
  |  | All recent writes land here first                 |        |
  |  | Fast random I/O (NVMe/SSD speed)                  |        |
  |  +--------------------------------------------------+        |
  |                                                               |
  |  Cold data (aged out, infrequently accessed):                 |
  |  +--------------------------------------------------+        |
  |  | Parity Tier                                       |        |
  |  | Dual parity (50-67% efficiency)                   |        |
  |  | Data rotated here by ReFS real-time tiering       |        |
  |  | Good sequential read, poor random write           |        |
  |  +--------------------------------------------------+        |
  |                                                               |
  |  ReFS Real-Time Tiering:                                      |
  |  - Tracks data temperature at 64 KiB granularity              |
  |  - Hot data stays in mirror tier (fast, protected)             |
  |  - Cold data (not accessed in X hours) rotated to parity tier  |
  |  - Rotation is a metadata operation (block cloning), instant   |
  |  - If cold data becomes hot again, it is promoted back         |
  +--------------------------------------------------------------+

Capacity calculation (MAP, 4-node cluster, 1 volume):
  Total raw capacity: 4 nodes * 6 SSD * 3.84 TiB = 92 TiB raw
  Mirror portion (30% of volume): 30 TiB raw -> 10 TiB usable (3-way)
  Parity portion (70% of volume): 62 TiB raw -> 41 TiB usable (dual parity)
  Total usable: ~51 TiB from 92 TiB raw = 55% efficiency
  Compare to pure 3-way mirror: 92 / 3 = 30.7 TiB usable = 33%

Volume Management: Pool to CSV

The S2D storage hierarchy is:

S2D Volume Hierarchy
======================

Level 1: Storage Pool
  - A single pool per cluster (S2D auto-creates it)
  - Contains ALL disks from ALL nodes (unified pool)
  - You do NOT create separate pools per node or per tier
  - Pool is managed via: Get-StoragePool, PowerShell

Level 2: Virtual Disk (Storage Spaces term)
  - Carved from the pool with a resiliency setting
  - Example: New-VirtualDisk -FriendlyName "VMs-Prod" \
              -Size 10TB -ResiliencySettingName Mirror \
              -NumberOfDataCopies 3
  - Maps to a set of extents (slabs) distributed across nodes

Level 3: Volume (Partition + Filesystem)
  - A volume is a partition on the virtual disk, formatted with ReFS
  - Example: New-Volume -FriendlyName "VMs-Prod" \
              -FileSystem ReFS -Size 10TB
  - Mounted at C:\ClusterStorage\VMs-Prod\ on the owner node

Level 4: Cluster Shared Volume (CSV)
  - The volume is registered as a CSV, making it accessible
    from ALL cluster nodes simultaneously
  - Owner node: handles metadata operations (create, rename, delete)
  - Non-owner nodes: perform direct I/O for reads/writes
    (bypass the owner for data operations using Block Redirected I/O)
  - If the owner node fails: another node takes ownership (~3 seconds)

  CSV I/O Modes:
  +---------------------+--------------------------+
  | Direct I/O          | Redirected I/O           |
  +---------------------+--------------------------+
  | Data reads/writes   | Metadata operations      |
  | bypass owner node   | go through owner node    |
  | Go directly to the  | (file create/delete/     |
  | Software Storage    |  rename, NTFS/ReFS       |
  | Bus -> disk         |  metadata)               |
  | Lowest latency      | Higher latency           |
  +---------------------+--------------------------+

  If direct I/O fails (e.g., storage path issue), CSV
  falls back to File System Redirected I/O (all I/O
  through the owner node via SMB). This is a degraded
  mode with higher latency and should trigger alerts.

ReFS Role in S2D

ReFS (Resilient File System) is the required filesystem for S2D volumes. It replaces NTFS and provides specific features that S2D depends on:

ReFS Feature	Description	S2D Relevance
Integrity streams	Per-block checksums (CRC64). Detects silent corruption (bit rot).	S2D uses integrity streams to detect corruption on capacity disks. When a checksum mismatch is found, S2D automatically repairs from a mirror copy.
Block cloning	Instant copy of file extents (metadata-only operation).	Used by MAP's real-time tiering to move data between mirror and parity tiers without copying bytes. Also used for instant VHDX snapshots.
Allocate-on-write	New writes allocate new blocks; old blocks are not overwritten.	Prevents torn writes (partial writes on power failure). Every write is atomic at the block level. Critical for data integrity.
No in-place metadata update	ReFS never updates metadata in place. All metadata writes go to new locations.	Eliminates the need for a metadata journal. Faster crash recovery (no journal replay).
Real-time tier optimization	ReFS tracks data temperature and moves data between tiers.	Enables MAP volumes: hot data in mirror tier, cold data in parity tier, with automatic promotion/demotion.

ReFS limitations compared to NTFS:

No file-level compression (NTFS supports per-file compression; ReFS does not)
No Encrypted File System (EFS) -- use BitLocker for full-volume encryption instead
No disk quotas (NTFS supports per-user quotas; ReFS does not)
No hard links (some applications depend on these)
No named streams beyond the default (limited alternate data stream support)
No data deduplication on ReFS until Windows Server 2019+ (and only for specific workloads)

For VM workloads (VHDX files on ReFS), these limitations rarely matter because the guest filesystem inside the VM handles file-level operations.

Failure Handling

S2D failure handling is designed around the concept of transient fault tolerance -- the ability to distinguish between temporary glitches and permanent failures:

S2D Failure Handling Timeline
================================

Event: Disk Failure
  t=0s:   Disk stops responding / reports SMART failure
  t=0s:   SBL marks disk as "communication lost"
  t=30s:  If disk does not recover: SBL marks disk as "retired"
          (30-second transient fault window)
  t=30s+: Storage Spaces begins repair:
          - Reads data from surviving mirror copies / parity
          - Writes new copies to other disks in the pool
          - Repair rate: prioritized but background
  t=hours: Repair completes (depends on data volume)

  Impact on VMs: NONE. Write I/O was served by cache tier.
  Read I/O seamlessly served from surviving mirror copies.

Event: Node Failure
  t=0s:   Node stops responding (heartbeat lost)
  t=0s:   Cluster Service detects failure
  t=5s:   Cluster quarantines the node (safety check)
  t=5-10s: CSV ownership of volumes on failed node transfers
           to surviving nodes
  t=10s:  VMs on failed node: 
          - HA VMs restart on surviving nodes (Cluster Aware
            Updating, Live Migration)
          - VM restart time: 30s - 2 min (depends on VM size)
  t=30s:  Storage Spaces starts repair:
          - Data that had copies only on the failed node is
            re-replicated to surviving nodes
          - The 30-second delay prevents premature repair for
            transient reboots

  Impact on storage:
  - 3-way mirror: 2 copies remain, I/O unaffected
  - 2-way mirror: 1 copy remains, I/O continues but no redundancy
  - Repair bandwidth: controlled by Storage Spaces repair priority
    (Low, Medium, High -- configurable via PowerShell)

Event: Second Node Failure (During Recovery)
  Scenario: 4-node cluster, 3-way mirror, Node A failed,
            repair is 50% complete, Node B now fails

  - All data that was fully repaired: still 2 copies (safe)
  - Data not yet repaired that had copies on both A and B:
    only 1 copy remaining on Node C or D
  - If a THIRD fault hits that 1 remaining copy: DATA LOSS
  - S2D elevates repair priority to "High" automatically
  - Monitor closely: Get-VirtualDisk | Get-StorageReliabilityCounter

Nested Resiliency (2-node clusters):
  For 2-node Azure Local clusters, S2D offers "nested resiliency":
  - Nested two-way mirror: 4 copies (2 on each node)
    -> Survives 1 node failure + 1 disk failure simultaneously
  - Nested mirror-accelerated parity:
    -> Mirror tier: nested two-way mirror
    -> Parity tier: mirror + single parity
    -> Survives 1 node + 1 disk failure
  - Capacity efficiency: ~25% (nested mirror), ~40% (nested MAP)

S2D Networking

S2D relies heavily on the network for cross-node storage I/O. The network is a critical component of the storage path.

S2D Network Architecture
===========================

Recommended Network Design (per node):

  +--Node-01--------------------------------------------+
  |                                                     |
  |  NIC 1 (25/100 GbE)                                 |
  |  +--------------------------------------------+    |
  |  | vSwitch (SET - Switch Embedded Teaming)     |    |
  |  |  Port 1: Management VLAN 10                 |    |
  |  |  Port 2: VM Traffic VLAN 20                 |    |
  |  |  Port 3: Live Migration VLAN 30             |    |
  |  +--------------------------------------------+    |
  |                                                     |
  |  NIC 2 (25/100 GbE) - RDMA enabled                  |
  |  +--------------------------------------------+    |
  |  | Storage Network A (VLAN 40)                 |    |
  |  | SMB Direct (RDMA)                           |    |
  |  | IP: 10.0.40.1/24                            |    |
  |  +--------------------------------------------+    |
  |                                                     |
  |  NIC 3 (25/100 GbE) - RDMA enabled                  |
  |  +--------------------------------------------+    |
  |  | Storage Network B (VLAN 41)                 |    |
  |  | SMB Direct (RDMA)                           |    |
  |  | IP: 10.0.41.1/24                            |    |
  |  +--------------------------------------------+    |
  |                                                     |
  +-----------------------------------------------------+

RDMA Protocols:
  +---------------+------------------+-------------------+
  | Protocol      | RoCE v2          | iWARP             |
  +---------------+------------------+-------------------+
  | Transport     | UDP/IP           | TCP/IP            |
  | Switch req.   | DCB (PFC, ETS,   | Standard Ethernet |
  |               | ECN) required    | (no DCB needed)   |
  | Latency       | ~2-5 us          | ~5-10 us          |
  | Throughput    | Higher           | Slightly lower    |
  | Complexity    | Higher (DCB      | Lower (TCP-based) |
  |               | config critical) |                   |
  | Vendors       | Mellanox/NVIDIA, | Intel, Chelsio,   |
  |               | Broadcom         | Broadcom          |
  +---------------+------------------+-------------------+

SMB Direct (RDMA for Storage):
  - S2D uses SMB3 as the transport between nodes
  - SMB Direct enables RDMA bypass: data transfer directly
    from NIC to memory, bypassing the CPU and OS network stack
  - Kernel bypass reduces latency from ~50-100 us (TCP) to
    ~2-10 us (RDMA)
  - Required for production S2D performance

Storage Network Bandwidth Calculation:
  - 4-node cluster, 3-way mirror
  - Aggregate write throughput: 5 GB/s
  - Each write generates 2 remote copies: 10 GB/s cross-node
  - Plus read traffic (cache misses): ~2 GB/s
  - Plus repair/rebuild: ~1 GB/s (background)
  - Total: ~13 GB/s across cluster
  - Per node: ~3.25 GB/s = 26 Gbps
  - Minimum: 2 x 25 GbE RDMA NICs per node (50 Gbps)
  - Recommended: 2 x 100 GbE RDMA NICs (200 Gbps headroom)

Performance Characteristics

Configuration	Random 4K Read IOPS (cluster)	Random 4K Write IOPS (cluster)	Sequential Read MB/s	Sequential Write MB/s
All-NVMe (4 nodes, no cache)	2-4 million	500K - 1 million	20-40 GB/s	10-20 GB/s
NVMe cache + SSD capacity (4 nodes)	500K - 1.5 million	300K - 800K	10-20 GB/s	5-15 GB/s
SSD cache + HDD capacity (4 nodes)	50K - 200K (cache hit dependent)	100K - 300K (cache absorbs)	2-8 GB/s	2-5 GB/s

These numbers are Microsoft's published benchmarks for optimized configurations. Real-world performance depends heavily on workload patterns, cache hit rates, and network configuration.

S2D Capacity Planning

S2D Capacity Planning Example
================================

Assumptions (same as ODF example for comparison):
  - 5,000 VMs, average 100 GiB provisioned (thin via dynamic VHDX)
  - Actual average utilization: 40% (40 GiB used per VM)
  - Total usable capacity needed: 200 TiB usable
  - Resiliency: 3-way mirror for production VMs
  - Target cluster utilization: max 70%

Step 1: Raw capacity for data
  Usable: 200 TiB
  x 3 (3-way mirror): 600 TiB raw for data copies

Step 2: ReFS and CSV overhead
  ReFS metadata: ~1-2% of volume capacity
  CSV metadata: negligible
  Storage Spaces metadata: ~1-2%
  600 TiB * 1.04 = 624 TiB

Step 3: Operational headroom (30% free)
  624 TiB / 0.70 = 891 TiB raw capacity tier needed

Step 4: Cache tier sizing
  S2D cache is in addition to capacity
  Rule of thumb: cache = 10% of capacity per node
  Cache purpose: absorb writes, cache reads
  Cache is mirrored across 2 nodes (write cache durability)

  For 891 TiB capacity: ~89 TiB cache total
  But cache devices are typically 1.6-3.2 TiB NVMe
  Practical: 2 x 3.2 TiB NVMe per node = 6.4 TiB cache/node

Step 5: Node count (16-node max per cluster)
  Per node capacity: 891 TiB / N nodes

  Option A: 16 nodes (max per cluster)
    891 / 16 = 55.7 TiB capacity per node
    = ~15 x 3.84 TiB SSD per node (capacity)
    + 2 x 3.2 TiB NVMe per node (cache)
    Total: 1 cluster of 16 nodes

  Option B: 8 nodes per cluster, 2 clusters
    891 / 8 = ~112 TiB capacity per node
    = ~30 x 3.84 TiB SSD per node (needs large servers)
    OR: split across 2 clusters of 8 nodes each
    = ~56 TiB per node per cluster = ~15 SSDs
    + 2 NVMe cache per node
    Total: 2 clusters of 8 nodes

  Option C: MAP for cold data (capacity optimization)
    Hot data (30%): 60 TiB usable, 3-way mirror -> 180 TiB raw
    Cold data (70%): 140 TiB usable, dual parity -> 210 TiB raw
    Total raw: 390 TiB capacity + ~40 TiB cache
    390 / 0.70 = 557 TiB (with headroom)
    557 / 16 = 35 TiB per node = ~9 SSDs per node + 2 NVMe cache
    Total: 1 cluster of 16 nodes (much more efficient)

Summary Comparison (3-way mirror, no EC/MAP optimization):
  +---------------------------+-------------+-------------+
  | Item                      | ODF (Ceph)  | S2D         |
  +---------------------------+-------------+-------------+
  | Usable capacity           | 200 TiB     | 200 TiB     |
  | Raw capacity (RF/mirror)  | 600 TiB     | 600 TiB     |
  | With overhead + headroom  | 900 TiB     | 891 TiB     |
  | + Cache tier              | N/A (WAL)   | ~89 TiB     |
  | Total raw disk            | 900 TiB     | 980 TiB     |
  | Max nodes per cluster     | Unlimited   | 16          |
  | Minimum clusters          | 1           | 1-2         |
  | Disaggregated option      | Yes         | No          |
  +---------------------------+-------------+-------------+

  Note: S2D cache tier is an ADDITIONAL investment beyond the
  capacity tier. Ceph uses the same NVMe devices for both WAL
  and data (or separate partitions on the same device), so the
  cache cost is embedded in the capacity devices.

Azure Local Integration

S2D on Azure Local adds cloud-management capabilities:

Azure Arc: Nodes are registered as Azure Arc-enabled servers. Storage can be monitored and managed from the Azure portal.
Azure portal integration: Volume creation, monitoring, and alerting through Azure Monitor. No on-premises management tools required (though PowerShell and Windows Admin Center remain available).
Cloud witness: Instead of a physical witness server for cluster quorum, Azure Local uses an Azure Blob Storage account as the witness. Eliminates the need for a third site or dedicated witness hardware.
Azure Backup integration: Azure Backup can protect VMs on Azure Local, including application-consistent backups using VSS (Volume Shadow Copy Service).
Azure Monitor: S2D metrics (disk health, volume utilization, cache hit rate, latency) flow to Azure Monitor for centralized alerting and dashboards.

How the Candidates Handle This

Comparison Table

Aspect	OVE (ODF / Ceph)	Azure Local (S2D)	Swisscom ESC
SDS engine	Ceph (open-source, Red Hat supported via ODF)	Storage Spaces Direct (proprietary, built into Windows Server)	N/A -- managed SAN (Dell PowerMax/PowerStore)
Architecture type	Distributed object store with block/file/object interfaces	Distributed block layer with ReFS filesystem	Centralized SAN controllers
Data placement	CRUSH algorithm (deterministic, no lookup table)	Software Storage Bus with CSV owner coordination	Array controller, hardware-managed
Daemon model	1 OSD per disk (Linux process in K8s pod)	Integrated service (Storage Spaces, Cluster Service)	N/A
Cache / journal	BlueStore WAL on NVMe partition or device; metadata in RocksDB DB	Separate cache tier: NVMe/SSD devices bound to capacity devices; 100% write-cache	Array DRAM + NVMe write buffer
Cache write behavior	WAL absorbs small writes; large writes go directly to data area (COW)	All writes land on cache tier first, mirrored to partner node's cache, then destaged	Controller cache absorbs writes, destages to RAID
Filesystem on storage	None -- BlueStore writes directly to raw block devices	ReFS required on all volumes	Array-internal (not customer-visible)
Replication	Per-pool: RF=2, RF=3, or erasure coding (k+m)	Per-volume: 2-way mirror, 3-way mirror, parity, MAP, nested	Array-level RAID (managed)
Max nodes per cluster	No hard limit (tested at 100+, practical at 200+)	16 (hard limit)	N/A
Minimum nodes (production)	3 (ODF compact)	2 (with cloud witness) or 3 (recommended)	N/A
Disaggregated storage	Yes -- ODF infra nodes (storage-only), fully disaggregated Ceph	No -- all nodes must contribute storage	N/A
Failure domain granularity	CRUSH: disk, host, rack, row, room, datacenter (arbitrary hierarchy)	Fault domains: node, chassis, rack, site (fixed levels)	Array HA (dual controller)
Data integrity checking	Ceph scrubbing (deep scrub checksums all data), BlueStore checksums	ReFS integrity streams (CRC64 per block), auto-repair from mirror	Array-level scrubbing (managed)
Snapshot mechanism	RBD snapshots (COW, instant, unlimited), layered clones	ReFS block cloning (COW, VHDX checkpoints)	Array-native snapshots
DR: synchronous	Metro-DR (stretched cluster, < 10 ms RTT, arbiter)	Stretch cluster (site-aware volumes, < 5 ms RTT)	Managed by Swisscom (SAN replication)
DR: asynchronous	Regional-DR (RBD snapshot-based mirroring, configurable RPO)	Azure Site Recovery (VM-level replication, 30s RPO)	Managed by Swisscom
Encryption at rest	dm-crypt/LUKS per OSD (KMIP integration)	BitLocker per CSV volume (TPM-backed)	Vendor-managed
Compression	Inline (snappy, zstd, lz4), per-pool setting	Not available at S2D level (ReFS does not compress)	Vendor-managed
Erasure coding	Yes, configurable k+m per pool (e.g., 4+2, 8+3)	Parity volumes (single/dual), MAP	Array-level RAID
Object storage (S3)	Built-in: Ceph RGW + NooBaa MCG	Not built-in; requires separate solution	Not included
Monitoring	Ceph metrics via Prometheus, 40+ ODF alerts, Ceph Dashboard	Azure Monitor, Windows Admin Center, Performance Monitor	Swisscom portal (managed)
Management interface	Kubernetes CRDs, `ceph` CLI, ODF Console Plugin	PowerShell, Windows Admin Center, Azure Portal	Swisscom portal
Upgrade model	OLM operator upgrades (rolling, non-disruptive)	Windows Update / Azure Update Manager (rolling, node-by-node)	Managed by Swisscom
Multi-cluster storage	Ceph can serve multiple OCP clusters (external mode)	Each S2D cluster is independent; no native cross-cluster storage	Shared SAN (multi-tenant)
Skill set required	Linux, Kubernetes, Ceph administration	Windows Server, PowerShell, Hyper-V, Azure Arc	None (managed)

Detailed Analysis

OVE (ODF/Ceph) -- Strengths:

Unlimited scale. No hard cluster size limit. A single Ceph cluster can scale from 3 nodes to hundreds of nodes. For 5,000+ VMs, you can grow the cluster organically without hitting a "cluster split" problem. You never need to architect multi-cluster storage federation.
CRUSH-based placement. The CRUSH algorithm provides unmatched flexibility in data placement. You can define custom failure domain hierarchies (per-rack, per-room, per-site) and create rules that match your physical topology exactly. Moving data to new racks or sites is a CRUSH map update, not a data migration.
Disaggregated architecture. ODF supports dedicated storage nodes (infra nodes) that run only Ceph OSDs. This allows independent scaling of compute and storage. If you need more VM capacity, add compute nodes. If you need more storage, add storage nodes. This decoupling is a significant architectural advantage for a growing environment.
Unified storage. A single Ceph cluster provides block (RBD for VM disks), file (CephFS for shared mounts), and object (RGW/NooBaa for S3-compatible backups). You do not need separate products for each storage interface. The operational model is one platform for all storage needs.
Open source with commercial support. The Ceph codebase is open source with a large community. ODF is Red Hat's supported distribution. There is no vendor lock-in at the storage engine level -- if Red Hat's pricing becomes unfavorable, the core technology remains accessible.
Inline compression. Ceph supports per-pool inline compression (snappy, zstd, lz4), which can significantly reduce capacity requirements for compressible workloads. S2D has no equivalent feature.

OVE (ODF/Ceph) -- Weaknesses:

Operational complexity. Ceph has a steeper learning curve than S2D. Concepts like PG autoscaling, CRUSH map management, OSD lifecycle (prepare, activate, purge), scrub scheduling, and BlueStore tuning require specialist knowledge. A misconfigured PG count or CRUSH rule can cause data imbalance or performance degradation.
Memory consumption. With the default osd_memory_target of 4 GiB and 12 OSDs per node, a storage node dedicates 48 GiB of RAM to OSD processes alone. Add MON, MGR, and system overhead, and a storage node may need 96-128 GiB RAM before running a single VM. This is a significant cost if compute and storage share nodes (converged deployment).
Write amplification. The Ceph write path (WAL + metadata update + data write) generates more I/O than raw writes. BlueStore mitigates this compared to FileStore, but write amplification ratios of 2-4x on small writes are common. This reduces effective NVMe endurance.
Recovery impact. Ceph recovery (after OSD or node failure) generates significant cross-cluster I/O. Default recovery settings can visibly impact VM latency. Tuning osd_recovery_max_active and osd_recovery_sleep is a day-1 operational task, not a day-2 optimization.

Azure Local (S2D) -- Strengths:

Write performance via cache tier. S2D's 100% write-cache design (all writes land on NVMe cache first) provides very low write latency -- typically 30-80 microseconds for 4K random writes in all-NVMe/SSD configurations with RDMA. The cache absorbs burst writes independently of the capacity tier, providing consistent write latency.
Simplicity. S2D has fewer knobs than Ceph. There are no PG counts to tune, no CRUSH maps to design, no OSD daemons to manage individually. The storage pool is automatic (all disks are pooled), and volume creation is straightforward PowerShell. For organizations with existing Windows infrastructure skills, the learning curve is manageable.
Integrated Hyper-V stack. S2D, CSV, ReFS, Hyper-V, and Failover Clustering are all integrated components of Windows Server. There is no "integration gap" -- the storage and hypervisor were designed together. VSS-based backups, live migration, and VM checkpoints work seamlessly.
Azure cloud integration. Azure Arc management, Azure Monitor telemetry, Azure Backup, Azure Site Recovery, and cloud witness provide a cloud-connected operational model. For organizations invested in Azure, this extends their existing management plane to on-premises infrastructure.
MAP for capacity efficiency. Mirror-accelerated parity is a unique S2D feature that automatically tiers data between a fast mirror layer and an efficient parity layer within a single volume. This provides a good balance of performance and capacity efficiency (40-55% usable vs 33% for pure 3-way mirror) without manual tiering management.
ReFS data integrity. ReFS integrity streams provide per-block CRC64 checksums with automatic repair from mirror copies. This provides end-to-end data integrity that is transparent to VMs. While Ceph also checksums (via BlueStore), ReFS integrates this at the filesystem level.

Azure Local (S2D) -- Weaknesses:

16-node cluster limit. This is the most significant architectural constraint. For 5,000+ VMs, you will likely need 2-3 S2D clusters. Each cluster is an independent storage domain -- data does not span clusters. VM placement, capacity management, and DR must be architected per-cluster. Ceph has no equivalent limitation.
No disaggregation. Every node in an S2D cluster must contribute storage. You cannot have compute-only nodes or storage-only nodes. This means scaling compute requires scaling storage, and vice versa. For a growing environment where compute and storage needs grow at different rates, this is an expensive constraint.
No built-in object storage. S2D provides block storage (CSV/VHDX) and file storage (SMB shares). There is no S3-compatible object storage layer. If workloads need S3 (backups, artifacts, logs), you must deploy a separate solution (MinIO, Azure Blob gateway, etc.).
No inline compression. S2D and ReFS do not provide inline compression at the storage layer. Ceph's per-pool compression can save 30-50% capacity on compressible workloads. This increases S2D's raw capacity requirements.
Windows-only operational model. S2D is managed via PowerShell, Windows Admin Center, and Azure Portal. The skill set is Windows Server administration -- fundamentally different from the Linux/Kubernetes skills required for OVE. If the organization is moving toward a Linux/Kubernetes-first strategy, S2D requires maintaining a parallel Windows skill set.
Cache tier as additional cost. S2D's cache tier (NVMe devices) is dedicated to caching and cannot be used for capacity. This means buying NVMe devices that only serve as cache, plus additional SSD/HDD devices for capacity. In an all-NVMe configuration (no separate cache), this is not an issue, but all-NVMe at scale is expensive.

Swisscom ESC -- Storage as a Service:

Swisscom ESC uses Dell PowerMax/PowerStore behind VxBlock. The customer has no visibility into or control over the SDS platform because there is no SDS platform -- it is traditional SAN. Storage is consumed as provisioned capacity with defined IOPS/latency SLAs.

Advantages: zero storage operations burden, no capacity planning, no failure handling, no upgrade management. Disadvantages: no architecture flexibility, no performance optimization capability, no cost optimization through erasure coding or compression, complete dependency on Swisscom for every storage decision. For a Tier-1 financial institution that values control over its infrastructure, this trade-off deserves careful evaluation.

Key Takeaways

This is the core platform decision. Choosing between ODF/Ceph and S2D is not just a storage decision -- it is a platform decision that determines the operating system (Linux vs Windows), the management paradigm (Kubernetes operators vs PowerShell/Azure Arc), the scalability model (elastic vs bounded), and the skill set required for the next 5-10 years. The storage comparison cannot be separated from the broader platform strategy.
The 16-node limit is a real architectural constraint. For 5,000+ VMs requiring ~200 TiB usable storage, S2D's 16-node limit means operating 1-3 clusters. Each cluster is an independent failure domain and management unit. Multi-cluster S2D adds operational complexity (which cluster hosts which VMs, how do you balance load, how do you do DR per-cluster). Ceph scales in a single cluster. The PoC should quantify the multi-cluster management overhead for S2D.
Cache architecture is fundamentally different. Ceph uses WAL on NVMe as a write journal (small, millisecond-level persistence, then flushed to data area). S2D uses a full-sized cache tier (larger NVMe devices) as a 100% write-cache that absorbs all writes before destaging. The S2D approach provides more consistent write latency at the cost of additional NVMe investment. The Ceph approach is more space-efficient but may show higher tail latency under sustained write pressure. The PoC should compare write latency under identical workloads.
Disaggregation is a differentiator for ODF. ODF's ability to run dedicated storage nodes separately from compute nodes allows independent scaling and resource optimization. For 5,000+ VMs where compute growth (new VMs) and storage growth (more data per VM) happen at different rates, disaggregation prevents forced overprovisioning. S2D's all-in-one model means every compute addition brings unwanted storage and vice versa.
Compression gives Ceph a capacity advantage. Ceph's inline compression (zstd, snappy) can reduce raw capacity requirements by 30-50% for compressible workloads. S2D has no equivalent. For a 200 TiB usable environment, this could mean 250-350 TiB less raw capacity needed -- a meaningful cost savings that should be validated against the actual workload profile.
CRUSH flexibility is both a strength and a responsibility. CRUSH allows Ceph to place data with arbitrary failure domain rules (per-rack, per-room, per-site). This enables topologies that S2D cannot match (e.g., a single cluster spanning two datacenters with site-aware placement). However, CRUSH requires expert configuration -- an incorrect CRUSH rule can create data hotspots, uneven distribution, or inadequate failure domain separation. The team must invest in CRUSH map design expertise before production deployment.
MAP is S2D's answer to capacity efficiency. Where Ceph uses erasure coding for cold data, S2D uses mirror-accelerated parity (MAP) with ReFS real-time tiering. Both achieve similar capacity efficiency improvements (from ~33% to ~50-55%). The difference is operational: Ceph requires you to create separate pools with different replication policies and assign workloads to the right pool. S2D's MAP handles tiering automatically within a single volume. For operational simplicity, MAP is superior. For control, Ceph's per-pool policies are superior.
Both platforms require RDMA for production performance. Ceph (via msgr2) and S2D (via SMB Direct) both achieve their performance targets only with RDMA-capable networking. RoCE v2 requires DCB-capable switches with proper PFC/ETS/ECN configuration. This network requirement is identical for both candidates and must be accounted for in the infrastructure bill of materials. Do not plan either platform on TCP-only storage networks for production workloads -- the latency penalty is 3-10x.
Operational skill sets do not overlap. Ceph/ODF operations (CRUSH map management, PG tuning, OSD lifecycle, Rook operator troubleshooting, ceph CLI) are Linux/Kubernetes skills. S2D operations (PowerShell cmdlets, Windows Admin Center, CSV management, ReFS administration, Failover Cluster Manager) are Windows Server skills. The organization cannot train one team for both -- it must choose a platform and invest deeply in the corresponding skill set. This training investment is a significant non-hardware cost that should be included in TCO calculations.
Validate with your workload, not vendor benchmarks. Both Ceph and S2D can publish impressive benchmark numbers. What matters is performance with your actual workload profile: your mix of 4K random vs 64K sequential, your read/write ratio, your VM density, your snapshot frequency. The PoC must use representative workloads derived from VMware metrics (IOPS, latency, throughput histograms from vRealize or Aria Operations) to produce comparable results.

Discussion Guide

The following questions are designed for vendor deep-dives, PoC design sessions, and internal architecture reviews. They focus on the SDS platform specifics that differentiate ODF/Ceph from S2D.

Questions for OVE / ODF (Red Hat)

CRUSH map design for our topology: "We have N racks across 2 datacenter rooms in Zurich, with a DR site in Bern. Design a CRUSH map that ensures RF=3 replicas are placed on different racks within a single site for normal operations, and show how the map changes for Metro-DR (stretched cluster) and Regional-DR (async mirroring). Can CRUSH rules be modified live without triggering data movement? How do we validate a CRUSH rule change before applying it?"
BlueStore tuning for all-NVMe: "We plan to deploy all-NVMe storage nodes. What is the recommended bluestore_min_alloc_size for NVMe (4 KiB vs 16 KiB)? Should WAL and DB be on the same NVMe as data, or on a separate dedicated NVMe? What is the performance difference? What osd_memory_target do you recommend for NVMe-backed OSDs under sustained 4K random write load?"
PG autoscaling behavior: "We will run multiple pools (block pool for VMs, CephFS metadata/data pools, RGW pools for object storage). How does PG autoscaling handle the distribution across pools? What happens during a PG split or merge operation -- is there a measurable latency impact on VM I/O? Can we schedule PG operations during maintenance windows?"
OSD failure and recovery impact: "Simulate a single OSD failure (one NVMe disk) on a 20-node, 240-OSD cluster at 65% utilization during a sustained 50,000 IOPS workload. What is the p99 latency increase during recovery? How long until the cluster is back to full health? What is the write amplification on surviving OSDs during recovery?"
ODF upgrade path and disruption: "Walk us through an ODF minor version upgrade (e.g., 4.15 to 4.16) and a Ceph major version upgrade. Which pods restart? Is I/O affected during OSD pod restarts? How long does a rolling upgrade take for a 20-node cluster? Is there a rollback procedure if the upgrade fails mid-way?"

Questions for Azure Local / S2D (Microsoft)

Multi-cluster architecture for 5,000+ VMs: "With a 16-node limit per S2D cluster, we anticipate needing 2-3 clusters. How do we design VM placement policies across clusters? Is there a unified management plane that spans multiple S2D clusters? How do we handle capacity imbalance across clusters (one cluster at 80%, another at 40%)? Can VMs access storage across cluster boundaries?"
Cache tier failure scenario: "If one NVMe cache device fails on a node, what happens to the dirty (not yet destaged) data in that cache? How does S2D re-bind the orphaned capacity devices to surviving cache devices? What is the latency impact during cache rebuild? What is the worst case -- both NVMe cache devices on one node fail simultaneously?"
MAP tuning and behavior: "For mirror-accelerated parity volumes, how do we control the mirror-to-parity ratio? What is the data temperature threshold for rotation from mirror to parity tier? Can we observe the current distribution of data between tiers? What is the read latency difference between data in the mirror tier vs the parity tier under production load?"
S2D repair priority and VM impact: "When a node fails and S2D starts repair, what is the measured IOPS degradation on surviving nodes? Can we set repair priority per volume (e.g., production VMs = high priority, dev/test = low priority)? What happens if we lose a second node while repair is at 50% -- how much data is at risk?"
ReFS limitations for our workloads: "We run VMs with Windows Server, Linux, and database workloads. Are there known ReFS limitations that affect VHDX performance (e.g., large file fragmentation, metadata update bottlenecks under high VM density)? How does ReFS block cloning performance scale with thousands of VHDX files on a single volume? Are all our backup tools (Veeam, Commvault) fully certified for ReFS-on-CSV?"

Questions for Swisscom ESC

Storage performance guarantees: "What are the contractual IOPS and latency SLAs for block storage? Are these per-VM, per-tenant, or per-array? What QoS mechanism enforces performance isolation between tenants? Can we purchase dedicated storage tiers (e.g., guaranteed NVMe tier for latency-sensitive databases)?"

Cross-Platform / Internal Architecture Questions

Head-to-head PoC design: "Design a PoC that runs identical workloads (fio benchmarks + representative VM workloads) on ODF/Ceph and S2D with equivalent hardware. Define the metrics: 4K random read/write IOPS at p50/p99/p99.9, 64K sequential throughput, write latency under 80% utilization, recovery impact during node failure, time-to-full-health after node loss. Both platforms must use all-NVMe storage and RDMA networking."
5-year capacity model: "Build a 5-year capacity model comparing ODF (RF=3 + EC for cold data, compression enabled) vs S2D (3-way mirror + MAP for cold data, no compression). Use our current 200 TiB usable baseline with 15% annual growth. Include: raw disk cost, NVMe cache cost (S2D only), server cost (including OSD RAM overhead for Ceph), and licensing cost. Which platform has a lower 5-year TCO at our scale?"
Failure scenario matrix: "For each platform, document the impact of: (a) single disk failure, (b) single node failure, (c) two simultaneous disk failures on different nodes, (d) simultaneous node + disk failure, (e) storage network partition between racks. For each scenario: is data accessible? Is there data loss risk? What is the automatic recovery process? What is the time to full health?"

Previous: 04-architectures.md -- Storage Architectures (SAN, NAS, HCI) Next: 06-kubernetes-storage.md -- Kubernetes Storage Model (CSI, PV/PVC, StorageClasses)