Storage -- Quick Reference

Key Numbers

Parameter	VMware/vSAN	OVE (ODF/Ceph)	Azure Local (S2D)
Max nodes/cluster (storage)	64 (vSAN)	No hard limit (tested 100+)	16 nodes
Replication factor	RF=1 (RAID) to RF=2/3 (mirroring)	RF=2 or RF=3 (configurable per pool)	2-way or 3-way mirror
Usable capacity (RF=3)	~33% of raw	~33% of raw	~33% of raw
Usable capacity (RF=2)	~50% of raw	~50% of raw	~50% of raw
Typical NVMe IOPS/OSD	N/A	10,000-50,000 per OSD	N/A (pooled)
Latency target (NVMe, 4K rw)	<200 us	<500 us (Ceph RBD, tuned)	<200 us (S2D all-NVMe)
Latency target (SSD, 4K rw)	<500 us	<1 ms (Ceph RBD)	<500 us
OSD memory default	N/A	4 GiB per OSD (`osd_memory_target`)	N/A
MON count	N/A	3 (small) or 5 (large)	N/A
PG count (per pool, default)	N/A	128-256 (autoscaler adjusts)	N/A
Max PVC size	VMDK: 62 TB	No hard limit (tested multi-TB)	VHDX: 64 TB
Volume snapshot support	VMDK delta disks	CSI VolumeSnapshot (Ceph instant)	Hyper-V checkpoints
Online volume expansion	Yes (vSAN)	Yes (CSI ExpandVolume)	Yes (S2D + ReFS)
Thin provisioning	Yes	Yes (Ceph RBD default)	Yes (ReFS)
Encryption at rest	vSAN encryption	ODF cluster/PV encryption (LUKS)	BitLocker
Deduplication	vSAN 7+	No native dedup in RBD	ReFS dedup + compression
Compression	vSAN compression	BlueStore compression (zstd/lz4)	ReFS compression
Scrubbing interval	N/A	Daily (light), weekly (deep)	Background (Storage Spaces)

Decision Matrix

Scenario	Recommended Backend	Volume Mode	StorageClass	Notes
VM boot disk (general)	Ceph RBD	Block	`ocs-storagecluster-ceph-rbd`	Block mode = less overhead, better perf
VM data disk (DB, high IOPS)	Ceph RBD (NVMe pool)	Block	`ceph-rbd-nvme` (custom)	Separate pool on NVMe-only OSDs
Shared filesystem (RWX)	CephFS	Filesystem	`ocs-storagecluster-cephfs`	For VMs needing shared mount
Bulk/archive storage	CephFS or NFS	Filesystem	`cephfs-bulk` (custom)	Cheaper tier, HDD-backed pool
External SAN integration	NetApp ONTAP (Trident)	Block	`ontap-san`	iSCSI or FC; existing SAN investment
External NAS integration	NetApp ONTAP (Trident)	Filesystem	`ontap-nas`	NFS; good for legacy NFS mounts
Live migration support	Any RWX-capable backend	Block (RWX)	Ceph RBD (default RWX)	RWO blocks live migration
VM template / golden image	Ceph RBD	Block	Same as target	Use CSI clone (COW, near-instant)
Disaster recovery (sync)	ODF Metro DR	Block	DR-enabled StorageClass	RPO=0, requires stretched cluster
Disaster recovery (async)	ODF Regional DR	Block	DR-enabled StorageClass	RPO=minutes, multi-site

Access mode cheat sheet:

Access Mode	Meaning	Live Migration?	Use Case
RWO (ReadWriteOnce)	Single node read/write	No (pinned to one node)	Non-migratable VMs, temp disks
RWX (ReadWriteMany)	Multi-node read/write	Yes	Production VMs (default for Ceph RBD block)
ROX (ReadOnlyMany)	Multi-node read-only	N/A	Shared config disks, ISOs

Essential Commands

Task	VMware/vSAN	OVE (ODF/Ceph)	Azure Local (S2D)
List datastores/pools	`govc datastore.info`	`oc get cephblockpool -n openshift-storage`	`Get-StoragePool -CimSession $c`
List storage classes	N/A (SPBM policies)	`oc get sc`	`Get-StoragePool`
List PVCs	N/A	`oc get pvc -A`	`Get-Volume -CimSession $c`
List PVs	N/A	`oc get pv`	`Get-VirtualDisk -CimSession $c`
Check storage capacity	`govc datastore.info ds1`	`oc exec -n openshift-storage deploy/rook-ceph-tools -- ceph df`	`Get-StoragePool \\| select FriendlyName,Size,AllocatedSize`
Check OSD/disk status	N/A	`oc exec -n openshift-storage deploy/rook-ceph-tools -- ceph osd tree`	`Get-PhysicalDisk -CimSession $c`
Check cluster health	`esxcli vsan health cluster list`	`oc exec ... -- ceph status`	`Get-HealthFault -CimSession $c`
Create PVC (block)	N/A	`oc apply -f pvc-block.yaml`	N/A (Azure portal/PS)
Expand PVC	Extend VMDK	`oc patch pvc my-pvc -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'`	`Resize-VirtualDisk`
Create snapshot	`govc snapshot.create`	`oc apply -f volume-snapshot.yaml`	`Checkpoint-VM`
Restore from snapshot	`govc snapshot.revert`	`oc apply -f volume-snapshot-restore.yaml`	`Restore-VMSnapshot`
Clone PVC	N/A	DataVolume with `source.pvc` (CSI clone)	`Copy-VHD`
Upload disk image	Content Library	`virtctl image-upload dv my-dv --image-path=disk.qcow2 --size=50Gi`	Azure portal
Check I/O stats	`esxtop`	`oc exec ... -- ceph osd perf`	`Get-StorageQoSFlow`
List slow OSD ops	N/A	`oc exec ... -- ceph daemon osd.0 dump_blocked_ops`	N/A
Repair/scrub	`esxcli vsan trace`	`oc exec ... -- ceph pg deep-scrub <pg-id>`	`Repair-VirtualDisk`
Check rebalance progress	N/A	`oc exec ... -- ceph -w` (watch)	`Get-StorageJob -CimSession $c`
Set pool replication	N/A	`oc exec ... -- ceph osd pool set <pool> size 3`	`Set-StorageTier`
Ceph toolbox shell	N/A	`oc rsh -n openshift-storage deploy/rook-ceph-tools`	N/A
ODF dashboard	N/A	OpenShift Console > Storage > Data Foundation	N/A

Architecture at a Glance (OVE/ODF)

+===============================================================================+
| VM Guest OS: /dev/vda (virtio-blk or virtio-scsi)                             |
+===============================================================================+
| QEMU block layer (inside virt-launcher pod)                                   |
|   raw block device (block-mode PVC) or qcow2 file (filesystem-mode PVC)      |
+===============================================================================+
| Kubernetes PVC (PersistentVolumeClaim)                                        |
|   StorageClass -> CSI driver -> Ceph RBD image                                |
+===============================================================================+
| CSI Layer: ceph-csi (rbd plugin)                                              |
|   Controller (Deployment): CreateVolume, Snapshot, Expand, Attach             |
|   Node (DaemonSet): Stage, Publish (map RBD to /dev on host, bind-mount)      |
+===============================================================================+
| Rook-Ceph Operator (manages all Ceph daemons as K8s workloads)                |
+===============================================================================+
| RADOS Cluster                                                                 |
|   MON (x3): Paxos quorum, cluster map, CRUSH map                             |
|   MGR (x2): Dashboard, Prometheus metrics, balancer                           |
|   OSD (1 per disk): BlueStore -> raw NVMe/SSD                                |
|     Pool -> PG -> OSD set (selected by CRUSH algorithm)                       |
|     Replication: primary writes -> replicate to secondary + tertiary OSD      |
+===============================================================================+
| Physical: NVMe/SSD drives, 25GbE NICs (public + cluster network)              |
+===============================================================================+

Storage Troubleshooting Quick Checks

1. Ceph cluster not healthy (`HEALTH_WARN` or `HEALTH_ERR`)

# Check overall status
oc exec -n openshift-storage deploy/rook-ceph-tools -- ceph status
# Check specific health warnings
oc exec -n openshift-storage deploy/rook-ceph-tools -- ceph health detail
# Check OSD status (look for down/out)
oc exec -n openshift-storage deploy/rook-ceph-tools -- ceph osd tree

2. PVC stuck in Pending state

# Check PVC events for provisioning errors
oc describe pvc my-pvc -n my-namespace
# Check CSI controller logs for CreateVolume failures
oc logs -n openshift-storage -l app=csi-rbdplugin-provisioner -c csi-rbdplugin --tail=50
# Check StorageClass exists and is default
oc get sc

3. VM disk I/O slow (high latency)

# Check OSD latency (commit_latency_ms > 10 = problem)
oc exec -n openshift-storage deploy/rook-ceph-tools -- ceph osd perf
# Check for slow ops (blocked > 30s = degraded)
oc exec -n openshift-storage deploy/rook-ceph-tools -- ceph daemon osd.0 dump_blocked_ops
# Check if recovery/backfill is consuming I/O
oc exec -n openshift-storage deploy/rook-ceph-tools -- ceph -s | grep -E 'recovery|backfill'

4. OSD crashed or not starting

# Check OSD pod status
oc get pods -n openshift-storage -l app=rook-ceph-osd --field-selector=status.phase!=Running
# Check OSD pod logs
oc logs -n openshift-storage rook-ceph-osd-<id>-<hash> --previous
# Check underlying disk health
oc debug node/<node-name> -- chroot /host smartctl -a /dev/nvme0n1

5. Volume not attaching to VM (VM stuck in scheduling)

# Check VolumeAttachment objects
oc get volumeattachment | grep <pv-name>
# Check CSI node plugin logs on target node
oc logs -n openshift-storage -l app=csi-rbdplugin --field-selector spec.nodeName=<node> -c csi-rbdplugin --tail=50
# Check if RBD image is locked by another node (stale mapping)
oc exec -n openshift-storage deploy/rook-ceph-tools -- rbd status <pool>/<image>

Storage -- Quick Reference

Key Numbers

Decision Matrix

Essential Commands

Architecture at a Glance (OVE/ODF)

Storage Troubleshooting Quick Checks

1. Ceph cluster not healthy (HEALTH_WARN or HEALTH_ERR)

2. PVC stuck in Pending state

3. VM disk I/O slow (high latency)

4. OSD crashed or not starting

5. Volume not attaching to VM (VM stuck in scheduling)

1. Ceph cluster not healthy (`HEALTH_WARN` or `HEALTH_ERR`)