Kubernetes Storage Model
Why This Matters
The previous pages established the storage fundamentals (01), the VMware baseline (02), the protocols (03), the architectures (04), and the SDS platforms (05). This page bridges the gap between the raw storage platform and the workloads that consume it. In Kubernetes-based platforms (OVE and Azure Local), no VM or container can access storage without passing through the Kubernetes storage model. CSI, PV/PVC, and StorageClasses are the control-plane mechanisms that connect a VM's disk request to an actual block device on Ceph or S2D.
For a Tier-1 financial enterprise running 5,000+ VMs, this page answers three critical questions:
-
How does a VM get a disk? In VMware, you create a VMDK on a datastore. In Kubernetes, a chain of abstractions -- StorageClass, PVC, PV, CSI driver, SDS platform -- must all function correctly for a VM to receive a writable disk. Understanding this chain is essential for troubleshooting, capacity planning, and performance analysis.
-
How do we map our existing storage policies? VMware's SPBM (Storage Policy Based Management) assigns VMs to storage tiers (gold, silver, bronze) with defined performance and protection levels. Kubernetes StorageClasses serve the same purpose but with different mechanics. A clean migration requires mapping every SPBM policy to a corresponding StorageClass with equivalent parameters.
-
What changes operationally? In VMware, storage is configured through vCenter's GUI -- datastore creation, policy assignment, capacity monitoring. In Kubernetes, storage is configured via YAML manifests, managed by operators, and provisioned dynamically through CSI drivers. The operational model shifts from GUI-driven administration to declarative, API-driven automation. The team must understand this shift to operate the platform effectively.
This page covers the three pillars of Kubernetes storage: CSI (how Kubernetes talks to storage backends), PV/PVC (how storage is represented and consumed), and StorageClasses (how storage tiers are defined and selected). Together, they form the complete path from a VM's disk request to a provisioned block device.
Concepts
1. CSI (Container Storage Interface)
Why CSI Exists
Before CSI, Kubernetes supported storage backends through in-tree volume plugins -- Go code compiled directly into the Kubernetes controller-manager and kubelet binaries. Each storage vendor (AWS EBS, GCE PD, Ceph RBD, vSphere VMDK, etc.) had their driver code embedded in the Kubernetes source tree. This created three problems:
- Coupling. Adding or fixing a storage driver required modifying and releasing the entire Kubernetes codebase. A bug in the Ceph RBD plugin meant waiting for the next Kubernetes release -- even if Ceph itself was already fixed.
- Vendor burden. Storage vendors had to submit PRs to
kubernetes/kubernetes, pass the full Kubernetes CI/CD pipeline, and align their release cycles with Kubernetes. This was slow, error-prone, and politically contentious. - Security and stability risk. Third-party storage driver code ran inside the kubelet and controller-manager with full privileges. A bug in one driver could crash the kubelet, affecting all pods on the node.
CSI solves these problems by defining a standard gRPC interface between Kubernetes and external storage drivers. Storage vendors implement CSI drivers as independent binaries (running in separate containers), communicate with Kubernetes through a well-defined API, and release on their own schedule. Kubernetes ships with CSI client code; the driver code runs out-of-tree.
Evolution of Kubernetes Storage Plugins
=========================================
Phase 1: In-Tree (Kubernetes 1.0 - 1.22)
+-----------------------------------------+
| Kubernetes Binary (kubelet, controller) |
| |
| +--------+ +--------+ +--------+ |
| | AWS EBS| | GCE PD | | Ceph | |
| | Plugin | | Plugin | | RBD | |
| +--------+ +--------+ | Plugin | |
| +--------+ +--------+ +--------+ |
| |vSphere | | Azure | +--------+ |
| | VMDK | | Disk | |Portworx| |
| | Plugin | | Plugin | | Plugin | |
| +--------+ +--------+ +--------+ |
| |
| All compiled into kubernetes/kubernetes|
| ~30+ plugins, all in-tree |
+-----------------------------------------+
Problem: tightly coupled, slow release
cycle, security risk
Phase 2: CSI Out-of-Tree (Kubernetes 1.13+, GA)
+-----------------------------------------+
| Kubernetes Binary |
| |
| +----------------------------------+ |
| | CSI Client (kube-controller-mgr) | |
| | CSI Client (kubelet) | |
| +----------------------------------+ |
+------------|----------------------------+
| gRPC (Unix Domain Socket)
v
+-------------------------------------------+
| CSI Driver (separate container/process) |
| |
| +----------+ +-----------+ +----------+ |
| | ceph-csi | | smb.csi | | csi-proxy|| |
| | (ODF) | | (Azure | | (Windows || |
| | | | Local) | | nodes) || |
| +----------+ +-----------+ +----------+ |
+-------------------------------------------+
Decoupled: vendor releases independently,
runs in own process, own security context
CSI Architecture: Controller Plugin and Node Plugin
A CSI driver is split into two components that run as separate Kubernetes workloads:
Controller Plugin -- runs as a Deployment (1-3 replicas for HA) or StatefulSet. Handles cluster-wide operations that do not require access to a specific node's local devices. The controller plugin implements the Controller Service RPCs:
| gRPC Method | Purpose | When Called |
|---|---|---|
CreateVolume |
Provisions a new volume on the storage backend (e.g., creates an RBD image in Ceph, allocates a VHDX on S2D) | PVC created with dynamic provisioning |
DeleteVolume |
Removes the volume from the storage backend | PV deleted (reclaim policy = Delete) |
ControllerPublishVolume |
Attaches the volume to a specific node (e.g., maps RBD image to a node, connects iSCSI target) | Pod scheduled to a node, volume needs attaching |
ControllerUnpublishVolume |
Detaches the volume from a node | Pod deleted or moved, volume no longer needed on that node |
CreateSnapshot |
Creates a point-in-time snapshot of a volume | VolumeSnapshot CRD created |
DeleteSnapshot |
Removes a snapshot | VolumeSnapshotContent deleted |
ControllerExpandVolume |
Expands volume capacity on the storage backend | PVC .spec.resources.requests.storage increased |
ValidateVolumeCapabilities |
Checks if a volume supports requested capabilities | Pre-flight validation |
ListVolumes |
Lists all volumes managed by this driver | Inventory and reconciliation |
ControllerGetCapabilities |
Returns which optional RPCs this driver supports | Discovery during startup |
Node Plugin -- runs as a DaemonSet (one pod per node). Handles node-local operations that require access to the host's devices, mount namespaces, and block device layer. The node plugin implements the Node Service RPCs:
| gRPC Method | Purpose | When Called |
|---|---|---|
NodeStageVolume |
Mounts the volume to a node-global staging directory (e.g., formats and mounts a block device to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pv>/globalmount) |
First pod on this node uses the volume |
NodeUnstageVolume |
Unmounts from the staging directory | Last pod on this node releases the volume |
NodePublishVolume |
Bind-mounts from the staging directory into the pod's volume directory (or creates a device symlink for raw block) | Pod starts and needs the volume |
NodeUnpublishVolume |
Removes the bind-mount from the pod | Pod terminates |
NodeExpandVolume |
Expands the filesystem on a volume after the controller has expanded the backend volume | After ControllerExpandVolume, filesystem needs resizing |
NodeGetInfo |
Returns node topology information (zone, rack, hostname) | Node registration |
NodeGetCapabilities |
Returns which optional RPCs this node plugin supports | Discovery during startup |
NodeGetVolumeStats |
Returns capacity and inode usage for a mounted volume | Monitoring, kubectl describe pvc |
CSI Driver Architecture (Controller + Node Plugins with Sidecars)
===================================================================
CONTROLLER DEPLOYMENT (1-3 replicas)
+-----------------------------------------------------------------------+
| Pod: csi-controller-0 |
| |
| +---------------------+ +---------------------+ |
| | external-provisioner| | external-attacher | |
| | Watches PVC objects | | Watches VolumeAttach| |
| | Calls CreateVolume / | | ment objects | |
| | DeleteVolume via gRPC| | Calls Controller- | |
| +----------|----------+ | PublishVolume via | |
| | | gRPC | |
| | +----------|----------+ |
| | | |
| +---------------------+ +---------------------+ |
| | external-snapshotter| | external-resizer | |
| | Watches VolumeSnap- | | Watches PVC size | |
| | shot objects | | changes | |
| | Calls CreateSnapshot | | Calls Controller- | |
| | via gRPC | | ExpandVolume via | |
| +----------|----------+ | gRPC | |
| | +----------|----------+ |
| | | |
| +--------+ +----------+ |
| | | |
| v v |
| +---------------------+ +----------------+ |
| | CSI Driver | | livenessprobe | |
| | (Controller Service)| | Health checks | |
| | e.g., ceph-csi, | | the CSI driver | |
| | csi-smb-controller| | via gRPC | |
| | Listens on Unix | | Exposes /healthz| |
| | domain socket | +----------------+ |
| +---------------------+ |
+-----------------------------------------------------------------------+
NODE DAEMONSET (one pod per node)
+-----------------------------------------------------------------------+
| Pod: csi-node-xxxxx (on every schedulable node) |
| |
| +------------------------+ |
| | node-driver-registrar | |
| | Registers CSI driver | |
| | with kubelet via the | |
| | kubelet plugin | |
| | registration mechanism | |
| | (fsnotify on | |
| | /registration/) | |
| +-----------|------------+ |
| | |
| v |
| +---------------------+ +----------------+ |
| | CSI Driver | | livenessprobe | |
| | (Node Service) | +----------------+ |
| | e.g., ceph-csi, | |
| | csi-smb-node | |
| | Has access to: | |
| | - Host /dev devices | |
| | - Host mount ns | |
| | - kubelet dir | |
| +---------------------+ |
+-----------------------------------------------------------------------+
Communication: All sidecar-to-driver communication uses gRPC over
a shared Unix domain socket mounted as an emptyDir volume within
the pod. No network traffic. No TLS overhead.
CSI Sidecar Containers
CSI sidecar containers are Kubernetes-maintained helper containers that run alongside the CSI driver in the same pod. They watch Kubernetes API objects and translate them into CSI gRPC calls. The CSI driver itself never watches the Kubernetes API directly -- it only responds to gRPC calls. This separation means a CSI driver can be written without any Kubernetes-specific code.
| Sidecar | Watches | Calls CSI RPC | Runs In |
|---|---|---|---|
| external-provisioner | PersistentVolumeClaim (unbound, matching StorageClass) | CreateVolume, DeleteVolume |
Controller |
| external-attacher | VolumeAttachment (created by AD controller) | ControllerPublishVolume, ControllerUnpublishVolume |
Controller |
| external-snapshotter | VolumeSnapshot, VolumeSnapshotContent | CreateSnapshot, DeleteSnapshot, ListSnapshots |
Controller |
| external-resizer | PersistentVolumeClaim (size increase detected) | ControllerExpandVolume |
Controller |
| livenessprobe | (none -- polls CSI driver) | Probe (gRPC health check) |
Controller + Node |
| node-driver-registrar | (none -- registers with kubelet) | NodeGetInfo |
Node |
The sidecar versions must be compatible with the CSI driver version. Mismatched sidecar versions are a common source of subtle bugs (e.g., external-provisioner v3.x calling a driver that only supports CSI spec 1.5 features).
gRPC Interface Between Kubernetes and the CSI Driver
All communication between Kubernetes components and the CSI driver uses gRPC over Unix domain sockets. There is no network involved. The socket file is shared between containers in the same pod via an emptyDir volume.
gRPC Communication Flow
=========================
Kubernetes API Server
|
| (watches via informers)
v
+---------------------+ Unix Domain Socket
| Sidecar Container | ------> /csi/csi.sock ------> CSI Driver Container
| (e.g., external- | |
| provisioner) | gRPC Request: | Executes storage
| | CreateVolumeRequest { | backend operations
| | name: "pvc-abc123" | (e.g., rbd create,
| | capacity_range: { | ceph osd pool,
| | required_bytes: 107374.. | New-VirtualDisk)
| | } |
| | volume_capabilities: [...] |
| | parameters: { |
| | "pool": "ocs-storagecl.." |
| | "imageFeatures": "layeri."|
| | } |
| | } |
| | |
| | <--- CreateVolumeResponse { |
| | volume: { |
| | volume_id: "0001-0024-.."|
| | capacity_bytes: 10737.. |
| | } |
| | } |
+---------------------+ |
| |
| Creates PV object |
v |
Kubernetes API Server |
| |
| PV bound to PVC |
v |
Scheduler places Pod |
| |
| VolumeAttachment created |
v |
external-attacher -----> ControllerPublishVolume ------->
|
| Attachment confirmed
v
kubelet (on target node)
|
| Calls CSI Node Plugin via local socket
v
Node Plugin: NodeStageVolume (format + mount to staging dir)
|
Node Plugin: NodePublishVolume (bind-mount to pod dir)
|
v
Pod sees mounted volume at /var/lib/kubelet/pods/<uid>/volumes/...
The CSI specification (currently v1.9+) defines the exact protobuf message types. Here are the key message structures:
// Simplified CSI CreateVolume request/response
message CreateVolumeRequest {
string name = 1; // Unique name (PVC UID)
CapacityRange capacity_range = 2; // Min/max bytes
repeated VolumeCapability volume_capabilities = 3; // Block or Mount
map<string, string> parameters = 4; // StorageClass parameters
map<string, string> secrets = 5; // Auth credentials
VolumeContentSource volume_content_source = 6; // Clone or snapshot source
AccessibilityRequirements accessibility_requirements = 7; // Topology
}
message CreateVolumeResponse {
Volume volume = 1;
}
message Volume {
int64 capacity_bytes = 1;
string volume_id = 2; // Backend-specific ID
VolumeContext volume_context = 3; // Key-value metadata
VolumeContentSource content_source = 4;
AccessibleTopology accessible_topology = 5;
}
Volume Lifecycle Through CSI
A volume goes through a defined state machine as it moves from creation to consumption to deletion. Each state transition corresponds to a specific CSI gRPC call.
CSI Volume Lifecycle State Machine
=====================================
PVC Created
|
v
+--------+ CreateVolume +--------+
| | ----------------> | |
| (none) | | CREATED| Volume exists on storage backend
| | | | (e.g., RBD image exists in Ceph pool)
+--------+ +---+----+
|
ControllerPublishVolume
(attach to node)
|
v
+---------+
| |
|NODE_ | Volume is attached to a specific node
|READY | (e.g., RBD mapped via krbd or rbd-nbd)
| |
+----+----+
|
NodeStageVolume
(format + mount to
staging directory)
|
v
+---------+
| |
|VOL_ | Volume is mounted at a node-global
|READY | staging path, filesystem created
| | (e.g., /var/lib/kubelet/plugins/
+----+----+ kubernetes.io/csi/pv/<name>/
| globalmount)
NodePublishVolume
(bind-mount to pod)
|
v
+---------+
| |
|PUBLISHED| Volume is accessible inside the pod
| | at the specified mount path
| | (e.g., /data inside the container)
+----+----+
|
NodeUnpublishVolume
(remove bind-mount)
|
v
+---------+
| |
|VOL_ | Back to staged but not published
|READY | (still mounted at staging path)
| |
+----+----+
|
NodeUnstageVolume
(unmount from staging)
|
v
+---------+
| |
|NODE_ | Back to attached but not mounted
|READY |
| |
+----+----+
|
ControllerUnpublishVolume
(detach from node)
|
v
+---------+
| |
| CREATED | Volume exists but not attached
| |
+----+----+
|
DeleteVolume
|
v
+---------+
| |
| (none) | Volume removed from storage backend
| |
+---------+
Note: Not all CSI drivers implement all stages. Some drivers do not
support ControllerPublishVolume (e.g., NFS-based drivers where the
volume is network-accessible from all nodes without explicit attach).
These drivers report NO_CONTROLLER_PUBLISH_UNPUBLISH capability, and
Kubernetes skips the attach/detach steps.
CSI for KubeVirt: How a VM Disk Becomes a PVC
In KubeVirt (used by OVE), virtual machines run inside pods. Each VM disk is backed by a PVC. The CSI layer is responsible for provisioning and mounting the underlying volume. The flow is:
- A
VirtualMachineCR is created with adataVolumeTemplateor a reference to an existing PVC. - The CDI (Containerized Data Importer) operator creates a PVC using the specified StorageClass.
- The external-provisioner sidecar detects the PVC and calls
CreateVolumeon the CSI driver. - The CSI driver provisions the volume (e.g., creates an RBD image in Ceph via
rbd create). - When the VM is scheduled, the CSI volume goes through the full lifecycle (attach, stage, publish).
- KubeVirt's virt-launcher pod receives the mounted volume (as a block device or filesystem mount).
- QEMU uses the volume as the VM's virtual disk (via virtio-blk or virtio-scsi).
KubeVirt VM Disk via CSI (OVE / ODF)
=======================================
VirtualMachine CR CDI Operator
spec: |
dataVolumeTemplates: | Creates DataVolume
- metadata: | which creates PVC
name: vm-boot-disk |
spec: v
storage: PersistentVolumeClaim
storageClassName: metadata:
ocs-storagecluster- name: vm-boot-disk
ceph-rbd spec:
resources: storageClassName: ocs-storagecluster-ceph-rbd
requests: volumeMode: Block
storage: 100Gi resources:
source: requests:
http: storage: 100Gi
url: "https://..." |
v
CSI Driver (ceph-csi)
|
| CreateVolume gRPC call
v
Ceph Cluster (ODF)
rbd create replicapool/csi-vol-<uuid> --size 100G
|
v
RBD Image: replicapool/csi-vol-<uuid>
|
(Pod scheduled, volume attached+staged+published)
|
v
+----------------------------+
| virt-launcher Pod |
| |
| +-----------------------+ |
| | QEMU/KVM Process | |
| | | |
| | VM: vm-boot-disk | |
| | virtio-blk --> /dev/ | |
| | rbd0 (block device) | |
| | | |
| | Guest OS sees: | |
| | /dev/vda (100 GiB) | |
| +-----------------------+ |
+----------------------------+
Key detail: KubeVirt prefers volumeMode: Block for VM disks.
This avoids the double-filesystem overhead (host filesystem +
guest filesystem). The raw block device is passed directly to
QEMU, which presents it to the guest as a virtual disk.
Key CSI Drivers for the Evaluation
| CSI Driver | Used By | Storage Backend | Protocols | Key Features |
|---|---|---|---|---|
| ceph-csi | OVE (ODF) | Ceph RADOS | RBD (block), CephFS (file) | Snapshots, clones, volume expansion, encryption, topology-aware provisioning, raw block mode, multi-attach (CephFS), RWX via CephFS |
| smb.csi.k8s.io | Azure Local | SMB shares on S2D / CSV | SMB 3.x | File-level access for Windows VMs, Kerberos auth, DFS support |
| disk.csi.azure.com | Azure Local | S2D managed disks | Virtual disk (VHDX on CSV) | Block volumes for VMs, snapshots via VSS, volume expansion |
| csi-proxy | Azure Local (Windows nodes) | (proxy for Windows-native CSI operations) | Named pipes | Enables CSI node operations on Windows nodes where Unix domain sockets are not available |
| nfs.csi.k8s.io | Both (optional) | NFS exports | NFS v3/v4 | Simple shared storage, no special driver needed, ReadWriteMany |
ceph-csi in detail (OVE/ODF):
ceph-csi is the CSI driver that connects Kubernetes to Ceph storage. In an ODF deployment, Rook-Ceph deploys and manages ceph-csi automatically. The driver communicates with Ceph via librbd (for RBD block volumes) and the CephFS kernel client or ceph-fuse (for CephFS filesystem volumes).
ceph-csi configuration is stored in a ConfigMap and Secret:
# ConfigMap: ceph-csi-config (simplified)
apiVersion: v1
kind: ConfigMap
metadata:
name: ceph-csi-config
namespace: openshift-storage
data:
config.json: |
[
{
"clusterID": "openshift-storage",
"monitors": [
"10.0.1.10:6789",
"10.0.1.11:6789",
"10.0.1.12:6789"
]
}
]
---
# Secret: csi-rbd-secret (credentials for Ceph auth)
apiVersion: v1
kind: Secret
metadata:
name: csi-rbd-secret
namespace: openshift-storage
type: Opaque
stringData:
userID: csi-rbd-node
userKey: AQD3o+1hxxxxxxxxxxxxxxxxxxxxxxxxxx==
csi-proxy for Windows nodes (Azure Local):
Azure Local runs Windows Server nodes with Hyper-V. CSI drivers on Linux use Unix domain sockets for communication. Windows does not support Unix domain sockets natively. csi-proxy runs as a Windows service on each node and exposes a set of Windows named pipe endpoints that CSI node plugins use instead of direct host access. The CSI node plugin communicates with csi-proxy via gRPC over named pipes, and csi-proxy executes the actual disk, volume, and filesystem operations on the Windows host.
CSI on Windows Nodes (Azure Local)
====================================
Linux Node (standard CSI):
+------------------------------------------+
| CSI Node Plugin Container |
| - Directly accesses /dev, /sys, /mount |
| - Uses Unix domain socket for gRPC |
| - Calls mount, mkfs, etc. directly |
+------------------------------------------+
Windows Node (via csi-proxy):
+------------------------------------------+
| CSI Node Plugin Container (Windows) |
| - Cannot access host devices directly |
| - Calls csi-proxy via named pipes |
| - csi-proxy translates to Win32 APIs |
+----------|-------------------------------+
| gRPC over named pipes
v
+------------------------------------------+
| csi-proxy.exe (Windows Service) |
| Disk API: Initialize, Partition |
| Volume API: Format, Mount, Resize |
| FS API: CreateSymlink, PathExists |
| SMB API: NewSMBGlobalMapping |
+------------------------------------------+
|
v
Windows Host (PowerShell / Win32 APIs)
CSI Features
| Feature | Description | CSI RPCs Involved | Relevance for 5,000+ VMs |
|---|---|---|---|
| Snapshots | Point-in-time copy of a volume; used for backup, cloning, testing | CreateSnapshot, DeleteSnapshot |
Pre-upgrade snapshots, backup integration (Kasten, Veeam) |
| Cloning | Creates a new volume from an existing volume (copy-on-write where supported) | CreateVolume with volume_content_source |
Rapid VM provisioning from golden images |
| Volume Expansion | Increase volume capacity without downtime | ControllerExpandVolume, NodeExpandVolume |
Online disk growth for VMs running out of space |
| Topology-Aware Provisioning | Places volumes in the same failure domain as the consuming pod/VM | CreateVolume with accessibility_requirements |
Data locality, rack-awareness, zone-awareness |
| Raw Block Volumes | Presents a volume as a raw block device (no filesystem) | Standard RPCs with VolumeCapability_Block |
KubeVirt VM disks (avoids double filesystem overhead) |
| Volume Health Monitoring | Reports volume health conditions | NodeGetVolumeStats, volume condition reporting |
Proactive detection of degraded volumes |
| Ephemeral Inline Volumes | Volumes tied to pod lifecycle (no PVC) | CSI ephemeral volume mode | Not relevant for persistent VM disks |
CSI vs VAAI: Conceptual Comparison
VMware administrators are familiar with VAAI (vStorage APIs for Array Integration) -- a set of offload APIs that let ESXi delegate storage operations to the array hardware. CSI and VAAI solve conceptually similar problems: abstracting storage operations behind a standard API so the hypervisor/orchestrator does not need to know the backend implementation.
| Dimension | VAAI (VMware) | CSI (Kubernetes) |
|---|---|---|
| Purpose | Offload storage operations to the array | Standardize storage provisioning and lifecycle |
| Interface | T10 SCSI commands (XCOPY, WRITE_SAME, ATS, UNMAP) | gRPC (protobuf over Unix domain socket) |
| Scope | Data-plane offload (copy, zero, lock, thin reclaim) | Full lifecycle (provision, attach, mount, snapshot, expand, delete) |
| Who implements | Storage array firmware | Storage vendor's CSI driver (container) |
| Deployment | Built into ESXi + array firmware | Kubernetes pods (Deployment + DaemonSet) |
| Feature discovery | ESXi queries array for VAAI primitive support | CSI driver reports GetCapabilities |
| Storage policies | SPBM (Storage Policy Based Management) | StorageClasses (see section 3) |
| Snapshots | VADP (vStorage APIs for Data Protection) | CSI CreateSnapshot / VolumeSnapshot CRD |
| Thin provisioning | VAAI UNMAP / WRITE_SAME with zeros | CSI driver handles thin provisioning internally |
Key difference: VAAI is a data-plane optimization (let the array do the heavy lifting for copy/zero/lock operations). CSI is a control-plane abstraction (let the driver handle the entire volume lifecycle). CSI does not specify how the data plane works -- it only standardizes the management operations. The data-plane path (how I/O flows from VM to disk) is determined by the underlying protocol (RBD, iSCSI, NVMe-oF) and is outside the CSI specification.
2. Persistent Volumes / PV / PVC
PV: The Cluster-Level Storage Resource
A PersistentVolume (PV) is a cluster-scoped Kubernetes object that represents a piece of provisioned storage. It is the Kubernetes equivalent of a LUN on a SAN, a VMDK on a datastore, or a VHDX on a CSV. PVs exist independently of any namespace or workload -- they are infrastructure-level objects managed by cluster administrators or provisioned automatically by CSI drivers.
A PV captures the following properties:
| Field | Purpose | Example |
|---|---|---|
spec.capacity.storage |
Size of the volume | 100Gi |
spec.accessModes |
How the volume can be mounted (RWO, ROX, RWX, RWOP) | [ReadWriteOnce] |
spec.persistentVolumeReclaimPolicy |
What happens when the PVC is deleted (Retain or Delete) | Retain |
spec.storageClassName |
Which StorageClass this PV belongs to | ocs-storagecluster-ceph-rbd |
spec.volumeMode |
Filesystem (default) or Block | Block |
spec.csi |
CSI-specific fields (driver name, volume handle, volume attributes) | See YAML below |
spec.nodeAffinity |
Topology constraints (which nodes can access this volume) | Rack/zone labels |
status.phase |
Current lifecycle phase (Available, Bound, Released, Failed) | Bound |
# PV provisioned by ceph-csi for a KubeVirt VM disk
apiVersion: v1
kind: PersistentVolume
metadata:
name: pvc-3f8e9a12-7b4c-4d5e-a1f0-9c8d7e6b5a43
annotations:
pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: ocs-storagecluster-ceph-rbd
volumeMode: Block # Raw block for KubeVirt VM disk
csi:
driver: openshift-storage.rbd.csi.ceph.com
volumeHandle: "0001-0024-openshift-storage-0000000000000001-3f8e9a12"
volumeAttributes:
clusterID: "openshift-storage"
pool: "ocs-storagecluster-cephblockpool"
imageFeatures: "layering,deep-flatten,exclusive-lock,object-map,fast-diff"
storage.kubernetes.io/csiProvisionerIdentity: "1682000000000-8081-openshift-storage.rbd.csi.ceph.com"
nodeStageSecretRef:
name: rook-csi-rbd-node
namespace: openshift-storage
controllerExpandSecretRef:
name: rook-csi-rbd-provisioner
namespace: openshift-storage
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.rbd.csi.ceph.com/openshift-storage
operator: Exists
# PV for S2D-backed volume (Azure Local)
apiVersion: v1
kind: PersistentVolume
metadata:
name: pvc-a7b2c3d4-e5f6-7890-abcd-ef1234567890
annotations:
pv.kubernetes.io/provisioned-by: disk.csi.azure.com
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: azurelocal-premium
volumeMode: Filesystem
csi:
driver: disk.csi.azure.com
volumeHandle: "/subscriptions/.../virtualharddisks/vm-boot-disk"
volumeAttributes:
storage.kubernetes.io/csiProvisionerIdentity: "..."
PVC: The Namespace-Scoped Request
A PersistentVolumeClaim (PVC) is a namespace-scoped object that represents a user's request for storage. It is the Kubernetes equivalent of saying "I need a 100 GiB disk from the gold tier." The PVC does not specify which physical volume to use -- it specifies what it needs, and Kubernetes (via the CSI driver and the PV/PVC binding controller) finds or creates a matching PV.
| Field | Purpose | Example |
|---|---|---|
spec.accessModes |
Required access mode | [ReadWriteOnce] |
spec.resources.requests.storage |
Minimum capacity needed | 100Gi |
spec.storageClassName |
Which StorageClass to use for provisioning | ocs-storagecluster-ceph-rbd |
spec.volumeMode |
Filesystem or Block | Block |
spec.selector |
Label-based selector for manual PV matching (static provisioning) | matchLabels: {tier: gold} |
spec.dataSource |
Clone from existing PVC or restore from VolumeSnapshot | See snapshot example below |
spec.dataSourceRef |
Extended data source (cross-namespace, custom resources) | DataVolume reference |
status.phase |
Current phase (Pending, Bound, Lost) | Bound |
status.capacity |
Actual provisioned capacity (may exceed request) | 100Gi |
# PVC for a KubeVirt VM boot disk (OVE)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vm-database-prod-01-boot
namespace: production-vms
labels:
app: database
tier: production
vm.kubevirt.io/name: database-prod-01
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block # Raw block for VM disk
storageClassName: ocs-storagecluster-ceph-rbd
resources:
requests:
storage: 100Gi
# PVC for a shared configuration volume (RWX via CephFS)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-config
namespace: production-vms
spec:
accessModes:
- ReadWriteMany # Multiple VMs can mount simultaneously
volumeMode: Filesystem
storageClassName: ocs-storagecluster-cephfs
resources:
requests:
storage: 10Gi
# PVC restored from a VolumeSnapshot (clone for testing)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vm-database-test-clone
namespace: test-vms
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
storageClassName: ocs-storagecluster-ceph-rbd
resources:
requests:
storage: 100Gi
dataSource:
name: vm-database-prod-snapshot-20260428
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
Binding: How PV and PVC Match
When a PVC is created, the PV controller in kube-controller-manager attempts to find a matching PV. The matching algorithm considers: access modes, capacity (PV capacity >= PVC request), StorageClass name, volume mode, and optional label selectors. There are two provisioning models:
Static provisioning: An administrator pre-creates PV objects. When a PVC is created, the PV controller searches existing Available PVs for a match. If found, the PV and PVC are bound. If no match exists, the PVC remains Pending.
Dynamic provisioning: The StorageClass specifies a CSI provisioner. When a PVC references a StorageClass with a provisioner, the external-provisioner sidecar creates a new volume via the CSI CreateVolume call, then creates a corresponding PV object. The PV controller then binds the PVC to the newly created PV.
PV-PVC Binding Flow (Dynamic Provisioning)
=============================================
1. User creates PVC
+---------------------------+
| PVC: vm-boot-disk |
| storageClassName: gold |
| storage: 100Gi |
| volumeMode: Block |
| accessModes: [RWO] |
| status.phase: Pending |
+---------------------------+
|
v
2. external-provisioner detects unbound PVC
matching its StorageClass provisioner
|
v
3. external-provisioner calls CSI CreateVolume
+---------------------------+
| gRPC: CreateVolume |
| name: pvc-<uuid> |
| capacity: 100 GiB |
| parameters: {pool: rbd, |
| imageFeatures: layering}|
+---------------------------+
|
v
4. CSI driver provisions volume on backend
(e.g., rbd create replicapool/csi-vol-<uuid> --size 100G)
|
v
5. CSI driver returns CreateVolumeResponse
with volume_id
|
v
6. external-provisioner creates PV object
+---------------------------+
| PV: pvc-<uuid> |
| capacity: 100Gi |
| csi.volumeHandle: vol-id |
| storageClassName: gold |
| volumeMode: Block |
| accessModes: [RWO] |
| reclaimPolicy: Delete |
| status.phase: Available |
+---------------------------+
|
v
7. PV controller (kube-controller-manager)
detects matching PV and PVC
- capacity: 100Gi >= 100Gi OK
- accessModes: [RWO] matches OK
- storageClass: gold matches OK
- volumeMode: Block matches OK
|
v
8. PV controller binds PV <-> PVC
- Sets PV.spec.claimRef to PVC
- Sets PVC.spec.volumeName to PV
+---------------------------+
| PV: pvc-<uuid> |
| status.phase: Bound |
| claimRef: |
| name: vm-boot-disk |
| namespace: prod-vms |
+---------------------------+
+---------------------------+
| PVC: vm-boot-disk |
| status.phase: Bound |
| volumeName: pvc-<uuid> |
+---------------------------+
Access Modes
Access modes define how many nodes can mount a volume simultaneously. They are specified in both the PV and PVC.
| Mode | Abbreviation | Description | Ceph RBD | CephFS | S2D (VHDX) | S2D (SMB) |
|---|---|---|---|---|---|---|
| ReadWriteOnce | RWO | Mounted read-write by a single node | Yes | Yes | Yes | Yes |
| ReadOnlyMany | ROX | Mounted read-only by many nodes | Yes | Yes | Yes | Yes |
| ReadWriteMany | RWX | Mounted read-write by many nodes | No (RBD) | Yes | No | Yes |
| ReadWriteOncePod | RWOP | Mounted read-write by a single pod (K8s 1.27+) | Yes | Yes | Yes | Yes |
Which access modes matter for VMs:
- RWO is the standard mode for VM boot and data disks. Each VM disk is exclusively owned by one VM. In KubeVirt, the virt-launcher pod mounts the volume in RWO mode.
- RWX is required for live migration. When a VM migrates from node A to node B, both the source and target virt-launcher pods must access the volume simultaneously during the migration window. RBD does not support RWX, but CephFS does. For RBD-based VM disks with live migration, KubeVirt uses a mechanism where the RBD exclusive lock is transferred between nodes -- effectively allowing controlled multi-node access during migration without true RWX semantics.
- RWOP is useful for ensuring a VM disk is never accidentally mounted by two pods on the same node (e.g., during a failed migration cleanup). It provides stronger isolation than RWO.
Access Mode Impact on VM Live Migration
==========================================
RWO Volume (Ceph RBD):
+----------+ +----------+
| Node A | | Node B |
| (source) | | (target) |
| virt- | RBD exclusive lock | virt- |
| launcher | transferred during | launcher |
| pod | --------migration--------> | pod |
| [RWO] | | [RWO] |
+----------+ +----------+
| |
v v
RBD device mapped RBD device mapped
on Node A on Node B
(lock released) (lock acquired)
Key: Only one node holds the exclusive lock at a time.
KubeVirt coordinates the handoff. This works for RBD
but requires RBD exclusive-lock feature enabled.
RWX Volume (CephFS):
+----------+ +----------+
| Node A | | Node B |
| (source) | | (target) |
| virt- | Both mount | virt- |
| launcher | simultaneously | launcher |
| pod | | pod |
| [RWX] | | [RWX] |
+----------+ +----------+
| |
v v
CephFS mounted CephFS mounted
on Node A on Node B
(concurrent access) (concurrent access)
Key: Both nodes mount the filesystem simultaneously.
No lock transfer needed. Simpler migration but CephFS
has higher latency than RBD for random I/O.
Volume Modes: Filesystem vs Block
| Mode | Description | How Data Is Accessed | KubeVirt Usage |
|---|---|---|---|
| Filesystem (default) | Volume is formatted with a filesystem (ext4, XFS) and mounted at a path | Standard file I/O (open, read, write) |
VM disk image stored as a file on the mounted FS (e.g., disk.img on XFS) |
| Block | Volume is exposed as a raw block device (/dev/...) |
Direct block I/O (ioctl, read/write on device) |
Raw block device passed directly to QEMU/KVM |
Why KubeVirt prefers Block mode for VM disks:
With Filesystem mode, the I/O path is: Guest OS --> virtio-blk --> QEMU --> host filesystem (XFS on staged volume) --> block layer --> CSI volume (RBD/S2D). There are two filesystems in the path -- the guest's filesystem inside the VM and the host's filesystem wrapping the disk image file. This adds overhead: double metadata updates, double journaling, fragmentation of the disk image file.
With Block mode, the I/O path is: Guest OS --> virtio-blk --> QEMU --> raw block device --> CSI volume (RBD/S2D). There is only one filesystem -- the guest's. The host sees a raw block device, and QEMU reads/writes directly to it. This eliminates the overhead of the host filesystem and provides better performance for VM workloads.
I/O Path Comparison: Filesystem vs Block Volume Mode
======================================================
Filesystem Mode (volumeMode: Filesystem):
+-------+ +--------+ +--------+ +--------+ +---------+
| Guest | -> | virtio | -> | QEMU | -> | Host | -> | CSI |
| FS | | -blk | | File | | FS | | Volume |
| (ext4)| | | | I/O | | (XFS) | | (RBD) |
+-------+ +--------+ +--------+ +--------+ +---------+
^^^
Extra layer:
- Double metadata
- Fragmentation
- Journal overhead
- Potential misalignment
Block Mode (volumeMode: Block):
+-------+ +--------+ +--------+ +---------+
| Guest | -> | virtio | -> | QEMU | -> | CSI |
| FS | | -blk | | Block | | Volume |
| (ext4)| | | | I/O | | (RBD) |
+-------+ +--------+ +--------+ +---------+
^^^
Direct path:
- No host FS overhead
- No fragmentation
- No double journaling
- Native I/O alignment
Performance impact: Block mode typically shows 10-20% better
IOPS and lower p99 latency for random I/O workloads (databases,
OLTP) compared to Filesystem mode with an image file.
Reclaim Policies
The reclaim policy determines what happens to the PV (and the underlying storage) when its bound PVC is deleted.
| Policy | Behavior | Use Case |
|---|---|---|
| Delete | PV and the underlying volume on the storage backend are automatically deleted when the PVC is deleted. | Ephemeral/replaceable workloads, dev/test environments. Default for dynamically provisioned volumes. |
| Retain | PV transitions to Released state. The underlying volume is preserved. An administrator must manually clean up and re-create or delete the PV. |
Production databases, regulated workloads where data must be explicitly reviewed before deletion. Critical for financial compliance. |
For a Tier-1 financial enterprise:
- Production VM disks should use
Retain. Accidental PVC deletion must not cause data loss. An operational procedure for reviewing and releasing retained PVs should be established. - Dev/test VM disks can use
Deleteto avoid accumulating orphaned volumes. - Snapshot-protected volumes can use
Deleteif the backup/snapshot strategy ensures data recoverability independent of the PV lifecycle.
Reclaim Policy Behavior
=========================
Delete Policy:
PVC deleted --> PV released --> CSI DeleteVolume --> Backend volume removed
(rbd rm, Remove-VirtualDisk)
Timeline: [PVC exists]-----[PVC deleted]---[PV gone]---[Data gone]
Data accessible Seconds later Permanent loss
Retain Policy:
PVC deleted --> PV Released (data preserved) --> Manual admin action required
|
+-> Option A: Delete PV + data
+-> Option B: Remove claimRef,
PV becomes Available again,
new PVC can bind to it
+-> Option C: Backup data, then
delete PV
Timeline: [PVC exists]-----[PVC deleted]---[PV Released]---[Admin decides]
Data accessible Data preserved Data safe Explicit action
PV Lifecycle
PV Lifecycle State Machine
=============================
+-------------+ PV created (static) or +-------------+
| | CSI CreateVolume (dynamic) | |
| (does not | -----------------------------> | Available |
| exist) | | |
+-------------+ +------+------+
|
PV controller
finds matching PVC
(capacity, mode,
class, selector)
|
v
+------+------+
| |
| Bound | <-- PV.claimRef set
| | PVC.volumeName set
+------+------+
|
PVC deleted
|
+-------------------------------+-------------------------------+
| |
reclaimPolicy: Delete reclaimPolicy: Retain
| |
v v
+------+------+ +------+------+
| | | |
| (deleted) | CSI DeleteVolume called, | Released | Data preserved,
| | PV object removed, | | claimRef still set,
| | backend volume destroyed | | cannot be re-bound
+-------------+ +------+------+
|
Admin removes
claimRef manually
|
v
+------+------+
| |
| Available | Can be bound
| | to a new PVC
+-------------+
Note: A PV can also enter "Failed" state if the backend volume
becomes inaccessible (e.g., Ceph cluster unreachable, RBD image
corrupted). Failed PVs require manual intervention.
PVC Resizing: Online Volume Expansion for VM Disks
PVC resizing allows increasing the storage capacity of a bound volume without downtime. This is critical for VMs that accumulate data over time. The process involves two steps:
- Controller-side expansion: The external-resizer sidecar detects the PVC size change and calls
ControllerExpandVolume. The CSI driver expands the volume on the backend (e.g.,rbd resizefor Ceph). - Node-side expansion: After the backend volume is expanded, the kubelet calls
NodeExpandVolumeon the node plugin. The node plugin resizes the filesystem (if volumeMode is Filesystem) or notifies the block device layer (if volumeMode is Block).
# Expand a VM disk from 100Gi to 200Gi (edit the PVC)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vm-database-prod-01-data
namespace: production-vms
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
storageClassName: ocs-storagecluster-ceph-rbd # Must have allowVolumeExpansion: true
resources:
requests:
storage: 200Gi # Changed from 100Gi to 200Gi
The StorageClass must have allowVolumeExpansion: true for resizing to work. After the expansion:
- Block mode (KubeVirt VMs): The block device grows. The guest OS must rescan and extend its internal filesystem/partition. This is identical to growing a VMDK in VMware -- the guest sees a larger disk and must
growpart+resize2fs(Linux) orExtend-Partition(Windows). - Filesystem mode: The host filesystem (XFS, ext4) is expanded automatically by the CSI node plugin. No guest action needed.
Shrinking volumes is not supported by CSI. If a volume needs to be smaller, the data must be migrated to a new, smaller volume. This matches VMware behavior where shrinking a VMDK in-place is also not supported.
Multi-Attach Considerations
| Storage Type | Access Mode | Multi-Attach | Use Case | Protocol |
|---|---|---|---|---|
| Ceph RBD | RWO | No (exclusive lock) | VM boot/data disks | RADOS/krbd |
| CephFS | RWX | Yes (POSIX filesystem) | Shared config, home dirs | CephFS kernel client |
| NFS | RWX | Yes (NFS protocol) | Legacy shared storage | NFS v3/v4 |
| S2D VHDX | RWO | No (exclusive) | VM boot/data disks | SMB Direct (RDMA) |
| S2D SMB Share | RWX | Yes (SMB3 protocol) | Shared file storage | SMB 3.x |
RWX for VM live migration vs RWX for multi-writer: These are different requirements. Live migration needs temporary dual-node access during migration (seconds). Multi-writer RWX means persistent concurrent access by multiple pods. KubeVirt handles live migration of RWO/RBD volumes via exclusive lock transfer, which is architecturally cleaner than requiring RWX for all VM disks.
3. StorageClasses
What StorageClasses Solve
Without StorageClasses, every PVC would need to be manually matched to a pre-created PV (static provisioning). This does not scale for 5,000+ VMs. StorageClasses enable dynamic provisioning -- the user requests storage by specifying a class name, and the system automatically provisions a volume with the right characteristics.
StorageClasses are the Kubernetes equivalent of VMware's Storage Policy Based Management (SPBM). In VMware, you define storage policies (e.g., "Gold: RAID-1 mirroring, SSD tier, no compression") and assign them to VMs. In Kubernetes, you define StorageClasses with provider-specific parameters and reference them in PVCs.
StorageClass Selection Flow
==============================
User creates PVC StorageClass Definition
+--------------------+ +----------------------------------+
| PVC: my-disk | | StorageClass: gold |
| storageClassName: |-------->| provisioner: rbd.csi.ceph.com |
| gold | | parameters: |
| storage: 100Gi | | pool: nvme-replicapool |
+--------------------+ | imageFeatures: layering,... |
| reclaimPolicy: Retain |
| volumeBindingMode: |
| WaitForFirstConsumer |
| allowVolumeExpansion: true |
+----------------------------------+
|
+----------------------+
|
v
external-provisioner:
"PVC 'my-disk' references StorageClass 'gold'.
I am the provisioner for rbd.csi.ceph.com.
I will call CreateVolume with these parameters."
|
v
CSI CreateVolume(
name: "pvc-<uuid>",
capacity: 100 GiB,
parameters: {
pool: "nvme-replicapool",
imageFeatures: "layering,deep-flatten,exclusive-lock,object-map,fast-diff"
}
)
|
v
Ceph: rbd create nvme-replicapool/csi-vol-<uuid> --size 100G
|
v
PV created, bound to PVC
StorageClass Definition
A StorageClass has the following key fields:
| Field | Purpose | Values |
|---|---|---|
provisioner |
The CSI driver that provisions volumes for this class | openshift-storage.rbd.csi.ceph.com, disk.csi.azure.com, etc. |
parameters |
Driver-specific key-value pairs passed to CreateVolume |
Pool name, replication factor, features, encryption settings |
reclaimPolicy |
Default reclaim policy for PVs created by this class | Delete (default) or Retain |
volumeBindingMode |
When to bind and provision the volume | Immediate or WaitForFirstConsumer |
allowVolumeExpansion |
Whether PVCs using this class can be resized | true or false |
mountOptions |
Additional mount options for filesystem volumes | ["discard", "noatime"] |
Volume Binding Modes
| Mode | Behavior | When to Use |
|---|---|---|
| Immediate | Volume is provisioned and bound as soon as the PVC is created, regardless of whether a pod exists to consume it. | Simple environments without topology constraints. The volume may be provisioned in a zone/rack where no pod will ever run. |
| WaitForFirstConsumer | Volume provisioning is delayed until a pod referencing the PVC is scheduled. The scheduler determines which node the pod will run on, and the volume is provisioned in the same topology domain. | Recommended for production. Ensures volumes are provisioned in the correct failure domain (rack, zone). Essential for topology-aware storage like Ceph with CRUSH rules or S2D with site-aware placement. |
Volume Binding Modes: Immediate vs WaitForFirstConsumer
=========================================================
Immediate Binding:
+-------+ +---------+ +--------+ +--------+
| PVC | ----> | external| ---> | CSI | --> | Volume |
| create| | provis- | | Create | | on |
| d | | ioner | | Volume | | Rack A |
+-------+ +---------+ +--------+ +--------+
|
Later: Pod scheduled to Rack B |
Problem: Volume is on Rack A, |
pod is on Rack B. Cross-rack I/O |
or scheduling failure. |
WaitForFirstConsumer Binding:
+-------+ +---------+
| PVC | ----> | external|
| create| | provis- |
| d | | ioner | "PVC has WaitForFirstConsumer.
+-------+ +---------+ I will wait."
|
Later: Pod scheduled |
to Rack B |
+-------+ |
| Pod | ----------->|
| sched | |
| Rack B| v
+-------+ +--------+ +--------+
| CSI | --> | Volume |
| Create | | on |
| Volume | | Rack B | <-- Correct topology!
+--------+ +--------+
StorageClass Parameters for Ceph RBD (OVE / ODF)
# StorageClass: Gold tier (NVMe pool, 3-way replication)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ocs-gold-nvme
annotations:
description: "NVMe-backed, 3-replica, for latency-sensitive production VMs"
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
clusterID: openshift-storage
pool: nvme-replicapool # Ceph pool backed by NVMe OSDs
imageFormat: "2" # Always "2" (layering support)
imageFeatures: "layering,deep-flatten,exclusive-lock,object-map,fast-diff"
# imageFeatures explained:
# layering - COW cloning support (instant VM provisioning from templates)
# deep-flatten - Required for snapshot deletion without affecting clones
# exclusive-lock - Single-writer guarantee (KubeVirt live migration lock transfer)
# object-map - Tracks which 4 MiB objects are allocated (speeds up diff/export)
# fast-diff - Uses object-map for efficient delta calculation (faster snapshots)
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
csi.storage.k8s.io/fstype: ext4 # Only used if volumeMode: Filesystem
reclaimPolicy: Retain # Production: never auto-delete
volumeBindingMode: WaitForFirstConsumer # Topology-aware placement
allowVolumeExpansion: true
# StorageClass: Silver tier (SSD pool, 3-way replication)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ocs-silver-ssd
annotations:
description: "SSD-backed, 3-replica, for general production workloads"
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
clusterID: openshift-storage
pool: ssd-replicapool # Ceph pool on SSD-class OSDs
imageFormat: "2"
imageFeatures: "layering,deep-flatten,exclusive-lock,object-map,fast-diff"
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StorageClass: Bronze tier (HDD pool, erasure-coded for capacity)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ocs-bronze-ec
annotations:
description: "HDD-backed, erasure-coded (4+2), for capacity-heavy cold workloads"
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
clusterID: openshift-storage
pool: hdd-ec-datapool # EC data pool
dataPool: hdd-ec-datapool # Used for RBD with EC
# Note: RBD on erasure-coded pools requires a metadata pool
# (replicated) and a data pool (EC). The "pool" parameter
# points to the replicated metadata pool, "dataPool" to the EC pool.
imageFormat: "2"
imageFeatures: "layering,exclusive-lock,object-map,fast-diff"
# Note: deep-flatten is not supported on EC pools
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete # Cold data: auto-cleanup acceptable
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StorageClass: CephFS (shared filesystem, RWX)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ocs-shared-cephfs
annotations:
description: "CephFS, 3-replica, ReadWriteMany for shared filesystem access"
provisioner: openshift-storage.cephfs.csi.ceph.com
parameters:
clusterID: openshift-storage
fsName: ocs-storagecluster-cephfilesystem # CephFS filesystem name
pool: ocs-storagecluster-cephfilesystem-data0 # CephFS data pool
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
reclaimPolicy: Retain
volumeBindingMode: Immediate # CephFS is accessible from all nodes
allowVolumeExpansion: true
StorageClass Parameters for S2D (Azure Local)
Azure Local uses CSI drivers that map to S2D storage pools and volume tiers. The parameters differ fundamentally from Ceph because S2D uses a different storage architecture (cache tier + capacity tier, ReFS, CSVs).
# StorageClass: Azure Local Premium (all-NVMe tier)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurelocal-premium
annotations:
description: "S2D all-NVMe, 3-way mirror, for latency-sensitive VMs"
provisioner: disk.csi.azure.com
parameters:
storagePool: "S2D-NVMe-Pool" # S2D storage pool name
resiliencySettingName: "Mirror" # Mirror (2-way, 3-way) or Parity
numberOfCopies: "3" # 3-way mirror for production
# S2D determines cache behavior automatically based on device classes:
# - All-NVMe: no cache tier (all devices serve as capacity)
# - NVMe + SSD: NVMe acts as cache for SSD capacity
# - NVMe + HDD: NVMe acts as cache for HDD capacity
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StorageClass: Azure Local Standard (SSD with NVMe cache)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurelocal-standard
annotations:
description: "S2D SSD capacity with NVMe cache, 3-way mirror"
provisioner: disk.csi.azure.com
parameters:
storagePool: "S2D-SSD-Pool"
resiliencySettingName: "Mirror"
numberOfCopies: "3"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StorageClass: Azure Local Capacity (MAP for cost efficiency)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurelocal-capacity
annotations:
description: "S2D mirror-accelerated parity, for capacity-heavy cold workloads"
provisioner: disk.csi.azure.com
parameters:
storagePool: "S2D-Capacity-Pool"
resiliencySettingName: "Parity" # Or "Mirror" with MAP tiering
numberOfCopies: "1" # Parity provides redundancy differently
# MAP parameters (mirror-accelerated parity):
# Hot data sits in the mirror tier (fast writes)
# Cold data automatically tiers to parity (space efficient)
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Default StorageClass Behavior
One StorageClass can be marked as the default using the annotation storageclass.kubernetes.io/is-default-class: "true". When a PVC does not specify a storageClassName, the default StorageClass is used. If no default is set and no class is specified, the PVC remains Pending (dynamic provisioning will not occur).
For 5,000+ VMs, the default StorageClass should be the most commonly used tier (typically the "silver" equivalent). Setting the default to the most expensive tier (gold/NVMe) risks accidental overprovisioning of premium storage.
# Setting the default StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ocs-silver-ssd
annotations:
storageclass.kubernetes.io/is-default-class: "true" # <-- This is the default
provisioner: openshift-storage.rbd.csi.ceph.com
# ... parameters ...
Multiple defaults: If more than one StorageClass is marked as default, PVCs without an explicit class will fail with an error. This is a common misconfiguration during initial setup. Enforce a single default via admission webhooks or policy (e.g., OPA/Gatekeeper).
Mapping VMware SPBM to Kubernetes StorageClasses
VMware's Storage Policy Based Management (SPBM) defines storage capabilities through rules. Each rule specifies a set of capabilities (replication factor, stripe width, disk type, QoS limits) that the datastore (vSAN, VMFS, etc.) must satisfy. VMs are assigned policies, and vCenter ensures the underlying storage matches.
The conceptual mapping to Kubernetes StorageClasses is direct, but the implementation mechanism differs:
SPBM to StorageClass Mapping
===============================
VMware SPBM Policy Kubernetes StorageClass
+-----------------------------------+ +-----------------------------------+
| Policy: "Gold-Production" | | StorageClass: "ocs-gold-nvme" |
| | | |
| Rules: | | Parameters: |
| - VSAN.hostFailuresToTolerate=2 | | pool: nvme-replicapool |
| (3 replicas) | | (Ceph pool with RF=3 via |
| - VSAN.stripeWidth=2 | | CRUSH rule, min_size=2) |
| - VSAN.forceProvisioning=false | | imageFeatures: layering,... |
| | | |
| Capability Profile: | | volumeBindingMode: |
| - Storage Type: All Flash | | WaitForFirstConsumer |
| - Encryption: Required | | |
| - QoS IOPS Limit: 10000 | | reclaimPolicy: Retain |
| | | allowVolumeExpansion: true |
+-----------------------------------+ +-----------------------------------+
VMware SPBM Policy Kubernetes StorageClass
+-----------------------------------+ +-----------------------------------+
| Policy: "Silver-Standard" | | StorageClass: "ocs-silver-ssd" |
| | | |
| Rules: | | Parameters: |
| - VSAN.hostFailuresToTolerate=1 | | pool: ssd-replicapool |
| (2 replicas) | | (Ceph pool with RF=3 or RF=2 |
| - VSAN.stripeWidth=1 | | via CRUSH rule) |
| - VSAN.forceProvisioning=false | | |
| | | |
| Capability Profile: | | reclaimPolicy: Retain |
| - Storage Type: Hybrid | | allowVolumeExpansion: true |
| - Encryption: Optional | | |
+-----------------------------------+ +-----------------------------------+
VMware SPBM Policy Kubernetes StorageClass
+-----------------------------------+ +-----------------------------------+
| Policy: "Bronze-Archive" | | StorageClass: "ocs-bronze-ec" |
| | | |
| Rules: | | Parameters: |
| - VSAN.hostFailuresToTolerate=1 | | pool: hdd-ec-metadatapool |
| - VSAN.stripeWidth=1 | | dataPool: hdd-ec-datapool |
| - VSAN.forceProvisioning=true | | (EC 4+2 for capacity |
| | | efficiency) |
| Capability Profile: | | |
| - Storage Type: Magnetic | | reclaimPolicy: Delete |
| - IOPS Limit: 500 | | allowVolumeExpansion: true |
+-----------------------------------+ +-----------------------------------+
Key differences between SPBM and StorageClasses:
| Dimension | VMware SPBM | Kubernetes StorageClass |
|---|---|---|
| QoS enforcement | SPBM can set IOPS limits directly in the policy. vSAN SIOC enforces them at the datastore level. | StorageClasses do not natively support QoS. IOPS limits must be enforced via Ceph QoS (rbd_qos_iops_limit), resource quotas, or external tools. |
| Compliance checking | vSAN continuously monitors whether VMs comply with their assigned policy. Non-compliant VMs are flagged in vCenter. | No equivalent built-in compliance monitoring. Custom tooling (Prometheus alerts, OPA policies) is needed to detect mismatches. |
| Policy reassignment | A VM's storage policy can be changed in-place. vSAN will rebalance the data to comply with the new policy (e.g., from 2 replicas to 3). | Changing a PVC's StorageClass is not supported. The volume must be migrated (snapshot + restore to a new PVC with the new class). |
| Encryption | Encryption is a policy rule. Enable encryption per-policy; vSAN encrypts transparently. | Encryption is a Ceph pool-level or OSD-level setting (dm-crypt/LUKS), not a StorageClass parameter. Separate encrypted pools and StorageClasses are needed for per-tier encryption. |
| Thin vs Thick provisioning | SPBM supports both thin and thick provisioning as a policy attribute. | CSI provisioning is always thin by default (Ceph, S2D both thin-provision). Thick provisioning requires explicit configuration. |
StorageClass Hierarchy for 5,000+ VMs
For a Tier-1 financial enterprise migrating 5,000+ VMs, the following StorageClass hierarchy maps the existing VMware tiering to Kubernetes:
StorageClass Hierarchy Design
================================
Tier StorageClass Name Backend (OVE) Backend (Azure Local)
----- ---------------------- ------------------------- -------------------------
Gold ocs-gold-nvme Ceph pool: nvme-rpl S2D: all-NVMe, 3-way mirror
(VM block, RWO) RF=3, NVMe OSDs Premium tier
Retain, WaitForFirst CRUSH: host failure domain
Expansion: yes
Use: Databases, OLTP,
latency-sensitive
Silver ocs-silver-ssd (DEFAULT) Ceph pool: ssd-rpl S2D: SSD + NVMe cache,
(VM block, RWO) RF=3, SSD OSDs 3-way mirror
Retain, WaitForFirst CRUSH: host failure domain Standard tier
Expansion: yes
Use: General prod VMs,
app servers, web tier
Bronze ocs-bronze-ec Ceph pool: hdd-ec S2D: SSD/HDD, MAP
(VM block, RWO) EC 4+2, HDD/SSD OSDs (mirror-accelerated parity)
Delete, WaitForFirst CRUSH: host failure domain Capacity tier
Expansion: yes
Use: Dev/test VMs,
cold data, archives
Archive ocs-archive-compressed Ceph pool: hdd-ec-comp (not available -- S2D
(VM block, RWO) EC 8+3, HDD OSDs has no inline compression)
Delete, WaitForFirst Compression: zstd
Expansion: yes CRUSH: host failure domain
Use: Compliance logs,
rarely accessed data
Shared ocs-shared-cephfs CephFS: ocs-cephfs S2D: SMB share on CSV
(filesystem, RWX) RF=3, SSD data pool Standard tier, SMB3
Retain, Immediate MDS active/standby
Expansion: yes
Use: Shared config,
NFS/SMB replacement
VM-Images ocs-template-rbd Ceph pool: ssd-rpl S2D: SSD + NVMe cache
(VM block, RWO) RF=3, SSD OSDs Standard tier
Delete, Immediate Used for golden images
Expansion: no + layering/cloning
Use: OS templates,
golden VM images
Capacity planning (OVE / ODF example):
+--------------------+---------+----------+---------+
| Tier | VM Count| Avg Size | Total |
+--------------------+---------+----------+---------+
| Gold (NVMe, RF=3) | 500 | 200 Gi | 100 Ti |
| Silver (SSD, RF=3) | 3,000 | 150 Gi | 450 Ti |
| Bronze (HDD, EC) | 1,000 | 300 Gi | 300 Ti |
| Archive (HDD, EC) | 300 | 500 Gi | 150 Ti |
| Shared (CephFS) | -- | 5 Ti | 5 Ti |
| Templates | 20 | 50 Gi | 1 Ti |
+--------------------+---------+----------+---------+
| Usable total | 4,820 | |~1,006 Ti|
+--------------------+---------+----------+---------+
| Raw required (RF=3)| | |~3,018 Ti| (Gold+Silver+Shared+Templates)
| Raw required (EC) | | | ~ 675 Ti| (Bronze 4+2: 1.5x, Archive 8+3: 1.375x)
| Raw grand total | | |~3,693 Ti|
+--------------------+---------+----------+---------+
VolumeSnapshot Example
VolumeSnapshots provide CSI-native, point-in-time snapshots that can be used for backup, cloning, and disaster recovery. They are the Kubernetes equivalent of VMware's VADP snapshots.
# VolumeSnapshotClass (defines the snapshot provider)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ocs-rbd-snapclass
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: openshift-storage.rbd.csi.ceph.com
parameters:
clusterID: openshift-storage
csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/snapshotter-secret-namespace: openshift-storage
deletionPolicy: Delete # Delete snapshot when VolumeSnapshot CR is deleted
# VolumeSnapshot (take a snapshot of a VM disk)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: vm-database-prod-snapshot-20260428
namespace: production-vms
labels:
app: database
backup-schedule: daily
spec:
volumeSnapshotClassName: ocs-rbd-snapclass
source:
persistentVolumeClaimName: vm-database-prod-01-data
# VolumeSnapshotContent (created automatically by external-snapshotter)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
name: snapcontent-3f8e9a12-7b4c-4d5e-a1f0-9c8d7e6b5a43
spec:
deletionPolicy: Delete
driver: openshift-storage.rbd.csi.ceph.com
source:
volumeHandle: "0001-0024-openshift-storage-0000000000000001-3f8e9a12"
volumeSnapshotRef:
name: vm-database-prod-snapshot-20260428
namespace: production-vms
volumeSnapshotClassName: ocs-rbd-snapclass
status:
snapshotHandle: "0001-0024-openshift-storage-snap-3f8e9a12"
creationTime: 1745836800000000000 # nanoseconds since epoch
readyToUse: true
restoreSize: 107374182400 # 100 GiB in bytes
The snapshot flow:
VolumeSnapshot Lifecycle
==========================
1. User creates VolumeSnapshot CR
(references PVC name + VolumeSnapshotClass)
|
v
2. external-snapshotter sidecar detects VolumeSnapshot
|
v
3. external-snapshotter calls CSI CreateSnapshot
CreateSnapshot(source_volume_id, name, parameters)
|
v
4. CSI driver creates snapshot on backend
Ceph: rbd snap create replicapool/csi-vol-<uuid>@snap-<uuid>
S2D: VSS snapshot of VHDX
|
v
5. CSI driver returns CreateSnapshotResponse
(snapshot_id, creation_time, ready_to_use)
|
v
6. external-snapshotter creates VolumeSnapshotContent CR
(cluster-scoped, stores backend snapshot details)
|
v
7. VolumeSnapshot status updated:
status.readyToUse: true
status.restoreSize: 100Gi
To restore: Create a new PVC with dataSource referencing
the VolumeSnapshot (see PVC example above).
How the Candidates Handle This
| Aspect | VMware (VMFS/vSAN) | OVE (CSI/ODF) | Azure Local (CSI/S2D) | Swisscom ESC |
|---|---|---|---|---|
| Storage abstraction | VMDK on Datastore | PVC backed by CSI volume (RBD image in Ceph pool) | PVC backed by CSI volume (VHDX on S2D CSV) | VM disk (managed by Swisscom, opaque to customer) |
| Plugin model | In-tree VMFS/vSAN/NFS drivers (ESXi built-in) | CSI out-of-tree (ceph-csi via Rook operator) | CSI out-of-tree (disk.csi.azure.com, smb.csi.k8s.io, csi-proxy) | N/A (fully managed) |
| Storage policy / tiering | SPBM policies (GUI + API, compliance monitoring, in-place reassignment) | StorageClasses (YAML manifests, no built-in compliance monitoring, no in-place reassignment) | StorageClasses (YAML manifests, Azure Arc integration for monitoring) | SLA tiers (contractual, no customer control over placement) |
| Dynamic provisioning | vSAN creates VMDK on demand based on policy | CSI external-provisioner calls CreateVolume, ceph-csi creates RBD image | CSI external-provisioner calls CreateVolume, disk.csi creates VHDX | Swisscom provisions on request (ticket/API) |
| Snapshot mechanism | VADP (vStorage APIs for Data Protection), VMDK snapshots (redo logs) | VolumeSnapshot CRD, CSI CreateSnapshot, RBD COW snapshots (instant, space-efficient) | VolumeSnapshot CRD, CSI CreateSnapshot, VSS-based VHDX checkpoints | Swisscom-managed snapshots (SAN-level) |
| Cloning | vSphere API clone (full or linked clone) | PVC with dataSource (CSI volume cloning, RBD layering for COW clones) | PVC with dataSource (VHDX copy or ReFS block cloning) | Not available to customer |
| Volume expansion | Edit VMDK size in vCenter (online, guest must extend FS) | Edit PVC size (CSI ControllerExpandVolume + NodeExpandVolume, guest must extend FS for block mode) | Edit PVC size (CSI expansion, guest must extend FS) | Request increase from Swisscom |
| Access modes | VMDK: single VM (exclusive). VMFS: multi-VM read. NFS: shared. Multi-writer VMDK for clustering (SCSI reservations) | RWO (RBD, standard VM disks). RWX (CephFS, shared FS). RWOP (strict single-pod). Live migration via lock transfer. | RWO (VHDX). RWX (SMB share). Shared VHDX for guest clustering (limited). | Single-VM access (standard). Shared storage via NFS/SMB (managed). |
| Volume mode | Always virtual disk (VMDK wraps a block device with metadata) | Block (preferred for VMs, raw device to QEMU) or Filesystem (image file on host FS) | Filesystem (VHDX on ReFS/CSV, standard Hyper-V model) | N/A (managed) |
| Topology awareness | vSAN fault domains, SPBM rack-aware policies | CSI topology keys + WaitForFirstConsumer + CRUSH rules (rack, zone, site) | CSI topology keys + WaitForFirstConsumer + S2D fault domains (node, chassis, rack, site) | N/A (managed) |
| QoS / IOPS limits | SIOC (Storage I/O Control) per-VM IOPS/throughput limits | Not built into CSI/StorageClass. Requires Ceph rbdqos* settings per image or cgroup I/O limits. | Not built into CSI/StorageClass. Requires Hyper-V QoS policies or S2D bandwidth reservation. | SLA-based (contractual IOPS guarantee per tier) |
| Encryption at rest | vSAN encryption (per-policy, KMS integration) | Ceph OSD-level encryption (dm-crypt/LUKS, KMIP). Per-pool, not per-StorageClass. | BitLocker per CSV volume (TPM-backed). Per-volume, not per-StorageClass. | Swisscom-managed (assumed encrypted) |
| Multi-cluster storage | vSAN stretched cluster, cross-vCenter | Ceph external mode (single Ceph cluster serves multiple OCP clusters, single CSI config) | Each S2D cluster is independent. No cross-cluster storage sharing. | Shared SAN backend (multi-tenant) |
| Operational tooling | vCenter GUI, PowerCLI, ESXCLI | kubectl, oc, ceph CLI, ODF Console Plugin, Rook CRDs, Prometheus metrics |
kubectl, PowerShell, Windows Admin Center, Azure Portal, Azure Arc |
Swisscom portal |
| Migration from VMware | N/A (source) | MTV (Migration Toolkit for Virtualization) converts VMDK to RBD-backed PVC | Azure Migrate converts VMDK to VHDX-backed PVC | Swisscom handles migration (P2V/V2V) |
| Backup integration | VADP-aware: Veeam, Commvault, Dell Avamar, Cohesity | CSI snapshot + Kasten K10, Veeam Kasten, Trilio. VolumeSnapshot is the backup API. | CSI snapshot + Azure Backup, Veeam. VSS integration for app-consistent snapshots. | Swisscom-managed backup |
| Maturity for VMs | 20+ years (VMware invented virtual disk management) | ~4 years (KubeVirt CSI integration GA since OCP 4.10+, rapidly maturing) | ~3 years (Azure Local CSI for AKS hybrid, actively evolving) | Mature (traditional IaaS model) |
Key Takeaways
-
CSI is the universal storage interface -- but the devil is in the driver. CSI standardizes the control-plane API between Kubernetes and storage. However, the quality, feature completeness, and maturity of the CSI driver varies significantly between platforms. ceph-csi is battle-tested in production (part of the CNCF ecosystem, deployed at scale by many organizations). Azure Local's CSI drivers are newer and less proven at scale. During the PoC, test every CSI feature the organization needs: provisioning, snapshots, cloning, expansion, raw block mode. Do not assume feature parity between drivers.
-
Block mode is non-negotiable for VM performance. KubeVirt VMs using Filesystem volume mode pay a measurable performance penalty (double filesystem overhead, fragmentation, alignment issues). The PoC must validate that all VM disks use
volumeMode: Blockand that the CSI driver and storage backend handle block volumes correctly. This is standard for ceph-csi/RBD but needs validation for Azure Local's CSI implementation. -
StorageClasses replace SPBM -- but with gaps. StorageClasses provide dynamic provisioning and tiering similar to SPBM. However, SPBM offers in-place policy reassignment, continuous compliance monitoring, and integrated QoS enforcement that StorageClasses do not. The organization must build supplementary tooling (Prometheus alerts, OPA/Gatekeeper policies, custom controllers) to close these gaps. This is an operational cost unique to the Kubernetes model.
-
WaitForFirstConsumer is mandatory for production. Using
Immediatebinding mode in a multi-rack environment will cause topology mismatches -- volumes provisioned in rack A while VMs run in rack B. Set all production StorageClasses toWaitForFirstConsumer. This is a day-1 configuration decision that is difficult to change retroactively (existing volumes cannot change their topology). -
Reclaim policies require an organizational decision. For a Tier-1 financial institution,
Retainshould be the default for production StorageClasses. Accidental PVC deletion must not cause immediate data loss. However,Retaincreates a management burden -- orphaned PVs accumulate and consume storage. An operational process for reviewing and cleaning up Released PVs is required. Automate this with a custom controller or scheduled job. -
Volume expansion works -- but shrinking does not. PVC resizing (growth) is supported by both ceph-csi and Azure Local CSI drivers. But volume shrinking is not supported by CSI. Over-provisioned volumes cannot be reclaimed. For 5,000+ VMs, right-sizing volumes at creation (or using thin provisioning with monitoring) is critical to avoid storage waste.
-
Live migration changes the access mode equation. In VMware, live migration (vMotion) is transparent -- the VMDK stays on shared storage, only compute moves. In KubeVirt, live migration of RBD-backed (RWO) volumes requires the exclusive-lock feature and a lock-transfer mechanism. This works but adds complexity. CephFS (RWX) simplifies migration but adds filesystem overhead. The PoC should test live migration under load with both RBD and CephFS to determine the preferred approach.
-
The snapshot model is fundamentally different. VMware VMDK snapshots use redo logs (delta disks) that can cause performance degradation when stacked. Ceph RBD snapshots are COW at the object level -- instant, space-efficient, and do not degrade read performance. This is an architectural improvement for backup workflows. However, the integration with enterprise backup tools (Veeam, Commvault) via VolumeSnapshot CRDs is newer and less mature than VADP integration. Validate backup tool compatibility during the PoC.
-
QoS is the biggest gap versus VMware. VMware's SIOC provides per-VM IOPS and throughput limits enforceable at the datastore level. Neither CSI nor StorageClasses offer built-in QoS. For a shared platform hosting 5,000+ VMs from different business units, noisy-neighbor isolation is critical. The organization must implement QoS through alternative mechanisms: Ceph per-image QoS (
rbd_qos_iops_limit), cgroup I/O controllers, or platform-level resource quotas. This gap must be addressed before production. -
The migration path determines the StorageClass design. When migrating from VMware, each VM's current SPBM policy must map to a StorageClass. Document every SPBM policy in use today, its parameters (replication, encryption, QoS, disk tier), and the number of VMs assigned to it. This inventory drives the StorageClass hierarchy design. Do not design StorageClasses in the abstract -- derive them from the actual VMware policy landscape.
Discussion Guide
The following questions are designed for vendor deep-dives, PoC planning, and internal architecture reviews. They focus on the Kubernetes storage model specifics that affect VM operations at scale.
Questions for OVE / ODF (Red Hat)
-
CSI driver maturity and feature coverage: "List every CSI feature that ceph-csi supports as GA (not tech preview) in the ODF version you are proposing. Specifically: volume snapshots, volume cloning, online volume expansion, raw block volumes, topology-aware provisioning, volume health monitoring, and volume group snapshots. For any feature that is tech preview or unsupported, what is the GA timeline?"
-
Block mode performance validation: "Show benchmark data comparing
volumeMode: BlockvsvolumeMode: Filesystemfor a KubeVirt VM running a 4K random write workload on Ceph RBD. We expect 10-20% better IOPS with block mode. Confirm this and explain the overhead sources in filesystem mode (double journaling, metadata updates, alignment)." -
StorageClass design review: "We plan to implement four tiers (gold/NVMe, silver/SSD, bronze/EC, archive/EC-compressed) plus CephFS for shared volumes. Review our StorageClass YAML definitions and Ceph pool configuration. Are the CRUSH rules correct for our rack topology? Are the imageFeatures flags optimal? Should we use different PG counts per pool for our expected capacity?"
-
Live migration with RBD exclusive lock: "Walk us through the exact sequence of events when a KubeVirt VM with an RBD-backed (RWO, Block) volume live-migrates from node A to node B. How is the exclusive lock transferred? What is the blackout window (if any) where I/O is paused? What happens if the source node crashes during lock transfer? How does this compare to vMotion in terms of migration downtime?"
-
QoS enforcement for multi-tenant VMs: "We need per-VM IOPS limits equivalent to VMware SIOC. Ceph supports
rbd_qos_iops_limitandrbd_qos_bps_limitper image. Can these be set via StorageClass parameters, or do they require post-provisioning configuration? Can they be changed dynamically without VM restart? What is the enforcement granularity (strict rate limiting vs best-effort)?"
Questions for Azure Local / S2D (Microsoft)
-
CSI driver feature parity with ceph-csi: "Which CSI features does the Azure Local CSI driver support as GA? Specifically: raw block volumes (
volumeMode: Block), volume snapshots (VolumeSnapshot CRD), volume cloning (PVC dataSource), online volume expansion, and topology-aware provisioning. For KubeVirt VMs on Azure Local, is block mode supported and recommended?" -
CSI-proxy reliability for Windows nodes: "csi-proxy is a critical dependency for CSI operations on Windows nodes. What is csi-proxy's failure model? If csi-proxy crashes, can existing VMs continue running (data-plane unaffected) or does the failure propagate? How is csi-proxy upgraded (rolling, requires node drain)? Is there a Linux-node alternative where AKS hybrid runs Linux workers instead of Windows?"
-
StorageClass mapping from SPBM for S2D: "We currently have VMware SPBM policies mapping to three tiers. Design equivalent StorageClasses for Azure Local / S2D that match our performance and redundancy requirements: (a) gold: 3-way mirror on NVMe, (b) silver: 3-way mirror on SSD with NVMe cache, (c) bronze: MAP for capacity. Show the StorageClass YAML with S2D-specific parameters."
-
Multi-cluster StorageClass consistency: "With 2-3 S2D clusters needed for 5,000+ VMs, how do we maintain consistent StorageClass definitions across clusters? Is there a federated StorageClass mechanism, or must each cluster be configured independently? How does Azure Arc handle cross-cluster storage policy governance?"
-
Snapshot and backup integration via CSI: "Demonstrate the VolumeSnapshot workflow for a Hyper-V VM on Azure Local: create snapshot, verify consistency (VSS-aware?), restore to a new PVC, attach to a new VM. How does this integrate with Azure Backup and Veeam? Is the CSI snapshot app-consistent or crash-consistent?"
Questions for Swisscom ESC
- Storage abstraction transparency: "Since Swisscom ESC is a managed IaaS, the Kubernetes storage model (CSI, PV/PVC, StorageClasses) may not apply. How does VM storage provisioning work? Is there a Kubernetes-compatible interface, or is it purely vSphere/PowerMax-based? If we need to run containerized workloads alongside VMs, what storage model do they consume?"
Cross-Platform / Internal Architecture Questions
-
SPBM-to-StorageClass migration inventory: "Before designing StorageClasses, we need a complete inventory of current VMware SPBM policies. For each policy: (a) policy name and parameters, (b) number of VMs assigned, (c) total capacity consumed, (d) performance profile (IOPS, latency from vRealize/Aria). This inventory is the input for StorageClass hierarchy design. Who owns this data extraction?"
-
CSI driver failure impact analysis: "For each candidate, document the failure impact of: (a) CSI controller pod crash (no new provisioning, existing volumes unaffected), (b) CSI node pod crash (no new mounts, existing mounts stable), (c) CSI driver version mismatch after partial upgrade, (d) loss of CSI credentials (Secret deleted). For each scenario, what is the blast radius and recovery procedure?"
-
Volume lifecycle automation: "Design a GitOps-driven workflow for the full volume lifecycle: (a) PVC creation via Helm chart or Kustomize, (b) StorageClass selection via namespace-level defaults, (c) snapshot scheduling via a CronJob or VolumeSnapshotSchedule controller, (d) PVC expansion via policy-driven automation, (e) orphaned PV cleanup via a garbage collection controller. Which components exist as open-source projects and which require custom development?"
Previous: 05-sds-platforms.md -- Software-Defined Storage Platforms (Ceph/ODF, S2D) Next: 07-data-protection.md -- Data Protection and Operations (Snapshots, DR, Encryption, Backup)