Modern datacenters and beyond

Kubernetes Storage Model

Why This Matters

The previous pages established the storage fundamentals (01), the VMware baseline (02), the protocols (03), the architectures (04), and the SDS platforms (05). This page bridges the gap between the raw storage platform and the workloads that consume it. In Kubernetes-based platforms (OVE and Azure Local), no VM or container can access storage without passing through the Kubernetes storage model. CSI, PV/PVC, and StorageClasses are the control-plane mechanisms that connect a VM's disk request to an actual block device on Ceph or S2D.

For a Tier-1 financial enterprise running 5,000+ VMs, this page answers three critical questions:

  1. How does a VM get a disk? In VMware, you create a VMDK on a datastore. In Kubernetes, a chain of abstractions -- StorageClass, PVC, PV, CSI driver, SDS platform -- must all function correctly for a VM to receive a writable disk. Understanding this chain is essential for troubleshooting, capacity planning, and performance analysis.

  2. How do we map our existing storage policies? VMware's SPBM (Storage Policy Based Management) assigns VMs to storage tiers (gold, silver, bronze) with defined performance and protection levels. Kubernetes StorageClasses serve the same purpose but with different mechanics. A clean migration requires mapping every SPBM policy to a corresponding StorageClass with equivalent parameters.

  3. What changes operationally? In VMware, storage is configured through vCenter's GUI -- datastore creation, policy assignment, capacity monitoring. In Kubernetes, storage is configured via YAML manifests, managed by operators, and provisioned dynamically through CSI drivers. The operational model shifts from GUI-driven administration to declarative, API-driven automation. The team must understand this shift to operate the platform effectively.

This page covers the three pillars of Kubernetes storage: CSI (how Kubernetes talks to storage backends), PV/PVC (how storage is represented and consumed), and StorageClasses (how storage tiers are defined and selected). Together, they form the complete path from a VM's disk request to a provisioned block device.


Concepts

1. CSI (Container Storage Interface)

Why CSI Exists

Before CSI, Kubernetes supported storage backends through in-tree volume plugins -- Go code compiled directly into the Kubernetes controller-manager and kubelet binaries. Each storage vendor (AWS EBS, GCE PD, Ceph RBD, vSphere VMDK, etc.) had their driver code embedded in the Kubernetes source tree. This created three problems:

  1. Coupling. Adding or fixing a storage driver required modifying and releasing the entire Kubernetes codebase. A bug in the Ceph RBD plugin meant waiting for the next Kubernetes release -- even if Ceph itself was already fixed.
  2. Vendor burden. Storage vendors had to submit PRs to kubernetes/kubernetes, pass the full Kubernetes CI/CD pipeline, and align their release cycles with Kubernetes. This was slow, error-prone, and politically contentious.
  3. Security and stability risk. Third-party storage driver code ran inside the kubelet and controller-manager with full privileges. A bug in one driver could crash the kubelet, affecting all pods on the node.

CSI solves these problems by defining a standard gRPC interface between Kubernetes and external storage drivers. Storage vendors implement CSI drivers as independent binaries (running in separate containers), communicate with Kubernetes through a well-defined API, and release on their own schedule. Kubernetes ships with CSI client code; the driver code runs out-of-tree.

Evolution of Kubernetes Storage Plugins
=========================================

Phase 1: In-Tree (Kubernetes 1.0 - 1.22)
+-----------------------------------------+
| Kubernetes Binary (kubelet, controller) |
|                                         |
|   +--------+ +--------+ +--------+     |
|   | AWS EBS| | GCE PD | | Ceph   |     |
|   | Plugin | | Plugin | | RBD    |     |
|   +--------+ +--------+ | Plugin |     |
|   +--------+ +--------+ +--------+     |
|   |vSphere | | Azure  | +--------+     |
|   | VMDK   | | Disk   | |Portworx|     |
|   | Plugin | | Plugin | | Plugin |     |
|   +--------+ +--------+ +--------+     |
|                                         |
|   All compiled into kubernetes/kubernetes|
|   ~30+ plugins, all in-tree             |
+-----------------------------------------+
       Problem: tightly coupled, slow release
       cycle, security risk

Phase 2: CSI Out-of-Tree (Kubernetes 1.13+, GA)
+-----------------------------------------+
| Kubernetes Binary                       |
|                                         |
|   +----------------------------------+  |
|   | CSI Client (kube-controller-mgr) |  |
|   | CSI Client (kubelet)             |  |
|   +----------------------------------+  |
+------------|----------------------------+
             | gRPC (Unix Domain Socket)
             v
+-------------------------------------------+
| CSI Driver (separate container/process)   |
|                                           |
| +----------+  +-----------+  +----------+ |
| | ceph-csi |  | smb.csi   |  | csi-proxy|| |
| | (ODF)    |  | (Azure    |  | (Windows || |
| |          |  |  Local)   |  |  nodes)  || |
| +----------+  +-----------+  +----------+ |
+-------------------------------------------+
       Decoupled: vendor releases independently,
       runs in own process, own security context

CSI Architecture: Controller Plugin and Node Plugin

A CSI driver is split into two components that run as separate Kubernetes workloads:

Controller Plugin -- runs as a Deployment (1-3 replicas for HA) or StatefulSet. Handles cluster-wide operations that do not require access to a specific node's local devices. The controller plugin implements the Controller Service RPCs:

gRPC Method Purpose When Called
CreateVolume Provisions a new volume on the storage backend (e.g., creates an RBD image in Ceph, allocates a VHDX on S2D) PVC created with dynamic provisioning
DeleteVolume Removes the volume from the storage backend PV deleted (reclaim policy = Delete)
ControllerPublishVolume Attaches the volume to a specific node (e.g., maps RBD image to a node, connects iSCSI target) Pod scheduled to a node, volume needs attaching
ControllerUnpublishVolume Detaches the volume from a node Pod deleted or moved, volume no longer needed on that node
CreateSnapshot Creates a point-in-time snapshot of a volume VolumeSnapshot CRD created
DeleteSnapshot Removes a snapshot VolumeSnapshotContent deleted
ControllerExpandVolume Expands volume capacity on the storage backend PVC .spec.resources.requests.storage increased
ValidateVolumeCapabilities Checks if a volume supports requested capabilities Pre-flight validation
ListVolumes Lists all volumes managed by this driver Inventory and reconciliation
ControllerGetCapabilities Returns which optional RPCs this driver supports Discovery during startup

Node Plugin -- runs as a DaemonSet (one pod per node). Handles node-local operations that require access to the host's devices, mount namespaces, and block device layer. The node plugin implements the Node Service RPCs:

gRPC Method Purpose When Called
NodeStageVolume Mounts the volume to a node-global staging directory (e.g., formats and mounts a block device to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pv>/globalmount) First pod on this node uses the volume
NodeUnstageVolume Unmounts from the staging directory Last pod on this node releases the volume
NodePublishVolume Bind-mounts from the staging directory into the pod's volume directory (or creates a device symlink for raw block) Pod starts and needs the volume
NodeUnpublishVolume Removes the bind-mount from the pod Pod terminates
NodeExpandVolume Expands the filesystem on a volume after the controller has expanded the backend volume After ControllerExpandVolume, filesystem needs resizing
NodeGetInfo Returns node topology information (zone, rack, hostname) Node registration
NodeGetCapabilities Returns which optional RPCs this node plugin supports Discovery during startup
NodeGetVolumeStats Returns capacity and inode usage for a mounted volume Monitoring, kubectl describe pvc
CSI Driver Architecture (Controller + Node Plugins with Sidecars)
===================================================================

                     CONTROLLER DEPLOYMENT (1-3 replicas)
  +-----------------------------------------------------------------------+
  |  Pod: csi-controller-0                                                |
  |                                                                       |
  |  +---------------------+  +---------------------+                    |
  |  | external-provisioner|  | external-attacher   |                    |
  |  | Watches PVC objects  |  | Watches VolumeAttach|                    |
  |  | Calls CreateVolume / |  | ment objects        |                    |
  |  | DeleteVolume via gRPC|  | Calls Controller-   |                    |
  |  +----------|----------+  | PublishVolume via    |                    |
  |             |             | gRPC                 |                    |
  |             |             +----------|----------+                    |
  |             |                        |                                |
  |  +---------------------+  +---------------------+                    |
  |  | external-snapshotter|  | external-resizer    |                    |
  |  | Watches VolumeSnap- |  | Watches PVC size    |                    |
  |  | shot objects         |  | changes             |                    |
  |  | Calls CreateSnapshot |  | Calls Controller-   |                    |
  |  | via gRPC             |  | ExpandVolume via    |                    |
  |  +----------|----------+  | gRPC                 |                    |
  |             |             +----------|----------+                    |
  |             |                        |                                |
  |             +--------+   +----------+                                |
  |                      |   |                                            |
  |                      v   v                                            |
  |             +---------------------+   +----------------+              |
  |             | CSI Driver          |   | livenessprobe  |              |
  |             | (Controller Service)|   | Health checks  |              |
  |             | e.g., ceph-csi,     |   | the CSI driver |              |
  |             |   csi-smb-controller|   | via gRPC       |              |
  |             | Listens on Unix     |   | Exposes /healthz|             |
  |             | domain socket       |   +----------------+              |
  |             +---------------------+                                   |
  +-----------------------------------------------------------------------+

                     NODE DAEMONSET (one pod per node)
  +-----------------------------------------------------------------------+
  |  Pod: csi-node-xxxxx (on every schedulable node)                      |
  |                                                                       |
  |  +------------------------+                                           |
  |  | node-driver-registrar  |                                           |
  |  | Registers CSI driver   |                                           |
  |  | with kubelet via the   |                                           |
  |  | kubelet plugin         |                                           |
  |  | registration mechanism |                                           |
  |  | (fsnotify on           |                                           |
  |  |  /registration/)       |                                           |
  |  +-----------|------------+                                           |
  |              |                                                        |
  |              v                                                        |
  |  +---------------------+   +----------------+                        |
  |  | CSI Driver           |   | livenessprobe  |                        |
  |  | (Node Service)       |   +----------------+                        |
  |  | e.g., ceph-csi,      |                                             |
  |  |   csi-smb-node       |                                             |
  |  | Has access to:       |                                             |
  |  |  - Host /dev devices |                                             |
  |  |  - Host mount ns     |                                             |
  |  |  - kubelet dir       |                                             |
  |  +---------------------+                                              |
  +-----------------------------------------------------------------------+

  Communication: All sidecar-to-driver communication uses gRPC over
  a shared Unix domain socket mounted as an emptyDir volume within
  the pod. No network traffic. No TLS overhead.

CSI Sidecar Containers

CSI sidecar containers are Kubernetes-maintained helper containers that run alongside the CSI driver in the same pod. They watch Kubernetes API objects and translate them into CSI gRPC calls. The CSI driver itself never watches the Kubernetes API directly -- it only responds to gRPC calls. This separation means a CSI driver can be written without any Kubernetes-specific code.

Sidecar Watches Calls CSI RPC Runs In
external-provisioner PersistentVolumeClaim (unbound, matching StorageClass) CreateVolume, DeleteVolume Controller
external-attacher VolumeAttachment (created by AD controller) ControllerPublishVolume, ControllerUnpublishVolume Controller
external-snapshotter VolumeSnapshot, VolumeSnapshotContent CreateSnapshot, DeleteSnapshot, ListSnapshots Controller
external-resizer PersistentVolumeClaim (size increase detected) ControllerExpandVolume Controller
livenessprobe (none -- polls CSI driver) Probe (gRPC health check) Controller + Node
node-driver-registrar (none -- registers with kubelet) NodeGetInfo Node

The sidecar versions must be compatible with the CSI driver version. Mismatched sidecar versions are a common source of subtle bugs (e.g., external-provisioner v3.x calling a driver that only supports CSI spec 1.5 features).

gRPC Interface Between Kubernetes and the CSI Driver

All communication between Kubernetes components and the CSI driver uses gRPC over Unix domain sockets. There is no network involved. The socket file is shared between containers in the same pod via an emptyDir volume.

gRPC Communication Flow
=========================

  Kubernetes API Server
       |
       | (watches via informers)
       v
  +---------------------+         Unix Domain Socket
  | Sidecar Container   | ------> /csi/csi.sock ------> CSI Driver Container
  | (e.g., external-    |                                |
  |  provisioner)       |  gRPC Request:                 | Executes storage
  |                     |  CreateVolumeRequest {         | backend operations
  |                     |    name: "pvc-abc123"          | (e.g., rbd create,
  |                     |    capacity_range: {           |  ceph osd pool,
  |                     |      required_bytes: 107374..  |  New-VirtualDisk)
  |                     |    }                           |
  |                     |    volume_capabilities: [...]  |
  |                     |    parameters: {               |
  |                     |      "pool": "ocs-storagecl.." |
  |                     |      "imageFeatures": "layeri."|
  |                     |    }                           |
  |                     |  }                             |
  |                     |                                |
  |                     |  <--- CreateVolumeResponse {   |
  |                     |    volume: {                   |
  |                     |      volume_id: "0001-0024-.."|
  |                     |      capacity_bytes: 10737..   |
  |                     |    }                           |
  |                     |  }                             |
  +---------------------+                                |
       |                                                  |
       | Creates PV object                                |
       v                                                  |
  Kubernetes API Server                                   |
       |                                                  |
       | PV bound to PVC                                  |
       v                                                  |
  Scheduler places Pod                                    |
       |                                                  |
       | VolumeAttachment created                         |
       v                                                  |
  external-attacher -----> ControllerPublishVolume ------->
       |
       | Attachment confirmed
       v
  kubelet (on target node)
       |
       | Calls CSI Node Plugin via local socket
       v
  Node Plugin: NodeStageVolume (format + mount to staging dir)
       |
  Node Plugin: NodePublishVolume (bind-mount to pod dir)
       |
       v
  Pod sees mounted volume at /var/lib/kubelet/pods/<uid>/volumes/...

The CSI specification (currently v1.9+) defines the exact protobuf message types. Here are the key message structures:

// Simplified CSI CreateVolume request/response
message CreateVolumeRequest {
    string name = 1;                           // Unique name (PVC UID)
    CapacityRange capacity_range = 2;          // Min/max bytes
    repeated VolumeCapability volume_capabilities = 3;  // Block or Mount
    map<string, string> parameters = 4;        // StorageClass parameters
    map<string, string> secrets = 5;           // Auth credentials
    VolumeContentSource volume_content_source = 6;  // Clone or snapshot source
    AccessibilityRequirements accessibility_requirements = 7;  // Topology
}

message CreateVolumeResponse {
    Volume volume = 1;
}

message Volume {
    int64 capacity_bytes = 1;
    string volume_id = 2;                      // Backend-specific ID
    VolumeContext volume_context = 3;          // Key-value metadata
    VolumeContentSource content_source = 4;
    AccessibleTopology accessible_topology = 5;
}

Volume Lifecycle Through CSI

A volume goes through a defined state machine as it moves from creation to consumption to deletion. Each state transition corresponds to a specific CSI gRPC call.

CSI Volume Lifecycle State Machine
=====================================

                      PVC Created
                          |
                          v
   +--------+   CreateVolume    +--------+
   |        | ----------------> |        |
   | (none) |                   | CREATED|  Volume exists on storage backend
   |        |                   |        |  (e.g., RBD image exists in Ceph pool)
   +--------+                   +---+----+
                                    |
                         ControllerPublishVolume
                           (attach to node)
                                    |
                                    v
                               +---------+
                               |         |
                               |NODE_    |  Volume is attached to a specific node
                               |READY    |  (e.g., RBD mapped via krbd or rbd-nbd)
                               |         |
                               +----+----+
                                    |
                              NodeStageVolume
                           (format + mount to
                            staging directory)
                                    |
                                    v
                               +---------+
                               |         |
                               |VOL_     |  Volume is mounted at a node-global
                               |READY    |  staging path, filesystem created
                               |         |  (e.g., /var/lib/kubelet/plugins/
                               +----+----+   kubernetes.io/csi/pv/<name>/
                                    |        globalmount)
                              NodePublishVolume
                           (bind-mount to pod)
                                    |
                                    v
                               +---------+
                               |         |
                               |PUBLISHED|  Volume is accessible inside the pod
                               |         |  at the specified mount path
                               |         |  (e.g., /data inside the container)
                               +----+----+
                                    |
                              NodeUnpublishVolume
                           (remove bind-mount)
                                    |
                                    v
                               +---------+
                               |         |
                               |VOL_     |  Back to staged but not published
                               |READY    |  (still mounted at staging path)
                               |         |
                               +----+----+
                                    |
                             NodeUnstageVolume
                           (unmount from staging)
                                    |
                                    v
                               +---------+
                               |         |
                               |NODE_    |  Back to attached but not mounted
                               |READY    |
                               |         |
                               +----+----+
                                    |
                        ControllerUnpublishVolume
                           (detach from node)
                                    |
                                    v
                               +---------+
                               |         |
                               | CREATED |  Volume exists but not attached
                               |         |
                               +----+----+
                                    |
                              DeleteVolume
                                    |
                                    v
                               +---------+
                               |         |
                               | (none)  |  Volume removed from storage backend
                               |         |
                               +---------+

  Note: Not all CSI drivers implement all stages. Some drivers do not
  support ControllerPublishVolume (e.g., NFS-based drivers where the
  volume is network-accessible from all nodes without explicit attach).
  These drivers report NO_CONTROLLER_PUBLISH_UNPUBLISH capability, and
  Kubernetes skips the attach/detach steps.

CSI for KubeVirt: How a VM Disk Becomes a PVC

In KubeVirt (used by OVE), virtual machines run inside pods. Each VM disk is backed by a PVC. The CSI layer is responsible for provisioning and mounting the underlying volume. The flow is:

  1. A VirtualMachine CR is created with a dataVolumeTemplate or a reference to an existing PVC.
  2. The CDI (Containerized Data Importer) operator creates a PVC using the specified StorageClass.
  3. The external-provisioner sidecar detects the PVC and calls CreateVolume on the CSI driver.
  4. The CSI driver provisions the volume (e.g., creates an RBD image in Ceph via rbd create).
  5. When the VM is scheduled, the CSI volume goes through the full lifecycle (attach, stage, publish).
  6. KubeVirt's virt-launcher pod receives the mounted volume (as a block device or filesystem mount).
  7. QEMU uses the volume as the VM's virtual disk (via virtio-blk or virtio-scsi).
KubeVirt VM Disk via CSI (OVE / ODF)
=======================================

  VirtualMachine CR                     CDI Operator
  spec:                                 |
    dataVolumeTemplates:                | Creates DataVolume
    - metadata:                         | which creates PVC
        name: vm-boot-disk              |
      spec:                             v
        storage:                   PersistentVolumeClaim
          storageClassName:        metadata:
            ocs-storagecluster-      name: vm-boot-disk
            ceph-rbd               spec:
          resources:                 storageClassName: ocs-storagecluster-ceph-rbd
            requests:                volumeMode: Block
              storage: 100Gi         resources:
        source:                        requests:
          http:                          storage: 100Gi
            url: "https://..."      |
                                    v
                              CSI Driver (ceph-csi)
                              |
                              | CreateVolume gRPC call
                              v
                         Ceph Cluster (ODF)
                         rbd create replicapool/csi-vol-<uuid> --size 100G
                              |
                              v
                         RBD Image: replicapool/csi-vol-<uuid>
                              |
                         (Pod scheduled, volume attached+staged+published)
                              |
                              v
                    +----------------------------+
                    | virt-launcher Pod          |
                    |                            |
                    |  +-----------------------+ |
                    |  | QEMU/KVM Process      | |
                    |  |                       | |
                    |  | VM: vm-boot-disk      | |
                    |  | virtio-blk --> /dev/   | |
                    |  |   rbd0 (block device)  | |
                    |  |                       | |
                    |  | Guest OS sees:        | |
                    |  |   /dev/vda (100 GiB)  | |
                    |  +-----------------------+ |
                    +----------------------------+

  Key detail: KubeVirt prefers volumeMode: Block for VM disks.
  This avoids the double-filesystem overhead (host filesystem +
  guest filesystem). The raw block device is passed directly to
  QEMU, which presents it to the guest as a virtual disk.

Key CSI Drivers for the Evaluation

CSI Driver Used By Storage Backend Protocols Key Features
ceph-csi OVE (ODF) Ceph RADOS RBD (block), CephFS (file) Snapshots, clones, volume expansion, encryption, topology-aware provisioning, raw block mode, multi-attach (CephFS), RWX via CephFS
smb.csi.k8s.io Azure Local SMB shares on S2D / CSV SMB 3.x File-level access for Windows VMs, Kerberos auth, DFS support
disk.csi.azure.com Azure Local S2D managed disks Virtual disk (VHDX on CSV) Block volumes for VMs, snapshots via VSS, volume expansion
csi-proxy Azure Local (Windows nodes) (proxy for Windows-native CSI operations) Named pipes Enables CSI node operations on Windows nodes where Unix domain sockets are not available
nfs.csi.k8s.io Both (optional) NFS exports NFS v3/v4 Simple shared storage, no special driver needed, ReadWriteMany

ceph-csi in detail (OVE/ODF):

ceph-csi is the CSI driver that connects Kubernetes to Ceph storage. In an ODF deployment, Rook-Ceph deploys and manages ceph-csi automatically. The driver communicates with Ceph via librbd (for RBD block volumes) and the CephFS kernel client or ceph-fuse (for CephFS filesystem volumes).

ceph-csi configuration is stored in a ConfigMap and Secret:

# ConfigMap: ceph-csi-config (simplified)
apiVersion: v1
kind: ConfigMap
metadata:
  name: ceph-csi-config
  namespace: openshift-storage
data:
  config.json: |
    [
      {
        "clusterID": "openshift-storage",
        "monitors": [
          "10.0.1.10:6789",
          "10.0.1.11:6789",
          "10.0.1.12:6789"
        ]
      }
    ]
---
# Secret: csi-rbd-secret (credentials for Ceph auth)
apiVersion: v1
kind: Secret
metadata:
  name: csi-rbd-secret
  namespace: openshift-storage
type: Opaque
stringData:
  userID: csi-rbd-node
  userKey: AQD3o+1hxxxxxxxxxxxxxxxxxxxxxxxxxx==

csi-proxy for Windows nodes (Azure Local):

Azure Local runs Windows Server nodes with Hyper-V. CSI drivers on Linux use Unix domain sockets for communication. Windows does not support Unix domain sockets natively. csi-proxy runs as a Windows service on each node and exposes a set of Windows named pipe endpoints that CSI node plugins use instead of direct host access. The CSI node plugin communicates with csi-proxy via gRPC over named pipes, and csi-proxy executes the actual disk, volume, and filesystem operations on the Windows host.

CSI on Windows Nodes (Azure Local)
====================================

  Linux Node (standard CSI):
  +------------------------------------------+
  | CSI Node Plugin Container                |
  |   - Directly accesses /dev, /sys, /mount |
  |   - Uses Unix domain socket for gRPC     |
  |   - Calls mount, mkfs, etc. directly     |
  +------------------------------------------+

  Windows Node (via csi-proxy):
  +------------------------------------------+
  | CSI Node Plugin Container (Windows)      |
  |   - Cannot access host devices directly  |
  |   - Calls csi-proxy via named pipes      |
  |   - csi-proxy translates to Win32 APIs   |
  +----------|-------------------------------+
             | gRPC over named pipes
             v
  +------------------------------------------+
  | csi-proxy.exe (Windows Service)          |
  |   Disk API:   Initialize, Partition      |
  |   Volume API: Format, Mount, Resize      |
  |   FS API:     CreateSymlink, PathExists  |
  |   SMB API:    NewSMBGlobalMapping        |
  +------------------------------------------+
             |
             v
  Windows Host (PowerShell / Win32 APIs)

CSI Features

Feature Description CSI RPCs Involved Relevance for 5,000+ VMs
Snapshots Point-in-time copy of a volume; used for backup, cloning, testing CreateSnapshot, DeleteSnapshot Pre-upgrade snapshots, backup integration (Kasten, Veeam)
Cloning Creates a new volume from an existing volume (copy-on-write where supported) CreateVolume with volume_content_source Rapid VM provisioning from golden images
Volume Expansion Increase volume capacity without downtime ControllerExpandVolume, NodeExpandVolume Online disk growth for VMs running out of space
Topology-Aware Provisioning Places volumes in the same failure domain as the consuming pod/VM CreateVolume with accessibility_requirements Data locality, rack-awareness, zone-awareness
Raw Block Volumes Presents a volume as a raw block device (no filesystem) Standard RPCs with VolumeCapability_Block KubeVirt VM disks (avoids double filesystem overhead)
Volume Health Monitoring Reports volume health conditions NodeGetVolumeStats, volume condition reporting Proactive detection of degraded volumes
Ephemeral Inline Volumes Volumes tied to pod lifecycle (no PVC) CSI ephemeral volume mode Not relevant for persistent VM disks

CSI vs VAAI: Conceptual Comparison

VMware administrators are familiar with VAAI (vStorage APIs for Array Integration) -- a set of offload APIs that let ESXi delegate storage operations to the array hardware. CSI and VAAI solve conceptually similar problems: abstracting storage operations behind a standard API so the hypervisor/orchestrator does not need to know the backend implementation.

Dimension VAAI (VMware) CSI (Kubernetes)
Purpose Offload storage operations to the array Standardize storage provisioning and lifecycle
Interface T10 SCSI commands (XCOPY, WRITE_SAME, ATS, UNMAP) gRPC (protobuf over Unix domain socket)
Scope Data-plane offload (copy, zero, lock, thin reclaim) Full lifecycle (provision, attach, mount, snapshot, expand, delete)
Who implements Storage array firmware Storage vendor's CSI driver (container)
Deployment Built into ESXi + array firmware Kubernetes pods (Deployment + DaemonSet)
Feature discovery ESXi queries array for VAAI primitive support CSI driver reports GetCapabilities
Storage policies SPBM (Storage Policy Based Management) StorageClasses (see section 3)
Snapshots VADP (vStorage APIs for Data Protection) CSI CreateSnapshot / VolumeSnapshot CRD
Thin provisioning VAAI UNMAP / WRITE_SAME with zeros CSI driver handles thin provisioning internally

Key difference: VAAI is a data-plane optimization (let the array do the heavy lifting for copy/zero/lock operations). CSI is a control-plane abstraction (let the driver handle the entire volume lifecycle). CSI does not specify how the data plane works -- it only standardizes the management operations. The data-plane path (how I/O flows from VM to disk) is determined by the underlying protocol (RBD, iSCSI, NVMe-oF) and is outside the CSI specification.


2. Persistent Volumes / PV / PVC

PV: The Cluster-Level Storage Resource

A PersistentVolume (PV) is a cluster-scoped Kubernetes object that represents a piece of provisioned storage. It is the Kubernetes equivalent of a LUN on a SAN, a VMDK on a datastore, or a VHDX on a CSV. PVs exist independently of any namespace or workload -- they are infrastructure-level objects managed by cluster administrators or provisioned automatically by CSI drivers.

A PV captures the following properties:

Field Purpose Example
spec.capacity.storage Size of the volume 100Gi
spec.accessModes How the volume can be mounted (RWO, ROX, RWX, RWOP) [ReadWriteOnce]
spec.persistentVolumeReclaimPolicy What happens when the PVC is deleted (Retain or Delete) Retain
spec.storageClassName Which StorageClass this PV belongs to ocs-storagecluster-ceph-rbd
spec.volumeMode Filesystem (default) or Block Block
spec.csi CSI-specific fields (driver name, volume handle, volume attributes) See YAML below
spec.nodeAffinity Topology constraints (which nodes can access this volume) Rack/zone labels
status.phase Current lifecycle phase (Available, Bound, Released, Failed) Bound
# PV provisioned by ceph-csi for a KubeVirt VM disk
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pvc-3f8e9a12-7b4c-4d5e-a1f0-9c8d7e6b5a43
  annotations:
    pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ocs-storagecluster-ceph-rbd
  volumeMode: Block                              # Raw block for KubeVirt VM disk
  csi:
    driver: openshift-storage.rbd.csi.ceph.com
    volumeHandle: "0001-0024-openshift-storage-0000000000000001-3f8e9a12"
    volumeAttributes:
      clusterID: "openshift-storage"
      pool: "ocs-storagecluster-cephblockpool"
      imageFeatures: "layering,deep-flatten,exclusive-lock,object-map,fast-diff"
      storage.kubernetes.io/csiProvisionerIdentity: "1682000000000-8081-openshift-storage.rbd.csi.ceph.com"
    nodeStageSecretRef:
      name: rook-csi-rbd-node
      namespace: openshift-storage
    controllerExpandSecretRef:
      name: rook-csi-rbd-provisioner
      namespace: openshift-storage
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.rbd.csi.ceph.com/openshift-storage
              operator: Exists
# PV for S2D-backed volume (Azure Local)
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pvc-a7b2c3d4-e5f6-7890-abcd-ef1234567890
  annotations:
    pv.kubernetes.io/provisioned-by: disk.csi.azure.com
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: azurelocal-premium
  volumeMode: Filesystem
  csi:
    driver: disk.csi.azure.com
    volumeHandle: "/subscriptions/.../virtualharddisks/vm-boot-disk"
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: "..."

PVC: The Namespace-Scoped Request

A PersistentVolumeClaim (PVC) is a namespace-scoped object that represents a user's request for storage. It is the Kubernetes equivalent of saying "I need a 100 GiB disk from the gold tier." The PVC does not specify which physical volume to use -- it specifies what it needs, and Kubernetes (via the CSI driver and the PV/PVC binding controller) finds or creates a matching PV.

Field Purpose Example
spec.accessModes Required access mode [ReadWriteOnce]
spec.resources.requests.storage Minimum capacity needed 100Gi
spec.storageClassName Which StorageClass to use for provisioning ocs-storagecluster-ceph-rbd
spec.volumeMode Filesystem or Block Block
spec.selector Label-based selector for manual PV matching (static provisioning) matchLabels: {tier: gold}
spec.dataSource Clone from existing PVC or restore from VolumeSnapshot See snapshot example below
spec.dataSourceRef Extended data source (cross-namespace, custom resources) DataVolume reference
status.phase Current phase (Pending, Bound, Lost) Bound
status.capacity Actual provisioned capacity (may exceed request) 100Gi
# PVC for a KubeVirt VM boot disk (OVE)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vm-database-prod-01-boot
  namespace: production-vms
  labels:
    app: database
    tier: production
    vm.kubevirt.io/name: database-prod-01
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block                 # Raw block for VM disk
  storageClassName: ocs-storagecluster-ceph-rbd
  resources:
    requests:
      storage: 100Gi
# PVC for a shared configuration volume (RWX via CephFS)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-config
  namespace: production-vms
spec:
  accessModes:
    - ReadWriteMany                # Multiple VMs can mount simultaneously
  volumeMode: Filesystem
  storageClassName: ocs-storagecluster-cephfs
  resources:
    requests:
      storage: 10Gi
# PVC restored from a VolumeSnapshot (clone for testing)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vm-database-test-clone
  namespace: test-vms
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  storageClassName: ocs-storagecluster-ceph-rbd
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: vm-database-prod-snapshot-20260428
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Binding: How PV and PVC Match

When a PVC is created, the PV controller in kube-controller-manager attempts to find a matching PV. The matching algorithm considers: access modes, capacity (PV capacity >= PVC request), StorageClass name, volume mode, and optional label selectors. There are two provisioning models:

Static provisioning: An administrator pre-creates PV objects. When a PVC is created, the PV controller searches existing Available PVs for a match. If found, the PV and PVC are bound. If no match exists, the PVC remains Pending.

Dynamic provisioning: The StorageClass specifies a CSI provisioner. When a PVC references a StorageClass with a provisioner, the external-provisioner sidecar creates a new volume via the CSI CreateVolume call, then creates a corresponding PV object. The PV controller then binds the PVC to the newly created PV.

PV-PVC Binding Flow (Dynamic Provisioning)
=============================================

  1. User creates PVC
     +---------------------------+
     | PVC: vm-boot-disk         |
     | storageClassName: gold    |
     | storage: 100Gi            |
     | volumeMode: Block         |
     | accessModes: [RWO]        |
     | status.phase: Pending     |
     +---------------------------+
              |
              v
  2. external-provisioner detects unbound PVC
     matching its StorageClass provisioner
              |
              v
  3. external-provisioner calls CSI CreateVolume
     +---------------------------+
     | gRPC: CreateVolume        |
     | name: pvc-<uuid>          |
     | capacity: 100 GiB         |
     | parameters: {pool: rbd,   |
     |   imageFeatures: layering}|
     +---------------------------+
              |
              v
  4. CSI driver provisions volume on backend
     (e.g., rbd create replicapool/csi-vol-<uuid> --size 100G)
              |
              v
  5. CSI driver returns CreateVolumeResponse
     with volume_id
              |
              v
  6. external-provisioner creates PV object
     +---------------------------+
     | PV: pvc-<uuid>            |
     | capacity: 100Gi           |
     | csi.volumeHandle: vol-id  |
     | storageClassName: gold    |
     | volumeMode: Block         |
     | accessModes: [RWO]        |
     | reclaimPolicy: Delete     |
     | status.phase: Available   |
     +---------------------------+
              |
              v
  7. PV controller (kube-controller-manager)
     detects matching PV and PVC
     - capacity:      100Gi >= 100Gi    OK
     - accessModes:   [RWO] matches     OK
     - storageClass:  gold matches      OK
     - volumeMode:    Block matches     OK
              |
              v
  8. PV controller binds PV <-> PVC
     - Sets PV.spec.claimRef to PVC
     - Sets PVC.spec.volumeName to PV
     +---------------------------+
     | PV: pvc-<uuid>            |
     | status.phase: Bound       |
     | claimRef:                  |
     |   name: vm-boot-disk      |
     |   namespace: prod-vms     |
     +---------------------------+
     +---------------------------+
     | PVC: vm-boot-disk         |
     | status.phase: Bound       |
     | volumeName: pvc-<uuid>    |
     +---------------------------+

Access Modes

Access modes define how many nodes can mount a volume simultaneously. They are specified in both the PV and PVC.

Mode Abbreviation Description Ceph RBD CephFS S2D (VHDX) S2D (SMB)
ReadWriteOnce RWO Mounted read-write by a single node Yes Yes Yes Yes
ReadOnlyMany ROX Mounted read-only by many nodes Yes Yes Yes Yes
ReadWriteMany RWX Mounted read-write by many nodes No (RBD) Yes No Yes
ReadWriteOncePod RWOP Mounted read-write by a single pod (K8s 1.27+) Yes Yes Yes Yes

Which access modes matter for VMs:

Access Mode Impact on VM Live Migration
==========================================

  RWO Volume (Ceph RBD):
  +----------+                            +----------+
  |  Node A  |                            |  Node B  |
  | (source) |                            | (target) |
  | virt-    |    RBD exclusive lock       | virt-    |
  | launcher |    transferred during       | launcher |
  | pod      | --------migration-------->  | pod      |
  | [RWO]    |                            | [RWO]    |
  +----------+                            +----------+
       |                                       |
       v                                       v
  RBD device mapped                    RBD device mapped
  on Node A                            on Node B
  (lock released)                      (lock acquired)

  Key: Only one node holds the exclusive lock at a time.
  KubeVirt coordinates the handoff. This works for RBD
  but requires RBD exclusive-lock feature enabled.

  RWX Volume (CephFS):
  +----------+                            +----------+
  |  Node A  |                            |  Node B  |
  | (source) |                            | (target) |
  | virt-    |    Both mount              | virt-    |
  | launcher |    simultaneously          | launcher |
  | pod      |                            | pod      |
  | [RWX]    |                            | [RWX]    |
  +----------+                            +----------+
       |                                       |
       v                                       v
  CephFS mounted                       CephFS mounted
  on Node A                            on Node B
  (concurrent access)                  (concurrent access)

  Key: Both nodes mount the filesystem simultaneously.
  No lock transfer needed. Simpler migration but CephFS
  has higher latency than RBD for random I/O.

Volume Modes: Filesystem vs Block

Mode Description How Data Is Accessed KubeVirt Usage
Filesystem (default) Volume is formatted with a filesystem (ext4, XFS) and mounted at a path Standard file I/O (open, read, write) VM disk image stored as a file on the mounted FS (e.g., disk.img on XFS)
Block Volume is exposed as a raw block device (/dev/...) Direct block I/O (ioctl, read/write on device) Raw block device passed directly to QEMU/KVM

Why KubeVirt prefers Block mode for VM disks:

With Filesystem mode, the I/O path is: Guest OS --> virtio-blk --> QEMU --> host filesystem (XFS on staged volume) --> block layer --> CSI volume (RBD/S2D). There are two filesystems in the path -- the guest's filesystem inside the VM and the host's filesystem wrapping the disk image file. This adds overhead: double metadata updates, double journaling, fragmentation of the disk image file.

With Block mode, the I/O path is: Guest OS --> virtio-blk --> QEMU --> raw block device --> CSI volume (RBD/S2D). There is only one filesystem -- the guest's. The host sees a raw block device, and QEMU reads/writes directly to it. This eliminates the overhead of the host filesystem and provides better performance for VM workloads.

I/O Path Comparison: Filesystem vs Block Volume Mode
======================================================

  Filesystem Mode (volumeMode: Filesystem):
  +-------+    +--------+    +--------+    +--------+    +---------+
  | Guest | -> | virtio | -> | QEMU   | -> | Host   | -> | CSI     |
  | FS    |    | -blk   |    | File   |    | FS     |    | Volume  |
  | (ext4)|    |        |    | I/O    |    | (XFS)  |    | (RBD)   |
  +-------+    +--------+    +--------+    +--------+    +---------+
                                              ^^^
                                          Extra layer:
                                          - Double metadata
                                          - Fragmentation
                                          - Journal overhead
                                          - Potential misalignment

  Block Mode (volumeMode: Block):
  +-------+    +--------+    +--------+    +---------+
  | Guest | -> | virtio | -> | QEMU   | -> | CSI     |
  | FS    |    | -blk   |    | Block  |    | Volume  |
  | (ext4)|    |        |    | I/O    |    | (RBD)   |
  +-------+    +--------+    +--------+    +---------+
                                 ^^^
                             Direct path:
                             - No host FS overhead
                             - No fragmentation
                             - No double journaling
                             - Native I/O alignment

  Performance impact: Block mode typically shows 10-20% better
  IOPS and lower p99 latency for random I/O workloads (databases,
  OLTP) compared to Filesystem mode with an image file.

Reclaim Policies

The reclaim policy determines what happens to the PV (and the underlying storage) when its bound PVC is deleted.

Policy Behavior Use Case
Delete PV and the underlying volume on the storage backend are automatically deleted when the PVC is deleted. Ephemeral/replaceable workloads, dev/test environments. Default for dynamically provisioned volumes.
Retain PV transitions to Released state. The underlying volume is preserved. An administrator must manually clean up and re-create or delete the PV. Production databases, regulated workloads where data must be explicitly reviewed before deletion. Critical for financial compliance.

For a Tier-1 financial enterprise:

Reclaim Policy Behavior
=========================

  Delete Policy:
  PVC deleted --> PV released --> CSI DeleteVolume --> Backend volume removed
                                                       (rbd rm, Remove-VirtualDisk)
  Timeline:  [PVC exists]-----[PVC deleted]---[PV gone]---[Data gone]
             Data accessible    Seconds later   Permanent loss

  Retain Policy:
  PVC deleted --> PV Released (data preserved) --> Manual admin action required
                                                    |
                                                    +-> Option A: Delete PV + data
                                                    +-> Option B: Remove claimRef,
                                                        PV becomes Available again,
                                                        new PVC can bind to it
                                                    +-> Option C: Backup data, then
                                                        delete PV

  Timeline:  [PVC exists]-----[PVC deleted]---[PV Released]---[Admin decides]
             Data accessible    Data preserved   Data safe      Explicit action

PV Lifecycle

PV Lifecycle State Machine
=============================

  +-------------+    PV created (static) or     +-------------+
  |             |    CSI CreateVolume (dynamic)  |             |
  |  (does not  | -----------------------------> |  Available  |
  |   exist)    |                                |             |
  +-------------+                                +------+------+
                                                        |
                                                  PV controller
                                                  finds matching PVC
                                                  (capacity, mode,
                                                   class, selector)
                                                        |
                                                        v
                                                 +------+------+
                                                 |             |
                                                 |   Bound     |  <-- PV.claimRef set
                                                 |             |      PVC.volumeName set
                                                 +------+------+
                                                        |
                                                  PVC deleted
                                                        |
                        +-------------------------------+-------------------------------+
                        |                                                               |
                  reclaimPolicy: Delete                                     reclaimPolicy: Retain
                        |                                                               |
                        v                                                               v
                 +------+------+                                                +------+------+
                 |             |                                                |             |
                 |  (deleted)  |  CSI DeleteVolume called,                      |  Released   |  Data preserved,
                 |             |  PV object removed,                            |             |  claimRef still set,
                 |             |  backend volume destroyed                      |             |  cannot be re-bound
                 +-------------+                                                +------+------+
                                                                                       |
                                                                                Admin removes
                                                                                claimRef manually
                                                                                       |
                                                                                       v
                                                                                +------+------+
                                                                                |             |
                                                                                |  Available  |  Can be bound
                                                                                |             |  to a new PVC
                                                                                +-------------+

  Note: A PV can also enter "Failed" state if the backend volume
  becomes inaccessible (e.g., Ceph cluster unreachable, RBD image
  corrupted). Failed PVs require manual intervention.

PVC Resizing: Online Volume Expansion for VM Disks

PVC resizing allows increasing the storage capacity of a bound volume without downtime. This is critical for VMs that accumulate data over time. The process involves two steps:

  1. Controller-side expansion: The external-resizer sidecar detects the PVC size change and calls ControllerExpandVolume. The CSI driver expands the volume on the backend (e.g., rbd resize for Ceph).
  2. Node-side expansion: After the backend volume is expanded, the kubelet calls NodeExpandVolume on the node plugin. The node plugin resizes the filesystem (if volumeMode is Filesystem) or notifies the block device layer (if volumeMode is Block).
# Expand a VM disk from 100Gi to 200Gi (edit the PVC)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vm-database-prod-01-data
  namespace: production-vms
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  storageClassName: ocs-storagecluster-ceph-rbd    # Must have allowVolumeExpansion: true
  resources:
    requests:
      storage: 200Gi    # Changed from 100Gi to 200Gi

The StorageClass must have allowVolumeExpansion: true for resizing to work. After the expansion:

Shrinking volumes is not supported by CSI. If a volume needs to be smaller, the data must be migrated to a new, smaller volume. This matches VMware behavior where shrinking a VMDK in-place is also not supported.

Multi-Attach Considerations

Storage Type Access Mode Multi-Attach Use Case Protocol
Ceph RBD RWO No (exclusive lock) VM boot/data disks RADOS/krbd
CephFS RWX Yes (POSIX filesystem) Shared config, home dirs CephFS kernel client
NFS RWX Yes (NFS protocol) Legacy shared storage NFS v3/v4
S2D VHDX RWO No (exclusive) VM boot/data disks SMB Direct (RDMA)
S2D SMB Share RWX Yes (SMB3 protocol) Shared file storage SMB 3.x

RWX for VM live migration vs RWX for multi-writer: These are different requirements. Live migration needs temporary dual-node access during migration (seconds). Multi-writer RWX means persistent concurrent access by multiple pods. KubeVirt handles live migration of RWO/RBD volumes via exclusive lock transfer, which is architecturally cleaner than requiring RWX for all VM disks.


3. StorageClasses

What StorageClasses Solve

Without StorageClasses, every PVC would need to be manually matched to a pre-created PV (static provisioning). This does not scale for 5,000+ VMs. StorageClasses enable dynamic provisioning -- the user requests storage by specifying a class name, and the system automatically provisions a volume with the right characteristics.

StorageClasses are the Kubernetes equivalent of VMware's Storage Policy Based Management (SPBM). In VMware, you define storage policies (e.g., "Gold: RAID-1 mirroring, SSD tier, no compression") and assign them to VMs. In Kubernetes, you define StorageClasses with provider-specific parameters and reference them in PVCs.

StorageClass Selection Flow
==============================

  User creates PVC               StorageClass Definition
  +--------------------+         +----------------------------------+
  | PVC: my-disk       |         | StorageClass: gold               |
  | storageClassName:  |-------->| provisioner: rbd.csi.ceph.com    |
  |   gold             |         | parameters:                      |
  | storage: 100Gi     |         |   pool: nvme-replicapool         |
  +--------------------+         |   imageFeatures: layering,...    |
                                 | reclaimPolicy: Retain            |
                                 | volumeBindingMode:               |
                                 |   WaitForFirstConsumer           |
                                 | allowVolumeExpansion: true       |
                                 +----------------------------------+
                                          |
                   +----------------------+
                   |
                   v
  external-provisioner:
  "PVC 'my-disk' references StorageClass 'gold'.
   I am the provisioner for rbd.csi.ceph.com.
   I will call CreateVolume with these parameters."
                   |
                   v
  CSI CreateVolume(
    name: "pvc-<uuid>",
    capacity: 100 GiB,
    parameters: {
      pool: "nvme-replicapool",
      imageFeatures: "layering,deep-flatten,exclusive-lock,object-map,fast-diff"
    }
  )
                   |
                   v
  Ceph: rbd create nvme-replicapool/csi-vol-<uuid> --size 100G
                   |
                   v
  PV created, bound to PVC

StorageClass Definition

A StorageClass has the following key fields:

Field Purpose Values
provisioner The CSI driver that provisions volumes for this class openshift-storage.rbd.csi.ceph.com, disk.csi.azure.com, etc.
parameters Driver-specific key-value pairs passed to CreateVolume Pool name, replication factor, features, encryption settings
reclaimPolicy Default reclaim policy for PVs created by this class Delete (default) or Retain
volumeBindingMode When to bind and provision the volume Immediate or WaitForFirstConsumer
allowVolumeExpansion Whether PVCs using this class can be resized true or false
mountOptions Additional mount options for filesystem volumes ["discard", "noatime"]

Volume Binding Modes

Mode Behavior When to Use
Immediate Volume is provisioned and bound as soon as the PVC is created, regardless of whether a pod exists to consume it. Simple environments without topology constraints. The volume may be provisioned in a zone/rack where no pod will ever run.
WaitForFirstConsumer Volume provisioning is delayed until a pod referencing the PVC is scheduled. The scheduler determines which node the pod will run on, and the volume is provisioned in the same topology domain. Recommended for production. Ensures volumes are provisioned in the correct failure domain (rack, zone). Essential for topology-aware storage like Ceph with CRUSH rules or S2D with site-aware placement.
Volume Binding Modes: Immediate vs WaitForFirstConsumer
=========================================================

  Immediate Binding:
  +-------+       +---------+      +--------+     +--------+
  | PVC   | ----> | external| ---> | CSI    | --> | Volume |
  | create|       | provis- |      | Create |     | on     |
  | d     |       | ioner   |      | Volume |     | Rack A |
  +-------+       +---------+      +--------+     +--------+
                                                       |
  Later: Pod scheduled to Rack B                       |
  Problem: Volume is on Rack A,                        |
  pod is on Rack B. Cross-rack I/O                     |
  or scheduling failure.                               |

  WaitForFirstConsumer Binding:
  +-------+       +---------+
  | PVC   | ----> | external|
  | create|       | provis- |
  | d     |       | ioner   |   "PVC has WaitForFirstConsumer.
  +-------+       +---------+    I will wait."
                       |
  Later: Pod scheduled  |
  to Rack B             |
  +-------+             |
  | Pod   | ----------->|
  | sched |             |
  | Rack B|             v
  +-------+       +--------+     +--------+
                  | CSI    | --> | Volume |
                  | Create |     | on     |
                  | Volume |     | Rack B |  <-- Correct topology!
                  +--------+     +--------+

StorageClass Parameters for Ceph RBD (OVE / ODF)

# StorageClass: Gold tier (NVMe pool, 3-way replication)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ocs-gold-nvme
  annotations:
    description: "NVMe-backed, 3-replica, for latency-sensitive production VMs"
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
  clusterID: openshift-storage
  pool: nvme-replicapool                         # Ceph pool backed by NVMe OSDs
  imageFormat: "2"                                # Always "2" (layering support)
  imageFeatures: "layering,deep-flatten,exclusive-lock,object-map,fast-diff"
  # imageFeatures explained:
  #   layering       - COW cloning support (instant VM provisioning from templates)
  #   deep-flatten   - Required for snapshot deletion without affecting clones
  #   exclusive-lock - Single-writer guarantee (KubeVirt live migration lock transfer)
  #   object-map     - Tracks which 4 MiB objects are allocated (speeds up diff/export)
  #   fast-diff      - Uses object-map for efficient delta calculation (faster snapshots)
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
  csi.storage.k8s.io/fstype: ext4                # Only used if volumeMode: Filesystem
reclaimPolicy: Retain                            # Production: never auto-delete
volumeBindingMode: WaitForFirstConsumer          # Topology-aware placement
allowVolumeExpansion: true
# StorageClass: Silver tier (SSD pool, 3-way replication)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ocs-silver-ssd
  annotations:
    description: "SSD-backed, 3-replica, for general production workloads"
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
  clusterID: openshift-storage
  pool: ssd-replicapool                          # Ceph pool on SSD-class OSDs
  imageFormat: "2"
  imageFeatures: "layering,deep-flatten,exclusive-lock,object-map,fast-diff"
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
  csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StorageClass: Bronze tier (HDD pool, erasure-coded for capacity)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ocs-bronze-ec
  annotations:
    description: "HDD-backed, erasure-coded (4+2), for capacity-heavy cold workloads"
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
  clusterID: openshift-storage
  pool: hdd-ec-datapool                          # EC data pool
  dataPool: hdd-ec-datapool                      # Used for RBD with EC
  # Note: RBD on erasure-coded pools requires a metadata pool
  # (replicated) and a data pool (EC). The "pool" parameter
  # points to the replicated metadata pool, "dataPool" to the EC pool.
  imageFormat: "2"
  imageFeatures: "layering,exclusive-lock,object-map,fast-diff"
  # Note: deep-flatten is not supported on EC pools
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
  csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete                            # Cold data: auto-cleanup acceptable
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StorageClass: CephFS (shared filesystem, RWX)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ocs-shared-cephfs
  annotations:
    description: "CephFS, 3-replica, ReadWriteMany for shared filesystem access"
provisioner: openshift-storage.cephfs.csi.ceph.com
parameters:
  clusterID: openshift-storage
  fsName: ocs-storagecluster-cephfilesystem      # CephFS filesystem name
  pool: ocs-storagecluster-cephfilesystem-data0  # CephFS data pool
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
reclaimPolicy: Retain
volumeBindingMode: Immediate                     # CephFS is accessible from all nodes
allowVolumeExpansion: true

StorageClass Parameters for S2D (Azure Local)

Azure Local uses CSI drivers that map to S2D storage pools and volume tiers. The parameters differ fundamentally from Ceph because S2D uses a different storage architecture (cache tier + capacity tier, ReFS, CSVs).

# StorageClass: Azure Local Premium (all-NVMe tier)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurelocal-premium
  annotations:
    description: "S2D all-NVMe, 3-way mirror, for latency-sensitive VMs"
provisioner: disk.csi.azure.com
parameters:
  storagePool: "S2D-NVMe-Pool"                  # S2D storage pool name
  resiliencySettingName: "Mirror"                 # Mirror (2-way, 3-way) or Parity
  numberOfCopies: "3"                            # 3-way mirror for production
  # S2D determines cache behavior automatically based on device classes:
  # - All-NVMe: no cache tier (all devices serve as capacity)
  # - NVMe + SSD: NVMe acts as cache for SSD capacity
  # - NVMe + HDD: NVMe acts as cache for HDD capacity
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StorageClass: Azure Local Standard (SSD with NVMe cache)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurelocal-standard
  annotations:
    description: "S2D SSD capacity with NVMe cache, 3-way mirror"
provisioner: disk.csi.azure.com
parameters:
  storagePool: "S2D-SSD-Pool"
  resiliencySettingName: "Mirror"
  numberOfCopies: "3"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StorageClass: Azure Local Capacity (MAP for cost efficiency)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurelocal-capacity
  annotations:
    description: "S2D mirror-accelerated parity, for capacity-heavy cold workloads"
provisioner: disk.csi.azure.com
parameters:
  storagePool: "S2D-Capacity-Pool"
  resiliencySettingName: "Parity"                # Or "Mirror" with MAP tiering
  numberOfCopies: "1"                            # Parity provides redundancy differently
  # MAP parameters (mirror-accelerated parity):
  # Hot data sits in the mirror tier (fast writes)
  # Cold data automatically tiers to parity (space efficient)
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Default StorageClass Behavior

One StorageClass can be marked as the default using the annotation storageclass.kubernetes.io/is-default-class: "true". When a PVC does not specify a storageClassName, the default StorageClass is used. If no default is set and no class is specified, the PVC remains Pending (dynamic provisioning will not occur).

For 5,000+ VMs, the default StorageClass should be the most commonly used tier (typically the "silver" equivalent). Setting the default to the most expensive tier (gold/NVMe) risks accidental overprovisioning of premium storage.

# Setting the default StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ocs-silver-ssd
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"    # <-- This is the default
provisioner: openshift-storage.rbd.csi.ceph.com
# ... parameters ...

Multiple defaults: If more than one StorageClass is marked as default, PVCs without an explicit class will fail with an error. This is a common misconfiguration during initial setup. Enforce a single default via admission webhooks or policy (e.g., OPA/Gatekeeper).

Mapping VMware SPBM to Kubernetes StorageClasses

VMware's Storage Policy Based Management (SPBM) defines storage capabilities through rules. Each rule specifies a set of capabilities (replication factor, stripe width, disk type, QoS limits) that the datastore (vSAN, VMFS, etc.) must satisfy. VMs are assigned policies, and vCenter ensures the underlying storage matches.

The conceptual mapping to Kubernetes StorageClasses is direct, but the implementation mechanism differs:

SPBM to StorageClass Mapping
===============================

VMware SPBM Policy                    Kubernetes StorageClass
+-----------------------------------+ +-----------------------------------+
| Policy: "Gold-Production"        | | StorageClass: "ocs-gold-nvme"    |
|                                   | |                                   |
| Rules:                            | | Parameters:                       |
|   - VSAN.hostFailuresToTolerate=2 | |   pool: nvme-replicapool         |
|     (3 replicas)                  | |   (Ceph pool with RF=3 via       |
|   - VSAN.stripeWidth=2            | |    CRUSH rule, min_size=2)       |
|   - VSAN.forceProvisioning=false  | |   imageFeatures: layering,...    |
|                                   | |                                   |
| Capability Profile:               | | volumeBindingMode:                |
|   - Storage Type: All Flash       | |   WaitForFirstConsumer           |
|   - Encryption: Required          | |                                   |
|   - QoS IOPS Limit: 10000        | | reclaimPolicy: Retain            |
|                                   | | allowVolumeExpansion: true       |
+-----------------------------------+ +-----------------------------------+

VMware SPBM Policy                    Kubernetes StorageClass
+-----------------------------------+ +-----------------------------------+
| Policy: "Silver-Standard"        | | StorageClass: "ocs-silver-ssd"   |
|                                   | |                                   |
| Rules:                            | | Parameters:                       |
|   - VSAN.hostFailuresToTolerate=1 | |   pool: ssd-replicapool          |
|     (2 replicas)                  | |   (Ceph pool with RF=3 or RF=2   |
|   - VSAN.stripeWidth=1            | |    via CRUSH rule)               |
|   - VSAN.forceProvisioning=false  | |                                   |
|                                   | |                                   |
| Capability Profile:               | | reclaimPolicy: Retain            |
|   - Storage Type: Hybrid          | | allowVolumeExpansion: true       |
|   - Encryption: Optional          | |                                   |
+-----------------------------------+ +-----------------------------------+

VMware SPBM Policy                    Kubernetes StorageClass
+-----------------------------------+ +-----------------------------------+
| Policy: "Bronze-Archive"         | | StorageClass: "ocs-bronze-ec"    |
|                                   | |                                   |
| Rules:                            | | Parameters:                       |
|   - VSAN.hostFailuresToTolerate=1 | |   pool: hdd-ec-metadatapool      |
|   - VSAN.stripeWidth=1            | |   dataPool: hdd-ec-datapool      |
|   - VSAN.forceProvisioning=true   | |   (EC 4+2 for capacity           |
|                                   | |    efficiency)                    |
| Capability Profile:               | |                                   |
|   - Storage Type: Magnetic        | | reclaimPolicy: Delete            |
|   - IOPS Limit: 500              | | allowVolumeExpansion: true       |
+-----------------------------------+ +-----------------------------------+

Key differences between SPBM and StorageClasses:

Dimension VMware SPBM Kubernetes StorageClass
QoS enforcement SPBM can set IOPS limits directly in the policy. vSAN SIOC enforces them at the datastore level. StorageClasses do not natively support QoS. IOPS limits must be enforced via Ceph QoS (rbd_qos_iops_limit), resource quotas, or external tools.
Compliance checking vSAN continuously monitors whether VMs comply with their assigned policy. Non-compliant VMs are flagged in vCenter. No equivalent built-in compliance monitoring. Custom tooling (Prometheus alerts, OPA policies) is needed to detect mismatches.
Policy reassignment A VM's storage policy can be changed in-place. vSAN will rebalance the data to comply with the new policy (e.g., from 2 replicas to 3). Changing a PVC's StorageClass is not supported. The volume must be migrated (snapshot + restore to a new PVC with the new class).
Encryption Encryption is a policy rule. Enable encryption per-policy; vSAN encrypts transparently. Encryption is a Ceph pool-level or OSD-level setting (dm-crypt/LUKS), not a StorageClass parameter. Separate encrypted pools and StorageClasses are needed for per-tier encryption.
Thin vs Thick provisioning SPBM supports both thin and thick provisioning as a policy attribute. CSI provisioning is always thin by default (Ceph, S2D both thin-provision). Thick provisioning requires explicit configuration.

StorageClass Hierarchy for 5,000+ VMs

For a Tier-1 financial enterprise migrating 5,000+ VMs, the following StorageClass hierarchy maps the existing VMware tiering to Kubernetes:

StorageClass Hierarchy Design
================================

  Tier     StorageClass Name         Backend (OVE)              Backend (Azure Local)
  -----    ----------------------    -------------------------  -------------------------
  Gold     ocs-gold-nvme             Ceph pool: nvme-rpl        S2D: all-NVMe, 3-way mirror
           (VM block, RWO)           RF=3, NVMe OSDs            Premium tier
           Retain, WaitForFirst      CRUSH: host failure domain
           Expansion: yes
           Use: Databases, OLTP,
           latency-sensitive

  Silver   ocs-silver-ssd (DEFAULT)  Ceph pool: ssd-rpl         S2D: SSD + NVMe cache,
           (VM block, RWO)           RF=3, SSD OSDs             3-way mirror
           Retain, WaitForFirst      CRUSH: host failure domain Standard tier
           Expansion: yes
           Use: General prod VMs,
           app servers, web tier

  Bronze   ocs-bronze-ec             Ceph pool: hdd-ec          S2D: SSD/HDD, MAP
           (VM block, RWO)           EC 4+2, HDD/SSD OSDs      (mirror-accelerated parity)
           Delete, WaitForFirst      CRUSH: host failure domain Capacity tier
           Expansion: yes
           Use: Dev/test VMs,
           cold data, archives

  Archive  ocs-archive-compressed    Ceph pool: hdd-ec-comp     (not available -- S2D
           (VM block, RWO)           EC 8+3, HDD OSDs          has no inline compression)
           Delete, WaitForFirst      Compression: zstd
           Expansion: yes            CRUSH: host failure domain
           Use: Compliance logs,
           rarely accessed data

  Shared   ocs-shared-cephfs         CephFS: ocs-cephfs         S2D: SMB share on CSV
           (filesystem, RWX)         RF=3, SSD data pool        Standard tier, SMB3
           Retain, Immediate         MDS active/standby
           Expansion: yes
           Use: Shared config,
           NFS/SMB replacement

  VM-Images  ocs-template-rbd        Ceph pool: ssd-rpl         S2D: SSD + NVMe cache
           (VM block, RWO)           RF=3, SSD OSDs             Standard tier
           Delete, Immediate         Used for golden images
           Expansion: no             + layering/cloning
           Use: OS templates,
           golden VM images

  Capacity planning (OVE / ODF example):
  +--------------------+---------+----------+---------+
  | Tier               | VM Count| Avg Size | Total   |
  +--------------------+---------+----------+---------+
  | Gold (NVMe, RF=3)  |     500 |   200 Gi |  100 Ti |
  | Silver (SSD, RF=3) |   3,000 |   150 Gi |  450 Ti |
  | Bronze (HDD, EC)   |   1,000 |   300 Gi |  300 Ti |
  | Archive (HDD, EC)  |     300 |   500 Gi |  150 Ti |
  | Shared (CephFS)    |      -- |     5 Ti |    5 Ti |
  | Templates          |      20 |    50 Gi |    1 Ti |
  +--------------------+---------+----------+---------+
  | Usable total       |   4,820 |          |~1,006 Ti|
  +--------------------+---------+----------+---------+
  | Raw required (RF=3)|         |          |~3,018 Ti|  (Gold+Silver+Shared+Templates)
  | Raw required (EC)  |         |          |  ~ 675 Ti|  (Bronze 4+2: 1.5x, Archive 8+3: 1.375x)
  | Raw grand total    |         |          |~3,693 Ti|
  +--------------------+---------+----------+---------+

VolumeSnapshot Example

VolumeSnapshots provide CSI-native, point-in-time snapshots that can be used for backup, cloning, and disaster recovery. They are the Kubernetes equivalent of VMware's VADP snapshots.

# VolumeSnapshotClass (defines the snapshot provider)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ocs-rbd-snapclass
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
driver: openshift-storage.rbd.csi.ceph.com
parameters:
  clusterID: openshift-storage
  csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/snapshotter-secret-namespace: openshift-storage
deletionPolicy: Delete                          # Delete snapshot when VolumeSnapshot CR is deleted
# VolumeSnapshot (take a snapshot of a VM disk)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: vm-database-prod-snapshot-20260428
  namespace: production-vms
  labels:
    app: database
    backup-schedule: daily
spec:
  volumeSnapshotClassName: ocs-rbd-snapclass
  source:
    persistentVolumeClaimName: vm-database-prod-01-data
# VolumeSnapshotContent (created automatically by external-snapshotter)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  name: snapcontent-3f8e9a12-7b4c-4d5e-a1f0-9c8d7e6b5a43
spec:
  deletionPolicy: Delete
  driver: openshift-storage.rbd.csi.ceph.com
  source:
    volumeHandle: "0001-0024-openshift-storage-0000000000000001-3f8e9a12"
  volumeSnapshotRef:
    name: vm-database-prod-snapshot-20260428
    namespace: production-vms
  volumeSnapshotClassName: ocs-rbd-snapclass
status:
  snapshotHandle: "0001-0024-openshift-storage-snap-3f8e9a12"
  creationTime: 1745836800000000000                # nanoseconds since epoch
  readyToUse: true
  restoreSize: 107374182400                        # 100 GiB in bytes

The snapshot flow:

VolumeSnapshot Lifecycle
==========================

  1. User creates VolumeSnapshot CR
     (references PVC name + VolumeSnapshotClass)
          |
          v
  2. external-snapshotter sidecar detects VolumeSnapshot
          |
          v
  3. external-snapshotter calls CSI CreateSnapshot
     CreateSnapshot(source_volume_id, name, parameters)
          |
          v
  4. CSI driver creates snapshot on backend
     Ceph: rbd snap create replicapool/csi-vol-<uuid>@snap-<uuid>
     S2D:  VSS snapshot of VHDX
          |
          v
  5. CSI driver returns CreateSnapshotResponse
     (snapshot_id, creation_time, ready_to_use)
          |
          v
  6. external-snapshotter creates VolumeSnapshotContent CR
     (cluster-scoped, stores backend snapshot details)
          |
          v
  7. VolumeSnapshot status updated:
     status.readyToUse: true
     status.restoreSize: 100Gi

  To restore: Create a new PVC with dataSource referencing
  the VolumeSnapshot (see PVC example above).

How the Candidates Handle This

Aspect VMware (VMFS/vSAN) OVE (CSI/ODF) Azure Local (CSI/S2D) Swisscom ESC
Storage abstraction VMDK on Datastore PVC backed by CSI volume (RBD image in Ceph pool) PVC backed by CSI volume (VHDX on S2D CSV) VM disk (managed by Swisscom, opaque to customer)
Plugin model In-tree VMFS/vSAN/NFS drivers (ESXi built-in) CSI out-of-tree (ceph-csi via Rook operator) CSI out-of-tree (disk.csi.azure.com, smb.csi.k8s.io, csi-proxy) N/A (fully managed)
Storage policy / tiering SPBM policies (GUI + API, compliance monitoring, in-place reassignment) StorageClasses (YAML manifests, no built-in compliance monitoring, no in-place reassignment) StorageClasses (YAML manifests, Azure Arc integration for monitoring) SLA tiers (contractual, no customer control over placement)
Dynamic provisioning vSAN creates VMDK on demand based on policy CSI external-provisioner calls CreateVolume, ceph-csi creates RBD image CSI external-provisioner calls CreateVolume, disk.csi creates VHDX Swisscom provisions on request (ticket/API)
Snapshot mechanism VADP (vStorage APIs for Data Protection), VMDK snapshots (redo logs) VolumeSnapshot CRD, CSI CreateSnapshot, RBD COW snapshots (instant, space-efficient) VolumeSnapshot CRD, CSI CreateSnapshot, VSS-based VHDX checkpoints Swisscom-managed snapshots (SAN-level)
Cloning vSphere API clone (full or linked clone) PVC with dataSource (CSI volume cloning, RBD layering for COW clones) PVC with dataSource (VHDX copy or ReFS block cloning) Not available to customer
Volume expansion Edit VMDK size in vCenter (online, guest must extend FS) Edit PVC size (CSI ControllerExpandVolume + NodeExpandVolume, guest must extend FS for block mode) Edit PVC size (CSI expansion, guest must extend FS) Request increase from Swisscom
Access modes VMDK: single VM (exclusive). VMFS: multi-VM read. NFS: shared. Multi-writer VMDK for clustering (SCSI reservations) RWO (RBD, standard VM disks). RWX (CephFS, shared FS). RWOP (strict single-pod). Live migration via lock transfer. RWO (VHDX). RWX (SMB share). Shared VHDX for guest clustering (limited). Single-VM access (standard). Shared storage via NFS/SMB (managed).
Volume mode Always virtual disk (VMDK wraps a block device with metadata) Block (preferred for VMs, raw device to QEMU) or Filesystem (image file on host FS) Filesystem (VHDX on ReFS/CSV, standard Hyper-V model) N/A (managed)
Topology awareness vSAN fault domains, SPBM rack-aware policies CSI topology keys + WaitForFirstConsumer + CRUSH rules (rack, zone, site) CSI topology keys + WaitForFirstConsumer + S2D fault domains (node, chassis, rack, site) N/A (managed)
QoS / IOPS limits SIOC (Storage I/O Control) per-VM IOPS/throughput limits Not built into CSI/StorageClass. Requires Ceph rbdqos* settings per image or cgroup I/O limits. Not built into CSI/StorageClass. Requires Hyper-V QoS policies or S2D bandwidth reservation. SLA-based (contractual IOPS guarantee per tier)
Encryption at rest vSAN encryption (per-policy, KMS integration) Ceph OSD-level encryption (dm-crypt/LUKS, KMIP). Per-pool, not per-StorageClass. BitLocker per CSV volume (TPM-backed). Per-volume, not per-StorageClass. Swisscom-managed (assumed encrypted)
Multi-cluster storage vSAN stretched cluster, cross-vCenter Ceph external mode (single Ceph cluster serves multiple OCP clusters, single CSI config) Each S2D cluster is independent. No cross-cluster storage sharing. Shared SAN backend (multi-tenant)
Operational tooling vCenter GUI, PowerCLI, ESXCLI kubectl, oc, ceph CLI, ODF Console Plugin, Rook CRDs, Prometheus metrics kubectl, PowerShell, Windows Admin Center, Azure Portal, Azure Arc Swisscom portal
Migration from VMware N/A (source) MTV (Migration Toolkit for Virtualization) converts VMDK to RBD-backed PVC Azure Migrate converts VMDK to VHDX-backed PVC Swisscom handles migration (P2V/V2V)
Backup integration VADP-aware: Veeam, Commvault, Dell Avamar, Cohesity CSI snapshot + Kasten K10, Veeam Kasten, Trilio. VolumeSnapshot is the backup API. CSI snapshot + Azure Backup, Veeam. VSS integration for app-consistent snapshots. Swisscom-managed backup
Maturity for VMs 20+ years (VMware invented virtual disk management) ~4 years (KubeVirt CSI integration GA since OCP 4.10+, rapidly maturing) ~3 years (Azure Local CSI for AKS hybrid, actively evolving) Mature (traditional IaaS model)

Key Takeaways

  1. CSI is the universal storage interface -- but the devil is in the driver. CSI standardizes the control-plane API between Kubernetes and storage. However, the quality, feature completeness, and maturity of the CSI driver varies significantly between platforms. ceph-csi is battle-tested in production (part of the CNCF ecosystem, deployed at scale by many organizations). Azure Local's CSI drivers are newer and less proven at scale. During the PoC, test every CSI feature the organization needs: provisioning, snapshots, cloning, expansion, raw block mode. Do not assume feature parity between drivers.

  2. Block mode is non-negotiable for VM performance. KubeVirt VMs using Filesystem volume mode pay a measurable performance penalty (double filesystem overhead, fragmentation, alignment issues). The PoC must validate that all VM disks use volumeMode: Block and that the CSI driver and storage backend handle block volumes correctly. This is standard for ceph-csi/RBD but needs validation for Azure Local's CSI implementation.

  3. StorageClasses replace SPBM -- but with gaps. StorageClasses provide dynamic provisioning and tiering similar to SPBM. However, SPBM offers in-place policy reassignment, continuous compliance monitoring, and integrated QoS enforcement that StorageClasses do not. The organization must build supplementary tooling (Prometheus alerts, OPA/Gatekeeper policies, custom controllers) to close these gaps. This is an operational cost unique to the Kubernetes model.

  4. WaitForFirstConsumer is mandatory for production. Using Immediate binding mode in a multi-rack environment will cause topology mismatches -- volumes provisioned in rack A while VMs run in rack B. Set all production StorageClasses to WaitForFirstConsumer. This is a day-1 configuration decision that is difficult to change retroactively (existing volumes cannot change their topology).

  5. Reclaim policies require an organizational decision. For a Tier-1 financial institution, Retain should be the default for production StorageClasses. Accidental PVC deletion must not cause immediate data loss. However, Retain creates a management burden -- orphaned PVs accumulate and consume storage. An operational process for reviewing and cleaning up Released PVs is required. Automate this with a custom controller or scheduled job.

  6. Volume expansion works -- but shrinking does not. PVC resizing (growth) is supported by both ceph-csi and Azure Local CSI drivers. But volume shrinking is not supported by CSI. Over-provisioned volumes cannot be reclaimed. For 5,000+ VMs, right-sizing volumes at creation (or using thin provisioning with monitoring) is critical to avoid storage waste.

  7. Live migration changes the access mode equation. In VMware, live migration (vMotion) is transparent -- the VMDK stays on shared storage, only compute moves. In KubeVirt, live migration of RBD-backed (RWO) volumes requires the exclusive-lock feature and a lock-transfer mechanism. This works but adds complexity. CephFS (RWX) simplifies migration but adds filesystem overhead. The PoC should test live migration under load with both RBD and CephFS to determine the preferred approach.

  8. The snapshot model is fundamentally different. VMware VMDK snapshots use redo logs (delta disks) that can cause performance degradation when stacked. Ceph RBD snapshots are COW at the object level -- instant, space-efficient, and do not degrade read performance. This is an architectural improvement for backup workflows. However, the integration with enterprise backup tools (Veeam, Commvault) via VolumeSnapshot CRDs is newer and less mature than VADP integration. Validate backup tool compatibility during the PoC.

  9. QoS is the biggest gap versus VMware. VMware's SIOC provides per-VM IOPS and throughput limits enforceable at the datastore level. Neither CSI nor StorageClasses offer built-in QoS. For a shared platform hosting 5,000+ VMs from different business units, noisy-neighbor isolation is critical. The organization must implement QoS through alternative mechanisms: Ceph per-image QoS (rbd_qos_iops_limit), cgroup I/O controllers, or platform-level resource quotas. This gap must be addressed before production.

  10. The migration path determines the StorageClass design. When migrating from VMware, each VM's current SPBM policy must map to a StorageClass. Document every SPBM policy in use today, its parameters (replication, encryption, QoS, disk tier), and the number of VMs assigned to it. This inventory drives the StorageClass hierarchy design. Do not design StorageClasses in the abstract -- derive them from the actual VMware policy landscape.


Discussion Guide

The following questions are designed for vendor deep-dives, PoC planning, and internal architecture reviews. They focus on the Kubernetes storage model specifics that affect VM operations at scale.

Questions for OVE / ODF (Red Hat)

  1. CSI driver maturity and feature coverage: "List every CSI feature that ceph-csi supports as GA (not tech preview) in the ODF version you are proposing. Specifically: volume snapshots, volume cloning, online volume expansion, raw block volumes, topology-aware provisioning, volume health monitoring, and volume group snapshots. For any feature that is tech preview or unsupported, what is the GA timeline?"

  2. Block mode performance validation: "Show benchmark data comparing volumeMode: Block vs volumeMode: Filesystem for a KubeVirt VM running a 4K random write workload on Ceph RBD. We expect 10-20% better IOPS with block mode. Confirm this and explain the overhead sources in filesystem mode (double journaling, metadata updates, alignment)."

  3. StorageClass design review: "We plan to implement four tiers (gold/NVMe, silver/SSD, bronze/EC, archive/EC-compressed) plus CephFS for shared volumes. Review our StorageClass YAML definitions and Ceph pool configuration. Are the CRUSH rules correct for our rack topology? Are the imageFeatures flags optimal? Should we use different PG counts per pool for our expected capacity?"

  4. Live migration with RBD exclusive lock: "Walk us through the exact sequence of events when a KubeVirt VM with an RBD-backed (RWO, Block) volume live-migrates from node A to node B. How is the exclusive lock transferred? What is the blackout window (if any) where I/O is paused? What happens if the source node crashes during lock transfer? How does this compare to vMotion in terms of migration downtime?"

  5. QoS enforcement for multi-tenant VMs: "We need per-VM IOPS limits equivalent to VMware SIOC. Ceph supports rbd_qos_iops_limit and rbd_qos_bps_limit per image. Can these be set via StorageClass parameters, or do they require post-provisioning configuration? Can they be changed dynamically without VM restart? What is the enforcement granularity (strict rate limiting vs best-effort)?"

Questions for Azure Local / S2D (Microsoft)

  1. CSI driver feature parity with ceph-csi: "Which CSI features does the Azure Local CSI driver support as GA? Specifically: raw block volumes (volumeMode: Block), volume snapshots (VolumeSnapshot CRD), volume cloning (PVC dataSource), online volume expansion, and topology-aware provisioning. For KubeVirt VMs on Azure Local, is block mode supported and recommended?"

  2. CSI-proxy reliability for Windows nodes: "csi-proxy is a critical dependency for CSI operations on Windows nodes. What is csi-proxy's failure model? If csi-proxy crashes, can existing VMs continue running (data-plane unaffected) or does the failure propagate? How is csi-proxy upgraded (rolling, requires node drain)? Is there a Linux-node alternative where AKS hybrid runs Linux workers instead of Windows?"

  3. StorageClass mapping from SPBM for S2D: "We currently have VMware SPBM policies mapping to three tiers. Design equivalent StorageClasses for Azure Local / S2D that match our performance and redundancy requirements: (a) gold: 3-way mirror on NVMe, (b) silver: 3-way mirror on SSD with NVMe cache, (c) bronze: MAP for capacity. Show the StorageClass YAML with S2D-specific parameters."

  4. Multi-cluster StorageClass consistency: "With 2-3 S2D clusters needed for 5,000+ VMs, how do we maintain consistent StorageClass definitions across clusters? Is there a federated StorageClass mechanism, or must each cluster be configured independently? How does Azure Arc handle cross-cluster storage policy governance?"

  5. Snapshot and backup integration via CSI: "Demonstrate the VolumeSnapshot workflow for a Hyper-V VM on Azure Local: create snapshot, verify consistency (VSS-aware?), restore to a new PVC, attach to a new VM. How does this integrate with Azure Backup and Veeam? Is the CSI snapshot app-consistent or crash-consistent?"

Questions for Swisscom ESC

  1. Storage abstraction transparency: "Since Swisscom ESC is a managed IaaS, the Kubernetes storage model (CSI, PV/PVC, StorageClasses) may not apply. How does VM storage provisioning work? Is there a Kubernetes-compatible interface, or is it purely vSphere/PowerMax-based? If we need to run containerized workloads alongside VMs, what storage model do they consume?"

Cross-Platform / Internal Architecture Questions

  1. SPBM-to-StorageClass migration inventory: "Before designing StorageClasses, we need a complete inventory of current VMware SPBM policies. For each policy: (a) policy name and parameters, (b) number of VMs assigned, (c) total capacity consumed, (d) performance profile (IOPS, latency from vRealize/Aria). This inventory is the input for StorageClass hierarchy design. Who owns this data extraction?"

  2. CSI driver failure impact analysis: "For each candidate, document the failure impact of: (a) CSI controller pod crash (no new provisioning, existing volumes unaffected), (b) CSI node pod crash (no new mounts, existing mounts stable), (c) CSI driver version mismatch after partial upgrade, (d) loss of CSI credentials (Secret deleted). For each scenario, what is the blast radius and recovery procedure?"

  3. Volume lifecycle automation: "Design a GitOps-driven workflow for the full volume lifecycle: (a) PVC creation via Helm chart or Kustomize, (b) StorageClass selection via namespace-level defaults, (c) snapshot scheduling via a CronJob or VolumeSnapshotSchedule controller, (d) PVC expansion via policy-driven automation, (e) orphaned PV cleanup via a garbage collection controller. Which components exist as open-source projects and which require custom development?"


Previous: 05-sds-platforms.md -- Software-Defined Storage Platforms (Ceph/ODF, S2D) Next: 07-data-protection.md -- Data Protection and Operations (Snapshots, DR, Encryption, Backup)