Modern datacenters and beyond

Data Protection & Operations

Why This Matters

The previous pages built the storage stack from the ground up: foundational concepts (01), the VMware baseline (02), protocols (03), architectures (04), SDS platforms (05), and the Kubernetes storage model (06). This page addresses the question that keeps CIOs and risk officers awake at night: what happens when things go wrong? Disk failures, ransomware, site disasters, accidental deletions, regulatory audits demanding proof of data-at-rest encryption -- all of these are scenarios that a Tier-1 financial enterprise must handle with documented, tested, and audited procedures.

For an organization running 5,000+ VMs, data protection is not an afterthought bolted on after go-live. It is a foundational architectural decision that shapes storage backend selection, network design (replication bandwidth), hardware procurement (encryption acceleration), and tooling investment (backup software licensing). Getting this wrong has regulatory consequences: FINMA Circular 2023/1 (previously 2018/3) on operational risks and business continuity management explicitly requires financial institutions to maintain tested disaster recovery capabilities, define RPO/RTO targets, and protect data at rest.

This page covers four pillars of data protection:

  1. Snapshots & Clones -- point-in-time copies at the storage level. The fastest form of data protection, but not a substitute for backups.
  2. Storage Replication / DR -- synchronous and asynchronous data replication between sites. The mechanism that determines RPO (how much data you lose) and RTO (how long recovery takes).
  3. Encryption at Rest -- protecting data on physical media from unauthorized access. A regulatory requirement, not an optional feature.
  4. Backup Integration -- how enterprise backup tools (Veeam, Kasten K10, Velero) integrate with each candidate platform. This is where the maturity gap between VMware and Kubernetes is most visible.

The central tension in this evaluation is maturity. VMware's data protection ecosystem is 20+ years old: VADP (vStorage APIs for Data Protection) is a deeply integrated, widely supported API that every enterprise backup vendor understands. The Kubernetes ecosystem is younger -- CSI VolumeSnapshots reached GA in Kubernetes 1.20 (December 2020), and Kubernetes-native backup tools like Kasten K10 and Velero are maturing rapidly but have not yet matched the depth of VADP integration. This page documents exactly where the gaps are and what compensating controls exist.


Concepts

1. Snapshots & Clones (Storage-Level)

What a Snapshot Is

A storage-level snapshot is a point-in-time, read-only record of a volume's state. When you create a snapshot, the storage system does not immediately copy all the data -- that would be prohibitively slow and expensive. Instead, it records a reference point and uses one of two mechanisms to preserve the original data as the volume continues to change.

The two fundamental snapshot mechanisms are Copy-on-Write (COW) and Redirect-on-Write (ROW). Understanding which mechanism each platform uses is critical because it determines snapshot performance impact, space consumption, and operational behavior.

Copy-on-Write (COW) vs Redirect-on-Write (ROW)

Copy-on-Write (COW) Snapshot Mechanism
=========================================

Before Snapshot:
+---+---+---+---+---+---+---+---+
| A | B | C | D | E | F | G | H |   <-- Active volume (blocks)
+---+---+---+---+---+---+---+---+

Step 1: Create Snapshot (instant -- only metadata)
+---+---+---+---+---+---+---+---+
| A | B | C | D | E | F | G | H |   <-- Active volume
+---+---+---+---+---+---+---+---+
  ^   ^   ^   ^   ^   ^   ^   ^
  |   |   |   |   |   |   |   |
  Snapshot S1 references same blocks (no data copied yet)

Step 2: Write new data to block C
  Before writing, OLD data (C) is copied to snapshot area:

  Snapshot S1 area:
  +---------+
  | C (old) |   <-- preserved original data
  +---------+

  Active volume:
  +---+---+----+---+---+---+---+---+
  | A | B | C' | D | E | F | G | H |   <-- C replaced with C'
  +---+---+----+---+---+---+---+---+

  Read from snapshot S1:
    Block C -> read from snapshot area (C old)
    Block A -> read from active volume (unchanged)

PERFORMANCE IMPACT:
  Every first write to a block after snapshot creation incurs:
  1. READ old block from active volume
  2. WRITE old block to snapshot area (COW copy)
  3. WRITE new data to active volume
  = 1 read + 2 writes per first modification = "write amplification"

  Subsequent writes to same block: normal (1 write only)
  Reads: no impact (active volume reads are direct)
Redirect-on-Write (ROW) Snapshot Mechanism
=============================================

Before Snapshot:
Block Map:  A -> loc0, B -> loc1, C -> loc2, D -> loc3
+------+------+------+------+------+------+------+------+
| loc0 | loc1 | loc2 | loc3 | loc4 | loc5 | loc6 | loc7 |
|  A   |  B   |  C   |  D   |      |      |      |      |
+------+------+------+------+------+------+------+------+

Step 1: Create Snapshot (instant -- freeze current block map)
Snapshot S1 Map: A -> loc0, B -> loc1, C -> loc2, D -> loc3
Active Map:     A -> loc0, B -> loc1, C -> loc2, D -> loc3

Step 2: Write new data (C') -- redirected to new location
+------+------+------+------+------+------+------+------+
| loc0 | loc1 | loc2 | loc3 | loc4 | loc5 | loc6 | loc7 |
|  A   |  B   |  C   |  D   |  C'  |      |      |      |
+------+------+------+------+------+------+------+------+
                 ^                     ^
                 |                     |
   Snapshot S1 Map still -> loc2    Active Map now -> loc4

PERFORMANCE IMPACT:
  Writes: NO penalty (write goes to new location, single I/O)
  Reads: May be slightly slower (data fragmented across locations)
  Cleanup: Garbage collection needed when snapshots deleted
  Space: Only new writes consume additional space (same as COW)

COW is used by: VMware VMDK snapshots (redo logs / delta disks), traditional SAN arrays (NetApp WAFL uses ROW variant, Dell PowerMax uses COW). ROW is used by: Ceph RBD (object-level, see below), ZFS, Btrfs, ReFS block cloning. In practice, modern systems often use a hybrid approach. The critical operational difference: COW penalizes writes (the first write after snapshot), ROW penalizes sequential reads (fragmentation) and requires garbage collection.

Ceph RBD Snapshots

Ceph RBD snapshots use an object-level COW mechanism within the RADOS layer. An RBD image is striped across many RADOS objects (default 4 MiB each). When a snapshot is created:

  1. The snapshot is recorded in the RBD image's metadata (the rbd_header object). This is instantaneous.
  2. No data is copied. All existing RADOS objects are now shared between the active image and the snapshot.
  3. When a write arrives for an object that has not been modified since the snapshot, the OSD performs a COW operation: it creates a new copy of the object for the active image, preserving the original object for the snapshot.
  4. Subsequent writes to the same object modify the active copy directly -- no further COW overhead.
Ceph RBD Snapshot Internals (Object-Level COW)
=================================================

RBD Image "vm-boot-disk" (100 GiB, 25,600 x 4 MiB objects)

  RADOS object: rbd_data.<image_id>.00000000  (obj 0, bytes 0-4M)
  RADOS object: rbd_data.<image_id>.00000001  (obj 1, bytes 4M-8M)
  RADOS object: rbd_data.<image_id>.00000002  (obj 2, bytes 8M-12M)
  ...
  RADOS object: rbd_data.<image_id>.00006400  (obj 25600)

Step 1: rbd snap create pool/vm-boot-disk@snap1
  - Writes snap metadata to rbd_header.<image_id>
  - Takes microseconds, no data movement
  - All 25,600 objects now shared: active + snap1

Step 2: VM writes 4 KiB to byte offset 8,000,100 (falls in obj 2)

  OSD handling object ...00000002:
  +-----------------------------------------------------+
  | 1. Check: does snap1 reference this object? YES      |
  | 2. COW: clone object to snap1 namespace              |
  |    snap1::rbd_data.<image_id>.00000002 = original    |
  | 3. Apply write to active object                      |
  |    rbd_data.<image_id>.00000002 = modified           |
  +-----------------------------------------------------+

  After COW:
  Active namespace:  ...00000002 = modified (4 MiB object, 4 KiB changed)
  snap1 namespace:   ...00000002 = original (4 MiB object, preserved)

  NOTE: COW granularity is the full RADOS object (4 MiB default).
        A 4 KiB write triggers a 4 MiB COW copy. This is the
        write amplification cost of RBD snapshots.

Layered Clones (Writable Snapshots):
  rbd snap create pool/vm-boot-disk@golden-image
  rbd snap protect pool/vm-boot-disk@golden-image
  rbd clone pool/vm-boot-disk@golden-image pool/vm-clone-001

  Result: vm-clone-001 is a full writable RBD image that reads
  from the parent snapshot for unmodified objects and stores
  only the differences. Thin provisioned from day one.

  Parent (golden-image):  [A][B][C][D][E][F]...
                            ^  ^     ^
  Clone (vm-clone-001):    [A][B][C'][D][E'][F]...
                                  ^       ^
                            only C' and E' stored in clone
                            A,B,D,F read from parent (COW)

Performance impact at scale: With 5,000 VMs and a policy of keeping 3 daily snapshots per VM, the system manages 15,000 active snapshots. Each snapshot adds COW overhead only to the first write per object after the snapshot is created. For write-heavy VMs (databases), this overhead is significant during the first pass after snapshot creation. For read-heavy or idle VMs, the overhead is negligible. Ceph handles this well because COW happens at the OSD level -- distributed across many OSDs -- so no single node bears the full cost.

Critical rule: Snapshots are NOT backups. A snapshot exists on the same storage system as the original data. If the Ceph cluster suffers catastrophic data loss (e.g., multiple simultaneous OSD failures beyond the replication factor), both the active image and all its snapshots are lost. Snapshots protect against accidental deletion and enable fast rollback. Backups protect against infrastructure failure and are stored on a separate storage system.

S2D / ReFS Snapshots

Storage Spaces Direct on Azure Local uses a different snapshot model. ReFS (Resilient File System) supports block cloning -- a metadata-only operation where multiple files can share physical extents on disk. This is conceptually similar to ROW: new writes go to new extents, and the original extents are preserved by the snapshot reference.

For VM disks (VHDX files on ReFS), checkpoints function as the snapshot mechanism:

S2D does not expose raw block-level snapshots through a public API in the same way Ceph does. Instead, VM-level checkpoints (backed by VHDX differencing disks, AVHDX files) serve as the snapshot mechanism. Each checkpoint creates a new differencing disk -- writes go to the new disk, reads fall through to the parent. This is functionally COW at the VHDX layer.

CSI VolumeSnapshot: The Kubernetes-Native Snapshot API

Both OVE and Azure Local expose snapshots through the Kubernetes CSI VolumeSnapshot API. This is the standard way to create, manage, and consume snapshots in Kubernetes.

CSI VolumeSnapshot Architecture
==================================

User creates VolumeSnapshot CR:
+---------------------------+       +--------------------------+
| VolumeSnapshot            |       | VolumeSnapshotClass      |
| apiVersion: snapshot.      |       | driver: rbd.csi.ceph.com |
|   storage.k8s.io/v1       |       |   (or disk.csi.azure.com)|
| spec:                     |       | deletionPolicy: Delete   |
|   volumeSnapshotClassName:|------>| parameters:              |
|     csi-rbd-snapclass     |       |   pool: replicapool      |
|   source:                 |       |   clusterID: rook-ceph   |
|     persistentVolumeClaim:|       +--------------------------+
|       name: vm-boot-disk  |
+-------------|-------------+
              |
              v
+---------------------------+       CSI Controller Plugin
| snapshot-controller       |       +------------------------+
| (cluster-wide, not per    |       | csi-snapshotter sidecar|
|  CSI driver)              |------>| calls CreateSnapshot   |
| Watches VolumeSnapshot CR |       | gRPC to CSI driver     |
| Creates VolumeSnapshotCont|       +-----------|------------+
|   ent (VSC) object        |                   |
+---------------------------+                   v
                                    +------------------------+
                                    | Storage Backend        |
                                    | Ceph: rbd snap create  |
                                    | S2D: VHDX checkpoint   |
                                    +-----------|------------+
                                                |
                                                v
+---------------------------+       +------------------------+
| VolumeSnapshotContent     |       | Physical snapshot       |
| (cluster-scoped)          |<------| exists on backend      |
| status:                   |       +------------------------+
|   readyToUse: true        |
|   snapshotHandle:         |
|     csi-snap-<uuid>       |
+---------------------------+

Restore: Create PVC with dataSource pointing to VolumeSnapshot
+---------------------------+
| PersistentVolumeClaim     |
| spec:                     |
|   dataSource:             |
|     name: vm-snap-01      |
|     kind: VolumeSnapshot  |
|     apiGroup: snapshot.   |
|       storage.k8s.io      |
|   accessModes: [RWO]     |
|   resources:              |
|     requests:             |
|       storage: 100Gi      |
+---------------------------+
  -> CSI driver creates new volume from snapshot
  -> For Ceph: rbd clone from snapshot (instant, COW)
  -> For S2D: new VHDX from checkpoint (copy or clone)

The three CRDs in the VolumeSnapshot ecosystem:

CRD Scope Purpose
VolumeSnapshot Namespaced User-facing: "I want a snapshot of this PVC"
VolumeSnapshotContent Cluster Admin/system-facing: "This is the actual snapshot on the backend"
VolumeSnapshotClass Cluster Policy: "Use this CSI driver with these parameters for snapshots"

This mirrors the PVC/PV/StorageClass pattern from page 06.

Application-Consistent Snapshots

A storage-level snapshot captures a crash-consistent state -- the equivalent of pulling the power cord. For databases, message queues, and other stateful applications, crash consistency may result in data corruption or long recovery times. Application-consistent snapshots require quiescing the application before the snapshot is taken.

Quiescing mechanisms per platform:

Platform Quiescing Mechanism How It Works
VMware VMware Tools + VSS (Windows) or freeze/thaw scripts (Linux) vCenter instructs VMware Tools to quiesce the guest OS before VMDK snapshot. On Windows, VSS flushes all file buffers and notifies applications (SQL Server, Exchange, etc.) to flush their write caches. On Linux, fsfreeze --freeze suspends writes to the filesystem.
OVE (KubeVirt) QEMU Guest Agent (qemu-ga) The QEMU Guest Agent runs inside the VM and exposes commands like guest-fsfreeze-freeze and guest-fsfreeze-thaw. Before taking a CSI VolumeSnapshot, an orchestrator (Kasten, Velero, or a custom script) calls guest-fsfreeze-freeze via the KubeVirt API, takes the snapshot, then calls guest-fsfreeze-thaw. On Windows guests, qemu-ga can invoke VSS.
Azure Local (Hyper-V) Hyper-V Integration Services + VSS Hyper-V's production checkpoints use VSS inside the guest to create application-consistent snapshots. The Hyper-V VSS Writer coordinates with application VSS Writers (SQL Server, Exchange, Active Directory). This is mature and well-tested.
Swisscom ESC Managed by Swisscom (VMware Tools + SAN-level snapshots) Swisscom handles quiescing as part of their managed backup service. The customer does not directly control the quiescing mechanism.

Gap analysis: VMware and Hyper-V have deep, mature VSS integration -- every enterprise backup vendor relies on this. KubeVirt's QEMU Guest Agent is functional but less widely integrated. Kasten K10 supports qemu-ga for KubeVirt VMs (since Kasten 6.0+), but the integration is newer and requires explicit configuration. During the PoC, validate that qemu-ga-based quiescing works correctly for Windows VMs with SQL Server and for Linux VMs with PostgreSQL/Oracle.

Snapshot Management at Scale

With 5,000 VMs, snapshot management becomes an operational challenge:

Best practice: Snapshot != Backup. This bears repeating. A snapshot schedule (daily COW snapshots on the primary storage) provides fast rollback for operational recovery (accidental deletion, bad update). A backup schedule (daily image-level backup to a separate storage system) provides protection against infrastructure failure, ransomware, and regulatory retention requirements. Both are needed. Neither substitutes for the other.


2. Storage Replication / DR

Synchronous vs Asynchronous Replication

Storage replication copies data from a primary site to a secondary site. The two modes define the tradeoff between data loss and performance:

Synchronous Replication (RPO = 0)
====================================

  Site A (Primary)                    Site B (Secondary)
  +-----------------+                +-----------------+
  | VM writes block |                |                 |
  | to storage      |                |                 |
  +--------|--------+                +-----------------+
           |
           v
  +-----------------+    1. Write    +-----------------+
  | Storage Primary |  ---------->  | Storage Replica |
  | Acknowledge     |  <----------  | Acknowledge     |
  | write to VM     |    2. ACK     |                 |
  +-----------------+                +-----------------+
           |
           v
  +-----------------+
  | VM receives     |
  | write ACK       |   Only AFTER both sites confirm
  +-----------------+

  RPO = 0 (zero data loss -- every committed write exists on both sites)
  Latency = local write + network RTT + remote write
  Requirement: < 10 ms RTT between sites (typically < 100 km)
  Risk: if Site B is slow, ALL writes at Site A are throttled


Asynchronous Replication (RPO > 0)
====================================

  Site A (Primary)                    Site B (Secondary)
  +-----------------+                +-----------------+
  | VM writes block |                |                 |
  | to storage      |                |                 |
  +--------|--------+                +-----------------+
           |
           v
  +-----------------+                +-----------------+
  | Storage Primary |                | Storage Replica |
  | Acknowledge     |   Background  | Receives data   |
  | write to VM     |   ========>   | periodically    |
  | IMMEDIATELY     |   (journal/   | (seconds to     |
  +-----------------+   snapshot     | minutes behind) |
                        shipping)    +-----------------+

  RPO = replication interval (e.g., 5 min snapshot schedule = up to 5 min data loss)
  Latency = local write only (no remote dependency)
  No RTT requirement (works across continents)
  Risk: data loss window = time since last successful replication

For a financial enterprise, the choice between synchronous and asynchronous replication is driven by regulatory and business requirements:

Factor Synchronous Asynchronous
RPO 0 (zero data loss) Seconds to minutes
Distance < 100 km (< 10 ms RTT) Unlimited
Write latency impact +1-5 ms per write (depends on RTT) None
Bandwidth Must sustain peak write throughput continuously Can burst; needs to keep up on average
Use case Active-active metro sites, critical financial systems (trading, payments) Cross-region DR, non-latency-sensitive workloads
FINMA relevance Required for CID systems with RPO=0 requirement Acceptable for systems with RPO > 0 tolerance

Ceph RBD Mirroring

Ceph provides native RBD (block device) mirroring between independent Ceph clusters. This is the DR mechanism for OVE/ODF environments.

Snapshot-based mirroring (the current recommended mode, replacing the older journal-based mode):

Ceph RBD Snapshot-Based Mirroring
====================================

  Site A: Ceph Cluster A              Site B: Ceph Cluster B
  (Primary for pool "prod")           (Secondary for pool "prod")

  +----------------------------+      +----------------------------+
  | rbd-mirror daemon          |      | rbd-mirror daemon          |
  | (runs on Site B, pulls     |      | (active for this pool)     |
  |  from Site A)              |      |                            |
  +----------------------------+      +----------------------------+

  Mirroring Cycle:
  1. Schedule triggers (e.g., every 5 minutes)
  2. Primary creates a mirror-snapshot on the RBD image
  3. rbd-mirror daemon on secondary compares:
     - Previous mirror-snapshot (already replicated)
     - Current mirror-snapshot (new)
  4. Calculates delta (changed objects between snapshots)
  5. Copies only changed objects to secondary cluster
  6. Creates matching mirror-snapshot on secondary
  7. Deletes old mirror-snapshots per retention policy

  Timeline:
  t=0   t=5   t=10  t=15  t=20 (minutes)
  |-----|-----|-----|-----|
  S1    S2    S3    S4    S5    <- mirror-snapshots on primary
  |     |     |     |     |
  |---->|---->|---->|---->|    <- delta replication to secondary
        S1'   S2'   S3'   S4'  <- mirror-snapshots on secondary
                                  (slight delay due to copy time)

  RPO = snapshot interval + copy time
  Example: 5-minute schedule + 1-minute copy = RPO up to 6 minutes

  Configuration per pool:
    rbd mirror pool enable <pool> image    # enable per-image mirroring
    rbd mirror image enable <pool>/<image> snapshot
    rbd mirror pool peer add <pool> <remote-cluster-spec>

Journal-based mirroring (legacy, not recommended for new deployments): Every write to a mirrored RBD image is first written to a journal (a separate RADOS object). The rbd-mirror daemon on the secondary cluster reads the journal and replays writes in order. This provides near-synchronous replication (RPO of seconds) but at significant performance cost -- every write is doubled (once to the image, once to the journal). Journal-based mirroring was deprecated in favor of snapshot-based mirroring for most use cases.

ODF Metro-DR and Regional-DR

Red Hat OpenShift Data Foundation (ODF) provides two DR topologies built on Ceph RBD mirroring and orchestrated by the ODF Multicluster Orchestrator:

ODF DR Topologies
===================

Metro-DR (Stretched Cluster, Synchronous, RPO=0)
-------------------------------------------------

          < 10 ms RTT, dedicated low-latency link
  Site A  <======================================>  Site B
  +------------------+                +------------------+
  | OCP Cluster      |                | OCP Cluster      |
  | (Zone A workers) |                | (Zone B workers) |
  +------------------+                +------------------+
  | ODF / Ceph       |                | ODF / Ceph       |
  | OSDs (Zone A)    |                | OSDs (Zone B)    |
  +------------------+                +------------------+
          |                                    |
          +------------+  +--------------------+
                       |  |
                  +-----------+
                  | Arbiter   |    (Site C or cloud)
                  | Node      |    MON only, no OSDs
                  | (Zone C)  |    Breaks quorum ties
                  +-----------+

  - Single stretched Ceph cluster spanning two sites
  - Data replicated synchronously (3 replicas: 2 in Zone A + 1 in Zone B,
    or 2+2 with arbiter tie-breaking)
  - RPO = 0 (every write confirmed on both sites before ACK)
  - RTO = minutes (failover = reschedule pods to surviving site)
  - Requires: < 10 ms RTT, sufficient bandwidth for ALL write traffic
  - Arbiter node required to maintain MON quorum during site failure
  - LIMITATION: 2 data sites only, arbiter has no data


Regional-DR (Independent Clusters, Asynchronous, RPO>0)
---------------------------------------------------------

  Site A                                Site B
  +------------------+                +------------------+
  | OCP Cluster A    |                | OCP Cluster B    |
  | ODF / Ceph       |                | ODF / Ceph       |
  | (independent)    |                | (independent)    |
  +---------|--------+                +---------|--------+
            |                                   |
            |    RBD snapshot-based mirroring    |
            +=================>================+
                (async, RPO = mirror interval)

  +------------------+
  | Hub Cluster      |    RHACM (Advanced Cluster Management)
  | RHACM + ODF MCO  |    ODF Multicluster Orchestrator
  | DR Policy CRDs   |    Manages failover/failback
  +------------------+

  - Two independent Ceph clusters, each with their own MON quorum
  - Async replication via rbd-mirror (snapshot-based)
  - RPO = snapshot schedule interval (typically 5-15 minutes)
  - RTO = minutes to tens of minutes (depends on workload complexity)
  - No RTT requirement (works across any distance)
  - RHACM orchestrates failover: updates DNS, reschedules workloads,
    promotes secondary to primary
  - DR Policy CRD defines: schedule, PVC selectors, failover order

Failover procedure (Regional-DR):

  1. RHACM detects or operator declares disaster at Site A.
  2. ODF Multicluster Orchestrator triggers failover via DRPlacementControl CRD.
  3. rbd-mirror on Site B promotes secondary images to primary (read-write).
  4. RHACM redeploys application workloads (VMs, pods) on Site B OCP cluster.
  5. DNS/load balancer entries updated to point to Site B.
  6. RTO depends on: number of VMs, application startup time, DNS propagation.

Failback procedure:

  1. Site A is restored and Ceph cluster is healthy.
  2. Reverse mirroring: Site B (now primary) replicates back to Site A.
  3. Once Site A is caught up, operator initiates planned failback.
  4. Workloads rescheduled back to Site A, Site B returns to secondary role.

S2D Storage Replica

Microsoft's Storage Replica is the replication engine for Azure Local / S2D. It operates at the volume (CSV) level and supports both synchronous and asynchronous modes.

S2D Storage Replica Topology
===============================

Synchronous (Stretch Cluster):
  Site A                              Site B
  +--------------------+            +--------------------+
  | Azure Local Node 1 |            | Azure Local Node 3 |
  | Azure Local Node 2 |            | Azure Local Node 4 |
  +---------|----------+            +---------|----------+
            |                                 |
  +---------|----------+            +---------|----------+
  | S2D Pool (Site A)  |<=========>| S2D Pool (Site B)  |
  | CSV Volume(s)      | sync rep  | CSV Volume(s)      |
  +--------------------+            +--------------------+
            |
  +--------------------+
  | Cloud Witness or   |   (Azure blob or file share witness)
  | File Share Witness  |   Quorum tie-breaking
  +--------------------+

  - Windows Server Failover Cluster spans both sites
  - Storage Replica replicates at the block level (SMB 3.x transport)
  - Synchronous: RPO=0, requires < 5 ms RTT (Microsoft recommendation)
  - Asynchronous: RPO > 0, no distance limit
  - Failover: cluster role moves to surviving site
  - Limitations: max 16 nodes per cluster (including both sites)

Asynchronous (Cluster-to-Cluster):
  - Two independent WSFC clusters, each with S2D
  - Storage Replica configured between specific volumes
  - RPO = replication lag (depends on bandwidth and change rate)
  - Failover is manual or scripted (not automatic)

Key differences from Ceph RBD mirroring:

Aspect Ceph RBD Mirroring S2D Storage Replica
Replication granularity Per RBD image (per VM disk) Per CSV volume (may contain multiple VMs)
Synchronous mode ODF Metro-DR stretched cluster Storage Replica sync mode
Synchronous RTT limit < 10 ms < 5 ms (Microsoft recommendation)
Async mechanism Snapshot-based delta copy Continuous log shipping
Orchestration RHACM + ODF MCO (Kubernetes-native) WSFC + PowerShell / WAC
Failover automation DRPlacementControl CRD (declarative) Cluster role failover (imperative + scripts)

DR Testing

FINMA requires regular DR testing -- not just documented procedures, but actual failover drills that validate RTO and RPO. Non-disruptive DR testing approaches:

RTO Analysis

RTO (Recovery Time Objective) is how long it takes to resume operations after a disaster. This depends on multiple factors:

Phase VMware SRM OVE / ODF Regional-DR Azure Local / SR Swisscom ESC
Detect failure vCenter alerts, SRM monitoring RHACM cluster health, manual declaration WSFC heartbeat, manual declaration Swisscom NOC monitoring
Initiate failover SRM recovery plan (1-click) DRPlacementControl CRD update PowerShell / WAC failover Swisscom runbook
Promote storage Seconds (VMFS resign/mount) Seconds (rbd-mirror promote) Seconds (SR role switch) Minutes (SAN failover)
Start VMs Minutes (SRM starts VMs in order) Minutes (Kubernetes reschedules pods) Minutes (Hyper-V starts VMs) Minutes (managed by Swisscom)
Network cutover SRM handles IP re-mapping MetalLB/DNS update, manual or automated WSFC handles IP failover Swisscom handles network
Application recovery Application-dependent Application-dependent Application-dependent Application-dependent
Typical total RTO 15-60 minutes (well-tested) 15-60 minutes (depends on workload count) 15-60 minutes (WSFC mature) SLA-defined (typically 4-8 hours)

FINMA Requirements for DR

FINMA Circular 2023/1 (Operational Risks and Resilience) and its predecessor 2018/3 establish requirements relevant to storage replication and DR:

  1. Business Continuity Management (BCM): Financial institutions must maintain BCM plans that include IT disaster recovery. Plans must define RPO and RTO for each critical system.
  2. Testing frequency: DR plans must be tested regularly (at least annually for critical systems, more frequently recommended). Test results must be documented and reviewed by management.
  3. Data residency: For Client Identifying Data (CID), data must remain within Switzerland (or in jurisdictions with equivalent data protection). DR sites must comply with the same data residency requirements as primary sites.
  4. Third-party risk (outsourcing): If DR is provided by a third party (Swisscom ESC), FINMA requires the institution to maintain oversight and control. The institution must be able to verify DR capabilities independently.
  5. Auditability: DR configurations, replication status, failover logs, and test results must be available for FINMA auditors. This requires logging and monitoring infrastructure at both primary and DR sites.

3. Encryption at Rest

Why Encryption at Rest Matters

Encryption at rest protects data stored on physical media (SSDs, NVMe drives, HDDs) from unauthorized access. For a financial enterprise, three scenarios drive this requirement:

  1. Hardware decommissioning. When drives reach end-of-life, they are returned to the vendor or disposed of. Without encryption, data on those drives can be recovered. With encryption, the data is unreadable without the encryption key -- secure disposal is as simple as destroying the key (crypto-erase).
  2. Physical theft. If a drive is stolen from the datacenter (insider threat or physical breach), encrypted data is unreadable without the key management infrastructure.
  3. Regulatory compliance. FINMA, ISO 27001 (Annex A.8.24 -- use of cryptography), and GDPR Article 32 all require appropriate technical measures to protect data. Encryption at rest is the standard technical control.

Encryption Layers

Encryption can be applied at multiple layers. Each layer has different tradeoffs for performance, key management complexity, and protection scope:

Encryption Layers Stack
==========================

Layer 5: Application-Level Encryption
+-------------------------------------------------------+
| Application encrypts data before writing to storage    |
| Examples: TDE (SQL Server, Oracle), column-level       |
|           encryption, application-managed key wrapping  |
| Pro: finest granularity, application controls keys     |
| Con: application changes required, no protection for   |
|      filesystem metadata, logs, temp files             |
+-------------------------------------------------------+
         |
         v  (encrypted data written to filesystem)
Layer 4: Filesystem-Level Encryption
+-------------------------------------------------------+
| Filesystem encrypts files/directories individually     |
| Examples: fscrypt (ext4/F2FS), EFS (NTFS)             |
| Pro: per-file granularity, different keys per dir      |
| Con: metadata (filenames, sizes) may leak, complex     |
|      key management per file/directory                 |
+-------------------------------------------------------+
         |
         v  (encrypted files stored on block device)
Layer 3: Block-Level / OS-Level Encryption
+-------------------------------------------------------+
| OS encrypts entire block device transparently          |
| Examples: dm-crypt/LUKS (Linux), BitLocker (Windows)   |
| Pro: transparent to applications, encrypts everything  |
|      including swap, temp files, filesystem metadata   |
| Con: key management at OS level, slight CPU overhead   |
|      (mitigated by AES-NI hardware acceleration)       |
+-------------------------------------------------------+
         |
         v  (encrypted blocks written to physical disk)
Layer 2: Controller-Level Encryption
+-------------------------------------------------------+
| HBA/RAID controller encrypts data before writing       |
| Examples: LSI MegaRAID encryption, Dell PERC            |
| Pro: transparent to OS, no CPU overhead on host        |
| Con: controller-specific, key tied to hardware,        |
|      does not protect data in host memory              |
+-------------------------------------------------------+
         |
         v  (encrypted data on physical media)
Layer 1: Self-Encrypting Drives (SED / OPAL)
+-------------------------------------------------------+
| Drive firmware encrypts all data on the media          |
| Standards: TCG Opal 2.0, IEEE 1667                     |
| Pro: zero performance overhead (hardware AES engine),  |
|      instant crypto-erase (destroy media encryption    |
|      key), transparent to OS                           |
| Con: trust the drive firmware (vulnerabilities found   |
|      in SED implementations, e.g., Crucial/Samsung     |
|      2018 CVEs), key management is drive-local         |
+-------------------------------------------------------+
         |
         v  (physical platters/NAND -- always encrypted)

RECOMMENDATION for Tier-1 financial enterprise:
  Layer 3 (LUKS or BitLocker) + Layer 1 (SED) = defense in depth
  Layer 3 provides OS-controlled key management (integrates with
  Vault/KMIP/Tang). Layer 1 provides hardware-accelerated encryption
  and instant crypto-erase for decommissioning.

LUKS (Linux Unified Key Setup)

LUKS is the standard for block-level encryption on Linux. It is the encryption mechanism used by Ceph/ODF for OSD-level encryption and by RHEL/CoreOS for node-level encryption on OVE clusters.

Architecture:

LUKS / dm-crypt Architecture
===============================

+------------------------------------------+
| Application / VM                         |
+------------------------------------------+
| Filesystem (XFS, ext4)                   |
+------------------------------------------+
| dm-crypt (kernel crypto module)          |
|   - Encrypts/decrypts blocks in-flight   |
|   - AES-256-XTS (default cipher)         |
|   - Uses kernel crypto API               |
|   - Hardware acceleration: AES-NI        |
+------------------------------------------+
| LUKS Header (stored on disk)             |
|   +------------------------------------+ |
|   | Magic: LUKS\xba\xbe               | |
|   | Version: 2                         | |
|   | Cipher: aes-xts-plain64            | |
|   | Key size: 512 bits (256-bit AES    | |
|   |           key + 256-bit tweak key) | |
|   | Hash: sha256                       | |
|   | Key Slots (0-31):                  | |
|   |   Slot 0: passphrase-derived key   | |
|   |   Slot 1: recovery key             | |
|   |   Slot 2: KMIP/Vault-managed key   | |
|   |   Slot 3-31: (unused)              | |
|   | Master Key (encrypted by slot key):| |
|   |   The actual data encryption key   | |
|   |   (DEK) -- never stored in clear   | |
|   +------------------------------------+ |
+------------------------------------------+
| Physical Block Device (NVMe, SSD)        |
+------------------------------------------+

Key Hierarchy:
  User Key (passphrase, keyfile, Vault token, Tang response)
       |
       v (PBKDF2 or Argon2)
  Slot Key (key encryption key -- KEK)
       |
       v (AES key-unwrap)
  Master Key (data encryption key -- DEK)
       |
       v (AES-XTS encryption of each sector)
  Encrypted data on disk

LUKS key slots allow up to 32 different passphrases or key sources to unlock the same volume. This is critical for operations:

Detached LUKS headers: The LUKS header can be stored on a separate device or in a key management system. If the header is detached, the encrypted disk looks like random data -- there is no indication that it is even encrypted. This is relevant for high-security environments where the existence of encrypted data must not be discoverable.

BitLocker (Windows)

BitLocker is the Windows equivalent of LUKS, used for full-volume encryption on Azure Local / S2D nodes and Hyper-V VMs.

Key features:

BitLocker on S2D CSV volumes:

Ceph OSD Encryption

In ODF/Rook-Ceph, encryption at rest is implemented at the OSD level using dm-crypt/LUKS. Each OSD's underlying block device is encrypted before BlueStore accesses it.

Ceph OSD Encryption with dm-crypt
====================================

Without Encryption:
  +----------+     +------------+     +----------------+
  | BlueStore| --> | Raw Block  | --> | Physical NVMe  |
  | (OSD)    |     | Device     |     | /dev/nvme0n1p1 |
  +----------+     +------------+     +----------------+

With dm-crypt Encryption:
  +----------+     +------------+     +----------------+     +----------------+
  | BlueStore| --> | dm-crypt   | --> | LUKS Container | --> | Physical NVMe  |
  | (OSD)    |     | (virtual   |     | (encrypted)    |     | /dev/nvme0n1p1 |
  |          |     |  device)   |     |                |     |                |
  +----------+     +------------+     +----------------+     +----------------+
                        ^
                        | Key from:
                        | - Kubernetes Secret (default)
                        | - HashiCorp Vault (recommended)
                        | - KMIP server (enterprise)

Rook-Ceph Configuration (CephCluster CR):
  spec:
    storage:
      encryptedDevice: true    # Enable dm-crypt on all OSDs
    security:
      kms:
        connectionDetails:
          KMS_PROVIDER: vault
          VAULT_ADDR: https://vault.example.com:8200
          VAULT_BACKEND_PATH: secret/data/rook
        tokenSecretName: rook-vault-token

Key Management Options:
  1. Kubernetes Secret (default, NOT recommended for production)
     - DEK stored as a K8s Secret in the Rook namespace
     - Encrypted only if etcd encryption is enabled
     - Risk: etcd backup exposes all DEKs

  2. HashiCorp Vault (recommended)
     - DEK stored in Vault's KV secret engine
     - Vault provides audit logging, key rotation, access policies
     - Rook retrieves key from Vault during OSD startup
     - Key never stored on the Kubernetes node

  3. KMIP (Key Management Interoperability Protocol)
     - Enterprise key management (Thales, Entrust, Gemalto)
     - Standard protocol, auditable, HSM-backed
     - Highest compliance level for financial institutions

Important: Ceph encryption is per-OSD, not per-volume or per-pool. All data on an encrypted OSD is encrypted, regardless of which pool or RBD image it belongs to. This simplifies configuration (encrypt all OSDs) but does not allow selective encryption (e.g., "encrypt only the gold tier"). For most financial enterprises, encrypting all data at rest is the simplest and most compliant approach.

S2D Encryption

Azure Local uses BitLocker for encryption at rest on S2D volumes:

Network-Bound Disk Encryption (NBDE) with Tang/Clevis

For OVE/RHEL environments, NBDE provides automatic unlock of LUKS-encrypted disks during boot without manual intervention or TPM dependency:

This is the recommended approach for OVE bare-metal nodes: LUKS-encrypted OS and OSD disks, with Tang/Clevis providing unattended boot while maintaining security.

Performance Impact of Encryption

Modern x86 CPUs (Intel Xeon Scalable, AMD EPYC) include AES-NI (Advanced Encryption Standard New Instructions) -- hardware acceleration for AES encryption operations. With AES-NI:

Operation Without AES-NI With AES-NI Impact
AES-256-XTS throughput ~500 MB/s per core ~5-10 GB/s per core 10-20x faster
Per-I/O latency overhead 10-50 us < 1 us Negligible
CPU utilization overhead 5-15% < 2% Negligible

Practical impact: With AES-NI (standard on all modern server CPUs), encryption at rest adds negligible overhead. Benchmarks consistently show < 3% throughput reduction for both LUKS and BitLocker when AES-NI is available. Encryption at rest should be enabled unconditionally on all storage nodes and VM volumes. The performance argument against encryption is obsolete with modern hardware.

Compliance Requirements

Standard Encryption at Rest Requirement
FINMA Circular 2023/1 requires "appropriate technical and organizational measures" for data protection. Encryption at rest is considered a baseline measure for sensitive financial data (CID). Not explicitly mandated by name, but expected by auditors.
ISO 27001 Annex A.8.24 (Use of cryptography): "A policy on the use of cryptographic controls for protection of information shall be developed and implemented." Encryption at rest is a primary control.
GDPR Article 32: "appropriate technical measures... including... encryption of personal data." Encryption at rest is explicitly listed as a technical measure.
PCI DSS Requirement 3.4: "Render PAN unreadable anywhere it is stored" by using strong cryptography. Directly requires encryption at rest for cardholder data.
Swiss FADP (nDSG) Article 8: requires appropriate technical measures to ensure data security. Encryption at rest is a recognized measure.

4. Backup Integration (Veeam, Kasten)

Backup Architecture Fundamentals

Before comparing platform-specific backup solutions, understand the three backup architecture models:

Backup Architecture Models
==============================

Model 1: Agent-Based Backup
+-------------------+        +-------------------+
| VM / Server       |        | Backup Server     |
|                   |        |                   |
| +---------------+ |  TCP   | +---------------+ |
| | Backup Agent  |--------->| | Backup Engine | |
| | (runs inside  | |        | | (catalogs,    | |
| |  the guest)   | |        | |  dedup, store)| |
| +---------------+ |        | +---------------+ |
+-------------------+        +---------|---------+
                                       |
                              +--------|--------+
                              | Backup Storage  |
                              | (disk, tape, S3)|
                              +-----------------+
  Pro: works on any platform, application-aware (database agents)
  Con: agent must be installed and maintained in every VM

Model 2: Agentless / Image-Level Backup (VMware VADP Model)
+-------------------+        +-------------------+
| Hypervisor (ESXi) |        | Backup Proxy      |
|                   |        | (Veeam proxy,     |
| +------+ +------+|        |  separate host)    |
| | VM 1 | | VM 2 ||        |                    |
| +------+ +------+||  VADP | +---------------+  |
| +------+ +------+|| API   | | Backup Engine |  |
| | VM 3 | | VM 4 ||------->| | (reads VMDK   |  |
| +------+ +------+||       | |  blocks via   |  |
+-------------------+|       | |  hypervisor)  |  |
                     |  CBT  | +---------------+  |
                     |------>|                    |
                     |       +-------------------+
  VMware VADP provides:
  - Snapshot coordination (quiesce guest, snapshot VMDK)
  - Changed Block Tracking (CBT): bitmap of changed blocks
    since last backup -- enables fast incremental backups
  - HotAdd/SAN/NFS transport modes for data transfer
  - Application-consistent snapshots via VSS/VMware Tools

  Pro: no agent inside VM, CBT makes incrementals very fast
  Con: tightly coupled to VMware API

Model 3: Kubernetes-Native Backup (CSI Snapshot Model)
+-------------------+        +-------------------+
| Kubernetes Cluster|        | Backup Controller |
|                   |        | (Kasten K10 or    |
| +------+ +------+|        |  Velero in-cluster)|
| |VM Pod| |VM Pod||        |                    |
| |+PVC  | |+PVC  || CSI    | +---------------+  |
| +------+ +------+| Snap   | | Backup Policy |  |
| +------+ +------+| API    | | Engine        |  |
| |VM Pod| |VM Pod||------->| | (snapshot,    |  |
| |+PVC  | |+PVC  ||       | |  export, ship)|  |
| +------+ +------+|       | +---------------+  |
+-------------------+       +---------|---------+
                                      |
                             +--------|--------+
                             | Backup Target   |
                             | (S3, NFS, etc.) |
                             +-----------------+
  CSI VolumeSnapshot provides:
  - Crash-consistent snapshots (application-consistent
    with qemu-ga or pre/post hooks)
  - No CBT equivalent in CSI (full snapshot each time,
    dedup/diff handled by backup tool or object store)
  - Export: backup tool copies snapshot data to external
    storage (S3-compatible object store is standard)

  Pro: Kubernetes-native, works with any CSI driver
  Con: younger ecosystem, no CBT, less enterprise tooling

VMware VADP: The Baseline (What We Have Today)

VMware's vStorage APIs for Data Protection (VADP) is the most mature VM backup API in the industry. Understanding it is essential to evaluate what we gain and lose in migration.

VADP key capabilities:

  1. Snapshot coordination: VADP creates a snapshot of the VMDK and optionally quiesces the guest OS (VMware Tools + VSS). The backup proxy reads from the snapshot while the VM continues running.
  2. Changed Block Tracking (CBT): A bitmap maintained by ESXi that records which blocks on a VMDK have changed since a given point in time. CBT is the cornerstone of efficient incremental backups. Without CBT, a backup tool must read and hash every block to determine changes -- orders of magnitude slower.
  3. Transport modes: VADP supports multiple ways to move data from the VMDK to the backup proxy:
    • HotAdd: Backup proxy is a VM on the same host; VMDK is "hot-added" to the proxy VM for direct read. Fastest mode.
    • SAN: Backup proxy reads VMDK blocks directly from the SAN (Fibre Channel or iSCSI). No ESXi involvement in data transfer.
    • NFS: For NFS-backed datastores, backup proxy reads directly from the NFS server.
    • Network (NBD/NBDSSL): Backup proxy reads blocks over the network through ESXi. Slowest but most universally compatible.
  4. Application-aware processing: With VMware Tools and VSS, VADP creates application-consistent snapshots for Windows workloads. For Linux, pre/post scripts can be configured.

Why VADP is hard to replace: Every enterprise backup vendor (Veeam, Commvault, Veritas NetBackup, Cohesity, Dell Avamar, Rubrik) has deep VADP integration. CBT alone provides a 10-100x speedup for incremental backups. There is no equivalent CBT API in the Kubernetes CSI specification. This is the single largest backup maturity gap between VMware and Kubernetes-based platforms.

Kasten K10 (Veeam Kasten)

Kasten K10 is a Kubernetes-native backup and disaster recovery platform. Veeam acquired Kasten in 2020, and it is now branded as "Veeam Kasten." K10 is the primary backup solution for OVE and other Kubernetes platforms.

Kasten K10 Architecture
==========================

+------------------------------------------------------------------+
| Kubernetes Cluster                                                |
|                                                                   |
|  kasten-io namespace:                                             |
|  +-------------------------------------------------------------+ |
|  | K10 Control Plane                                            | |
|  |                                                              | |
|  | +------------+  +------------+  +------------+              | |
|  | | Catalog    |  | Dashboard  |  | Auth /     |              | |
|  | | Service    |  | (Web UI)   |  | RBAC       |              | |
|  | | (metadata  |  | (React app)|  | (OIDC,     |              | |
|  | |  store,    |  |            |  |  LDAP,     |              | |
|  | |  restore   |  |            |  |  K8s RBAC) |              | |
|  | |  points)   |  |            |  |            |              | |
|  | +------------+  +------------+  +------------+              | |
|  |                                                              | |
|  | +------------+  +------------+  +------------+              | |
|  | | Policy     |  | Actions    |  | Kanister   |              | |
|  | | Engine     |  | Controller |  | Sidecar    |              | |
|  | | (schedule, |  | (execute   |  | (app-aware |              | |
|  | |  select,   |  |  backup/   |  |  hooks,    |              | |
|  | |  retain)   |  |  restore   |  |  blueprint |              | |
|  | |            |  |  actions)  |  |  execution)|              | |
|  | +------------+  +-----+------+  +------------+              | |
|  +------------------------|-------------------------------------|+|
|                           |                                       |
|  Per Backup Action:       v                                       |
|  +-------------------------------------------------------------+ |
|  | 1. Pre-hook (optional): qemu-ga fsfreeze, DB quiesce         | |
|  | 2. CSI VolumeSnapshot: create snapshot of each PVC           | |
|  | 3. Post-hook: thaw filesystem, resume DB                     | |
|  | 4. Export snapshot data to Location Profile target            | |
|  |    (read snapshot, stream to S3/NFS/Azure Blob)              | |
|  | 5. Catalog: record restore point metadata                    | |
|  +-------------------------------------------------------------+ |
|                                                                   |
+------------------------------------------------------------------+
           |
           | Export (S3 API, NFS, etc.)
           v
+---------------------------+
| Location Profile Target   |
| +----------+ +---------+  |
| | S3 Bucket| | NFS     |  |
| | (MinIO,  | | Share   |  |
| | AWS S3,  | |         |  |
| | Ceph RGW)| |         |  |
| +----------+ +---------+  |
+---------------------------+

K10 key concepts:

Concept Description
Policy Defines what to back up (namespace selector, label selector), when (schedule), how long to keep (retention), and where to send it (Location Profile). Policies are Kubernetes CRDs.
Location Profile Target for backup data export: S3-compatible object store, NFS, Azure Blob, Google Cloud Storage, Veeam Backup Repository.
Blueprint (Kanister) Application-aware hooks for databases and stateful applications. Kanister is an open-source framework (from Kasten) that defines pre/post actions for backup. Example: a PostgreSQL blueprint runs pg_dump before snapshot and verifies restore after restore.
Restore Point A catalog entry representing a complete, restorable backup. Contains metadata about all backed-up resources (PVCs, ConfigMaps, Secrets, CRDs) and their storage locations.
Transform A mutation applied during restore -- e.g., change StorageClass, modify resource names, update ConfigMap values. Used for DR (restore to a different cluster with different StorageClasses) or cloning (create a test copy with modified configuration).

K10 for KubeVirt VMs:

Kasten K10 supports backing up KubeVirt VMs as Kubernetes workloads. The backup captures:

For application-consistent backups of KubeVirt VMs, K10 can invoke the QEMU Guest Agent to freeze filesystems before snapshotting. This requires:

  1. QEMU Guest Agent installed and running inside the VM
  2. K10 configured with VM-aware hooks (KubeVirt integration, available since Kasten 6.0+)
  3. The VirtualMachine CRD must have spec.template.spec.domain.devices.autoattachGraphicsDevice and guest agent channel configured

Gap vs VMware/VADP:

Capability VMware + Veeam (VADP) OVE + Kasten K10
Changed Block Tracking Native CBT in ESXi. 10-100x faster incrementals. No CBT equivalent in CSI. K10 uses full snapshot + dedup at export target. Incrementals are slower.
Transport modes HotAdd, SAN, NFS, NBD -- choose optimal path CSI snapshot + export via K10 data mover pod. One mode.
Application consistency VMware Tools + VSS. Mature, tested with 100+ apps. qemu-ga + Kanister blueprints. Growing but smaller ecosystem.
Backup proxy scaling Deploy Veeam proxies as needed, scale independently K10 runs in-cluster, scales with cluster resources
Granular restore File-level restore from image backup (Veeam FLR), instant VM recovery (boot from backup) Restore full PVC or full VM. File-level restore requires mounting PVC to helper pod. Instant VM recovery not available.
Ecosystem maturity 15+ years, battle-tested at 100,000+ VM scale ~5 years, proven at 1,000s of VMs, rapidly maturing

Velero

Velero is the open-source Kubernetes backup tool (originally Heptio Ark). It is a lighter-weight alternative to Kasten K10 and is often used as a baseline backup solution.

Velero architecture:

Velero vs Kasten K10 for VMs:

Aspect Velero Kasten K10
Cost Open source (free) Commercial license (Veeam)
KubeVirt VM support Basic: backs up VM CRDs + PVC snapshots. No guest agent integration. Advanced: VM-aware hooks, qemu-ga integration, application-consistent snapshots.
Application awareness Pre/post hooks (bash scripts in pod annotations) Kanister blueprints (structured, reusable, parameterized)
UI CLI-only (kubectl plugin) Web dashboard with policy management, monitoring, RBAC
DR orchestration Manual restore to DR cluster Built-in DR policies with automated failover/failback
Enterprise features None (no RBAC, no multi-tenancy, no audit logs) RBAC, multi-cluster management, audit logs, compliance reports

Recommendation: For a Tier-1 financial enterprise, Velero alone is insufficient. It lacks the enterprise features (RBAC, audit logging, compliance reporting, VM-aware hooks) required by FINMA and internal governance. Kasten K10 is the recommended Kubernetes-native backup solution. Velero can serve as a secondary backup mechanism for non-critical workloads or as a fallback.

Hyper-V Backup (Azure Local)

For Azure Local running Hyper-V VMs, backup integration uses the mature Windows / Hyper-V ecosystem:

Key advantage of Hyper-V backup: The backup maturity gap between VMware and Hyper-V is small. Veeam for Hyper-V has CBT (via RCT), VSS integration, and application-consistent snapshots -- nearly matching the VMware VADP experience. If the organization migrates from VMware to Azure Local, the backup tooling transition is relatively straightforward.

Backup Storage Targets

All backup architectures require a target to store backup data. The standard options:

Target Protocol Use Case Deduplication
S3-compatible object store S3 API (HTTP) Primary backup target for Kasten, Velero. Scalable, cheap, durable. Depends on implementation (MinIO: no, Ceph RGW: no, AWS S3: no native dedup)
NFS share NFSv3/v4 Traditional backup target. Easy to set up. No native dedup (relies on backup software)
Deduplication appliance NFS, S3, or proprietary Enterprise backup target with hardware-accelerated dedup. Dell DataDomain, HPE StoreOnce, ExaGrid. Yes (inline or post-process)
Veeam Backup Repository Proprietary Veeam-specific backup target with built-in dedup and compression. Can run on Linux or Windows. Yes (Veeam dedup)
Tape LTO via SAS Long-term archival, air-gapped copy for ransomware protection. No (sequential, append-only)

The 3-2-1 Backup Rule

The 3-2-1 rule is a well-established backup best practice: maintain 3 copies of data, on 2 different media types, with 1 copy off-site. For a financial enterprise, this is the minimum. The modern extension is 3-2-1-1-0: add 1 air-gapped or immutable copy, and verify 0 errors through regular restore testing.

3-2-1-1-0 Backup Rule Implementation
========================================

Copy 1: Primary Data
+-------------------+
| Production Cluster|  Ceph RBD / S2D volumes (the live data)
| (Site A)          |  This is NOT a backup -- it is the primary copy
+-------------------+

Copy 2: On-Site Backup (different media)
+-------------------+
| Backup Target     |  S3-compatible object store or dedup appliance
| (Site A, separate |  At Site A, but on separate storage infrastructure
|  storage system)  |  Daily incremental, weekly full
+-------------------+

Copy 3: Off-Site Backup (different location)
+-------------------+
| DR Backup Target  |  S3 bucket at Site B, or replicated dedup appliance
| (Site B)          |  Protects against site-level disaster
+-------------------+

Copy 4 (1 in 3-2-1-1-0): Immutable / Air-Gapped Copy
+-------------------+
| Immutable Storage |  S3 Object Lock (WORM), tape library, or
| (Site B or Cloud) |  cloud storage with immutability policy
|                   |  Protects against ransomware (cannot be
|                   |  deleted or modified, even by admins)
+-------------------+

0 Errors: Regular Restore Testing
+-------------------+
| Restore Test      |  Monthly: restore random VMs to isolated network
| (Automated)       |  Verify boot, application function, data integrity
|                   |  Document results for FINMA auditors
+-------------------+

Implementation per platform:

Layer OVE + Kasten K10 Azure Local + Veeam Swisscom ESC
Copy 1 Ceph RBD (3x replicated) S2D (3-way mirror) Managed by Swisscom
Copy 2 K10 export to S3 (Ceph RGW or MinIO at Site A) Veeam backup to local repository Managed by Swisscom
Copy 3 K10 export to S3 at Site B Veeam backup copy job to DR site Managed by Swisscom
Copy 4 S3 Object Lock (immutable bucket) Veeam hardened repository or tape Managed by Swisscom
Restore test K10 restore to test namespace, automated validation Veeam SureBackup (automated restore verification) Contractual SLA with Swisscom

RPO/RTO Targets for Financial Enterprises

Industry-standard RPO/RTO targets for a Swiss Tier-1 financial institution:

Workload Tier RPO RTO Example Systems
Tier 0 (Critical) 0 (zero data loss) < 1 hour Core banking, payment processing, trading
Tier 1 (Business Critical) < 15 minutes < 4 hours CRM, regulatory reporting, customer portal
Tier 2 (Important) < 1 hour < 8 hours Email, collaboration, internal apps
Tier 3 (Standard) < 24 hours < 24 hours Development, test, non-critical internal tools

These RPO/RTO targets determine the backup and replication configuration:


How the Candidates Handle This

Capability VMware (Current) OVE (ODF/Ceph) Azure Local (S2D) Swisscom ESC
Snapshot mechanism VMDK snapshots (redo logs, COW). Mature but can cause performance issues when stacked (long chains). RBD snapshots (object-level COW). Instant, space-efficient, no chain penalty. CSI VolumeSnapshot API. Hyper-V checkpoints (VHDX differencing disks). Production checkpoints with VSS. CSI VolumeSnapshot API. SAN-level snapshots (Dell PowerMax/PowerStore). Managed by Swisscom, opaque to customer.
Application-consistent snapshots VMware Tools + VSS (Windows), freeze/thaw scripts (Linux). Extremely mature. QEMU Guest Agent + qemu-ga fsfreeze. Works but less widely tested. Kanister blueprints for app-aware hooks. Hyper-V Integration Services + VSS. Mature, comparable to VMware. Managed by Swisscom.
Cloning vSphere linked clones (COW, fast) or full clones. Mature, GUI-driven. RBD layered clones (COW from protected snapshot). Instant, space-efficient. PVC dataSource API. VHDX differencing disks or ReFS block clone. Functional but less automated via CSI. Not available to customer.
Synchronous replication (RPO=0) vSAN stretched cluster, vSphere Replication (async only; sync via 3rd party like RecoverPoint). ODF Metro-DR (stretched Ceph cluster, arbiter node, < 10 ms RTT). GA since ODF 4.12+. Storage Replica synchronous mode (stretch cluster, < 5 ms RTT). Mature (Windows Server 2016+). Swisscom manages replication. SLA-defined RPO.
Asynchronous replication (RPO>0) vSphere Replication (5 min to 24 hour RPO). SRM for orchestrated failover. Mature. ODF Regional-DR (RBD snapshot-based mirroring). RHACM orchestrates failover. GA since ODF 4.12+. Storage Replica async mode (cluster-to-cluster). PowerShell/WAC orchestration. Mature. Swisscom manages. SLA-defined RPO.
DR orchestration Site Recovery Manager (SRM). Automated recovery plans, non-disruptive DR testing, IP re-mapping. Industry benchmark. RHACM + ODF Multicluster Orchestrator. DRPlacementControl CRD. Younger but functional. Windows Server Failover Cluster + Storage Replica. PowerShell scripts for orchestration. Manual or semi-automated. Swisscom runbook. SLA-defined RTO. Customer has limited visibility.
DR test capability SRM test failover (bubble network, non-disruptive). Best-in-class. Test failover via cloned PVCs on DR site. Manual setup, not as automated as SRM. Storage Replica test failover (mount snapshot). Functional. Contractual DR test (annual).
Encryption at rest vSAN encryption (per SPBM policy). vSphere VM Encryption (per VM). KMS integration (KMIP). Mature. Ceph OSD encryption (dm-crypt/LUKS). Vault/KMIP key management. Tang/Clevis for NBDE. Per-cluster, not per-VM. BitLocker on CSV volumes. TPM integration. AD-based key escrow. Per-volume. Assumed encrypted (Swisscom responsibility). No customer control over encryption.
Key management External KMS (KMIP). HyTrust, Thales, etc. HashiCorp Vault, KMIP, Tang/Clevis (NBDE). AD-based, Azure Key Vault. TPM for boot volumes. Managed by Swisscom.
Backup API VADP (vStorage APIs for Data Protection). CBT for fast incrementals. 20+ years, universal vendor support. CSI VolumeSnapshot. No CBT equivalent. Younger API, growing vendor support. VSS + Hyper-V RCT (Resilient Change Tracking). CBT equivalent available. Mature. Managed by Swisscom.
Primary backup tool Veeam Backup & Replication (VADP). Best-in-class. Kasten K10 (Veeam Kasten). Kubernetes-native. Growing maturity. Veeam Backup & Replication (Hyper-V). Mature, near-VADP feature parity. Swisscom managed backup.
CBT / incremental efficiency Native CBT in ESXi. 10-100x faster incrementals. No CBT in CSI. K10 uses full snapshot + dedup. Incrementals are full-snapshot-based. Hyper-V RCT provides CBT. Veeam leverages RCT for fast incrementals. N/A (managed).
Granular restore Veeam FLR (file-level restore from image). Instant VM Recovery (boot from backup). Application item recovery (Exchange, SQL, AD). K10: restore full PVC or VM. File-level restore requires helper pod. No Instant VM Recovery equivalent. Veeam FLR, Instant VM Recovery (mount from backup). Application item recovery. Request from Swisscom (ticket-based).
Immutable backup Veeam hardened repository, S3 Object Lock, tape. S3 Object Lock on export target. Immutable backup location profile in K10. Veeam hardened repository, S3 Object Lock, tape. Managed by Swisscom.
Backup maturity for VMs 20+ years. Gold standard. ~4-5 years. Rapidly maturing. Largest gap vs VMware. ~10 years (Hyper-V backup). Near-VMware maturity. Mature (VMware-based backend).

Key Takeaways

  1. Snapshots are fast recovery, not backup. Ceph RBD snapshots are architecturally superior to VMware VMDK snapshots (no chain degradation, instant creation, space-efficient COW at the object level). However, snapshots exist on the same storage system as the primary data. For a financial enterprise, snapshots complement backups -- they never replace them. Implement automated snapshot schedules (CSI VolumeSnapshot + CronJob or K10 policy) with strict retention policies. Monitor snapshot space consumption across 5,000+ VMs to prevent capacity surprises.

  2. The backup maturity gap is real -- but the Hyper-V path is shorter. VMware VADP with Veeam is the industry gold standard for VM backup. Moving to OVE/Kasten K10 means accepting a younger backup ecosystem: no CBT (slower incrementals), less mature application-consistent snapshot integration, no Instant VM Recovery. Moving to Azure Local/Veeam preserves most of the backup maturity -- Hyper-V RCT provides CBT, and Veeam for Hyper-V is nearly feature-complete with Veeam for VMware. If backup maturity is a primary concern, Azure Local has an advantage over OVE in this specific dimension.

  3. DR is table stakes -- validate failover, not just replication. Both OVE (ODF Metro-DR/Regional-DR) and Azure Local (Storage Replica) provide synchronous and asynchronous replication. The technology works. The risk is in the orchestration: can the team execute a failover under pressure? DR testing must be a PoC requirement. Simulate a site failure, execute the failover procedure, measure actual RTO, and validate that all 5,000 VMs come up in the correct order with correct network configuration. SRM sets the benchmark for DR orchestration maturity -- neither ODF MCO nor WSFC fully matches it yet.

  4. Encryption at rest is non-negotiable -- enable it everywhere. With AES-NI hardware acceleration, encryption at rest costs < 3% performance. There is no valid reason to leave data unencrypted on physical media. For OVE, enable Ceph OSD encryption with Vault/KMIP key management from day one. For Azure Local, enable BitLocker on all CSV volumes and OS volumes. Ensure key management is centralized, auditable, and integrated with the organization's existing key management infrastructure.

  5. Key management is the hard part of encryption. Encryption algorithms are well-understood and hardware-accelerated. Key management is where most organizations stumble. Questions to answer before PoC: Where are encryption keys stored? How are they rotated? What happens if the key management system is unavailable during node boot? Who has access to recovery keys? Are key access events auditable? For OVE, HashiCorp Vault with KMIP backend is the recommended approach. For Azure Local, AD-based escrow with Azure Key Vault provides a mature solution.

  6. Plan for the CBT gap in OVE backup. The absence of Changed Block Tracking in the CSI specification is the most impactful backup gap for OVE. Without CBT, every incremental backup involves creating a full CSI snapshot and exporting the entire snapshot to the backup target. The backup software (K10) handles deduplication at the export target, but the data movement is still significantly higher than CBT-based incrementals. For 5,000 VMs, this means higher backup window durations, more backup network bandwidth, and more backup storage consumption. Mitigations: (a) stagger backup schedules across VMs, (b) use a high-throughput S3 target (Ceph RGW on dedicated nodes), (c) evaluate whether future CSI enhancements (CSI CBT proposal is in discussion) close this gap, (d) for database VMs with very large disks, consider application-level backup (pg_dump, RMAN) in addition to image-level backup.

  7. Swisscom ESC shifts the burden but limits the control. With ESC, data protection is Swisscom's responsibility. This simplifies operations (no backup infrastructure to manage) but reduces visibility and control. FINMA still holds the institution accountable for data protection, even if outsourced. The institution must contractually ensure: documented RPO/RTO per workload tier, regular DR testing with customer observation, audit access to backup and replication logs, and the right to independent DR testing. Verify that Swisscom's SLA matches the institution's RPO/RTO requirements for all workload tiers.

  8. Immutable backups are mandatory for ransomware protection. All platforms must provide at least one immutable backup copy that cannot be deleted or modified by any administrator -- including a compromised admin account. For Kasten K10: use S3 Object Lock (Compliance mode, not Governance mode) on the export bucket. For Veeam: use hardened Linux repository with immutable flag. For Swisscom ESC: contractually require immutable backup retention. This is not optional for a financial enterprise -- ransomware is an existential threat, and mutable backups are not backups if an attacker can delete them.

  9. FINMA auditors will ask for evidence. Every data protection capability claimed in this document must be demonstrable with evidence: snapshot schedules in K10/Veeam policy exports, replication lag dashboards in Prometheus/Grafana, encryption configuration in Ceph/BitLocker status reports, DR test reports with measured RTO/RPO, backup success/failure logs, key management audit trails. Build the monitoring and reporting infrastructure alongside the data protection infrastructure -- not after go-live.

  10. Test restore, not just backup. The most dangerous false confidence is an untested backup. Schedule monthly restore tests: pick random VMs from the backup catalog, restore them to an isolated namespace/network, verify boot and application functionality, document the results. Kasten K10 supports policy-driven restore validation. Veeam SureBackup automates this for VMware and Hyper-V. For Swisscom ESC, contractually require quarterly restore testing with customer-witnessed verification.


Discussion Guide

The following questions are designed for vendor deep-dives, PoC planning, and internal architecture reviews. They address the data protection gaps and requirements specific to a Tier-1 financial enterprise migrating from VMware.

Questions for OVE / ODF (Red Hat / Kasten)

  1. CBT gap mitigation: "CSI VolumeSnapshot does not provide Changed Block Tracking. For 5,000 VMs with an average 100 GiB disk and 10% daily change rate, quantify the impact on backup window duration, backup network bandwidth, and backup storage consumption compared to VMware VADP with CBT. What is your roadmap for CBT-like functionality in CSI or Kasten K10? Is the CSI CBT proposal (KEP-3314) on your radar, and when do you expect it to be available?"

  2. Application-consistent backup for KubeVirt VMs: "Demonstrate a Kasten K10 backup of a KubeVirt VM running Windows Server 2022 with SQL Server 2022. Show that: (a) the QEMU Guest Agent is invoked, (b) VSS is triggered inside the guest, (c) SQL Server VSS Writer participates, (d) the resulting backup is application-consistent (not just crash-consistent). Show the same for a Linux VM with PostgreSQL. What is the failure mode if qemu-ga is not installed or not responding?"

  3. Kasten K10 scale validation: "K10 will manage backup policies for 5,000+ VMs, each with 2-3 PVCs. That is 10,000-15,000 PVCs with daily snapshots and exports. What are K10's proven scale limits? How many concurrent snapshot and export operations can K10 handle? What is the expected load on the Kubernetes API server and the CSI driver during a nightly backup window? Show benchmark data from a comparable-scale deployment."

  4. DR failover end-to-end demonstration: "Execute a full Regional-DR failover scenario: (a) Primary site with 50 VMs running on ODF, async mirroring to DR site. (b) Simulate primary site failure. (c) Initiate failover via RHACM DRPlacementControl. (d) Measure time from failover initiation to all 50 VMs running on DR site. (e) Execute failback. (f) Verify data integrity post-failback. What was the measured RTO? What was the actual RPO (data loss)?"

  5. Encryption key management integration: "We use [Thales CipherTrust / Entrust KeyControl / specific KMIP provider] for enterprise key management. Demonstrate integration with Ceph OSD encryption via KMIP. Show the key lifecycle: provisioning, rotation, revocation, audit logging. What happens if the KMIP server is unreachable during OSD startup? Is there a local key cache or does the OSD refuse to start?"

Questions for Azure Local (Microsoft / Veeam)

  1. Veeam for Hyper-V feature parity: "Compare Veeam Backup & Replication for Hyper-V vs VMware feature-by-feature: CBT (RCT vs VADP CBT), transport modes, application-consistent snapshots (VSS), Instant VM Recovery, file-level restore, application item recovery (Exchange, SQL, AD, SharePoint). Identify any features available for VMware but not for Hyper-V. For any gaps, what is the timeline for parity?"

  2. Storage Replica RPO validation: "Configure Storage Replica in asynchronous mode between two Azure Local clusters with a 5-minute replication schedule. Under production load (simulated I/O representing 500 VMs), measure the actual replication lag. Does the 5-minute schedule hold, or does lag accumulate under heavy write load? What is the measured RPO under peak load?"

  3. BitLocker key management at scale: "For 5,000+ VMs across multiple Azure Local clusters, each with BitLocker-encrypted CSV volumes, describe the key management architecture. Where are recovery keys stored? How is key rotation performed without downtime? What happens if the Active Directory is unavailable during a node reboot? Is Azure Key Vault integration supported for on-premises Azure Local deployments?"

  4. DR testing without production impact: "Demonstrate a non-disruptive DR test using Storage Replica: mount a test copy of the replicated volume on the DR site, start VMs from the test copy in an isolated network, validate functionality, then tear down the test without affecting the replication relationship. Compare this to VMware SRM's test failover capability."

Questions for Swisscom ESC

  1. Data protection SLA specifics: "For each workload tier (Tier 0-3), document the contractual RPO, RTO, backup frequency, backup retention, and DR test frequency. Are RPO/RTO targets guaranteed or best-effort? What are the financial penalties for SLA breaches? Can we participate in or observe DR tests?"

  2. Backup and encryption audit access: "FINMA requires us to maintain oversight of data protection, even when outsourced. Provide: (a) access to backup success/failure logs (daily), (b) replication lag dashboards (real-time), (c) encryption status reports (quarterly), (d) DR test reports with measured RTO/RPO (per test). What format are these provided in? Can they be integrated into our SIEM/monitoring infrastructure?"

Cross-Platform / Internal Architecture Questions

  1. RPO/RTO classification of all 5,000 VMs: "Before designing backup and DR policies, every VM must be classified into a workload tier (0-3) with explicit RPO/RTO targets approved by the business owner. This classification drives backup schedule, replication mode (sync/async), and DR site requirements. Who owns this classification process? What is the timeline to complete it for all 5,000+ VMs?"

  2. Backup network bandwidth planning: "For each candidate platform, calculate the backup network bandwidth required for the nightly backup window (assuming an 8-hour window). Inputs: 5,000 VMs, average 100 GiB per VM, 10% daily change rate. For VMware/Veeam: CBT means transferring ~50 TiB of changes. For OVE/K10: full snapshot export means transferring more (quantify). For Azure Local/Veeam: RCT means transferring ~50 TiB. Is the existing backup network sufficient, or does it need to be upgraded?"

  3. Immutable backup implementation: "Design the immutable backup architecture for each candidate: (a) S3 Object Lock configuration (Compliance mode, retention period per tier), (b) access control (no admin can disable Object Lock), (c) monitoring (alert if immutable backup creation fails), (d) restore procedure from immutable backup. Validate that a simulated ransomware scenario (attacker with admin credentials attempts to delete all backups) fails to destroy the immutable copies."

  4. Key management consolidation: "We currently use [existing KMS] for VMware vSAN encryption. When migrating to the new platform, should we: (a) extend the existing KMS to cover Ceph/BitLocker encryption, (b) deploy a new KMS (Vault, Azure Key Vault), or (c) use platform-native key management? Evaluate each option for: FINMA compliance, operational complexity, cost, and integration depth. Recommend a single key management architecture that covers all encryption use cases (storage encryption, backup encryption, TLS certificates)."


Previous: 06-kubernetes-storage.md -- Kubernetes Storage Model (CSI, PV/PVC, StorageClasses) Next: 08-advanced-topics.md -- Advanced Topics (Object Storage, Data Locality)