Advanced Storage Topics

Why This Matters

The previous seven pages built a complete picture of the storage stack: foundational concepts (01), the VMware baseline (02), protocols (03), architectures (04), SDS platforms (05), the Kubernetes storage model (06), and data protection (07). This final page covers two advanced topics that do not fit neatly into any single layer but have significant impact on architecture, operations, and compliance: Object Storage (S3-compatible) and Data Locality.

Object Storage is the glue that connects backup targets, container image registries, log aggregation, artifact repositories, and VM disk image repositories. Every platform in this evaluation depends on S3-compatible object storage for critical infrastructure services. Understanding the options -- Ceph RGW, MinIO, NooBaa, Azure Blob -- and their architectural differences is essential for designing backup targets, achieving FINMA-compliant immutable storage, and enabling multi-cloud tiering.

Data Locality is the hidden performance factor in HCI. When a VM's compute and storage reside on the same physical node, I/O bypasses the network entirely, reducing latency by 50-80%. When they do not -- after a live migration, after a node failure, or in a disaggregated architecture -- every I/O traverses the storage network. For an organization running 5,000+ VMs, data locality decisions affect aggregate performance, network bandwidth consumption, and tail-latency behavior. Understanding how each platform handles (or does not handle) locality is essential for capacity planning and SLA-driven workload placement.

Together, these two topics close the remaining gaps in the storage evaluation and ensure that the final platform decision accounts for operational infrastructure (object storage) and physical-layer performance characteristics (data locality).

Concepts

1. Object Storage (S3-compatible)

Object Storage Fundamentals

Object storage is a flat-namespace storage paradigm where data is organized into buckets containing objects. Unlike block storage (fixed-size blocks, no metadata, filesystem required) or file storage (hierarchical directory tree, POSIX semantics), object storage treats each piece of data as a self-contained unit with three components:

Component	Description	Example
Key	A unique identifier (string) within a bucket. Can contain "/" characters to simulate directory hierarchy, but the underlying storage is flat.	`backups/2026/04/vm-db01/full-20260428.tar`
Data	The binary payload. Can be any size from 0 bytes to multiple terabytes.	A 50 GiB VM disk image, a 2 KiB JSON log entry
Metadata	Key-value pairs attached to the object. System metadata (content-type, content-length, ETag, last-modified) + user-defined metadata.	`x-amz-meta-vm-name: db-prod-01`, `Content-Type: application/octet-stream`

Object Storage vs Block vs File -- Structural Comparison
===========================================================

Block Storage (RBD, VHDX, LUN):
+---+---+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |  <-- Fixed-size blocks (4 KiB)
+---+---+---+---+---+---+---+---+---+---+
  No metadata per block. Client (filesystem) gives meaning.
  Access: SCSI/NVMe commands (read block N, write block N)
  Latency: < 1 ms (local), 1-5 ms (network)

File Storage (NFS, SMB, CephFS):
/
├── home/
│   ├── user1/
│   │   ├── doc.txt          (inode, permissions, timestamps)
│   │   └── report.pdf       (inode, permissions, timestamps)
│   └── user2/
│       └── data.csv         (inode, permissions, timestamps)
└── var/
    └── log/
        └── syslog           (inode, permissions, timestamps)
  Hierarchical namespace. POSIX semantics (open/read/write/close).
  Access: NFS/SMB protocol (RPC-based)
  Latency: 1-10 ms (network)

Object Storage (S3, Ceph RGW, MinIO):
Bucket: "backup-prod-2026"
  +--------------------------------------------------+
  | Key: "vm-db01/full-20260428.tar"                  |
  | Data: [50 GiB binary blob]                        |
  | Metadata:                                         |
  |   Content-Type: application/x-tar                 |
  |   x-amz-meta-vm-name: db-prod-01                  |
  |   x-amz-meta-backup-type: full                    |
  |   ETag: "d41d8cd98f00b204e9800998ecf8427e"        |
  |   Last-Modified: 2026-04-28T03:15:00Z             |
  +--------------------------------------------------+
  | Key: "vm-db01/incr-20260429.tar"                  |
  | Data: [2 GiB binary blob]                         |
  | Metadata: ...                                     |
  +--------------------------------------------------+
  | Key: "vm-web01/full-20260428.tar"                 |
  | Data: [30 GiB binary blob]                        |
  | Metadata: ...                                     |
  +--------------------------------------------------+
  Flat namespace (no directories -- "/" in key is just a character).
  Access: HTTP REST API (GET, PUT, DELETE, HEAD)
  Latency: 10-100 ms (HTTP overhead, but scales infinitely)

Why object storage scales better than file storage: In a file system, every directory listing requires traversing the metadata tree. In object storage, the namespace is a flat hash map -- listing objects in a bucket is a constant-time operation against an index, regardless of how many objects exist. This is why object stores can hold billions of objects in a single bucket while NFS mounts struggle with directories containing millions of files.

Why object storage is not suitable for VM boot disks: The HTTP REST API introduces 10-100x more latency than block protocols (SCSI, NVMe). Object storage is designed for large sequential reads/writes (backups, images, logs), not random 4 KiB IOPS. VM boot disks require block storage (RBD, VHDX) for acceptable performance.

The S3 API: The De Facto Standard

Amazon S3 (Simple Storage Service), launched in 2006, established the de facto API standard for object storage. Virtually every object storage implementation -- Ceph RGW, MinIO, Azure Blob (via S3-compatible layer), Google Cloud Storage (interoperability API), Wasabi, Backblaze B2 -- implements the S3 API to some degree. For this evaluation, "S3-compatible" means the implementation supports the core S3 API operations that backup and infrastructure tools require.

Core S3 API Operations:

Operation	HTTP Method	Description	Used By
`PutObject`	`PUT /{bucket}/{key}`	Upload an object (up to 5 GiB in single PUT)	Backup export, artifact upload
`GetObject`	`GET /{bucket}/{key}`	Download an object (supports range requests for partial reads)	Backup restore, image pull
`DeleteObject`	`DELETE /{bucket}/{key}`	Delete a single object	Retention policy cleanup
`HeadObject`	`HEAD /{bucket}/{key}`	Get object metadata without downloading data	Existence checks, metadata queries
`ListObjectsV2`	`GET /{bucket}?list-type=2`	List objects in a bucket (paginated, up to 1000 per page)	Backup catalog enumeration
`CreateMultipartUpload`	`POST /{bucket}/{key}?uploads`	Initiate a multipart upload for large objects	Large backup files (> 5 GiB)
`UploadPart`	`PUT /{bucket}/{key}?partNumber=N&uploadId=X`	Upload a part of a multipart upload (5 MiB - 5 GiB per part)	Parallel upload of large files
`CompleteMultipartUpload`	`POST /{bucket}/{key}?uploadId=X`	Finalize multipart upload, assemble parts into single object	After all parts uploaded
`CreateBucket`	`PUT /{bucket}`	Create a new bucket	Initial setup, automation
`DeleteBucket`	`DELETE /{bucket}`	Delete an empty bucket	Cleanup
`PutBucketPolicy`	`PUT /{bucket}?policy`	Set access policy (JSON document) on a bucket	Security, multi-tenancy
`PutObjectLockConfiguration`	`PUT /{bucket}?object-lock`	Enable WORM (Write Once Read Many) on a bucket	Immutable backups, FINMA compliance
`PutObjectRetention`	`PUT /{bucket}/{key}?retention`	Set retention period on a specific object	Per-object immutability
`PutObjectLegalHold`	`PUT /{bucket}/{key}?legal-hold`	Place a legal hold on an object (prevents deletion until removed)	Regulatory holds, litigation

Multipart upload is critical for large backup files. A 50 GiB VM disk image cannot be uploaded in a single PUT (5 GiB limit). Multipart upload splits the file into parts (e.g., 100 parts x 500 MiB), uploads them in parallel (maximizing network throughput), and assembles them server-side. If a part fails, only that part is retried -- not the entire 50 GiB upload. All major backup tools (Kasten K10, Velero, Veeam) use multipart upload automatically.

S3 Authentication: AWS Signature v4 (SigV4):

Every S3 request is authenticated using HMAC-SHA256 signatures derived from an Access Key ID and Secret Access Key. The signature covers the HTTP method, URI, headers, and query parameters -- preventing request tampering. This is the same authentication mechanism used by AWS S3, and all S3-compatible implementations support it.

S3 Request Authentication Flow (SigV4)
==========================================

1. Client has credentials:
   Access Key ID:     AKIAIOSFODNN7EXAMPLE
   Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

2. Client constructs canonical request:
   PUT /backup-prod/vm-db01/full-20260428.tar HTTP/1.1
   Host: s3.internal.bank.example.com
   Content-Type: application/octet-stream
   x-amz-date: 20260428T031500Z
   x-amz-content-sha256: e3b0c44298fc1c14...

3. Client derives signing key:
   DateKey     = HMAC-SHA256("AWS4" + SecretKey, "20260428")
   RegionKey   = HMAC-SHA256(DateKey, "us-east-1")
   ServiceKey  = HMAC-SHA256(RegionKey, "s3")
   SigningKey  = HMAC-SHA256(ServiceKey, "aws4_request")

4. Client calculates signature:
   StringToSign = "AWS4-HMAC-SHA256\n" + timestamp + "\n" +
                  scope + "\n" + hash(canonical_request)
   Signature = HMAC-SHA256(SigningKey, StringToSign)

5. Client sends request with Authorization header:
   Authorization: AWS4-HMAC-SHA256
     Credential=AKIAIOSFODNN7EXAMPLE/20260428/us-east-1/s3/aws4_request,
     SignedHeaders=content-type;host;x-amz-content-sha256;x-amz-date,
     Signature=fe5f80f77d5fa3beca038a248ff027d0445342fe2855ddc963176630326f1024

6. Server verifies:
   - Looks up SecretKey for the given AccessKeyID
   - Reconstructs the signing key and signature
   - Compares: if match, request is authenticated
   - Also checks timestamp to prevent replay attacks (within 15 min)

Why Object Storage Matters for Platform Operations

Object storage is not just for application data. In a Kubernetes-based infrastructure platform, S3-compatible object storage is a foundational service consumed by multiple infrastructure components:

Object Storage as Platform Infrastructure
============================================

+-------------------------------------------------------------------+
|                    Platform Consumers of S3                        |
|                                                                   |
|  +-----------------+   +-----------------+   +-----------------+  |
|  | Backup System   |   | Container       |   | Log Aggregation |  |
|  | (Kasten K10,    |   | Image Registry  |   | (Loki, Elastic, |  |
|  |  Velero, Veeam) |   | (Quay, Harbor,  |   |  Splunk)        |  |
|  |                 |   |  ACR)           |   |                 |  |
|  | Exports backup  |   | Stores image    |   | Ships log chunks|  |
|  | data to S3      |   | layers in S3    |   | to S3 for long- |  |
|  | (largest volume)|   | (100s of GiB)   |   | term retention  |  |
|  +-----------------+   +-----------------+   +-----------------+  |
|           |                    |                    |              |
|           v                    v                    v              |
|  +-----------------------------------------------------------+   |
|  |              S3-Compatible Object Store                    |   |
|  |         (Ceph RGW / MinIO / NooBaa / Azure Blob)           |   |
|  +-----------------------------------------------------------+   |
|           ^                    ^                    ^              |
|           |                    |                    |              |
|  +-----------------+   +-----------------+   +-----------------+  |
|  | Artifact Store  |   | VM Disk Image   |   | Monitoring /    |  |
|  | (Terraform      |   | Repository      |   | Metrics Archive |  |
|  |  state, Ansible |   | (qcow2, VMDK    |   | (Thanos, Cortex |  |
|  |  artifacts,     |   |  gold images    |   |  long-term      |  |
|  |  Helm charts)   |   |  for templating)|   |  Prometheus)    |  |
|  +-----------------+   +-----------------+   +-----------------+  |
+-------------------------------------------------------------------+

Specific use cases by volume and criticality:

Use Case	Typical Volume	Criticality	S3 Features Required
Backup targets (Kasten, Velero, Veeam)	100+ TiB (largest consumer by far)	Critical -- data protection depends on it	Multipart upload, Object Lock (WORM), versioning, lifecycle policies
Container image registry (Quay, Harbor)	500 GiB - 5 TiB	High -- cluster cannot pull images without it	GET/PUT, moderate IOPS, high availability
Log aggregation (Loki, Elasticsearch)	10-50 TiB (grows continuously)	Medium -- operational visibility	Lifecycle policies (auto-delete after 90 days), cheap capacity
Terraform/Ansible state	< 100 GiB	Critical -- infrastructure-as-code depends on it	Versioning, locking (DynamoDB-style or external)
VM disk image repository	1-10 TiB	Medium -- used during provisioning	Large object support, range requests
Monitoring archive (Thanos)	5-20 TiB	Medium -- historical metrics	Lifecycle policies, cheap capacity

Ceph RGW (RADOS Gateway): Ceph's S3-Compatible Object Gateway

For OVE environments running ODF (OpenShift Data Foundation), Ceph RGW is the native S3-compatible object storage service. RGW runs as a process (daemon) that translates S3 API calls into RADOS operations, storing objects directly in the Ceph cluster alongside RBD block volumes and CephFS file data.

Architecture:

Ceph RGW Architecture
========================

                    S3 Clients
        (Kasten K10, Velero, Quay, Loki, Terraform)
                        |
                        | HTTPS (port 443)
                        v
              +-------------------+
              | Load Balancer /   |
              | Ingress (HAProxy, |  <-- HA entry point
              | OCP Route, MetalLB|      (distributes across RGW pods)
              +--------+----------+
                       |
          +------------+------------+
          |            |            |
          v            v            v
    +-----------+ +-----------+ +-----------+
    | RGW Pod 0 | | RGW Pod 1 | | RGW Pod 2 |
    |           | |           | |           |
    | Beast     | | Beast     | | Beast     |   <-- HTTP frontend
    | (async    | | (async    | | (async    |       (replaced Civetweb
    |  HTTP     | |  HTTP     | |  HTTP     |        in recent versions)
    |  server)  | |  server)  | |  server)  |
    |           | |           | |           |
    | RGW       | | RGW       | | RGW       |   <-- S3 API logic:
    | Engine    | | Engine    | | Engine    |       bucket ops, auth,
    |           | |           | |           |       multipart, ACLs,
    |           | |           | |           |       object lock
    |           | |           | |           |
    | librados  | | librados  | | librados  |   <-- Direct RADOS access
    +-----------+ +-----------+ +-----------+       (no OSD intermediary)
          |            |            |
          +------------+------------+
                       |
                       v
    +--------------------------------------------------+
    |              RADOS Cluster (Ceph OSDs)            |
    |                                                  |
    |  Pool: ".rgw.root"        (RGW realm config)     |
    |  Pool: "default.rgw.log"  (intent log, gc)       |
    |  Pool: "default.rgw.meta" (bucket metadata,      |
    |                            user info, ACLs)      |
    |  Pool: "default.rgw.buckets.index"               |
    |         (bucket index -- object listing)          |
    |  Pool: "default.rgw.buckets.data"                |
    |         (actual object data -- largest pool)      |
    |                                                  |
    |  Data placement: CRUSH rules determine which     |
    |  OSDs store each RADOS object (3x replication    |
    |  or erasure coding, configurable per pool)       |
    +--------------------------------------------------+

RGW internal data path -- what happens when Kasten K10 uploads a backup:

K10 sends PUT /backup-bucket/vm-db01/full-20260428.tar with multipart upload.
The load balancer routes to an available RGW pod.
RGW authenticates the request (SigV4, checks access key against user DB in .rgw.meta pool).
For each multipart part, RGW splits the data into RADOS objects (default 4 MiB stripe size).
RGW writes each RADOS object to the default.rgw.buckets.data pool via librados.
RADOS (the OSD layer) applies the pool's replication or erasure coding rules (e.g., 3x replication, or 4+2 EC).
On CompleteMultipartUpload, RGW assembles the manifest in the bucket index (.rgw.buckets.index pool).
The object is now retrievable, replicated, and durable.

Bucket index sharding: For buckets with millions of objects (common for backup buckets with long retention), the bucket index can become a bottleneck. A single RADOS object holding the index for 10 million objects is too large to scan efficiently. RGW supports bucket index sharding -- splitting the index across multiple RADOS objects:

Bucket Index Sharding
========================

Without sharding (default for small buckets):
  Bucket: "backup-prod"
  +---------------------------------------------+
  | Single index object (one RADOS object)       |
  | Contains: list of all object keys + metadata |
  | Problem: with 10M objects, listing is slow   |
  |          and index RADOS object is huge       |
  +---------------------------------------------+

With sharding (rgw_bucket_index_max_aio, bucket resharding):
  Bucket: "backup-prod" (index_shard_count = 32)
  +----------+ +----------+ +----------+     +----------+
  | Shard 0  | | Shard 1  | | Shard 2  | ... | Shard 31 |
  | keys A-B | | keys C-D | | keys E-F |     | keys Y-Z |
  +----------+ +----------+ +----------+     +----------+
  Each shard is a separate RADOS object.
  LIST operation parallelizes across shards.

  Recommended: enable dynamic resharding (default in recent Ceph)
  Ceph auto-reshards when a shard exceeds a threshold (~150K entries)

  ODF default: dynamic resharding enabled.
  Monitor: ceph dashboard or "radosgw-admin bucket stats"

RGW performance characteristics:

Metric	Typical Value (NVMe-backed Ceph, 3 RGW pods)	Notes
Small object PUT (4 KiB)	5,000-15,000 ops/sec	Limited by per-object RADOS overhead
Large object PUT (100 MiB)	2-5 GiB/s aggregate throughput	Limited by network bandwidth and OSD write throughput
Small object GET (4 KiB)	10,000-30,000 ops/sec	Reads are faster (no replication write path)
Large object GET (100 MiB)	3-8 GiB/s aggregate throughput	Reads scale with OSD count and network
LIST (1000 objects/page)	50-200 ms per page	Depends on bucket index shard count
Multipart upload (10 GiB file, 100 MiB parts)	60-120 seconds	Parallelism of parts improves throughput

RGW resource requirements (ODF defaults):

3 RGW pods (can scale to more for higher throughput)
Per pod: 2-4 CPU cores, 4-8 GiB RAM
RGW data shares the RADOS cluster with RBD -- capacity planning must account for both
Dedicated pools for RGW data allow independent erasure coding policies (e.g., RBD uses 3x replication, RGW backup data uses 4+2 EC for better space efficiency)

Ceph RGW Multi-Site Replication for DR

For a financial institution requiring geo-redundant object storage (backup data must exist at two sites), Ceph RGW supports multi-site replication -- asynchronous replication of buckets between two or more RGW deployments at different sites.

Ceph RGW Multi-Site Architecture
====================================

Site A (Primary)                           Site B (Secondary / DR)
Zurich DC                                  Bern DC

+---------------------------+              +---------------------------+
| Zonegroup: "ch-prod"      |              | Zonegroup: "ch-prod"      |
|                           |              |                           |
| Zone: "zurich"            |              | Zone: "bern"              |
|                           |              |                           |
| +------- RGW Pods ------+ |              | +------- RGW Pods ------+ |
| | RGW-0  RGW-1  RGW-2  | |              | | RGW-0  RGW-1  RGW-2  | |
| +-----------+-----------+ |              | +-----------+-----------+ |
|             |              |              |             |              |
| +-----------v-----------+ |              | +-----------v-----------+ |
| |     RADOS Cluster     | |              | |     RADOS Cluster     | |
| |  (local Ceph OSDs)    | |              | |  (local Ceph OSDs)    | |
| +-----------------------+ |              | +-----------------------+ |
+-------------|-------------+              +-------------|-------------+
              |                                          |
              |      Async Replication (data log)        |
              +<========================================>+
              |                                          |
              |  RGW sync agents read data log from      |
              |  remote zone and replicate objects        |
              |  Replication lag: seconds to minutes      |
              |  (depends on WAN bandwidth + write rate)  |
              |                                          |
              |  Direction: active-active or active-      |
              |  passive (configurable per bucket)        |

Realm: "bank-prod"
  Zonegroup: "ch-prod"
    Zone: "zurich" (Site A)
    Zone: "bern"   (Site B)

Concepts:
  Realm:     Top-level container. All zones in a realm share the
             same namespace (buckets and users).
  Zonegroup: A group of zones that replicate data between each other.
             Equivalent to an AWS S3 region.
  Zone:      A single Ceph cluster's RGW deployment. Each zone has
             its own RADOS cluster and RGW instances.

Multi-site replication behavior:
  1. Client writes object to Zone "zurich" (Site A)
  2. RGW records the write in the data log (a per-shard log)
  3. Sync agent in Zone "bern" (Site B) reads the data log
  4. Sync agent fetches the object from Zone "zurich" via RGW API
  5. Sync agent writes the object to local RADOS (Site B)
  6. Object is now available in both zones

  RPO = replication lag (typically seconds to low minutes)
  If Site A fails, Site B has all data up to the last synced entry

Active-Active:
  Both zones accept writes. Conflict resolution: last-write-wins
  (based on timestamp). Suitable when both sites have active workloads.

Active-Passive:
  All writes go to the primary zone. Secondary zone is read-only.
  Simpler conflict model. Suitable for backup-only DR sites.

Why multi-site matters for backup:

Kasten K10 exports backup data to an S3 bucket. If that bucket is on a local Ceph RGW in Zurich, and Zurich suffers a site failure, the backups are lost -- together with the production data. With RGW multi-site, the backup bucket is replicated to Bern. After a site failure, K10 can restore from the Bern copy. This is the object storage equivalent of off-site backup (the "1" in 3-2-1).

RGW IAM, Bucket Policies, and Quotas:

RGW supports a subset of the AWS IAM model:

Users and access keys: Each S3 client (Kasten, Quay, Loki) gets its own RGW user with a unique Access Key ID and Secret Key. Created via radosgw-admin user create.
Bucket policies: JSON documents (same syntax as AWS S3 bucket policies) that control access at the bucket level. Example: restrict the backup bucket to only the Kasten service account, deny all DELETE operations from any user except the retention-policy cleanup service.
Quotas: Per-user and per-bucket quotas (max objects, max bytes). Prevents a single consumer from exhausting the object storage pool.

Example: RGW Bucket Policy for Immutable Backup
===================================================

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowKastenWriteOnly",
      "Effect": "Allow",
      "Principal": {"AWS": ["arn:aws:iam:::user/kasten-backup"]},
      "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::backup-immutable",
        "arn:aws:s3:::backup-immutable/*"
      ]
    },
    {
      "Sid": "DenyDeleteForEveryone",
      "Effect": "Deny",
      "Principal": "*",
      "Action": ["s3:DeleteObject", "s3:DeleteBucket"],
      "Resource": [
        "arn:aws:s3:::backup-immutable",
        "arn:aws:s3:::backup-immutable/*"
      ]
    }
  ]
}

Note: This policy alone is NOT sufficient for true immutability.
An admin with radosgw-admin access can bypass bucket policies.
For FINMA-grade immutability, use S3 Object Lock (see below).

MinIO: High-Performance S3-Compatible Server

MinIO is an alternative S3-compatible object store optimized for high performance on modern hardware (NVMe, 100 GbE). Unlike Ceph RGW (which layers S3 on top of a general-purpose distributed object store), MinIO is purpose-built for S3 workloads.

Architecture:

MinIO Distributed Mode Architecture
=======================================

                     S3 Clients
                         |
                         | HTTPS
                         v
               +-------------------+
               | Load Balancer     |
               +--------+----------+
                        |
           +------------+------------+
           |            |            |
           v            v            v
     +-----------+ +-----------+ +-----------+
     | MinIO     | | MinIO     | | MinIO     |
     | Node 1    | | Node 2    | | Node 3    |
     |           | |           | |           |
     | MinIO     | | MinIO     | | MinIO     |   <-- Single Go binary
     | Server    | | Server    | | Server    |       per node
     | Process   | | Process   | | Process   |
     |           | |           | |           |
     | +-------+ | | +-------+ | | +-------+ |
     | |Drive 0| | | |Drive 0| | | |Drive 0| |
     | |Drive 1| | | |Drive 1| | | |Drive 1| |
     | |Drive 2| | | |Drive 2| | | |Drive 2| |   <-- Direct disk
     | |Drive 3| | | |Drive 3| | | |Drive 3| |       access (XFS)
     | +-------+ | | +-------+ | | +-------+ |       per drive
     +-----------+ +-----------+ +-----------+

     +--- MinIO Node 4 (same structure) ---+

Erasure Coding (per-object):
  With 12 drives across 4 nodes:
  Default EC: 8 data shards + 4 parity shards
  Can tolerate loss of 4 drives (or 1 full node)
  Storage overhead: 1.5x (vs 3x for replication)

  Object: "vm-db01/full-20260428.tar" (50 GiB)
  +----+----+----+----+----+----+----+----+----+----+----+----+
  | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | P0 | P1 | P2 | P3 |
  +----+----+----+----+----+----+----+----+----+----+----+----+
  N1-0  N1-1  N1-2  N1-3  N2-0  N2-1  N2-2  N2-3  N3-0 N3-1 N3-2 N3-3

  Each shard ~= 50 GiB / 8 = 6.25 GiB
  Total storage: 50 GiB * 1.5 = 75 GiB (vs 150 GiB for 3x replication)

MinIO vs Ceph RGW -- when to use which:

Aspect	Ceph RGW	MinIO
Best for	Environments already running Ceph (ODF/OVE). Unified cluster for block + file + object.	Dedicated object storage with highest performance. Environments without existing Ceph.
Architecture	S3 gateway on top of general-purpose RADOS. Shares cluster with RBD/CephFS.	Purpose-built S3 server. Dedicated hardware.
Performance (large objects)	Good (limited by RADOS object size, replication overhead)	Excellent (direct-to-disk writes, minimal overhead)
Performance (small objects)	Moderate (RADOS object creation overhead per object)	Good (optimized metadata handling)
Storage efficiency	Supports EC per pool (e.g., 4+2 for backup data)	Supports EC per erasure set (e.g., 8+4)
Multi-protocol	Block (RBD) + File (CephFS) + Object (RGW) from same cluster	Object only (S3 API)
Operational complexity	Managed by ODF operator (automated). Shares Ceph operations.	Separate deployment, separate operations. MinIO Operator for Kubernetes.
Object Lock / WORM	Supported (since Nautilus / Ceph 14)	Supported (since RELEASE.2019-09-13)
License	Ceph: LGPL. ODF: Red Hat subscription.	MinIO: AGPLv3 (server). Commercial license available.

Recommendation for OVE environments: Use Ceph RGW as the primary object store. It is integrated into ODF, managed by the same operator, shares the same RADOS cluster, and avoids deploying a separate storage system. Deploy MinIO only if specific performance benchmarks show RGW is insufficient for the workload (unlikely for backup/registry/log use cases at 5,000 VM scale). For Azure Local environments without Ceph, MinIO deployed on Kubernetes is a viable S3-compatible option.

Azure Blob Storage / Azure Local Storage Accounts

For Azure Local environments, the native object storage option is Azure Blob Storage accessed via Azure Local storage accounts or Azure cloud storage accounts.

Azure Local (version 23H2+) supports local storage accounts that provide Azure Blob-compatible APIs running on-premises. These are backed by S2D storage and present a subset of the Azure Blob Storage API. However, as of early 2026, on-premises Azure Local storage accounts are primarily designed for:

Azure IoT / edge computing scenarios
Tiering data to Azure cloud
Local caching for Azure Blob workloads

For general-purpose S3-compatible object storage on Azure Local (backup targets, registry, logs), the options are:

MinIO on Kubernetes (on Azure Local): Deploy MinIO as a Kubernetes workload on the Azure Local AKS-HCI cluster. MinIO runs on S2D-backed PVCs. Provides full S3 API including Object Lock.
Azure Blob Storage (cloud): Use Azure cloud storage accounts with Blob Storage. Requires WAN connectivity. Latency is higher (cloud round-trip), but storage is effectively infinite and fully managed.
Azure Blob + S3-compatible gateway: Azure Blob does not natively speak S3, but tools like MinIO Gateway (deprecated) or Azure Blob Storage's S3-compatible endpoint (preview) can bridge the gap. Alternatively, many backup tools (Veeam, Kasten) support Azure Blob natively alongside S3.

Key difference from Ceph RGW: Azure Local does not have a native, integrated, on-premises S3-compatible object store equivalent to Ceph RGW. This means backup targets, registry backends, and log sinks require either an additional MinIO deployment or cloud connectivity to Azure Blob. This is an architectural gap relative to OVE/ODF, where Ceph RGW is built into the platform.

NooBaa (Multi-Cloud Gateway): ODF's S3 Abstraction Layer

NooBaa is the Multi-Cloud Gateway (MCG) component of ODF. It provides an S3-compatible endpoint that abstracts over multiple backing stores -- local Ceph RGW, remote S3 buckets, Azure Blob, Google Cloud Storage -- presenting them as a unified S3 namespace with policy-driven data placement.

NooBaa Multi-Cloud Gateway Architecture
==========================================

                     S3 Clients
          (Kasten K10, Quay, Loki, Terraform)
                         |
                         | HTTPS (S3 API)
                         v
              +---------------------+
              |    NooBaa S3        |
              |    Endpoint Pods    |  <-- Stateless S3 frontend
              |    (2-3 replicas)   |      (Deployed by ODF operator)
              +----------+----------+
                         |
              +----------v----------+
              |    NooBaa Core      |  <-- Policy engine, data placement
              |    (Operator +      |      decisions, lifecycle management
              |     Core Pod)       |
              +----------+----------+
                         |
         +---------------+----------------+
         |               |                |
         v               v                v
  +-----------+   +-----------+   +----------------+
  | Backing   |   | Backing   |   | Backing        |
  | Store 1   |   | Store 2   |   | Store 3        |
  |           |   |           |   |                |
  | Local     |   | Remote    |   | Azure Blob     |
  | Ceph RGW  |   | AWS S3    |   | (Cloud)        |
  | (on-prem, |   | (off-site |   | (archival      |
  |  fast,    |   |  DR copy) |   |  tier, cold    |
  |  primary) |   |           |   |  storage)      |
  +-----------+   +-----------+   +----------------+

Data Placement Policies (BucketClass CRD):

  Policy: "backup-tiered"
  +---------------------------------------------------------+
  | Tier 1 (hot): Local Ceph RGW                            |
  |   - First copy of all backup data                       |
  |   - Fast restore (local network, NVMe-backed)           |
  |   - Retention: 30 days                                  |
  |                                                         |
  | Tier 2 (warm): Remote AWS S3 (cross-region DR)          |
  |   - Asynchronous copy from Tier 1                       |
  |   - Protects against site failure                       |
  |   - Retention: 90 days                                  |
  |                                                         |
  | Tier 3 (cold): Azure Blob Archive                       |
  |   - Objects older than 90 days auto-tiered              |
  |   - Lowest cost per GiB                                 |
  |   - Access: retrieval takes hours (archive tier)        |
  |   - Retention: 7 years (FINMA regulatory requirement)   |
  +---------------------------------------------------------+

  Policy: "backup-mirror"
  +---------------------------------------------------------+
  | Mirror mode: write to BOTH local Ceph RGW AND remote    |
  |   S3 simultaneously                                     |
  | Every object exists in at least 2 locations             |
  | No tiering delay -- instant DR copy                     |
  | Higher write latency (waits for slowest backend)        |
  | Higher cost (2x storage)                                |
  +---------------------------------------------------------+

NooBaa BucketClass and ObjectBucketClaim (OBC):

NooBaa integrates with Kubernetes through two CRDs:

BucketClass: Defines data placement policy (which backing stores, what tiering/mirroring rules).
ObjectBucketClaim (OBC): A Kubernetes resource that requests a bucket (similar to PVC for block storage). The ODF operator provisions the bucket on NooBaa according to the BucketClass, and creates a Secret containing the S3 endpoint, access key, and secret key. Applications consume the Secret to access the bucket.

# Example: ObjectBucketClaim for Kasten K10 backup target
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: kasten-backup-bucket
  namespace: kasten-io
spec:
  bucketName: kasten-backup-prod
  storageClassName: openshift-storage.noobaa.io
  additionalConfig:
    bucketclass: backup-tiered    # References the BucketClass above

# ODF operator creates:
# 1. The bucket "kasten-backup-prod" on NooBaa
# 2. A Secret "kasten-backup-bucket" with:
#      AWS_ACCESS_KEY_ID
#      AWS_SECRET_ACCESS_KEY
# 3. A ConfigMap "kasten-backup-bucket" with:
#      BUCKET_HOST (NooBaa S3 endpoint)
#      BUCKET_PORT
#      BUCKET_NAME

NooBaa vs direct Ceph RGW:

Aspect	Direct Ceph RGW	NooBaa MCG
Performance	Higher (direct RADOS access, no abstraction layer)	Lower (additional hop through NooBaa endpoint to backing store)
Multi-cloud	Single Ceph cluster only	Multiple backends (Ceph + cloud S3 + Azure Blob)
Tiering	Manual (lifecycle policies on Ceph pools)	Automated (BucketClass policies, transparent tiering)
Kubernetes integration	Manual (radosgw-admin user create, manual secrets)	Native (OBC CRD, automated provisioning, auto-created secrets)
Use case	Primary data (performance-critical, large volume)	Multi-cloud abstraction, tiered backup, DR copies to cloud

Recommendation: For OVE environments, use direct Ceph RGW for performance-critical, high-volume object storage (primary backup target, container registry). Use NooBaa MCG for multi-cloud tiering scenarios (replicate backups to off-site S3 or cloud archive for FINMA long-term retention requirements). The two are complementary, not competing.

Compliance: WORM, Object Lock, and Legal Hold

For a Tier-1 financial institution under FINMA supervision, the ability to create immutable backups that cannot be deleted or modified -- even by administrators with full system access -- is a critical requirement. This protects against:

Ransomware: An attacker who gains admin credentials attempts to delete all backups before encrypting production data. Immutable backups survive this attack.
Insider threats: A malicious or compromised administrator attempts to destroy evidence or tamper with audit logs stored in object storage.
Accidental deletion: Operational error (misconfigured lifecycle policy, wrong bucket deleted) destroys critical backup data.
Regulatory retention: FINMA and banking regulations require certain data (audit logs, transaction records, compliance artifacts) to be retained for defined periods (typically 7-10 years) with proof of integrity.

S3 Object Lock provides WORM (Write Once Read Many) semantics at the object level:

S3 Object Lock Architecture
===============================

Bucket: "backup-immutable"
  Object Lock: ENABLED (set at bucket creation, cannot be disabled)
  Default Retention: COMPLIANCE mode, 365 days

Object: "vm-db01/full-20260428.tar"
  +--------------------------------------------------+
  | Data: [50 GiB backup file]                        |
  |                                                   |
  | Retention:                                        |
  |   Mode: COMPLIANCE                                |
  |   Retain Until: 2027-04-28T00:00:00Z              |
  |                                                   |
  | Legal Hold: OFF                                   |
  |                                                   |
  | What can happen before 2027-04-28:                |
  |   READ:    Allowed (normal GET)                   |
  |   DELETE:  DENIED (403 AccessDenied)              |
  |   MODIFY:  DENIED (cannot overwrite)              |
  |   EXTEND:  Allowed (can increase retention)       |
  |   SHORTEN: DENIED in COMPLIANCE mode              |
  |                                                   |
  | What happens after 2027-04-28:                    |
  |   Normal object -- can be read, deleted, or       |
  |   overwritten (unless new retention is set)       |
  +--------------------------------------------------+

Two Retention Modes:

  COMPLIANCE Mode (recommended for FINMA):
  +-------------------------------------------------------+
  | - NO ONE can delete or modify before retention expires |
  | - Not even the root/admin account                     |
  | - Not even the S3 service administrator               |
  | - Retention period CANNOT be shortened                |
  | - Retention period CAN be extended                    |
  | - Bucket with COMPLIANCE objects CANNOT be deleted     |
  |   (must wait for all retentions to expire)            |
  +-------------------------------------------------------+

  GOVERNANCE Mode (less restrictive):
  +-------------------------------------------------------+
  | - MOST users cannot delete or modify                  |
  | - Users with s3:BypassGovernanceRetention permission  |
  |   CAN delete/modify before retention expires          |
  | - Useful for development/testing, NOT for regulatory  |
  |   compliance (admin can bypass)                       |
  +-------------------------------------------------------+

Legal Hold (independent of retention):
  +-------------------------------------------------------+
  | - Separate flag: ON or OFF per object                 |
  | - When ON: object cannot be deleted regardless of     |
  |   retention period (even if retention has expired)    |
  | - Used for litigation holds: "preserve all data       |
  |   related to case X"                                  |
  | - Can be set/removed by users with                    |
  |   s3:PutObjectLegalHold permission                    |
  | - Does not have an expiry -- must be explicitly       |
  |   removed by authorized user                          |
  +-------------------------------------------------------+

Object Lock support by platform:

Platform	Object Lock Support	COMPLIANCE Mode	Legal Hold	Notes
Ceph RGW (ODF)	Yes (since Ceph Nautilus / 14.x)	Yes	Yes	Fully supported. Requires bucket created with Object Lock enabled.
MinIO	Yes (since RELEASE.2019-09-13)	Yes	Yes	Fully supported. MinIO is frequently used as immutable backup target.
NooBaa MCG	Partial (depends on backing store)	Depends on backing store	Depends on backing store	NooBaa passes Object Lock to the backing store. If backing store is Ceph RGW or MinIO, Object Lock works. If backing store is Azure Blob, Object Lock maps to Azure immutable blob storage (different API, translated by NooBaa).
Azure Blob	Azure Immutable Blob Storage (different API, not S3 Object Lock)	Yes (via Azure immutability policies)	Yes (via Azure legal hold)	Different API (Azure REST), but functionally equivalent. Backup tools must support Azure Blob natively or use a gateway.

Implementation for Kasten K10 with immutable S3 target:

Immutable Backup Architecture with Kasten K10
=================================================

Step 1: Create Object Lock-enabled bucket on Ceph RGW

  $ radosgw-admin user create --uid=kasten-backup \
      --display-name="Kasten Backup Service Account"

  $ s3cmd mb s3://backup-immutable \
      --host=rgw.odf.internal \
      --access_key=<kasten-access-key> \
      --secret_key=<kasten-secret-key>

  $ s3cmd setversioning s3://backup-immutable --enable

  # Object Lock requires versioning and must be set at bucket creation:
  $ aws s3api create-bucket \
      --bucket backup-immutable \
      --object-lock-enabled-for-bucket \
      --endpoint-url https://rgw.odf.internal

  $ aws s3api put-object-lock-configuration \
      --bucket backup-immutable \
      --object-lock-configuration '{
        "ObjectLockEnabled": "Enabled",
        "Rule": {
          "DefaultRetention": {
            "Mode": "COMPLIANCE",
            "Days": 365
          }
        }
      }' \
      --endpoint-url https://rgw.odf.internal

Step 2: Configure K10 Location Profile

  K10 Location Profile:
    Name:        immutable-backup-target
    Type:        S3 Compatible
    Endpoint:    https://rgw.odf.internal
    Bucket:      backup-immutable
    Access Key:  <kasten-access-key>
    Secret Key:  <kasten-secret-key>
    Protection:  Enable Immutable Backups
    Retention:   365 days (COMPLIANCE mode)

Step 3: Verify immutability

  # Attempt to delete a backup object:
  $ aws s3api delete-object \
      --bucket backup-immutable \
      --key "k10/vm-db01/full-20260428.tar" \
      --endpoint-url https://rgw.odf.internal

  Expected: 403 AccessDenied
  "An error occurred (AccessDenied) when calling the
   DeleteObject operation: Access Denied"

  # Even the root RGW admin cannot delete in COMPLIANCE mode.
  # The only way to remove the data is to wait for retention to expire.

FINMA audit evidence for immutable backups:

An auditor will verify:

Object Lock is enabled on the backup bucket (show get-object-lock-configuration output).
Retention mode is COMPLIANCE, not GOVERNANCE (GOVERNANCE can be bypassed by admins).
Retention period meets regulatory requirements (e.g., 365 days for operational backups, 7 years for compliance archives).
Attempted deletions are logged and denied (show CloudTrail-equivalent audit logs from RGW or MinIO).
No administrative bypass exists (in COMPLIANCE mode, this is enforced by the S3 implementation itself -- no configuration can override it).
Regular backup success verification (show K10/Veeam backup reports with success/failure status).

2. Data Locality

The Concept

In a Hyper-Converged Infrastructure (HCI), compute (VMs) and storage (disks) share the same physical nodes. Data locality refers to the degree to which a VM's data resides on the same physical node that runs the VM. When locality is high, the VM reads and writes its own node's local disks without traversing the storage network. When locality is low -- because the VM migrated to a different node, or because the data was placed on remote OSDs by the storage algorithm -- every I/O traverses the network.

Data Locality: Local vs Remote I/O Path
==========================================

HIGH LOCALITY (VM and data on same node):
+---------------------------------------------------------+
|  Node 1                                                 |
|                                                         |
|  +----------+     Local I/O      +------------------+   |
|  |  VM-01   | =================> |  OSD.0 (NVMe-0)  |   |
|  | (vCPU,   |     (memory bus,   |  Contains VM-01  |   |
|  |  vRAM)   |      no network)   |  primary replica |   |
|  +----------+                    +------------------+   |
|                                                         |
|  Latency: 100-200 us (NVMe only)                        |
|  Bandwidth: Full NVMe bandwidth (3+ GiB/s per drive)    |
|  CPU overhead: Minimal (no network stack processing)     |
+---------------------------------------------------------+

LOW LOCALITY (VM on Node 1, data on Node 2):
+---------------------------------------------------------+
|  Node 1                           Node 2                |
|                                                         |
|  +----------+     Network I/O     +------------------+  |
|  |  VM-01   | ==================> |  OSD.4 (NVMe-0)  |  |
|  | (vCPU,   |     (RDMA/TCP,     |  Contains VM-01  |  |
|  |  vRAM)   |      25-100 GbE)   |  primary replica |  |
|  +----------+                    +------------------+  |
|                                                         |
|  Latency: 200-500 us (NVMe + network round-trip)        |
|  Bandwidth: Limited by network (12.5 GiB/s for 100 GbE)|
|  CPU overhead: Higher (network stack, TCP/RDMA, S2D/Ceph|
|                protocol processing)                     |
+---------------------------------------------------------+

The difference matters at scale:
  - 100 us vs 400 us per I/O = 4x latency increase
  - For a database VM doing 50,000 IOPS:
    Local:  50,000 * 100 us = 5 seconds of I/O time per second
    Remote: 50,000 * 400 us = 20 seconds of I/O time per second
    (The VM effectively stalls because I/O cannot keep up)

Why Data Locality Matters for HCI Performance

In traditional SAN architecture, all storage is remote by definition -- every I/O traverses the Fibre Channel or iSCSI network. Data locality is not a concept because there is no local path. In HCI, the introduction of local disks creates a performance asymmetry: local I/O is faster than remote I/O. Exploiting this asymmetry is one of the primary performance advantages of HCI over SAN.

Quantified performance impact (based on published benchmarks and vendor documentation):

Metric	Local Read	Remote Read (25 GbE)	Remote Read (100 GbE RDMA)	Ratio (Local vs 25 GbE)
Latency (4 KiB random read, NVMe)	80-150 us	300-600 us	150-300 us	2-4x
Latency (4 KiB random write, NVMe)	100-200 us (+ replication)	200-500 us (+ replication)	150-300 us (+ replication)	1.5-2.5x
Throughput (128 KiB sequential read)	3-6 GiB/s per NVMe	Limited by 25 GbE (~3 GiB/s)	Limited by 100 GbE (~12 GiB/s)	1-2x
CPU overhead per I/O	~5 us (block layer only)	~15-30 us (network stack + protocol)	~8-15 us (RDMA bypass)	2-6x

Key observations:

Read latency is where locality matters most. Writes always traverse the network for replication (Ceph writes to 3 replicas, S2D writes to 2-3 mirrors), so the write path includes remote I/O regardless of locality. But reads can be served entirely locally if the data is on a local OSD/drive.
RDMA (RoCE v2, iWARP) narrows the gap significantly. With 100 GbE RDMA, remote read latency drops to 150-300 us -- only 1.5-2x worse than local. Without RDMA (TCP-based storage networks), the gap widens to 3-5x.
The aggregate impact scales with VM count. A single VM losing locality is barely noticeable. When 1,000 VMs simultaneously lose locality after a cluster rebalancing event, the storage network saturates and all VMs experience degraded performance.

Ceph Data Locality

Ceph uses CRUSH (Controlled Replication Under Scalable Hashing) to determine where data is placed. By default, CRUSH distributes data across OSDs based on the failure-domain hierarchy (host, rack, datacenter) to maximize resilience -- not to maximize locality. This means a VM running on Node 1 may have its primary OSD on Node 3 and replicas on Nodes 1 and 5. Reads go to the primary OSD (Node 3), which is remote.

Primary OSD Affinity (CRUSH Primary Affinity)

Every PG (Placement Group) has a primary OSD that handles all client reads and coordinates writes. By default, the primary is the first OSD in the CRUSH-calculated acting set. Ceph provides a mechanism to influence which OSD becomes primary:

CRUSH Primary Affinity
=========================

Default behavior (no locality optimization):

  PG 3.1a:
    Acting set: [OSD.7 (Node 3), OSD.2 (Node 1), OSD.11 (Node 4)]
    Primary: OSD.7 (Node 3)

  VM on Node 1 reads from PG 3.1a:
    -> Read request goes to OSD.7 on Node 3 (REMOTE read)
    -> Even though OSD.2 on Node 1 has a replica

  This is suboptimal: Node 1 has the data locally, but Ceph
  reads from the primary (which is remote).

Primary affinity adjustment:
  ceph osd primary-affinity osd.2 1.0   (prefer as primary)
  ceph osd primary-affinity osd.7 0.5   (less preferred as primary)

  Effect: CRUSH recalculates primary assignments.
  More PGs will choose local OSDs as primary when possible.

  WARNING: Changing primary affinity is a blunt instrument.
  It affects ALL PGs on those OSDs, not just the ones serving
  a specific VM. Use with caution and monitor for imbalance.

OSD Read Affinity (Read from Closest Replica)

Since Ceph Reef (18.x) and backported to recent ODF versions, Ceph supports read affinity at the OSD level. Instead of always reading from the primary OSD, the client can be configured to read from the closest replica -- ideally, one on the same node.

OSD Read Affinity (read_from_replica)
========================================

Without read affinity (default -- all reads from primary):

  PG 3.1a acting set: [OSD.7 (Node 3), OSD.2 (Node 1), OSD.11 (Node 4)]

  VM on Node 1:
    +----------+                +----------+
    |  VM-01   | ---network---> |  OSD.7   |   Primary OSD (Node 3)
    |  Node 1  |  READ request  |  Node 3  |   ALWAYS reads from primary
    +----------+                +----------+

    OSD.2 on Node 1 has the same data, but is unused for reads.

With read affinity enabled:

  Config: osd_read_from_replica = localize
  CRUSH location: each OSD tagged with host=node1, rack=rack1, etc.

  VM on Node 1:
    +----------+    local I/O    +----------+
    |  VM-01   | ==============> |  OSD.2   |   Closest replica (Node 1)
    |  Node 1  |  READ request   |  Node 1  |   LOCAL read -- no network!
    +----------+                 +----------+

    Ceph client (librbd) checks CRUSH location of all replicas,
    finds OSD.2 is on the same host as the client, reads locally.

  Latency improvement:
    Before: 300-500 us (remote primary read)
    After:  100-200 us (local replica read)
    Improvement: 50-70% latency reduction for reads

  Trade-off:
    - Reads are slightly stale (replica may lag primary by microseconds)
    - For VM workloads, this staleness is irrelevant (single-writer model)
    - Writes still go to primary (write path unchanged)

Ceph Read Affinity: Decision Flow
====================================

Client (librbd) issues READ for PG 3.1a:

  1. Client knows acting set: [OSD.7, OSD.2, OSD.11]
  2. Client knows its own CRUSH location: host=node1

  if osd_read_from_replica == "localize":
    3. Check each replica's CRUSH location:
       OSD.7 -> host=node3  (remote)
       OSD.2 -> host=node1  (LOCAL MATCH!)
       OSD.11 -> host=node4 (remote)
    4. Read from OSD.2 (local replica)

  elif osd_read_from_replica == "balance":
    3. Distribute reads across all replicas (round-robin)
    4. Balances load across OSDs at cost of some remote reads

  elif osd_read_from_replica == "primary" (default):
    3. Always read from OSD.7 (primary)
    4. Consistent but ignores locality

  Recommendation for HCI (OVE):
    Use "localize" for VM workloads (maximizes local reads)
    Use "balance" for throughput-heavy workloads across many PGs

CRUSH Rules for Locality Awareness

CRUSH rules can be designed to prefer placing the primary replica on the same host as the client. This is more architectural than the per-OSD affinity tuning above:

CRUSH Rule for Primary Locality
===================================

Standard CRUSH rule (no locality preference):

  rule replicated_rule {
    id 0
    type replicated
    step take default          # Start from root of CRUSH tree
    step chooseleaf firstn 0   # Choose N OSDs from distinct hosts
        type host
    step emit
  }

  Result: Primary OSD could be on ANY host. No locality guarantee.

Locality-aware CRUSH rule (primary on same host):

  Note: Ceph does not natively support "place primary on client's host"
  because CRUSH is a static algorithm (it does not know which host the
  client/VM is on at rule-evaluation time).

  Workaround: Use CRUSH device classes + per-pool configuration
  to influence placement, OR rely on read affinity (above) which
  achieves the same performance benefit without CRUSH rule changes.

  The read-affinity approach (osd_read_from_replica=localize) is
  the recommended method for Ceph locality optimization in ODF/OVE.
  It is simpler, safer, and does not require custom CRUSH rules.

BlueStore Cache

Every Ceph OSD includes a BlueStore cache -- an in-memory cache that holds recently read and written data blocks. This cache provides a form of implicit data locality: even if the primary OSD is remote, the local OSD's BlueStore cache may hold recently written data (since the local OSD is a replica).

BlueStore Cache Behavior
===========================

Per-OSD Cache (in-process memory):

  OSD.2 (Node 1, NVMe-backed):
  +----------------------------------------------+
  |  BlueStore Cache (default: from osd_memory_target)
  |  Typically 1-3 GiB per OSD (auto-tuned)       |
  |                                                |
  |  +------------------------------------------+ |
  |  |  Block Cache (data blocks)               | |
  |  |  - Recently read blocks                  | |
  |  |  - Recently written blocks               | |
  |  |  - LRU eviction                          | |
  |  +------------------------------------------+ |
  |  +------------------------------------------+ |
  |  |  KV Cache (RocksDB metadata)             | |
  |  |  - Object headers, omap data             | |
  |  |  - Critical for metadata-heavy ops       | |
  |  +------------------------------------------+ |
  +----------------------------------------------+

  With read affinity enabled:
    VM on Node 1 reads from local OSD.2
    First read: fetched from NVMe, cached in BlueStore
    Subsequent reads: served from memory cache (< 10 us!)
    Cache hit rate depends on working set size vs cache size

  Configuration:
    osd_memory_target = 4294967296  (4 GiB, default)
    bluestore_cache_autotune = true  (default)

  For NVMe-only clusters: consider increasing osd_memory_target
  to 6-8 GiB per OSD if RAM allows, to increase cache hit rate
  and further improve local read performance.

S2D Data Locality

Storage Spaces Direct takes a fundamentally different approach to data locality than Ceph. S2D is designed with strong locality awareness built into its core architecture through the CSV (Cluster Shared Volume) ownership model and the Storage Bus Layer (SBL).

ReFS Read-Local Optimization

S2D uses ReFS (Resilient File System) on top of Storage Spaces virtual disks. ReFS includes a read-local optimization that is enabled by default:

S2D Read-Local Optimization
===============================

3-way mirror volume: data exists on 3 nodes

  Volume: "ClusterStorage\Vol1"
  Data block X:
    Copy 1: Node 1, SSD-2  (local to Node 1)
    Copy 2: Node 2, SSD-4  (remote)
    Copy 3: Node 3, SSD-1  (remote)

  VM running on Node 1 reads block X:

  +-----------+                        +-----------+
  |  Node 1   |                        |  Node 2   |
  |           |   NOT used for read    |           |
  |  +------+ |                        | +-------+ |
  |  | VM   | |                        | |Copy 2 | |
  |  +--+---+ |                        | +-------+ |
  |     |     |                        +-----------+
  |     v     |
  |  +------+ |                        +-----------+
  |  |Copy 1| |  <-- READ LOCAL        |  Node 3   |
  |  |SSD-2 | |  No network involved   |           |
  |  +------+ |                        | +-------+ |
  |           |   NOT used for read    | |Copy 3 | |
  +-----------+                        | +-------+ |
                                       +-----------+

  S2D ALWAYS reads from the local copy when one exists.
  This is automatic -- no configuration required.

  Result: every read operation on a 3-way mirror volume
  is served locally, as long as the local copy is healthy.

  Write path (always involves network):
    VM writes block X on Node 1:
    1. Write to local copy (Node 1, SSD-2)       -- local
    2. Write to mirror copy (Node 2, SSD-4)       -- network (SMB Direct)
    3. Write to mirror copy (Node 3, SSD-1)       -- network (SMB Direct)
    4. ACK to VM after all 3 writes complete      -- synchronous mirrors

    Write latency = max(local write, remote write 1, remote write 2)
    Typically: 100-300 us (dominated by network round-trip)

CSV Ownership and Preferred Node

Every CSV volume has an owner node -- the node that coordinates metadata operations for that volume. Other nodes access the volume through the owner (for metadata) or directly via the Storage Bus Layer (for data I/O, which uses read-local).

CSV Ownership Model
======================

4-node cluster, 4 CSV volumes:

  Node 1 (owner of Vol1):
  +--------------------------------------------------+
  |  CSV Owner for C:\ClusterStorage\Vol1             |
  |  - Handles NTFS metadata operations for Vol1     |
  |  - All metadata (create file, rename, ACL change)|
  |    goes through Node 1's NTFS stack              |
  |  - Data I/O from local VMs: LOCAL (SBL direct)   |
  |  - Data I/O from remote VMs: redirected I/O      |
  |    or direct I/O via SBL (depends on access mode)|
  +--------------------------------------------------+

  Node 2 (owner of Vol2):
  +--------------------------------------------------+
  |  CSV Owner for C:\ClusterStorage\Vol2             |
  +--------------------------------------------------+

  Node 3 (owner of Vol3, Vol4):
  +--------------------------------------------------+
  |  CSV Owner for C:\ClusterStorage\Vol3 and Vol4    |
  +--------------------------------------------------+

  CSV Preferred Owner:
    Set-ClusterSharedVolume -Name "Vol1" -PreferredOwner "Node1"
    After live migration, CSV ownership can be moved to follow
    the VM, keeping metadata operations local.

  Best practice:
    - Assign CSV ownership to the node running the majority of
      VMs on that volume
    - After live migration of a heavy VM, consider moving CSV
      ownership to follow the VM
    - Azure Local automates some of this via VM placement heuristics

Storage Bus Layer: Local Path vs Remote Path

The Software Storage Bus (SBL) in S2D provides two I/O paths:

S2D Storage Bus Layer I/O Paths
==================================

LOCAL PATH (data on same node as VM):
+----------------------------------------------------------+
|  Node 1                                                  |
|  +------+    +----------+    +--------+    +-----------+ |
|  | VM   | -> | Hyper-V  | -> | ReFS / | -> | SBL Local | |
|  |      |    | VHD      |    | Storage|    | Path      | |
|  |      |    | Stack    |    | Spaces |    | (no SMB)  | |
|  +------+    +----------+    +--------+    +-----+-----+ |
|                                                  |       |
|                                            +-----v-----+ |
|                                            | NVMe/SSD  | |
|                                            | (local    | |
|                                            |  disk)    | |
|                                            +-----------+ |
+----------------------------------------------------------+
  Latency: 100-200 us
  No network stack, no SMB, no TCP/RDMA overhead
  Highest performance path in S2D

REMOTE PATH (data on different node from VM):
+----------------------------------------------------------+
|  Node 1                          Node 2                  |
|  +------+    +----------+       +----------+             |
|  | VM   | -> | Hyper-V  | -SMB->| SBL      |             |
|  |      |    | VHD      | Direct| Remote   |             |
|  |      |    | Stack    | (RDMA)| Handler  |             |
|  +------+    +----------+       +-----+----+             |
|                                       |                  |
|                                 +-----v-----+            |
|                                 | NVMe/SSD  |            |
|                                 | (Node 2   |            |
|                                 |  disk)    |            |
|                                 +-----------+            |
+----------------------------------------------------------+
  Latency: 200-500 us (with RDMA), 400-1000 us (TCP)
  Requires: SMB Direct stack, RDMA NIC (recommended)
  Network bandwidth consumed for every I/O

IMPORTANT S2D BEHAVIOR:
  For 3-way mirrors, READ operations use the LOCAL path
  (read-local optimization, see above).

  For WRITE operations, the local path writes the local copy,
  then SBL sends the write to remote nodes via SMB Direct.

  Result: READS are always local (when mirror copy exists locally).
  WRITES involve remote I/O (for mirror copies) regardless.

Data Locality vs Live Migration: The Fundamental Tension

Live migration (vMotion in VMware, Live Migration in Hyper-V, KubeVirt live migration) moves a running VM from one physical node to another without downtime. In HCI, live migration breaks data locality -- the VM is now on a different node than its primary data.

The Locality-Migration Tension
==================================

Before migration:
+------------------+     +------------------+
|  Node 1          |     |  Node 2          |
|                  |     |                  |
|  +------+        |     |                  |
|  | VM-01| LOCAL  |     |                  |
|  +--+---+ reads  |     |                  |
|     |            |     |                  |
|  +--v---------+  |     |  +------------+  |
|  | OSD.0      |  |     |  | OSD.4      |  |
|  | VM-01 data |  |     |  | VM-01      |  |
|  | (primary)  |  |     |  | (replica)  |  |
|  +------------+  |     |  +------------+  |
+------------------+     +------------------+

Latency: 100-200 us (local reads from OSD.0)

After live migration (VM-01 moves to Node 2):
+------------------+     +------------------+
|  Node 1          |     |  Node 2          |
|                  |     |                  |
|                  |     |  +------+        |
|                  |     |  | VM-01| REMOTE |
|                  |     |  +--+---+ reads  |
|                  |     |     |            |
|  +------------+  |     |  +--v---------+  |
|  | OSD.0      |  |  <--+  | OSD.4      |  |
|  | VM-01 data |  | net |  | VM-01      |  |
|  | (primary)  |  |     |  | (replica)  |  |
|  +------------+  |     |  +------------+  |
+------------------+     +------------------+

Without read affinity: 
  Reads go to primary OSD.0 on Node 1 = REMOTE (300-500 us)

With Ceph read affinity (localize):
  Reads go to replica OSD.4 on Node 2 = LOCAL (100-200 us)
  Locality is RESTORED without moving data!

With S2D read-local:
  Reads go to local mirror copy on Node 2 = LOCAL (100-200 us)
  Locality is MAINTAINED automatically!

How each platform handles post-migration data rebalancing:

Platform	Read Locality After Migration	Write Locality After Migration	Data Rebalancing
VMware vSAN	vSAN reads from nearest replica (locality-aware reads since vSAN 7.0)	Writes go to owner component, may be remote. vSAN rebalances over time.	Automatic rebalancing proactive rebalance task moves data components closer to the VM's current host.
Ceph (OVE/ODF)	With `osd_read_from_replica=localize`: reads from local replica if one exists. Without: reads from remote primary.	Writes always go to primary OSD (may be remote). Replicas on other nodes.	No automatic rebalancing for locality. CRUSH does not move data to follow VMs. Data stays where CRUSH placed it. Locality depends on read affinity and replica distribution.
S2D (Azure Local)	Read-local always active. Reads from local mirror copy. No configuration needed.	Writes go to all mirror copies (local + remote). Local write is part of the path.	No explicit rebalancing needed for locality because read-local always works (3-way mirror ensures every node has a copy, but only for volumes with sufficient mirror copies). After migration, if no local copy exists (e.g., 2-way mirror, VM moved to a non-mirror node), reads become remote.
Swisscom ESC	Managed by Swisscom (SAN-based, locality not applicable).	N/A	N/A

Storage migration to restore locality (Ceph):

When Ceph read affinity cannot restore locality (because no replica exists on the VM's current node), the only option is storage migration -- moving the data to a new location. In Kubernetes/KubeVirt, this means creating a new PVC on the target node and copying data:

Storage Migration in KubeVirt (OVE)
======================================

Scenario: VM-01 live-migrated from Node 1 to Node 3.
          No Ceph replica exists on Node 3.
          All reads are remote (high latency).

Option 1: Accept the latency (most common approach)
  - If read affinity finds a replica on a nearby node (same rack),
    the latency penalty is small
  - For most VM workloads, the difference is acceptable
  - Recommendation: monitor latency, only act if SLA is breached

Option 2: VM live storage migration (KubeVirt 1.1+)
  - KubeVirt supports live storage migration (moving PVCs)
  - Creates a new PVC, copies data block-by-block, then switches
  - However: Ceph CRUSH may place the new PVC's data on the same
    set of OSDs (CRUSH is deterministic based on PG, not affinity)
  - This does NOT guarantee locality unless CRUSH rules are modified

Option 3: Rely on Ceph replica distribution
  - With 3x replication and many OSDs per node, the probability
    that at least one replica is on the VM's current node is:

    P(local replica) = 1 - P(no replica on this node)

    For a 6-node cluster with 6 OSDs/node (36 OSDs total):
    PG has 3 replicas across 3 different nodes (CRUSH host rule)
    P(at least one replica on a specific node) = 3/6 = 50%

    For a 4-node cluster: P = 3/4 = 75%
    For a 3-node cluster: P = 3/3 = 100% (guaranteed locality)

    Larger clusters = lower probability of automatic locality
    Read affinity only helps when a local replica exists

Conclusion:
  - 3-4 node clusters: read affinity provides near-guaranteed locality
  - 6+ node clusters: locality probability drops; accept remote reads
    or use node affinity rules to keep VMs on preferred nodes
  - S2D has an advantage here: 3-way mirror across 3+ nodes means
    read-local almost always finds a local copy

Data Locality in Disaggregated Architectures

Not all deployments use HCI. Some architectures disaggregate compute and storage, running compute-only nodes (no local storage) and storage-only nodes (no VMs). In this model, data locality is zero by design -- every I/O traverses the network.

HCI vs Disaggregated Architecture
=====================================

HCI (Hyper-Converged -- locality possible):
+------------------+  +------------------+  +------------------+
|  Node 1          |  |  Node 2          |  |  Node 3          |
|  VMs + Storage   |  |  VMs + Storage   |  |  VMs + Storage   |
|  +------+        |  |  +------+        |  |  +------+        |
|  | VM-01|        |  |  | VM-03|        |  |  | VM-05|        |
|  | VM-02|        |  |  | VM-04|        |  |  | VM-06|        |
|  +------+        |  |  +------+        |  |  +------+        |
|  +------+        |  |  +------+        |  |  +------+        |
|  |OSD.0 |        |  |  |OSD.2 |        |  |  |OSD.4 |        |
|  |OSD.1 |        |  |  |OSD.3 |        |  |  |OSD.5 |        |
|  +------+        |  |  +------+        |  |  +------+        |
+------------------+  +------------------+  +------------------+
  Local reads possible. Network only for replication writes.
  Higher resource contention (VMs compete with OSDs for CPU/RAM).

Disaggregated (Compute + Storage separated -- zero locality):
Compute Tier:                        Storage Tier:
+------------------+                 +------------------+
|  Compute Node 1  |                 |  Storage Node 1  |
|  VMs only        |   ALL I/O       |  OSDs only       |
|  +------+        |   traverses     |  +------+        |
|  | VM-01|  ------+---network-----> |  |OSD.0 |        |
|  | VM-02|        |                 |  |OSD.1 |        |
|  +------+        |                 |  |OSD.2 |        |
|  No local disks  |                 |  +------+        |
+------------------+                 +------------------+
+------------------+                 +------------------+
|  Compute Node 2  |                 |  Storage Node 2  |
|  VMs only        |   ALL I/O       |  OSDs only       |
|  +------+  ------+---network-----> |  +------+        |
|  | VM-03|        |                 |  |OSD.3 |        |
|  | VM-04|        |                 |  |OSD.4 |        |
|  +------+        |                 |  |OSD.5 |        |
+------------------+                 +------------------+
  No local reads. Every I/O = network round-trip.
  No resource contention (VMs get full CPU/RAM).
  Requires 25-100 GbE RDMA storage network.

When disaggregated makes sense:
  - GPU-heavy workloads (VMs need all GPU + CPU, no room for OSDs)
  - Very large storage clusters (storage nodes have 24+ drives each,
    too much storage per compute ratio for HCI)
  - Independent scaling (add compute without adding storage, or vice versa)

When HCI is better:
  - General VM workloads (the majority of the 5,000 VM estate)
  - Latency-sensitive workloads (databases, OLTP)
  - Cost efficiency (fewer total nodes, no dedicated storage hardware)

ODF supports both models:
  - HCI: "internal" mode (OSDs on worker nodes alongside VMs)
  - Disaggregated: "external" mode (OSDs on dedicated storage nodes)
  - Hybrid: some worker nodes with OSDs, some without

Performance Impact: Measured Latency Comparison

The following table summarizes published and expected latency values for local vs remote storage access across the candidate platforms. These values should be validated during the PoC with the organization's specific hardware and workload profiles.

Measured Latency: Local vs Remote Storage Access
====================================================

Test: 4 KiB random read, queue depth 1, direct I/O (iodepth=1)
Hardware: NVMe drives, 25 GbE or 100 GbE RDMA storage network

Platform          | Local Read   | Remote Read    | Remote Read     | Locality
                  | (same node)  | (25 GbE TCP)   | (100 GbE RDMA)  | Benefit
------------------+--------------+----------------+-----------------+----------
Ceph (ODF/OVE)   |              |                |                 |
  read-from-      |              |                |                 |
  primary         |  N/A *       |  250-500 us    |  150-300 us     |  N/A
  (default)       |              |                |                 |
                  |              |                |                 |
Ceph (ODF/OVE)   |              |                |                 |
  read-affinity   |  80-180 us   |  (fallback     |  (fallback      |  50-70%
  localize        |              |   if no local  |   if no local   |  latency
                  |              |   replica)     |   replica)      |  reduction
                  |              |                |                 |
S2D (Azure Local) |              |                |                 |
  read-local      |  80-150 us   |  200-500 us    |  120-250 us     |  50-70%
  (default on)    |              |  (2-way mirror |  (2-way mirror  |  latency
                  |              |   no local     |   no local      |  reduction
                  |              |   copy)        |   copy)         |
                  |              |                |                 |
vSAN (VMware)     |              |                |                 |
  nearest-        |  80-180 us   |  200-400 us    |  N/A (vSAN uses |  50-60%
  replica read    |              |  (remote       |   TCP typically) |  latency
                  |              |   component)   |                 |  reduction
------------------+--------------+----------------+-----------------+----------

* Ceph default: reads always from primary OSD. If primary happens
  to be local, latency is 80-180 us. If remote, 250-500 us.
  With default settings, locality is essentially random.

Test: 4 KiB random write, queue depth 1, direct I/O
(Writes always involve replication -- local + remote writes)

Platform          | Write Latency | Notes
------------------+---------------+--------------------------------------
Ceph (ODF/OVE)   | 200-600 us    | Primary OSD writes locally + sends to
  3x replication  |               | 2 replicas. ACK after all 3 complete.
                  |               | Dominated by slowest replica write.
                  |               |
S2D (Azure Local) | 150-400 us    | Write to local mirror + SMB Direct to
  3-way mirror    |               | 2 remote mirrors. ACK after all 3.
                  |               | RDMA reduces remote write latency.
                  |               |
vSAN (VMware)     | 200-500 us    | Write to owner component + mirror
  RAID-1 (mirror) |               | components. ACK after all mirrors.
------------------+---------------+--------------------------------------

Key insight: WRITE latency is similar across platforms because all
require network replication. The locality advantage is primarily
in READ latency.

How the Candidates Handle This

Object Storage Comparison

Aspect	VMware (vSAN)	OVE (ODF/Ceph)	Azure Local (S2D)	Swisscom ESC
Native S3 service	None. VMware does not include an S3-compatible object store. Requires separate deployment (MinIO, Dell ECS, etc.).	Ceph RGW -- built into ODF. S3-compatible, managed by ODF operator. Production-grade.	None native on-premises. Azure Blob in cloud. MinIO deployable on AKS-HCI for on-prem S3.	Managed by Swisscom. S3 access depends on contracted services.
S3 API completeness	N/A	Ceph RGW: high (covers most S3 operations including multipart, versioning, Object Lock, bucket policies). Some advanced features (S3 Select, Batch Operations) not supported.	MinIO (if deployed): very high S3 API completeness. Azure Blob: different API (Azure REST), S3 compatibility layer is partial.	Unknown (managed).
Object Lock / WORM	N/A	Ceph RGW: full Object Lock support (COMPLIANCE + GOVERNANCE modes, legal hold). Since Ceph Nautilus.	MinIO: full Object Lock support. Azure Blob: Azure Immutable Blob Storage (functionally equivalent, different API).	Contractual (must verify with Swisscom).
Multi-site replication	N/A	Ceph RGW multi-site: async replication across zones. Active-active or active-passive. Built-in.	MinIO: site replication (active-active, async). Azure Blob: GRS/GZRS (Microsoft-managed, cloud-only).	Managed by Swisscom.
Multi-cloud tiering	N/A	NooBaa MCG: tiering across local Ceph + remote S3 + Azure Blob. BucketClass policies.	Manual (scripted tiering from MinIO to Azure Blob). No native multi-cloud gateway.	N/A (managed).
Kubernetes integration	N/A	ObjectBucketClaim (OBC) CRD for automated bucket provisioning. Secrets auto-generated for applications.	MinIO Operator for Kubernetes. OBC support available. Less mature than ODF integration.	N/A.
Backup tool integration	Veeam targets external S3 (MinIO, AWS, etc.) or Veeam repository.	K10 exports to Ceph RGW S3 natively. Velero uses S3 natively. Quay/Harbor use S3 for registry backend.	Veeam targets MinIO S3 or Azure Blob natively. K10 on AKS-HCI targets MinIO S3.	Managed backup targets.
Capacity for backup	Separate infrastructure.	Shared Ceph cluster (data pool can use EC for efficiency, e.g., 4+2 = 1.5x overhead vs 3x for replicated). Capacity scales with cluster.	MinIO: separate EC-protected capacity. Azure Blob: cloud-billed per GiB.	Managed capacity (SLA).

Data Locality Comparison

Aspect	VMware (vSAN)	OVE (ODF/Ceph)	Azure Local (S2D)	Swisscom ESC
Read locality mechanism	Read from nearest replica (since vSAN 7.0). Automatic.	`osd_read_from_replica=localize` (since Ceph Reef / ODF 4.14+). Must be enabled.	Read-local optimization in SBL. Automatic, always on for mirror volumes.	N/A (SAN-based, all I/O is remote by design).
Write locality	Writes always involve replication (remote I/O). Owner component may be local or remote.	Writes always go to primary OSD (may be local or remote) + replicas. No locality optimization for writes.	Writes go to all mirror copies (local + remote via SMB Direct). Local copy is always written as part of the path.	N/A.
Locality after live migration	vSAN proactive rebalance moves components closer to VM's new host over time. Reads may be temporarily remote.	Reads may become remote if no replica on new host. Read affinity restores locality if a replica exists. No automatic data rebalancing.	Read-local continues to work if a mirror copy exists on the new host (likely for 3-way mirror). Immediate locality.	N/A.
Locality probability (6-node cluster)	~50% immediate (depends on RAID policy and placement). Improves over time with rebalancing.	~50% with 3x replication (3 replicas across 6 nodes). 100% with 3-node clusters.	Near 100% with 3-way mirror (3 copies across 3+ nodes, S2D places copies across all nodes).	N/A.
Configuration required	Automatic (vSAN 7.0+).	Must enable `osd_read_from_replica=localize`. Not enabled by default.	Automatic (SBL read-local is always on).	N/A.
Disaggregated support	vSAN HCI only (no compute-only nodes). vSAN ESA supports disaggregated.	ODF supports external mode (compute-only + storage-only nodes). Locality = zero in this mode.	S2D is HCI only. No disaggregated mode.	N/A (SAN is inherently disaggregated).

Key Takeaways

Object storage is infrastructure, not optional. Every candidate platform needs S3-compatible object storage for backup targets, container image registries, log aggregation, and artifact storage. OVE has a significant advantage here: Ceph RGW is built into ODF, managed by the same operator, and provides production-grade S3 with Object Lock, multi-site replication, and Kubernetes-native provisioning (OBC). Azure Local lacks an equivalent native on-premises S3 service and requires deploying MinIO as an additional component or relying on Azure Blob in the cloud. This gap adds operational complexity and an additional software component to manage.
Immutable backups via S3 Object Lock are non-negotiable for FINMA compliance. Ransomware is an existential threat to a 5,000+ VM environment. Mutable backups are not backups if an attacker with admin credentials can delete them. Deploy all backup buckets with S3 Object Lock in COMPLIANCE mode (not GOVERNANCE mode). COMPLIANCE mode prevents deletion by any user, including root -- this is the only mode that provides true WORM semantics. Verify during PoC that attempted deletions return 403 AccessDenied and that no administrative bypass exists.
Ceph RGW multi-site replication is the answer to off-site backup durability. Storing backups on the same Ceph cluster as production data does not protect against site failure. Configure RGW multi-site replication to asynchronously copy the backup bucket to a second site. This provides the "1" in the 3-2-1 backup rule at the object storage level. Measure replication lag under production write load during PoC to validate RPO for backup data.
NooBaa MCG enables FINMA-compliant long-term retention without operational burden. Financial regulations require retaining certain data for 7-10 years. Storing 7 years of backups on expensive NVMe-backed Ceph storage is cost-prohibitive. NooBaa's BucketClass tiering policy automates the lifecycle: hot data on local Ceph RGW (fast restore), warm data on remote S3 (DR protection), cold data on Azure Blob Archive (cheapest $/GiB, 7-year retention). This is a compelling architecture that leverages ODF's integrated components.
Data locality is a hidden performance multiplier -- but only for reads. Write latency is dominated by replication (network round-trip to mirror/replica nodes) regardless of locality. Read latency benefits enormously from locality: 80-180 us local vs 250-500 us remote (2-4x improvement). For read-heavy VM workloads (databases, analytics, file servers), data locality is a critical performance factor. For write-heavy workloads, locality provides minimal benefit.
S2D has a structural advantage in read locality over Ceph. S2D's read-local optimization is automatic, always-on, and works reliably with 3-way mirrors. Ceph's read affinity must be explicitly enabled (osd_read_from_replica=localize) and only works when a replica happens to exist on the VM's current node (50% probability in a 6-node cluster). In small clusters (3-4 nodes), both platforms provide near-guaranteed locality. In larger clusters, S2D's design advantage becomes more pronounced.
Live migration is the enemy of data locality -- but read affinity is the antidote. When a VM live-migrates, it loses locality with its primary OSD (Ceph) or may move to a node without a mirror copy (S2D, if < 3-way mirror). For Ceph, enabling osd_read_from_replica=localize restores read locality without moving data -- the client reads from the nearest replica. For S2D, read-local finds the local mirror copy automatically. Neither platform rebalances data to follow VMs (vSAN's proactive rebalance is unique in this regard).
Disaggregated architectures sacrifice locality for flexibility. If the architecture separates compute-only nodes from storage-only nodes (ODF external mode), data locality is zero by design. Every I/O traverses the network. This is acceptable when: (a) the storage network is 100 GbE RDMA (latency penalty is only 1.5-2x), (b) the workload is not latency-sensitive, or (c) the architecture requires independent compute/storage scaling. For the majority of the 5,000 VM estate running general workloads, HCI with locality provides better performance per dollar.
Enable Ceph read affinity in OVE from day one. The osd_read_from_replica=localize setting is not enabled by default in ODF. This is the single most impactful performance tuning for Ceph in an HCI deployment. Enable it during initial ODF installation, not as an afterthought. Validate the latency improvement during PoC by measuring 4 KiB random read latency with and without the setting enabled.
Validate bucket index sharding for backup buckets. A Kasten K10 deployment backing up 5,000 VMs with 365-day retention will accumulate millions of objects in the backup bucket. Without bucket index sharding, listing operations become prohibitively slow. Ceph's dynamic resharding handles this automatically in recent versions, but verify during PoC that resharding is enabled and functioning. Monitor bucket index shard count and per-shard object count via radosgw-admin bucket stats.

Discussion Guide

The following questions are designed for vendor deep-dives, PoC planning, and internal architecture reviews. They address the object storage and data locality considerations specific to a Tier-1 financial enterprise migrating from VMware.

Questions for OVE / ODF (Red Hat)

Ceph RGW performance at backup scale: "Our Kasten K10 deployment will back up 5,000 VMs nightly. Assuming an average backup export of 10 GiB per VM (50 TiB total nightly export), how many RGW pods are required to sustain this throughput within an 8-hour backup window (~1.7 GiB/s sustained)? What is the impact on OSD performance for the shared RADOS cluster -- do RGW backup writes compete with RBD VM I/O? Should we use separate RADOS pools with erasure coding for RGW data to minimize the impact on replicated RBD pools?"
S3 Object Lock validation: "Demonstrate S3 Object Lock in COMPLIANCE mode on Ceph RGW. Create a bucket with Object Lock enabled, upload an object with 365-day COMPLIANCE retention, then attempt to delete the object using: (a) the S3 API with the bucket owner's credentials, (b) the radosgw-admin command-line tool with admin privileges, (c) direct RADOS object deletion via rados rm. Confirm that all three deletion attempts fail. What is the FINMA-auditable evidence trail for these denied operations?"
RGW multi-site replication lag: "Configure RGW multi-site replication between two ODF clusters at two sites connected by a 10 Gbps WAN link. Under sustained write load (1 GiB/s of backup exports to the primary zone), measure the replication lag to the secondary zone. What is the steady-state lag? What is the maximum lag during peak write bursts? How long does catch-up replication take after a 4-hour WAN outage?"
Read affinity configuration and validation: "Enable osd_read_from_replica=localize on an ODF cluster running KubeVirt VMs. Demonstrate the latency improvement using fio (4 KiB random read, iodepth=1) from inside a VM: (a) with read affinity disabled (reads from primary), (b) with read affinity enabled (reads from local replica). Show the ceph osd perf output and the Prometheus ceph_osd_op_r_latency metric before and after. What is the measured latency reduction?"
NooBaa tiering for FINMA retention: "Configure a NooBaa BucketClass with three tiers: local Ceph RGW (hot, 30 days), remote S3 at DR site (warm, 90 days), Azure Blob Archive (cold, 7 years). Demonstrate the lifecycle: upload an object, observe it on local Ceph, then simulate time progression and verify the object is tiered to Azure Blob Archive. What is the retrieval time from Azure Blob Archive tier? Can K10 restore from the Archive tier, and what is the RTO impact?"

Questions for Azure Local (Microsoft)

On-premises S3 strategy: "Azure Local does not include a native S3-compatible object store. What is Microsoft's recommended approach for on-premises S3 storage for Kasten K10 backup targets, container image registry storage, and log aggregation (Loki)? Is MinIO on AKS-HCI the recommended approach? If so, what is the support model -- does Microsoft support MinIO, or is it community/MinIO commercial support? How does this compare to the integrated Ceph RGW in ODF?"
Azure Blob immutability for backup: "If we use Azure Blob Storage (cloud) as the backup target for Veeam or Kasten K10, demonstrate Azure Immutable Blob Storage in WORM-compliant mode. Show that a backup object with a time-based retention policy (365 days) cannot be deleted by: (a) the storage account owner, (b) an Azure subscription administrator, (c) a Global Administrator in Azure AD. How does this compare to S3 Object Lock COMPLIANCE mode in terms of FINMA auditability?"
S2D read-local validation: "Demonstrate S2D read-local optimization. Run a VM on Node 1, measure 4 KiB random read latency with fio. Live-migrate the VM to Node 2, measure latency again. Confirm that read-local continues to provide local reads after migration (latency should remain similar). Then create a 2-way mirror volume (instead of 3-way), migrate the VM to a node that does not hold a mirror copy, and measure the latency degradation."
Data locality after node failure: "In a 4-node Azure Local cluster with 3-way mirror volumes: Node 2 fails. S2D rebuilds the missing mirror copies on remaining nodes. During rebuild: (a) what is the I/O latency impact on running VMs? (b) does read-local continue to function for VMs on surviving nodes? (c) how long does the rebuild take for 10 TiB of data per node? After rebuild: is read-local restored for all VMs, or are some mirror copies now on non-optimal nodes?"

Questions for Swisscom ESC

Object storage services: "Does Swisscom ESC provide S3-compatible object storage as a managed service? If yes: what S3 API operations are supported? Is Object Lock (WORM) supported in COMPLIANCE mode? What is the capacity limit? What is the SLA for availability and durability? If no: what is the backup target architecture (NFS, proprietary appliance)? How does the customer verify immutability of backup data?"

Cross-Platform / Internal Architecture Questions

Object storage sizing for 5,000 VMs: "Calculate the object storage capacity required for: (a) Kasten K10 / Veeam backup data with 365-day retention, daily incremental + weekly full, 5,000 VMs with average 100 GiB disk. (b) Container image registry (Quay/Harbor) with estimated 500 images, 5 tags each, average 2 GiB per image. (c) Log retention (Loki) at 90 days with estimated 10 GiB/day of log ingestion. (d) Monitoring archive (Thanos) at 1 year with estimated 5 GiB/day of metric ingestion. Total estimated object storage: [X] TiB. Plan for this capacity in the Ceph cluster (OVE) or MinIO deployment (Azure Local)."
Data locality policy decision: "For our 5,000 VM estate, what is the recommended cluster size (number of nodes)? Larger clusters provide more capacity and fault isolation but reduce data locality probability in Ceph (fewer replicas per node as a fraction of total). Smaller clusters maximize locality but limit scale and fault domains. Evaluate: (a) 3-node clusters with maximum locality (100%) but limited scale. (b) 6-8 node clusters with moderate locality (~50%) and good scale. (c) Multiple smaller clusters (e.g., 5 x 4-node clusters) vs fewer larger clusters (e.g., 2 x 10-node clusters). Recommend a cluster topology that balances locality, scale, and failure domain requirements."
RDMA network requirement: "Data locality discussions assume that remote I/O (when locality is lost) has acceptable latency. With TCP-based storage networks, remote latency is 400-1000 us. With RDMA (RoCE v2 or iWARP), remote latency drops to 150-300 us. For our 5,000 VM deployment, is RDMA a requirement or a nice-to-have? What is the hardware cost difference (RDMA NICs, RDMA-capable switches) vs the performance benefit? For Azure Local, RDMA is strongly recommended by Microsoft. For OVE, is RDMA supported with ODF/Ceph?"
Immutable backup architecture end-to-end: "Design the complete immutable backup architecture: (a) S3 bucket with Object Lock COMPLIANCE mode (Ceph RGW or MinIO), (b) Kasten K10 or Veeam configured to use immutable location profile, (c) bucket policies that deny DELETE for all principals, (d) monitoring/alerting for backup failures, (e) quarterly audit report generation showing: retention policies active, deletion attempts denied, backup success rates, restore test results. Present this as a FINMA-audit-ready package."

Previous: 07-data-protection.md -- Data Protection & Operations (Snapshots, Replication, Encryption, Backup) Next: This is the final page in the storage study series.