Building the First OVE Cluster: A System Engineer's Journey
The story of Marco Baumann, Senior Infrastructure Engineer at a Swiss financial institution, as he builds a production OpenShift Virtualization Engine cluster from bare metal to running workloads -- with all the mistakes, revelations, and hard-won lessons along the way.
Chapter 1: The Hardware Arrives
The loading dock confirmation email arrived at 6:47 AM on a Tuesday. Marco Baumann read it on his phone while standing in line at the coffee machine, and for a moment his stomach dropped the way it used to before exams at ETH. Twelve servers. Eight switches. Two hundred and sixteen terabytes of NVMe flash. The hardware for the bank's first OpenShift Virtualization Engine cluster had landed in the Basel data center, and there was no more theoretical about this project.
Marco had spent ten years in this building managing VMware -- first as a junior vSphere admin, then as the senior infrastructure engineer who knew every VLAN, every datastore, every DRS rule in the estate. Five thousand VMs. Two data centers. A team of four. He could deploy an ESXi host in his sleep: boot from USB, assign a management IP, add to vCenter, configure vDS uplinks, attach datastores, done. Twenty minutes if the network team had the port configs ready.
This was going to be different.
He pulled up NetBox on his laptop and reviewed the rack plan he had submitted three weeks ago. The physical layout was the one thing that felt familiar. Servers go in racks. Cables connect to switches. Power comes from PDUs. The laws of physics and rack unit math do not care whether you are running ESXi or RHCOS.
The cluster would occupy two adjacent racks in hall B:
Rack B-14 (Network + Control Plane) Rack B-15 (Compute/Storage Workers)
============================================= =============================================
U42 [Spine Switch 1 - Arista 7050X3] U42 [Spine Switch 2 - Arista 7050X3]
U41 [ -- patch panel -- ] U41 [ -- patch panel -- ]
U40 [Leaf Switch 1 - Arista 7020R] U40 [Leaf Switch 4 - Arista 7020R]
U39 [Leaf Switch 2 - Arista 7020R] U39 [Leaf Switch 5 - Arista 7020R]
U38 [Leaf Switch 3 - Arista 7020R] U38 [Leaf Switch 6 - Arista 7020R]
U37 [ -- patch panel -- ] U37 [ -- patch panel -- ]
U36 [Control Plane 1 - HPE DL380 Gen11] U36 [Worker 4 - HPE DL380 Gen11]
U35 [Control Plane 2 - HPE DL380 Gen11] U35 [Worker 5 - HPE DL380 Gen11]
U34 [Control Plane 3 - HPE DL380 Gen11] U34 [Worker 6 - HPE DL380 Gen11]
U33 [Worker 1 - HPE DL380 Gen11] U33 [Worker 7 - HPE DL380 Gen11]
U32 [Worker 2 - HPE DL380 Gen11] U32 [Worker 8 - HPE DL380 Gen11]
U31 [Worker 3 - HPE DL380 Gen11] U31 [Worker 9 - HPE DL380 Gen11]
... ...
U01 [PDU-A / PDU-B] U01 [PDU-A / PDU-B]
Three control plane nodes and nine worker/storage nodes. The control plane nodes were configured with 32 cores, 256 GB RAM, and two 1 TB NVMe drives each -- enough for etcd and the OpenShift control plane services, but they would not run any VMs. The worker nodes were the heavy hitters: 128 cores (2x AMD EPYC 9454, 48 cores each), 1 TB RAM, and twelve 3.84 TB NVMe U.2 drives per server. Each worker also carried two dual-port 25GbE NICs -- four ports total.
Marco had done the power budget in NetBox. Each DL380 Gen11 fully loaded drew about 1,400 watts at peak. Twelve servers at peak was 16,800 watts, plus eight switches at roughly 250 watts each -- 18,800 watts total. Each rack had two 30A 208V PDUs (6,240 watts per PDU, 12,480 watts per rack, 24,960 watts across both racks). Comfortable, with headroom for the next expansion wave. He had learned years ago that the number-one cause of unplanned outages was not software bugs -- it was tripped breakers.
The first difference from VMware hit him when he started planning the out-of-band management setup. In the VMware world, each ESXi host had an iLO/iDRAC/IPMI interface for remote management, and vCenter was the brain. You deployed vCenter as a VM on one of the hosts (or as an appliance on a management cluster), and from there you managed everything. The IPMI interface was a nice-to-have for emergencies.
In the OVE world, the IPMI interface was not a nice-to-have. It was critical infrastructure. The OpenShift installer used BMC (Baseboard Management Controller) access to orchestrate the bootstrap process -- it would PXE-boot the servers, push the ignition configs, and monitor the installation remotely. Marco realized he needed every iLO interface on a dedicated management VLAN (VLAN 10), reachable from the installer machine, before he could even begin the OpenShift installation.
He spent the first day and a half racking servers and running cables. The cabling plan was straightforward but tedious:
Each server had four 25GbE ports:
- Ports 1 and 2 (NIC 1, dual-port): bonded for cluster traffic (API, pod network, GENEVE overlay)
- Ports 3 and 4 (NIC 2, dual-port): bonded for storage traffic (Ceph public + cluster network)
Each bond went to a pair of leaf switches -- port 1 to leaf-A, port 2 to leaf-B -- forming an LACP bond (mode 4, 802.3ad) across the MLAG pair. This was the same architecture he used for ESXi hosts: two uplinks per function, bonded for bandwidth and redundancy, each uplink to a different switch in the pair. The leaf switches were paired using MLAG (Arista's implementation) so that the server saw a single logical switch at each bond endpoint.
The spine-leaf fabric was where things got more interesting. Marco had never configured BGP on a data center switch before. In the VMware world, the network team handled the physical fabric, and he consumed VLANs. Now, as part of the OVE project, the infrastructure team owned the entire stack -- from the switch ASIC to the VM YAML.
He followed the design he had worked through in the networking study sessions. Each leaf pair connected to both spine switches with 100GbE uplinks. The spines ran eBGP with a unique ASN per switch, and each leaf pair shared an ASN within the MLAG domain. The fabric used a /31 point-to-point link between each leaf and spine, and ECMP across the four paths (2 spines x 2 uplinks per spine) to distribute east-west traffic evenly.
Spine-Leaf Fabric (BGP ASN Layout):
+----------------+ +----------------+
| Spine 1 | | Spine 2 |
| ASN 65000 | | ASN 65001 |
+--+--+--+--+--+-+ +--+--+--+--+--+-+
| | | | | | | | | |
100GbE | | | | | 100GbE | | | | |
| | | | | | | | | |
+------+ | | | +---------+ | | | +------+
| | | | | | | | |
+---+---+ +--+--++ +---+---+ +--+---++ +--+--+ +----+--+
|Leaf 1 | |Leaf 2| |Leaf 3 | |Leaf 4 | |Leaf 5| |Leaf 6 |
|ASN | |ASN | |ASN | |ASN | |ASN | |ASN |
|65010 | |65010 | |65011 | |65012 | |65012 | |65013 |
| MLAG pair 1 | | | | MLAG pair 2 | | |
+-------+ +------+ +-------+ +-------+ +------+ +-------+
| | | | | |
Servers Servers Servers Servers Servers Servers
(ctrl) (ctrl) (wkr1-3) (wkr4-6) (wkr4-6) (wkr7-9)
He configured the bonds on the first control plane node using nmcli from the iLO console. Two bonds, each with two slave interfaces:
# Bond for cluster traffic (VLAN 20 for nodes, VLAN 30 for pods)
nmcli connection add type bond con-name bond0 ifname bond0 \
bond.options "mode=802.3ad,miimon=100,lacp_rate=fast,xmit_hash_policy=layer3+4"
nmcli connection add type ethernet con-name bond0-port1 ifname ens1f0 master bond0
nmcli connection add type ethernet con-name bond0-port2 ifname ens1f1 master bond0
# Bond for storage traffic (VLAN 40 for Ceph public, VLAN 41 for Ceph cluster)
nmcli connection add type bond con-name bond1 ifname bond1 \
bond.options "mode=802.3ad,miimon=100,lacp_rate=fast,xmit_hash_policy=layer3+4"
nmcli connection add type ethernet con-name bond1-port1 ifname ens2f0 master bond1
nmcli connection add type ethernet con-name bond1-port2 ifname ens2f1 master bond1
This looked right. He pinged the gateway, traffic flowed, LACP negotiated successfully. He felt good. Then he hit his first real lesson.
He was testing network connectivity between worker-1 and worker-2 using iperf3. The raw throughput on the bond was hitting about 24.5 Gbps -- close to the theoretical 25 Gbps per link, as expected with two 25GbE links bonded and traffic hashing across both. But when he ran the same test using GENEVE-encapsulated traffic (simulating what the OVN-Kubernetes overlay would do), the throughput dropped to about 23.2 Gbps, and -- worse -- he saw occasional packet drops.
The problem was MTU.
GENEVE encapsulation adds a header to every packet: 8 bytes UDP, 8 bytes GENEVE base header, and typically 4-8 bytes of GENEVE options -- roughly 50-58 bytes of overhead on top of the outer IP header (20 bytes) and Ethernet header (14 bytes). If the inner packet was 1500 bytes (standard MTU) and you added ~100 bytes of encapsulation overhead, the outer frame became ~1600 bytes. The leaf switches had their interfaces set to the default 1500-byte MTU. Frames over 1500 bytes were being silently dropped.
Marco stared at the packet capture for ten minutes before the light went on. He had configured jumbo frames on his ESXi environment years ago for vSAN traffic and iSCSI, but the rule of thumb there had been "set MTU to 9000 on everything and forget about it." With GENEVE, it was more nuanced. The inner MTU for pods and VMs could stay at 1500, but the outer MTU on the physical infrastructure had to be large enough to accommodate the encapsulation overhead. OVN-Kubernetes defaulted to an inner MTU of 1400, which meant the outer frames were approximately 1500 bytes -- just barely fitting. But if any GENEVE option headers were added, or if the inner workload assumed 1500-byte MTU (which VMs typically did), the outer frame would exceed 1500 bytes.
The fix was to set the physical MTU on all interfaces in the path -- server NICs, switch ports, and inter-switch links -- to at least 9000 bytes (jumbo frames), which gave ample headroom for any encapsulation overhead:
# On each leaf switch (Arista EOS):
interface Ethernet1/1-48
mtu 9214
interface Port-Channel1-24
mtu 9214
# On each spine switch:
interface Ethernet1/1-32
mtu 9214
After the MTU change, the GENEVE test ran clean at full line rate with zero drops. Marco made a note in his runbook: "MTU matters more in an overlay world than it ever did in a VLAN world. Set physical MTU to 9214 everywhere before deploying OpenShift. Do not assume 1500 works."
He would later learn that the OpenShift installer's install-config.yaml allowed setting the cluster network MTU, and that OVN-Kubernetes would automatically calculate the outer MTU by adding the encapsulation overhead to the configured inner MTU. But the physical fabric had to support whatever outer MTU resulted. Getting this wrong produced the kind of intermittent packet loss that could haunt you for weeks -- large packets dropped, small packets fine, making the problem look like an application bug rather than an infrastructure misconfiguration.
By the end of day three, all twelve servers were racked, cabled, powered, and reachable via iLO on the management VLAN. The spine-leaf fabric was passing traffic with jumbo frames. The bonds were negotiating LACP correctly on all servers. Marco had a spreadsheet with MAC addresses, BMC IPs, and serial numbers for every component, all mirrored in NetBox.
He leaned back in his chair and looked at the racks through the glass wall of the data center office. In the VMware world, this was where he would have booted the first ESXi installer USB. In the OVE world, the next step was considerably more complex -- but also, he had to admit, more interesting.
Chapter 2: Installing OpenShift
Marco had installed vCenter and added ESXi hosts to clusters perhaps fifty times in his career. The process was always the same: deploy the vCenter Server Appliance (VCSA) from an ISO, connect to the web UI, run the Add Host wizard, accept the SSL certificate, enter root credentials, assign a license, attach to a distributed switch. Imperative, sequential, one thing at a time.
The OpenShift installation was a completely different philosophy. There was no GUI installer. There was no "click here to add a node." The entire cluster -- control plane, workers, networking, authentication -- was declared in a single YAML file and then brought into existence by a bootstrap process that felt more like igniting a chain reaction than building a house.
The heart of it was install-config.yaml. Marco spent an entire morning crafting it, referencing the Red Hat documentation and the networking design he had finalized the week before:
apiVersion: v1
metadata:
name: ove-prod-01
baseDomain: infra.bank.ch
networking:
networkType: OVNKubernetes
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
serviceNetwork:
- 172.30.0.0/16
machineNetwork:
- cidr: 10.20.0.0/24
platform:
baremetal:
apiVIPs:
- 10.20.0.10
ingressVIPs:
- 10.20.0.11
hosts:
- name: ctrl-1
role: master
bmc:
address: idrac-virtualmedia://10.10.0.101/redfish/v1/Systems/1
username: admin
password: <redacted>
bootMACAddress: "b4:96:91:1a:2c:00"
rootDeviceHints:
deviceName: /dev/nvme0n1
- name: ctrl-2
role: master
bmc:
address: idrac-virtualmedia://10.10.0.102/redfish/v1/Systems/1
username: admin
password: <redacted>
bootMACAddress: "b4:96:91:1a:2c:10"
rootDeviceHints:
deviceName: /dev/nvme0n1
- name: ctrl-3
role: master
bmc:
address: idrac-virtualmedia://10.10.0.103/redfish/v1/Systems/1
username: admin
password: <redacted>
bootMACAddress: "b4:96:91:1a:2c:20"
rootDeviceHints:
deviceName: /dev/nvme0n1
# Workers defined similarly (9 entries)...
controlPlane:
replicas: 3
platform:
baremetal: {}
compute:
- name: worker
replicas: 9
platform:
baremetal: {}
pullSecret: '<redacted>'
sshKey: 'ssh-ed25519 AAAA... marco@infra.bank.ch'
The CIDRs had taken him the longest to decide. clusterNetwork was the Pod CIDR -- every pod (and every VM, since VMs run as pods) would get an IP from this range. With a /14 and a /23 per node, he had 512 possible nodes with 512 pod IPs each. At 50 VMs per worker node, that was plenty. The serviceNetwork was for Kubernetes Services -- the stable ClusterIPs that fronted groups of pods. The machineNetwork was the physical node network -- the VLAN where the servers' bond0 interfaces lived.
He ran openshift-install create cluster --dir=./install-dir and watched the output scroll past.
The bootstrap process was unlike anything he had seen in VMware. The installer first provisioned a temporary bootstrap node -- an ephemeral machine that would bring the control plane to life and then self-destruct. The bootstrap node booted via the BMC's virtual media interface, loaded Red Hat CoreOS (RHCOS) with an Ignition config, and started a temporary Kubernetes API server. This temporary API server then provisioned the three control plane nodes, which booted RHCOS, joined the cluster, and took over the control plane duties. Once the control plane was stable, the bootstrap node was no longer needed and was removed.
Bootstrap Process Timeline:
T+0 min Installer provisions bootstrap node via BMC
Bootstrap boots RHCOS from virtual media
T+5 min Bootstrap starts temporary etcd + API server
Bootstrap starts machine-api-operator
T+10 min Machine-api provisions 3 control plane nodes via BMC
Control plane nodes boot RHCOS
T+20 min Control plane nodes start etcd (3-node quorum)
Control plane takes over API serving
etcd cluster is healthy (3 members)
T+30 min Bootstrap detects control plane is healthy
Bootstrap tears itself down
T+35 min Machine-api provisions 9 worker nodes via BMC
Workers boot RHCOS, join cluster
T+60 min All 12 nodes are Ready
Cluster operators stabilize
Installation complete
Marco watched the terminal for the better part of an hour. At T+22 minutes, the installer reported "Waiting up to 40m0s for the Kubernetes API at https://api.ove-prod-01.infra.bank.ch:6443..." and he felt a knot in his stomach. The wait cursor blinked. Fourteen seconds later: "API v1.29.1+6afe7c1 up". He exhaled.
By T+55 minutes, the cluster was fully installed. All twelve nodes showed Ready:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ctrl-1 Ready control-plane,master 42m v1.29.1+6afe7c1
ctrl-2 Ready control-plane,master 41m v1.29.1+6afe7c1
ctrl-3 Ready control-plane,master 40m v1.29.1+6afe7c1
worker-1 Ready worker 18m v1.29.1+6afe7c1
worker-2 Ready worker 17m v1.29.1+6afe7c1
worker-3 Ready worker 17m v1.29.1+6afe7c1
worker-4 Ready worker 16m v1.29.1+6afe7c1
worker-5 Ready worker 15m v1.29.1+6afe7c1
worker-6 Ready worker 15m v1.29.1+6afe7c1
worker-7 Ready worker 14m v1.29.1+6afe7c1
worker-8 Ready worker 14m v1.29.1+6afe7c1
worker-9 Ready worker 13m v1.29.1+6afe7c1
Marco stared at the output. In vSphere, "adding 9 compute hosts" meant running the Add Host wizard nine times, each time entering credentials, accepting certificates, configuring networking, and assigning licenses. Here, the installer had booted all nine workers in parallel, pushed identical RHCOS images via Ignition, and they had all joined the cluster autonomously. The declarative model -- define the desired state, let the system converge -- was philosophically different from the imperative, step-by-step approach he had spent a decade using.
And then RHCOS itself. Marco SSH'd into worker-1 to look around:
$ ssh core@worker-1
Red Hat Enterprise Linux CoreOS 418.94.202403051920-0
[core@worker-1 ~]$
The OS was sparse. No yum. No apt. No package manager at all. RHCOS was an immutable, image-based operating system. The root filesystem was read-only, delivered as an OSTree commit, and updated atomically. You could not install packages, edit configuration files under /etc, or customize the OS in the way you would customize a RHEL or Ubuntu server. It was, Marco thought, the polar opposite of ESXi -- which was also a locked-down hypervisor OS, but one that let you install VIBs, edit /etc/vmware/, and tweak advanced settings through the host client.
In the RHCOS world, OS configuration was done through MachineConfig objects -- Kubernetes custom resources that the Machine Config Operator (MCO) applied to nodes. Want to add a kernel parameter? Create a MachineConfig. Want to add a systemd unit? MachineConfig. Want to change the chrony NTP configuration? MachineConfig.
Marco learned this the hard way.
His NTP servers were corporate-internal (ntp1.bank.ch and ntp2.bank.ch), not the Red Hat defaults. He SSH'd into worker-1 and edited /etc/chrony.conf manually, replacing the default NTP pool with his corporate servers. He ran systemctl restart chronyd, verified time sync with chronyc sources, and moved on to the next task, feeling productive.
Two days later, the MCO rolled out a configuration update to the worker pool (a routine certificate rotation). As part of this update, the MCO rebooted worker-1 with a fresh RHCOS image -- and Marco's hand-edited chrony.conf was gone. NTP was pointing at the default servers again. Three VMs on worker-1 had drifted by 4 seconds during the reboot window, triggering a Kerberos authentication failure on a Windows SQL Server instance.
The post-mortem was educational. Marco created a proper MachineConfig:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 99-worker-custom-chrony
labels:
machineconfiguration.openshift.io/role: worker
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- path: /etc/chrony.conf
mode: 0644
overwrite: true
contents:
inline: |
server ntp1.bank.ch iburst
server ntp2.bank.ch iburst
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
logdir /var/log/chrony
After applying this, the MCO rolled out the change to all worker nodes -- rebooting each one sequentially, applying the new chrony configuration as part of the immutable image, and confirming convergence. The NTP configuration was now part of the cluster's desired state, version-controlled and reproducible. If worker-7 died tomorrow and was replaced with new hardware, the MCO would apply the same chrony config to the new node automatically.
"In vSphere," Marco told his colleague Anna over coffee, "I would have used a Host Profile or a kickstart script to enforce this. The MachineConfig is the same idea, but it is part of the cluster itself -- not a side process. The node does not need to know about a PXE server or a configuration management tool. The cluster is the configuration management tool."
Anna, who managed the Windows VMs and would be doing much of the migration work, looked skeptical. "And if you need to debug something on the node? What if chronyd is crashing and you need to see why?"
"You can still SSH in and debug," Marco said. "You just cannot make changes that stick. It is like debugging in a container -- you read logs, you look at process state, you trace syscalls, but you do not install packages or edit files. If you need to fix something permanently, you fix it in the MachineConfig and let the MCO roll it out."
This was the conceptual shift that took Marco the longest to internalize. In the VMware world, each ESXi host was a pet -- individually configured, individually maintained, with a unique set of tweaks that had accumulated over years. In the OVE world, nodes were cattle. Identical, replaceable, and configured by the cluster, not by the admin. It was uncomfortable. It was also, he had to admit, more reliable.
By the end of week one, the base OpenShift cluster was running. Twelve nodes, healthy, with OVN-Kubernetes providing the pod network, CoreDNS resolving cluster services, and the OpenShift web console accessible at https://console-openshift-console.apps.ove-prod-01.infra.bank.ch. Marco had logged in with the kubeadmin credentials and was exploring the dashboard. It looked nothing like vCenter. There were no host views, no VM trees, no datastores. There were projects (namespaces), operators, workloads, and monitoring graphs. The learning curve felt vertical.
But the cluster was alive, and that was something.
Chapter 3: Setting Up Storage (ODF)
Marco had managed vSAN for six years. He understood disk groups, storage policies, deduplication and compression, witness appliances for stretched clusters. He had replaced failed disks in production vSAN clusters under pressure. Storage was his comfort zone.
OpenShift Data Foundation -- ODF -- was Ceph in a Kubernetes suit. Rook was the operator that managed Ceph daemons (MON, OSD, MGR, MDS) as Kubernetes pods. The Ceph cluster ran on the same worker nodes that would host the VMs. This was the same hyperconverged model as vSAN: compute and storage co-located on the same servers.
He installed the ODF operator from the OpenShift OperatorHub -- a one-click install through the web console that deployed the Rook-Ceph operator, the CSI drivers (ceph-csi for RBD and CephFS), and the ODF console plugin. Then he created a StorageCluster custom resource that described the desired Ceph cluster:
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
name: ocs-storagecluster
namespace: openshift-storage
spec:
manageNodes: false
monDataDirHostPath: /var/lib/rook
storageDeviceSets:
- name: ocs-deviceset-nvme
count: 9 # 9 worker nodes
dataPVCTemplate:
spec:
storageClassName: local-nvme
accessModes:
- ReadWriteOnce
volumeMode: Block
resources:
requests:
storage: "3840Gi" # 3.84 TB NVMe drives
replica: 3
portable: false
placement:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cluster.ocs.openshift.io/openshift-storage
operator: In
values:
- ""
The first thing Marco needed to understand was how Ceph's architecture mapped to what he knew from vSAN.
In vSAN, each host contributed disk groups -- each group containing one cache tier device (SSD or NVMe) and one or more capacity tier devices (SSD or HDD). vSAN striped and replicated data across disk groups using its own distributed object store (VSAN on-disk format, technically an evolved version of the VMFS filesystem).
In Ceph, each physical disk ran its own OSD (Object Storage Daemon). There were no disk groups. Each NVMe drive on each worker node became an independent OSD. With 9 worker nodes and 12 NVMe drives per node, Marco was looking at 108 OSDs. Each OSD managed its own BlueStore instance -- a purpose-built storage engine that wrote directly to the raw block device, bypassing any Linux filesystem.
vSAN vs. Ceph Mapping:
vSAN Ceph (ODF)
==== ==========
Disk Group (cache + capacity) OSD (one per disk, no separate cache tier)
Storage Policy (RAID/mirror) CRUSH Rule + Pool (replication or EC)
VMFS / vSAN on-disk format BlueStore (direct block device access)
vSAN Witness MON (Paxos quorum, min 3)
vSAN Performance Service MGR + Prometheus exporter
Virtual Machine Storage Policy StorageClass (maps to Ceph pool)
Marco labeled the worker nodes for ODF and applied the StorageCluster. He watched the operator create pods in the openshift-storage namespace:
$ oc get pods -n openshift-storage -w
NAME READY STATUS AGE
noobaa-core-0 1/1 Running 5m
noobaa-db-pg-0 1/1 Running 5m
ocs-operator-6d5f6b8c9-xk7qp 1/1 Running 8m
rook-ceph-mgr-a-5f7b8d9c6-lm2np 2/2 Running 3m
rook-ceph-mon-a-7c9f8d6b5-qr4st 2/2 Running 4m
rook-ceph-mon-b-6b8e7c5d4-mn3pq 2/2 Running 4m
rook-ceph-mon-c-5a7d6b4c3-jk2mn 2/2 Running 4m
rook-ceph-osd-0-6c8f7d5e4-rs5tu 2/2 Running 2m
rook-ceph-osd-1-5b7e6c4d3-pq4rs 2/2 Running 2m
...
The MON pods came up first -- three of them, forming the Paxos quorum that maintained the cluster map. Then the MGR pod (the manager daemon, responsible for metrics, dashboard, and the balancer module). Then the OSDs started appearing, one per NVMe drive.
Marco opened the Ceph toolbox to check cluster status:
$ oc rsh -n openshift-storage deploy/rook-ceph-tools
[root@rook-ceph-tools /]# ceph status
cluster:
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
health: HEALTH_WARN
108 osds exist but only 72 are up
services:
mon: 3 daemons, quorum a,b,c
mgr: a(active)
osd: 108 total, 72 up, 72 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
Only 72 of 108 OSDs were up. The last three worker nodes (worker-7 through worker-9) had not been labeled for ODF yet -- Marco had only labeled six workers initially while testing the configuration. The remaining 36 OSDs existed in the CRUSH map but were not yet populated.
He labeled the remaining three workers:
for i in 7 8 9; do
oc label node worker-$i cluster.ocs.openshift.io/openshift-storage=""
done
And then he watched the rebalance. This was where his vSAN experience helped -- and where Ceph surprised him.
When he added the last three nodes, Ceph's CRUSH algorithm recalculated the placement of every Placement Group (PG). With the existing data on 72 OSDs now needing to be redistributed across 108 OSDs, Ceph began migrating PGs. The toolbox showed the rebalance in real time:
[root@rook-ceph-tools /]# ceph -w
cluster:
health: HEALTH_WARN
Degraded data redundancy: 128/384 objects degraded
128 pgs not deep-scrubbed in time
2026-03-18 14:32:01.234 osd.73 [INF] 3.2a starting backfill to osd.73 from osd.12
2026-03-18 14:32:01.456 osd.74 [INF] 3.4c starting backfill to osd.74 from osd.31
2026-03-18 14:32:02.789 osd.75 [INF] 2.1f starting backfill to osd.75 from osd.5
...
The cluster was not yet holding production data -- they were still in the setup phase -- so the rebalance was fast. But Marco could imagine what this would look like with 30 TB of VM data spread across the cluster. He remembered from the storage fundamentals study material that CRUSH used a deterministic pseudo-random algorithm: given an object name, a pool, and the CRUSH map, any client could calculate exactly which OSDs held that object. When the CRUSH map changed (new OSDs added), the algorithm produced different OSD assignments for a fraction of the PGs, and Ceph migrated data to match.
The PG states were initially confusing. He saw PGs in states like active+remapped+backfilling and active+undersized+degraded. The Ceph documentation was dense, and the PG state machine had dozens of possible states. But the key insight was simple: active meant the PG could serve I/O, and backfilling meant data was being moved to achieve the desired placement. Once all PGs reached active+clean, the rebalance was complete.
# 15 minutes later:
[root@rook-ceph-tools /]# ceph status
cluster:
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c
mgr: a(active)
osd: 108 total, 108 up, 108 in
data:
pools: 3 pools, 384 pgs
objects: 0 objects, 0 B
usage: 108 GiB used, 396 TiB / 396 TiB avail
pgs: 384 active+clean
396 TiB raw capacity. Marco did the math he already knew but that still stung: with 3-way replication (the default for production block storage), usable capacity was roughly one-third of raw. 396 TiB raw became approximately 130 TiB usable. His twelve 3.84 TB drives per node, nine nodes, was 414 TB raw, minus some overhead for BlueStore metadata, OSD journals (WAL/DB), and Ceph system pools.
"One hundred and thirty terabytes," he said to himself. He had been running 180 TB on the VMware estate (across vSAN and SAN). The capacity math for 3x replication was something he had known intellectually from the study material, but seeing it as a real number on his real cluster made it visceral. He would need to request more storage in the next procurement cycle, or use erasure coding for the less critical tiers.
Now for the StorageClasses. Marco wanted to offer three tiers, mirroring the storage policies he had in vSphere:
# Gold: NVMe-backed, 3-way replication, for tier-1 databases
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-rbd-gold
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
clusterID: openshift-storage
pool: ocs-storagecluster-cephblockpool # 3-replica on NVMe
imageFeatures: layering,exclusive-lock,object-map,fast-diff
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
---
# Silver: SSD-backed, 3-way replication, for general workloads
# (In this cluster, all drives are NVMe, so silver = same hardware
# but with different IOPS limits via Ceph QoS in a future release.
# For now, silver is a logical separation for capacity planning.)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-rbd-silver
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
clusterID: openshift-storage
pool: ocs-storagecluster-cephblockpool
imageFeatures: layering,exclusive-lock,object-map,fast-diff
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
---
# Bronze: Erasure-coded pool, for dev/test and bulk data
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-rbd-bronze
provisioner: openshift-storage.rbd.csi.ceph.com
parameters:
clusterID: openshift-storage
pool: ceph-ec-data-pool # EC 4+2 pool (see below)
dataPool: ceph-ec-data-pool # For RBD, the data pool is the EC pool
imageFeatures: layering,exclusive-lock,object-map,fast-diff
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
The erasure-coded pool for bronze was a calculation. EC 4+2 (4 data chunks + 2 parity chunks) gave approximately 66% usable capacity from raw (compared to 33% for 3x replication), but at the cost of higher write latency and CPU overhead. For dev/test VMs and bulk storage, this was an acceptable trade-off.
There was also the NetApp question. The bank had a large NetApp ONTAP deployment (FAS and AFF arrays) that was not going away. The storage team was not ready to move everything to Ceph, and several applications had hard dependencies on NFS exports from NetApp. Marco installed NetApp's Trident CSI driver and created additional StorageClasses:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ontap-san-gold
provisioner: csi.trident.netapp.io
parameters:
backendType: "ontap-san"
storagePools: "aff-a400-pool1"
fsType: "ext4"
encryption: "true"
reclaimPolicy: Delete
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ontap-nas-shared
provisioner: csi.trident.netapp.io
parameters:
backendType: "ontap-nas"
storagePools: "fas-9500-pool1"
reclaimPolicy: Retain
allowVolumeExpansion: true
This gave him a complete storage menu: Ceph RBD for VMs that should run on the converged cluster, and NetApp Trident for VMs that needed access to existing SAN/NAS infrastructure. The beauty of the CSI abstraction was that a VM did not know or care which backend served its disk. The StorageClass was the only decision point.
Marco set up the Ceph monitoring dashboards in Grafana. The ODF operator deployed a pre-configured Grafana instance with dashboards for cluster health, OSD utilization, pool I/O, and PG states. He stared at the OSD utilization graph for a while -- all 108 OSDs showing near-zero utilization, flat lines waiting for workloads. That would change soon enough.
He ran a quick fio benchmark from a test pod to establish baseline performance:
$ oc run fio-test --image=nixery.dev/fio --restart=Never -- \
fio --name=test --ioengine=libaio --direct=1 --rw=randwrite \
--bs=4k --numjobs=8 --iodepth=32 --size=10G \
--filename=/data/testfile --runtime=60
# Results:
write: IOPS=48.2k, BW=188MiB/s (197MB/s)
lat (usec): min=128, max=12045, avg=425.3, stdev=312.7
clat percentiles (usec):
| 50th=[ 326], 75th=[ 474], 90th=[ 750], 99th=[ 1876]
48,000 random write IOPS at 4K block size from a single client. The latency was higher than raw NVMe (which would be sub-100 microseconds), reflecting the overhead of the Ceph replication path: client write -> primary OSD -> replicate to secondary + tertiary OSD -> acknowledge. That replication latency was the tax you paid for data protection. vSAN had the same tax, and in his experience the numbers were comparable. Good enough.
The storage foundation was laid. Now he needed to build the network layer on top of it.
Chapter 4: Configuring the Network
The network was where Marco felt the biggest gap between his VMware knowledge and the OVE world. In vSphere, networking was port groups and distributed switches. You created a port group on a vDS, assigned a VLAN, and connected VM NICs to it. NSX added an overlay network with microsegmentation, but even NSX mapped reasonably well to physical networking concepts.
In OVE, the primary network layer was OVN-Kubernetes -- an overlay network built on Open Virtual Switch (OVS) using GENEVE tunnels. Every pod (and every VM) got an IP from the pod CIDR, and OVN handled routing between pods across nodes through GENEVE encapsulation. This was the default network, and it worked out of the box after the cluster installation.
But VMs were not containers. The bank's VMs needed to connect to existing VLANs: production VLAN 200, database VLAN 300, management VLAN 100. These were physical VLANs that existed on the leaf switches, carried traffic to and from systems outside the OVE cluster, and had firewall rules that referenced specific IP ranges. The VMs needed "real" IPs on these VLANs, not pod IPs on the GENEVE overlay.
This was where Multus came in. Multus was a CNI meta-plugin that allowed a pod (or VM) to have multiple network interfaces. The default interface (eth0) connected to the OVN-Kubernetes overlay. Additional interfaces (net1, net2, ...) connected to secondary networks defined by NetworkAttachmentDefinitions (NADs). Each NAD specified a CNI plugin and configuration for the secondary network.
Marco created NADs for each VLAN:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: vlan200-production
namespace: production-vms
annotations:
k8s.v1.cni.cncf.io/resourceName: bridge.network.kubevirt.io/br-prod
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "vlan200-production",
"type": "bridge",
"bridge": "br-prod",
"vlan": 200,
"ipam": {},
"macspoofchk": true,
"preserveDefaultVlan": false
}
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: vlan300-database
namespace: database-vms
annotations:
k8s.v1.cni.cncf.io/resourceName: bridge.network.kubevirt.io/br-db
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "vlan300-database",
"type": "bridge",
"bridge": "br-db",
"vlan": 300,
"ipam": {},
"macspoofchk": true,
"preserveDefaultVlan": false
}
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: vlan100-management
namespace: default
annotations:
k8s.v1.cni.cncf.io/resourceName: bridge.network.kubevirt.io/br-mgmt
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "vlan100-management",
"type": "bridge",
"bridge": "br-mgmt",
"vlan": 100,
"ipam": {}
}
He configured the host-level bridges using NMState, the operator that managed node network configuration declaratively:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
name: br-prod-workers
spec:
nodeSelector:
node-role.kubernetes.io/worker: ""
desiredState:
interfaces:
- name: br-prod
type: linux-bridge
state: up
bridge:
options:
stp:
enabled: false
port:
- name: bond0
vlan:
mode: trunk
trunk-tags:
- id: 200
Marco realized that this was the OVE version of creating a port group on a vDS. The NetworkAttachmentDefinition was the port group -- it defined which network a VM NIC connected to. The NMState-managed bridge was the virtual switch -- it bridged the VM's traffic onto the physical bond with the correct VLAN tag. The analogy was not perfect (a vDS was more sophisticated, with traffic shaping, port mirroring, and LACP management built in), but the function was the same: connect VMs to VLANs.
NSX / vDS Mapping to OVE:
vSphere OVE
======= ===
vDS Port Group (VLAN 200) -> NetworkAttachmentDefinition (VLAN 200)
vDS (Distributed Switch) -> Linux bridge (br-prod) via NMState
VM NIC on port group -> VM spec: interface with multus NAD reference
NSX Segment -> OVN logical switch (for overlay networks)
NSX T1 Router -> OVN logical router (cluster-internal)
NSX DFW rule -> NetworkPolicy / MultiNetworkPolicy
Next was MetalLB. In the VMware world, VMs got their IPs from DHCP or static assignment, and those IPs were routable on the physical network because the VMs were directly connected to VLANs. In OVE, VMs on the default overlay network had pod IPs that were not routable outside the cluster. For services that needed to be reachable from the corporate network -- load balancers, monitoring endpoints, management interfaces -- Marco needed MetalLB.
MetalLB operated in two modes: Layer 2 (ARP-based, simple but limited to a single node as the traffic entry point) and BGP (the MetalLB speaker on each node announced service IPs to the ToR switches via BGP, providing true ECMP load distribution). Marco chose BGP mode, since his spine-leaf fabric was already running BGP:
apiVersion: metallb.io/v1beta1
kind: MetalLB
metadata:
name: metallb
namespace: metallb-system
---
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: leaf-switch-1
namespace: metallb-system
spec:
myASN: 65020
peerASN: 65010
peerAddress: 10.20.0.1
holdTime: "90s"
keepaliveTime: "30s"
---
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: leaf-switch-2
namespace: metallb-system
spec:
myASN: 65020
peerASN: 65010
peerAddress: 10.20.0.2
holdTime: "90s"
keepaliveTime: "30s"
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: production-pool
namespace: metallb-system
spec:
addresses:
- 10.20.100.0/24 # External IPs for LoadBalancer services
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: production-advertisement
namespace: metallb-system
spec:
ipAddressPools:
- production-pool
peers:
- leaf-switch-1
- leaf-switch-2
With this configuration, any Kubernetes Service of type LoadBalancer would receive an IP from the 10.20.100.0/24 pool, and MetalLB's speaker pods would announce that IP to the leaf switches via BGP. The leaf switches would install the route and forward traffic to the announcing worker node. If the worker node failed, MetalLB on another node would take over the announcement.
Then came the security layer. In vSphere, Marco had used NSX Distributed Firewall (DFW) rules to implement microsegmentation -- controlling which VMs could talk to which other VMs at the granularity of individual VM NICs. The rules were defined in terms of security groups, tags, and IP sets, and applied consistently across all hosts by the NSX distributed firewall kernel module.
In OVE, the equivalent was NetworkPolicy (for pod network traffic) and MultiNetworkPolicy (for secondary network traffic on Multus interfaces). Marco created a NetworkPolicy to isolate the database VMs:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: database-isolation
namespace: database-vms
spec:
podSelector:
matchLabels:
tier: database
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
network-tier: application
podSelector:
matchLabels:
app: backend-api
ports:
- port: 5432
protocol: TCP
- port: 1521
protocol: TCP
egress:
- to:
- namespaceSelector:
matchLabels:
network-tier: infrastructure
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
- port: 123
protocol: UDP
This policy said: "Database VMs can only receive inbound connections from backend-api pods in the application namespace, on PostgreSQL (5432) and Oracle (1521) ports. Database VMs can only make outbound connections to the infrastructure namespace for DNS (53) and NTP (123)."
It worked. But Marco immediately noticed the gaps compared to NSX DFW.
First, there was no Traceflow equivalent. In NSX, when a firewall rule blocked traffic unexpectedly, he could inject a traceflow packet from the source VM and watch it traverse the data path, seeing exactly which rule on which host dropped it. In OVE, the closest tool was ovn-trace, which simulated packet forwarding through the OVN logical pipeline:
$ oc rsh -n openshift-ovn-kubernetes ovnkube-node-xxxxx
# Simulate a packet from a database VM to an application VM
$ ovn-trace --summary sw0 \
'inport=="database-vms_oracle-db-01" && eth.src==0a:58:0a:80:02:05 &&
eth.dst==0a:58:0a:80:00:0a && ip4.src==10.128.2.5 &&
ip4.dst==10.128.0.10 && tcp.dst==8080'
# Output shows each logical flow the packet matches:
# ingress(ls=sw0): check source MAC/IP binding -> allow
# ingress(ls=sw0): ACL tier="NetworkPolicy" match -> drop
It was useful but not as intuitive as NSX Traceflow's GUI, which showed the packet path on a visual topology diagram. Marco missed that.
Second, NetworkPolicies were namespace-scoped and used label selectors, not IP-based rules. In NSX, he could write a rule that said "allow 10.1.1.0/24 to 10.2.2.0/24 on port 443." In Kubernetes, the rule was "allow pods with label app=frontend in namespace web-tier to pods with label app=backend in namespace api-tier on port 443." This was actually more flexible -- labels were dynamic and followed the workload regardless of IP -- but it required a mental shift from IP-centric to identity-centric security policies.
Third, there was no stateful Layer 7 inspection. NSX DFW could inspect application protocols (HTTP, SQL, DNS) and make policy decisions based on content. Kubernetes NetworkPolicies operated at Layer 3/4 only -- IP addresses, ports, and protocols. For Layer 7 policies, Marco would need a service mesh (Istio/OpenShift Service Mesh), which was a separate project he was not ready to tackle yet.
He made a note: "NetworkPolicies cover 80% of what NSX DFW did for us. The remaining 20% needs MultiNetworkPolicy for Multus interfaces, AdminNetworkPolicy for cluster-wide default rules, and eventually a service mesh for L7. Accept this as a phased implementation."
Finally, Marco installed the Network Observability Operator. This deployed flow collectors on each node that captured network flows (using eBPF) and forwarded them to a Loki-based backend for storage and querying. The operator provided a dashboard in the OpenShift console that showed traffic topology, flow rates, and DNS queries. It was not vRealize Network Insight, but it was enough to answer the question "who is talking to whom?" -- which was essential for troubleshooting and capacity planning.
$ oc get pods -n netobserv
NAME READY STATUS AGE
flowlogs-pipeline-7c9f8d6b5-abc12 1/1 Running 5m
flowlogs-pipeline-7c9f8d6b5-def34 1/1 Running 5m
netobserv-controller-manager-5f7b8d9c6-gh56 1/1 Running 8m
netobserv-plugin-6d5f6b8c9-ij78 1/1 Running 5m
The network layer was complete. OVN-Kubernetes for the overlay, Multus with bridge CNI for VLAN access, MetalLB for external IP advertisement, NetworkPolicies for microsegmentation, and the Network Observability Operator for visibility. It was not a single product like NSX -- it was a composition of operators and plugins -- but each piece did its job, and together they provided the network foundation the VMs needed.
Chapter 5: The First VM
Marco had been looking at YAML for two weeks straight. Networking YAML. Storage YAML. Machine configuration YAML. It was time to create something that actually did something -- a virtual machine.
In vSphere, creating a VM was a GUI workflow: right-click a cluster, select New Virtual Machine, walk through the wizard (name, folder, datastore, network, guest OS, CPU, memory, disks), click Finish. Two minutes, maybe three if you were being careful about the storage policy.
In OVE, a VM was a YAML manifest. Marco opened his editor and started typing:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: test-linux-01
namespace: sandbox
labels:
app: test
os: rhel9
spec:
running: true
template:
metadata:
labels:
app: test
kubevirt.io/vm: test-linux-01
spec:
domain:
cpu:
cores: 4
sockets: 1
threads: 1
memory:
guest: 8Gi
machine:
type: q35
firmware:
bootloader:
efi: {}
devices:
disks:
- name: rootdisk
disk:
bus: virtio
- name: cloudinitdisk
disk:
bus: virtio
interfaces:
- name: default
masquerade: {}
rng: {}
networks:
- name: default
pod: {}
volumes:
- name: rootdisk
dataVolume:
name: test-linux-01-root
- name: cloudinitdisk
cloudInitNoCloud:
userData: |
#cloud-config
hostname: test-linux-01
user: marco
ssh_authorized_keys:
- ssh-ed25519 AAAA... marco@infra.bank.ch
packages:
- qemu-guest-agent
runcmd:
- systemctl enable --now qemu-guest-agent
evictionStrategy: LiveMigrate
dataVolumeTemplates:
- metadata:
name: test-linux-01-root
spec:
source:
http:
url: "https://images.internal.bank.ch/rhel9/rhel-9.4-x86_64-kvm.qcow2"
storage:
resources:
requests:
storage: 50Gi
storageClassName: ceph-rbd-gold
accessModes:
- ReadWriteMany
He applied it:
$ oc apply -f test-linux-01.yaml
virtualmachine.kubevirt.io/test-linux-01 created
And watched:
$ oc get vm,vmi,dv -n sandbox -w
NAME AGE STATUS READY
virtualmachine.kubevirt.io/test-linux-01 5s Provisioning False
NAME AGE PHASE
datavolume.cdi.kubevirt.io/test-linux-01-root 5s ImportScheduled
...
datavolume.cdi.kubevirt.io/test-linux-01-root 12s ImportInProgress NNNN (23%)
...
datavolume.cdi.kubevirt.io/test-linux-01-root 87s ImportInProgress NNNN (94%)
...
datavolume.cdi.kubevirt.io/test-linux-01-root 98s Succeeded
NAME AGE PHASE IP NODENAME
virtualmachineinstance.kubevirt.io/test-linux-01 10s Running 10.128.2.47 worker-3
The DataVolume first downloaded the RHEL 9 qcow2 image from the internal image server, converted it to raw format, and wrote it to a Ceph RBD PVC. Then the VirtualMachine transitioned to Running, and a VirtualMachineInstance appeared -- the running instance, mapped to a virt-launcher pod on worker-3.
Marco connected:
$ virtctl console test-linux-01 -n sandbox
Successfully connected to test-linux-01 console. The escape sequence is ^]
Red Hat Enterprise Linux 9.4 (Plow)
Kernel 5.14.0-427.el9.x86_64 on an x86_64
test-linux-01 login: marco
[marco@test-linux-01 ~]$
It was alive. A RHEL 9 VM, running on KVM inside a virt-launcher pod inside a Kubernetes cluster on RHCOS. The cloud-init had set the hostname, created his user, installed the QEMU guest agent, and injected his SSH key. The whole process from oc apply to login prompt had taken about two minutes.
Now the harder one: Windows.
Marco downloaded the Windows Server 2022 ISO and the VirtIO driver ISO (virtio-win.iso) to the image server. Windows VMs on KVM needed VirtIO drivers for disk and network -- without them, the Windows installer could not see the virtual disk (the VirtIO SCSI controller was invisible to Windows without the driver) and would drop to a "no drives found" screen.
He created the Windows VM with an install ISO and the VirtIO driver ISO attached as CD-ROM drives:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: test-win-01
namespace: sandbox
spec:
running: true
template:
spec:
domain:
cpu:
cores: 4
sockets: 1
threads: 2
memory:
guest: 8Gi
machine:
type: q35
firmware:
bootloader:
efi:
secureBoot: false # Secure Boot after driver install
clock:
utc: {}
timer:
hpet:
present: false
hyperv: {}
rtc:
tickPolicy: catchup
features:
acpi: {}
apic: {}
hyperv:
relaxed: {}
vapic: {}
spinlocks:
spinlocks: 8191
vpindex: {}
runtime: {}
synic: {}
stimer:
direct: {}
frequencies: {}
ipi: {}
tlbflush: {}
devices:
disks:
- name: rootdisk
disk:
bus: virtio
- name: install-iso
cdrom:
bus: sata
- name: virtio-drivers
cdrom:
bus: sata
interfaces:
- name: default
masquerade: {}
tpm: {}
networks:
- name: default
pod: {}
volumes:
- name: rootdisk
dataVolume:
name: test-win-01-root
- name: install-iso
persistentVolumeClaim:
claimName: windows-2022-iso
- name: virtio-drivers
persistentVolumeClaim:
claimName: virtio-win-iso
dataVolumeTemplates:
- metadata:
name: test-win-01-root
spec:
source:
blank: {}
storage:
resources:
requests:
storage: 100Gi
storageClassName: ceph-rbd-gold
accessModes:
- ReadWriteMany
He booted the VM and connected via VNC:
$ virtctl vnc test-win-01 -n sandbox
The Windows Server 2022 installer came up. He selected "Custom: Install Windows only." And there it was -- the dreaded "Where do you want to install Windows?" screen with an empty drive list. No drives found.
This was expected. The VirtIO SCSI disk was invisible to the Windows installer without the VirtIO driver. Marco clicked "Load driver," browsed to the VirtIO CD-ROM drive (D:), navigated to viostor\2k22\amd64\, and selected the VirtIO storage driver. The 100 GB disk appeared. He also loaded the VirtIO network driver (NetKVM\2k22\amd64\) and the balloon driver (Balloon\2k22\amd64\) while he was at it -- better to install all VirtIO drivers during initial setup than to fight with Device Manager later.
Installation took about 15 minutes. He configured the administrator password, logged in, and verified that the network was working (the VirtIO NIC was recognized and had received a DHCP address from KubeVirt's masquerade DHCP server). He installed the QEMU Guest Agent for Windows (qemu-ga-x86_64.msi from the VirtIO ISO), which would allow KubeVirt to report the VM's IP address, hostname, and filesystem information through the VirtualMachineInstance status.
With both test VMs running, Marco explored the operations that mapped to his daily vSphere workflows.
Snapshots:
# Create a snapshot
$ cat <<EOF | oc apply -f -
apiVersion: snapshot.kubevirt.io/v1beta1
kind: VirtualMachineSnapshot
metadata:
name: test-linux-01-snap-pre-update
namespace: sandbox
spec:
source:
apiGroup: kubevirt.io
kind: VirtualMachine
name: test-linux-01
EOF
virtualmachinesnapshot.snapshot.kubevirt.io/test-linux-01-snap-pre-update created
# Check status
$ oc get vmsnapshot -n sandbox
NAME SOURCEKIND SOURCENAME PHASE READYTOUSE AGE
test-linux-01-snap-pre-update VirtualMachine test-linux-01 Succeeded true 45s
The snapshot had frozen the guest filesystem (via the QEMU guest agent), created a Ceph RBD snapshot of the PVC, and recorded the VM's metadata. In vSphere, this was a right-click -> "Take Snapshot." Here it was a YAML manifest. Different ergonomics, same result.
Cloning:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: test-linux-02
namespace: sandbox
spec:
running: true
dataVolumeTemplates:
- metadata:
name: test-linux-02-root
spec:
source:
pvc:
namespace: sandbox
name: test-linux-01-root
storage:
resources:
requests:
storage: 50Gi
storageClassName: ceph-rbd-gold
accessModes:
- ReadWriteMany
template:
metadata:
labels:
kubevirt.io/vm: test-linux-02
spec:
# ... same spec as test-linux-01 ...
volumes:
- name: rootdisk
dataVolume:
name: test-linux-02-root
- name: cloudinitdisk
cloudInitNoCloud:
userData: |
#cloud-config
hostname: test-linux-02
# cloud-init changes the hostname on first boot
The clone used Ceph RBD's copy-on-write mechanism -- the ceph rbd clone operation created a new RBD image that shared the parent's data blocks and only stored the delta. The clone was nearly instantaneous regardless of disk size. In vSphere, the equivalent was a linked clone, which used a delta VMDK. Same concept, different storage engine.
Templates and cloud-init:
Marco realized that the combination of DataVolumeTemplates and cloud-init was actually more powerful than vSphere templates. In vSphere, a template was a frozen VM image. You cloned it to create a new VM, then used Guest OS Customization (limited to hostname, IP, domain join) or ran a script to finish the configuration. Cloud-init, which worked in vSphere too (via open-vm-tools), was often an afterthought.
In OVE, cloud-init was a first-class citizen. The cloud-init data was injected as a Kubernetes volume -- a ConfigMap or Secret mounted as a virtual CD-ROM in the VM. This meant the cloud-init configuration was part of the VM's YAML definition, version-controlled in Git alongside the VM spec. Change the hostname, the SSH keys, the NTP config, the DNS settings, the CA certificates -- all in YAML, all tracked, all reproducible.
# A cloud-init config that would take 10 clicks in vSphere
cloudInitNoCloud:
userData: |
#cloud-config
hostname: app-server-42
fqdn: app-server-42.prod.bank.ch
users:
- name: ansible
ssh_authorized_keys:
- ssh-ed25519 AAAA... ansible@bank.ch
sudo: ALL=(ALL) NOPASSWD:ALL
write_files:
- path: /etc/pki/ca-trust/source/anchors/bank-root-ca.pem
content: |
-----BEGIN CERTIFICATE-----
MIIFxTCCA62gAwIBAgIU...
-----END CERTIFICATE-----
packages:
- qemu-guest-agent
- chrony
- tuned
runcmd:
- update-ca-trust
- systemctl enable --now qemu-guest-agent
- tuned-adm profile virtual-guest
ntp:
servers:
- ntp1.bank.ch
- ntp2.bank.ch
networkData: |
version: 2
ethernets:
eth0:
addresses:
- 10.200.1.42/24
gateway4: 10.200.1.1
nameservers:
addresses: [10.100.1.10, 10.100.1.11]
search: [prod.bank.ch, bank.ch]
And then the moment that changed Marco's perspective on the whole project. He was writing the third VM definition, copy-pasting YAML and changing names and IPs. And it hit him: these are just files. Text files. He could put them in a Git repository. He could write a script that generated 50 VM YAMLs from a CSV input. He could use Helm charts or Kustomize overlays to manage variations across environments. He could run oc apply -f . to deploy an entire application stack -- VMs, services, network policies -- from a single directory.
In vSphere, the VM configuration lived in vCenter's PostgreSQL database and in .vmx files on VMFS datastores. You could export OVAs, but the VM definition was not natively version-controllable. You could not diff two VM configurations, review a change in a pull request, or roll back to a previous version by reverting a commit.
In OVE, a VM was a YAML file. Its storage was a PVC. Its network was a NAD reference. Its security policy was a NetworkPolicy. Everything was a Kubernetes resource, and every Kubernetes resource was a YAML manifest, and every YAML manifest could live in Git.
Marco created a Git repository called vm-definitions, committed his three test VM YAMLs, and pushed them. He stared at the commit log in GitLab. Three VMs, fully defined, version-controlled, reviewable, auditable.
"This," he said quietly to his monitor, "is actually better."
Chapter 6: Security and Compliance
The bank's CISO had sent a requirements document three weeks before the first server arrived. It was twelve pages long and referenced FINMA Circular 2018/3 (outsourcing), ISO 27001, and the bank's internal security standard. The non-negotiable items were: access control with least privilege, encryption at rest, encryption in transit, immutable audit logs, secure boot for Windows VMs, and quarterly compliance evidence.
Marco had been doing this in vSphere for years. vCenter permissions with roles (VM Administrator, Read-Only, Network Admin) assigned to AD groups at the folder or cluster level. vSAN encryption or SAN-level encryption at rest. TLS everywhere. ESXi audit logs shipped to Splunk. Secure Boot enabled on UEFI VMs.
In OVE, the security model was different -- and in some ways, stronger.
RBAC in Kubernetes was more granular than vSphere permissions. In vSphere, permissions were assigned at the inventory level (folder, cluster, resource pool) and could control broad actions (VM Power User, Datastore User). In Kubernetes, RBAC operated at the API verb level for specific resource types within specific namespaces.
Marco created a role for the application team that could manage VMs in their namespace but could not modify cluster resources or access other namespaces:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: vm-operator
namespace: app-team-alpha
rules:
- apiGroups: ["kubevirt.io"]
resources: ["virtualmachines", "virtualmachineinstances"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["kubevirt.io"]
resources: ["virtualmachineinstances/console", "virtualmachineinstances/vnc"]
verbs: ["get"]
- apiGroups: ["subresources.kubevirt.io"]
resources: ["virtualmachines/start", "virtualmachines/stop", "virtualmachines/restart"]
verbs: ["update"]
- apiGroups: ["cdi.kubevirt.io"]
resources: ["datavolumes"]
verbs: ["get", "list", "create", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "create", "delete"]
# Cannot modify NetworkPolicies, Roles, or cluster resources
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: app-team-alpha-vm-operators
namespace: app-team-alpha
subjects:
- kind: Group
name: AD-APP-TEAM-ALPHA-VMOPS
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: vm-operator
apiGroup: rbac.authorization.k8s.io
This was far more precise than vSphere's built-in roles. The team could create and manage VMs but could not modify network policies (which the security team controlled), could not access other namespaces (which other teams owned), and could not touch cluster-level resources (which the platform team managed). Each action was an API call, and each API call was authorized against the role.
The AD group binding connected to the bank's Active Directory through OpenShift's OAuth identity provider (LDAP backend). Marco had configured this during the cluster setup:
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
name: cluster
spec:
identityProviders:
- name: bank-ldap
type: LDAP
mappingMethod: claim
ldap:
url: "ldaps://ldap.bank.ch:636/ou=Users,dc=bank,dc=ch?sAMAccountName"
bindDN: "cn=openshift-svc,ou=ServiceAccounts,dc=bank,dc=ch"
bindPassword:
name: ldap-bind-password
ca:
name: ldap-ca-bundle
insecure: false
Encryption at rest was handled at multiple levels. First, etcd -- the Kubernetes data store that held all cluster state, including VM definitions, secrets, and configuration -- was encrypted:
apiVersion: config.openshift.io/v1
kind: APIServer
metadata:
name: cluster
spec:
encryption:
type: aescbc # AES-CBC encryption for etcd values
Second, the Ceph OSD data was encrypted using LUKS (Linux Unified Key Setup). ODF supported per-OSD encryption, which meant every NVMe drive in the cluster had its data encrypted with a unique key. Marco configured the encryption keys to be stored in HashiCorp Vault (the bank's existing secret management platform) via the Vault CSI provider:
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
name: ocs-storagecluster
namespace: openshift-storage
spec:
encryption:
enable: true
kms:
connectionDetails:
KMS_PROVIDER: vault
KMS_SERVICE_NAME: vault
VAULT_ADDR: https://vault.bank.ch:8200
VAULT_BACKEND_PATH: odf-encryption
tokenSecretName: vault-token
Audit logging was already built into OpenShift. Every API call -- every oc apply, every virtctl start, every console login -- was logged by the Kubernetes audit backend. Marco configured the audit log to be forwarded to the bank's Splunk SIEM:
apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
name: instance
namespace: openshift-logging
spec:
outputs:
- name: splunk-audit
type: splunk
url: https://splunk-hec.bank.ch:8088
secret:
name: splunk-hec-token
pipelines:
- name: audit-to-splunk
inputRefs:
- audit
outputRefs:
- splunk-audit
Each audit log entry contained the user identity, the action performed, the resource affected, the timestamp, and the source IP. For FINMA compliance, this provided the immutable audit trail the CISO required. Marco tested it by creating a VM and searching in Splunk:
index=openshift_audit action=create resource=virtualmachines user="marco.baumann@bank.ch"
The event was there, timestamped, with the full request body.
Secure Boot and vTPM for Windows VMs were configured in the VM spec. Marco enabled them on the Windows test VM:
spec:
template:
spec:
domain:
firmware:
bootloader:
efi:
secureBoot: true
features:
smm: {} # System Management Mode, required for Secure Boot
devices:
tpm: {} # Virtual TPM 2.0
The vTPM was backed by swtpm (a software TPM emulator) running inside the virt-launcher pod. It persisted its state to a Kubernetes Secret, so the TPM state survived VM reboots and migrations. This allowed Windows features like BitLocker and Credential Guard to function inside the VM.
Marco stepped back and looked at the security posture as a whole. In some ways, it was stronger than vSphere:
-
Namespace isolation was a hard boundary. A user in namespace
app-team-alphacould not even see resources in namespaceapp-team-beta, let alone modify them. In vSphere, permissions were enforced but the inventory tree was visible to anyone with at least Read-Only access. -
Pod Security Standards (restricted, baseline, privileged) enforced at the namespace level prevented VMs from running with elevated privileges unless explicitly allowed. This was a defense-in-depth layer that vSphere did not have.
-
Immutable nodes (RHCOS) meant that even if an attacker gained SSH access to a worker node, they could not install a rootkit or modify system binaries that would survive a reboot. The next MCO reconciliation would restore the node to its declared state.
-
NetworkPolicies defaulted to deny-all when applied, implementing a positive security model (only explicitly allowed traffic is permitted). In NSX, default-allow was common and deny rules had to be explicitly created.
The area where vSphere was still ahead was operational maturity around compliance reporting. The Compliance Operator in OpenShift could scan nodes against CIS benchmarks and NIST 800-53 controls, but generating the formatted compliance evidence that auditors expected required more work than running a vSphere Hardening Guide assessment in vROps. Marco filed a backlog item to automate compliance report generation.
Chapter 7: The Migration Begins
The sandbox phase was over. Management wanted production VMs running on the OVE cluster within four weeks. The plan called for migrating the first 50 VMs -- a mix of RHEL application servers, Ubuntu utility VMs, and Windows Server database hosts -- as the first wave of the 5,000-VM migration.
Marco installed the Migration Toolkit for Virtualization (MTV) from the OperatorHub. MTV was a Red Hat-maintained operator that connected to a VMware vCenter, discovered VMs, mapped networks and storage from the source to the target, and executed migrations -- cold or warm.
$ oc get pods -n openshift-mtv
NAME READY STATUS AGE
forklift-controller-7c9f8d6b5-xk7qp 2/2 Running 5m
forklift-must-gather-api-6b8e7c5d4-mn3pq 1/1 Running 5m
forklift-ui-5a7d6b4c3-jk2mn 1/1 Running 5m
forklift-validation-5f7b8d9c6-lm2np 1/1 Running 5m
virt-v2v-7c9f8d6b5-rs5tu 1/1 Running 5m
First, he created a Provider that pointed MTV at the existing vCenter:
apiVersion: forklift.konveyor.io/v1beta1
kind: Provider
metadata:
name: vcenter-prod
namespace: openshift-mtv
spec:
type: vsphere
url: https://vcenter.bank.ch/sdk
secret:
name: vcenter-credentials
namespace: openshift-mtv
MTV connected to vCenter, inventoried all VMs, and presented them in its web UI. Marco could see the familiar vSphere inventory -- clusters, resource pools, VMs -- rendered in the OpenShift console. Each VM showed its configuration: CPU, memory, disks, networks, guest OS, VMware Tools status.
Next, he created network and storage mappings:
apiVersion: forklift.konveyor.io/v1beta1
kind: NetworkMap
metadata:
name: prod-network-map
namespace: openshift-mtv
spec:
map:
- source:
id: dvportgroup-100 # vDS port group "Production-VLAN200"
destination:
name: vlan200-production
namespace: production-vms
type: multus
- source:
id: dvportgroup-200 # vDS port group "Database-VLAN300"
destination:
name: vlan300-database
namespace: database-vms
type: multus
- source:
id: dvportgroup-50 # vDS port group "Management-VLAN100"
destination:
name: vlan100-management
namespace: default
type: multus
provider:
source:
name: vcenter-prod
namespace: openshift-mtv
destination:
name: host
namespace: openshift-mtv
---
apiVersion: forklift.konveyor.io/v1beta1
kind: StorageMap
metadata:
name: prod-storage-map
namespace: openshift-mtv
spec:
map:
- source:
id: datastore-10 # vSAN-Gold
destination:
storageClass: ceph-rbd-gold
accessMode: ReadWriteMany
- source:
id: datastore-20 # vSAN-Silver
destination:
storageClass: ceph-rbd-silver
accessMode: ReadWriteMany
- source:
id: datastore-30 # NetApp-NFS
destination:
storageClass: ontap-nas-shared
accessMode: ReadWriteMany
provider:
source:
name: vcenter-prod
namespace: openshift-mtv
destination:
name: host
namespace: openshift-mtv
Marco started with a single test migration -- a RHEL 8 application server with 4 vCPUs, 16 GB RAM, and a 100 GB VMDK on vSAN. He chose warm migration, which used VMware's Changed Block Tracking (CBT) to pre-copy the disk data while the source VM was still running, minimizing the cutover window.
The warm migration flow:
Warm Migration Timeline:
T+0 min MTV starts initial disk copy
Connects to vCenter via VDDK (VMware Virtual Disk Development Kit)
Reads VM disk via NBD (Network Block Device) protocol
Streams disk data to CDI importer pod on OVE cluster
Writes to Ceph RBD PVC (target disk)
T+45 min Initial copy complete (100 GB at ~300 MB/s over 10GbE)
MTV takes CBT snapshot (tracks changed blocks)
Source VM continues running normally
T+46 min First incremental copy
Only changed blocks since initial copy (~2 GB)
Copies in 30 seconds
T+47 min Second incremental copy
Only changed blocks since first incremental (~200 MB)
Copies in 3 seconds
T+48 min Cutover initiated by Marco
Source VM is shut down
Final incremental copy (remaining dirty blocks)
MTV creates VirtualMachine CR on OVE cluster
VirtIO drivers injected (virt-v2v conversion)
VM boots on OVE
T+50 min VM is running on OVE
Total cutover downtime: ~2 minutes
The test migration succeeded. The RHEL VM booted on OVE with its IP address, hostname, and application stack intact. Marco SSH'd in, checked the running services, verified database connectivity, and ran a quick smoke test. Everything worked. The VirtIO drivers had been injected automatically by virt-v2v (the conversion engine inside MTV) -- replacing the VMware PVSCSI and VMXNET3 drivers with VirtIO equivalents.
Linux VMs were the easy ones. Windows VMs were harder.
Marco migrated a Windows Server 2022 VM running SQL Server. The migration completed, but the VM failed to boot with a blue screen: INACCESSIBLE_BOOT_DEVICE. The Windows boot disk was using the VMware PVSCSI controller driver, and even though virt-v2v had injected the VirtIO SCSI driver, the Windows boot configuration was still referencing the PVSCSI driver.
The fix was to install the VirtIO drivers inside the Windows VM before migration. Marco logged into the source VM on vSphere, mounted the virtio-win.iso, and ran the VirtIO driver installer (which installed all VirtIO drivers silently without rebooting). After the drivers were installed, he re-ran the migration. This time, Windows booted successfully on OVE.
He wrote this up as the first item in his migration runbook:
Windows Pre-Migration Checklist:
- Install VirtIO drivers (virtio-win MSI) on the source VM while it is still running on VMware
- Install QEMU Guest Agent (qemu-ga MSI) on the source VM
- Verify VirtIO drivers are loaded: Device Manager should show "Red Hat VirtIO SCSI controller" (not as the boot device, but present)
- Consolidate all VMware snapshots before migration
- Record the NIC MAC address if the application is MAC-bound
- Remove any VMware-specific hardware (virtual serial ports, shared folders)
The next failure was more interesting. One of the VMs on the migration list -- an old RHEL 7 server running a legacy monitoring application -- had a physical Raw Device Mapping (RDM) for a SAN LUN that was passed through directly to the VM. MTV reported the migration as failed:
$ oc get migration test-migration-rdm -n openshift-mtv -o yaml
...
status:
conditions:
- type: Failed
status: "True"
reason: "UnsupportedDiskFormat"
message: "VM has disks with unsupported backing: RDM (physical compatibility mode)"
This was correct. An RDM was a direct mapping of a physical LUN to a VM -- the data was not in a VMDK file that could be copied. MTV could not migrate it automatically. Marco had two options: convert the RDM to a VMDK on vSphere first (using Storage vMotion), or manually copy the LUN data to a PVC on OVE using dd over SSH. He chose the former -- Storage vMotion the RDM to a thick-provisioned VMDK, then migrate with MTV. It added half a day to the migration timeline for that single VM.
After two weeks of iterating, Marco had a repeatable process. He could migrate 10 VMs per day with a single person, or 20 per day with two people (himself and Anna, who had become proficient with the MTV UI and the Windows pre-migration checklist). The bottleneck was not the tool -- it was the validation. Each migrated VM needed a smoke test: verify services, check connectivity, validate application functionality. The migration itself (disk copy + cutover) was largely automated. The human part was confirming that the application worked correctly in its new home.
At the end of week four, 47 of the 50 planned VMs were running on OVE. Three had been deferred: the RDM VM (resolved later), a VM with a USB dongle passthrough (required VFIO configuration that was not yet ready), and a VM running an ancient RHEL 5 guest that lacked VirtIO driver support (would be rebuilt from scratch on RHEL 9).
Marco updated the project dashboard. Wave 1: 94% complete. 4,953 VMs remaining.
Chapter 8: Day 2 -- Operations
The first real operational event happened on a Wednesday at 10:15 AM, three weeks after the first production VMs had landed on the cluster. An OpenShift update was available -- a patch release (4.16.12 to 4.16.14) that included security fixes for the Linux kernel and a performance improvement for the OVN-Kubernetes CNI.
In the VMware world, patching ESXi hosts meant putting each host into maintenance mode (which triggered vMotion to evacuate all VMs), installing the patch, rebooting, and exiting maintenance mode. Marco would do this in sequence -- one host at a time -- during a late-night maintenance window, because even though VMs were live-migrated automatically, he wanted to watch the process.
In OVE, the Machine Config Operator handled patching. The MCO managed a "machine config pool" for each node role (master, worker). When an update was applied, the MCO updated the RHCOS image for each pool and then rolled it out node by node: cordon the node (mark it unschedulable), drain the node (evict all pods, including VMs), apply the new RHCOS image, reboot, uncordon the node (mark it schedulable again).
Marco applied the update from the web console and watched the MCO work:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT
master rendered-master-abc123 True False False 3 3
worker rendered-worker-def456 False True False 9 6
The MCO was updating the worker pool. Six of nine workers were already updated. The seventh worker -- worker-7 -- was being drained. Marco watched the VMs migrate:
$ oc get vmim -A
NAMESPACE NAME PHASE VMI
production-vms kubevirt-evacuate-worker7-001 Succeeded app-server-12
production-vms kubevirt-evacuate-worker7-002 Running db-oracle-03
production-vms kubevirt-evacuate-worker7-003 Pending web-proxy-07
sandbox kubevirt-evacuate-worker7-004 Pending test-linux-01
The VMs were live-migrating off worker-7 automatically. The evictionStrategy: LiveMigrate he had set on every production VM was doing its job -- Kubernetes was draining the node, and each virt-launcher pod eviction triggered a KubeVirt live migration. The VMs moved to other workers with zero downtime. The whole process -- drain, reboot, uncordon -- took about 8 minutes per worker node. Nine workers, sequentially, was roughly 72 minutes for the entire cluster update.
Marco realized this was smoother than his ESXi patching workflow. In vSphere, he had to initiate maintenance mode manually on each host and monitor the vMotion progress. Here, the MCO did it automatically, respecting PodDisruptionBudgets (which ensured that at most one VM from a replicated set was migrating at a time). He could have walked away and let the MCO finish the update unattended.
He did not walk away. Not the first time. But the fact that he could have was significant.
The monitoring setup was the next operational gap he needed to fill. OpenShift came with a built-in Prometheus and Grafana stack that collected metrics from every component -- kubelet, API server, etcd, OVN-Kubernetes, and KubeVirt. But the dashboards were container-centric, not VM-centric. They showed CPU and memory usage per pod, not per guest OS. They showed network bytes per interface, not per VM NIC.
Marco needed dashboards that answered the same questions he answered in vRealize Operations Manager (vROps): Which VMs are consuming the most CPU? Which VMs have high memory utilization? Which datastores (now: PVCs) are running low on space? Which hosts (now: nodes) are overloaded?
He created custom Grafana dashboards that queried KubeVirt-specific metrics:
# CPU utilization per VM (as a percentage of allocated cores)
sum by (name, namespace) (
rate(kubevirt_vmi_cpu_usage_seconds_total[5m])
) / on(name, namespace) group_left()
(kubevirt_vmi_resource_requests{resource="cpu"}) * 100
# Memory utilization per VM (RSS as a percentage of allocated memory)
kubevirt_vmi_memory_resident_bytes
/ on(name, namespace) group_left()
kubevirt_vmi_memory_available_bytes * 100
# Storage IOPS per VM
sum by (name, namespace) (
rate(kubevirt_vmi_storage_iops_read_total[5m]) +
rate(kubevirt_vmi_storage_iops_write_total[5m])
)
# Network throughput per VM (bytes/sec)
sum by (name, namespace) (
rate(kubevirt_vmi_network_receive_bytes_total[5m]) +
rate(kubevirt_vmi_network_transmit_bytes_total[5m])
)
He set up alerts for the critical conditions:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: vm-alerts
namespace: openshift-monitoring
spec:
groups:
- name: vm.rules
rules:
- alert: VMHighCPUUsage
expr: |
sum by (name, namespace) (
rate(kubevirt_vmi_cpu_usage_seconds_total[5m])
) / on(name, namespace) group_left()
(kubevirt_vmi_resource_requests{resource="cpu"}) > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "VM {{ $labels.name }} in {{ $labels.namespace }} CPU > 90%"
- alert: CephOSDDown
expr: ceph_osd_up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ceph OSD {{ $labels.ceph_daemon }} is down"
Then the disk failure happened. Week six. Worker-5, OSD number 47 (one of the twelve NVMe drives in that server). The Ceph health check turned yellow:
$ oc rsh -n openshift-storage deploy/rook-ceph-tools
[root@rook-ceph-tools /]# ceph status
cluster:
health: HEALTH_WARN
1 osds down
Degraded data redundancy: 834/15012 objects degraded (5.55%)
services:
osd: 108 total, 107 up, 107 in; 12 remapped pgs
OSD 47 had reported a hardware error and marked itself down. The MON cluster updated the OSD map, and Ceph immediately began recovering the data that had been stored on OSD 47 by re-replicating it from the surviving copies on the other two replicas (remember: 3-way replication meant every object existed on three different OSDs, on three different hosts). The recovery was automatic -- no human intervention required.
Marco watched the recovery progress:
[root@rook-ceph-tools /]# ceph -w
2026-04-15 10:23:45.123 [WRN] osd.47 reported failed by osd.46
2026-04-15 10:23:46.234 [INF] osd.47 marked down (agent)
2026-04-15 10:23:47.345 [INF] pgmap: 384 pgs; 12 active+recovering, 372 active+clean
2026-04-15 10:24:12.456 [INF] pgmap: 384 pgs; 8 active+recovering, 376 active+clean
2026-04-15 10:25:33.567 [INF] pgmap: 384 pgs; 3 active+recovering, 381 active+clean
2026-04-15 10:26:47.678 [INF] pgmap: 384 pgs; 384 active+clean
2026-04-15 10:26:47.789 [INF] cluster health changed: HEALTH_WARN -> HEALTH_OK
Three minutes. From OSD failure to full data redundancy recovery, three minutes. The VMs on worker-5 had continued running the entire time -- the disk failure affected only the storage capacity of that OSD, not the VMs' ability to access their data (which was served from the surviving replicas on other OSDs). No VM had experienced any downtime or data loss.
Marco opened a hardware support ticket for the failed NVMe drive. When the replacement arrived two days later, he hot-swapped the drive (the HPE DL380 Gen11 supported NVMe hot-swap), and the Rook-Ceph operator detected the new drive, created a new OSD on it, and began backfilling data. The cluster was back to 108 OSDs within the hour.
This was the same self-healing behavior he expected from vSAN -- when a vSAN disk failed, vSAN rebuilt the data from its mirror copies automatically. The mechanism was different (CRUSH-based PG re-replication vs. vSAN object rebuilds), but the operational outcome was identical: disk fails, data is protected, replace the disk when convenient.
The IOPS incident was more interesting. One of the developers had deployed a load test against a PostgreSQL VM that was generating 40,000 random write IOPS. The VM was on the ceph-rbd-gold StorageClass, sharing the Ceph cluster with all the other production VMs. The latency graphs for other VMs on the same storage pool spiked from 400 microseconds to 3 milliseconds.
Marco's vSAN reflex was "Storage I/O Control" (SIOC) -- the vSAN feature that limited the IOPS a VM could consume when the datastore was under contention. The OVE equivalent was less mature but workable.
He applied a ResourceQuota to the developer's namespace to limit the total storage IOPS that namespace could consume. Then he looked into Ceph-level QoS:
# Set IOPS limit on a specific RBD image (the VM's disk PVC)
[root@rook-ceph-tools /]# rbd config image set \
ocs-storagecluster-cephblockpool/csi-vol-abcd1234 \
rbd_qos_iops_limit 5000
This capped the offending VM's disk at 5,000 IOPS, and the latency for other VMs returned to normal within seconds. It was not as seamless as SIOC (which was automatic and policy-driven), but it worked. Marco added a note to his operational runbook and filed a feature request with Red Hat for namespace-scoped Ceph QoS integration.
The Terraform moment came on a Friday afternoon. Marco had been creating VMs by hand-editing YAML and applying it with oc apply. This was fine for 5 VMs. It was tedious for 50. It would be insane for 500.
He installed the Kubernetes Terraform provider and wrote his first Terraform module for VM provisioning:
resource "kubernetes_manifest" "vm" {
manifest = {
apiVersion = "kubevirt.io/v1"
kind = "VirtualMachine"
metadata = {
name = var.vm_name
namespace = var.namespace
labels = {
app = var.app_label
environment = var.environment
team = var.team
}
}
spec = {
running = true
dataVolumeTemplates = [{
metadata = { name = "${var.vm_name}-root" }
spec = {
source = {
pvc = {
namespace = "golden-images"
name = var.golden_image
}
}
storage = {
resources = {
requests = { storage = var.disk_size }
}
storageClassName = var.storage_class
accessModes = ["ReadWriteMany"]
}
}
}]
template = {
metadata = {
labels = {
"kubevirt.io/vm" = var.vm_name
app = var.app_label
}
}
spec = {
domain = {
cpu = { cores = var.cpu_cores, sockets = 1, threads = 1 }
memory = { guest = var.memory }
# ... additional domain config ...
}
evictionStrategy = "LiveMigrate"
volumes = [
{ name = "rootdisk", dataVolume = { name = "${var.vm_name}-root" } },
{ name = "cloudinitdisk", cloudInitNoCloud = {
userData = templatefile("${path.module}/templates/cloud-init.yaml.tpl", {
hostname = var.vm_name
domain = var.domain
ssh_keys = var.ssh_keys
ntp_servers = var.ntp_servers
})
}}
]
networks = [{ name = "default", pod = {} }]
}
}
}
}
}
He created a terraform.tfvars file with the parameters for the next batch of 20 VMs, ran terraform plan to preview the changes, and then terraform apply to create them all at once. Twenty VMs, provisioned in parallel, each cloned from a golden image, each with unique cloud-init configuration, each tracked in Terraform state.
He checked the Terraform state into a Git repository alongside the variable files. Every VM was now code. The provisioning, the configuration, the lifecycle -- all of it was in Git, reviewable, auditable, reproducible. If someone asked "what changed on vm-app-server-42 last Tuesday?" he could show them a Git commit diff.
This was the "everything as code" moment the platform strategy document had described. Marco had understood it intellectually when he read the strategy. Now he was living it. VMs were YAML. Networks were YAML. Storage was YAML. Security policies were YAML. And all the YAML lived in Git.
That Friday night, Marco slept through the night for the first time since the project had started. The monitoring was in place, the alerts were configured, the runbooks were written, the team was trained. The cluster would page him if something went wrong. Nothing went wrong.
Chapter 9: Reflections
Six months after the first server was racked, the OVE cluster was running 312 VMs in production. The second wave of 200 VMs was in progress. The team had grown from Marco working alone to a three-person platform team (Marco, Anna, and Lukas, a former network engineer who had embraced the Kubernetes model with unexpected enthusiasm). The weekly VM migration rate had settled at 40-50 VMs per week -- sustainable, predictable, and backed by tested runbooks.
Marco sat down one evening and wrote a retrospective document for the infrastructure leadership. He titled it "VMware to OVE: What I Wish I Had Known."
What is genuinely better:
-
Infrastructure as Code, natively. In vSphere, he had Terraform providers and PowerCLI scripts, but they always felt like afterthoughts -- bolted onto a platform designed for GUI-driven administration. In OVE, everything was a Kubernetes resource, and every resource was a YAML manifest. Version control, code review, CI/CD pipelines for infrastructure -- these were not bolt-ons. They were the natural way to operate the platform. This was the single biggest improvement in his daily workflow.
-
Unified platform. The bank was starting to deploy containerized applications alongside the VMs. In vSphere, this would have required a separate platform (Tanzu, or a standalone Kubernetes cluster). In OVE, containers and VMs shared the same cluster, the same monitoring, the same RBAC, the same networking. When the application team asked "can my container talk to my database VM?" the answer was "yes, they are on the same network. Here is a Service and a NetworkPolicy." No cross-platform integration headaches.
-
Immutable nodes. The RHCOS immutable OS was uncomfortable at first. But after six months, he had not had a single "configuration drift" incident. Every node was identical. Every patch was applied atomically. The MCO handled updates without human intervention. He had spent less time on node maintenance in six months of OVE than he typically spent in two months of ESXi.
-
Cloud-init templating. Provisioning a VM from a golden image with full cloud-init customization was faster and more reliable than the vSphere Guest OS Customization mechanism. The ability to inject CA certificates, SSH keys, NTP configuration, and package installations -- all from a YAML spec -- eliminated the need for post-deployment configuration management for 90% of his VMs.
-
Namespace isolation for multi-tenancy. Different teams got different namespaces with different quotas, different RBAC, different network policies. Self-service VM provisioning was real -- the application team could create VMs in their namespace without opening a ticket. The guardrails (quotas, policies) prevented them from doing damage, and the platform team focused on maintaining those guardrails rather than executing individual provisioning requests.
What is genuinely worse:
-
Operational complexity when things go wrong. When a VM failed to start in vSphere, the error was usually in the vmkernel log or the vCenter event log. In OVE, a VM failure could originate in six different layers: Kubernetes scheduling, CRI-O container runtime, CNI networking, CSI storage, KubeVirt, or QEMU. Diagnosing which layer was responsible required knowledge of all six. The learning curve for troubleshooting was steep and never fully plateaued.
-
No DRS equivalent. The Kubernetes scheduler placed VMs at creation time but did not rebalance them. After a node drain, VMs clustered on the surviving nodes and did not automatically spread back out. The descheduler helped, but it was less mature than DRS. Marco spent time each month manually checking resource distribution and triggering migrations to rebalance. In vSphere, DRS did this automatically, every five minutes.
-
Storage Live Migration. vSphere's Storage vMotion could move a VM's disk from one datastore to another while the VM was running. KubeVirt had no direct equivalent. Moving a VM's PVC from one StorageClass to another required creating a new PVC, cloning the data, updating the VM spec, and restarting the VM. This was a significant operational gap for scenarios like storage tier rebalancing or migrating away from a decommissioned storage backend.
-
Windows VM experience. Windows VMs worked on OVE, but the experience was rougher than on vSphere. VirtIO driver installation was an extra step. The VNC console was adequate but not as responsive as VMRC. Sysprep integration for Windows templates was more manual than vSphere's Guest OS Customization. Windows-heavy organizations would feel this friction more acutely.
-
Ecosystem maturity. Twenty years of VMware meant twenty years of third-party integration. Veeam, Commvault, SolarWinds, ServiceNow -- every enterprise tool had a vSphere plugin. The KubeVirt ecosystem was growing (OADP/Velero for backup, Kasten K10, the OpenShift monitoring stack) but was not yet as deep. Marco spent more time integrating tools than he had in vSphere.
What is just different (neither better nor worse):
-
CLI vs. GUI. vSphere was GUI-first with CLI available. OVE was CLI-first with a GUI available. Neither was inherently better. The GUI was important for situational awareness (dashboard views, topology maps), and the CLI was important for automation and repeatability. Marco used both daily and would not want to give up either.
-
Storage model. vSphere used monolithic datastores (VMFS, vSAN) that contained multiple VMDKs. OVE used per-disk PVCs, each independently provisioned and managed. The PVC model was more flexible (each disk could use a different StorageClass) but less familiar and harder to visualize (there was no "datastore browser" equivalent).
-
Networking model. vSphere networking was VLAN-centric with optional overlays (NSX). OVE networking was overlay-first (OVN/GENEVE) with VLAN access via Multus. Both models achieved the same outcome -- VMs connected to the networks they needed -- but the mental model was inverted.
Skills that transferred directly:
- Capacity planning (CPU, memory, storage, network bandwidth)
- Storage architecture (replication, erasure coding, tiering, performance benchmarking)
- Network design (spine-leaf, LACP, BGP, VLANs, MTU considerations)
- Live migration concepts (pre-copy, convergence, bandwidth requirements)
- Security fundamentals (encryption, RBAC, audit logging, compliance)
- Hardware lifecycle management (rack and stack, firmware updates, disk replacement)
- Troubleshooting methodology (isolate the layer, check the logs, reproduce the issue)
Skills that required learning from scratch:
- Kubernetes fundamentals (pods, deployments, services, namespaces, RBAC)
- YAML-driven infrastructure (declarative model, desired-state reconciliation)
- Container runtime internals (CRI-O, pod lifecycle, image management)
- Ceph storage operations (PG states, CRUSH maps, OSD management,
cephCLI) - OVN networking (logical flows,
ovn-trace,ovs-ofctl) - GitOps workflow (version control for infrastructure, CI/CD for cluster configuration)
- The Operator pattern (understanding how operators manage complex applications)
What Marco wished he had known on day one:
-
Start with the study material. The networking, storage, and virtualization fundamentals documents were dense, but they mapped every VMware concept to its OVE equivalent. He should have spent the first two weeks reading before touching hardware. He would have avoided the MTU mistake, the manual chrony edit, and two days of confusion about PG states.
-
Build the monitoring stack before deploying VMs. He had deployed VMs first and monitoring second. This meant the first week of production was blind -- he had no dashboards, no alerts, and no baselines. Build monitoring first. Always.
-
Migrate Linux VMs first, always. Linux VMs with VirtIO drivers were trivial to migrate. Windows VMs required pre-migration driver installation and had more failure modes. He should have migrated all Linux VMs in wave 1 and saved Windows for wave 2, when the team had more experience.
-
The learning curve is real, but it has a plateau. The first month was overwhelming. Kubernetes, Ceph, OVN, KubeVirt -- each was a complex system, and he was learning all four simultaneously. By month three, the daily operations felt manageable. By month six, he was writing Terraform modules and troubleshooting Ceph PG states without looking at the documentation. The curve was steep, but it was finite.
-
Everything is a pod. This was the most important mental model shift. Every VM is a pod. Every storage daemon is a pod. Every network component is a pod. Every operator controller is a pod. Once this clicked, the entire platform made sense.
oc get pods -Awas the universal starting point for any investigation. Every problem manifested as a pod in a bad state -- and Kubernetes had mature tools for diagnosing pods. -
Get help. Marco had tried to learn everything himself. He should have engaged Red Hat professional services for the first two weeks of cluster setup. Having an expert in the room during the initial ODF deployment and network configuration would have saved a week of trial and error. The technology was open source and well-documented, but operational experience was not something you could absorb from documentation alone.
Marco closed his laptop and looked out the window. The Basel skyline was dark except for the red aviation lights on the Roche towers. Somewhere in the data center below, 312 virtual machines were running on a platform that had not existed six months ago. By this time next year, all 5,000 VMs would be there.
It had been the hardest project of his career. It had also been the most rewarding. The platform was not perfect -- no platform was -- but it was modern, it was automated, it was auditable, and it was his team's. They had built it from bare metal, mistake by mistake, lesson by lesson.
And tomorrow, there were 40 more VMs to migrate.
This story is based on real OpenShift Virtualization Engine architecture and operational patterns. Technical details -- including YAML manifests, commands, error messages, and architectural descriptions -- are accurate as of the study material. Names, dates, and the specific financial institution are fictional.