Modern datacenters and beyond

Infrastructure as Code

Why This Matters

The platform strategy mandates "Everything as Code" -- all infrastructure must be version-controlled, peer-reviewed, and automated. This is not a preference; it is a governance requirement. For a Tier-1 financial enterprise running 5,000+ VMs, manual provisioning through web consoles is a compliance risk, an audit liability, and an operational bottleneck. Every VM, every network policy, every storage volume must be traceable to a Git commit, reproducible from a pipeline, and destroyable without human intervention.

Today, the VMware estate is managed through a combination of vCenter UI operations, PowerCLI scripts, Terraform with the vSphere provider, and Ansible playbooks using the community.vmware collection. This hybrid approach works, but it is tightly coupled to VMware-specific APIs. Migrating to OVE, Azure Local, or Swisscom ESC means replacing or rewriting every IaC integration -- Terraform providers, Ansible collections, CI/CD pipelines, and approval workflows.

This chapter covers the two primary IaC toolchains (Terraform and Ansible), the Kubernetes-native alternative (GitOps with ArgoCD/Flux), and the emerging Crossplane model. Each approach has distinct strengths, and the right choice depends on the target platform. For OVE, GitOps may be the preferred model because VMs are Kubernetes-native objects. For Azure Local, Terraform with the Azure provider ecosystem is the natural fit. For Swisscom ESC, the IaC surface depends on what APIs Swisscom exposes.

The chapter assumes working knowledge of Git, YAML, and basic Kubernetes concepts. It provides complete, copy-pasteable examples that can be adapted for proof-of-concept deployments.


Concepts

1. Terraform Provider for KubeVirt / Hyper-V

Terraform Fundamentals

Terraform is a declarative infrastructure provisioning tool. You describe the desired end state of your infrastructure in HCL (HashiCorp Configuration Language) files, and Terraform calculates the difference between the current state and the desired state, then executes the necessary API calls to converge reality to the declaration.

Core concepts:

  Terraform Execution Flow

  +-------------------+
  | .tf files (HCL)   |   Developer writes declarative configuration
  | - main.tf         |   in version-controlled .tf files
  | - variables.tf    |
  | - outputs.tf      |
  +--------+----------+
           |
           v
  +--------+----------+
  | terraform init    |   Downloads provider plugins, initializes
  | (one-time setup)  |   backend for state storage
  +--------+----------+
           |
           v
  +--------+----------+
  | terraform plan    |   Reads current state + provider API,
  |                   |   computes diff, shows proposed changes
  |  + = create       |
  |  ~ = modify       |   "Plan: 3 to add, 1 to change, 0 to destroy"
  |  - = destroy      |
  +--------+----------+
           |
           v  (human or pipeline reviews plan)
  +--------+----------+
  | terraform apply   |   Executes API calls to converge
  |                   |   real infrastructure to desired state
  |  Provider API     |
  |  calls: POST,     |   Updates terraform.tfstate with new
  |  PUT, DELETE      |   resource IDs and attributes
  +--------+----------+
           |
           v
  +--------+----------+
  | terraform.tfstate |   JSON file mapping HCL resources to
  | (state file)      |   real infrastructure object IDs
  |                   |   MUST be stored securely (S3, GCS,
  |                   |   Terraform Cloud, GitLab backend)
  +-------------------+

State management is Terraform's most critical operational concern. The state file contains resource IDs, IP addresses, and sometimes secrets. For a team of 10+ engineers managing 5,000+ VMs:

KubeVirt Terraform Provider

The KubeVirt Terraform provider (kubevirt/kubevirt in the Terraform registry) enables declarative management of KubeVirt VirtualMachine resources via Terraform. It speaks directly to the Kubernetes API server using a kubeconfig file and translates HCL resource definitions into KubeVirt Custom Resource operations.

Provider configuration:

# providers.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    kubevirt = {
      source  = "kubevirt/kubevirt"
      version = "~> 0.1"
    }
  }

  backend "s3" {
    bucket         = "terraform-state-prod"
    key            = "kubevirt/vm-workloads/terraform.tfstate"
    region         = "eu-central-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

provider "kubevirt" {
  # Option 1: Use the default kubeconfig
  # (reads from KUBECONFIG env var or ~/.kube/config)

  # Option 2: Explicit path
  config_path = var.kubeconfig_path

  # Option 3: In-cluster (when Terraform runs inside the cluster)
  # No configuration needed -- uses the pod's service account token
}

Resource types available in the KubeVirt provider:

Resource Type Description
kubevirt_virtual_machine A VirtualMachine CR (persistent, restartable VM)
kubevirt_data_volume A CDI DataVolume (disk image import or clone)

The KubeVirt Terraform provider is relatively young compared to the vSphere provider. It covers the core VM and DataVolume resources but does not yet expose every KubeVirt CRD (e.g., MigrationPolicy, VirtualMachineClusterInstancetype, VirtualMachinePool are not directly supported). For resources not covered by the provider, the Kubernetes Terraform provider's kubernetes_manifest resource can fill the gap (see "Alternative" section below).

Complete example -- provisioning a VM with disks, network, and cloud-init:

# variables.tf

variable "kubeconfig_path" {
  description = "Path to the kubeconfig file"
  type        = string
  default     = "~/.kube/config"
}

variable "namespace" {
  description = "Kubernetes namespace for the VM"
  type        = string
  default     = "vm-workloads"
}

variable "vm_name" {
  description = "Name of the virtual machine"
  type        = string
  default     = "rhel9-appserver-01"
}

variable "cpu_cores" {
  description = "Number of vCPU cores"
  type        = number
  default     = 4
}

variable "memory" {
  description = "Memory allocation (e.g., 8Gi)"
  type        = string
  default     = "8Gi"
}

variable "disk_size" {
  description = "Root disk size (e.g., 100Gi)"
  type        = string
  default     = "100Gi"
}

variable "ssh_public_key" {
  description = "SSH public key for cloud-init"
  type        = string
}

variable "network_name" {
  description = "Name of the NetworkAttachmentDefinition for the VM network"
  type        = string
  default     = "vlan-100-prod"
}

variable "storage_class" {
  description = "Kubernetes StorageClass for the VM disk"
  type        = string
  default     = "ocs-storagecluster-ceph-rbd"
}

variable "golden_image_namespace" {
  description = "Namespace containing the golden image PVC"
  type        = string
  default     = "golden-images"
}

variable "golden_image_pvc" {
  description = "Name of the golden image PVC to clone"
  type        = string
  default     = "rhel9-golden-20240601"
}
# main.tf

# --- Data Volume: Clone the root disk from a golden image ---

resource "kubevirt_data_volume" "root_disk" {
  metadata {
    name      = "${var.vm_name}-rootdisk"
    namespace = var.namespace

    labels = {
      "app.kubernetes.io/name"       = var.vm_name
      "app.kubernetes.io/managed-by" = "terraform"
      "app.kubernetes.io/component"  = "rootdisk"
    }
  }

  spec {
    source {
      pvc {
        name      = var.golden_image_pvc
        namespace = var.golden_image_namespace
      }
    }

    pvc {
      access_modes = ["ReadWriteMany"]
      resources {
        requests = {
          storage = var.disk_size
        }
      }
      storage_class_name = var.storage_class
    }
  }
}

# --- Data Volume: Additional data disk ---

resource "kubevirt_data_volume" "data_disk" {
  metadata {
    name      = "${var.vm_name}-datadisk"
    namespace = var.namespace

    labels = {
      "app.kubernetes.io/name"       = var.vm_name
      "app.kubernetes.io/managed-by" = "terraform"
      "app.kubernetes.io/component"  = "datadisk"
    }
  }

  spec {
    source {
      blank {}
    }

    pvc {
      access_modes = ["ReadWriteMany"]
      resources {
        requests = {
          storage = "200Gi"
        }
      }
      storage_class_name = var.storage_class
    }
  }
}

# --- Virtual Machine ---

resource "kubevirt_virtual_machine" "vm" {
  metadata {
    name      = var.vm_name
    namespace = var.namespace

    labels = {
      "app.kubernetes.io/name"       = var.vm_name
      "app.kubernetes.io/managed-by" = "terraform"
      "app.kubernetes.io/part-of"    = "appserver-fleet"
      "env"                          = "production"
    }

    annotations = {
      "vm.kubevirt.io/os" = "rhel9"
    }
  }

  spec {
    run_strategy = "Always"

    template {
      metadata {
        labels = {
          "app.kubernetes.io/name"    = var.vm_name
          "kubevirt.io/domain"        = var.vm_name
        }
      }

      spec {
        domain {
          cpu {
            cores   = var.cpu_cores
            sockets = 1
            threads = 1
          }

          resources {
            requests = {
              memory = var.memory
            }
            limits = {
              memory = var.memory
            }
          }

          devices {
            disk {
              name = "rootdisk"
              disk {
                bus = "virtio"
              }
            }
            disk {
              name = "datadisk"
              disk {
                bus = "virtio"
              }
            }
            disk {
              name = "cloudinit"
              disk {
                bus = "virtio"
              }
            }

            interface {
              name                     = "prod-net"
              bridge {}
            }

            rng {}
          }
        }

        network {
          name = "prod-net"
          multus {
            network_name = var.network_name
          }
        }

        volume {
          name = "rootdisk"
          data_volume {
            name = kubevirt_data_volume.root_disk.metadata[0].name
          }
        }

        volume {
          name = "datadisk"
          data_volume {
            name = kubevirt_data_volume.data_disk.metadata[0].name
          }
        }

        volume {
          name = "cloudinit"
          cloud_init_no_cloud {
            user_data = <<-CLOUDINIT
              #cloud-config
              hostname: ${var.vm_name}
              fqdn: ${var.vm_name}.internal.example.com
              manage_etc_hosts: true

              users:
                - name: sysadmin
                  sudo: ALL=(ALL) NOPASSWD:ALL
                  shell: /bin/bash
                  ssh_authorized_keys:
                    - ${var.ssh_public_key}

              packages:
                - qemu-guest-agent
                - nfs-utils
                - python3

              runcmd:
                - systemctl enable --now qemu-guest-agent
                - echo "Provisioned by Terraform at $(date)" > /etc/motd

              write_files:
                - path: /etc/sysctl.d/99-tuning.conf
                  content: |
                    net.core.somaxconn = 65535
                    vm.swappiness = 10
                  permissions: '0644'
            CLOUDINIT
          }
        }
      }
    }
  }

  depends_on = [
    kubevirt_data_volume.root_disk,
    kubevirt_data_volume.data_disk,
  ]
}
# outputs.tf

output "vm_name" {
  description = "Name of the created VM"
  value       = kubevirt_virtual_machine.vm.metadata[0].name
}

output "vm_namespace" {
  description = "Namespace of the created VM"
  value       = kubevirt_virtual_machine.vm.metadata[0].namespace
}

output "root_disk_name" {
  description = "Name of the root disk DataVolume"
  value       = kubevirt_data_volume.root_disk.metadata[0].name
}

output "data_disk_name" {
  description = "Name of the data disk DataVolume"
  value       = kubevirt_data_volume.data_disk.metadata[0].name
}

Lifecycle management -- what happens on plan, apply, and destroy:

Operation Behavior
terraform plan Reads the current state of the VM and DataVolumes from the Kubernetes API. Computes the diff. Shows whether the VM would be created, modified, or destroyed.
terraform apply (create) Creates the DataVolume CRs first (CDI begins cloning/importing disk images). Then creates the VirtualMachine CR. KubeVirt's virt-controller creates the virt-launcher pod and starts the VM. Terraform waits for the resources to be created in the API but does not wait for the VM to finish booting.
terraform apply (modify) Changes to certain fields (labels, annotations, resource requests) can be applied in-place. Changes to immutable fields (disk bus type, network interface type) force a destroy-and-recreate. The run_strategy can be changed in-place.
terraform destroy Deletes the VirtualMachine CR first (which triggers graceful shutdown of the VMI and deletion of the virt-launcher pod), then deletes the DataVolume CRs (which deletes the underlying PVCs and PVs, releasing the storage). Data is permanently lost.

State management -- how VM state maps to Terraform state:

The Terraform state file records the metadata (name, namespace, UID, resourceVersion) and the full spec of each managed resource. When the state file says a VM exists with 4 cores and 8Gi memory, Terraform trusts this until the next plan or apply, at which point it queries the Kubernetes API to detect drift.

Drift detection: If someone manually edits the VM via kubectl or virtctl (e.g., adds a label, changes memory), the next terraform plan will detect the drift and propose changes to bring the real resource back in line with the HCL definition. This is a strength of Terraform -- it enforces the declared state. It is also a risk: if the operations team makes an emergency change via kubectl and then someone runs terraform apply, Terraform will revert the emergency change.

Limitations and workarounds:

  1. Incomplete CRD coverage. The KubeVirt provider does not expose all KubeVirt CRDs. MigrationPolicy, VirtualMachineClusterInstancetype, VirtualMachineClusterPreference, VirtualMachinePool, and many others must be managed via the kubernetes_manifest resource or via raw kubectl apply outside Terraform.

  2. No lifecycle operation support. Terraform can set run_strategy: Always or run_strategy: Halted, but it cannot trigger a live migration, open a console, or invoke a restart. These imperative operations are outside Terraform's declarative model. Use virtctl or Ansible for day-2 lifecycle operations.

  3. Provider maturity. The KubeVirt Terraform provider is a community-maintained project. It does not have the same level of investment, documentation, or testing as the vSphere or AWS providers. Expect rough edges. Pin the provider version and test upgrades carefully.

  4. State file size at scale. Each VM resource in the state file is approximately 5--10 KB of JSON. At 5,000 VMs, the state file would be 25--50 MB -- manageable, but slow to plan. Segment state files by team, environment, or application.

  5. Import existing VMs. If VMs were created manually (e.g., during a PoC), terraform import can bring them under Terraform management. However, you must write the matching HCL configuration first. There is no terraform import --generate-config for KubeVirt resources (unlike some other providers).

Alternative: Kubernetes Terraform Provider with Raw Manifests

For KubeVirt resources not covered by the dedicated provider, the hashicorp/kubernetes provider offers the kubernetes_manifest resource, which can manage any Kubernetes resource by accepting raw YAML/JSON manifests.

# Using the Kubernetes provider for resources not in the KubeVirt provider

terraform {
  required_providers {
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.30"
    }
  }
}

provider "kubernetes" {
  config_path = var.kubeconfig_path
}

# Example: Create a MigrationPolicy (not supported by KubeVirt provider)
resource "kubernetes_manifest" "migration_policy" {
  manifest = {
    apiVersion = "migrations.kubevirt.io/v1alpha1"
    kind       = "MigrationPolicy"
    metadata = {
      name = "high-bandwidth-policy"
    }
    spec = {
      selectors = {
        namespaceSelector = {
          "migration-tier" = "premium"
        }
      }
      bandwidthPerMigration    = "1Gi"
      completionTimeoutPerGiB  = 800
      allowAutoConverge        = true
      allowPostCopy            = false
    }
  }
}

# Example: Create a VirtualMachineClusterInstancetype
resource "kubernetes_manifest" "instancetype_large" {
  manifest = {
    apiVersion = "instancetype.kubevirt.io/v1beta1"
    kind       = "VirtualMachineClusterInstancetype"
    metadata = {
      name = "large"
      labels = {
        "instancetype.kubevirt.io/vendor" = "internal"
      }
    }
    spec = {
      cpu = {
        guest = 8
      }
      memory = {
        guest = "32Gi"
      }
    }
  }
}

Tradeoff: kubernetes_manifest is fully generic but has weaker type safety. The KubeVirt provider validates HCL against the KubeVirt CRD schema at plan time; kubernetes_manifest validates only at apply time when the API server rejects invalid fields. For core VM resources, prefer the dedicated provider. For supplementary CRDs, use kubernetes_manifest.

OpenTofu as an Alternative to HashiCorp Terraform

In August 2023, HashiCorp changed Terraform's license from the Mozilla Public License 2.0 (MPL-2.0) to the Business Source License 1.1 (BSL 1.1). The BSL restricts using Terraform in products that compete with HashiCorp's commercial offerings. While this does not directly affect end-users who use Terraform to manage their own infrastructure, it created uncertainty for:

In response, the Linux Foundation forked Terraform 1.5.x under the name OpenTofu and released it under the MPL-2.0 license. OpenTofu is a drop-in replacement for Terraform: it uses the same HCL language, the same state format, the same provider ecosystem, and the same CLI commands (tofu init, tofu plan, tofu apply).

What this means for the evaluation:

Hyper-V / Azure Local Terraform

For Azure Local, the Terraform story is fundamentally different from KubeVirt. Azure Local is managed through Azure Resource Manager (ARM), which means the standard Azure Terraform provider (azurerm) is the primary IaC interface. VMs on Azure Local are represented as Azure Arc-enabled resources in the Azure control plane.

Provider configuration:

# providers.tf for Azure Local

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.100"
    }
  }

  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "tfstateprodstorage"
    container_name       = "tfstate"
    key                  = "azure-local/vm-workloads.tfstate"
  }
}

provider "azurerm" {
  features {}
  subscription_id = var.subscription_id
}

Resource types for Azure Local / Azure Stack HCI:

Resource Type Description
azurerm_stack_hci_cluster The Azure Stack HCI cluster registration
azurerm_stack_hci_logical_network Logical network on the HCI cluster
azurerm_stack_hci_storage_path Storage path (local or shared)
azurerm_stack_hci_marketplace_gallery_image Marketplace image for VM creation
azurerm_stack_hci_network_interface NIC for an HCI VM
azurerm_stack_hci_virtual_hard_disk Virtual hard disk (VHDX)
azurerm_arc_machine Arc-connected machine representation

Example -- provisioning a VM on Azure Local:

# main.tf for Azure Local VM

variable "resource_group_name" {
  type    = string
  default = "hci-workloads-rg"
}

variable "location" {
  type    = string
  default = "switzerlandnorth"
}

variable "custom_location_id" {
  description = "Azure Custom Location ID for the HCI cluster"
  type        = string
}

variable "logical_network_id" {
  description = "ID of the logical network on the HCI cluster"
  type        = string
}

variable "marketplace_image_id" {
  description = "ID of the marketplace gallery image"
  type        = string
}

# --- Network Interface ---

resource "azurerm_stack_hci_network_interface" "vm_nic" {
  name                = "rhel9-appserver-01-nic"
  resource_group_name = var.resource_group_name
  location            = var.location
  custom_location_id  = var.custom_location_id

  ip_configuration {
    name                          = "ipconfig1"
    private_ip_address            = "10.100.1.50"
    subnet_id                     = var.logical_network_id
  }

  tags = {
    managed-by = "terraform"
    env        = "production"
  }
}

# --- Virtual Hard Disk ---

resource "azurerm_stack_hci_virtual_hard_disk" "data_disk" {
  name                = "rhel9-appserver-01-datadisk"
  resource_group_name = var.resource_group_name
  location            = var.location
  custom_location_id  = var.custom_location_id

  disk_size_gb = 200
  dynamic_enabled = true

  storage_path_id = var.storage_path_id

  tags = {
    managed-by = "terraform"
  }
}

# Note: As of the azurerm provider ~3.100, the full
# azurerm_stack_hci_virtual_machine resource is still
# evolving. Check the provider changelog for the latest
# supported attributes. The resource may require using
# azapi_resource for full control.

The Azure API provider (azapi) for cutting-edge resources:

Because Azure Local resources are evolving rapidly, the azurerm provider sometimes lags behind the Azure API. The azapi provider allows direct Azure REST API calls using Terraform:

terraform {
  required_providers {
    azapi = {
      source  = "Azure/azapi"
      version = "~> 1.12"
    }
  }
}

resource "azapi_resource" "hci_vm" {
  type      = "Microsoft.AzureStackHCI/virtualMachineInstances@2024-01-01"
  parent_id = azurerm_arc_machine.vm.id

  body = jsonencode({
    extendedLocation = {
      type = "CustomLocation"
      name = var.custom_location_id
    }
    properties = {
      hardwareProfile = {
        vmSize  = "Custom"
        processors = 4
        memoryMB   = 8192
      }
      osProfile = {
        computerName  = "rhel9-appserver-01"
        adminUsername  = "sysadmin"
        linuxConfiguration = {
          ssh = {
            publicKeys = [{
              keyData = var.ssh_public_key
              path    = "/home/sysadmin/.ssh/authorized_keys"
            }]
          }
        }
      }
      storageProfile = {
        imageReference = {
          id = var.marketplace_image_id
        }
        dataDisks = [{
          id = azurerm_stack_hci_virtual_hard_disk.data_disk.id
        }]
      }
      networkProfile = {
        networkInterfaces = [{
          id = azurerm_stack_hci_network_interface.vm_nic.id
        }]
      }
    }
  })
}

Key differences from the KubeVirt model:

Aspect KubeVirt (OVE) Azure Local
API endpoint Kubernetes API server (on-premises) Azure Resource Manager (cloud control plane)
Authentication kubeconfig / ServiceAccount token Azure AD / Service Principal / Managed Identity
Provider kubevirt/kubevirt + hashicorp/kubernetes hashicorp/azurerm + Azure/azapi
State of provider maturity Community-maintained, limited CRD coverage Microsoft-backed, but HCI-specific resources are still evolving
Offline operation Fully functional without internet Requires connectivity to Azure ARM (cloud control plane)
Resource model Kubernetes CRDs (VirtualMachine, DataVolume) Azure resource types (Microsoft.AzureStackHCI/*)

VMware Terraform Provider (Current State)

The current VMware estate uses the hashicorp/vsphere Terraform provider. Understanding what the team currently has helps scope the migration effort.

# Current VMware configuration (for migration context)

provider "vsphere" {
  user                 = var.vsphere_user
  password             = var.vsphere_password
  vsphere_server       = var.vsphere_server
  allow_unverified_ssl = false
}

data "vsphere_datacenter" "dc" {
  name = "DC-ZRH-01"
}

data "vsphere_compute_cluster" "cluster" {
  name          = "Cluster-Prod-01"
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_datastore" "datastore" {
  name          = "VSAN-Prod-01"
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_network" "network" {
  name          = "VLAN-100-Prod"
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_virtual_machine" "template" {
  name          = "templates/rhel9-golden"
  datacenter_id = data.vsphere_datacenter.dc.id
}

resource "vsphere_virtual_machine" "vm" {
  name             = "rhel9-appserver-01"
  resource_pool_id = data.vsphere_compute_cluster.cluster.resource_pool_id
  datastore_id     = data.vsphere_datastore.datastore.id
  num_cpus         = 4
  memory           = 8192
  guest_id         = data.vsphere_virtual_machine.template.guest_id

  network_interface {
    network_id   = data.vsphere_network.network.id
    adapter_type = data.vsphere_virtual_machine.template.network_interface_types[0]
  }

  disk {
    label            = "disk0"
    size             = 100
    thin_provisioned = true
  }

  clone {
    template_uuid = data.vsphere_virtual_machine.template.id
    customize {
      linux_options {
        host_name = "rhel9-appserver-01"
        domain    = "internal.example.com"
      }
      network_interface {
        ipv4_address = "10.100.1.50"
        ipv4_netmask = 24
      }
      ipv4_gateway = "10.100.1.1"
    }
  }
}

Migration from vSphere provider to KubeVirt / Azure provider:

The migration is not a find-and-replace exercise. The resource models are fundamentally different:

Migration strategy:

  1. Inventory all existing Terraform-managed VMware resources.
  2. Write equivalent KubeVirt or Azure HCL configurations for each resource.
  3. Use MTV (Migration Toolkit for Virtualization) to migrate the VM data (disks, configurations).
  4. Import the migrated VMs into the new Terraform state using terraform import.
  5. Run terraform plan to verify that the imported state matches the new HCL. Resolve any drift.
  6. Decommission the old VMware Terraform configurations.

This is a labor-intensive process. Budget for it explicitly.

Crossplane as a Kubernetes-Native Alternative to Terraform

Crossplane is a CNCF project that brings the Terraform-style declarative resource management model into Kubernetes itself. Instead of running Terraform from a CI/CD pipeline or a developer workstation, Crossplane runs as a set of controllers inside the Kubernetes cluster and manages infrastructure through Custom Resources.

  Crossplane Architecture

  +====================================================================+
  |  Kubernetes Cluster (same cluster as OVE, or a dedicated mgmt      |
  |  cluster)                                                          |
  |                                                                    |
  |  +-------------------------------+                                 |
  |  | Crossplane Core Controllers   |                                 |
  |  | - Package manager             |  Watches Crossplane CRDs        |
  |  | - Composition engine          |  and reconciles external         |
  |  | - Provider reconcilers        |  resources via provider APIs     |
  |  +-------------------------------+                                 |
  |                                                                    |
  |  +-------------------------------+  +----------------------------+ |
  |  | Provider: provider-kubernetes |  | Provider: provider-azure   | |
  |  | Manages K8s resources         |  | Manages Azure resources    | |
  |  | (VMs, DataVolumes, etc.)      |  | (HCI VMs, networks, etc.)  | |
  |  +-------------------------------+  +----------------------------+ |
  |                                                                    |
  |  Custom Resources:                                                 |
  |  +-------------------------------+                                 |
  |  | kind: VirtualMachine          |  Crossplane XRD (Composite      |
  |  | apiVersion: infra.example/v1  |  Resource Definition) that      |
  |  | spec:                         |  abstracts the platform-         |
  |  |   cpu: 4                      |  specific details behind a      |
  |  |   memory: 8Gi                 |  unified API                    |
  |  |   image: rhel9                |                                 |
  |  +-------------------------------+                                 |
  +====================================================================+
          |                                       |
          v                                       v
  +------------------+                 +--------------------+
  | KubeVirt API     |                 | Azure ARM API      |
  | (on-prem cluster)|                 | (cloud control     |
  |                  |                 |  plane)             |
  +------------------+                 +--------------------+

Why Crossplane matters for this evaluation:

Tradeoff: Crossplane adds complexity to the Kubernetes cluster (more controllers, more CRDs, more RBAC to manage). It is a good fit for organizations that are all-in on Kubernetes; it is a poor fit if the team wants to keep IaC separate from the runtime platform.


2. Ansible Modules for VM Management

Ansible Fundamentals

Ansible is an agentless automation engine. It connects to target systems over SSH (Linux) or WinRM (Windows), executes modules to achieve a desired state, and reports results. Unlike Terraform (which is purely declarative and focused on provisioning), Ansible is procedural-first (playbooks execute tasks in order) with idempotent modules that can be used declaratively.

Core concepts:

  Ansible Execution Model

  +==================================================================+
  |  Control Node (where ansible-playbook runs)                      |
  |                                                                  |
  |  +----------------------------+                                  |
  |  | ansible-playbook site.yml  |                                  |
  |  +----------------------------+                                  |
  |        |                                                         |
  |        v                                                         |
  |  +----------------------------+                                  |
  |  | Parse playbook YAML        |                                  |
  |  | Resolve roles, variables   |                                  |
  |  | Build task execution plan  |                                  |
  |  +----------------------------+                                  |
  |        |                                                         |
  |        v                                                         |
  |  +----------------------------+                                  |
  |  | For each host in inventory |                                  |
  |  |   For each task:           |                                  |
  |  |     1. Generate module     |                                  |
  |  |        code + arguments    |                                  |
  |  |     2. Transfer to host    |   <--- SSH (Linux) or            |
  |  |        (via SSH/WinRM)     |        WinRM (Windows)           |
  |  |     3. Execute module      |                                  |
  |  |     4. Collect result JSON |                                  |
  |  |     5. Report changed/ok/  |                                  |
  |  |        failed              |                                  |
  |  +----------------------------+                                  |
  |        |                                                         |
  |        v                                                         |
  |  Results:                                                        |
  |  ok=12  changed=3  unreachable=0  failed=0                       |
  +==================================================================+

  For KubeVirt management, the "host" is localhost (the control node),
  and modules communicate with the Kubernetes API server via kubeconfig
  instead of SSH to individual VMs.

Ansible vs. Terraform -- when to use which:

  IaC Tool Decision Matrix

  +------------------------------------------------------------------+
  | Provisioning new infrastructure?                                  |
  |   (VMs, networks, storage, DNS)                                  |
  |                                                                  |
  |   YES --> Terraform / OpenTofu / Crossplane                      |
  |           Declarative, state-tracked, plan-before-apply           |
  |           Good at: creating, modifying, destroying infra          |
  |           Bad at: configuring software inside VMs                 |
  |                                                                  |
  | Configuring existing infrastructure?                              |
  |   (OS patching, package install, config files, certificates)     |
  |                                                                  |
  |   YES --> Ansible                                                |
  |           Procedural + idempotent, agentless, SSH-based           |
  |           Good at: OS config, app deployment, day-2 operations   |
  |           Bad at: tracking infrastructure state, destroy lifecycle|
  |                                                                  |
  | Both?                                                            |
  |                                                                  |
  |   YES --> Terraform for provisioning + Ansible for configuration |
  |           Terraform creates the VM, Ansible configures the OS    |
  |           Common pattern: Terraform outputs VM IP, Ansible uses  |
  |           it as dynamic inventory                                |
  +------------------------------------------------------------------+

For KubeVirt/OVE, the boundary blurs. Ansible's kubevirt.core collection can both provision VMs (create VirtualMachine CRs) and configure guest OSes (via SSH after the VM boots). A single playbook can do both. Whether to use Terraform or Ansible for provisioning is a team preference and organizational standard, not a technical requirement.

KubeVirt Ansible

Two Ansible collections are relevant for managing KubeVirt VMs:

  1. kubernetes.core -- the general-purpose Kubernetes collection. Its k8s module can create, update, and delete any Kubernetes resource, including KubeVirt CRDs. This is the "raw manifest" approach.

  2. kubevirt.core -- a KubeVirt-specific collection with purpose-built modules:

    • kubevirt_vm -- manage VirtualMachine CRs
    • kubevirt_vmi -- manage VirtualMachineInstance CRs (for direct VMI manipulation, rare in practice)
    • inventory plugin -- dynamically discover running VMs as Ansible inventory hosts

Installation:

# Install required collections
ansible-galaxy collection install kubernetes.core kubevirt.core

# Verify
ansible-galaxy collection list | grep -E "kubernetes|kubevirt"
# kubernetes.core   3.2.0
# kubevirt.core     1.5.0

Full playbook example -- VM provisioning, configuration, and lifecycle operations:

This is a complete, production-oriented playbook that demonstrates end-to-end VM management with KubeVirt.

# file: playbooks/provision-appserver.yml
# Purpose: Provision a RHEL 9 VM on KubeVirt, wait for it to boot,
#          configure the guest OS, and verify readiness.
#
# Usage:
#   ansible-playbook playbooks/provision-appserver.yml \
#     -e vm_name=rhel9-appserver-01 \
#     -e namespace=vm-workloads \
#     -e cpu_cores=4 \
#     -e memory=8Gi \
#     -e disk_size=100Gi

---
- name: Provision KubeVirt VM
  hosts: localhost
  connection: local
  gather_facts: false

  vars:
    vm_name: "rhel9-appserver-01"
    namespace: "vm-workloads"
    cpu_cores: 4
    memory: "8Gi"
    disk_size: "100Gi"
    storage_class: "ocs-storagecluster-ceph-rbd"
    golden_image_pvc: "rhel9-golden-20240601"
    golden_image_namespace: "golden-images"
    network_name: "vlan-100-prod"
    ssh_public_key: "{{ lookup('file', '~/.ssh/id_ed25519.pub') }}"

  tasks:
    # ---------------------------------------------------------------
    # Step 1: Ensure namespace exists
    # ---------------------------------------------------------------
    - name: Create namespace if it does not exist
      kubernetes.core.k8s:
        state: present
        definition:
          apiVersion: v1
          kind: Namespace
          metadata:
            name: "{{ namespace }}"
            labels:
              managed-by: ansible
              migration-tier: premium

    # ---------------------------------------------------------------
    # Step 2: Create the DataVolume (clone from golden image)
    # ---------------------------------------------------------------
    - name: Create root disk DataVolume
      kubernetes.core.k8s:
        state: present
        definition:
          apiVersion: cdi.kubevirt.io/v1beta1
          kind: DataVolume
          metadata:
            name: "{{ vm_name }}-rootdisk"
            namespace: "{{ namespace }}"
            labels:
              app.kubernetes.io/name: "{{ vm_name }}"
              app.kubernetes.io/managed-by: ansible
          spec:
            source:
              pvc:
                name: "{{ golden_image_pvc }}"
                namespace: "{{ golden_image_namespace }}"
            pvc:
              accessModes:
                - ReadWriteMany
              resources:
                requests:
                  storage: "{{ disk_size }}"
              storageClassName: "{{ storage_class }}"

    - name: Wait for DataVolume to complete cloning
      kubernetes.core.k8s_info:
        api_version: cdi.kubevirt.io/v1beta1
        kind: DataVolume
        name: "{{ vm_name }}-rootdisk"
        namespace: "{{ namespace }}"
      register: dv_status
      until: >-
        dv_status.resources | length > 0 and
        dv_status.resources[0].status.phase | default('') == 'Succeeded'
      retries: 60
      delay: 10

    # ---------------------------------------------------------------
    # Step 3: Create the VirtualMachine
    # ---------------------------------------------------------------
    - name: Create VirtualMachine
      kubevirt.core.kubevirt_vm:
        state: present
        name: "{{ vm_name }}"
        namespace: "{{ namespace }}"
        labels:
          app.kubernetes.io/name: "{{ vm_name }}"
          app.kubernetes.io/managed-by: ansible
          env: production
        running: true
        spec:
          domain:
            cpu:
              cores: "{{ cpu_cores }}"
              sockets: 1
              threads: 1
            resources:
              requests:
                memory: "{{ memory }}"
              limits:
                memory: "{{ memory }}"
            devices:
              disks:
                - name: rootdisk
                  disk:
                    bus: virtio
                - name: cloudinit
                  disk:
                    bus: virtio
              interfaces:
                - name: prod-net
                  bridge: {}
              rng: {}
          networks:
            - name: prod-net
              multus:
                networkName: "{{ network_name }}"
          volumes:
            - name: rootdisk
              dataVolume:
                name: "{{ vm_name }}-rootdisk"
            - name: cloudinit
              cloudInitNoCloud:
                userData: |
                  #cloud-config
                  hostname: {{ vm_name }}
                  fqdn: {{ vm_name }}.internal.example.com
                  manage_etc_hosts: true
                  users:
                    - name: sysadmin
                      sudo: ALL=(ALL) NOPASSWD:ALL
                      shell: /bin/bash
                      ssh_authorized_keys:
                        - {{ ssh_public_key }}
                  packages:
                    - qemu-guest-agent
                    - python3
                  runcmd:
                    - systemctl enable --now qemu-guest-agent

    # ---------------------------------------------------------------
    # Step 4: Wait for VM to be running and guest agent to report
    # ---------------------------------------------------------------
    - name: Wait for VMI to reach Running phase
      kubernetes.core.k8s_info:
        api_version: kubevirt.io/v1
        kind: VirtualMachineInstance
        name: "{{ vm_name }}"
        namespace: "{{ namespace }}"
      register: vmi_status
      until: >-
        vmi_status.resources | length > 0 and
        vmi_status.resources[0].status.phase | default('') == 'Running'
      retries: 30
      delay: 10

    - name: Wait for guest agent to report IP address
      kubernetes.core.k8s_info:
        api_version: kubevirt.io/v1
        kind: VirtualMachineInstance
        name: "{{ vm_name }}"
        namespace: "{{ namespace }}"
      register: vmi_info
      until: >-
        vmi_info.resources[0].status.interfaces | default([]) | length > 0 and
        vmi_info.resources[0].status.interfaces[0].ipAddress | default('') != ''
      retries: 30
      delay: 10

    - name: Extract VM IP address
      ansible.builtin.set_fact:
        vm_ip: "{{ vmi_info.resources[0].status.interfaces[0].ipAddress }}"

    - name: Display VM information
      ansible.builtin.debug:
        msg: |
          VM provisioned successfully:
            Name:      {{ vm_name }}
            Namespace: {{ namespace }}
            IP:        {{ vm_ip }}
            CPU:       {{ cpu_cores }} cores
            Memory:    {{ memory }}
            Disk:      {{ disk_size }}

    # ---------------------------------------------------------------
    # Step 5: Add the VM to in-memory inventory for configuration
    # ---------------------------------------------------------------
    - name: Add VM to runtime inventory
      ansible.builtin.add_host:
        name: "{{ vm_ip }}"
        groups: new_vms
        ansible_user: sysadmin
        ansible_ssh_private_key_file: "~/.ssh/id_ed25519"
        ansible_ssh_common_args: "-o StrictHostKeyChecking=no"

# ===================================================================
# Play 2: Configure the guest OS
# ===================================================================
- name: Configure VM guest OS
  hosts: new_vms
  become: true
  gather_facts: true

  tasks:
    - name: Wait for SSH to become available
      ansible.builtin.wait_for_connection:
        delay: 10
        timeout: 300

    - name: Gather facts after connection
      ansible.builtin.setup:

    - name: Update all packages
      ansible.builtin.dnf:
        name: "*"
        state: latest
      register: pkg_update

    - name: Install standard tooling
      ansible.builtin.dnf:
        name:
          - vim
          - tmux
          - htop
          - net-tools
          - bind-utils
          - lsof
          - strace
          - tcpdump
          - chrony
        state: present

    - name: Configure chrony for NTP
      ansible.builtin.copy:
        dest: /etc/chrony.conf
        content: |
          server ntp1.internal.example.com iburst
          server ntp2.internal.example.com iburst
          driftfile /var/lib/chrony/drift
          makestep 1.0 3
          rtcsync
        owner: root
        group: root
        mode: "0644"
      notify: restart chrony

    - name: Configure sysctl tuning
      ansible.posix.sysctl:
        name: "{{ item.key }}"
        value: "{{ item.value }}"
        sysctl_set: true
        reload: true
      loop:
        - { key: "net.core.somaxconn", value: "65535" }
        - { key: "vm.swappiness", value: "10" }
        - { key: "net.ipv4.tcp_max_syn_backlog", value: "65535" }
        - { key: "fs.file-max", value: "2097152" }

    - name: Ensure firewalld is running
      ansible.builtin.systemd:
        name: firewalld
        state: started
        enabled: true

    - name: Open application port
      ansible.posix.firewalld:
        port: 8443/tcp
        permanent: true
        state: enabled
        immediate: true

    - name: Verification -- display system info
      ansible.builtin.debug:
        msg: |
          Configuration complete:
            Hostname:   {{ ansible_hostname }}
            OS:         {{ ansible_distribution }} {{ ansible_distribution_version }}
            Kernel:     {{ ansible_kernel }}
            CPUs:       {{ ansible_processor_vcpus }}
            Memory:     {{ ansible_memtotal_mb }} MB
            IP:         {{ ansible_default_ipv4.address | default('N/A') }}
            Updates:    {{ pkg_update.results | default([]) | length }} packages updated

  handlers:
    - name: restart chrony
      ansible.builtin.systemd:
        name: chronyd
        state: restarted

Day-2 operations playbook -- patching, scaling, certificates:

# file: playbooks/day2-operations.yml
# Purpose: Common day-2 lifecycle operations for KubeVirt VMs
#
# Usage (patching):
#   ansible-playbook playbooks/day2-operations.yml \
#     --tags patch -e target_vms=vm-workloads
#
# Usage (scale CPU):
#   ansible-playbook playbooks/day2-operations.yml \
#     --tags scale -e vm_name=rhel9-appserver-01 \
#     -e namespace=vm-workloads -e new_cpu_cores=8

---
# --- Tag: patch -- Patch guest OS packages ---
- name: "Day-2: Patch guest OS"
  hosts: "{{ target_vms | default('all') }}"
  become: true
  tags: [patch]

  tasks:
    - name: Update all packages
      ansible.builtin.dnf:
        name: "*"
        state: latest
      register: patch_result

    - name: Display patch results
      ansible.builtin.debug:
        msg: "{{ patch_result.results | default([]) | length }} packages updated"

    - name: Check if reboot is required
      ansible.builtin.stat:
        path: /var/run/reboot-required
      register: reboot_flag

    - name: Reboot if required (with wait)
      ansible.builtin.reboot:
        msg: "Rebooting for kernel update"
        reboot_timeout: 600
      when: reboot_flag.stat.exists | default(false)

# --- Tag: scale -- Scale VM CPU/Memory via KubeVirt API ---
- name: "Day-2: Scale VM resources"
  hosts: localhost
  connection: local
  gather_facts: false
  tags: [scale]

  vars:
    vm_name: ""
    namespace: "vm-workloads"
    new_cpu_cores: 4
    new_memory: "8Gi"

  tasks:
    - name: Patch VirtualMachine CPU and memory
      kubernetes.core.k8s:
        state: present
        merge_type: merge
        definition:
          apiVersion: kubevirt.io/v1
          kind: VirtualMachine
          metadata:
            name: "{{ vm_name }}"
            namespace: "{{ namespace }}"
          spec:
            template:
              spec:
                domain:
                  cpu:
                    cores: "{{ new_cpu_cores }}"
                  resources:
                    requests:
                      memory: "{{ new_memory }}"
                    limits:
                      memory: "{{ new_memory }}"

    - name: Restart VM to apply changes (if hot-plug not available)
      kubernetes.core.k8s:
        state: present
        definition:
          apiVersion: subresources.kubevirt.io/v1
          kind: VirtualMachine
          metadata:
            name: "{{ vm_name }}"
            namespace: "{{ namespace }}"
      # Note: For VMs with LiveUpdate enabled and maxSockets configured,
      # CPU hot-plug applies without restart. Otherwise, a restart is
      # needed. The restart can be triggered via:
      #   virtctl restart {{ vm_name }} -n {{ namespace }}
      # or by toggling the VM's runStrategy.

# --- Tag: certs -- Rotate TLS certificates ---
- name: "Day-2: Rotate application certificates"
  hosts: "{{ target_vms | default('all') }}"
  become: true
  tags: [certs]

  vars:
    cert_source_dir: "files/certs"
    cert_dest_dir: "/etc/pki/tls"

  tasks:
    - name: Copy new certificate
      ansible.builtin.copy:
        src: "{{ cert_source_dir }}/app.crt"
        dest: "{{ cert_dest_dir }}/certs/app.crt"
        owner: root
        group: root
        mode: "0644"
      notify: reload application

    - name: Copy new private key
      ansible.builtin.copy:
        src: "{{ cert_source_dir }}/app.key"
        dest: "{{ cert_dest_dir }}/private/app.key"
        owner: root
        group: root
        mode: "0600"
      notify: reload application

    - name: Verify certificate validity
      ansible.builtin.command:
        cmd: >-
          openssl x509 -in {{ cert_dest_dir }}/certs/app.crt
          -noout -dates -subject
      register: cert_info
      changed_when: false

    - name: Display certificate info
      ansible.builtin.debug:
        msg: "{{ cert_info.stdout }}"

  handlers:
    - name: reload application
      ansible.builtin.systemd:
        name: myapp
        state: reloaded

KubeVirt dynamic inventory plugin:

The kubevirt.core collection includes an inventory plugin that discovers running VMs and makes them available as Ansible hosts. This eliminates the need to maintain a static inventory file.

# file: inventory/kubevirt.yml
# Ansible dynamic inventory for KubeVirt VMs

plugin: kubevirt.core.kubevirt
connections:
  - namespaces:
      - vm-workloads
      - vm-development
    # Filter: only VMs with the "ansible-managed: true" label
    label_selector: "ansible-managed=true"
    # Network interface to use for SSH connection
    network_name: default
    # Use the guest agent-reported IP address
    use_service: false

# Map KubeVirt labels to Ansible groups
compose:
  ansible_host: >-
    status.interfaces[0].ipAddress
  ansible_user: "'sysadmin'"
  ansible_ssh_private_key_file: "'~/.ssh/id_ed25519'"

# Group VMs by labels
keyed_groups:
  - key: labels['env']
    prefix: env
    separator: "_"
  - key: labels['app.kubernetes.io/part-of']
    prefix: app
    separator: "_"
# Test the dynamic inventory
ansible-inventory -i inventory/kubevirt.yml --list

# Use in a playbook
ansible-playbook -i inventory/kubevirt.yml playbooks/day2-operations.yml --tags patch

Hyper-V Ansible

Managing Hyper-V / Azure Local VMs with Ansible requires the community.windows collection and WinRM connectivity to the Hyper-V hosts.

Prerequisites:

# file: inventory/hyperv-hosts.yml
all:
  children:
    hyperv_hosts:
      hosts:
        hv-node-01.internal.example.com:
          ansible_connection: winrm
          ansible_winrm_transport: credssp
          ansible_winrm_server_cert_validation: validate
          ansible_port: 5986
        hv-node-02.internal.example.com:
          ansible_connection: winrm
          ansible_winrm_transport: credssp
          ansible_winrm_server_cert_validation: validate
          ansible_port: 5986
      vars:
        ansible_user: "DOMAIN\\svc-ansible"
        ansible_password: "{{ vault_hyperv_password }}"
# file: playbooks/hyperv-provision-vm.yml
# Provision a VM on Hyper-V using the community.windows collection

---
- name: Provision Hyper-V VM
  hosts: hyperv_hosts[0]
  gather_facts: false

  vars:
    vm_name: "rhel9-appserver-01"
    vm_cpu: 4
    vm_memory_mb: 8192
    vm_disk_path: "C:\\ClusterStorage\\Volume1\\VMs\\{{ vm_name }}"
    vm_vhdx_size_bytes: 107374182400  # 100 GB
    vm_switch: "Prod-vSwitch"
    iso_path: "C:\\ISOs\\rhel-9.3-x86_64-dvd.iso"

  tasks:
    - name: Create VM directory
      ansible.windows.win_file:
        path: "{{ vm_disk_path }}"
        state: directory

    - name: Create VHDX disk
      community.windows.win_powershell:
        script: |
          $vhdx = "{{ vm_disk_path }}\\{{ vm_name }}.vhdx"
          if (-not (Test-Path $vhdx)) {
              New-VHD -Path $vhdx -SizeBytes {{ vm_vhdx_size_bytes }} -Dynamic
              Write-Output "created"
          } else {
              Write-Output "exists"
          }
      register: vhdx_result

    - name: Create Hyper-V VM
      community.windows.win_powershell:
        script: |
          $vm = Get-VM -Name "{{ vm_name }}" -ErrorAction SilentlyContinue
          if (-not $vm) {
              New-VM -Name "{{ vm_name }}" `
                     -MemoryStartupBytes {{ vm_memory_mb }}MB `
                     -VHDPath "{{ vm_disk_path }}\\{{ vm_name }}.vhdx" `
                     -SwitchName "{{ vm_switch }}" `
                     -Generation 2
              Set-VM -Name "{{ vm_name }}" `
                     -ProcessorCount {{ vm_cpu }} `
                     -DynamicMemory `
                     -MemoryMinimumBytes 2GB `
                     -MemoryMaximumBytes {{ vm_memory_mb }}MB
              # Enable Secure Boot with Microsoft UEFI CA
              Set-VMFirmware -VMName "{{ vm_name }}" `
                             -SecureBootTemplate MicrosoftUEFICertificateAuthority
              # Attach ISO for installation
              Add-VMDvdDrive -VMName "{{ vm_name }}" `
                             -Path "{{ iso_path }}"
              Write-Output "created"
          } else {
              Write-Output "exists"
          }
      register: vm_create_result

    - name: Start VM
      community.windows.win_powershell:
        script: |
          Start-VM -Name "{{ vm_name }}"
      when: vm_create_result.output[0] == "created"

Note: The community.windows collection does not have a dedicated win_hyperv_guest module with full idempotency guarantees comparable to the kubevirt.core.kubevirt_vm module. Most Hyper-V operations require using win_powershell with custom PowerShell scripts and manual idempotency checks (the if (-not $vm) pattern above). This is a significant maturity gap compared to the KubeVirt Ansible integration.

For Azure Local specifically, Microsoft recommends using the Azure CLI (az stack-hci vm create) or ARM templates rather than direct Hyper-V PowerShell commands, because the VMs must be registered as Arc resources. Ansible can invoke Azure CLI commands via ansible.builtin.command or use the azure.azcollection collection for ARM-level operations.

VMware Ansible (Current State)

The current VMware estate uses the community.vmware collection, which provides mature, well-tested modules:

Module Purpose
community.vmware.vmware_guest Create, manage, and delete VMs
community.vmware.vmware_guest_disk Manage VM disks
community.vmware.vmware_guest_network Manage VM network adapters
community.vmware.vmware_guest_powerstate Control VM power state
community.vmware.vmware_guest_snapshot Manage VM snapshots
community.vmware.vmware_vmotion Trigger vMotion migrations
community.vmware.vmware_cluster_info Query cluster information

The community.vmware collection is one of the most mature Ansible collections. Any replacement must provide comparable module coverage and idempotency guarantees. As of 2026, the kubevirt.core collection is functional but has fewer modules. The community.windows collection for Hyper-V is even thinner. This gap must be factored into the migration timeline.

Ansible Automation Platform (AAP)

For an enterprise with 5,000+ VMs, running ansible-playbook from a developer's laptop is not a governance-compliant operating model. Red Hat's Ansible Automation Platform (AAP) -- formerly Ansible Tower -- provides:

  Ansible Automation Platform Architecture

  +====================================================================+
  |  Ansible Automation Platform (AAP)                                 |
  |                                                                    |
  |  +-----------------------------+                                   |
  |  | Automation Controller       |   Web UI + REST API               |
  |  | (formerly Tower)            |   RBAC, credential store,         |
  |  |                             |   job scheduling, approvals       |
  |  +-------------+---------------+                                   |
  |                |                                                   |
  |                v                                                   |
  |  +-----------------------------+                                   |
  |  | Execution Environments      |   Containerized Ansible           |
  |  | (container images with      |   execution with pinned           |
  |  |  collections + Python deps) |   dependencies                    |
  |  +-------------+---------------+                                   |
  |                |                                                   |
  +====================================================================+
                   |
        +----------+----------+------------------+
        |                     |                  |
        v                     v                  v
  +------------+       +------------+     +------------+
  | KubeVirt   |       | Hyper-V    |     | VM Guest   |
  | API Server |       | Hosts      |     | OS (SSH)   |
  | (kubeconfig)|      | (WinRM)    |     |            |
  +------------+       +------------+     +------------+

AAP is particularly relevant for the OVE evaluation because Red Hat bundles AAP with OpenShift Platform Plus. If the organization selects OVE, AAP becomes a natural fit for VM lifecycle automation -- provisioning VMs through KubeVirt, configuring guest OSes via SSH, and managing day-2 operations, all through a single automation platform with audit logging and approval workflows.

GitOps with ArgoCD/Flux: The Kubernetes-Native IaC Model

For organizations adopting OVE, GitOps is potentially the most natural IaC model because KubeVirt VMs are Kubernetes-native Custom Resources. Instead of running Terraform from a pipeline or Ansible from AAP, you store VirtualMachine manifests in a Git repository and let a GitOps controller (ArgoCD or Flux) synchronize them to the cluster.

What is GitOps?

GitOps is an operational model where:

  1. The desired state of all infrastructure is declared in Git (the single source of truth).
  2. A controller running in the cluster continuously compares the desired state (Git) with the actual state (cluster).
  3. When drift is detected, the controller automatically reconciles (applies the changes from Git to the cluster).
  4. All changes go through Git -- pull requests, code review, approval, merge. No direct kubectl apply.
  GitOps Sync Loop

  +============================================================+
  |                                                            |
  |  Git Repository                                            |
  |  (e.g., GitLab, GitHub)                                    |
  |                                                            |
  |  repo: infra-gitops/vm-workloads                           |
  |  +------------------------------------------------------+  |
  |  | main branch                                          |  |
  |  |                                                      |  |
  |  | vm-workloads/                                        |  |
  |  |   rhel9-appserver-01.yaml    (VirtualMachine CR)     |  |
  |  |   rhel9-appserver-02.yaml    (VirtualMachine CR)     |  |
  |  |   rhel9-database-01.yaml     (VirtualMachine CR)     |  |
  |  |   network-policy.yaml        (NetworkPolicy)         |  |
  |  |   resource-quota.yaml        (ResourceQuota)         |  |
  |  |   kustomization.yaml         (Kustomize overlay)     |  |
  |  +------------------------------------------------------+  |
  |                                                            |
  +=======================+====================================+
                          |
              1. ArgoCD polls Git repo
                 (or webhook triggers sync)
                          |
                          v
  +=======================+====================================+
  |  Kubernetes Cluster (OVE)                                  |
  |                                                            |
  |  +------------------------------------------------------+  |
  |  | ArgoCD Controller                                    |  |
  |  |                                                      |  |
  |  | 2. Compare Git manifests with live cluster state     |  |
  |  |                                                      |  |
  |  | 3a. If in sync --> no action (healthy)               |  |
  |  | 3b. If out of sync --> apply diff to cluster         |  |
  |  |     (create/update/delete resources)                 |  |
  |  |                                                      |  |
  |  | 4. Report sync status back to ArgoCD UI / Git        |  |
  |  +------------------------------------------------------+  |
  |                                                            |
  |  +------------------------------------------------------+  |
  |  | Live Resources                                       |  |
  |  |   VirtualMachine: rhel9-appserver-01  (Running)      |  |
  |  |   VirtualMachine: rhel9-appserver-02  (Running)      |  |
  |  |   VirtualMachine: rhel9-database-01   (Running)      |  |
  |  +------------------------------------------------------+  |
  |                                                            |
  +============================================================+

How ArgoCD syncs VirtualMachine manifests from Git to the cluster:

ArgoCD works with any Kubernetes resource, including KubeVirt CRDs. No special configuration is needed to manage VirtualMachines -- ArgoCD treats them like any other CR.

Step 1: Store VM manifests in Git.

# file: vm-workloads/base/rhel9-appserver-01.yaml
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: rhel9-appserver-01
  namespace: vm-workloads
  labels:
    app.kubernetes.io/name: rhel9-appserver-01
    app.kubernetes.io/managed-by: argocd
    env: production
  annotations:
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
spec:
  runStrategy: Always
  template:
    metadata:
      labels:
        kubevirt.io/domain: rhel9-appserver-01
    spec:
      domain:
        cpu:
          cores: 4
          sockets: 1
          threads: 1
        resources:
          requests:
            memory: 8Gi
          limits:
            memory: 8Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
            - name: cloudinit
              disk:
                bus: virtio
          interfaces:
            - name: prod-net
              bridge: {}
          rng: {}
      networks:
        - name: prod-net
          multus:
            networkName: vlan-100-prod
      volumes:
        - name: rootdisk
          dataVolume:
            name: rhel9-appserver-01-rootdisk
        - name: cloudinit
          cloudInitNoCloud:
            userData: |
              #cloud-config
              hostname: rhel9-appserver-01
              users:
                - name: sysadmin
                  sudo: ALL=(ALL) NOPASSWD:ALL
                  ssh_authorized_keys:
                    - ssh-ed25519 AAAA... admin@example.com
              packages:
                - qemu-guest-agent
              runcmd:
                - systemctl enable --now qemu-guest-agent
---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: rhel9-appserver-01-rootdisk
  namespace: vm-workloads
spec:
  source:
    pvc:
      name: rhel9-golden-20240601
      namespace: golden-images
  pvc:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 100Gi
    storageClassName: ocs-storagecluster-ceph-rbd

Step 2: Use Kustomize for environment-specific overlays.

# file: vm-workloads/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - rhel9-appserver-01.yaml
  - rhel9-appserver-02.yaml
  - rhel9-database-01.yaml
commonLabels:
  managed-by: argocd
  team: platform-engineering
# file: vm-workloads/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
namespace: vm-workloads-prod
patches:
  - target:
      kind: VirtualMachine
    patch: |
      - op: replace
        path: /spec/template/spec/domain/resources/requests/memory
        value: 16Gi
      - op: replace
        path: /spec/template/spec/domain/resources/limits/memory
        value: 16Gi
# file: vm-workloads/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
namespace: vm-workloads-staging
patches:
  - target:
      kind: VirtualMachine
    patch: |
      - op: replace
        path: /spec/template/spec/domain/cpu/cores
        value: 2
      - op: replace
        path: /spec/template/spec/domain/resources/requests/memory
        value: 4Gi
      - op: replace
        path: /spec/template/spec/domain/resources/limits/memory
        value: 4Gi

Step 3: Create an ArgoCD Application.

# file: argocd/applications/vm-workloads-prod.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: vm-workloads-prod
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: infrastructure
  source:
    repoURL: https://gitlab.internal.example.com/infra-gitops/vm-workloads.git
    targetRevision: main
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: vm-workloads-prod
  syncPolicy:
    automated:
      prune: false         # Do NOT delete VMs removed from Git (safety)
      selfHeal: true       # Re-apply if someone manually changes a VM
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
      - RespectIgnoreDifferences=true
    retry:
      limit: 3
      backoff:
        duration: 30s
        factor: 2
        maxDuration: 5m
  ignoreDifferences:
    # Ignore status fields that KubeVirt controllers update
    - group: kubevirt.io
      kind: VirtualMachine
      jsonPointers:
        - /status
    - group: cdi.kubevirt.io
      kind: DataVolume
      jsonPointers:
        - /status

Key ArgoCD configuration decisions for VMs:

Decision Recommendation Rationale
Automated sync Yes, with selfHeal: true Detect and revert manual kubectl changes. Enforces Git as the source of truth.
Pruning prune: false (initially) Removing a VM YAML from Git should NOT automatically delete a running production VM. This is a safety measure. Enable selective pruning per resource type after the team gains confidence.
Server-side apply Yes Required for large CRDs like VirtualMachine. Client-side apply can hit annotation size limits.
Ignore differences Ignore /status on VMs and DataVolumes KubeVirt controllers continuously update the status subresource. ArgoCD would otherwise show perpetual "OutOfSync" status.

GitOps vs. Terraform for VM lifecycle management:

Aspect Terraform GitOps (ArgoCD/Flux)
Source of truth .tf files + terraform.tfstate Git repository (manifests in YAML)
State storage External backend (S3, Azure Blob, etc.) Kubernetes API server (etcd)
Reconciliation On-demand (terraform apply) Continuous (controller polls or watches Git)
Drift detection On terraform plan Continuous (controller compares Git vs. live)
Drift correction Manual (terraform apply) Automatic (if selfHeal: true)
Multi-platform Excellent (any provider) Kubernetes-only (CRDs must live in K8s)
Learning curve HCL language, state management YAML/Kustomize/Helm, ArgoCD configuration
Destroy workflow terraform destroy (explicit) Remove YAML from Git + enable pruning (risky)
Imperative operations Not supported (use virtctl/Ansible) Not supported (use virtctl/Ansible)
Audit trail CI/CD pipeline logs Git commit history (who, when, what, why)
Best fit Multi-cloud, hybrid environments Kubernetes-native platforms (OVE)

The GitOps recommendation for OVE:

For an OVE deployment, GitOps is the strongest IaC model because:

  1. VMs are Kubernetes CRDs -- they are native Git-syncable objects.
  2. No external state file to manage, lock, or lose.
  3. The Git commit history is the audit trail. Every VM change is a reviewed, approved pull request.
  4. Self-healing reverts unauthorized manual changes -- a strong governance control.
  5. The same ArgoCD instance that manages application deployments also manages VMs -- one operational model.

However, GitOps alone does not cover everything:

The combined model:

  Recommended IaC Architecture for OVE

  +====================================================================+
  |                                                                    |
  |  Git Repository                                                    |
  |  +--------------------------------------------------------------+  |
  |  | VM Definitions (YAML)     | Kustomize Overlays | Helm Charts |  |
  |  +--------------------------------------------------------------+  |
  |                |                                                   |
  |    Pull Request --> Review --> Approve --> Merge to main            |
  |                |                                                   |
  +================+===================================================+
                   |
          +--------+--------+
          |                 |
          v                 v
  +-------+------+  +------+-------+
  | ArgoCD       |  | AAP / Ansible|
  | (GitOps)     |  | (Day-2 Ops)  |
  |              |  |              |
  | Manages:     |  | Manages:     |
  | - VM CRDs    |  | - Guest OS   |
  | - DataVolumes|  | - Patching   |
  | - Networks   |  | - Certs      |
  | - Quotas     |  | - Config     |
  | - Policies   |  | - Compliance |
  +--------------+  +--------------+
          |                 |
          v                 v
  +====================================================================+
  |  OVE Cluster (Kubernetes + KubeVirt)                               |
  |  - VirtualMachine CRDs managed by ArgoCD                          |
  |  - Guest OS configured by Ansible via SSH                          |
  |  - Imperative ops (migrate, console) via virtctl                   |
  +====================================================================+

How the Candidates Handle This

Capability VMware (Current) OVE (KubeVirt) Azure Local (Hyper-V) Swisscom ESC
Terraform Provider hashicorp/vsphere -- mature, full-featured, widely used. Covers VMs, networks, storage, clusters, resource pools. kubevirt/kubevirt -- community-maintained, covers VMs and DataVolumes. Supplement with kubernetes_manifest for other CRDs. Less mature. hashicorp/azurerm + Azure/azapi -- Microsoft-backed, evolving HCI resource coverage. Strong for Azure-native operations. Depends on Swisscom API exposure. If VMware-based, uses hashicorp/vsphere. If API is abstracted, may require custom provider.
Ansible Collection community.vmware -- 50+ modules, mature, well-documented. vmware_guest is the gold standard for VM management. kubevirt.core + kubernetes.core -- functional but fewer modules. kubevirt_vm for provisioning, k8s for raw manifests. Dynamic inventory plugin available. community.windows -- requires PowerShell scripting for Hyper-V. No dedicated win_hyperv_guest module with full idempotency. azure.azcollection for ARM-level operations. Depends on API exposure. Ansible can wrap any REST API via ansible.builtin.uri module.
GitOps (ArgoCD/Flux) Not applicable. VMware resources are not Kubernetes CRDs. PowerCLI/govc scripts can be wrapped but this is not native GitOps. Native fit. VMs are Kubernetes CRDs. ArgoCD/Flux sync VM manifests from Git. Self-healing, drift detection, audit trail via Git history. Recommended model. Not natively applicable. Azure Local VMs are ARM resources, not Kubernetes CRDs. Azure GitOps (Flux on Arc-enabled clusters) manages Kubernetes workloads but not Hyper-V VMs. Not applicable unless Swisscom exposes a Kubernetes-native API.
Crossplane Possible via provider-terraform wrapping the vSphere provider, but adds complexity without clear benefit. Native fit. provider-kubernetes manages KubeVirt CRDs. Enables platform abstraction via XRDs. Good for multi-cluster or multi-platform scenarios. Possible via provider-azure for ARM resources. Useful for hybrid OVE + Azure Local environments. Unlikely unless Swisscom provides a Crossplane-compatible API.
State Management Terraform state file. Remote backend required. State segmentation by cluster/team. GitOps: Kubernetes etcd (no external state). Terraform: standard state file. Crossplane: Kubernetes etcd. Terraform state in Azure Blob Storage. ARM template state managed by Azure. Depends on tooling choice.
Enterprise Automation vRealize Automation (vRA) / Aria Automation. Mature self-service portal with approval workflows, blueprints, catalog. AAP (Ansible Automation Platform). RBAC, credential management, approval workflows, job scheduling. Bundled with OpenShift Platform Plus. Azure Automation / Azure DevOps pipelines. ARM templates, Bicep. RBAC via Azure AD. Swisscom-managed. Self-service capabilities depend on ESC portal features.
Offline Operation Fully offline. vCenter + Terraform run on-premises. Fully offline. Kubernetes API + ArgoCD + Ansible all run on-premises. Git server can be on-premises (GitLab). Partially offline. VMs run locally but ARM control plane requires Azure connectivity. Terraform apply requires Azure AD authentication. Depends on Swisscom architecture. ESC is a managed service -- likely requires Swisscom network connectivity.
IaC Migration Effort N/A (baseline). High. Rewrite all Terraform HCL from vSphere to KubeVirt resources. Rewrite Ansible from community.vmware to kubevirt.core. Build GitOps repo structure. Estimated: 3-6 months for 5,000+ VMs. Medium-High. Rewrite Terraform HCL from vSphere to azurerm/azapi. Rewrite Ansible from community.vmware to community.windows + PowerShell. Estimated: 3-6 months. Low (if Swisscom manages IaC) to High (if customer manages IaC through Swisscom APIs).

Key Takeaways

  1. The IaC toolchain is not portable across platforms. Every Terraform configuration, every Ansible playbook, every CI/CD pipeline that touches VMware APIs must be rewritten for the target platform. This is not a configuration change -- it is a development project. Budget for it as a workstream with its own timeline, testing, and rollout plan.

  2. GitOps is the strongest IaC model for OVE. Because KubeVirt VMs are Kubernetes CRDs, they integrate natively with ArgoCD/Flux. The Git repository becomes the single source of truth. The commit history becomes the audit trail. Self-healing reverts unauthorized changes. No external state file. For a regulated financial institution, the Git-based approval workflow (pull request --> review --> approve --> merge --> auto-sync) is a natural fit for change management processes.

  3. Terraform remains the right choice for Azure Local. Azure Local VMs are ARM resources managed through the Azure control plane. The azurerm and azapi Terraform providers are the natural IaC interface. GitOps does not apply because Azure Local VMs are not Kubernetes CRDs. Consider Terraform Cloud or GitLab-managed Terraform state for enterprise state management.

  4. Ansible bridges the provisioning-configuration gap. Regardless of platform, Ansible is the right tool for day-2 guest OS operations: patching, certificate rotation, configuration management, compliance scanning. The kubevirt.core collection enables a single-playbook workflow that provisions the VM via the Kubernetes API and then configures the guest OS via SSH. For enterprise deployments, AAP provides the RBAC, credential management, and audit trail that ansible-playbook on a laptop does not.

  5. The KubeVirt Terraform provider is less mature than the vSphere provider. The team currently relies on a battle-tested Terraform provider with comprehensive resource coverage. The KubeVirt provider covers core resources but not the full CRD surface. The kubernetes_manifest resource fills gaps but with weaker type safety. If Terraform is the chosen IaC tool for OVE (instead of GitOps), budget for workarounds and monitor provider releases closely.

  6. OpenTofu mitigates the Terraform license risk. The BSL license change does not affect end-users today, but organizational legal teams may flag it. OpenTofu is a drop-in replacement under the MPL-2.0 license. All examples in this chapter work with both tools. The decision between Terraform and OpenTofu is a legal and strategic question, not a technical one.

  7. Crossplane is worth evaluating for multi-platform scenarios. If the organization operates both OVE and Azure Local (or plans to), Crossplane's Composite Resource Definitions can provide a unified VM provisioning API that abstracts platform-specific details. This reduces cognitive load for application teams but adds operational complexity to the platform team.

  8. Offline operation is a differentiator. OVE's IaC stack (GitOps with ArgoCD, Ansible, on-premises Git server) operates fully offline. Azure Local's IaC stack (Terraform with azurerm, Azure AD authentication) requires connectivity to Azure. For financial institutions with strict air-gap or data sovereignty requirements, this is a material consideration.

  9. The "destroy" workflow deserves special attention. Terraform destroy and GitOps pruning both delete VMs permanently. For 5,000+ VMs in production, accidental deletion is a catastrophic risk. Implement safeguards: prevent_destroy lifecycle blocks in Terraform, prune: false in ArgoCD, deletion protection annotations, and mandatory approval gates before any destructive operation.

  10. IaC is a team capability, not just a tooling choice. The team must be trained on the new IaC tools, workflows, and operational patterns. A VMware team that has used PowerCLI for a decade will not become proficient in Kustomize overlays and ArgoCD sync policies overnight. Invest in training as part of the migration budget.


Discussion Guide

Use these questions when engaging with vendors, Red Hat/Microsoft/Swisscom field teams, or internal subject matter experts.

Terraform and Provider Maturity

  1. What is the current release cadence and maintainer status of the KubeVirt Terraform provider? Is it backed by Red Hat or purely community-maintained? What is the commitment to keeping it aligned with new KubeVirt CRD versions? Why this matters: A community-maintained provider with a single maintainer is a supply-chain risk for an enterprise managing 5,000+ VMs. If the provider lags behind KubeVirt releases, the team is stuck with kubernetes_manifest workarounds or forking the provider.

  2. For Azure Local: demonstrate provisioning a VM entirely through Terraform (azurerm or azapi provider) with network, storage, and guest configuration. Which resource types are GA in the provider, and which require the azapi escape hatch? Why this matters: If critical resource types are not yet in the azurerm provider and require raw API calls via azapi, the IaC experience is significantly less mature than the vSphere Terraform provider the team currently uses.

  3. Show a Terraform plan for modifying a running VM's CPU count from 4 to 8. Does the plan show an in-place update or a destroy-and-recreate? What happens to the VM during the apply? Why this matters: If CPU changes require destroy-and-recreate, Terraform-managed scaling becomes disruptive. The team needs to understand which fields are mutable in-place and which trigger recreation.

GitOps and ArgoCD

  1. Demonstrate an ArgoCD-managed VirtualMachine deployment: commit a new VM manifest to Git, show ArgoCD detecting the change, syncing it to the cluster, and reporting the VM as healthy. What is the sync latency from Git merge to VM creation? Why this matters: GitOps sync latency directly affects provisioning SLAs. If ArgoCD polls every 3 minutes, a VM request takes at minimum 3 minutes before creation even starts. Webhook-triggered syncs are faster but require integration with the Git server.

  2. Show what happens when someone manually changes a GitOps-managed VM via kubectl (e.g., adds extra memory). Does ArgoCD detect the drift? Does self-healing revert it? How quickly? Why this matters: Self-healing is the key governance feature of GitOps. If it does not work reliably for VirtualMachine CRDs (e.g., due to status field noise), the model loses its value.

  3. How should the team handle the "delete a VM" workflow in GitOps? If a developer removes a VM YAML from the Git repo and merges, should ArgoCD automatically delete the running VM? What safeguards exist? Why this matters: Accidental VM deletion is the single biggest risk of GitOps with prune: true. The team needs a clear operational model -- deletion protection annotations, manual prune approval, or a two-step workflow (stop first, then delete).

Ansible and Day-2 Operations

  1. Show a complete workflow: provision a VM via Ansible (kubevirt.core), wait for it to boot, SSH into the guest, install packages, configure NTP and firewall rules, and verify. How long does the end-to-end workflow take? Why this matters: End-to-end provisioning time (from playbook start to VM ready for application deployment) is the metric that matters for provisioning SLAs. The team needs to benchmark this against the current VMware + Ansible workflow.

  2. Demonstrate rolling OS patching across 50 VMs using Ansible with serial execution (e.g., 5 at a time), health checks between batches, and automatic rollback if a health check fails. Is this workflow achievable with the kubevirt.core collection? Why this matters: Day-2 patching at scale is the most frequent operational task. The workflow must support batch execution with health gates to avoid taking down all instances of a service simultaneously.

Enterprise Automation and Governance

  1. For OVE: show how AAP (Ansible Automation Platform) integrates with KubeVirt -- credential management for kubeconfig, RBAC for playbook execution, approval workflows for production changes, and audit logging. Is AAP bundled with the OVE subscription? Why this matters: Running Ansible from a laptop is not governance-compliant. The team needs an enterprise automation platform with audit trails, RBAC, and approval workflows. If AAP is included, it simplifies the commercial model.

  2. What is the recommended IaC operating model for the target platform? Should the team use Terraform, GitOps, Ansible, Crossplane, or a combination? Provide a reference architecture with clear boundaries between tools (which tool does what). Why this matters: Tool proliferation is a real risk. If the team ends up using Terraform for some VMs, ArgoCD for others, Ansible for day-2, and Crossplane for cross-platform abstraction, the operational complexity is worse than the current VMware model. The vendor should provide an opinionated reference architecture.