Modern datacenters and beyond

Storage Protocols

Why This Matters

Storage protocols are the wire-level languages through which servers and storage systems communicate. In a VMware environment, the protocol choice was often hidden behind VMFS and vSAN abstractions -- the hypervisor handled the details. When migrating to OVE, Azure Local, or Swisscom ESC, the protocol layer becomes explicit and consequential: each candidate platform supports a different subset of protocols, with different performance ceilings, different operational complexity, and different hardware prerequisites.

For a Tier-1 financial enterprise running 5,000+ VMs, protocol selection has three concrete impacts:

  1. Performance ceiling. The difference between iSCSI over 25 GbE and NVMe-oF over 100 GbE RDMA is not incremental -- it can be 10x in IOPS and 5x in latency. Choosing the wrong protocol for latency-sensitive workloads (databases, trading systems, real-time analytics) means leaving performance on the table that no amount of tuning can recover.

  2. Infrastructure dependencies. Fibre Channel requires dedicated HBAs and FC switches. NVMe/RDMA requires lossless Ethernet (DCB/PFC/ECN). iSCSI runs on commodity Ethernet but performs best with jumbo frames and dedicated NICs. Each protocol choice has a bill of materials and a network design consequence.

  3. Operational model. NFS and SMB are file-level protocols that simplify shared access but introduce locking complexity. Block protocols (iSCSI, FC, NVMe-oF) deliver raw performance but limit sharing to one writer per LUN without cluster filesystems. The operational team needs to understand what they are operating, not just what the vendor configured.

This page dissects each protocol at the wire level -- PDU structures, queue models, frame formats -- so the evaluation team can ask precise questions and make informed trade-offs during PoC planning.


Concepts

1. iSCSI (Internet Small Computer Systems Interface)

Protocol Architecture

iSCSI encapsulates SCSI commands inside TCP/IP packets, enabling block storage access over standard Ethernet networks. It was standardized as RFC 7143 (consolidating the original RFC 3720) and has been the workhorse of IP-based SAN for two decades.

The key architectural insight is that iSCSI is a transport mapping, not a new storage protocol. The SCSI command set (READ, WRITE, INQUIRY, REPORT LUNS, etc.) remains identical to what a local SCSI disk uses -- iSCSI simply carries those commands over TCP instead of a parallel SCSI cable or Fibre Channel link.

iSCSI Protocol Stack
======================

  +----------------------------------+
  | SCSI Command Layer               |
  | (CDB: READ_10, WRITE_10, etc.)   |
  +----------------------------------+
  | iSCSI Layer                       |
  | (PDU framing, session mgmt,      |
  |  login, authentication)           |
  +----------------------------------+
  | TCP (port 3260)                   |
  | (reliable, ordered byte stream)   |
  +----------------------------------+
  | IP (IPv4 or IPv6)                |
  +----------------------------------+
  | Ethernet (1/10/25/100 GbE)        |
  +----------------------------------+

  vs. Fibre Channel:

  +----------------------------------+
  | SCSI Command Layer               |
  +----------------------------------+
  | FCP (Fibre Channel Protocol)      |
  +----------------------------------+
  | FC-2 (framing, flow control)      |
  +----------------------------------+
  | FC-1 (encoding: 8b/10b, 64b/66b) |
  +----------------------------------+
  | FC-0 (physical: SFP, fiber optic) |
  +----------------------------------+

iSCSI PDU Structure

Every iSCSI exchange is carried in Protocol Data Units (PDUs). A PDU consists of up to five segments:

iSCSI PDU Layout
==================

+--------------------------------------------------+
|     Basic Header Segment (BHS)    48 bytes        |
|  +--------------------------------------------+  |
|  | Opcode (1 byte)                             |  |
|  | Flags (1 byte)                              |  |
|  | Total AHS Length (1 byte)                   |  |
|  | Data Segment Length (3 bytes)               |  |
|  | LUN (8 bytes)                               |  |
|  | Initiator Task Tag (4 bytes)                |  |
|  | ... (remaining fields opcode-specific)      |  |
|  +--------------------------------------------+  |
+--------------------------------------------------+
|     Additional Header Segment (AHS)  variable     |
|     (extended CDB, bi-directional read length)    |
+--------------------------------------------------+
|     Header Digest (optional)         4 bytes      |
|     (CRC32C of BHS + AHS)                         |
+--------------------------------------------------+
|     Data Segment                     variable     |
|     (SCSI CDB, data, parameters, text)            |
+--------------------------------------------------+
|     Data Digest (optional)           4 bytes      |
|     (CRC32C of Data Segment)                      |
+--------------------------------------------------+
|     Padding to 4-byte boundary       0-3 bytes    |
+--------------------------------------------------+

Key opcodes (initiator -> target):
  0x01  SCSI Command (carries CDB + optional immediate data)
  0x05  SCSI Data-Out (write data)
  0x03  Login Request
  0x06  Text Request
  0x10  SNACK Request (selective retransmission)
  0x40  Vendor-specific

Key opcodes (target -> initiator):
  0x21  SCSI Response (status, sense data)
  0x25  SCSI Data-In (read data)
  0x23  Login Response
  0x31  Ready to Transfer (R2T) -- flow control for writes
  0x02  Async Message (target-initiated events)

Initiator/Target Model

iSCSI uses a strict client-server model:

IQN Naming: Every iSCSI entity is identified by an iSCSI Qualified Name (IQN):

iqn.YYYY-MM.reverse.domain:unique-identifier
iqn.2024-01.com.example.dc1:storage-array-01.lun5
iqn.2024-01.com.example.dc1:initiator.esxi-host-03

Discovery mechanisms:

Session and Connection Management

An iSCSI session has two phases:

  1. Login Phase: Negotiates parameters (authentication, header/data digests, max burst length, max connections, initial R2T preference). Uses text-based key=value exchanges inside Login PDUs.

  2. Full Feature Phase: The actual SCSI command exchange happens here. Multiple TCP connections can be aggregated into a single session (MC/S -- Multiple Connections per Session) for bandwidth aggregation and failover.

iSCSI Session Model
=====================

Initiator                          Target
   |                                  |
   |--- Login Request (credentials) ->|
   |<-- Login Response (challenge) ---|
   |--- Login Request (response) ---->|
   |<-- Login Response (success) -----|
   |                                  |
   |  === Full Feature Phase ===      |
   |                                  |
   |--- SCSI Command (READ_10) ------>|
   |<-- Data-In (read payload) -------|
   |<-- SCSI Response (GOOD) ---------|
   |                                  |
   |--- SCSI Command (WRITE_10) ----->|
   |<-- R2T (ready to receive) -------|
   |--- Data-Out (write payload) ---->|
   |<-- SCSI Response (GOOD) ---------|
   |                                  |

Key session parameters negotiated at login:
  MaxRecvDataSegmentLength   (default 8192, typically tuned to 262144)
  MaxBurstLength             (max data per sequence, default 262144)
  FirstBurstLength           (unsolicited data before R2T, default 65536)
  InitialR2T                 (Yes = wait for R2T; No = send immediately)
  MaxOutstandingR2T          (parallel write streams, default 1)
  MaxConnections             (per session, default 1)
  DataPDUInOrder             (Yes for most implementations)
  HeaderDigest / DataDigest  (None or CRC32C)

Authentication

Performance Considerations

iSCSI's fundamental performance limitation is TCP overhead. Every SCSI I/O requires TCP processing (segmentation, checksumming, ACKs, congestion control), which consumes CPU cycles and adds latency.

Mitigation strategies, ranked by effectiveness:

Strategy Latency Impact CPU Impact Complexity
iSER (iSCSI Extensions for RDMA) -60-80% (bypasses TCP entirely) -90% (zero-copy) High (requires RDMA NICs, lossless fabric)
TCP Offload Engine (TOE) -20-40% -60-80% Medium (specialized NIC)
Jumbo Frames (MTU 9000) -5-15% -10-20% Low (switch + NIC config)
Dedicated storage network Eliminates contention N/A Medium (separate NICs/VLANs)
Multiple sessions (MC/S) Increases throughput Slight increase Low
Interrupt coalescing Trades latency for CPU -30-50% Low (NIC tuning)

Typical performance numbers (single-target, 4K random read, queue depth 32):

Configuration IOPS Latency (avg) CPU overhead
iSCSI / 10 GbE / software initiator 100-150K 200-400 us 15-25% of one core
iSCSI / 25 GbE / software initiator 200-300K 150-300 us 20-30% of one core
iSCSI / 25 GbE / TOE 250-350K 100-200 us 5-10% of one core
iSER / 25 GbE RDMA 400-600K 50-100 us 2-5% of one core

Linux Initiator Stack

The standard Linux iSCSI implementation is open-iscsi, consisting of:

Key operational commands:

# Discover targets on a portal
iscsiadm -m discovery -t sendtargets -p 10.0.100.1:3260

# Login to a specific target
iscsiadm -m node -T iqn.2024-01.com.storage:array01 -p 10.0.100.1 --login

# Show active sessions
iscsiadm -m session -P 3

# Set CHAP credentials
iscsiadm -m node -T iqn.2024-01.com.storage:array01 \
  --op update -n node.session.auth.authmethod -v CHAP
iscsiadm -m node -T iqn.2024-01.com.storage:array01 \
  --op update -n node.session.auth.username -v initiator01
iscsiadm -m node -T iqn.2024-01.com.storage:array01 \
  --op update -n node.session.auth.password -v <secret>

# Set replacement_timeout (seconds before failing I/O after path loss)
iscsiadm -m node -T iqn.2024-01.com.storage:array01 \
  --op update -n node.session.timeo.replacement_timeout -v 20

The critical parameter for VM workloads is node.session.timeo.replacement_timeout. This determines how long the initiator waits for a failed session to reconnect before failing I/Os back to the upper layers. The default of 120 seconds is far too long for production VMs -- it means I/O hangs for 2 minutes on a path failure. Financial environments typically set this to 5-20 seconds and rely on multipath for redundancy.

iSCSI in the Candidate Platforms

OVE (OpenShift Virtualization Engine): ODF/Ceph does not expose storage to VMs via iSCSI natively. Ceph RBD images are attached to VMs through the CSI driver, which maps RBD directly into the QEMU process using librbd (a user-space library, not a kernel iSCSI initiator). However, external iSCSI targets (NetApp, Pure Storage, Dell PowerStore) can be consumed via third-party CSI drivers that use iSCSI under the hood. The CSI driver manages the iSCSI session lifecycle transparently.

Azure Local: S2D does not use iSCSI for internal cluster storage. All intra-cluster I/O uses the SMB3/RDMA (Software Storage Bus) path. However, Azure Local nodes can attach external iSCSI targets using the Windows iSCSI Initiator (built into Windows Server). This is relevant for consuming existing SAN infrastructure during a migration transition period.

Swisscom ESC: The underlying Dell VxBlock infrastructure uses Fibre Channel to connect compute nodes to PowerMax/PowerStore arrays. iSCSI is not the primary protocol but may be available for specific workloads on request. The customer has no visibility into or control over the transport protocol.


2. NVMe-oF (Non-Volatile Memory Express over Fabrics)

Why NVMe Needed a Fabric Extension

NVMe was designed from scratch for flash storage, replacing the SCSI command set with a protocol optimized for parallelism and low latency. The key difference is the queue model:

SCSI vs NVMe Queue Model
===========================

SCSI (iSCSI, FC):
  Single command queue, single completion queue
  Queue depth: typically 32-256 per LUN
  Commands processed serially by storage controller

  Application
       |
       v
  +------------------+
  | Single Queue     |  max ~256 outstanding commands
  |  cmd1, cmd2, ... |
  +------------------+
       |
       v
  Storage Controller (processes commands one-by-one from queue)


NVMe:
  Up to 65,535 Submission Queues (SQ), each with up to 65,536 entries
  Up to 65,535 Completion Queues (CQ)
  Per-CPU queue pairs -- no locking, no contention

  CPU 0          CPU 1          CPU 2          CPU N
    |              |              |              |
    v              v              v              v
  +------+      +------+      +------+      +------+
  | SQ 1 |      | SQ 2 |      | SQ 3 |      | SQ N |
  | CQ 1 |      | CQ 2 |      | CQ 3 |      | CQ N |
  +------+      +------+      +------+      +------+
    |              |              |              |
    +------+-------+------+------+------+-------+
           |              |              |
        NVMe Controller (processes all queues in parallel)

  Result:
  - No lock contention between CPUs
  - Direct doorbell register writes (MMIO) instead of interrupts
  - 4 KB command vs 16-32 byte SCSI CDB
  - Completion via MSI-X interrupt per CQ (one per CPU)

When NVMe drives are local (PCIe-attached), they deliver 500K-1M+ IOPS at 10-20 us latency. The challenge was extending this performance across the network to shared storage arrays without reintroducing the bottlenecks that SCSI/iSCSI created. NVMe-oF (NVM Express over Fabrics, NVMe-oF specification 1.0, June 2016) solves this by defining transport bindings that carry the NVMe command set over network fabrics with minimal protocol overhead.

NVMe Architecture Internals

NVMe Command Flow (Local PCIe)
=================================

1. Host writes command to Submission Queue (SQ) in host memory
2. Host writes SQ doorbell (MMIO register on NVMe controller)
3. Controller fetches command from SQ via PCIe DMA
4. Controller executes command (read/write to flash)
5. Controller writes completion entry to Completion Queue (CQ)
6. Controller generates MSI-X interrupt to host CPU
7. Host processes completion entry, writes CQ doorbell

+--Host Memory-----------+        +--NVMe Controller--+
|                         |        |                    |
| Submission Queue (SQ)   |<--DMA--| Fetch commands     |
|  [cmd][cmd][cmd][...]   |        |                    |
|                         |        |                    |
| Completion Queue (CQ)   |--DMA-->| Post completions   |
|  [cpl][cpl][cpl][...]   |        |                    |
|                         |        | SQ Doorbell (MMIO) |
| Data Buffers            |<--DMA--| Data transfer      |
|  [page][page][page]     |        |                    |
+-------------------------+        +--------------------+

Key data structures:
  - Namespace: a logical partition of NVMe storage (like a LUN)
  - NQN (NVMe Qualified Name): identifier for subsystems
    nqn.2014-08.org.nvmexpress:uuid:<UUID>
  - Subsystem: collection of controllers and namespaces
  - Controller: a physical or virtual interface to a subsystem

Transport Bindings

NVMe-oF defines three transport bindings, each with different trade-offs:

NVMe/TCP (NVMe over TCP, TP8000):

NVMe/RDMA (NVMe over RDMA):

NVMe/FC (NVMe over Fibre Channel):

NVMe-oF Transport Comparison
===============================

                   NVMe/TCP         NVMe/RDMA (RoCE v2)    NVMe/FC
                   --------         -------------------    --------
Network            Standard         Lossless Ethernet      Fibre Channel
                   Ethernet         (PFC/ECN/DCB)          32/64G FC

NIC                Standard         RDMA NIC               FC HBA
                   25/100 GbE       (Mellanox CX-6/7)      (NVMe-capable)

CPU overhead       Medium           Very low               Low
                   (TCP stack)      (kernel bypass)        (FC offload)

Latency (4K        100-200 us       20-50 us               30-80 us
random read)

IOPS (single       300-500K         800K-1.5M              500K-1M
target, QD=32)

Deployment         Low              High                   Medium
complexity         (standard net)   (lossless fabric       (FC zoning,
                                    tuning)                 HBA firmware)

Existing infra     Reuses all       Reuses Ethernet        Reuses FC
reuse              Ethernet         (may need switch       fabric
                                    upgrade for PFC)

Discovery Service

NVMe-oF uses a dedicated discovery mechanism. Each storage fabric exposes one or more discovery controllers that respond to nvme discover commands and return a list of available subsystems, transport types, and addresses.

NVMe-oF Discovery Flow
=========================

Initiator Host                    Discovery Controller        Storage Target
     |                                   |                          |
     |-- nvme discover (traddr, -t) ---->|                          |
     |                                   |                          |
     |<-- Discovery Log Page ------------|                          |
     |    Entry 1: nqn=nqn.xxx,                                     |
     |             trtype=tcp,                                      |
     |             traddr=10.0.100.1,                               |
     |             trsvcid=4420                                     |
     |    Entry 2: nqn=nqn.yyy,                                     |
     |             trtype=rdma,                                     |
     |             traddr=10.0.100.2,                               |
     |             trsvcid=4420                                     |
     |                                                              |
     |-- nvme connect (nqn=nqn.xxx, trtype=tcp, traddr=...) ------>|
     |<-- Controller ready, namespace(s) available -----------------|
     |                                                              |
     |   /dev/nvmeXnY appears (block device ready for I/O)          |

Referrals: A discovery controller can point the initiator to other
discovery controllers, enabling multi-site or hierarchical discovery.

Performance Characteristics: NVMe-oF vs iSCSI

The performance gap is not just about raw numbers -- it is about where the bottleneck sits:

Metric iSCSI (25 GbE, software) NVMe/TCP (25 GbE) NVMe/RDMA (100 GbE RoCE) Factor (iSCSI vs RDMA)
4K random read IOPS 200-300K 400-600K 1.0-1.5M 4-5x
4K random read latency (avg) 200-400 us 100-200 us 20-50 us 5-10x
4K random read latency (p99) 500-1000 us 200-400 us 40-80 us 8-15x
128K sequential read throughput 2.5 GB/s (line rate) 3.0 GB/s (line rate) 12 GB/s (line rate) 4-5x
CPU per 100K IOPS 20-30% core 10-15% core 2-5% core 5-10x

The latency advantage of NVMe/RDMA is most pronounced at high percentiles (p99, p99.9), which is where user-visible performance problems manifest. A database query that hits 1000 random reads will see its total latency dominated by the slowest reads -- and at p99.9, iSCSI can spike to 2-5 ms while NVMe/RDMA stays under 200 us.

Multipath: NVMe Native vs Device-Mapper

NVMe defines its own multipath mechanism called ANA (Asymmetric Namespace Access), which is conceptually similar to ALUA in SCSI:

Best practice (2024+): Use NVMe native multipath. It is lighter weight, understands ANA natively, and avoids the device-mapper overhead. Enable with nvme_core.multipath=Y kernel parameter.

Linux NVMe Stack

# Install NVMe CLI tools
dnf install nvme-cli

# Discover available subsystems on a target
nvme discover -t tcp -a 10.0.100.1 -s 4420

# Connect to a subsystem
nvme connect -t tcp -n nqn.2024-01.com.storage:array01 \
  -a 10.0.100.1 -s 4420

# Connect to all discovered subsystems
nvme connect-all -t tcp -a 10.0.100.1 -s 4420

# List connected NVMe devices
nvme list

# Show NVMe multipath topology
nvme list-subsys

# Show controller details (queues, firmware, capabilities)
nvme id-ctrl /dev/nvme0

# Show namespace details (size, block size, capacity)
nvme id-ns /dev/nvme0n1

# Persistent connections (survive reboot)
# /etc/nvme/discovery.conf:
# -t tcp -a 10.0.100.1 -s 4420
# Then: systemctl enable --now nvmf-autoconnect.service

NVMe-oF in the Candidate Platforms

OVE: ODF (Ceph) has experimental NVMe-oF gateway support (SPDK-based). The Ceph NVMe-oF gateway exports Ceph RBD images as NVMe-oF namespaces, allowing external hosts to consume Ceph storage via NVMe/TCP. For VMs running inside OVE, storage access is via librbd (user-space), which is already highly optimized and does not benefit significantly from NVMe-oF. NVMe-oF matters most for OVE when consuming external storage arrays (NetApp, Pure) that expose NVMe-oF targets.

Azure Local: S2D uses the Software Storage Bus (SSB) for intra-cluster communication, which is an SMB3/RDMA-based transport. Azure Local does not use NVMe-oF for internal cluster storage. However, Azure Local 23H2+ supports connecting to external NVMe/TCP targets. NVMe/TCP is gaining relevance as Microsoft expands Azure Local's external storage integration.

Swisscom ESC: The Dell PowerMax/PowerStore arrays behind ESC support NVMe/FC as a front-end protocol. Whether NVMe/FC is enabled in the ESC tenant depends on Swisscom's infrastructure configuration. The customer cannot select or configure the transport protocol.


3. MPIO (Multipath I/O)

Why Multipath Exists

In enterprise storage, a single cable failure or switch failure must never cause a storage outage. Multipath I/O provides two capabilities simultaneously:

  1. Redundancy: If one path fails (cable cut, NIC failure, switch failure), I/O continues on the remaining paths without interruption.
  2. Performance: I/O can be distributed across multiple active paths, multiplying available bandwidth.
MPIO Path Layout (Typical Dual-Fabric Design)
================================================

  +--------------------+
  |   Compute Host     |
  |                    |
  | NIC A      NIC B   |   (or HBA-A, HBA-B for FC)
  +--+------------+----+
     |            |
     v            v
  +------+    +------+
  |Switch |    |Switch |     Fabric A    Fabric B
  |  A    |    |  B    |     (separate failure domains)
  +--+----+    +--+----+
     |            |
     v            v
  +--+------------+----+
  | Port A     Port B  |
  |                    |
  | Storage Target     |
  |   (SAN Array,      |
  |    Ceph Gateway,    |
  |    NVMe Target)     |
  +--------------------+

  From host's perspective WITHOUT multipath:
    /dev/sda  (via NIC-A -> Switch-A -> Port-A)    path 1
    /dev/sdb  (via NIC-B -> Switch-B -> Port-B)    path 2
    --> Same physical LUN appears as TWO devices!
    --> Filesystem sees two disks, data corruption risk

  From host's perspective WITH multipath:
    /dev/sda  (path 1)  --+
                           +--> /dev/mapper/mpath0   (single device)
    /dev/sdb  (path 2)  --+

    multipathd manages failover and load balancing

Linux Device-Mapper Multipath

The standard Linux multipath implementation is device-mapper-multipath, consisting of:

Key configuration sections in /etc/multipath.conf:

defaults {
    polling_interval     5          # seconds between path health checks
    path_checker         tur        # Test Unit Ready (SCSI command)
    path_grouping_policy failover   # or multibus, group_by_prio
    failback             immediate  # switch back to preferred path when available
    no_path_retry        5          # I/O retries when ALL paths fail (or "queue")
    user_friendly_names  yes        # use mpath0, mpath1 instead of WWID
    find_multipaths      yes        # only multipath devices with 2+ paths
}

devices {
    device {
        vendor               "NETAPP"
        product              "LUN.*"
        path_grouping_policy group_by_prio
        prio                 alua           # use ALUA priority
        path_checker         tur
        failback             immediate
        no_path_retry        queue
        features             "3 queue_if_no_path pg_init_retries 50"
    }
}

Path Grouping Policies

How paths are organized into groups determines the failover and load balancing behavior:

Policy Behavior Use Case
failover One active path per group; groups are priority-ordered. Only the highest-priority group handles I/O. If all paths in a group fail, the next group takes over. Active-passive arrays, conservative deployments
multibus All paths in a single group; I/O is distributed across all paths simultaneously. Active-active arrays with symmetric access
group_by_prio Paths are grouped by ALUA priority. Highest-priority group is active; lower-priority groups are standby. ALUA-capable arrays (most modern SAN arrays)
group_by_serial Paths are grouped by target serial number. Multi-target configurations
group_by_node_name Paths are grouped by target node name (WWNN for FC). Specific FC topologies

Path Checkers and Failover Timing

Path health is verified by periodic checks. The checker type must match the storage protocol:

Checker Protocol Mechanism
tur (Test Unit Ready) SCSI (iSCSI, FC) Sends SCSI TEST UNIT READY command; device responds with status
readsector0 SCSI Reads LBA 0; more invasive but catches more failure modes
directio Any block device Reads a sector with O_DIRECT; generic fallback
none NVMe native multipath Not needed; NVMe driver handles path state via ANA

Failover timing is determined by the interaction of several parameters:

Failover Timeline (iSCSI example)
====================================

  t=0s     Path failure occurs (cable cut)
  t=0-5s   TCP retransmissions (kernel TCP timeout)
  t=5s     iSCSI session detects failure (replacement_timeout)
  t=5s     Path marked as "failed" by iscsi_tcp
  t=5s     multipathd detects failed path on next poll
  t=5-10s  I/O rerouted to surviving path(s)
  t=10s    I/O resumes on healthy path

  Total failover time: 5-10 seconds (with tuning)
  Default failover time: 60-120 seconds (without tuning!)

  Critical tuning parameters:
    iSCSI:
      node.session.timeo.replacement_timeout = 5-20
    multipathd:
      polling_interval = 5
      no_path_retry = 5 (or "queue" for indefinite retry)
    Kernel:
      net.ipv4.tcp_retries2 = 5  (reduce from default 15)

ALUA (Asymmetric Logical Unit Access)

ALUA (SPC-3/SPC-4 standard) is a SCSI mechanism that allows a storage array to communicate path preferences to the host. This is critical for dual-controller arrays where a LUN is owned by one controller but accessible (at reduced performance) through the other:

ALUA Path States
==================

  +-----------------------------+     +-----------------------------+
  |   Controller A (Owner)      |     |   Controller B (Peer)       |
  |                             |     |                             |
  |   LUN 1: Active-Optimized  |     |   LUN 1: Active-Non-Opt.   |
  |   LUN 2: Active-Non-Opt.   |     |   LUN 2: Active-Optimized  |
  |   LUN 3: Active-Optimized  |     |   LUN 3: Active-Non-Opt.   |
  +-----------------------------+     +-----------------------------+
         |                                   |
         v                                   v
  Paths to Controller A             Paths to Controller B
  Priority: 50 (preferred)          Priority: 10 (non-preferred)
  for LUN 1 and LUN 3               for LUN 1 and LUN 3

  ALUA states:
    Active-Optimized (AO)    = Best path, full performance
    Active-Non-Optimized (ANO) = Accessible but cross-controller hop
    Standby                  = Path available but not serving I/O
    Unavailable              = Path offline
    Transitioning            = Controller ownership is changing

  multipathd with prio=alua:
    - Queries ALUA target port group descriptor
    - Assigns priority based on AO (50) vs ANO (10)
    - Routes I/O to AO paths first
    - On controller failover, ALUA states update, multipathd re-prioritizes

Multipath with Each Storage Protocol

Protocol Multipath Method Path Identity Notes
Fibre Channel dm-multipath with ALUA WWPN + LUN ID Most mature; well-tested with all enterprise arrays
iSCSI dm-multipath with ALUA IQN + target portal + LUN Multiple iSCSI sessions (one per path); requires multiple NICs or VLANs
NVMe/TCP NVMe native multipath (ANA) NQN + controller ID Preferred over dm-multipath; lighter weight
NVMe/RDMA NVMe native multipath (ANA) NQN + controller ID Same as NVMe/TCP
NVMe/FC NVMe native multipath (ANA) NQN + controller ID Requires FC HBA with NVMe support

4. Fibre Channel

FC Protocol Stack

Fibre Channel is a lossless, high-speed network protocol designed exclusively for storage traffic. Unlike Ethernet, FC guarantees in-order delivery and provides credit-based flow control that prevents frame drops.

FC Protocol Stack (FC-0 through FC-4)
========================================

+--------------------------------------------------+
| FC-4: Upper Layer Protocols                       |
|   FCP (Fibre Channel Protocol for SCSI)           |
|   NVMe/FC (NVMe over Fibre Channel)               |
|   FICON (mainframe channel)                        |
+--------------------------------------------------+
| FC-2: Framing and Flow Control                    |
|   Frame structure, credit-based flow control,      |
|   classes of service (Class 3 most common),        |
|   exchange and sequence management                 |
+--------------------------------------------------+
| FC-1: Encode/Decode                               |
|   8b/10b (up to 8G FC)                             |
|   64b/66b (16G FC and above)                       |
+--------------------------------------------------+
| FC-0: Physical Interface                          |
|   Optical transceivers (SFP+, SFP28, QSFP)        |
|   Cable types (OM3/OM4 multimode, OS2 singlemode) |
|   Speeds: 8/16/32/64/128 GFC                      |
+--------------------------------------------------+

Speed evolution:
  1G FC  (1997)  -> 1.0625 Gbps
  2G FC  (2001)  -> 2.125 Gbps
  4G FC  (2004)  -> 4.25 Gbps
  8G FC  (2007)  -> 8.5 Gbps
  16G FC (2011)  -> 14.025 Gbps (64b/66b encoding)
  32G FC (2016)  -> 28.05 Gbps
  64G FC (2020)  -> 57.2 Gbps (PAM4 signaling)
  128G FC (2024) -> 112.0 Gbps (expected Gen 8)

Topology

FC supports three topologies, though only one is used in modern data centers:

Zoning

Zoning is the FC equivalent of firewall rules. It controls which initiators (HBAs) can see which targets (storage ports):

Zoning Example (Single-Initiator Best Practice)
==================================================

Zone: host01_array01
  Members:
    21:00:00:e0:8b:01:01:01   (Host 01, HBA Port A, WWPN)
    50:00:00:00:c9:aa:bb:01   (Array 01, Port A1, WWPN)
    50:00:00:00:c9:aa:bb:02   (Array 01, Port A2, WWPN)

Zone: host01_array01_fabricB
  Members:
    21:00:00:e0:8b:01:01:02   (Host 01, HBA Port B, WWPN)
    50:00:00:00:c9:aa:bb:03   (Array 01, Port B1, WWPN)
    50:00:00:00:c9:aa:bb:04   (Array 01, Port B2, WWPN)

Zone: host02_array01
  Members:
    21:00:00:e0:8b:02:01:01   (Host 02, HBA Port A, WWPN)
    50:00:00:00:c9:aa:bb:01   (Array 01, Port A1, WWPN)
    50:00:00:00:c9:aa:bb:02   (Array 01, Port A2, WWPN)

Zoneset: production_zoneset (activated on all switches in fabric)
  Contains: host01_array01, host01_array01_fabricB, host02_array01

WWNN/WWPN Addressing and Login

Every FC port has two addresses:

Both are 8-byte addresses, typically displayed as colon-separated hex: 21:00:00:e0:8b:01:01:01.

The login process establishes communication:

FC Login Sequence
===================

  Host HBA                FC Switch                Storage Array
     |                       |                          |
     |-- FLOGI (Fabric -------->                        |
     |   Login)              |                          |
     |<-- FLOGI Accept ------|                          |
     |   (assigned N_Port ID |                          |
     |    e.g., 0x010200)    |                          |
     |                       |                          |
     |-- PLOGI (Port Login, to storage target WWPN) --->|
     |<-- PLOGI Accept (buffer credits, capabilities) --|
     |                       |                          |
     |-- PRLI (Process Login, FCP/NVMe protocol) ------>|
     |<-- PRLI Accept (protocol ready) -----------------|
     |                       |                          |
     |   === Ready for SCSI/NVMe commands ===           |

  FLOGI: Host registers with the fabric, receives a 3-byte N_Port ID
         (24-bit address used for frame routing within the fabric)
  PLOGI: Host establishes a session with a specific target port
  PRLI:  Host and target agree on the upper-layer protocol (FCP or NVMe)

FC Frame Structure

FC Frame Format
=================

+--------+--------+--------+--------+-----------+--------+--------+
| SOF    | Frame  | D_ID   | S_ID   |  Payload  | CRC    | EOF    |
| (4B)   | Header | (3B)   | (3B)   | (0-2112B) | (4B)   | (4B)   |
|        | (24B)  |        |        |           |        |        |
+--------+--------+--------+--------+-----------+--------+--------+

Frame Header (24 bytes):
+------+------+------+------+------+------+------+------+
| R_CTL| D_ID (dest) | CS_  | S_ID (source)     | TYPE |
| (1B) | (3B)        | CTL  | (3B)              | (1B) |
|      |             | (1B) |                    |      |
+------+------+------+------+------+------+------+------+
| SEQ_ | DF_  | SEQ_ | OX_ID       | RX_ID       |
| ID   | CTL  | CNT  | (2B)        | (2B)        |
| (1B) | (1B) | (2B) |             |             |
+------+------+------+------+------+------+------+------+
| Relative Offset (Parameter)       |
| (4B)                               |
+------------------------------------+

Key fields:
  R_CTL:  Routing control (data frame, link control, etc.)
  D_ID:   Destination N_Port ID (assigned during FLOGI)
  S_ID:   Source N_Port ID
  TYPE:   Upper layer protocol (0x08 = FCP/SCSI, 0x28 = NVMe)
  OX_ID:  Originator exchange ID (tracks the I/O operation)
  RX_ID:  Responder exchange ID

Max payload: 2112 bytes (2048 data + 64 optional header)
Max frame size with overhead: 2148 bytes

Credit-Based Flow Control

FC's lossless guarantee comes from buffer-to-buffer (BB) credits. Each port allocates a fixed number of frame buffers. Before sending a frame, the sender must hold a credit. Credits are returned when the receiver processes the frame.

This means FC never drops frames due to congestion -- instead, a saturated link causes the sender to wait (back-pressure). This is fundamentally different from Ethernet, where congestion causes frame drops and TCP retransmissions.

The implication for long distances: credits are finite, and they must cover the round-trip time. For a 100 km link at 32G FC, you need hundreds of credits, which requires expensive long-distance buffers on the FC switch ports.

FCoE (Fibre Channel over Ethernet)

FCoE encapsulates FC frames inside Ethernet frames, allowing FC and Ethernet traffic to share the same physical cable and switch. It was designed to reduce cabling complexity in data centers.

DCB (Data Center Bridging) requirements for FCoE:

CNA (Converged Network Adapter): A single adapter that provides both Ethernet and FC connectivity. The CNA presents virtual Ethernet and virtual FC interfaces to the OS.

Current status (2025): FCoE adoption has been limited. Most enterprises that invested in FC stayed with native FC rather than converging. Most enterprises on Ethernet chose iSCSI or NVMe/TCP. FCoE occupies a narrow niche for organizations with existing FC infrastructure that want to reduce cabling in new deployments.

FC in Modern Data Centers -- When It Still Makes Sense

Fibre Channel is not dead, but its role is narrowing:

FC still makes sense when:

FC is losing ground to converged Ethernet because:

Fibre Channel in the Candidate Platforms

OVE: OpenShift has no native Fibre Channel support at the Kubernetes level. FC-attached storage can be consumed via CSI drivers from storage vendors (e.g., NetApp Trident, Dell CSI, Pure CSI) that handle the FC session management outside of Kubernetes. This means the FC zoning, HBA configuration, and multipath setup happen at the bare-metal Linux layer, not within the Kubernetes/OpenShift abstraction. It works, but it is a bolt-on integration, not a first-class citizen.

Azure Local: Windows Server has mature native FC support via the Windows FC HBA driver stack. Azure Local nodes can connect to FC SAN arrays for VM storage. However, S2D (the HCI storage model) uses local disks and SMB3 for inter-node traffic -- FC is only relevant when consuming external SAN storage alongside or instead of S2D.

Swisscom ESC: FC is the primary storage transport. The Dell VxBlock infrastructure connects compute nodes to PowerMax/PowerStore arrays via 32G Fibre Channel. The customer has no involvement in FC operations -- Swisscom manages zoning, HBA firmware, and multipath configuration.


5. NFSv3

RPC/XDR Foundation

NFS is built on Sun RPC (Remote Procedure Call) and XDR (External Data Representation). Every NFS operation is an RPC call: the client marshals the operation (READ, WRITE, GETATTR, LOOKUP) into XDR format, sends it to the server, and waits for the XDR-encoded response.

NFS Protocol Stack (v3)
==========================

  +----------------------------------+
  | NFS v3 Protocol (RFC 1813)        |
  | Operations: READ, WRITE, CREATE,  |
  | REMOVE, RENAME, GETATTR, LOOKUP,  |
  | READDIR, COMMIT, etc.             |
  +----------------------------------+
  | Sun RPC v2 (RFC 5531)             |
  | Program number: 100003 (NFS)      |
  | Version: 3                         |
  | Procedure numbers: 0-21           |
  +----------------------------------+
  | XDR (External Data Representation)|
  | (RFC 4506 -- serialization format)|
  +----------------------------------+
  | TCP (port assigned by portmapper)  |
  | or UDP (legacy, less reliable)     |
  +----------------------------------+

  Supporting services:
    portmapper / rpcbind (port 111):
      Maps RPC program numbers to TCP/UDP ports
      Client asks: "What port is NFS (program 100003) on?"
      portmapper responds: "Port 2049"

    mountd (mount daemon):
      Handles mount requests from clients
      Verifies export permissions (/etc/exports)
      Returns a file handle for the exported directory

    statd / lockd (NLM -- Network Lock Manager):
      Provides advisory file locking for NFSv3
      statd tracks client state for crash recovery
      lockd handles lock/unlock requests

    rquotad (quota daemon):
      Reports quota information to clients

Stateless Design

NFSv3 is stateless -- the server maintains no per-client session state. Every request is self-contained and can be answered independently. This has profound implications:

Advantages:

Disadvantages:

AUTH_SYS and Export Controls

NFSv3 authentication is primitive by modern standards:

Performance Considerations

Factor Impact Recommendation
TCP vs UDP TCP provides reliable delivery; UDP relies on NFS-level retransmission. TCP is mandatory for modern deployments. Always use TCP (-o proto=tcp)
wsize/rsize Read and write transfer sizes. Larger = fewer round trips = higher throughput. Set to 1048576 (1 MB) for bulk workloads: -o rsize=1048576,wsize=1048576
Attribute caching (ac) Client caches file attributes locally to avoid GETATTR round-trips. noac disables caching for strict consistency. Use default ac for performance. Use noac only when multiple clients write and read the same files simultaneously.
Read-ahead Client pre-fetches sequential data. Controlled by nfs.nfs_congestion_kb and client-side readahead (blockdev --setra) Tune for sequential workloads (backup, media)
Write-behind / async writes Client buffers writes and flushes asynchronously. sync mount option forces synchronous writes. Use default (async) for performance, sync for strict durability requirements
Jumbo frames MTU 9000 reduces per-frame overhead for large transfers Enable if supported end-to-end
nconnect Linux 5.3+: opens multiple TCP connections per mount, distributing I/O. -o nconnect=8 for high-throughput workloads. Major improvement on multi-core servers.

Typical NFSv3 performance (single client, 25 GbE, enterprise NAS):

Workload Throughput IOPS Latency
1 MB sequential read 2.0-2.5 GB/s N/A N/A
1 MB sequential write (sync) 1.0-1.5 GB/s N/A N/A
4K random read N/A 50-100K 200-500 us
4K random write (sync) N/A 10-30K 500-2000 us

NFS for VM Storage

In VMware environments, NFS is commonly used as a datastore protocol. ESXi mounts an NFS export and stores VMDK files on it, similar to VMFS on block storage.

VMware NFS datastore specifics:

Implication for migration: If the current VMware environment uses NFS datastores, the NFS server itself may survive the migration (it is independent of the hypervisor), but the integration model changes completely. OVE can consume NFS via the Kubernetes CSI NFS driver, but there is no equivalent of VAAI-NAS offload. Azure Local can mount NFS shares but prefers SMB/S2D for VM storage.


6. NFSv4

Stateful Design

NFSv4 (RFC 7530) is a fundamental redesign of NFS, introducing statefulness. The server now tracks open files, locks, and delegations per client. This eliminates the bolted-on NLM locking mechanism and enables features impossible in a stateless protocol.

Key stateful operations:

NFSv4 Stateful Interaction
============================

Client A                              Server
   |                                     |
   |-- OPEN file.txt ------------------->|  Server creates state
   |<-- stateid=0x0001 ------------------|  Client holds state
   |                                     |
   |-- READ (stateid=0x0001, off=0) ---->|  Server verifies state
   |<-- Data ----------------------------|
   |                                     |
   |-- LOCK (stateid=0x0001, 0-4095) --->|  Integrated locking
   |<-- Lock granted --------------------|  (no separate NLM!)
   |                                     |
   |-- WRITE (stateid=0x0001, off=0) --->|  Write within lock
   |<-- OK ------------------------------|
   |                                     |
   |-- CLOSE (stateid=0x0001) ---------->|  Releases state + lock
   |<-- OK ------------------------------|

Lease management:
  - Client renews lease periodically (default 90s)
  - If client fails to renew (crash, network partition):
    Server reclaims state after lease expiry
    Locks are released, delegations recalled
  - Grace period after server restart:
    Clients reclaim pre-existing locks
    New locks are rejected until grace period ends

Single-Port Operation and Pseudo-Filesystem

NFSv4 consolidates all operations on a single well-known port: TCP 2049. No portmapper, no mountd, no lockd, no statd. This is a major operational improvement for environments with strict firewall policies (like financial institutions).

Pseudo-filesystem: NFSv4 does not use the mount protocol. Instead, the server presents a virtual filesystem tree (pseudo-filesystem) that maps exported directories into a unified namespace. Clients connect to the server's root (/) and navigate to the export via the pseudo-filesystem path.

NFSv4 Pseudo-Filesystem
==========================

Server exports:
  /data/project-a    -> available at /project-a
  /data/project-b    -> available at /project-b
  /backup/2024       -> available at /backup/2024

Pseudo-filesystem (as seen by client):
  /                        (pseudo-root, not a real directory)
  +-- project-a/           (real export: /data/project-a)
  +-- project-b/           (real export: /data/project-b)
  +-- backup/
       +-- 2024/           (real export: /backup/2024)

Client mount:
  mount -t nfs4 server:/project-a /mnt/project-a
  (no portmapper query, no mountd call, just TCP 2049)

Referrals: The pseudo-filesystem can contain referrals that redirect clients to different servers. This enables a federated namespace spanning multiple NFS servers without client-side configuration changes.

Security Model

NFSv4 replaces AUTH_SYS with RPCSEC_GSS (RFC 2203), which provides:

Security Flavor Authentication Integrity Encryption Performance Impact
AUTH_SYS None (trust UID) None None Baseline
krb5 Kerberos ticket None None ~5-10% overhead
krb5i Kerberos ticket Per-message HMAC None ~15-25% overhead
krb5p Kerberos ticket Per-message HMAC AES encryption ~30-50% overhead

For a financial institution, krb5i or krb5p should be mandatory for any NFS export carrying regulated data. The performance overhead is significant but unavoidable for compliance.

ACLs: NFSv4 defines its own ACL model (not POSIX ACLs) based on the Windows/NT ACL model. NFSv4 ACLs support ALLOW and DENY entries, inheritance, and a richer set of permissions than POSIX mode bits. This is important for interoperability with Windows environments (the same ACL semantics map to NTFS).

pNFS (Parallel NFS)

pNFS (NFSv4.1, RFC 5661) separates metadata operations from data operations, allowing clients to perform data I/O directly to storage devices in parallel, bypassing the NFS server as a data bottleneck:

pNFS Architecture
===================

Without pNFS (traditional NFS):
  Client --> NFS Server --> Storage
  (all data flows through the server -- bottleneck)

With pNFS:
  Client --metadata--> MDS (Metadata Server)
  Client --data------> DS1 (Data Server / Storage Device)
  Client --data------> DS2 (Data Server / Storage Device)
  Client --data------> DS3 (Data Server / Storage Device)

  1. Client requests layout from MDS
  2. MDS returns layout describing where data blocks reside
  3. Client performs I/O directly to data servers
  4. Client commits/returns layout to MDS

Layout types:
  - Files:     Data striped across NFS data servers (NetApp, Linux knfsd)
  - Blocks:    Data on block devices (SAN LUNs accessible to client)
  - Objects:   Data in object storage (Panasas, rare)
  - SCSI:      Data on SCSI LUNs (SPC-4 based)
  - FlexFiles: Enhanced file layout (NetApp ONTAP, most common in practice)

pNFS is important for high-throughput workloads (media, genomics, HPC) but is less relevant for typical VM storage where block protocols dominate.

NFSv4.1 and NFSv4.2 Enhancements

NFSv4.1 (RFC 5661):

NFSv4.2 (RFC 7862):

NFSv4 in the Candidate Platforms

OVE: The Kubernetes NFS CSI driver supports NFSv4.x for mounting external NFS exports as Persistent Volumes. CephFS (part of ODF) uses its own protocol (not NFS) for file access, but the Ceph NFS-Ganesha gateway can export CephFS directories via NFSv4. This is primarily used for shared filesystem access by VMs, not for boot disks (which use Ceph RBD).

Azure Local: Windows Server supports NFSv4.1 as a client (mounting external NFS shares) but the native VM storage path is SMB3/S2D. NFS is relevant when Azure Local VMs need to access existing NFS infrastructure (e.g., NAS appliances shared with Linux workloads).

Swisscom ESC: NFS access is available as a managed service option. The protocol version (v3 vs v4) depends on the underlying NAS infrastructure (typically Dell PowerStore NAS or Isilon). The customer consumes NFS exports without managing the server configuration.


7. SMB / CIFS

Protocol Evolution

SMB (Server Message Block) is the file sharing protocol native to Windows environments. Understanding its evolution is critical because the version determines security posture, performance capabilities, and protocol compatibility.

SMB Version Timeline
======================

SMB1 / CIFS (1983-2006):
  - Original protocol, designed for LAN Manager / Windows for Workgroups
  - Chatty, insecure, single-threaded
  - Vulnerable: WannaCry (EternalBlue, MS17-010) exploited SMBv1
  - STATUS: Deprecated. MUST be disabled. Removed from Windows 11 24H2+.
  - Financial institutions: SMBv1 should be blocked at the network level.

SMB2.0 (Windows Vista / 2006):
  - Reduced chattiness (compound commands, pipelining)
  - Larger reads/writes (up to 1 MB vs 64 KB)
  - Improved caching (oplock/lease model)
  - Durable handles (survive brief disconnects)

SMB2.1 (Windows 7 / 2009):
  - Large MTU support (up to 1 MB per SMB packet)
  - Client oplock leasing
  - BranchCache support (WAN optimization)

SMB3.0 (Windows 8 / Server 2012):
  - Multichannel (aggregate multiple NICs)
  - SMB Direct (RDMA -- zero-copy, kernel bypass)
  - SMB Encryption (AES-128-CCM)
  - Transparent failover (continuously available shares)
  - VSS remote snapshots
  - Scale-out file server (SOFS) support

SMB3.02 (Windows 8.1 / Server 2012 R2):
  - SMB1 can be fully disabled
  - Improved SOFS performance

SMB3.1.1 (Windows 10 / Server 2016+):
  - AES-128-GCM encryption (faster than CCM)
  - Pre-authentication integrity (SHA-512 hash chain)
  - Cluster dialect fencing
  - AES-256-CCM/GCM (Windows Server 2022+)

SMB3 Key Features for Enterprise Storage

Multichannel: SMB3 can detect multiple network interfaces between client and server and automatically establish parallel connections across them. This provides both bandwidth aggregation and NIC-level failover without any multipath software.

SMB3 Multichannel
===================

Client                        Server
+--------+                    +--------+
| NIC 1  |----Connection 1--->| NIC 1  |
| 25 GbE |                    | 25 GbE |
+--------+                    +--------+
| NIC 2  |----Connection 2--->| NIC 2  |
| 25 GbE |                    | 25 GbE |
+--------+                    +--------+

Total bandwidth: 50 Gbps (aggregate)
If NIC 1 fails: all I/O shifts to NIC 2 (automatic)
No multipath daemon, no dm-mpath, no configuration.

SMB Direct (RDMA): When both client and server have RDMA-capable NICs (RoCE v2 or iWARP), SMB3 can perform direct memory-to-memory data transfers that bypass the TCP/IP stack and the kernel entirely. This is the transport used by Storage Spaces Direct for intra-cluster storage traffic.

Performance comparison (single session, 4K random read):

Transport IOPS Latency (avg) CPU overhead
SMB3 over TCP (25 GbE) 150-250K 150-300 us 15-25% core
SMB3 over RDMA (25 GbE RoCE v2) 500-800K 30-60 us 3-8% core
SMB3 over RDMA (100 GbE RoCE v2) 1.0-1.5M 20-40 us 2-5% core

SMB Encryption: SMB3+ supports per-share or per-server encryption using AES-128-GCM (SMB 3.0.2+) or AES-256-GCM (Server 2022+). Encryption is negotiated during session setup and applies to all data in transit. Unlike NFSv4 krb5p which encrypts at the RPC level, SMB encryption operates at the SMB transport level and has lower performance overhead (5-15% vs 30-50% for krb5p).

Continuously Available (CA) shares: SMB3 supports transparent failover between clustered file servers. If the node serving a share fails, the client's SMB session seamlessly moves to another cluster node. Open file handles, locks, and oplock leases are preserved. This is the foundation for Scale-Out File Server (SOFS) in Windows Server and for S2D's shared storage model.

SMB in Azure Local

SMB3 is the native storage protocol for Azure Local. Every I/O operation between compute and storage within an Azure Local / S2D cluster uses SMB3/RDMA:

Azure Local Storage Path
===========================

  VM (Hyper-V)
       |
       v
  Virtual Disk (VHDX file on CSV)
       |
       v
  Cluster Shared Volume (CSVFS)
       |
       v
  Storage Spaces Direct (S2D)
       |
       v
  Software Storage Bus (SSB)
       |
       v
  SMB3 Direct (RDMA)          <-- All inter-node I/O
       |
       v
  Remote node's local NVMe/SSD

  Key point: Even when a VM accesses data on a remote node,
  the I/O goes over SMB3/RDMA. This is transparent to the VM.
  The VM sees a local VHDX; CSVFS and S2D handle the remoting.

This makes SMB3/RDMA performance the single most critical factor for Azure Local storage performance. If RDMA is misconfigured (wrong MTU, PFC not enabled, ECN not tuned), the cluster falls back to SMB3/TCP with 5-10x worse latency. Verifying RDMA health is a Day-1 priority for Azure Local deployments.

Samba on Linux

Samba is the open-source implementation of the SMB protocol for Linux/UNIX systems. It enables Linux servers to serve files to Windows clients and vice versa. Samba 4.x supports SMB3.1.1 including encryption and multichannel.

Relevance to the evaluation: In a mixed OVE/Linux environment with Windows VMs that require SMB file shares, there are two options:

  1. External Windows file server or NAS appliance serving SMB3 shares. The VMs access the shares over the network.
  2. Samba running on a Linux VM or container within OVE, providing SMB3 shares to Windows VMs. This avoids a dependency on Windows infrastructure but introduces the complexity of running Samba in production (AD integration, Kerberos, CTDB for clustering).

Neither option is as integrated as the Azure Local model, where SMB3 is the native fabric protocol and Windows file sharing is a first-class capability.

When SMB Matters in the Evaluation

SMB matters in three scenarios:

  1. Windows VM workloads: Windows VMs that access shared file servers (DFS namespaces, department shares, application data) expect SMB. This is a workload requirement, not a platform requirement -- it applies to all candidates.

  2. Azure Local platform internals: SMB3/RDMA is the storage fabric protocol for S2D. Its performance directly determines Azure Local's storage performance. Understanding SMB3 internals is therefore essential for Azure Local evaluation.

  3. Migration of existing Windows file services: If the current VMware environment hosts Windows file servers that serve SMB shares to the enterprise, these file servers must continue operating on the target platform. On OVE, they run as KubeVirt VMs; on Azure Local, they run as Hyper-V VMs or as Scale-Out File Server roles; on ESC, they run as managed VMs.


How the Candidates Handle This

Protocol Support Matrix

Protocol VMware (Current) OVE (OpenShift Virtualization Engine) Azure Local Swisscom ESC
iSCSI ESXi software initiator; iSCSI datastores common Via CSI drivers for external arrays; not used for ODF/Ceph internal Windows iSCSI Initiator for external SAN; not used for S2D internal Available via VxBlock infrastructure (secondary to FC)
NVMe-oF vSphere 7.0+ supports NVMe/TCP and NVMe/FC datastores Ceph NVMe-oF gateway (experimental); external arrays via CSI NVMe/TCP support for external storage (23H2+); S2D uses local NVMe directly Dell arrays support NVMe/FC; customer has no protocol choice
FC Full native support; common in enterprise VMware environments Not Kubernetes-native; supported via vendor CSI drivers at bare-metal level Native Windows FC HBA support; usable for external SAN Primary storage transport (32G FC to Dell PowerMax/PowerStore)
MPIO ESXi native multipath (NMP/PSA), round-robin, fixed, MRU dm-multipath at Linux host level; NVMe native multipath for NVMe-oF Windows MPIO (MSDSM); SMB Multichannel for S2D internal Managed by Swisscom; customer has no visibility
NFSv3 Full support; NFS datastores widely used Kubernetes NFS CSI driver; CephFS via NFS-Ganesha gateway Windows NFS client; not primary storage path Available as managed NAS service
NFSv4 vSphere 6.0+ supports NFSv4.1 datastores Kubernetes NFS CSI driver; NFS-Ganesha gateway supports v4.1/v4.2 Windows NFSv4.1 client (Server 2022+) Available as managed NAS service
SMB3 Not used for datastores; only for Windows VM guest access Samba on Linux or external Windows file server Native fabric protocol (SMB3/RDMA for S2D); first-class citizen Available for Windows VM guest access

Platform-Specific Protocol Analysis

OVE -- Protocol-Independent by Design: OVE's storage architecture is deliberately protocol-agnostic at the Kubernetes layer. VMs consume storage through the CSI interface, which abstracts the underlying protocol. ODF/Ceph communicates internally using its own RADOS protocol (a custom binary protocol over TCP or RDMA/msgr2), not iSCSI, NVMe-oF, or FC. When external storage is consumed, the CSI driver (NetApp Trident, Pure CSI, Dell CSI) handles the protocol negotiation. This means the OVE operations team needs protocol expertise only when integrating external storage arrays -- the internal storage path is Ceph-specific and does not use any of the traditional storage protocols.

Azure Local -- SMB3 as the Backbone: Azure Local's architecture is built entirely on SMB3. The Software Storage Bus uses SMB3/RDMA for all inter-node storage traffic. CSV (Cluster Shared Volumes) uses SMB3 for redirected I/O when the coordinating node is not the same as the owner node. This deep integration means that SMB3/RDMA performance is Azure Local's performance. The operations team must understand SMB3 multichannel, RDMA configuration (RoCE v2, PFC, DCBX), and the Windows SMB client/server architecture. For external storage, Azure Local can consume iSCSI, FC, and NVMe/TCP, but S2D is the primary and recommended storage model.

Swisscom ESC -- FC Behind the Curtain: ESC's storage transport is Fibre Channel between VxBlock compute and PowerMax/PowerStore arrays. The customer never interacts with FC, MPIO, or any storage protocol directly. NFS and SMB are available as managed add-on services for file-level access. The protocol choice is Swisscom's operational decision, and changes (e.g., migration from FC to NVMe/FC) would happen transparently. This is the trade-off of a managed service: zero protocol operational burden, but also zero protocol optimization capability.


Key Takeaways

  1. Protocol choice is a platform choice, not a storage choice. Selecting OVE means accepting Ceph RADOS as the internal storage protocol (with CSI as the abstraction layer). Selecting Azure Local means committing to SMB3/RDMA as the storage fabric. Selecting ESC means accepting FC as the transport (managed by Swisscom). The traditional freedom to choose between iSCSI, FC, and NFS for datastore connectivity (as in VMware) does not exist in any of the candidates -- the protocol is architecturally determined.

  2. NVMe-oF is the future but not the present for any candidate. None of the three candidates use NVMe-oF as their primary internal storage protocol today. OVE uses RADOS, Azure Local uses SMB3/RDMA, ESC uses FC. NVMe-oF is relevant for consuming external storage arrays and will likely become the standard for external connectivity within 2-3 years. Ensure your network infrastructure (RDMA-capable NICs, lossless Ethernet) can support NVMe-oF when it matures.

  3. RDMA is the performance differentiator. Whether it is SMB Direct (Azure Local) or NVMe/RDMA (future external storage), RDMA-capable networking delivers 5-10x latency reduction compared to TCP-based protocols. Any new hardware procurement should specify RDMA-capable NICs (Mellanox/NVIDIA ConnectX-6 or later) and switches that support DCB (PFC, ECN, DCBX). This is a one-time infrastructure investment that benefits all three candidates.

  4. Fibre Channel is a Swisscom ESC dependency, not a platform requirement. If you choose OVE or Azure Local with HCI storage (ODF or S2D), FC is not needed for primary storage. FC remains relevant only if you choose to consume external SAN arrays alongside HCI. Given the operational cost and skill specialization of FC, HCI-native storage (eliminating FC) is one of the clearest cost-reduction opportunities in this migration.

  5. NFSv4 with Kerberos should be the standard for file shares. For any NFS-based file sharing (configuration repositories, shared data, inter-application communication), mandate NFSv4 with krb5i or krb5p. NFSv3 with AUTH_SYS is not acceptable for a financial institution, regardless of network segmentation. The performance overhead of Kerberos is the cost of doing business in a regulated environment.

  6. SMB expertise is non-negotiable for Azure Local. If Azure Local is selected, the operations team must develop deep SMB3/RDMA expertise -- this is not just a "Windows file share" protocol, it is the storage fabric. RDMA misconfiguration (PFC errors, ECN tuning, incorrect MTU) will directly impact every VM's storage performance. This is analogous to understanding vSAN's RDT protocol in the current VMware environment.

  7. Multipath design must match the protocol. iSCSI and FC use dm-multipath with ALUA. NVMe-oF uses NVMe native multipath with ANA. SMB3 uses built-in multichannel. Mixing multipath approaches or using the wrong mechanism (e.g., dm-multipath for NVMe) introduces unnecessary complexity and may miss failover events. Define the multipath standard for each protocol before the PoC.

  8. The iSCSI performance tax is real but manageable. iSCSI on modern 25/100 GbE with jumbo frames and dedicated NICs delivers adequate performance for the vast majority of enterprise workloads. The 5-10x latency advantage of NVMe/RDMA matters only for latency-sensitive tier-1 databases and trading systems. Do not over-engineer the storage network for NVMe/RDMA unless a measurable workload justifies it.


Discussion Guide

The following questions are designed for vendor deep-dives, PoC planning sessions, and internal architecture reviews. They probe the practical implications of storage protocol choices across the candidate platforms.

Questions for All Candidates

  1. Internal storage protocol and performance ceiling: "What protocol does your platform use for internal storage traffic between compute and storage components? What is the maximum measured single-VM IOPS and p99 latency on this internal protocol at 80% cluster utilization? We need these numbers from a real benchmark on hardware comparable to our deployment size, not from a datasheet."

  2. RDMA readiness and fallback behavior: "Does your platform support RDMA for storage traffic? If RDMA is enabled and a transient PFC/ECN misconfiguration causes RDMA failures, does the platform fall back to TCP gracefully, or does it fail hard? How do we monitor RDMA health proactively (counter thresholds for PFC pause frames, RoCE retransmissions, ECN-marked packets)?"

  3. Multipath failover timing: "Walk us through the exact failover sequence when a storage path fails. What is the measured time from cable disconnection to I/O resumption on the surviving path? What I/Os are in-flight during failover -- are they retried or failed back to the application? What is the impact on VM-visible latency during a failover event?"

  4. Protocol upgrade path: "Our current infrastructure uses iSCSI/FC for external storage connectivity. What is the migration path to NVMe-oF (TCP or RDMA) for consuming external storage arrays? Can both protocols coexist during a transition period? What hardware (NICs, switches, HBAs) needs to change?"

  5. Encryption in transit: "How is storage traffic encrypted between compute and storage nodes? Is encryption enabled by default? What cipher and key length are used? What is the measured performance overhead of encryption on storage I/O? For regulatory compliance, we require encryption of all data in transit -- confirm that this is achieved for all storage protocol paths, including inter-node replication traffic."

Questions Specific to OVE

  1. Ceph RADOS protocol security: "Ceph uses its own RADOS protocol (msgr2) for internal cluster communication. Is msgr2 encryption (cephx + on-wire encryption) enabled by default in ODF? What is the performance impact? Can we enforce mutual authentication between OSDs and clients to prevent a compromised node from reading another tenant's data?"

  2. External storage protocol support via CSI: "If we connect a NetApp/Pure/Dell array via CSI, which protocol does the CSI driver use -- iSCSI, FC, or NVMe-oF? Can we influence the protocol selection? What multipath mechanism is used at the node level for CSI-attached external volumes, and how is it configured (dm-multipath vs NVMe native)?"

Questions Specific to Azure Local

  1. SMB3/RDMA health verification: "Show us how to verify that RDMA is active and healthy on all cluster nodes. What PowerShell cmdlets or monitoring tools report RDMA connection status, PFC statistics, and ECN marking rates? What are the alert thresholds for RDMA degradation that should trigger proactive intervention?"

  2. SMB Direct fallback behavior: "If RDMA fails on a subset of NICs (e.g., due to firmware bug or cable issue), does the Software Storage Bus fall back to SMB3/TCP for those paths? Is this fallback automatic and transparent? What is the latency impact, and how quickly does the system recover when RDMA is restored?"

  3. SMB encryption for compliance: "Is SMB encryption (AES-256-GCM) enabled by default for S2D inter-node traffic? If not, what is the performance overhead of enabling it? Does enabling encryption conflict with SMB Direct/RDMA (since RDMA bypasses the kernel where encryption would typically be applied)?"

Questions Specific to Swisscom ESC

  1. Storage protocol transparency: "Which storage protocol connects our VMs to the underlying storage? Is it FC, iSCSI, or NVMe/FC? Can we see storage path status (active/standby, failover events, latency per path) in the self-service portal? If a path failover occurs, is it logged and visible to us, or only to Swisscom's operations team?"

  2. Protocol evolution roadmap: "Is Swisscom planning to migrate the ESC storage backend from FC to NVMe/FC or NVMe/TCP? If so, what is the timeline, and will this migration be transparent to tenants? Will there be a performance improvement, and how will it be reflected in SLA terms?"

Architecture-Level Questions (for Internal Discussion)

  1. Protocol skills assessment: "Which storage protocols does our current operations team have hands-on experience with? FC? iSCSI? SMB3? NFS? Do we have RDMA networking experience? The skill gap between our current VMware/FC/iSCSI world and the target platform's protocol stack determines the training investment and the risk of early-lifecycle incidents."

  2. Converged vs dedicated storage network: "In the target architecture, should storage traffic run on a dedicated physical network (separate NICs, separate switches) or on a converged network with QoS/VLAN separation? What is the cost delta? What is the risk delta? For NVMe/RDMA or SMB Direct, lossless Ethernet configuration (PFC, ECN) on a converged network is complex -- is a dedicated storage network simpler and safer even if more expensive?"

  3. Protocol standardization policy: "Should we standardize on a single storage protocol for external storage connectivity (e.g., NVMe/TCP for all external array access), or allow protocol diversity (iSCSI for legacy, NVMe/TCP for new)? Standardization reduces operational complexity but may require hardware refresh. What is the 5-year total cost of each approach?"


Next: 04-storage-architectures.md -- Storage Architectures (SAN, NAS, HCI / Software-Defined Storage)