Storage Protocols
Why This Matters
Storage protocols are the wire-level languages through which servers and storage systems communicate. In a VMware environment, the protocol choice was often hidden behind VMFS and vSAN abstractions -- the hypervisor handled the details. When migrating to OVE, Azure Local, or Swisscom ESC, the protocol layer becomes explicit and consequential: each candidate platform supports a different subset of protocols, with different performance ceilings, different operational complexity, and different hardware prerequisites.
For a Tier-1 financial enterprise running 5,000+ VMs, protocol selection has three concrete impacts:
-
Performance ceiling. The difference between iSCSI over 25 GbE and NVMe-oF over 100 GbE RDMA is not incremental -- it can be 10x in IOPS and 5x in latency. Choosing the wrong protocol for latency-sensitive workloads (databases, trading systems, real-time analytics) means leaving performance on the table that no amount of tuning can recover.
-
Infrastructure dependencies. Fibre Channel requires dedicated HBAs and FC switches. NVMe/RDMA requires lossless Ethernet (DCB/PFC/ECN). iSCSI runs on commodity Ethernet but performs best with jumbo frames and dedicated NICs. Each protocol choice has a bill of materials and a network design consequence.
-
Operational model. NFS and SMB are file-level protocols that simplify shared access but introduce locking complexity. Block protocols (iSCSI, FC, NVMe-oF) deliver raw performance but limit sharing to one writer per LUN without cluster filesystems. The operational team needs to understand what they are operating, not just what the vendor configured.
This page dissects each protocol at the wire level -- PDU structures, queue models, frame formats -- so the evaluation team can ask precise questions and make informed trade-offs during PoC planning.
Concepts
1. iSCSI (Internet Small Computer Systems Interface)
Protocol Architecture
iSCSI encapsulates SCSI commands inside TCP/IP packets, enabling block storage access over standard Ethernet networks. It was standardized as RFC 7143 (consolidating the original RFC 3720) and has been the workhorse of IP-based SAN for two decades.
The key architectural insight is that iSCSI is a transport mapping, not a new storage protocol. The SCSI command set (READ, WRITE, INQUIRY, REPORT LUNS, etc.) remains identical to what a local SCSI disk uses -- iSCSI simply carries those commands over TCP instead of a parallel SCSI cable or Fibre Channel link.
iSCSI Protocol Stack
======================
+----------------------------------+
| SCSI Command Layer |
| (CDB: READ_10, WRITE_10, etc.) |
+----------------------------------+
| iSCSI Layer |
| (PDU framing, session mgmt, |
| login, authentication) |
+----------------------------------+
| TCP (port 3260) |
| (reliable, ordered byte stream) |
+----------------------------------+
| IP (IPv4 or IPv6) |
+----------------------------------+
| Ethernet (1/10/25/100 GbE) |
+----------------------------------+
vs. Fibre Channel:
+----------------------------------+
| SCSI Command Layer |
+----------------------------------+
| FCP (Fibre Channel Protocol) |
+----------------------------------+
| FC-2 (framing, flow control) |
+----------------------------------+
| FC-1 (encoding: 8b/10b, 64b/66b) |
+----------------------------------+
| FC-0 (physical: SFP, fiber optic) |
+----------------------------------+
iSCSI PDU Structure
Every iSCSI exchange is carried in Protocol Data Units (PDUs). A PDU consists of up to five segments:
iSCSI PDU Layout
==================
+--------------------------------------------------+
| Basic Header Segment (BHS) 48 bytes |
| +--------------------------------------------+ |
| | Opcode (1 byte) | |
| | Flags (1 byte) | |
| | Total AHS Length (1 byte) | |
| | Data Segment Length (3 bytes) | |
| | LUN (8 bytes) | |
| | Initiator Task Tag (4 bytes) | |
| | ... (remaining fields opcode-specific) | |
| +--------------------------------------------+ |
+--------------------------------------------------+
| Additional Header Segment (AHS) variable |
| (extended CDB, bi-directional read length) |
+--------------------------------------------------+
| Header Digest (optional) 4 bytes |
| (CRC32C of BHS + AHS) |
+--------------------------------------------------+
| Data Segment variable |
| (SCSI CDB, data, parameters, text) |
+--------------------------------------------------+
| Data Digest (optional) 4 bytes |
| (CRC32C of Data Segment) |
+--------------------------------------------------+
| Padding to 4-byte boundary 0-3 bytes |
+--------------------------------------------------+
Key opcodes (initiator -> target):
0x01 SCSI Command (carries CDB + optional immediate data)
0x05 SCSI Data-Out (write data)
0x03 Login Request
0x06 Text Request
0x10 SNACK Request (selective retransmission)
0x40 Vendor-specific
Key opcodes (target -> initiator):
0x21 SCSI Response (status, sense data)
0x25 SCSI Data-In (read data)
0x23 Login Response
0x31 Ready to Transfer (R2T) -- flow control for writes
0x02 Async Message (target-initiated events)
Initiator/Target Model
iSCSI uses a strict client-server model:
- Initiator: The server (compute node) that sends SCSI commands. Runs as a kernel module (
iscsi_tcp) or hardware offload (iSCSI HBA / TCP Offload Engine). - Target: The storage system (SAN array, software-defined storage node) that receives SCSI commands and returns data.
IQN Naming: Every iSCSI entity is identified by an iSCSI Qualified Name (IQN):
iqn.YYYY-MM.reverse.domain:unique-identifier
iqn.2024-01.com.example.dc1:storage-array-01.lun5
iqn.2024-01.com.example.dc1:initiator.esxi-host-03
Discovery mechanisms:
- SendTargets: The initiator connects to a known target portal (IP:port) and requests a list of available targets. Simple but requires knowing at least one portal address.
- iSNS (Internet Storage Name Service): A centralized directory (like DNS for storage) where targets register and initiators query. More scalable but adds an infrastructure dependency. Defined in RFC 4171.
- Static configuration: Manually configured target addresses. Common in small environments.
Session and Connection Management
An iSCSI session has two phases:
-
Login Phase: Negotiates parameters (authentication, header/data digests, max burst length, max connections, initial R2T preference). Uses text-based key=value exchanges inside Login PDUs.
-
Full Feature Phase: The actual SCSI command exchange happens here. Multiple TCP connections can be aggregated into a single session (MC/S -- Multiple Connections per Session) for bandwidth aggregation and failover.
iSCSI Session Model
=====================
Initiator Target
| |
|--- Login Request (credentials) ->|
|<-- Login Response (challenge) ---|
|--- Login Request (response) ---->|
|<-- Login Response (success) -----|
| |
| === Full Feature Phase === |
| |
|--- SCSI Command (READ_10) ------>|
|<-- Data-In (read payload) -------|
|<-- SCSI Response (GOOD) ---------|
| |
|--- SCSI Command (WRITE_10) ----->|
|<-- R2T (ready to receive) -------|
|--- Data-Out (write payload) ---->|
|<-- SCSI Response (GOOD) ---------|
| |
Key session parameters negotiated at login:
MaxRecvDataSegmentLength (default 8192, typically tuned to 262144)
MaxBurstLength (max data per sequence, default 262144)
FirstBurstLength (unsolicited data before R2T, default 65536)
InitialR2T (Yes = wait for R2T; No = send immediately)
MaxOutstandingR2T (parallel write streams, default 1)
MaxConnections (per session, default 1)
DataPDUInOrder (Yes for most implementations)
HeaderDigest / DataDigest (None or CRC32C)
Authentication
- CHAP (Challenge Handshake Authentication Protocol): One-way authentication where the target challenges the initiator. Uses shared secrets. Prevents unauthorized initiators from connecting to targets.
- Mutual CHAP: Bidirectional -- the initiator also challenges the target. Prevents a rogue target from impersonating a legitimate storage array (important in multi-tenant environments).
- No authentication: Common in isolated storage networks. Not recommended for financial environments even on dedicated VLANs.
Performance Considerations
iSCSI's fundamental performance limitation is TCP overhead. Every SCSI I/O requires TCP processing (segmentation, checksumming, ACKs, congestion control), which consumes CPU cycles and adds latency.
Mitigation strategies, ranked by effectiveness:
| Strategy | Latency Impact | CPU Impact | Complexity |
|---|---|---|---|
| iSER (iSCSI Extensions for RDMA) | -60-80% (bypasses TCP entirely) | -90% (zero-copy) | High (requires RDMA NICs, lossless fabric) |
| TCP Offload Engine (TOE) | -20-40% | -60-80% | Medium (specialized NIC) |
| Jumbo Frames (MTU 9000) | -5-15% | -10-20% | Low (switch + NIC config) |
| Dedicated storage network | Eliminates contention | N/A | Medium (separate NICs/VLANs) |
| Multiple sessions (MC/S) | Increases throughput | Slight increase | Low |
| Interrupt coalescing | Trades latency for CPU | -30-50% | Low (NIC tuning) |
Typical performance numbers (single-target, 4K random read, queue depth 32):
| Configuration | IOPS | Latency (avg) | CPU overhead |
|---|---|---|---|
| iSCSI / 10 GbE / software initiator | 100-150K | 200-400 us | 15-25% of one core |
| iSCSI / 25 GbE / software initiator | 200-300K | 150-300 us | 20-30% of one core |
| iSCSI / 25 GbE / TOE | 250-350K | 100-200 us | 5-10% of one core |
| iSER / 25 GbE RDMA | 400-600K | 50-100 us | 2-5% of one core |
Linux Initiator Stack
The standard Linux iSCSI implementation is open-iscsi, consisting of:
- iscsid: User-space daemon managing sessions and error recovery.
- iscsiadm: CLI for discovery, login, logout, and session management.
- iscsi_tcp (kernel module): The actual TCP transport and PDU processing.
- /etc/iscsi/initiatorname.iscsi: File containing the node's IQN.
- /etc/iscsi/iscsid.conf: Global configuration (timeouts, CHAP credentials, replacement_timeout).
Key operational commands:
# Discover targets on a portal
iscsiadm -m discovery -t sendtargets -p 10.0.100.1:3260
# Login to a specific target
iscsiadm -m node -T iqn.2024-01.com.storage:array01 -p 10.0.100.1 --login
# Show active sessions
iscsiadm -m session -P 3
# Set CHAP credentials
iscsiadm -m node -T iqn.2024-01.com.storage:array01 \
--op update -n node.session.auth.authmethod -v CHAP
iscsiadm -m node -T iqn.2024-01.com.storage:array01 \
--op update -n node.session.auth.username -v initiator01
iscsiadm -m node -T iqn.2024-01.com.storage:array01 \
--op update -n node.session.auth.password -v <secret>
# Set replacement_timeout (seconds before failing I/O after path loss)
iscsiadm -m node -T iqn.2024-01.com.storage:array01 \
--op update -n node.session.timeo.replacement_timeout -v 20
The critical parameter for VM workloads is node.session.timeo.replacement_timeout. This determines how long the initiator waits for a failed session to reconnect before failing I/Os back to the upper layers. The default of 120 seconds is far too long for production VMs -- it means I/O hangs for 2 minutes on a path failure. Financial environments typically set this to 5-20 seconds and rely on multipath for redundancy.
iSCSI in the Candidate Platforms
OVE (OpenShift Virtualization Engine): ODF/Ceph does not expose storage to VMs via iSCSI natively. Ceph RBD images are attached to VMs through the CSI driver, which maps RBD directly into the QEMU process using librbd (a user-space library, not a kernel iSCSI initiator). However, external iSCSI targets (NetApp, Pure Storage, Dell PowerStore) can be consumed via third-party CSI drivers that use iSCSI under the hood. The CSI driver manages the iSCSI session lifecycle transparently.
Azure Local: S2D does not use iSCSI for internal cluster storage. All intra-cluster I/O uses the SMB3/RDMA (Software Storage Bus) path. However, Azure Local nodes can attach external iSCSI targets using the Windows iSCSI Initiator (built into Windows Server). This is relevant for consuming existing SAN infrastructure during a migration transition period.
Swisscom ESC: The underlying Dell VxBlock infrastructure uses Fibre Channel to connect compute nodes to PowerMax/PowerStore arrays. iSCSI is not the primary protocol but may be available for specific workloads on request. The customer has no visibility into or control over the transport protocol.
2. NVMe-oF (Non-Volatile Memory Express over Fabrics)
Why NVMe Needed a Fabric Extension
NVMe was designed from scratch for flash storage, replacing the SCSI command set with a protocol optimized for parallelism and low latency. The key difference is the queue model:
SCSI vs NVMe Queue Model
===========================
SCSI (iSCSI, FC):
Single command queue, single completion queue
Queue depth: typically 32-256 per LUN
Commands processed serially by storage controller
Application
|
v
+------------------+
| Single Queue | max ~256 outstanding commands
| cmd1, cmd2, ... |
+------------------+
|
v
Storage Controller (processes commands one-by-one from queue)
NVMe:
Up to 65,535 Submission Queues (SQ), each with up to 65,536 entries
Up to 65,535 Completion Queues (CQ)
Per-CPU queue pairs -- no locking, no contention
CPU 0 CPU 1 CPU 2 CPU N
| | | |
v v v v
+------+ +------+ +------+ +------+
| SQ 1 | | SQ 2 | | SQ 3 | | SQ N |
| CQ 1 | | CQ 2 | | CQ 3 | | CQ N |
+------+ +------+ +------+ +------+
| | | |
+------+-------+------+------+------+-------+
| | |
NVMe Controller (processes all queues in parallel)
Result:
- No lock contention between CPUs
- Direct doorbell register writes (MMIO) instead of interrupts
- 4 KB command vs 16-32 byte SCSI CDB
- Completion via MSI-X interrupt per CQ (one per CPU)
When NVMe drives are local (PCIe-attached), they deliver 500K-1M+ IOPS at 10-20 us latency. The challenge was extending this performance across the network to shared storage arrays without reintroducing the bottlenecks that SCSI/iSCSI created. NVMe-oF (NVM Express over Fabrics, NVMe-oF specification 1.0, June 2016) solves this by defining transport bindings that carry the NVMe command set over network fabrics with minimal protocol overhead.
NVMe Architecture Internals
NVMe Command Flow (Local PCIe)
=================================
1. Host writes command to Submission Queue (SQ) in host memory
2. Host writes SQ doorbell (MMIO register on NVMe controller)
3. Controller fetches command from SQ via PCIe DMA
4. Controller executes command (read/write to flash)
5. Controller writes completion entry to Completion Queue (CQ)
6. Controller generates MSI-X interrupt to host CPU
7. Host processes completion entry, writes CQ doorbell
+--Host Memory-----------+ +--NVMe Controller--+
| | | |
| Submission Queue (SQ) |<--DMA--| Fetch commands |
| [cmd][cmd][cmd][...] | | |
| | | |
| Completion Queue (CQ) |--DMA-->| Post completions |
| [cpl][cpl][cpl][...] | | |
| | | SQ Doorbell (MMIO) |
| Data Buffers |<--DMA--| Data transfer |
| [page][page][page] | | |
+-------------------------+ +--------------------+
Key data structures:
- Namespace: a logical partition of NVMe storage (like a LUN)
- NQN (NVMe Qualified Name): identifier for subsystems
nqn.2014-08.org.nvmexpress:uuid:<UUID>
- Subsystem: collection of controllers and namespaces
- Controller: a physical or virtual interface to a subsystem
Transport Bindings
NVMe-oF defines three transport bindings, each with different trade-offs:
NVMe/TCP (NVMe over TCP, TP8000):
- Runs on standard Ethernet, no special NIC or switch requirements
- Uses TCP for reliable delivery (like iSCSI, but carrying NVMe commands instead of SCSI)
- Higher latency than RDMA (TCP processing overhead), but dramatically simpler deployment
- Typical latency: 100-200 us (vs iSCSI 200-400 us for equivalent hardware)
- Supported in Linux kernel since 5.0 (2019)
NVMe/RDMA (NVMe over RDMA):
- Uses RDMA (Remote Direct Memory Access) to bypass the kernel network stack entirely
- Two RDMA technologies: RoCE v2 (RDMA over Converged Ethernet, runs on standard Ethernet with DCB) and iWARP (runs on standard TCP/IP, less common)
- Data is transferred directly between application memory buffers -- zero CPU copy
- Typical latency: 20-50 us (approaching local NVMe latency of 10-20 us)
- Requires: RDMA-capable NICs (ConnectX-6/7), lossless Ethernet fabric (PFC + ECN for RoCE v2)
NVMe/FC (NVMe over Fibre Channel):
- Carries NVMe commands over existing FC infrastructure (32/64G FC)
- Leverages existing FC switch investment and operational expertise
- Typical latency: 30-80 us (better than iSCSI, slightly higher than NVMe/RDMA due to FC framing)
- Requires: FC HBAs with NVMe support, FC switches with NVMe/FC firmware
NVMe-oF Transport Comparison
===============================
NVMe/TCP NVMe/RDMA (RoCE v2) NVMe/FC
-------- ------------------- --------
Network Standard Lossless Ethernet Fibre Channel
Ethernet (PFC/ECN/DCB) 32/64G FC
NIC Standard RDMA NIC FC HBA
25/100 GbE (Mellanox CX-6/7) (NVMe-capable)
CPU overhead Medium Very low Low
(TCP stack) (kernel bypass) (FC offload)
Latency (4K 100-200 us 20-50 us 30-80 us
random read)
IOPS (single 300-500K 800K-1.5M 500K-1M
target, QD=32)
Deployment Low High Medium
complexity (standard net) (lossless fabric (FC zoning,
tuning) HBA firmware)
Existing infra Reuses all Reuses Ethernet Reuses FC
reuse Ethernet (may need switch fabric
upgrade for PFC)
Discovery Service
NVMe-oF uses a dedicated discovery mechanism. Each storage fabric exposes one or more discovery controllers that respond to nvme discover commands and return a list of available subsystems, transport types, and addresses.
NVMe-oF Discovery Flow
=========================
Initiator Host Discovery Controller Storage Target
| | |
|-- nvme discover (traddr, -t) ---->| |
| | |
|<-- Discovery Log Page ------------| |
| Entry 1: nqn=nqn.xxx, |
| trtype=tcp, |
| traddr=10.0.100.1, |
| trsvcid=4420 |
| Entry 2: nqn=nqn.yyy, |
| trtype=rdma, |
| traddr=10.0.100.2, |
| trsvcid=4420 |
| |
|-- nvme connect (nqn=nqn.xxx, trtype=tcp, traddr=...) ------>|
|<-- Controller ready, namespace(s) available -----------------|
| |
| /dev/nvmeXnY appears (block device ready for I/O) |
Referrals: A discovery controller can point the initiator to other
discovery controllers, enabling multi-site or hierarchical discovery.
Performance Characteristics: NVMe-oF vs iSCSI
The performance gap is not just about raw numbers -- it is about where the bottleneck sits:
| Metric | iSCSI (25 GbE, software) | NVMe/TCP (25 GbE) | NVMe/RDMA (100 GbE RoCE) | Factor (iSCSI vs RDMA) |
|---|---|---|---|---|
| 4K random read IOPS | 200-300K | 400-600K | 1.0-1.5M | 4-5x |
| 4K random read latency (avg) | 200-400 us | 100-200 us | 20-50 us | 5-10x |
| 4K random read latency (p99) | 500-1000 us | 200-400 us | 40-80 us | 8-15x |
| 128K sequential read throughput | 2.5 GB/s (line rate) | 3.0 GB/s (line rate) | 12 GB/s (line rate) | 4-5x |
| CPU per 100K IOPS | 20-30% core | 10-15% core | 2-5% core | 5-10x |
The latency advantage of NVMe/RDMA is most pronounced at high percentiles (p99, p99.9), which is where user-visible performance problems manifest. A database query that hits 1000 random reads will see its total latency dominated by the slowest reads -- and at p99.9, iSCSI can spike to 2-5 ms while NVMe/RDMA stays under 200 us.
Multipath: NVMe Native vs Device-Mapper
NVMe defines its own multipath mechanism called ANA (Asymmetric Namespace Access), which is conceptually similar to ALUA in SCSI:
-
NVMe native multipath (
nvme-corekernel module withmultipath=Y): The kernel natively understands multiple paths to the same NVMe namespace. Each path can be marked as optimized, non-optimized, or inaccessible. The kernel aggregates paths at thenvmedriver level, presenting a single/dev/nvmeXnYdevice. No device-mapper or multipathd needed. -
Device-mapper multipath (
dm-mpath): The traditional Linux approach. Each NVMe path appears as a separate block device, andmultipathdaggregates them into a single/dev/dm-Xdevice. Works but adds a layer and may not support ANA state transitions as efficiently.
Best practice (2024+): Use NVMe native multipath. It is lighter weight, understands ANA natively, and avoids the device-mapper overhead. Enable with nvme_core.multipath=Y kernel parameter.
Linux NVMe Stack
# Install NVMe CLI tools
dnf install nvme-cli
# Discover available subsystems on a target
nvme discover -t tcp -a 10.0.100.1 -s 4420
# Connect to a subsystem
nvme connect -t tcp -n nqn.2024-01.com.storage:array01 \
-a 10.0.100.1 -s 4420
# Connect to all discovered subsystems
nvme connect-all -t tcp -a 10.0.100.1 -s 4420
# List connected NVMe devices
nvme list
# Show NVMe multipath topology
nvme list-subsys
# Show controller details (queues, firmware, capabilities)
nvme id-ctrl /dev/nvme0
# Show namespace details (size, block size, capacity)
nvme id-ns /dev/nvme0n1
# Persistent connections (survive reboot)
# /etc/nvme/discovery.conf:
# -t tcp -a 10.0.100.1 -s 4420
# Then: systemctl enable --now nvmf-autoconnect.service
NVMe-oF in the Candidate Platforms
OVE: ODF (Ceph) has experimental NVMe-oF gateway support (SPDK-based). The Ceph NVMe-oF gateway exports Ceph RBD images as NVMe-oF namespaces, allowing external hosts to consume Ceph storage via NVMe/TCP. For VMs running inside OVE, storage access is via librbd (user-space), which is already highly optimized and does not benefit significantly from NVMe-oF. NVMe-oF matters most for OVE when consuming external storage arrays (NetApp, Pure) that expose NVMe-oF targets.
Azure Local: S2D uses the Software Storage Bus (SSB) for intra-cluster communication, which is an SMB3/RDMA-based transport. Azure Local does not use NVMe-oF for internal cluster storage. However, Azure Local 23H2+ supports connecting to external NVMe/TCP targets. NVMe/TCP is gaining relevance as Microsoft expands Azure Local's external storage integration.
Swisscom ESC: The Dell PowerMax/PowerStore arrays behind ESC support NVMe/FC as a front-end protocol. Whether NVMe/FC is enabled in the ESC tenant depends on Swisscom's infrastructure configuration. The customer cannot select or configure the transport protocol.
3. MPIO (Multipath I/O)
Why Multipath Exists
In enterprise storage, a single cable failure or switch failure must never cause a storage outage. Multipath I/O provides two capabilities simultaneously:
- Redundancy: If one path fails (cable cut, NIC failure, switch failure), I/O continues on the remaining paths without interruption.
- Performance: I/O can be distributed across multiple active paths, multiplying available bandwidth.
MPIO Path Layout (Typical Dual-Fabric Design)
================================================
+--------------------+
| Compute Host |
| |
| NIC A NIC B | (or HBA-A, HBA-B for FC)
+--+------------+----+
| |
v v
+------+ +------+
|Switch | |Switch | Fabric A Fabric B
| A | | B | (separate failure domains)
+--+----+ +--+----+
| |
v v
+--+------------+----+
| Port A Port B |
| |
| Storage Target |
| (SAN Array, |
| Ceph Gateway, |
| NVMe Target) |
+--------------------+
From host's perspective WITHOUT multipath:
/dev/sda (via NIC-A -> Switch-A -> Port-A) path 1
/dev/sdb (via NIC-B -> Switch-B -> Port-B) path 2
--> Same physical LUN appears as TWO devices!
--> Filesystem sees two disks, data corruption risk
From host's perspective WITH multipath:
/dev/sda (path 1) --+
+--> /dev/mapper/mpath0 (single device)
/dev/sdb (path 2) --+
multipathd manages failover and load balancing
Linux Device-Mapper Multipath
The standard Linux multipath implementation is device-mapper-multipath, consisting of:
- multipathd: User-space daemon that monitors paths, detects failures, and updates the device-mapper table.
- kpartx: Utility for creating device nodes for partitions on multipath devices.
- /etc/multipath.conf: Configuration file for path grouping, failover policy, timeouts, and device-specific overrides.
- dm-mpath (kernel): The device-mapper target that aggregates multiple paths into a single block device.
Key configuration sections in /etc/multipath.conf:
defaults {
polling_interval 5 # seconds between path health checks
path_checker tur # Test Unit Ready (SCSI command)
path_grouping_policy failover # or multibus, group_by_prio
failback immediate # switch back to preferred path when available
no_path_retry 5 # I/O retries when ALL paths fail (or "queue")
user_friendly_names yes # use mpath0, mpath1 instead of WWID
find_multipaths yes # only multipath devices with 2+ paths
}
devices {
device {
vendor "NETAPP"
product "LUN.*"
path_grouping_policy group_by_prio
prio alua # use ALUA priority
path_checker tur
failback immediate
no_path_retry queue
features "3 queue_if_no_path pg_init_retries 50"
}
}
Path Grouping Policies
How paths are organized into groups determines the failover and load balancing behavior:
| Policy | Behavior | Use Case |
|---|---|---|
| failover | One active path per group; groups are priority-ordered. Only the highest-priority group handles I/O. If all paths in a group fail, the next group takes over. | Active-passive arrays, conservative deployments |
| multibus | All paths in a single group; I/O is distributed across all paths simultaneously. | Active-active arrays with symmetric access |
| group_by_prio | Paths are grouped by ALUA priority. Highest-priority group is active; lower-priority groups are standby. | ALUA-capable arrays (most modern SAN arrays) |
| group_by_serial | Paths are grouped by target serial number. | Multi-target configurations |
| group_by_node_name | Paths are grouped by target node name (WWNN for FC). | Specific FC topologies |
Path Checkers and Failover Timing
Path health is verified by periodic checks. The checker type must match the storage protocol:
| Checker | Protocol | Mechanism |
|---|---|---|
| tur (Test Unit Ready) | SCSI (iSCSI, FC) | Sends SCSI TEST UNIT READY command; device responds with status |
| readsector0 | SCSI | Reads LBA 0; more invasive but catches more failure modes |
| directio | Any block device | Reads a sector with O_DIRECT; generic fallback |
| none | NVMe native multipath | Not needed; NVMe driver handles path state via ANA |
Failover timing is determined by the interaction of several parameters:
Failover Timeline (iSCSI example)
====================================
t=0s Path failure occurs (cable cut)
t=0-5s TCP retransmissions (kernel TCP timeout)
t=5s iSCSI session detects failure (replacement_timeout)
t=5s Path marked as "failed" by iscsi_tcp
t=5s multipathd detects failed path on next poll
t=5-10s I/O rerouted to surviving path(s)
t=10s I/O resumes on healthy path
Total failover time: 5-10 seconds (with tuning)
Default failover time: 60-120 seconds (without tuning!)
Critical tuning parameters:
iSCSI:
node.session.timeo.replacement_timeout = 5-20
multipathd:
polling_interval = 5
no_path_retry = 5 (or "queue" for indefinite retry)
Kernel:
net.ipv4.tcp_retries2 = 5 (reduce from default 15)
ALUA (Asymmetric Logical Unit Access)
ALUA (SPC-3/SPC-4 standard) is a SCSI mechanism that allows a storage array to communicate path preferences to the host. This is critical for dual-controller arrays where a LUN is owned by one controller but accessible (at reduced performance) through the other:
ALUA Path States
==================
+-----------------------------+ +-----------------------------+
| Controller A (Owner) | | Controller B (Peer) |
| | | |
| LUN 1: Active-Optimized | | LUN 1: Active-Non-Opt. |
| LUN 2: Active-Non-Opt. | | LUN 2: Active-Optimized |
| LUN 3: Active-Optimized | | LUN 3: Active-Non-Opt. |
+-----------------------------+ +-----------------------------+
| |
v v
Paths to Controller A Paths to Controller B
Priority: 50 (preferred) Priority: 10 (non-preferred)
for LUN 1 and LUN 3 for LUN 1 and LUN 3
ALUA states:
Active-Optimized (AO) = Best path, full performance
Active-Non-Optimized (ANO) = Accessible but cross-controller hop
Standby = Path available but not serving I/O
Unavailable = Path offline
Transitioning = Controller ownership is changing
multipathd with prio=alua:
- Queries ALUA target port group descriptor
- Assigns priority based on AO (50) vs ANO (10)
- Routes I/O to AO paths first
- On controller failover, ALUA states update, multipathd re-prioritizes
Multipath with Each Storage Protocol
| Protocol | Multipath Method | Path Identity | Notes |
|---|---|---|---|
| Fibre Channel | dm-multipath with ALUA | WWPN + LUN ID | Most mature; well-tested with all enterprise arrays |
| iSCSI | dm-multipath with ALUA | IQN + target portal + LUN | Multiple iSCSI sessions (one per path); requires multiple NICs or VLANs |
| NVMe/TCP | NVMe native multipath (ANA) | NQN + controller ID | Preferred over dm-multipath; lighter weight |
| NVMe/RDMA | NVMe native multipath (ANA) | NQN + controller ID | Same as NVMe/TCP |
| NVMe/FC | NVMe native multipath (ANA) | NQN + controller ID | Requires FC HBA with NVMe support |
4. Fibre Channel
FC Protocol Stack
Fibre Channel is a lossless, high-speed network protocol designed exclusively for storage traffic. Unlike Ethernet, FC guarantees in-order delivery and provides credit-based flow control that prevents frame drops.
FC Protocol Stack (FC-0 through FC-4)
========================================
+--------------------------------------------------+
| FC-4: Upper Layer Protocols |
| FCP (Fibre Channel Protocol for SCSI) |
| NVMe/FC (NVMe over Fibre Channel) |
| FICON (mainframe channel) |
+--------------------------------------------------+
| FC-2: Framing and Flow Control |
| Frame structure, credit-based flow control, |
| classes of service (Class 3 most common), |
| exchange and sequence management |
+--------------------------------------------------+
| FC-1: Encode/Decode |
| 8b/10b (up to 8G FC) |
| 64b/66b (16G FC and above) |
+--------------------------------------------------+
| FC-0: Physical Interface |
| Optical transceivers (SFP+, SFP28, QSFP) |
| Cable types (OM3/OM4 multimode, OS2 singlemode) |
| Speeds: 8/16/32/64/128 GFC |
+--------------------------------------------------+
Speed evolution:
1G FC (1997) -> 1.0625 Gbps
2G FC (2001) -> 2.125 Gbps
4G FC (2004) -> 4.25 Gbps
8G FC (2007) -> 8.5 Gbps
16G FC (2011) -> 14.025 Gbps (64b/66b encoding)
32G FC (2016) -> 28.05 Gbps
64G FC (2020) -> 57.2 Gbps (PAM4 signaling)
128G FC (2024) -> 112.0 Gbps (expected Gen 8)
Topology
FC supports three topologies, though only one is used in modern data centers:
- Point-to-Point (N_Port to N_Port): Direct connection between two devices. Rarely used outside direct-attach storage.
- Arbitrated Loop (FC-AL): Legacy shared-medium topology where up to 126 devices share bandwidth. Obsolete; do not deploy.
- Switched Fabric (FC-SW): Every device connects to an FC switch. The fabric provides non-blocking, full-bandwidth connectivity between any pair of ports. This is the standard topology.
Zoning
Zoning is the FC equivalent of firewall rules. It controls which initiators (HBAs) can see which targets (storage ports):
- Soft zoning (WWN-based): Zone membership defined by WWPN. If a device moves to a different switch port, the zone follows it. Most common.
- Hard zoning (port-based): Zone membership defined by switch port number. More restrictive but breaks when devices are re-cabled.
- Single-initiator zoning (best practice): Each zone contains exactly one initiator WWPN and one or more target WWPNs. This is the industry standard because it prevents accidental cross-talk between initiators.
Zoning Example (Single-Initiator Best Practice)
==================================================
Zone: host01_array01
Members:
21:00:00:e0:8b:01:01:01 (Host 01, HBA Port A, WWPN)
50:00:00:00:c9:aa:bb:01 (Array 01, Port A1, WWPN)
50:00:00:00:c9:aa:bb:02 (Array 01, Port A2, WWPN)
Zone: host01_array01_fabricB
Members:
21:00:00:e0:8b:01:01:02 (Host 01, HBA Port B, WWPN)
50:00:00:00:c9:aa:bb:03 (Array 01, Port B1, WWPN)
50:00:00:00:c9:aa:bb:04 (Array 01, Port B2, WWPN)
Zone: host02_array01
Members:
21:00:00:e0:8b:02:01:01 (Host 02, HBA Port A, WWPN)
50:00:00:00:c9:aa:bb:01 (Array 01, Port A1, WWPN)
50:00:00:00:c9:aa:bb:02 (Array 01, Port A2, WWPN)
Zoneset: production_zoneset (activated on all switches in fabric)
Contains: host01_array01, host01_array01_fabricB, host02_array01
WWNN/WWPN Addressing and Login
Every FC port has two addresses:
- WWNN (World Wide Node Name): Identifies the physical device (server, array). One per device.
- WWPN (World Wide Port Name): Identifies a specific port on the device. One per physical or virtual port.
Both are 8-byte addresses, typically displayed as colon-separated hex: 21:00:00:e0:8b:01:01:01.
The login process establishes communication:
FC Login Sequence
===================
Host HBA FC Switch Storage Array
| | |
|-- FLOGI (Fabric --------> |
| Login) | |
|<-- FLOGI Accept ------| |
| (assigned N_Port ID | |
| e.g., 0x010200) | |
| | |
|-- PLOGI (Port Login, to storage target WWPN) --->|
|<-- PLOGI Accept (buffer credits, capabilities) --|
| | |
|-- PRLI (Process Login, FCP/NVMe protocol) ------>|
|<-- PRLI Accept (protocol ready) -----------------|
| | |
| === Ready for SCSI/NVMe commands === |
FLOGI: Host registers with the fabric, receives a 3-byte N_Port ID
(24-bit address used for frame routing within the fabric)
PLOGI: Host establishes a session with a specific target port
PRLI: Host and target agree on the upper-layer protocol (FCP or NVMe)
FC Frame Structure
FC Frame Format
=================
+--------+--------+--------+--------+-----------+--------+--------+
| SOF | Frame | D_ID | S_ID | Payload | CRC | EOF |
| (4B) | Header | (3B) | (3B) | (0-2112B) | (4B) | (4B) |
| | (24B) | | | | | |
+--------+--------+--------+--------+-----------+--------+--------+
Frame Header (24 bytes):
+------+------+------+------+------+------+------+------+
| R_CTL| D_ID (dest) | CS_ | S_ID (source) | TYPE |
| (1B) | (3B) | CTL | (3B) | (1B) |
| | | (1B) | | |
+------+------+------+------+------+------+------+------+
| SEQ_ | DF_ | SEQ_ | OX_ID | RX_ID |
| ID | CTL | CNT | (2B) | (2B) |
| (1B) | (1B) | (2B) | | |
+------+------+------+------+------+------+------+------+
| Relative Offset (Parameter) |
| (4B) |
+------------------------------------+
Key fields:
R_CTL: Routing control (data frame, link control, etc.)
D_ID: Destination N_Port ID (assigned during FLOGI)
S_ID: Source N_Port ID
TYPE: Upper layer protocol (0x08 = FCP/SCSI, 0x28 = NVMe)
OX_ID: Originator exchange ID (tracks the I/O operation)
RX_ID: Responder exchange ID
Max payload: 2112 bytes (2048 data + 64 optional header)
Max frame size with overhead: 2148 bytes
Credit-Based Flow Control
FC's lossless guarantee comes from buffer-to-buffer (BB) credits. Each port allocates a fixed number of frame buffers. Before sending a frame, the sender must hold a credit. Credits are returned when the receiver processes the frame.
This means FC never drops frames due to congestion -- instead, a saturated link causes the sender to wait (back-pressure). This is fundamentally different from Ethernet, where congestion causes frame drops and TCP retransmissions.
The implication for long distances: credits are finite, and they must cover the round-trip time. For a 100 km link at 32G FC, you need hundreds of credits, which requires expensive long-distance buffers on the FC switch ports.
FCoE (Fibre Channel over Ethernet)
FCoE encapsulates FC frames inside Ethernet frames, allowing FC and Ethernet traffic to share the same physical cable and switch. It was designed to reduce cabling complexity in data centers.
DCB (Data Center Bridging) requirements for FCoE:
- PFC (Priority Flow Control, 802.1Qbb): Per-priority pause frames to prevent FCoE traffic from being dropped. FCoE traffic is assigned a dedicated priority (typically priority 3).
- ETS (Enhanced Transmission Selection, 802.1Qaz): Bandwidth allocation between FCoE and regular Ethernet traffic.
- DCBX (Data Center Bridging Exchange): Protocol for auto-negotiating DCB parameters between switches and CNAs.
CNA (Converged Network Adapter): A single adapter that provides both Ethernet and FC connectivity. The CNA presents virtual Ethernet and virtual FC interfaces to the OS.
Current status (2025): FCoE adoption has been limited. Most enterprises that invested in FC stayed with native FC rather than converging. Most enterprises on Ethernet chose iSCSI or NVMe/TCP. FCoE occupies a narrow niche for organizations with existing FC infrastructure that want to reduce cabling in new deployments.
FC in Modern Data Centers -- When It Still Makes Sense
Fibre Channel is not dead, but its role is narrowing:
FC still makes sense when:
- The organization has an existing FC fabric with significant switch/HBA investment and operational expertise.
- Latency-sensitive, mission-critical workloads (databases, ERP systems) require the guaranteed lossless behavior of FC credit-based flow control.
- The storage arrays are FC-native (Dell PowerMax, HPE Primera, Hitachi VSP) and have limited or no NVMe/TCP support.
- Regulatory or compliance frameworks mandate physical network segregation of storage traffic.
FC is losing ground to converged Ethernet because:
- 100/200/400 GbE provides raw bandwidth parity with 64/128G FC.
- NVMe/TCP delivers competitive latency without dedicated fabric infrastructure.
- NVMe/RDMA (RoCE v2) delivers latency parity with NVMe/FC on converged Ethernet with DCB.
- Operational teams increasingly prefer converged networks to reduce skill set requirements (one network team, not separate Ethernet and FC teams).
- Cloud-native platforms (Kubernetes, OpenShift) have no native FC support -- FC requires out-of-band integration.
Fibre Channel in the Candidate Platforms
OVE: OpenShift has no native Fibre Channel support at the Kubernetes level. FC-attached storage can be consumed via CSI drivers from storage vendors (e.g., NetApp Trident, Dell CSI, Pure CSI) that handle the FC session management outside of Kubernetes. This means the FC zoning, HBA configuration, and multipath setup happen at the bare-metal Linux layer, not within the Kubernetes/OpenShift abstraction. It works, but it is a bolt-on integration, not a first-class citizen.
Azure Local: Windows Server has mature native FC support via the Windows FC HBA driver stack. Azure Local nodes can connect to FC SAN arrays for VM storage. However, S2D (the HCI storage model) uses local disks and SMB3 for inter-node traffic -- FC is only relevant when consuming external SAN storage alongside or instead of S2D.
Swisscom ESC: FC is the primary storage transport. The Dell VxBlock infrastructure connects compute nodes to PowerMax/PowerStore arrays via 32G Fibre Channel. The customer has no involvement in FC operations -- Swisscom manages zoning, HBA firmware, and multipath configuration.
5. NFSv3
RPC/XDR Foundation
NFS is built on Sun RPC (Remote Procedure Call) and XDR (External Data Representation). Every NFS operation is an RPC call: the client marshals the operation (READ, WRITE, GETATTR, LOOKUP) into XDR format, sends it to the server, and waits for the XDR-encoded response.
NFS Protocol Stack (v3)
==========================
+----------------------------------+
| NFS v3 Protocol (RFC 1813) |
| Operations: READ, WRITE, CREATE, |
| REMOVE, RENAME, GETATTR, LOOKUP, |
| READDIR, COMMIT, etc. |
+----------------------------------+
| Sun RPC v2 (RFC 5531) |
| Program number: 100003 (NFS) |
| Version: 3 |
| Procedure numbers: 0-21 |
+----------------------------------+
| XDR (External Data Representation)|
| (RFC 4506 -- serialization format)|
+----------------------------------+
| TCP (port assigned by portmapper) |
| or UDP (legacy, less reliable) |
+----------------------------------+
Supporting services:
portmapper / rpcbind (port 111):
Maps RPC program numbers to TCP/UDP ports
Client asks: "What port is NFS (program 100003) on?"
portmapper responds: "Port 2049"
mountd (mount daemon):
Handles mount requests from clients
Verifies export permissions (/etc/exports)
Returns a file handle for the exported directory
statd / lockd (NLM -- Network Lock Manager):
Provides advisory file locking for NFSv3
statd tracks client state for crash recovery
lockd handles lock/unlock requests
rquotad (quota daemon):
Reports quota information to clients
Stateless Design
NFSv3 is stateless -- the server maintains no per-client session state. Every request is self-contained and can be answered independently. This has profound implications:
Advantages:
- Server crash recovery is trivial: the server restarts, clients retry their requests, and operations resume. No session state to rebuild.
- Server-side simplicity: no memory consumed for tracking client sessions, open files, or lock state.
Disadvantages:
- Locking is bolted on. Because the core protocol is stateless, file locking requires a separate stateful protocol: the Network Lock Manager (NLM, rpc.lockd). NLM is notoriously fragile -- if statd/lockd processes crash or the client-server NLM state becomes inconsistent, locks can be lost or orphaned.
- No delegation. The server cannot grant a client exclusive access to a file for local caching. Every attribute check requires a network round-trip (mitigated by client-side attribute caching with
ac/noacmount options). - Mount protocol is separate. The mount operation itself uses a different RPC program (mountd, program 100005), which requires its own port and firewall rule.
AUTH_SYS and Export Controls
NFSv3 authentication is primitive by modern standards:
-
AUTH_SYS (AUTH_UNIX): The client sends its UID and GID list in the RPC header. The server trusts these values. There is no cryptographic verification -- a client can claim to be any UID. This is fundamentally insecure and relies entirely on network segmentation for access control.
-
Export controls (
/etc/exportson the server):/data 10.0.100.0/24(rw,sync,no_root_squash,no_subtree_check) /backup 10.0.200.0/24(ro,sync,root_squash)Key export options:
root_squash: Maps UID 0 (root) tonobody(65534). Prevents root on a client from having root access on the exported filesystem.no_root_squash: Allows root access. Required for VM storage datastores where the hypervisor needs root-level access to create/modify VM disk files.sync: Server commits writes to stable storage before responding. Required for data integrity. Theasyncoption is faster but risks data loss on server crash.no_subtree_check: Disables verification that accessed files are within the exported subtree. Improves performance and is recommended for all exports.
Performance Considerations
| Factor | Impact | Recommendation |
|---|---|---|
| TCP vs UDP | TCP provides reliable delivery; UDP relies on NFS-level retransmission. TCP is mandatory for modern deployments. | Always use TCP (-o proto=tcp) |
| wsize/rsize | Read and write transfer sizes. Larger = fewer round trips = higher throughput. | Set to 1048576 (1 MB) for bulk workloads: -o rsize=1048576,wsize=1048576 |
| Attribute caching (ac) | Client caches file attributes locally to avoid GETATTR round-trips. noac disables caching for strict consistency. |
Use default ac for performance. Use noac only when multiple clients write and read the same files simultaneously. |
| Read-ahead | Client pre-fetches sequential data. Controlled by nfs.nfs_congestion_kb and client-side readahead (blockdev --setra) |
Tune for sequential workloads (backup, media) |
| Write-behind / async writes | Client buffers writes and flushes asynchronously. sync mount option forces synchronous writes. |
Use default (async) for performance, sync for strict durability requirements |
| Jumbo frames | MTU 9000 reduces per-frame overhead for large transfers | Enable if supported end-to-end |
| nconnect | Linux 5.3+: opens multiple TCP connections per mount, distributing I/O. | -o nconnect=8 for high-throughput workloads. Major improvement on multi-core servers. |
Typical NFSv3 performance (single client, 25 GbE, enterprise NAS):
| Workload | Throughput | IOPS | Latency |
|---|---|---|---|
| 1 MB sequential read | 2.0-2.5 GB/s | N/A | N/A |
| 1 MB sequential write (sync) | 1.0-1.5 GB/s | N/A | N/A |
| 4K random read | N/A | 50-100K | 200-500 us |
| 4K random write (sync) | N/A | 10-30K | 500-2000 us |
NFS for VM Storage
In VMware environments, NFS is commonly used as a datastore protocol. ESXi mounts an NFS export and stores VMDK files on it, similar to VMFS on block storage.
VMware NFS datastore specifics:
- ESXi supports NFSv3 (all versions) and NFSv4.1 (vSphere 6.0+).
- VAAI-NAS (vStorage APIs for Array Integration -- NAS) offloads operations to the NFS server: full file clone, extended statistics, reserve space. Without VAAI-NAS, every VM clone requires reading and writing every byte through ESXi.
- Datastore-level locking: VMware uses a proprietary locking mechanism on NFS datastores (not NLM) to prevent multiple ESXi hosts from corrupting the same VMDK. This lock mechanism is VMware-specific and does not exist in any of the candidate platforms.
Implication for migration: If the current VMware environment uses NFS datastores, the NFS server itself may survive the migration (it is independent of the hypervisor), but the integration model changes completely. OVE can consume NFS via the Kubernetes CSI NFS driver, but there is no equivalent of VAAI-NAS offload. Azure Local can mount NFS shares but prefers SMB/S2D for VM storage.
6. NFSv4
Stateful Design
NFSv4 (RFC 7530) is a fundamental redesign of NFS, introducing statefulness. The server now tracks open files, locks, and delegations per client. This eliminates the bolted-on NLM locking mechanism and enables features impossible in a stateless protocol.
Key stateful operations:
- OPEN: Client opens a file and receives a stateid (an opaque handle that represents the open state on the server). All subsequent reads, writes, and locks reference this stateid.
- LOCK: Integrated byte-range locking, replacing NLM. Locks are tied to the client's lease and are automatically released if the client fails to renew.
- CLOSE: Releases the stateid and associated locks.
- Delegation: The server can grant a client exclusive rights to a file (read delegation or write delegation). While a client holds a delegation, it can perform reads/writes and cache data locally without consulting the server. This dramatically reduces latency for workloads with single-writer patterns.
NFSv4 Stateful Interaction
============================
Client A Server
| |
|-- OPEN file.txt ------------------->| Server creates state
|<-- stateid=0x0001 ------------------| Client holds state
| |
|-- READ (stateid=0x0001, off=0) ---->| Server verifies state
|<-- Data ----------------------------|
| |
|-- LOCK (stateid=0x0001, 0-4095) --->| Integrated locking
|<-- Lock granted --------------------| (no separate NLM!)
| |
|-- WRITE (stateid=0x0001, off=0) --->| Write within lock
|<-- OK ------------------------------|
| |
|-- CLOSE (stateid=0x0001) ---------->| Releases state + lock
|<-- OK ------------------------------|
Lease management:
- Client renews lease periodically (default 90s)
- If client fails to renew (crash, network partition):
Server reclaims state after lease expiry
Locks are released, delegations recalled
- Grace period after server restart:
Clients reclaim pre-existing locks
New locks are rejected until grace period ends
Single-Port Operation and Pseudo-Filesystem
NFSv4 consolidates all operations on a single well-known port: TCP 2049. No portmapper, no mountd, no lockd, no statd. This is a major operational improvement for environments with strict firewall policies (like financial institutions).
Pseudo-filesystem: NFSv4 does not use the mount protocol. Instead, the server presents a virtual filesystem tree (pseudo-filesystem) that maps exported directories into a unified namespace. Clients connect to the server's root (/) and navigate to the export via the pseudo-filesystem path.
NFSv4 Pseudo-Filesystem
==========================
Server exports:
/data/project-a -> available at /project-a
/data/project-b -> available at /project-b
/backup/2024 -> available at /backup/2024
Pseudo-filesystem (as seen by client):
/ (pseudo-root, not a real directory)
+-- project-a/ (real export: /data/project-a)
+-- project-b/ (real export: /data/project-b)
+-- backup/
+-- 2024/ (real export: /backup/2024)
Client mount:
mount -t nfs4 server:/project-a /mnt/project-a
(no portmapper query, no mountd call, just TCP 2049)
Referrals: The pseudo-filesystem can contain referrals that redirect clients to different servers. This enables a federated namespace spanning multiple NFS servers without client-side configuration changes.
Security Model
NFSv4 replaces AUTH_SYS with RPCSEC_GSS (RFC 2203), which provides:
- Kerberos 5 authentication (krb5): Client and server authenticate each other via Kerberos tickets. No more trusting UID/GID values from untrusted clients.
- Kerberos 5 integrity (krb5i): Authentication plus cryptographic integrity checking of every RPC message (prevents tampering).
- Kerberos 5 privacy (krb5p): Authentication plus integrity plus encryption of the entire RPC payload. Required for sensitive data in transit.
| Security Flavor | Authentication | Integrity | Encryption | Performance Impact |
|---|---|---|---|---|
| AUTH_SYS | None (trust UID) | None | None | Baseline |
| krb5 | Kerberos ticket | None | None | ~5-10% overhead |
| krb5i | Kerberos ticket | Per-message HMAC | None | ~15-25% overhead |
| krb5p | Kerberos ticket | Per-message HMAC | AES encryption | ~30-50% overhead |
For a financial institution, krb5i or krb5p should be mandatory for any NFS export carrying regulated data. The performance overhead is significant but unavoidable for compliance.
ACLs: NFSv4 defines its own ACL model (not POSIX ACLs) based on the Windows/NT ACL model. NFSv4 ACLs support ALLOW and DENY entries, inheritance, and a richer set of permissions than POSIX mode bits. This is important for interoperability with Windows environments (the same ACL semantics map to NTFS).
pNFS (Parallel NFS)
pNFS (NFSv4.1, RFC 5661) separates metadata operations from data operations, allowing clients to perform data I/O directly to storage devices in parallel, bypassing the NFS server as a data bottleneck:
pNFS Architecture
===================
Without pNFS (traditional NFS):
Client --> NFS Server --> Storage
(all data flows through the server -- bottleneck)
With pNFS:
Client --metadata--> MDS (Metadata Server)
Client --data------> DS1 (Data Server / Storage Device)
Client --data------> DS2 (Data Server / Storage Device)
Client --data------> DS3 (Data Server / Storage Device)
1. Client requests layout from MDS
2. MDS returns layout describing where data blocks reside
3. Client performs I/O directly to data servers
4. Client commits/returns layout to MDS
Layout types:
- Files: Data striped across NFS data servers (NetApp, Linux knfsd)
- Blocks: Data on block devices (SAN LUNs accessible to client)
- Objects: Data in object storage (Panasas, rare)
- SCSI: Data on SCSI LUNs (SPC-4 based)
- FlexFiles: Enhanced file layout (NetApp ONTAP, most common in practice)
pNFS is important for high-throughput workloads (media, genomics, HPC) but is less relevant for typical VM storage where block protocols dominate.
NFSv4.1 and NFSv4.2 Enhancements
NFSv4.1 (RFC 5661):
- Sessions: Multi-connection sessions with exactly-once semantics. Eliminates the duplicate request cache (DRC) reliability issues of NFSv4.0.
- Trunking: Multiple network paths between client and server for bandwidth aggregation and failover. Comparable to iSCSI MC/S.
- pNFS: Parallel NFS (described above).
- Directory delegation: Clients can cache directory listings locally.
NFSv4.2 (RFC 7862):
- Server-side copy: The client instructs the server to copy data between files without transferring data over the network. Critical for VM clone operations -- instead of reading 100 GB from the server and writing it back, the server copies internally. Massive performance improvement for provisioning workflows.
- Space reservations (fallocate): Pre-allocate space for files on the server. Prevents out-of-space errors during writes.
- Sparse files: Efficient handling of files with large zero-filled regions (common in VM disk images).
- Application data blocks: Identify data patterns for deduplication hints.
- Labeled NFS: SELinux label support on NFS files for mandatory access control.
NFSv4 in the Candidate Platforms
OVE: The Kubernetes NFS CSI driver supports NFSv4.x for mounting external NFS exports as Persistent Volumes. CephFS (part of ODF) uses its own protocol (not NFS) for file access, but the Ceph NFS-Ganesha gateway can export CephFS directories via NFSv4. This is primarily used for shared filesystem access by VMs, not for boot disks (which use Ceph RBD).
Azure Local: Windows Server supports NFSv4.1 as a client (mounting external NFS shares) but the native VM storage path is SMB3/S2D. NFS is relevant when Azure Local VMs need to access existing NFS infrastructure (e.g., NAS appliances shared with Linux workloads).
Swisscom ESC: NFS access is available as a managed service option. The protocol version (v3 vs v4) depends on the underlying NAS infrastructure (typically Dell PowerStore NAS or Isilon). The customer consumes NFS exports without managing the server configuration.
7. SMB / CIFS
Protocol Evolution
SMB (Server Message Block) is the file sharing protocol native to Windows environments. Understanding its evolution is critical because the version determines security posture, performance capabilities, and protocol compatibility.
SMB Version Timeline
======================
SMB1 / CIFS (1983-2006):
- Original protocol, designed for LAN Manager / Windows for Workgroups
- Chatty, insecure, single-threaded
- Vulnerable: WannaCry (EternalBlue, MS17-010) exploited SMBv1
- STATUS: Deprecated. MUST be disabled. Removed from Windows 11 24H2+.
- Financial institutions: SMBv1 should be blocked at the network level.
SMB2.0 (Windows Vista / 2006):
- Reduced chattiness (compound commands, pipelining)
- Larger reads/writes (up to 1 MB vs 64 KB)
- Improved caching (oplock/lease model)
- Durable handles (survive brief disconnects)
SMB2.1 (Windows 7 / 2009):
- Large MTU support (up to 1 MB per SMB packet)
- Client oplock leasing
- BranchCache support (WAN optimization)
SMB3.0 (Windows 8 / Server 2012):
- Multichannel (aggregate multiple NICs)
- SMB Direct (RDMA -- zero-copy, kernel bypass)
- SMB Encryption (AES-128-CCM)
- Transparent failover (continuously available shares)
- VSS remote snapshots
- Scale-out file server (SOFS) support
SMB3.02 (Windows 8.1 / Server 2012 R2):
- SMB1 can be fully disabled
- Improved SOFS performance
SMB3.1.1 (Windows 10 / Server 2016+):
- AES-128-GCM encryption (faster than CCM)
- Pre-authentication integrity (SHA-512 hash chain)
- Cluster dialect fencing
- AES-256-CCM/GCM (Windows Server 2022+)
SMB3 Key Features for Enterprise Storage
Multichannel: SMB3 can detect multiple network interfaces between client and server and automatically establish parallel connections across them. This provides both bandwidth aggregation and NIC-level failover without any multipath software.
SMB3 Multichannel
===================
Client Server
+--------+ +--------+
| NIC 1 |----Connection 1--->| NIC 1 |
| 25 GbE | | 25 GbE |
+--------+ +--------+
| NIC 2 |----Connection 2--->| NIC 2 |
| 25 GbE | | 25 GbE |
+--------+ +--------+
Total bandwidth: 50 Gbps (aggregate)
If NIC 1 fails: all I/O shifts to NIC 2 (automatic)
No multipath daemon, no dm-mpath, no configuration.
SMB Direct (RDMA): When both client and server have RDMA-capable NICs (RoCE v2 or iWARP), SMB3 can perform direct memory-to-memory data transfers that bypass the TCP/IP stack and the kernel entirely. This is the transport used by Storage Spaces Direct for intra-cluster storage traffic.
Performance comparison (single session, 4K random read):
| Transport | IOPS | Latency (avg) | CPU overhead |
|---|---|---|---|
| SMB3 over TCP (25 GbE) | 150-250K | 150-300 us | 15-25% core |
| SMB3 over RDMA (25 GbE RoCE v2) | 500-800K | 30-60 us | 3-8% core |
| SMB3 over RDMA (100 GbE RoCE v2) | 1.0-1.5M | 20-40 us | 2-5% core |
SMB Encryption: SMB3+ supports per-share or per-server encryption using AES-128-GCM (SMB 3.0.2+) or AES-256-GCM (Server 2022+). Encryption is negotiated during session setup and applies to all data in transit. Unlike NFSv4 krb5p which encrypts at the RPC level, SMB encryption operates at the SMB transport level and has lower performance overhead (5-15% vs 30-50% for krb5p).
Continuously Available (CA) shares: SMB3 supports transparent failover between clustered file servers. If the node serving a share fails, the client's SMB session seamlessly moves to another cluster node. Open file handles, locks, and oplock leases are preserved. This is the foundation for Scale-Out File Server (SOFS) in Windows Server and for S2D's shared storage model.
SMB in Azure Local
SMB3 is the native storage protocol for Azure Local. Every I/O operation between compute and storage within an Azure Local / S2D cluster uses SMB3/RDMA:
Azure Local Storage Path
===========================
VM (Hyper-V)
|
v
Virtual Disk (VHDX file on CSV)
|
v
Cluster Shared Volume (CSVFS)
|
v
Storage Spaces Direct (S2D)
|
v
Software Storage Bus (SSB)
|
v
SMB3 Direct (RDMA) <-- All inter-node I/O
|
v
Remote node's local NVMe/SSD
Key point: Even when a VM accesses data on a remote node,
the I/O goes over SMB3/RDMA. This is transparent to the VM.
The VM sees a local VHDX; CSVFS and S2D handle the remoting.
This makes SMB3/RDMA performance the single most critical factor for Azure Local storage performance. If RDMA is misconfigured (wrong MTU, PFC not enabled, ECN not tuned), the cluster falls back to SMB3/TCP with 5-10x worse latency. Verifying RDMA health is a Day-1 priority for Azure Local deployments.
Samba on Linux
Samba is the open-source implementation of the SMB protocol for Linux/UNIX systems. It enables Linux servers to serve files to Windows clients and vice versa. Samba 4.x supports SMB3.1.1 including encryption and multichannel.
Relevance to the evaluation: In a mixed OVE/Linux environment with Windows VMs that require SMB file shares, there are two options:
- External Windows file server or NAS appliance serving SMB3 shares. The VMs access the shares over the network.
- Samba running on a Linux VM or container within OVE, providing SMB3 shares to Windows VMs. This avoids a dependency on Windows infrastructure but introduces the complexity of running Samba in production (AD integration, Kerberos, CTDB for clustering).
Neither option is as integrated as the Azure Local model, where SMB3 is the native fabric protocol and Windows file sharing is a first-class capability.
When SMB Matters in the Evaluation
SMB matters in three scenarios:
-
Windows VM workloads: Windows VMs that access shared file servers (DFS namespaces, department shares, application data) expect SMB. This is a workload requirement, not a platform requirement -- it applies to all candidates.
-
Azure Local platform internals: SMB3/RDMA is the storage fabric protocol for S2D. Its performance directly determines Azure Local's storage performance. Understanding SMB3 internals is therefore essential for Azure Local evaluation.
-
Migration of existing Windows file services: If the current VMware environment hosts Windows file servers that serve SMB shares to the enterprise, these file servers must continue operating on the target platform. On OVE, they run as KubeVirt VMs; on Azure Local, they run as Hyper-V VMs or as Scale-Out File Server roles; on ESC, they run as managed VMs.
How the Candidates Handle This
Protocol Support Matrix
| Protocol | VMware (Current) | OVE (OpenShift Virtualization Engine) | Azure Local | Swisscom ESC |
|---|---|---|---|---|
| iSCSI | ESXi software initiator; iSCSI datastores common | Via CSI drivers for external arrays; not used for ODF/Ceph internal | Windows iSCSI Initiator for external SAN; not used for S2D internal | Available via VxBlock infrastructure (secondary to FC) |
| NVMe-oF | vSphere 7.0+ supports NVMe/TCP and NVMe/FC datastores | Ceph NVMe-oF gateway (experimental); external arrays via CSI | NVMe/TCP support for external storage (23H2+); S2D uses local NVMe directly | Dell arrays support NVMe/FC; customer has no protocol choice |
| FC | Full native support; common in enterprise VMware environments | Not Kubernetes-native; supported via vendor CSI drivers at bare-metal level | Native Windows FC HBA support; usable for external SAN | Primary storage transport (32G FC to Dell PowerMax/PowerStore) |
| MPIO | ESXi native multipath (NMP/PSA), round-robin, fixed, MRU | dm-multipath at Linux host level; NVMe native multipath for NVMe-oF | Windows MPIO (MSDSM); SMB Multichannel for S2D internal | Managed by Swisscom; customer has no visibility |
| NFSv3 | Full support; NFS datastores widely used | Kubernetes NFS CSI driver; CephFS via NFS-Ganesha gateway | Windows NFS client; not primary storage path | Available as managed NAS service |
| NFSv4 | vSphere 6.0+ supports NFSv4.1 datastores | Kubernetes NFS CSI driver; NFS-Ganesha gateway supports v4.1/v4.2 | Windows NFSv4.1 client (Server 2022+) | Available as managed NAS service |
| SMB3 | Not used for datastores; only for Windows VM guest access | Samba on Linux or external Windows file server | Native fabric protocol (SMB3/RDMA for S2D); first-class citizen | Available for Windows VM guest access |
Platform-Specific Protocol Analysis
OVE -- Protocol-Independent by Design: OVE's storage architecture is deliberately protocol-agnostic at the Kubernetes layer. VMs consume storage through the CSI interface, which abstracts the underlying protocol. ODF/Ceph communicates internally using its own RADOS protocol (a custom binary protocol over TCP or RDMA/msgr2), not iSCSI, NVMe-oF, or FC. When external storage is consumed, the CSI driver (NetApp Trident, Pure CSI, Dell CSI) handles the protocol negotiation. This means the OVE operations team needs protocol expertise only when integrating external storage arrays -- the internal storage path is Ceph-specific and does not use any of the traditional storage protocols.
Azure Local -- SMB3 as the Backbone: Azure Local's architecture is built entirely on SMB3. The Software Storage Bus uses SMB3/RDMA for all inter-node storage traffic. CSV (Cluster Shared Volumes) uses SMB3 for redirected I/O when the coordinating node is not the same as the owner node. This deep integration means that SMB3/RDMA performance is Azure Local's performance. The operations team must understand SMB3 multichannel, RDMA configuration (RoCE v2, PFC, DCBX), and the Windows SMB client/server architecture. For external storage, Azure Local can consume iSCSI, FC, and NVMe/TCP, but S2D is the primary and recommended storage model.
Swisscom ESC -- FC Behind the Curtain: ESC's storage transport is Fibre Channel between VxBlock compute and PowerMax/PowerStore arrays. The customer never interacts with FC, MPIO, or any storage protocol directly. NFS and SMB are available as managed add-on services for file-level access. The protocol choice is Swisscom's operational decision, and changes (e.g., migration from FC to NVMe/FC) would happen transparently. This is the trade-off of a managed service: zero protocol operational burden, but also zero protocol optimization capability.
Key Takeaways
-
Protocol choice is a platform choice, not a storage choice. Selecting OVE means accepting Ceph RADOS as the internal storage protocol (with CSI as the abstraction layer). Selecting Azure Local means committing to SMB3/RDMA as the storage fabric. Selecting ESC means accepting FC as the transport (managed by Swisscom). The traditional freedom to choose between iSCSI, FC, and NFS for datastore connectivity (as in VMware) does not exist in any of the candidates -- the protocol is architecturally determined.
-
NVMe-oF is the future but not the present for any candidate. None of the three candidates use NVMe-oF as their primary internal storage protocol today. OVE uses RADOS, Azure Local uses SMB3/RDMA, ESC uses FC. NVMe-oF is relevant for consuming external storage arrays and will likely become the standard for external connectivity within 2-3 years. Ensure your network infrastructure (RDMA-capable NICs, lossless Ethernet) can support NVMe-oF when it matures.
-
RDMA is the performance differentiator. Whether it is SMB Direct (Azure Local) or NVMe/RDMA (future external storage), RDMA-capable networking delivers 5-10x latency reduction compared to TCP-based protocols. Any new hardware procurement should specify RDMA-capable NICs (Mellanox/NVIDIA ConnectX-6 or later) and switches that support DCB (PFC, ECN, DCBX). This is a one-time infrastructure investment that benefits all three candidates.
-
Fibre Channel is a Swisscom ESC dependency, not a platform requirement. If you choose OVE or Azure Local with HCI storage (ODF or S2D), FC is not needed for primary storage. FC remains relevant only if you choose to consume external SAN arrays alongside HCI. Given the operational cost and skill specialization of FC, HCI-native storage (eliminating FC) is one of the clearest cost-reduction opportunities in this migration.
-
NFSv4 with Kerberos should be the standard for file shares. For any NFS-based file sharing (configuration repositories, shared data, inter-application communication), mandate NFSv4 with krb5i or krb5p. NFSv3 with AUTH_SYS is not acceptable for a financial institution, regardless of network segmentation. The performance overhead of Kerberos is the cost of doing business in a regulated environment.
-
SMB expertise is non-negotiable for Azure Local. If Azure Local is selected, the operations team must develop deep SMB3/RDMA expertise -- this is not just a "Windows file share" protocol, it is the storage fabric. RDMA misconfiguration (PFC errors, ECN tuning, incorrect MTU) will directly impact every VM's storage performance. This is analogous to understanding vSAN's RDT protocol in the current VMware environment.
-
Multipath design must match the protocol. iSCSI and FC use dm-multipath with ALUA. NVMe-oF uses NVMe native multipath with ANA. SMB3 uses built-in multichannel. Mixing multipath approaches or using the wrong mechanism (e.g., dm-multipath for NVMe) introduces unnecessary complexity and may miss failover events. Define the multipath standard for each protocol before the PoC.
-
The iSCSI performance tax is real but manageable. iSCSI on modern 25/100 GbE with jumbo frames and dedicated NICs delivers adequate performance for the vast majority of enterprise workloads. The 5-10x latency advantage of NVMe/RDMA matters only for latency-sensitive tier-1 databases and trading systems. Do not over-engineer the storage network for NVMe/RDMA unless a measurable workload justifies it.
Discussion Guide
The following questions are designed for vendor deep-dives, PoC planning sessions, and internal architecture reviews. They probe the practical implications of storage protocol choices across the candidate platforms.
Questions for All Candidates
-
Internal storage protocol and performance ceiling: "What protocol does your platform use for internal storage traffic between compute and storage components? What is the maximum measured single-VM IOPS and p99 latency on this internal protocol at 80% cluster utilization? We need these numbers from a real benchmark on hardware comparable to our deployment size, not from a datasheet."
-
RDMA readiness and fallback behavior: "Does your platform support RDMA for storage traffic? If RDMA is enabled and a transient PFC/ECN misconfiguration causes RDMA failures, does the platform fall back to TCP gracefully, or does it fail hard? How do we monitor RDMA health proactively (counter thresholds for PFC pause frames, RoCE retransmissions, ECN-marked packets)?"
-
Multipath failover timing: "Walk us through the exact failover sequence when a storage path fails. What is the measured time from cable disconnection to I/O resumption on the surviving path? What I/Os are in-flight during failover -- are they retried or failed back to the application? What is the impact on VM-visible latency during a failover event?"
-
Protocol upgrade path: "Our current infrastructure uses iSCSI/FC for external storage connectivity. What is the migration path to NVMe-oF (TCP or RDMA) for consuming external storage arrays? Can both protocols coexist during a transition period? What hardware (NICs, switches, HBAs) needs to change?"
-
Encryption in transit: "How is storage traffic encrypted between compute and storage nodes? Is encryption enabled by default? What cipher and key length are used? What is the measured performance overhead of encryption on storage I/O? For regulatory compliance, we require encryption of all data in transit -- confirm that this is achieved for all storage protocol paths, including inter-node replication traffic."
Questions Specific to OVE
-
Ceph RADOS protocol security: "Ceph uses its own RADOS protocol (msgr2) for internal cluster communication. Is msgr2 encryption (cephx + on-wire encryption) enabled by default in ODF? What is the performance impact? Can we enforce mutual authentication between OSDs and clients to prevent a compromised node from reading another tenant's data?"
-
External storage protocol support via CSI: "If we connect a NetApp/Pure/Dell array via CSI, which protocol does the CSI driver use -- iSCSI, FC, or NVMe-oF? Can we influence the protocol selection? What multipath mechanism is used at the node level for CSI-attached external volumes, and how is it configured (dm-multipath vs NVMe native)?"
Questions Specific to Azure Local
-
SMB3/RDMA health verification: "Show us how to verify that RDMA is active and healthy on all cluster nodes. What PowerShell cmdlets or monitoring tools report RDMA connection status, PFC statistics, and ECN marking rates? What are the alert thresholds for RDMA degradation that should trigger proactive intervention?"
-
SMB Direct fallback behavior: "If RDMA fails on a subset of NICs (e.g., due to firmware bug or cable issue), does the Software Storage Bus fall back to SMB3/TCP for those paths? Is this fallback automatic and transparent? What is the latency impact, and how quickly does the system recover when RDMA is restored?"
-
SMB encryption for compliance: "Is SMB encryption (AES-256-GCM) enabled by default for S2D inter-node traffic? If not, what is the performance overhead of enabling it? Does enabling encryption conflict with SMB Direct/RDMA (since RDMA bypasses the kernel where encryption would typically be applied)?"
Questions Specific to Swisscom ESC
-
Storage protocol transparency: "Which storage protocol connects our VMs to the underlying storage? Is it FC, iSCSI, or NVMe/FC? Can we see storage path status (active/standby, failover events, latency per path) in the self-service portal? If a path failover occurs, is it logged and visible to us, or only to Swisscom's operations team?"
-
Protocol evolution roadmap: "Is Swisscom planning to migrate the ESC storage backend from FC to NVMe/FC or NVMe/TCP? If so, what is the timeline, and will this migration be transparent to tenants? Will there be a performance improvement, and how will it be reflected in SLA terms?"
Architecture-Level Questions (for Internal Discussion)
-
Protocol skills assessment: "Which storage protocols does our current operations team have hands-on experience with? FC? iSCSI? SMB3? NFS? Do we have RDMA networking experience? The skill gap between our current VMware/FC/iSCSI world and the target platform's protocol stack determines the training investment and the risk of early-lifecycle incidents."
-
Converged vs dedicated storage network: "In the target architecture, should storage traffic run on a dedicated physical network (separate NICs, separate switches) or on a converged network with QoS/VLAN separation? What is the cost delta? What is the risk delta? For NVMe/RDMA or SMB Direct, lossless Ethernet configuration (PFC, ECN) on a converged network is complex -- is a dedicated storage network simpler and safer even if more expensive?"
-
Protocol standardization policy: "Should we standardize on a single storage protocol for external storage connectivity (e.g., NVMe/TCP for all external array access), or allow protocol diversity (iSCSI for legacy, NVMe/TCP for new)? Standardization reduces operational complexity but may require hardware refresh. What is the 5-year total cost of each approach?"
Next: 04-storage-architectures.md -- Storage Architectures (SAN, NAS, HCI / Software-Defined Storage)