Network Observability

Why This Matters

The previous chapters covered how packets move: physical connectivity (LACP, ECMP, spine-leaf), overlays (OVN, GENEVE, Multus), advanced data paths (SR-IOV, DPDK), routing and security (DVR, NetworkPolicy, micro-segmentation), and platform networking (Services, Routes, MetalLB). All of that knowledge is about building and configuring the network. This chapter is about seeing what the network is doing after it is built -- and about finding what is broken when it stops working.

Network observability is the difference between a team that operates by hypothesis ("it is probably a firewall rule") and a team that operates by evidence ("the flow logs show 14,000 dropped packets from subnet 10.128.4.0/23 to port 5432, matching NetworkPolicy deny-db-access applied at 14:32 UTC"). For an organization running 5,000+ VMs, spread across hundreds of OVN logical switches and multiple physical racks, operating without network observability is operating blind.

In the VMware world, observability was provided by a combination of tools: vRealize Network Insight (now Aria Operations for Networks) for flow analysis and topology visualization, NSX DFW flow logs for micro-segmentation audit, vCenter performance charts for NIC utilization, and IPFIX export from the vDS for third-party collectors. These tools are mature, well-integrated with vSphere, and familiar to the operations team. They are also gone the moment VMware is decommissioned.

The replacement stack depends on the target platform:

OVE uses the OpenShift Network Observability Operator -- an eBPF-based flow capture pipeline that feeds into Loki and surfaces in the OpenShift console, plus the standard Prometheus/Grafana stack for metrics, OVS/OVN native tools for troubleshooting, and IPFIX export from OVS for external collectors.
Azure Local uses Windows Admin Center for basic network monitoring, Azure Monitor (via Arc) for metrics and alerts, SDN diagnostics for the Microsoft SDN stack, and SNMP/NetFlow from the physical switches.
Swisscom ESC provides managed monitoring through the Swisscom operations portal. The customer has limited direct visibility into network flows and must rely on Swisscom's operations team for deep diagnostics.

This chapter covers five topics that together form a complete network observability strategy:

Network Flow Monitoring -- capturing and analyzing traffic flows (NetFlow, IPFIX, sFlow, eBPF)
Packet Capture and Analysis -- deep inspection of individual packets for root-cause analysis
Network Metrics and Dashboards -- continuous monitoring of network health indicators
Troubleshooting Methodology -- the systematic, layer-by-layer approach to diagnosing network problems
Compliance and Audit -- meeting FINMA requirements for network logging, retention, and forensics

The troubleshooting methodology section is the most operationally critical. When a production VM loses connectivity at 2:00 AM, the on-call engineer needs a repeatable, systematic procedure -- not guesswork. The method presented here works bottom-up through the stack: physical NIC, virtual switch, overlay tunnel, logical switch, pod network, application. Each layer has specific tools and specific failure modes.

Concepts

1. Network Flow Monitoring

The Three Pillars of Network Observability

Network observability rests on three complementary data sources. Each provides a different level of detail, operates at a different cost, and answers different questions:

Pillar	What It Captures	Granularity	Overhead	Use Case
Metrics	Aggregate counters: bytes/sec, packets/sec, errors, drops, retransmits	Per-interface, per-Service, per-node	Very low (<1% CPU)	Dashboards, alerting, capacity planning
Logs	Discrete events: connection accepted/denied, policy matched, error occurred	Per-event	Low-medium (depends on volume)	Audit trails, policy debugging, compliance
Flows	Conversation records: src/dst IP, ports, protocol, bytes transferred, duration	Per-connection or per-sample	Medium (depends on sampling rate)	Traffic analysis, anomaly detection, forensics, capacity planning

The Three Pillars and Their Data Sources:

  +-----------------------------------------------------------------------+
  |                        Network Observability                          |
  +-----------------------------------------------------------------------+
         |                        |                        |
         v                        v                        v
  +-------------+          +-------------+          +-------------+
  |   METRICS   |          |    LOGS     |          |    FLOWS    |
  |-------------|          |-------------|          |-------------|
  | Prometheus  |          | Loki        |          | IPFIX       |
  | SNMP        |          | syslog      |          | sFlow       |
  | ethtool     |          | OVN ACL log |          | NetFlow     |
  | node_exp.   |          | audit log   |          | eBPF Hubble |
  +------+------+          +------+------+          +------+------+
         |                        |                        |
         v                        v                        v
  +-----------------------------------------------------------------------+
  |  Visualization & Analysis (Grafana, OpenShift Console, SIEM)          |
  +-----------------------------------------------------------------------+
         |                        |                        |
         v                        v                        v
  "How much traffic?"      "What happened?"         "Who talked to whom?"
  "Is the link healthy?"   "Why was it denied?"     "What is the pattern?"

Metrics answer quantitative questions: bandwidth utilization, error rates, packet drop counts. They are cheap to collect (counters that increment) and cheap to store (time-series databases compress well). Prometheus is the standard metrics store in OVE. Metrics are essential for alerting ("interface error rate > 0.1% for 5 minutes") and dashboards ("top-10 busiest nodes by network throughput").

Logs answer event-level questions: "why was this connection rejected?" "which NetworkPolicy denied traffic from pod X to pod Y?" Logs are structured or semi-structured records of discrete events. In OVE, OVN ACL logs capture every packet that matches a deny rule. In VMware, the NSX DFW log served the same purpose. Logs are essential for compliance (FINMA requires audit trails for security policy enforcement) and for debugging specific incidents.

Flows answer relationship and pattern questions: "which VMs are talking to each other?" "what is the traffic distribution across subnets?" "is there an anomalous spike in east-west traffic from the database tier to an unexpected destination?" Flows aggregate packets into conversations (a "flow" is defined by a 5-tuple: source IP, destination IP, source port, destination port, protocol). Flow records capture metadata about the conversation -- not the packet payload. This makes flows cheaper than full packet capture but more informative than aggregate metrics.

A mature observability strategy uses all three pillars. Metrics trigger the alert. Flows narrow the scope to specific conversations. Logs confirm the policy or event. If all three fail to explain the problem, packet capture provides the definitive answer -- but packet capture is expensive and targeted, not continuous.

Flow Protocols: NetFlow, IPFIX, sFlow

Three major protocols exist for exporting flow data from network devices and virtual switches. Understanding their differences is essential because the choice affects what data is available, how it is collected, and what tools can consume it.

NetFlow v5 and v9

NetFlow was developed by Cisco in the mid-1990s. NetFlow v5 is a fixed-format protocol: every flow record has exactly the same fields (source IP, destination IP, source port, destination port, protocol, bytes, packets, start time, end time, input/output interface, TCP flags, ToS). The fixed format makes v5 simple to parse but inflexible -- it cannot carry IPv6 addresses, MPLS labels, VXLAN VNIs, or any field that was not defined when v5 was created.

NetFlow v9 introduced template-based records. Instead of a fixed format, the exporter sends a template that describes the fields in subsequent data records. The collector must receive and parse the template before it can decode the data. This allows v9 to carry arbitrary fields, including IPv6, MPLS, and vendor-specific extensions.

NetFlow v5 Fixed Record Format (48 bytes per flow):

  +0          +4          +8          +12         +16
  +-----------+-----------+-----------+-----------+
  | src_addr  | dst_addr  | nexthop   | input_if  |
  +-----------+-----------+-----------+-----------+
  | output_if | packets   | bytes     | first_ts  |
  +-----------+-----------+-----------+-----------+
  | last_ts   |src_port|dst_port| pad |proto|tos|f|
  +-----------+-----------+-----------+-----------+
  | src_as    | dst_as    |src_msk|dst_msk| pad   |
  +-----------+-----------+-----------+-----------+

  Limitation: IPv4 only, fixed 48-byte record,
              no extensibility, no bidirectional flows.

NetFlow v9 Template-Based Format:

  Template Flowset (sent periodically):
  +--------------------------------------------------+
  | Template ID: 256                                  |
  | Field Count: 8                                    |
  | Field 1: SRC_ADDR (type 8, length 4)             |
  | Field 2: DST_ADDR (type 27, length 16)  <- IPv6! |
  | Field 3: SRC_PORT (type 7, length 2)              |
  | Field 4: DST_PORT (type 11, length 2)             |
  | ...                                               |
  +--------------------------------------------------+

  Data Flowset (references template 256):
  +--------------------------------------------------+
  | Template ID: 256                                  |
  | Record 1: [values in template-defined order]      |
  | Record 2: [values in template-defined order]      |
  | ...                                               |
  +--------------------------------------------------+

NetFlow v5 is still encountered on older Cisco switches and routers. It remains useful for basic traffic analysis on legacy infrastructure. NetFlow v9 is the predecessor to IPFIX and shares its template-based approach. For new deployments, IPFIX has entirely superseded NetFlow v9.

IPFIX (IP Flow Information Export)

IPFIX (RFC 7011) is the IETF-standardized evolution of NetFlow v9. It uses the same template-based approach but adds several important capabilities:

Variable-length fields: IPFIX supports variable-length information elements (e.g., application name strings, HTTP URLs). NetFlow v9 required fixed-length fields.
Enterprise-specific information elements: Vendors can define custom fields using a Private Enterprise Number (PEN). This allows OVS, VMware, and others to export platform-specific data (e.g., OVN logical port ID, VXLAN tunnel endpoint) without conflicting with standard fields.
Bidirectional flows (Biflow): IPFIX can represent both directions of a conversation in a single record (RFC 5103). NetFlow requires two separate unidirectional records.
SCTP and TCP transport: IPFIX supports reliable delivery over TCP or SCTP, in addition to UDP. This is important for compliance use cases where flow loss is unacceptable.
Structured data types: IPFIX supports lists (basicList, subTemplateList, subTemplateMultiList) for encoding multiple values in a single field, such as a list of MPLS labels or a list of NAT translations.

IPFIX Template Structure (simplified):

  +------------------------------------------------------+
  | Message Header                                        |
  | +--------------------------------------------------+ |
  | | Version: 10 (IPFIX)                              | |
  | | Length: total message length                      | |
  | | Export Time: UNIX timestamp                       | |
  | | Sequence Number: for loss detection               | |
  | | Observation Domain ID: identifies the exporter    | |
  | +--------------------------------------------------+ |
  |                                                      |
  | Template Set (Set ID = 2)                             |
  | +--------------------------------------------------+ |
  | | Template ID: 256                                  | |
  | | Field Count: 10                                   | |
  | |                                                    | |
  | | IE  8: sourceIPv4Address        (4 bytes)         | |
  | | IE 12: destinationIPv4Address   (4 bytes)         | |
  | | IE  7: sourceTransportPort      (2 bytes)         | |
  | | IE 11: destinationTransportPort (2 bytes)         | |
  | | IE  4: protocolIdentifier       (1 byte)          | |
  | | IE  1: octetDeltaCount          (8 bytes)         | |
  | | IE  2: packetDeltaCount         (8 bytes)         | |
  | | IE150: flowStartSeconds         (4 bytes)         | |
  | | IE151: flowEndSeconds           (4 bytes)         | |
  | | IE  6: tcpControlBits           (2 bytes)         | |
  | +--------------------------------------------------+ |
  |                                                      |
  | Data Set (Set ID = 256, references template above)    |
  | +--------------------------------------------------+ |
  | | Record 1: 10.128.0.5 | 10.128.4.15 | 45321 |    | |
  | |           5432 | 6 (TCP) | 1048576 | 8192 |     | |
  | |           1714300800 | 1714300860 | 0x12 (SYN+ACK)| |
  | | Record 2: ...                                     | |
  | +--------------------------------------------------+ |
  +------------------------------------------------------+

  Key: IE = Information Element number (from IANA registry)
       IE 8 = sourceIPv4Address, IE 12 = destinationIPv4Address, etc.

IPFIX in OVS: Open vSwitch has native IPFIX support. OVS can export flow records for every connection traversing a bridge (typically br-int in OVN deployments). The IPFIX exporter in OVS tracks flows based on the standard 5-tuple and exports records when a flow ends, when a flow has been active for a configurable "active timeout" (e.g., 60 seconds), or when the flow cache is full.

Configuring IPFIX export on OVS:

# Enable IPFIX on bridge br-int, export to collector at 10.0.0.50:4739
ovs-vsctl -- set Bridge br-int ipfix=@i \
  -- --id=@i create IPFIX targets=\"10.0.0.50:4739\" \
     obs_domain_id=1 \
     obs_point_id=1 \
     sampling=64 \
     cache_active_timeout=60 \
     cache_max_flows=4096

# Verify IPFIX configuration
ovs-vsctl list IPFIX

# Remove IPFIX configuration
ovs-vsctl clear Bridge br-int ipfix

Key parameters:

sampling=64 -- OVS samples 1 out of every 64 packets for flow creation. Setting sampling=1 means every packet is tracked, which provides complete visibility but significantly increases CPU and memory usage. For a node running 100+ VMs, sampling=64 to sampling=256 is a practical range.
cache_active_timeout=60 -- active flows are exported every 60 seconds. This determines how frequently the collector receives updates about long-lived connections.
cache_max_flows=4096 -- maximum number of concurrent flows in the IPFIX cache. When the cache is full, the oldest flow is evicted and exported. Size this based on the expected number of concurrent connections per node.

sFlow

sFlow (RFC 3176) takes a fundamentally different approach from NetFlow/IPFIX. Instead of tracking flows (maintaining state for every active connection), sFlow uses statistical packet sampling: it copies a random 1-in-N packet and exports the packet header (typically the first 128 bytes) along with interface counter data.

The distinction is important:

Aspect	IPFIX (Flow-Based)	sFlow (Sample-Based)
State per connection	Yes -- flow cache maintains one entry per 5-tuple	No -- stateless, samples are independent
Accuracy for long flows	High -- every packet is counted	Statistical -- accuracy depends on sampling rate
Accuracy for short flows	High	Low -- short flows may not be sampled at all
CPU overhead at high speed	High -- must hash every packet for flow lookup	Low -- only processes 1-in-N packets
Best for	Detailed per-flow accounting, billing, forensics	High-speed link monitoring (40G/100G), real-time traffic visualization
Typical deployment	Virtual switch (OVS), lower-speed physical switches	Physical switches at 40G/100G/400G, spine-leaf fabric

sFlow is better suited for monitoring the physical spine-leaf fabric (where links are 100G+ and the flow count is enormous), while IPFIX is better suited for monitoring the virtual switch (where the overhead is acceptable and per-flow accuracy matters for compliance and troubleshooting).

In practice, a complete monitoring architecture uses both:

Combined IPFIX + sFlow Architecture:

  Physical Spine-Leaf Fabric              Virtual Infrastructure
  (sFlow for high-speed links)           (IPFIX for per-flow detail)

  +--------+       +--------+
  | Spine 1|       | Spine 2|
  | sFlow  |       | sFlow  |
  +---++---+       +---++---+
      ||               ||
  +---++---+       +---++---+
  | Leaf 1 |       | Leaf 2 |
  | sFlow  |       | sFlow  |
  +---+----+       +---+----+
      |                |
  +---+----+       +---+----+
  | Node 1 |       | Node 2 |
  | +----+ |       | +----+ |
  | |OVS | |       | |OVS | |
  | |IPFIX| |       | |IPFIX| |
  | +----+ |       | +----+ |
  +--------+       +--------+
      |                |
      v                v
  +------------------------------+
  |    Flow Collector            |
  |    (e.g., ElastiFlow,        |
  |     ntopng, Kentik)          |
  +------------------------------+
      |
      v
  +------------------------------+
  |    Analysis & Dashboards     |
  |    (Grafana, Kibana, SIEM)   |
  +------------------------------+

eBPF-Based Flow Capture

eBPF (Extended Berkeley Packet Filter) has become the preferred mechanism for network observability in modern Kubernetes environments. Unlike IPFIX (which requires OVS configuration and a separate collector) or sFlow (which requires switch support), eBPF programs run directly in the Linux kernel on every node. They attach to kernel network hooks (TC ingress/egress, XDP, socket hooks) and capture flow metadata with negligible overhead because the eBPF program runs in kernel space -- no context switch to userspace, no packet copying.

Two major eBPF-based observability tools are relevant:

Cilium Hubble

Hubble is the observability layer of the Cilium CNI. While OVE does not use Cilium as its default CNI (it uses OVN-Kubernetes), Hubble is worth understanding because Cilium is an alternative CNI for OpenShift, and Hubble's architecture influenced the OpenShift Network Observability Operator.

Hubble provides:

Flow visibility: Every connection between pods, services, and external endpoints is captured as a flow event with source/destination identity (namespace, pod name, service name -- not just IP addresses).
L7 protocol parsing: Hubble can parse HTTP, gRPC, DNS, and Kafka at the application layer, providing request-level visibility (HTTP method, URL, response code, latency).
Identity-aware flows: Because Cilium uses identity-based security (not IP-based), Hubble flows include the Kubernetes identity of both endpoints, making analysis dramatically easier than parsing raw IP addresses.

OpenShift Network Observability Operator

The Network Observability Operator is the primary network flow monitoring tool for OVE. It deploys an eBPF agent on every node that captures flow data from the kernel's Traffic Control (TC) hooks, processes the flows through a pipeline, stores them in Loki, and surfaces them in the OpenShift web console.

The architecture has four components:

OpenShift Network Observability Operator -- Pipeline Architecture:

  +--Node 1-----------------------------------------------+
  |                                                        |
  |  +--Pod A--+    +--Pod B--+    +--KubeVirt VM--+      |
  |  |         |    |         |    |    (virt-     |      |
  |  +---------+    +---------+    |     launcher) |      |
  |       |              |         +-------+-------+      |
  |       v              v                 v              |
  |  +----+----+    +----+----+    +-------+-------+      |
  |  |  veth   |    |  veth   |    | tap / veth    |      |
  |  +----+----+    +----+----+    +-------+-------+      |
  |       |              |                 |              |
  |       +--------------+-----------------+              |
  |                      |                                |
  |               +------+------+                         |
  |               |   br-int    |   <-- OVS bridge        |
  |               |   (OVS)    |                         |
  |               +------+------+                         |
  |                      |                                |
  |    +--------[TC hooks: ingress/egress]--------+       |
  |    |                                          |       |
  |    v                                          |       |
  |  +-+------------------------------------------+-+     |
  |  |        eBPF Agent (DaemonSet)                |     |
  |  |                                              |     |
  |  |  - Attaches to TC ingress/egress hooks       |     |
  |  |  - Captures: src/dst IP, ports, protocol,   |     |
  |  |    bytes, packets, TCP flags, DSCP, DNS,     |     |
  |  |    drop reason, direction, interface         |     |
  |  |  - Enriches with: node name, zone            |     |
  |  |  - Exports flow records via gRPC/Kafka       |     |
  |  +---+------------------------------------------+     |
  +------+------------------------------------------------+
         |
         | gRPC / Kafka
         v
  +------+------------------------------------------------+
  |  FLP (Flowlogs Pipeline)                               |
  |                                                        |
  |  - Receives raw flow records from eBPF agents          |
  |  - Enriches with Kubernetes metadata:                  |
  |    pod name, namespace, workload name, service,        |
  |    node name, zone, NetworkPolicy match                |
  |  - Performs aggregation (configurable):                |
  |    per-namespace, per-workload, per-node               |
  |  - Computes derived metrics:                           |
  |    bytes/sec, packets/sec, RTT, DNS latency            |
  |  - Exports to:                                         |
  |    -> Loki (flow log storage)                          |
  |    -> Prometheus (flow-derived metrics)                 |
  |    -> Kafka (external consumers)                       |
  |    -> OpenTelemetry (OTLP export)                      |
  +------+-------+----------------+------------------------+
         |       |                |
         v       v                v
  +------+-+ +--+-----+   +------+-----------+
  |  Loki  | | Prom.  |   | External systems |
  |        | | etheus |   | (SIEM, Kafka,    |
  |        | |        |   |  OpenTelemetry)  |
  +------+-+ +--+-----+   +------------------+
         |       |
         v       v
  +------+-------+-----------------------------------+
  |  OpenShift Console Plugin                         |
  |                                                    |
  |  - Network Traffic page (topology view)            |
  |  - Flow table (filterable by namespace, pod,       |
  |    service, port, protocol, direction)              |
  |  - Top-N talkers by bytes/packets                  |
  |  - DNS tracking (queries, response times, errors)  |
  |  - Drop analysis (reason codes, affected pods)     |
  |  - Grafana dashboards (pre-built)                  |
  +----------------------------------------------------+

eBPF Agent: The eBPF agent runs as a privileged DaemonSet on every node. It compiles and loads eBPF programs into the kernel that attach to the TC (Traffic Control) ingress and egress hooks on every network interface. These hooks see every packet entering or leaving the node. The eBPF program extracts flow metadata (5-tuple, byte/packet counts, TCP flags, DSCP marking, DNS query data) and stores it in a per-CPU ring buffer. A userspace component reads the ring buffer and exports flow records.

The eBPF agent adds negligible overhead -- typically less than 1-2% CPU per node -- because the eBPF program runs in kernel space and processes only header data, never copying packet payloads. This is orders of magnitude more efficient than IPFIX with sampling=1 or full packet capture.

FLP (Flowlogs Pipeline): The FLP component receives raw flow records from all eBPF agents and enriches them with Kubernetes metadata. A raw flow record contains IP addresses and ports; FLP resolves these to pod names, namespaces, workload names (Deployment, StatefulSet, VirtualMachine), service names, and node names. This enrichment is what makes the data useful -- a flow from 10.128.0.5:45321 to 10.128.4.15:5432 becomes a flow from web-frontend/pod-abc in namespace production to postgres-primary/pod-xyz in namespace databases, matching service postgres-svc.

FLP also computes derived metrics (bytes per second, packets per second, round-trip time, DNS response latency) and performs configurable aggregation (aggregate flows by namespace, by workload, or by node to reduce storage volume).

Loki: Flow log records are stored in Loki, the same log aggregation system used by OpenShift for container logs. Loki is a horizontally scalable, multi-tenant log store designed for cost-efficient storage using object storage backends (S3, ODF). Flow logs in Loki are queryable using LogQL, Loki's query language.

Console Plugin: The OpenShift Network Observability console plugin provides a graphical interface for flow analysis directly in the OpenShift web console. It renders a topology view (which namespaces or workloads are talking to each other), a filterable flow table, top-N talker lists, DNS tracking dashboards, and packet drop analysis. This is the closest equivalent to VMware vRealize Network Insight's topology view.

Deploying the Network Observability Operator:

# 1. Install Loki Operator (prerequisite)
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: loki-operator
  namespace: openshift-operators-redhat
spec:
  channel: stable-6.2
  name: loki-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

---
# 2. Install Network Observability Operator
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: netobserv-operator
  namespace: openshift-netobserv-operator
spec:
  channel: stable
  name: netobserv-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

---
# 3. Create FlowCollector instance
apiVersion: flows.netobserv.io/v1beta2
kind: FlowCollector
metadata:
  name: cluster
spec:
  namespace: netobserv
  deploymentModel: Direct          # Direct (gRPC) or Kafka
  agent:
    type: eBPF
    ebpf:
      sampling: 50                 # sample 1 in 50 packets (0 = all)
      cacheActiveTimeout: 5s       # export interval for active flows
      cacheMaxFlows: 100000        # max concurrent flows per node
      interfaces: []               # empty = all interfaces
      excludeInterfaces:
      - lo                         # exclude loopback
      features:
      - DNSTracking               # capture DNS queries and responses
      - PacketDrop                 # capture packet drop reasons
      - FlowRTT                   # compute TCP round-trip time
      privileged: false            # run without full root (uses CAP_BPF)
  processor:
    logTypes: Flows                # export flow logs (not just metrics)
    metrics:
      includeList:
      - namespace_flows_total
      - node_ingress_bytes_total
      - workload_ingress_bytes_total
    conversationTracking:
      endTimeout: 10s
      heartbeatInterval: 30s
  loki:
    mode: LokiStack               # use Loki Operator's LokiStack
    lokiStack:
      name: loki
      namespace: netobserv
  consolePlugin:
    enable: true
    portNaming:
      enable: true                 # resolve well-known ports to names
    quickFilters:                  # pre-defined filters in console
    - name: "DNS traffic"
      filter:
        dst_port: "53"
    - name: "Database traffic"
      filter:
        dst_port: "5432,3306,1521,27017"
    - name: "Dropped packets"
      filter:
        pkt_drop_cause: "*"
  exporters:                       # optional: export to external systems
  - type: Kafka
    kafka:
      address: kafka.monitoring:9092
      topic: network-flows

2. Packet Capture and Analysis

When to Use Packet Capture vs Flow Monitoring

Flow monitoring tells you who is talking to whom, how much, and whether traffic is allowed. Packet capture tells you what is in the packets and why the protocol is failing. They serve different purposes:

Scenario	Use Flows	Use Packet Capture
"Which VMs are generating the most east-west traffic?"	Yes	No
"Why is the TLS handshake failing between app and database?"	No	Yes
"Is NetworkPolicy blocking traffic from namespace A to B?"	Yes	No
"Why is the TCP connection resetting after 3 packets?"	No	Yes
"What is the traffic distribution across subnets?"	Yes	No
"Why does DNS resolution return NXDOMAIN for internal names?"	Partial (DNS tracking)	Yes (full query/response)
"Is the MTU too small for encapsulated traffic?"	No	Yes (check fragment/DF flags)

The general rule: start with flows, escalate to packet capture. Flows are cheap and always-on. Packet capture is expensive and targeted -- use it only when flows have narrowed the problem to a specific connection or interface.

tcpdump and Wireshark Basics for Virtual Environments

tcpdump is the foundational packet capture tool on Linux. In OVE, every node runs RHCOS (Red Hat CoreOS), which includes tcpdump. Capturing packets in a virtualized Kubernetes environment requires knowing where in the stack to capture:

Packet Capture Points in the OVE Stack:

  +--External Network--+
  |                    |
  +--------+-----------+
           |
  (1)  Physical NIC (bond0 / ens3f0)     <-- capture: tcpdump -i bond0
           |
  +--------+-----------+
  | br-ex (OVS bridge) |                 <-- capture: tcpdump -i br-ex
  +--------+-----------+
           |
  (2)  br-int (OVS bridge)              <-- capture: ovs-tcpdump -i br-int
           |                                  (sees GENEVE-decapsulated traffic)
           |
      +----+----+
      |         |
  (3) veth-A   veth-B                    <-- capture: nsenter + tcpdump on veth
      |         |                              or: tcpdump in pod netns
      |         |
  +---+---+ +--+------+
  | Pod A | | KubeVirt |
  |       | | VM Pod   |
  +-------+ |  +-----+ |
             |  | tap0| |               <-- capture: tcpdump -i tap0 inside
             |  +--+--+ |                    the virt-launcher pod
             |     |     |
             |  +--+--+  |
             |  | VM  |  |              <-- capture: tcpdump inside the VM guest
             |  | NIC |  |                   (virtctl console, then tcpdump)
             |  +-----+  |
             +-----------+

Capture at the physical NIC (1): Sees all traffic entering and leaving the node, including GENEVE-encapsulated overlay traffic. Useful for verifying that packets reach the node and for diagnosing underlay issues (MTU, LACP, link errors). The traffic is encapsulated, so you must decode GENEVE headers to see the inner payload.

# Capture all traffic on the physical NIC
tcpdump -i bond0 -nn -s 128 -w /tmp/capture-bond0.pcap

# Capture only GENEVE traffic (UDP port 6081)
tcpdump -i bond0 -nn udp port 6081 -w /tmp/geneve.pcap

# Capture non-overlay traffic (management, storage, etc.)
tcpdump -i bond0 -nn 'not udp port 6081' -w /tmp/non-overlay.pcap

Capture at the OVS bridge (2): The ovs-tcpdump utility captures packets on a specific OVS port or bridge. When capturing on br-int, the traffic is decapsulated -- you see the inner packet with the pod/VM IP addresses, not the GENEVE outer headers. This is the most useful capture point for debugging overlay connectivity.

# Capture on the OVS internal port for a specific pod/VM
# First, find the OVS port name for the target pod
ovs-vsctl list-ports br-int | grep <partial-pod-id>

# Capture on that port (sees traffic for that specific pod/VM)
ovs-tcpdump -i <port-name> -w /tmp/capture-pod.pcap

# Capture all traffic on br-int (high volume -- use filters)
ovs-tcpdump -i br-int -nn host 10.128.4.15 -w /tmp/capture-vm.pcap

Capture at the pod veth (3): To capture traffic at the pod level, enter the pod's network namespace:

# Find the pod's PID
POD_NAME="virt-launcher-my-vm-abc123"
NAMESPACE="production"
PID=$(oc get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].containerID}' | \
  sed 's|cri-o://||' | xargs -I {} crictl inspect {} | jq '.info.pid')

# Enter the pod's network namespace and capture
nsenter -t $PID -n tcpdump -i eth0 -nn -w /tmp/capture-pod-eth0.pcap

# Alternative: use oc debug to start a debug pod on the same node
oc debug node/<node-name> -- chroot /host tcpdump -i <veth-name> -nn -w /tmp/capture.pcap

Capture inside a KubeVirt VM: For capturing traffic inside the VM guest OS itself (after the virtio NIC), use virtctl to access the VM console:

# Access VM console
virtctl console my-vm -n production

# Inside the VM, run tcpdump (if available)
sudo tcpdump -i eth0 -nn -w /tmp/guest-capture.pcap

# Copy the capture file out of the VM
virtctl scp production/my-vm:/tmp/guest-capture.pcap ./guest-capture.pcap

OVS Mirror Ports (NSX Port Mirroring Equivalent)

In VMware NSX, port mirroring allows copying traffic from one or more source ports to a destination port for analysis. OVS provides equivalent functionality through mirror configurations:

# Create a mirror on br-int that copies all traffic from port "vm-port-1"
# to a GRE tunnel for remote capture

# Step 1: Create a GRE tunnel port for the mirror destination
ovs-vsctl add-port br-int mirror-gre -- \
  set interface mirror-gre type=gre options:remote_ip=10.0.0.99

# Step 2: Create the mirror
ovs-vsctl -- set bridge br-int mirrors=@m \
  -- --id=@src get port vm-port-1 \
  -- --id=@dst get port mirror-gre \
  -- --id=@m create mirror name=debug-mirror \
     select-src-port=@src \
     select-dst-port=@src \
     output-port=@dst

# Step 3: On the collector (10.0.0.99), capture the mirrored traffic
tcpdump -i gre0 -nn -w /tmp/mirrored-traffic.pcap

# Step 4: Remove the mirror when done (important -- mirrors consume bandwidth)
ovs-vsctl -- --id=@m get mirror debug-mirror -- remove bridge br-int mirrors @m

Warning: Mirror ports in OVS are node-local configurations. In an OVN-managed environment (which OVE is), directly modifying OVS configurations may be overwritten by the OVN controller. For persistent mirror configurations, use the OVN northbound database or the Network Observability Operator's packet capture feature, which provides a CRD-based interface.

OVN Traceflow (ovn-trace)

ovn-trace is the OVN equivalent of VMware NSX's Traceflow feature. It simulates a packet through the OVN logical pipeline without actually sending a packet on the network. This is invaluable for debugging policy and routing issues because it shows exactly which logical flows match, which ACLs are applied, and where the packet would be dropped or forwarded.

# Trace a TCP SYN from VM in namespace "production" to database in namespace "databases"
# First, find the logical port names
ovn-nbctl show | grep -A2 "production"
# Output: port "production_my-vm" (addresses: ["0a:58:0a:80:00:05 10.128.0.5"])

ovn-nbctl show | grep -A2 "databases"
# Output: port "databases_postgres" (addresses: ["0a:58:0a:80:04:0f 10.128.4.15"])

# Run the trace
ovn-trace --ovs my-switch \
  'inport=="production_my-vm" && eth.src==0a:58:0a:80:00:05 && \
   eth.dst==0a:58:0a:80:04:0f && ip4.src==10.128.0.5 && \
   ip4.dst==10.128.4.15 && tcp.dst==5432 && ip.ttl==64'

# The output shows every logical flow table hit:
# ingress(ls_in_port_sec_l2)  -> match
# ingress(ls_in_acl)          -> allow  (or drop, with policy name)
# ingress(ls_in_l2_lkup)      -> forward to router
# egress(ls_out_acl)          -> allow
# ...

ovn-detrace performs the reverse operation: given a physical OVS flow table entry (from ovs-ofctl dump-flows br-int), it maps the flow back to the OVN logical flow that generated it. This is essential for understanding why a particular OVS flow rule exists.

# Dump a specific OVS flow and trace it back to the OVN logical flow
ovs-ofctl dump-flows br-int | grep "10.128.4.15" | ovn-detrace

3. Network Metrics and Dashboards

Key Network Metrics

The following metrics should be monitored continuously on every node and exposed in dashboards and alerting rules:

Metric	Source	What It Indicates	Alert Threshold (Guideline)
Interface bandwidth utilization	`node_network_transmit_bytes_total`, `node_network_receive_bytes_total` (node_exporter)	Link saturation -- approaching capacity on physical NICs or bonds	>80% sustained for 10 min
Packet drops (TX/RX)	`node_network_transmit_drop_total`, `node_network_receive_drop_total`	Ring buffer overflow, QoS drops, OVS drops	>0 sustained for 5 min
Interface errors	`node_network_transmit_errs_total`, `node_network_receive_errs_total`	CRC errors, runt frames, link-layer problems (cable, optics, NIC)	>0 sustained for 5 min
TCP retransmits	`node_netstat_Tcp_RetransSegs`	Packet loss in the network, buffer overflow, congestion	>1% of total segments
TCP connection states	`node_netstat_Tcp_CurrEstab`, `node_sockstat_TCP_tw`	Connection leaks (TIME_WAIT buildup), overloaded services	TIME_WAIT > 10,000 per node
OVS datapath packet drops	OVS `ovs-dpctl show` counters, or via `ovs-vsctl get interface <port> statistics`	OVS kernel datapath dropping packets (flow miss, buffer full)	>0 sustained
OVN southbound port binding latency	OVN controller metrics	Delay between logical port creation and OVS flow programming	>5s (indicates control plane lag)
DNS query latency	Network Observability Operator DNS tracking	Slow DNS resolution (affects all service discovery)	P99 >100ms
Network policy deny count	OVN ACL log counters, Network Observability Operator drop metrics	Blocked traffic -- legitimate or attack	Spike >10x baseline in 5 min

Prometheus + Grafana for Network Monitoring in OVE

OpenShift includes an integrated Prometheus instance (managed by the Cluster Monitoring Operator) that scrapes metrics from:

node_exporter on every node (NIC counters, TCP stats, interface status)
kube-state-metrics (Service, Endpoint, Pod status)
OVN metrics (exposed by the OVN northd and OVN controller processes)
OVS metrics (exposed by the ovs-vswitchd process)
Network Observability Operator (flow-derived metrics if FLP is configured to export to Prometheus)

Pre-built dashboards are available via the OpenShift console (Observe > Dashboards) and can be extended with custom Grafana dashboards via the community Grafana Operator.

Example PromQL queries for network monitoring:

# Top-10 nodes by network throughput (bytes/sec, last 5 minutes)
topk(10,
  rate(node_network_receive_bytes_total{device=~"bond.*|ens.*"}[5m])
  + rate(node_network_transmit_bytes_total{device=~"bond.*|ens.*"}[5m])
)

# Packet drop rate per node (drops/sec)
rate(node_network_receive_drop_total{device=~"bond.*|ens.*"}[5m])
  + rate(node_network_transmit_drop_total{device=~"bond.*|ens.*"}[5m])

# TCP retransmit rate (percentage of total segments)
rate(node_netstat_Tcp_RetransSegs[5m])
  / rate(node_netstat_Tcp_OutSegs[5m]) * 100

# OVN controller flow programming latency (if metric is exposed)
histogram_quantile(0.99,
  rate(ovn_controller_if_status_mgr_run_duration_seconds_bucket[5m])
)

OpenShift Network Observability Operator Metrics

Beyond the eBPF flow logs stored in Loki, the Network Observability Operator generates Prometheus metrics that enable alerting without querying Loki:

# Total bytes exchanged between two namespaces
sum(rate(netobserv_workload_egress_bytes_total{
  SrcK8S_Namespace="production",
  DstK8S_Namespace="databases"
}[5m]))

# Top-5 workloads by ingress traffic
topk(5,
  sum by (DstK8S_OwnerName, DstK8S_Namespace) (
    rate(netobserv_workload_ingress_bytes_total[5m])
  )
)

# Packet drops by cause
sum by (PktDropLatestDropCause) (
  rate(netobserv_workload_drop_packets_total[5m])
)

# DNS error rate by namespace
sum by (SrcK8S_Namespace) (
  rate(netobserv_workload_dns_rcode_total{DnsFlagsResponseCode!="NoError"}[5m])
)

Windows Admin Center Network Monitoring (Azure Local)

Azure Local provides network monitoring through two primary interfaces:

Windows Admin Center (WAC): On-premises GUI that shows per-node NIC status (link speed, duplex, errors), virtual switch configuration, VM network adapter statistics, SDN logical network status, and basic traffic counters. WAC is adequate for small-to-medium deployments but lacks the flow-level visibility and topology analysis provided by the OpenShift Network Observability Operator.

Azure Monitor (via Arc): Azure Local nodes are registered as Azure Arc resources, enabling Azure Monitor to collect performance counters (including network metrics) and forward them to a Log Analytics workspace. This provides cloud-based dashboards, alerting, and log queries via KQL (Kusto Query Language). Azure Monitor provides better alerting and long-term data retention than WAC but requires Azure connectivity and incurs cloud billing for data ingestion.

Neither WAC nor Azure Monitor provides eBPF-based flow capture, protocol-level analysis, or topology visualization comparable to the OpenShift Network Observability Operator. For detailed flow analysis on Azure Local, the physical switches must export NetFlow/IPFIX/sFlow to a third-party collector.

SNMP for Physical Switch Monitoring

Regardless of the virtualization platform chosen, the physical spine-leaf fabric still needs monitoring. SNMP (Simple Network Management Protocol) remains the standard for physical switch monitoring. Key SNMP MIBs for network observability:

MIB	OID (Prefix)	Data Provided
IF-MIB	1.3.6.1.2.1.2	Interface counters: bytes, packets, errors, drops, speed, status
ENTITY-MIB	1.3.6.1.2.1.47	Physical entity inventory (chassis, line cards, optics)
BGP4-MIB	1.3.6.1.2.1.15	BGP peer status, received/advertised prefixes
LLDP-MIB	1.0.8802.1.1.2	LLDP neighbor information (topology discovery)

SNMP data is typically collected by Prometheus (via snmp_exporter), Zabbix, LibreNMS, or Observium, and displayed in Grafana dashboards alongside OVE/Azure Local metrics. For a spine-leaf fabric with 20-40 switches, plan for SNMP polling every 60 seconds from a dedicated collector.

Comparison to vRealize Network Insight / Aria Operations for Networks

VMware vRealize Network Insight (vRNI), now Aria Operations for Networks, provides:

Topology visualization: Automatic L2/L3 topology discovery across vDS, NSX, and physical switches
Flow analysis: IPFIX ingestion from vDS and NSX, with application-level grouping
Micro-segmentation planning: Recommended NSX DFW rules based on observed traffic flows
Change detection: Alert when traffic patterns change, new flows appear, or rules are modified
Path analysis: Trace the path of a flow through the virtual and physical network

The closest equivalent in OVE is the Network Observability Operator combined with OVN diagnostic tools. Feature-by-feature comparison:

Feature	vRNI / Aria Operations for Networks	OVE Equivalent
Topology visualization	Automatic, includes physical switches	Network Observability console plugin (K8s topology only, no physical)
Flow analysis	IPFIX from vDS/NSX, application grouping	eBPF flows via Network Observability Operator, Kubernetes-aware grouping
Micro-segmentation planning	AI-recommended DFW rules from observed flows	Manual analysis of flow logs to derive NetworkPolicy rules
Change detection	Built-in anomaly detection	Custom Prometheus alerting rules on flow metrics
Path analysis	GUI-based, covers virtual + physical	`ovn-trace` (CLI, virtual only), no integrated physical path analysis
Physical network integration	NSX + SNMP discovery	Separate SNMP monitoring, no unified view

The gap is real: vRNI/Aria provides a single-pane view across virtual and physical networking with AI-assisted analysis. OVE's observability is Kubernetes-native and eBPF-powered (which is architecturally superior for virtual traffic) but does not automatically integrate with the physical network and lacks the AI-assisted policy recommendation feature. Organizations migrating from VMware should plan for separate physical network monitoring (SNMP/sFlow collectors) and invest in building custom Grafana dashboards that unify virtual and physical network views.

4. Troubleshooting Methodology

The Systematic, Bottom-Up Approach

Network troubleshooting in OVE follows the same principle as in any network environment: work bottom-up through the stack, eliminate layers systematically, and never skip a layer. The temptation to jump directly to the application ("it is probably a firewall rule") leads to wasted time when the actual problem is a failed LACP bond, an MTU black hole, or a misconfigured VLAN trunk.

The following decision tree provides a systematic approach:

Network Troubleshooting Decision Tree:

  VM reports "network not working"
            |
            v
  +--------------------+
  | 1. PHYSICAL LAYER  |     Tools: ethtool, ip link, dmesg
  | Is the link up?    |
  +--------+-----------+
           |
     Yes   |   No --> Check cable, optic, switch port, LACP bond status
           v
  +--------------------+
  | 2. NODE NETWORKING |     Tools: ip addr, ip route, bridge, nmcli
  | Does the node have |
  | correct IPs/routes?|
  +--------+-----------+
           |
     Yes   |   No --> Check NMState config, NetworkManager, DHCP lease
           v
  +--------------------+
  | 3. OVS LAYER       |     Tools: ovs-vsctl, ovs-ofctl, ovs-dpctl
  | Is OVS healthy?    |
  | Are ports present? |
  +--------+-----------+
           |
     Yes   |   No --> Check ovs-vswitchd logs, OVS DB, restart openvswitch
           v
  +--------------------+
  | 4. OVN LAYER       |     Tools: ovn-nbctl, ovn-sbctl, ovn-trace
  | Is the logical port |
  | bound? Are ACLs    |
  | correct?            |
  +--------+-----------+
           |
     Yes   |   No --> Check OVN controller logs, port bindings, ACL rules
           v
  +--------------------+
  | 5. OVERLAY TUNNEL   |    Tools: tcpdump on bond0 for GENEVE, ping
  | Is GENEVE working  |          between node IPs
  | between nodes?     |
  +--------+-----------+
           |
     Yes   |   No --> Check MTU (need >= 1550 for GENEVE), firewall
           |          rules on underlay, GENEVE port 6081 not blocked
           v
  +--------------------+
  | 6. POD NETWORKING   |    Tools: oc exec, nsenter, tcpdump on veth
  | Can the pod reach  |
  | the cluster network?|
  +--------+-----------+
           |
     Yes   |   No --> Check veth pair, CNI logs, IP assignment
           v
  +--------------------+
  | 7. NETWORK POLICY   |    Tools: ovn-nbctl acl-list, ovn-trace,
  | Is a policy blocking|          Network Observability drop analysis
  | the traffic?        |
  +--------+-----------+
           |
     Yes   |   No policy match --> Check destination pod/VM health
           |   (blocked) --> Identify the policy, verify intent
           v
  +--------------------+
  | 8. APPLICATION      |    Tools: ss, netstat, curl, virtctl console
  | Is the service      |
  | listening? Is DNS   |
  | resolving?          |
  +--------------------+

Layer-by-Layer Tools and Commands

Layer 1: Physical NIC and Link Status

# Check link status on all interfaces
ip link show

# Detailed NIC information (speed, duplex, link status, driver)
ethtool bond0
ethtool ens3f0

# Check for interface errors (CRC, alignment, carrier errors)
ethtool -S ens3f0 | grep -i error
ethtool -S ens3f0 | grep -i drop

# Check LACP bond status
cat /proc/net/bonding/bond0
# Look for: MII Status (up/down), Partner MAC (switch port),
#           Aggregator ID (should match across slaves)

# Check LLDP neighbors (verify correct switch port connection)
lldptool -t -n -i ens3f0
# or, if lldpd is running:
lldpcli show neighbors

# Check for recent NIC-related kernel messages
dmesg | grep -i -E "ens3f0|bond0|link|carrier" | tail -20

Layer 2: Node Networking

# Verify node IP addresses
ip addr show bond0
ip addr show br-ex

# Check routing table
ip route show
# Expect: default via <gateway>, 10.128.0.0/14 (pod CIDR) via OVN,
#         overlay-related routes

# Check ARP table (is the gateway reachable?)
ip neigh show | grep <gateway-ip>

# Verify NMState-managed configuration
oc get nncp   # NodeNetworkConfigurationPolicy
oc get nnce   # NodeNetworkConfigurationEnactment (per-node status)

Layer 3: OVS (Open vSwitch)

# List all OVS bridges
ovs-vsctl list-br

# List all ports on br-int (the main integration bridge)
ovs-vsctl list-ports br-int

# Show OVS port details (including the OVN logical port mapping)
ovs-vsctl show

# Check OVS datapath statistics (packet counts, drops)
ovs-dpctl show

# Dump flow tables on br-int (the programmed forwarding rules)
ovs-ofctl dump-flows br-int

# Check specific flows for a destination IP
ovs-ofctl dump-flows br-int | grep "10.128.4.15"

# Check OVS interface statistics (TX/RX bytes, packets, errors, drops)
ovs-vsctl get interface <port-name> statistics

# OVS daemon health
systemctl status openvswitch
ovs-appctl coverage/show   # internal OVS counters

Layer 4: OVN (Open Virtual Network)

# List all logical switches (one per namespace, plus infrastructure)
ovn-nbctl ls-list

# List all logical ports on a specific switch
ovn-nbctl lsp-list <switch-name>

# Check port binding (is the logical port bound to a physical chassis/node?)
ovn-sbctl find port_binding logical_port="<port-name>"
# Look for: chassis=<node-uuid> (bound) or chassis=[] (unbound = problem)

# List all ACLs (network policies translated to OVN ACLs)
ovn-nbctl acl-list <switch-name>

# List all load balancers (Kubernetes Services)
ovn-nbctl lb-list

# Check OVN controller status on the local node
ovn-appctl -t ovn-controller connection-status
# Should return: "connected"

# Check OVN northd status
ovn-appctl -t /var/run/ovn/ovn-northd.ctl status

# Simulate a packet through the OVN logical pipeline
ovn-trace <datapath> \
  'inport=="<src-port>" && eth.src==<src-mac> && eth.dst==<dst-mac> && \
   ip4.src==<src-ip> && ip4.dst==<dst-ip> && tcp.dst==<dst-port> && ip.ttl==64'

Layer 5: Overlay Tunnel

# Verify GENEVE tunnel interfaces exist
ovs-vsctl show | grep genev

# Ping another node's underlay IP (tests physical connectivity)
ping -M do -s 1500 <other-node-ip>
# If this fails with "message too long" -> MTU problem
# GENEVE adds ~50 bytes overhead, so underlay MTU must be >= 1550
# With jumbo frames (MTU 9000), this is a non-issue

# Capture GENEVE traffic to verify tunnel is working
tcpdump -i bond0 -nn udp port 6081 -c 10

# Check GENEVE tunnel port status in OVS
ovs-vsctl list interface ovn-<remote-node-id>-0
# Look for: status={tunnel_egress_iface=...}

Layer 6: Pod / VM Networking

# Check if the pod/VM is running
oc get pod -n <namespace> | grep <vm-name>
oc get vmi -n <namespace>    # VirtualMachineInstance status

# Check pod IP assignment
oc get pod <pod-name> -n <namespace> -o jsonpath='{.status.podIP}'

# Exec into the virt-launcher pod and check networking
oc exec -it <virt-launcher-pod> -n <namespace> -- ip addr show
oc exec -it <virt-launcher-pod> -n <namespace> -- ip route show

# Verify connectivity from the pod to CoreDNS
oc exec -it <virt-launcher-pod> -n <namespace> -- \
  nslookup kubernetes.default.svc.cluster.local

# Test connectivity to the target
oc exec -it <virt-launcher-pod> -n <namespace> -- \
  curl -v telnet://<target-ip>:<target-port>

Layer 7: Network Policy

# List NetworkPolicies in the target namespace
oc get networkpolicy -n <namespace>
oc get networkpolicy <policy-name> -n <namespace> -o yaml

# List AdminNetworkPolicies (cluster-scoped)
oc get adminnetworkpolicy

# Use ovn-trace to verify if a policy blocks traffic (see Layer 4 above)

# Check Network Observability Operator for drop events
# In the OpenShift console: Observe > Network Traffic > filter by
#   "Dropped packets" and source/destination namespace

# Check OVN ACL hit counters
ovn-nbctl acl-list <switch-name>
# The hit counter shows how many packets matched each ACL rule

Layer 8: Application

# Access the VM console
virtctl console <vm-name> -n <namespace>

# Inside the VM: check if the service is listening
ss -tlnp | grep <port>
netstat -tlnp | grep <port>

# Check DNS resolution inside the VM
nslookup <target-hostname>
dig <target-hostname>

# Test connectivity from inside the VM
curl -v http://<target>:<port>/health
telnet <target> <port>

Common Issues and Diagnosis

MTU Black Holes

Symptom: Small packets (ping, DNS) work fine. Large transfers (HTTP downloads, database queries returning large result sets) hang or time out. TCP connections establish successfully but stall when bulk data transfer begins.

Root Cause: The GENEVE overlay adds ~50 bytes of encapsulation header. If the physical underlay MTU is 1500 (standard Ethernet) and the VM sends a 1500-byte packet, the encapsulated packet becomes ~1550 bytes -- too large for the underlay. If the Don't Fragment (DF) bit is set (which it is for most TCP traffic due to Path MTU Discovery), the underlay switch drops the packet and sends an ICMP "Fragmentation Needed" message back. But if ICMP is blocked by a firewall on the path, the source never learns about the MTU limitation -- creating a "black hole" where large packets silently disappear.

Diagnosis:

# Test underlay MTU (from one node to another)
ping -M do -s 1472 <other-node-ip>    # 1472 + 28 (IP+ICMP header) = 1500 -- should work
ping -M do -s 1473 <other-node-ip>    # 1473 + 28 = 1501 -- should fail if MTU=1500

# Check current MTU on physical NIC
ip link show bond0 | grep mtu

# Check MTU on OVS bridges
ip link show br-int | grep mtu
ip link show br-ex | grep mtu
ip link show genev_sys_6081 | grep mtu

# Capture and look for ICMP "Fragmentation Needed" messages
tcpdump -i bond0 -nn icmp | grep "need to frag"

Fix: Set the physical underlay MTU to at least 1550 (or preferably 9000 for jumbo frames) on all NICs, switch ports, and intermediate links. In OVE, verify with NMState:

apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: bond0-mtu-9000
spec:
  desiredState:
    interfaces:
    - name: bond0
      type: bond
      mtu: 9000
      # ... (rest of bond configuration)

OVN Logical Flow Mismatches

Symptom: Traffic between two pods/VMs in the same cluster is dropped, even though no NetworkPolicy should block it. ovn-trace shows the traffic being dropped at an ACL stage.

Diagnosis:

# Check if the OVN northbound and southbound databases are in sync
ovn-nbctl --no-leader-only show
ovn-sbctl --no-leader-only show

# Check OVN controller status (should be "connected")
ovn-appctl -t ovn-controller connection-status

# Check if there are stale port bindings
ovn-sbctl find port_binding chassis=[]
# Unbound ports indicate pods that OVN does not know about

# Check OVN northd replication lag
ovn-appctl -t /var/run/ovn/ovn-northd.ctl status
# In HA deployments, check if the active northd is processing correctly

# Compare the expected ACLs with the actual flows
ovn-nbctl acl-list <switch>
ovs-ofctl dump-flows br-int table=44   # table 44 is typically the ACL table

Fix: If the OVN controller is disconnected or the databases are out of sync, restart the OVN controller on the affected node:

# On the affected node (via oc debug node/<node>)
systemctl restart ovn-controller

If the issue is a stale port binding, the pod may need to be recreated (delete the pod, let the controller recreate it). In OVE, the OVN control plane is managed by the ovnkube-node and ovnkube-master pods -- check their logs for errors.

VLAN Trunking Misconfiguration

Symptom: KubeVirt VMs with Multus secondary interfaces on a VLAN cannot reach the VLAN gateway or other devices on the same VLAN. The primary OVN interface works fine.

Diagnosis:

# Check Multus NAD (NetworkAttachmentDefinition) configuration
oc get net-attach-def -n <namespace> -o yaml

# Verify the VLAN is trunked on the physical switch port
# (requires switch access -- check with network team)

# On the node, verify the VLAN interface exists
ip link show | grep vlan
bridge vlan show dev bond0   # shows which VLANs are configured on the bond

# Check if the Linux bridge for the VLAN exists
ip link show | grep br-

# Capture on the VLAN interface to see if tagged frames arrive
tcpdump -i bond0 -nn -e vlan 100 -c 10

Common causes:

Physical switch port is set to "access" mode instead of "trunk" for the required VLAN
The VLAN ID in the NAD does not match the physical switch VLAN
The node's NMState configuration does not include the VLAN interface
The bridge plugin in the NAD creates an isolated bridge not connected to the physical VLAN

DNS Resolution Failures

Symptom: Pods/VMs cannot resolve internal service names (my-svc.my-ns.svc.cluster.local) or external names. nslookup or dig returns SERVFAIL or times out.

Diagnosis:

# Check CoreDNS pods are running
oc get pods -n openshift-dns

# Check CoreDNS logs for errors
oc logs -n openshift-dns <coredns-pod>

# Test DNS from a pod
oc exec -it <pod> -- nslookup kubernetes.default.svc.cluster.local
oc exec -it <pod> -- cat /etc/resolv.conf
# Verify: nameserver points to CoreDNS ClusterIP (172.30.0.10 default)
#         search domain includes svc.cluster.local

# Check if the DNS service has endpoints
oc get endpoints dns-default -n openshift-dns

# For KubeVirt VMs: check if the VM guest is using the correct DNS server
virtctl console <vm> -n <namespace>
# Inside VM: cat /etc/resolv.conf
# VMs on the primary OVN interface should use the pod DNS (172.30.0.10)
# VMs on a secondary VLAN interface may use an external DNS server

Common causes:

CoreDNS pods are crashlooping (check logs, common cause: upstream DNS server unreachable)
Pod's /etc/resolv.conf points to the wrong DNS server (check CNI configuration)
NetworkPolicy blocking DNS traffic (UDP/TCP port 53 to openshift-dns namespace)
KubeVirt VM guest not configured to use the cluster DNS server
DNS search domain too long (Kubernetes has a limit of 6 search domains and 256 characters total)

Network Policy Blocking Legitimate Traffic

Symptom: Application connectivity that previously worked stops after a NetworkPolicy is applied. Alternatively, traffic that should be allowed by policy is still being blocked.

Diagnosis:

# List all policies affecting a namespace
oc get networkpolicy -n <namespace>
oc get adminnetworkpolicy
oc get baselineadminnetworkpolicy

# Check if a default-deny policy exists
oc get networkpolicy -n <namespace> -o yaml | grep -A5 "policyTypes"
# A policy with policyTypes: [Ingress] and empty ingress rules = default deny ingress

# Use ovn-trace to simulate the blocked connection
ovn-trace <datapath> 'inport=="<src-port>" && ip4.src==<src> && \
  ip4.dst==<dst> && tcp.dst==<port> && ip.ttl==64'
# Look for: "drop" at the ACL stage, with the matching ACL name

# Use Network Observability Operator to find dropped packets
# OpenShift Console > Observe > Network Traffic > Filter: "Dropped packets"
# Filter by source/destination namespace and port

# Check OVN ACL hit counters to see which rule is matching
ovn-nbctl acl-list <switch-name>

Common mistakes:

A default-deny policy is applied but the corresponding allow policies do not cover all required traffic (e.g., missing DNS allow rule to openshift-dns namespace)
AdminNetworkPolicy with a Deny action at a lower priority overrides a namespace-level allow
Pod label selectors in the policy do not match the actual pod labels (case-sensitive, typo-prone)
Egress policy forgets to allow DNS (port 53) -- everything breaks because service discovery fails

5. Compliance and Audit

FINMA Requirements for Network Logging

Swiss Financial Market Supervisory Authority (FINMA) circulars relevant to network observability include:

FINMA Circular 2023/1 "Operational risks and resilience -- banks" (replaces Circular 2008/21): Requires adequate logging and monitoring of IT systems, including network infrastructure. Outsourcing to third parties (relevant for Swisscom ESC) does not transfer the regulatory responsibility -- the bank remains accountable for ensuring adequate monitoring and audit capability.
FINMA Circular 2008/21 "Operational Risks -- Banks" (Appendix 3, Technology risks): Requires that banks maintain audit trails for IT operations, including network access and changes. Logging must be tamper-proof, retained for an adequate period, and available for regulatory review.
ISAE 3402 / SOC 2 Type II: While not FINMA-specific, FINMA expects banks to obtain and review ISAE 3402 or SOC 2 reports from cloud/hosting providers (relevant for Swisscom ESC). These reports must cover network security controls, including logging and monitoring.

For practical implementation, the following network logging capabilities are required:

Requirement	What Must Be Logged	Retention	Platform Implementation
Network access logging	All connections to/from production workloads: src/dst IP, ports, protocol, timestamp, bytes, action (allow/deny)	Minimum 1 year, recommended 3 years	OVE: Network Observability Operator flow logs in Loki, export to SIEM. Azure Local: Azure Monitor + Log Analytics. ESC: Swisscom-managed logging.
Security policy enforcement	Every deny action by a network policy: which policy, which rule, which connection was blocked	Minimum 1 year	OVE: OVN ACL logs + Network Observability drop tracking. Azure Local: SDN firewall logs. ESC: Swisscom-managed.
Network change audit	Every change to network configuration: who changed what, when, from where	Minimum 5 years	OVE: Kubernetes audit log (API server) captures all NetworkPolicy/NAD/NMState changes. Azure Local: Azure Activity Log. ESC: Swisscom change management.
Privileged access logging	All privileged access to network infrastructure (switch config, OVS/OVN commands, firewall changes)	Minimum 1 year	OVE: node auditd logs + API server audit log. Azure Local: Windows Event Log + Azure Activity Log. ESC: Swisscom PAM logs.

Flow Log Retention Policies

Flow log storage grows linearly with the number of flows captured and the retention period. Sizing is critical:

Flow Log Storage Estimation:

  Inputs:
  - 5,000 VMs
  - Average 50 active connections per VM = 250,000 concurrent flows
  - Flow record exported every 60 seconds (active timeout)
  - Average enriched flow record size: ~500 bytes (JSON, with K8s metadata)
  - eBPF sampling rate: 1:50

  Calculation:
  - Flows exported per minute: 250,000 / 50 (sampling) * 1 = 5,000 records/min
    (Note: sampling reduces volume but not accuracy for long-lived flows)
  - Actually, for eBPF with connection tracking, all connections are captured
    regardless of packet sampling -- sampling affects per-packet metrics,
    not connection-level flow records. Assume 250,000 flow records/min.
  - Data per minute: 250,000 * 500 bytes = 125 MB/min
  - Data per day: 125 MB * 1,440 min = ~176 GB/day
  - Data per year: 176 GB * 365 = ~64 TB/year (uncompressed)
  - With Loki compression (typical 10:1): ~6.4 TB/year

  Retention tiers:
  - Hot (Loki, local SSD): 7-30 days for interactive queries
  - Warm (Loki, object storage / ODF): 90 days for investigation
  - Cold (S3-compatible archive): 1-3 years for compliance
  - Archive (tape / offline): 3-7 years for regulatory retention

Important: The retention periods above are guidelines. The bank's compliance and legal teams must define the actual retention policy based on their interpretation of FINMA requirements and internal risk appetite. The platform team provides the technical capability; the retention policy is a business decision.

Network Forensics

Network forensics goes beyond routine monitoring. When a security incident occurs (data exfiltration, unauthorized access, lateral movement), the forensics team needs to reconstruct what happened on the network. The key question is: what was captured, and can you go back far enough?

Data Source	What It Provides	Forensic Value	Limitation
Flow logs (eBPF / IPFIX)	Connection metadata: who talked to whom, when, how much	Reconstruct communication patterns, identify lateral movement, detect data exfiltration (large outbound flows to unknown IPs)	No payload content -- cannot see what data was transferred
OVN ACL logs	Policy deny/allow events	Prove that security policies were enforced, identify policy bypasses	Only logged for policies with logging enabled
Kubernetes audit log	API server activity: who created/modified/deleted network resources	Identify unauthorized configuration changes (e.g., NetworkPolicy deletion)	Does not capture network traffic content
Full packet capture	Complete packet content including payload	Definitive evidence for forensics, can reconstruct entire sessions	Enormous storage cost, typically not feasible for continuous capture at scale
DNS query logs	DNS queries and responses	Identify C2 communication (DNS tunneling), data exfiltration via DNS, reconnaissance	Only DNS traffic, not general network content

For a Tier-1 financial enterprise, the recommended forensic strategy is:

Always on: Flow logs (eBPF) + OVN ACL logs + Kubernetes audit log + DNS query tracking. These are low-cost and provide sufficient data for most investigations.
On demand: Full packet capture for specific interfaces or traffic patterns during an active investigation. Use OVS mirror ports or tcpdump targeted at specific pods/VMs.
Retained for compliance: Flow logs and audit logs retained for 1-3 years (per FINMA requirements). Full packet captures retained for the duration of the investigation plus a legal hold period.

Integration with SIEM

Network flow logs and audit events must be forwarded to the enterprise SIEM (Security Information and Event Management) for correlation with other security data sources (endpoint detection, identity events, application logs).

SIEM Integration Architecture:

  +--OVE Cluster------------------------------------------+
  |                                                        |
  |  Network Observability Operator                        |
  |  +-------------------+                                 |
  |  | eBPF Agent        |                                 |
  |  +--------+----------+                                 |
  |           |                                            |
  |           v                                            |
  |  +--------+----------+       +----------------------+  |
  |  | FLP (Flowlogs     +------>| Kafka (optional)     |  |
  |  |  Pipeline)        |       +----------+-----------+  |
  |  +--------+----------+                  |              |
  |           |                             |              |
  |           v                             v              |
  |  +--------+----------+       +----------+-----------+  |
  |  | Loki              |       | Kafka Mirror/Connect |  |
  |  | (flow log store)  |       +----------+-----------+  |
  |  +-------------------+                  |              |
  |                                         |              |
  |  Kubernetes Audit Log ------+           |              |
  |  OVN ACL Logs --------------+           |              |
  |                             |           |              |
  +-----------------------------+-----------+--------------+
                                |           |
                                v           v
                   +--SIEM------------------------------+
                   |                                    |
                   |  Splunk / Elastic / Azure Sentinel |
                   |                                    |
                   |  - Flow correlation rules          |
                   |  - Anomaly detection               |
                   |  - Incident investigation           |
                   |  - Compliance reporting              |
                   |  - Retention management              |
                   +------------------------------------+

Splunk integration: Use the Kafka Connect Splunk Sink Connector to stream flow logs from Kafka to Splunk HEC (HTTP Event Collector). Alternatively, deploy a Splunk Universal Forwarder on OVE nodes to forward syslog and audit log data. Splunk's CIM (Common Information Model) maps flow log fields to its Network Traffic data model for correlation.

Elastic (ELK) integration: Use Filebeat or Fluentd to forward flow logs from Loki or Kafka to Elasticsearch. The Elastic Common Schema (ECS) provides a standard mapping for network flow data. Kibana dashboards provide flow visualization comparable to the OpenShift console plugin.

Azure Sentinel integration: For Azure Local environments, network logs flow naturally through Azure Monitor and Log Analytics to Azure Sentinel. For OVE environments that need Sentinel integration, use the Azure Monitor agent on OVE nodes or stream flow logs via Kafka to an Azure Event Hub, which Sentinel can ingest.

How the Candidates Handle This

Capability	VMware (Current)	OVE	Azure Local	Swisscom ESC
Flow monitoring technology	IPFIX from vDS/NSX, consumed by vRNI/Aria Operations for Networks	eBPF-based (Network Observability Operator) + IPFIX from OVS (optional)	NetFlow/IPFIX from physical switches. SDN diagnostics for virtual network. No native eBPF flow capture.	Swisscom-managed. Customer has no direct flow visibility. Must request reports.
Flow enrichment	vRNI maps flows to VM names, vSphere tags, NSX groups	FLP enriches with Kubernetes metadata: pod, namespace, workload, service, node	Limited. Azure Monitor provides some workload-level correlation.	N/A -- managed by Swisscom
Packet capture	Port mirroring (vDS/NSX), packet capture in NSX UI	`tcpdump`, `ovs-tcpdump`, OVS mirror ports, `virtctl console` for VM-internal capture	Hyper-V port mirroring, `netsh trace`, Wireshark on host	Must request from Swisscom operations. No self-service packet capture.
Traceflow / path analysis	NSX Traceflow (GUI, simulates packet through logical pipeline)	`ovn-trace` (CLI, simulates packet through OVN logical pipeline), `ovn-detrace` for reverse mapping	SDN diagnostics (Test-SdnConnection PowerShell cmdlet)	N/A
Network topology visualization	vRNI auto-discovers L2/L3 topology, physical + virtual	Network Observability console plugin (K8s topology only, no physical switch discovery)	Windows Admin Center shows physical + logical network topology (limited)	Swisscom portal provides limited topology view
Metrics and dashboards	vCenter performance charts (NIC utilization, latency), vRNI dashboards	Prometheus + Grafana (node_exporter, OVN metrics, flow-derived metrics)	Windows Admin Center + Azure Monitor dashboards	Swisscom SLA reports
DNS monitoring	Limited (vRNI can track DNS flows)	Network Observability Operator DNS tracking (query names, response codes, latency)	Azure DNS analytics (if using Azure DNS)	N/A
Packet drop analysis	NSX DFW drop counters, vDS drop stats	eBPF-based drop tracking with kernel drop reason codes, NetworkPolicy match identification	SDN diagnostics for policy drops	Must request investigation from Swisscom
SIEM integration	vRNI + syslog/SNMP export to SIEM, NSX DFW log export	Kafka/OTLP export from FLP, syslog forwarding, direct Loki API queries	Azure Sentinel (native for Azure Monitor data), syslog for physical	Swisscom provides SIEM integration as managed service
Compliance logging	NSX DFW audit logs, vCenter event log, vRNI flow archive	Kubernetes audit log, OVN ACL logs, eBPF flow logs in Loki, configurable retention	Azure Activity Log + Azure Monitor Logs with configurable retention	Swisscom manages compliance logging per SLA. Customer receives attestation reports.
Physical network monitoring	vRNI SNMP/LLDP discovery of physical switches	Separate tooling required (SNMP exporter, sFlow collector). No unified virtual + physical view.	Windows Admin Center + SNMP. Partial integration via Network ATC.	Swisscom manages physical network.
Operational model	vRNI is a dedicated appliance, managed by network team via GUI	Network Observability Operator is a Kubernetes-native operator, managed via CRDs and OpenShift console	Split between WAC (on-prem) and Azure Monitor (cloud). PowerShell for advanced diagnostics.	Fully managed by Swisscom. Customer has observer role.
Maturity for 5,000+ VM environments	Very mature. vRNI handles large-scale environments well.	Maturing. Network Observability Operator is GA but younger than vRNI. eBPF capture is architecturally sound. FLP pipeline may need tuning for 5,000+ VM scale.	Moderate. Azure Monitor is mature for cloud workloads but SDN diagnostics on-prem are less polished.	Mature but opaque. Customer trusts Swisscom's capability.

Key Takeaways

Deploy the Network Observability Operator from day one on OVE. Do not wait for a production incident to discover that you have no flow visibility. The operator is lightweight (eBPF agents add <2% CPU overhead per node), provides Kubernetes-aware flow enrichment, and integrates directly into the OpenShift console. Configure it during cluster deployment as a standard component, not as an afterthought.
The eBPF-based flow capture in OVE is architecturally superior to VMware's IPFIX approach. eBPF operates in kernel space, captures every connection with near-zero overhead, and enriches flows with Kubernetes identity (pod name, namespace, service) rather than raw IP addresses. The VMware equivalent (vRNI + IPFIX from vDS) requires a separate appliance, processes only sampled flows, and enriches with vSphere inventory (VM name, port group). The OVE approach is more efficient and more informative for troubleshooting in a Kubernetes-native environment.
Physical network monitoring remains a separate concern. Neither OVE nor Azure Local integrates physical switch monitoring into its virtual network observability stack. Plan for a separate SNMP/sFlow collector (Prometheus snmp_exporter + Grafana, or LibreNMS/Zabbix) for the spine-leaf fabric. Build Grafana dashboards that display both virtual (eBPF/Prometheus) and physical (SNMP) network metrics side by side.
ovn-trace is your most powerful troubleshooting tool. When traffic between two VMs is unexpectedly blocked or misrouted, ovn-trace simulates the packet through the entire OVN logical pipeline and shows exactly which ACL or routing decision dropped or forwarded it. Learn this tool thoroughly -- it replaces NSX Traceflow and is the single fastest way to diagnose OVN connectivity issues.
MTU black holes are the most common post-migration network issue. GENEVE encapsulation adds ~50 bytes to every packet. If the physical underlay MTU is 1500, large packets are silently dropped. The fix is simple (set underlay MTU to 9000 / jumbo frames) but must be done on every NIC, every switch port, and every intermediate link before VMs are migrated. Test with ping -M do -s 8950 <other-node-ip> after configuration.
Network observability is a FINMA compliance requirement, not a nice-to-have. Flow logs, policy enforcement logs, and configuration change audit trails must be captured, retained, and available for regulatory review. Define retention policies (hot/warm/cold/archive tiers) with the compliance team before go-live. Export flow logs and audit events to the enterprise SIEM. For Swisscom ESC, verify that Swisscom's ISAE 3402 / SOC 2 report covers network logging adequacy.
Azure Local's network observability gap is significant. Azure Local lacks eBPF-based flow capture, has limited on-premises flow analysis tools, and depends on Azure Monitor (cloud connectivity required) for advanced monitoring. Organizations choosing Azure Local must compensate with physical switch flow export (NetFlow/IPFIX/sFlow) to a third-party collector and build their own flow analysis pipeline.
Swisscom ESC's managed model means trusting Swisscom's observability. The customer has no self-service access to flow logs, packet capture, or network diagnostics. Incident investigation requires opening a ticket with Swisscom and waiting for their L3 team. This is acceptable if the SLA is satisfactory and the compliance team accepts Swisscom's attestation reports in lieu of direct audit access. It is a risk if the organization needs rapid, independent network forensics capability.
Plan SIEM integration architecture early. The choice of Kafka vs direct export, the flow log schema mapping (to Splunk CIM or Elastic ECS), and the SIEM ingestion capacity must be designed before the Network Observability Operator is deployed. A 5,000-VM environment can generate 100+ GB/day of flow log data -- verify that the SIEM can ingest this volume and that the cost is budgeted.
The troubleshooting methodology is a team skill, not a document. The eight-layer bottom-up approach (physical, node, OVS, OVN, overlay, pod, policy, application) must be trained through practice, not just reading. Conduct tabletop exercises: "a VM in namespace production cannot reach the database in namespace databases -- walk through each layer." Build runbooks with the specific commands for each layer. The first production incident is not the time to learn ovn-trace.

Discussion Guide

The following questions are designed for vendor workshops, SME deep-dives, and PoC validation sessions. They test whether the vendor or SME has actual production experience with network observability at enterprise scale -- not just marketing-slide familiarity.

1. eBPF Flow Capture Performance at Scale

"We will run 5,000+ VMs on OVE. Each VM may have 50-200 concurrent connections. Walk us through the eBPF agent's resource consumption at this scale. What is the CPU overhead per node? How does the sampling rate affect accuracy vs performance? What happens if the eBPF ring buffer overflows -- do we lose flows silently or get an alert? How do we monitor the health of the observability pipeline itself?"

Purpose: Tests real-world experience with the Network Observability Operator at scale. The answer should include: (1) eBPF agent CPU overhead is typically 1-3% per node, but depends on connection rate (connections per second) more than total VM count. A node with 100 VMs at 50 conn/sec each = 5,000 conn/sec, which is well within eBPF agent capability. (2) The sampling parameter in the FlowCollector CRD controls packet sampling, not connection sampling -- all TCP connections are tracked via conntrack-based flow identification regardless of sampling. Sampling reduces the accuracy of per-flow byte/packet counters but does not cause flows to be missed. (3) Ring buffer overflow results in flow events being dropped; this is observable via the netobserv_agent_evictions_total Prometheus metric. (4) The pipeline itself should be monitored: FLP processing latency, Loki ingestion rate, Kafka consumer lag (if using Kafka deployment model).

2. Flow Log Retention and Compliance

"FINMA requires us to maintain audit trails for security policy enforcement. How do we implement network flow log retention for 1-3 years? What is the storage cost? Can we tier the storage (hot/warm/cold)? How do we ensure flow logs are tamper-proof? Can an auditor query 6-month-old flow data to answer the question: 'did VM X communicate with external IP Y on date Z?'"

Purpose: Tests compliance-oriented observability architecture. The answer should include: (1) Loki supports multi-tier storage: local SSD for recent data (7-30 days), S3-compatible object storage for warm data (30-90 days via retention policies), and S3 Glacier or equivalent for cold/archive (1-3 years). (2) Storage cost depends on flow volume, compression ratio (Loki typically achieves 10:1), and storage tier pricing. Provide a sizing estimate based on the environment. (3) Tamper-proofing: Loki data in object storage is immutable (S3 object lock, ODF WORM mode). Kubernetes RBAC restricts who can modify or delete Loki data. Audit log tracks all access to the Loki API. (4) Yes, historical queries are possible via LogQL -- Loki is designed for exactly this use case. But cold-tier queries are slow (minutes, not seconds). Plan for this in the investigation workflow.

3. Packet Capture for Incident Response

"During a security incident, our IR team needs to capture packets from a specific VM for forensic analysis. Walk us through the process: how do you identify the correct capture point, how do you capture without affecting the VM's performance, how do you extract and deliver the capture file to the IR team, and how do you ensure chain of custody for the captured data?"

Purpose: Tests practical incident response capability. The answer should include: (1) Identify the capture point: use oc get pod -o wide to find the virt-launcher pod and its node, then ovs-vsctl list-ports br-int to find the OVS port. (2) Capture without performance impact: use ovs-tcpdump with a BPF filter targeted at the specific VM's IP/port (avoid unfiltered capture on br-int). Set a file size limit (-W 10 -C 100 for 10 rotating files of 100MB each). (3) Extract: copy the pcap file from the node to a secure location (oc debug node/<node> -- cat /tmp/capture.pcap > local-capture.pcap). (4) Chain of custody: hash the capture file immediately (sha256sum capture.pcap), record the hash in the incident ticket, store the file in a tamper-proof location (S3 with object lock), document who accessed the file and when.

4. Troubleshooting a Silent Packet Drop

"A KubeVirt VM in namespace 'trading' reports that connections to a database VM in namespace 'market-data' intermittently fail. Small queries work; large result sets time out. There is no NetworkPolicy blocking the traffic. Walk us through your troubleshooting approach. What is the most likely root cause and how do you prove it?"

Purpose: Tests systematic troubleshooting ability and knowledge of MTU black holes. The answer should start with the bottom-up methodology and quickly converge on the MTU hypothesis: (1) Check physical link (ethtool -- probably fine if small packets work). (2) Check OVS/OVN (ovn-trace shows the path is allowed). (3) Key test: ping -M do -s 1472 <db-node-ip> succeeds, ping -M do -s 1473 <db-node-ip> fails -- confirming underlay MTU is 1500 and cannot carry encapsulated large packets. (4) Capture on the underlay: tcpdump -i bond0 icmp | grep "need to frag" -- look for ICMP Fragmentation Needed messages. If absent, a firewall on the path is blocking ICMP, creating a PMTU black hole. (5) Fix: set underlay MTU to 9000 on all links (NMState + switch configuration). (6) Verify: ping -M do -s 8950 <db-node-ip> succeeds.

5. Network Observability Operator vs vRealize Network Insight

"Our network team is accustomed to vRealize Network Insight (Aria Operations for Networks). They use it for topology visualization, micro-segmentation planning, and flow-based firewall rule recommendations. What is the OVE equivalent? What gaps exist, and how do we fill them? Will the network team feel like they are losing capability?"

Purpose: Tests honest gap analysis. The answer should be candid: (1) The Network Observability Operator provides flow analysis, DNS tracking, and drop analysis -- this covers the core flow monitoring use case. (2) Topology visualization is Kubernetes-native (namespace/workload level) but does not include physical switches -- unlike vRNI, which discovers the physical topology via SNMP/LLDP. (3) Micro-segmentation planning (AI-recommended firewall rules based on observed flows) has no direct equivalent in OVE. The team must manually analyze flow logs to derive NetworkPolicy rules. (4) Gaps to fill: deploy a separate physical network monitoring tool (LibreNMS, Observium, or Prometheus snmp_exporter), build custom Grafana dashboards for unified virtual+physical view, consider third-party tools like Kentik or Datadog for advanced flow analytics. (5) Yes, the network team will initially feel a capability regression. The mitigation is training, custom dashboards, and setting expectations that the tooling is different, not absent.

6. Azure Local Network Monitoring Gaps

"We are evaluating Azure Local as an alternative. What is Azure Local's network observability story? How does it compare to OVE's Network Observability Operator? What do we lose, and what do we gain? How do we monitor the Microsoft SDN stack?"

Purpose: Tests cross-platform knowledge. The answer should include: (1) Azure Local relies on Azure Monitor for metrics, Windows Admin Center for on-prem visibility, and SDN diagnostics (PowerShell cmdlets: Test-SdnConnection, Get-SdnDiagnostics) for the Microsoft SDN stack. (2) There is no eBPF-based flow capture equivalent. Flow monitoring depends on physical switch NetFlow/sFlow export to a third-party collector. (3) Azure Monitor provides good metrics and alerting but requires Azure connectivity -- if the Azure Arc connection is down, monitoring degrades. (4) Gains: Azure Monitor integration with Azure Sentinel for SIEM, native KQL queries, familiar Azure portal experience for teams already in the Microsoft ecosystem. (5) Losses: no eBPF flow enrichment with workload identity, no in-console flow table and topology view, limited packet capture tooling compared to OVS/OVN tools.

7. SIEM Integration Design

"Our SOC uses Splunk for SIEM. Design the integration between OVE's network observability data and Splunk. What data sources do we forward? What is the expected data volume? How do we map OVE flow log fields to Splunk's CIM Network Traffic data model? What correlation rules should we create for detecting lateral movement?"

Purpose: Tests SIEM integration depth. The answer should include: (1) Data sources: eBPF flow logs (via Kafka -> Kafka Connect Splunk Sink), Kubernetes audit log (via Splunk Connect for Kubernetes), OVN ACL logs (via syslog forwarder), node system logs (via Splunk Universal Forwarder or Fluentd). (2) Volume: for 5,000 VMs, estimate 100-200 GB/day of flow log data after compression. Verify Splunk license and indexer capacity. (3) CIM mapping: src_ip -> SrcK8S_HostIP, dest_ip -> DstK8S_HostIP, src_port -> source transport port, action -> allow/deny from OVN ACL, app -> Kubernetes workload name. (4) Correlation rules: detect communications to new external IPs from production VMs (possible C2), detect east-west traffic between namespaces that have no prior communication history (possible lateral movement), detect large data transfers from database namespaces to non-application namespaces (possible exfiltration), detect NetworkPolicy deletions followed by new traffic patterns.

8. Physical Network Monitoring Strategy

"Our spine-leaf fabric has 24 Arista leaf switches and 4 spine switches. All run EOS with sFlow enabled. How do we integrate sFlow data from the physical switches with the virtual network observability in OVE? Can we build a unified view? What tools do we use?"

Purpose: Tests physical-virtual monitoring integration. The answer should include: (1) Deploy an sFlow collector (e.g., sFlow-RT, ntopng, or InMon's host sFlow agent with Prometheus integration) that receives sFlow from all Arista switches. (2) Deploy Prometheus snmp_exporter to scrape interface counters, BGP peer status, and LLDP neighbor data from all switches via SNMP. (3) Build Grafana dashboards that combine physical (SNMP/sFlow) and virtual (Network Observability metrics, node_exporter) data on the same page: physical switch port utilization alongside the corresponding node's virtual switch throughput. (4) Correlate: when a physical link shows errors (SNMP counters), check if the OVE nodes on that leaf switch show packet drops or retransmits (Prometheus). (5) Current limitation: there is no single tool that automatically discovers and correlates the physical-to-virtual mapping. The team must manually build the correlation (e.g., using LLDP data to map which OVE node is connected to which leaf switch port). Consider network automation tools (Nautobot, NetBox) for maintaining this mapping.

9. Observability During Migration

"During the VMware-to-OVE migration, we will have VMs running on both platforms simultaneously for 6-12 months. How do we maintain unified network observability across both environments during the transition? Can we see flows between a VMware VM and an OVE VM in a single dashboard?"

Purpose: Tests migration-phase operational planning. The answer should include: (1) During the transition, VMware VMs export IPFIX via vDS to the existing vRNI/flow collector. OVE VMs are captured by the Network Observability Operator. These are two separate data sources with different schemas. (2) Cross-platform flows (VMware VM to OVE VM via a shared VLAN or routed path) appear in both collectors -- as an egress flow in one and an ingress flow in the other. (3) Unified view: forward both data sources to the SIEM (Splunk/Elastic) and build cross-platform correlation dashboards. Map VMware VM names and OVE workload names to a common application inventory. (4) During migration, monitor for traffic asymmetry: a VMware VM that is migrated to OVE should show the same traffic patterns (same peers, same volume) on the new platform. Deviations indicate a connectivity problem. (5) Decommission vRNI only after all VMs are migrated and verified on OVE.

10. Proving Network Policy Effectiveness to Auditors

"A FINMA auditor asks: 'Prove that your micro-segmentation policies are effective. Show me that the database VMs in namespace databases can only be reached by application VMs in namespace production, and that all other traffic is denied.' How do you demonstrate this using OVE's network observability tools?"

Purpose: Tests audit-readiness. The answer should include: (1) Show the NetworkPolicy definition: oc get networkpolicy -n databases -o yaml -- displays the allow rules (ingress from namespace production on port 5432) and implicit deny-all for other traffic. (2) Show flow logs proving enforcement: query Loki via LogQL or use the Network Observability console to filter flows with destination namespace=databases. Show that all allowed flows have source namespace=production and destination port=5432. Show that denied flows (from other namespaces) appear with PktDropCause indicating policy denial. (3) Show OVN ACL hit counters: ovn-nbctl acl-list <databases-switch> showing the deny ACL has non-zero hit counts, proving that unauthorized traffic was actually blocked, not just theoretically blocked. (4) Show that the policy cannot be modified without audit trail: query the Kubernetes audit log for all events on NetworkPolicy resources in the databases namespace, proving who created/modified the policy and when. (5) Export this evidence as a PDF/report for the auditor.