EdgeConnect SD-WAN High Availability (HA) Guide

[TOC] # EdgeConnect SD-WAN High Availability (HA) Guide ## 1. Introduction High Availability (HA) in EdgeConnect SD-WAN is designed to provide minimal network disruption and to provide maximum application continuity during failure events. It is supported by two disticnt redundancy models: the Traditional HA model which is ideal for data center environments, and the simplified EdgeHA model for branch deployments. ### 1.1 Purpose This document serves as a **comprehensive guide to High Availability (HA) in EdgeConnect SD-WAN**. It covers the full lifecycle of HA design and operation, including: - **Deployment architectures** – EdgeHA and Traditional topologies, and their design considerations. - **Configuration** – Baseline, Optimized, and Baseline 9.6 settings, with trade-offs explained. - **Operation** – How HA functions under different network conditions, including peer and path selection, tunnel behavior, and flow reclassification. - **Convergence behaviors** – Measured failover times, packet loss, and recovery analysis across key failure scenarios. By consolidating **design principles, configuration guidance, operational insights, and empirical test results**, this guide provides a single reference point for architects and engineers to plan, deploy, and validate HA in EdgeConnect SD-WAN networks. ### 1.2 In Scope This guide focuses on **EdgeConnect SD-WAN High Availability (HA)** from both an **architectural** and **operational** perspective. The primary emphasis is on how HA topologies, configuration parameters, and system behaviors influence **traffic convergence during failure events**. The initial scope centers on the convergence of **internal SD-WAN fabric traffic**, providing a baseline understanding of recovery performance under controlled test conditions. Future iterations of this document will extend analysis to additional traffic patterns such as **local internet breakout**, **security service edge (SSE)**, and **Zone Based Firewall**. **Failure scenarios examined** - Gateway Power loss - Active Gateway LAN-side failure - Active Gateway WAN-side failure - HA link loss **Per-scenario evaluation criteria** - Convergence duration (time to restore traffic flows) - Traffic loss (packet loss observed during convergence) ### 1.3 Out of Scope While this guide provides a **comprehensive reference for EdgeConnect SD-WAN High Availability**, its focus is intentionally limited to **device-level HA topologies and traffic convergence behaviors**. The following areas are **outside the current scope** but may be addressed in future updates or separate documents: - **Non-HA scale and resiliency mechanisms** (e.g., multi-gateway ECMP). - **Overlay/underlay interactions** (e.g. Link Bonding). - **Orchestrator-level redundancy** (e.g. Orchestrator HA). - **Application-level performance impacts** (e.g. anything beyond traffic convergence). - **Third-party service redundancy** (e.g., SSE failover validation). - **Security Policy Failover** (e.g. ZBFW policy and failover interactions). ## 2. HA Topologies and Deployment Models EdgeConnect SD-WAN offers two High Availability (HA) deployment models, each designed to meet different site requirements: - **EdgeHA HA** is optimized for branch deployments, using a direct HA link between appliances to share WAN connections. It simplifies cabling and setup, making it a practical choice for smaller environments. - **Traditional HA** is ideal for data centers, campuses, and large branch sites. It requires WAN L2 switching, but offers greater scale oppurtunities. Both models deliver **uninterrupted uptime**, **tunnel continuity**, and **redundant pathing**. The right choice depends on your site architecture and scalability needs. | Model | Best For | Key Mechanism | Advantages | Considerations | | --------------- | ------------------ | ---------------------------------- | ------------------------------------- | --------------------------------------------- | | **EdgeHA** | Branch | Direct HA link with WAN IP sharing | Simple setup; direct WAN connections. | Max 2 ECs | | **Traditional** | Data centers, campus, hubs | WAN L2 switching layer | Mature, scalable, N+1 scale support. | Requires L2 WAN switches; more configuration. | ### 2.1 Traditional HA (Active/Standby & Active/Active) **Traditional HA** is the original redundancy model for EdgeConnect. It’s most often deployed in **data centers or hub sites** where WAN services terminate into **L2 switch fabrics**, where all EdgeConnects have WAN connectctions to all underlays at the site. #### 2.1.1 Traditional HA Topology ![Traditional HA](https://hackmd.io/_uploads/SyjeBl89eg.png) #### 2.1.2 Deployment Models - **Active/Standby LAN** - One EC is designated as the **active node**; the ther is **standby**. - The preferred active node is selected using protocols like VRRP priority, lower OSPF cost, or higher BGP preference. - For WANOptimization use cases, **Symmetric traffic flows** are maintained. - **Active/Active LAN** - All ECs advertise LAN subnets simultaneously towards the SD-WAN Fabric. - Advertisements towards the LAN are sent as equal cost. - This model achieves **true load balancing** across the EC pair for traffic originating from the LAN. - For WAN Optimization, asymmetry in optimized TCP traffic is resolved through flow redirection. - Note: Flow redirection does not correct asymmetry for traffic types other than TCP flows that are optimized. #### 2.1.3 Active/Standby LAN-Side Design Options - **VRRP with IP-SLA (default & simplest)** - All ECs share a **Virtual IP (VIP)** as the default gateway. - Higher priority EC becomes master; preemption is typically **disabled**. - IP-SLA monitors VRRP and adjusts **subnet metric delta** (≥2) if EC loses mastership. - Simple to implement; fast failover (hundreds of ms). - **LAN Side OSPF** - All ECs participate in OSPF. - Lower OSPF **interface cost** (e.g., 10 vs 20) designates the preferred EC. - Redistributed routes tagged and assigned different metrics (e.g., 50 vs 70) for **symmetry**. - No VRRP/IP-SLA needed. - Preferred for **routing-rich data center environments**. - Tag and Filter is required for loop prevention. - BFD Support: Can be enabled alongside BGP/OSPF to dramatically reduce failover time (to sub-second), making it a competitive alternative to VRRP for speed. - **LAN Side BGP** - All ECs form **eBGP sessions** with LAN routers (or internal iBGP with route reflectors). - Admin distance, MED, and Local Pref determine preferred EC. - Ideal when the DC already uses BGP extensively. - BFD Support: Can be enabled alongside BGP/OSPF to dramatically reduce failover time (to sub-second), making it a competitive alternative to VRRP for speed. #### 2.1.4 Active/Active LAN-Side Design Options - **OSPF w/ECMP** - All EdgeConnect particiapte in OSPF and advertise the same LAN subnets with equal cost metrics towards the LAN-side routers. This leverages OSPF's native ECMP capability. - Traffic Distribution: LAN routers install multiple, equal costs paths to the subnets, distributing outbound traffic (from LAN to WAN) across all active EC appliances, leading to load balancing and possibly higher throuput. - Failover: If an active EC fails, its OSPF advertisements are withdrawn. The LAN router removes the failed path, and all traffic automatically shifts to the remaining active ECs. Failover is fast, relying on OSPF's default convergence time (or BFD) if configured. - Implementation is simple to configure in enviornments that already use OSPF. It provides a strong, load balancing mechanism without the need for additional protocols like VRRP. - **BGP w/ECMP** - ECs form **eBGP sessions** with LAN routers (or internal iBGP with route reflectors). - REquires Flow redirection for optimized TCP symmetry. - Admin distance, MED, and Local Pref determine preferred EC. - Supports **multi-homing** to multiple WAN/PE providers. - Ideal when the DC already uses BGP extensively. - BFD is critical component for accelerating failure detection in routing protocols like BGP. #### 2.1.5 WAN-Side Design - ECs connect to **WAN-side switches** that in turn connect to carriers (MPLS, DIA, Internet). - SD-WAN tunnels established by each EC use their **own WAN IPs** for each underlay. - For **IPsec-UDP** each EC must still get a unique IPsec UDP port. #### 2.1.6 Operational Caveats - **Orchestrator UDP settings**: changing the global tunnel port does **not retroactively update paired ECs**. Update manually on secondary EC if needed. - **VRRP Preemption**: disabling prevents unnecessary flaps when an EC recovers. - **Azure/Cloud limitations**: VLAN tagging isn’t supported in cloud fabrics — EdgeHA not applicable in those environments. ---- ### 2.2 EdgeHA (Active/Standby Only) **EdgeHA** is the high-availability model optimized for branch environments. Unlike Traditional HA, it removes the need for WAN-side Layer 2 switches by using a dedicated HA link between appliances. This link allows each EdgeConnect to access the peer’s WAN interfaces, enabling a seamless active/active WAN design where all circuits are utilized simultaneously. On the LAN side, EdgeHA operates in an active/standby configuration. One appliance actively handles LAN traffic while the other remains on standby, ready to take over instantly if a failure occurs. As illustrated in Figure 1, this design delivers both redundancy and uninterrupted service during failures, maximizing WAN usage without compromising LAN stability. #### 2.2.1 EdgeHA Topology ![EdgeConnect HA](https://hackmd.io/_uploads/HJpZrl85lg.png) #### 2.2.2 EdgeHA Deployment Models **Active/Standby LAN** - ACtive/Active LAN is not supported with EdgeHA. One appliance active, one standby, when viewed from the **LAN side**. - EdgeHA Active/Standby LAN requirement stems from its operational model, which uses the dedicated HA link to maintain a consistent state and its not designed to support simultaneous, Equal-cost LAN traffic distribution. - From the **WAN side**, both appliances can actively use their WAN circuits, ensuring full utilization of underlay services. -Customers requiring ECMP-style active/active load balancing, Traditional HA must be deployed. >**Active/Active LAN** is not supported with EdgeHA. >For customers requiring **ECMP-style active/active load distribution**, use >**Traditional HA** instead. #### 2.2.3 EdgeHA HA Link Requirements The **HA link is the core differentiator** of EdgeHA: - **Link Selection** - Mgmt interfaces are not supported, only datapath interfaces (e.g. lan0, wan0, lan1, wan1, etc.) - 2x Links can be bonded using LACP, if desired. Configure the bonded link first on each EdgeHA device and then set the bonded link as the EdgeHA interface via the deployment page. - **Latency**: - Recommended ≤10ms RTT (for consistent UX). - Maximum ≤500ms RTT. - Any traffic traversing peer interfaces inherits this latency. - **Bandwidth**: - Must equal or exceed the **fastest WAN underlay** in the pair. - **Connectivity**: - Must be **Layer-2 with VLAN tagging** (802.1Q). - Default VLANs: 100–164. - Can be delivered via trunk, QinQ, VXLAN, xWDM, or dark fiber. - **HASync VLAN**: - Calculated as `VLAN Start + # of HA Subnets`. - Must be passed transparently. - **Optical note**: Direct dark fiber HA links require **EC-S-P or larger** appliances. #### 2.2.4 EdgeHA LAN-Side Design Options **VRRP (most common)** * EdgeConnect appliances provide a shared Virtual IP as the LAN default gateway. * The appliance with the higher priority becomes master; preemption is typically disabled (non-revertive). * IP-SLA tracks the VRRP state and adjusts the advertised subnet metric (≥2) so that traffic from the SD-WAN fabric is always forwarded to the active EC. **BGP** * The LAN network peers with the ECs via eBGP or iBGP. * eBGP: use AS-Path Prepend or MED to prefer routes through the primary EC. * iBGP: use Local Preference or AS-Path Prepend to prefer routes through the primary EC. * Redistribution metrics (e.g., 50 vs. 70) again ensure symmetric return paths. * Best suited for routing-rich branches where LAN routers already run BGP. **OSPF** * Both EdgeConnect appliances participate in OSPF. * A lower interface cost (e.g., 10 vs. 20) makes one appliance preferred. * Redistribution metrics (e.g., 50 vs. 70) ensure symmetric return paths. * Best suited for routing-rich branch deployments where LAN routers are already running OSPF. #### 2.2.5 Operational Caveats & Considerations - **No flow redirection support**: leave it disabled. - **DHCP/DNS caching not synchronized**: deploy external redundancy if needed. - **Labels must differ between ECs**: Orchestrator enforces this rule. - **IaaS not supported**: EdgeHA links rely on VLAN tagging, which cloud vNets don’t support. ## 3. High Availability Functionality (Failover Mechanics) ### 3.1 Core Operational Components High Availability in EdgeConnect is achieved through the coordinated action of two main groups of operational components: those responsible for rapid **Failure Detection** and those responsible got seamless **Flow Recovery**. ###3.1.1 Failure Detection and Control Plane Mechincas These components ensure the EdgeConnect appliances quickly and accuretly recognize a change in state either from a peer failure or path degradation and update the SD-WAN fabric. - Traffic Classification - Overlay Selection - Peer selection - Route lookup - Route updates - Path selection - Auto Flow reclassification - Primary vs. backup label handling - Nuances of idle vs. active tunnels ### 3.2 Failure Scenarios & Expected Behavior - LAN-side failure - WAN-side failure (primary/backup label) - Power loss - HA/cluster link loss ### 3.3 Traffic Flow Perspective - Branch → Hub (initiator: Branch) - Hub → Branch (initiator: Hub) - Bidirectional considerations ## 4. Convergence Configuration Settings ### 4.1 Tuneable Dimensions The following outlines the **tunable dimensions within the EdgeConnect SD-WAN fabric** that most directly influence how quickly the system responds to LAN-side convergence events, to shift tr. These settings are grouped into two categories: **Global Settings** - Settings which apply universally across all devices in the SD-WAN fabric. - Settings which are typically managed through globally scoped menus in the Orchestrator. - Affect system-wide behavior and convergence responsiveness across the entire SD-WAN fabric. **Gateway Settings** - Configurable on a **per-gateway basis**. - Usually applied via **templates** or during **initial site setup**. - Allows for localized tuning of convergence behavior at individual sites. ### 4.2 Global Settings - **Auto flow re-classify** - Orchestration > Orchestration Settings > **Auto \**Flow Re-classify Timer\**** - When policy changes occur, flow reclassification makes a best-effort attempt to conform the flow to the change. Specifies how often to do a policy lookup. **This setting controls how often the EdgeConnect will check if a new remote peer is available for a given flow**. It monitors Route Changes, Policy Changes, Internet Breakout, QoS Policies, Business Intent Overlays etc,, **Set to 60 seconds by default. It can be safely adjusted to 10 in most networks.** - **Tunnel health retry count** - This is the number of seconds a tunnel must be down before BGP and OSPF routes are pulled from any routers downstream or upstream from the EdgeConnect. **This can be adjusted down to 5 for most high-speed Internet and MPLS connections.** Leave this at Default for LTE and other unreliable underlays. Each label has its own unique Retry Count. - Overlay Bonding Policy ### 4.3 **Gateway Settings** - **VRRP** - version, advertisement/hold-down, preemption, priority, VMAC - **IP SLA** - VRRP Interface Monitor - Configuration > IPSLA > VRRP Monitor > **Monitor Sampling Interval**** * This is the frequency with which we check to see if the VRRP state has changed. Important when failing over from one EdgeHA appliance to another. - **Quiescent tunnel keep alive time** - System Information > System Settings > **Quiescent Tunnel Keep Alive Time (Default = 60 Seconds)** - Specifies the rate at which to send keep alive packets **after** a tunnel has become idle. - If tunnel is in Idle State; this is the interval we send a keep-alive for that tunnel. if a tunnel is in Idle state and there are no other active tunnels to that peer it may take up to 60 seconds before we take that tunnel down. Which results in a delay in removing the routes associated with that tunnel. ONLY IMPORTANT when we have idle tunnels and we want to manage convergence time for failure and all the tunnels are in idle state. **This timer can be set to 1 second** - You cannot actively measure for convergence time with the tunnels in the Idle state - BGP/OSPF/BFD timers (where applicable) - dynamic routing protocols, OSPF / BGP, having a dead / holdover timer not exceeding 10 seconds, but the better solution is to have BFD running with OSPF | BGP, and can then have the larger OSPF | BGP dead / holdover timers. - ### ⚠️ Aggressive Path Fail Settings **Aggressive path failure** is a configuration option available **only via the CLI**, designed for environments with **extremely strict failover requirements**. - While it **does not enhance LAN-side convergence**, it significantly **accelerates failover** between **Active**, **Standby**, or **Backup tunnels**. - This setting is particularly useful when rapid failover between **Primary and Backup underlay tunnels** is required. - However, it comes with a trade-off: it generates an **exceptionally high volume of keepalive packets**, which can lead to **network utilization far exceeding normal baselines**. > ⚠️ **Use with caution**: This setting should only be applied in scenarios where performance demands justify the increased overhead. ### 4.4 Baseline / Optimized / 9.6 Default A configuration model consisting of **Baseline**, **Optimized**, and **Expedited** configurations and their respective measured convergence times and packet loss are proposed in the following table. This framework enables teams to align failover settings with specific operational priorities and network conditions, without overemphasizing one-size-fits-all outcomes. - **Baseline** (stability-first, conservative) - **Optimized** (balanced) ECOS 9.6 introduces a new set of functionality which greatly improve the failover time in EdgeConnect SD-WAN. - **Expedited** (fastest failover; consider metered links, CPU, stability) **Global Settings** | Setting | **Baseline ~60sec** | **Optimized ~15sec** | **9.6 Default ~3sec** | | ------------------------ | :---------------------------------------------: | :-----------------------------------------------------: | :----------------------------------------------------: | | Flow Reclassify Interval | 60 | 10 | 60 | | Tunnel Health Retry | 30 | 5 | 30 | | Bonding Policy (tests) | HQ | HQ | HQ | **VRRP & IP SLA** | Setting | Baseline | Optimized | 9.6 Default | |---------------------------------|:-----------------------:|:-----------------------:|:-------------:| | VRRP Version | 2 | 3 | 3 | | VRRP Adv / Hold-down | 60 s / 1 s | 10 s / 100 cs | 10 s / 10 cs | | Preemption | Enabled | Enabled | Enabled | | Priority | _(site-specific)_ | _(site-specific)_ | _(site-specific)_ | | VRRP VMAC ^1^ | Disabled | Disabled | Disabled | | IP SLA – VRRP Monitor Interval | 1 sec | 1 sec | 1 sec | | IP SLA – VRRP Monitor | Subnet Metric Δ = 100 | Subnet Metric Δ = 100 | Subnet Metric Δ = 100 | 1 - VRRP Virtual Mac (VMAC) is disabled by default in virtual EdgeConnect devices, but enabled by default in physical hardware. **BGP / OSPF / BFD** | Setting | Baseline | Optimized | 9.6 Default | |-------------|:-------:|:-----------:|:-------------:| | BGP Timers | 30 KA / 90 HD | 5 KA / 15 HD | 30 KA / 90 HD | | OSPF Timers | N/A | N/A | N/A | | BFD Timers | 1000/1000/3 ON | 1000/1000/3 ON | 1000/1000/3 ON | **Other** | Setting | Baseline | Optimized | 9.6 Default | | -------------------------------- | :-----: | :---------: | :-----------: | | Quiescent tunnel keep alive time | 60 | 5 | 60 | | system ipsec-natd quiesc-th | 120 | 10 | 120 | | system ipsec-natd quiesc-period | 360 | 15 | 360 | | system aggressive-path-fail | disable | disable | disable | --- ## 5. HA Testing & Validation ### 5.1 Convergence Measurement Methodology Figure 1 depicts the logical traffic flows that will be tested. Bi-directional traffic will be sent between the branch and hub, with both serving as initiator and receiver. This will allow for a complete end to end measurement of traffic convergence based on the perspective of the traffic initiator to be recorded. To ensure precise measurement of traffic convergence time and packet loss, each flow is generated with high accuracy and temporal precision. Both UDP and ICMP traffic types are included; however, convergence and loss metrics convergence time is the span from the first impairment to the point when all affected probes recover, and packet loss is the total lost probes across that same cycle. - **Traffic types & parameters** - **UDP:** 10 pps, 100 ms ± 1 ms, ~361 B, dst port 8888 - **ICMP:** 10 pps, 100 ms ± 1 ms, ~361 B, Echo with metadata - **Measurement focus:** - Full restoration time & total packet loss ### 5.2 Definition of “Convergence Time” and Measurement Window Convergence time is defined as the elapsed duration between the initial detection of a failure event and the complete restoration of bidirectional SD-WAN traffic flow across the affected path or service. This measurement accounts for all underlying control-plane and data-plane processes required for recovery, including routing updates, tunnel re-establishment, flow reclassification, and label reassignment. For the purposes of this guide, convergence is considered achieved only when uninterrupted traffic delivery resumes at the configured packet interval with no additional loss beyond the natural steady-state baseline. A measurement window begins at the precise moment of failure injection and extends until traffic stability is restored and sustained for a minimum of **three consecutive measurement intervals** (e.g., 300 ms at 10 pps). This approach ensures that transient recovery or oscillation is excluded from the final convergence time value. ### 5.3 Definition of “Packet Loss During Convergence Packet loss during convergence is defined as the percentage of traffic packets that fail to reach the intended destination from the moment a failure event is introduced until the network has fully restored stable forwarding of the affected flows. This metric reflects the cumulative disruption experienced by active sessions during the transition from failure detection, to convergence, to recovery. For the purposes of this guide, packet loss is calculated as: $$ \text{Packet Loss (\%)} = \frac{\text{Packets Sent} - \text{Packets Received}}{\text{Packets Sent}} \times 100 $$ The measurement interval is bound to the defined convergence window, ensuring that only packets lost during the failover and recovery process are included. Loss outside the convergence window (e.g., due to background jitter, drops, or unrelated congestion) is excluded from this calculation. ### 5.4 Test Matrix: EdgeHA Branch - **EdgeHA (VRRP)** **EdgeHA (OSPF/BGP)** - Power loss to VRRP active - LAN-side failure to VRRP active - WAN-side failure (primary label) to VRRP active - WAN-side failure (backup label) to VRRP active - EdgeHA Link Failure - **EdgeHA (OSPF/BGP)** - Power loss to VRRP active - LAN-side failure to OSPF/BGP Primary - WAN-side failure (primary label) to OSPF/BGP Primary - WAN-side failure (backup label) to OSPF/BGP Primary - EdgeHA Link Failure ### 5.5 Test Matrix: Traditional HA Branch - **Active/Standby (VRRP)** - Power loss to VRRP active - LAN-side failure to VRRP active - WAN-side failure (primary label) to VRRP active - WAN-side failure (backup label) to VRRP active - **Active/Standby (OSPF/BGP)** - Power loss to OSPF/BGP Active - LAN-side failure to OSPF/BGP Active - WAN-side failure (primary label) to OSPF/BGP Active - WAN-side failure (backup label) to OSPF/BGP Active - **Active/Active (OSPF/BGP)** - Power Loss to (1) Traditional HA Cluster Member - LAN Side Failure to (1) Traditional HA Cluster Member - Power Loss to (1) Traditional HA Cluster Member - LAN Side Failure to (1) Traditional HA Cluster Member WAN Side Failure (Primary Label) to (1) Traditional HA Cluster Member Test #4 - WAN Side Failure (Backup Label) to (1) Traditional HA Cluster Member Test #5 - Traditional HA Cluster Link Failure ### 5.6 Result Artifacts (Per Scenario) - **Convergence Time Table** (Default / Recommended / High Performance) - **Packet Loss Table** (percentage lost during event) Orchestrator - **9.5.4.40573** ECOS Software - **9.5.4.0_103613** Orchestrator - **9.6.0.41106** ECOS Software - **9.0.0_106331** **Convergence Time – EdgeHA VRRP Active Gateway** Note: Default BIO = Both WAN set as Primary | Test # | Test Description | Baseline Network Convergence @ ~60sec | Optimized Network Convergence @ ~15sec | ECOS 9.6 w/VRRP Tracking Network Convergence @ ~5sec | |-------- |--------------------------------------------------- |------------------------------------------------------------------ |------------------------------------------------------------------- |------------------------------------------------------------------ | | 1 | Power loss to VRRP active EdgeHA | ~58 | ~6 | ~15 | | 2 | LAN-side failure to VRRP active EdgeHA | ~23 | ~5 | ~4 | | 3 | WAN failure (primary label) to VRRP active EdgeHA | ~7 | 0 | ~3 | | 4 | WAN failure (backup label) to VRRP active EdgeHA | 0 | 0 | 0 | | 5 | EdgeHA cluster link failure | 0 | 0 | 0 | **Packet Loss – EdgeHA VRRP Active Gateway** | Test # | Scenario | Baseline | Optimized |9.6 Default | | :----: | ------------------------------------------------- | ----------: | --------------: | -------------: | | 1 | Power loss to VRRP active EdgeHA | ~200 | ~40 | ~125 | | 2 | LAN-side failure to VRRP active EdgeHA | ~110 | ~53| ~24 | | 3 | WAN failure (primary label) to VRRP active EdgeHA | ~4 | 0 | ~20 | | 4 | WAN failure (backup label) to VRRP active EdgeHA | 0 | 0 | 0 | | 5 | EdgeHA cluster link failure | 0 | 0| 0 | - **Convergence Time Table** (Default / Recommended / High Performance) - **Packet Loss Table** (percentage lost during event) Orchestrator - **9.5.4.40573** ECOS Software - **9.5.4.0_103613** Orchestrator - **9.6.0.41106** ECOS Software - **9.0.0_106331** **Convergence Time – EdgeHA VRRP Active Gateway** (Note: Default BIO = WAN set as Primary & Backup) | Test # | Test Description | Baseline Network Convergence @ ~60sec | Optimized Network Convergence @ ~15sec | 9.6 Default Network Convergence @ ~5sec | |-------- |--------------------------------------------------- |------------------------------------------------------------------ |------------------------------------------------------------------- |------------------------------------------------------------------ | | 1 | Power loss to VRRP active EdgeHA | ~59 | ~13 | | | 2 | LAN-side failure to VRRP active EdgeHA | ~56 | ~9 | | | 3 | WAN failure (primary label) to VRRP active EdgeHA | ~34 | ~2 | | | 4 | WAN failure (backup label) to VRRP active EdgeHA | 0 | 0 | | | 5 | EdgeHA cluster link failure | 0 | 0 | | **Packet Loss – EdgeHA VRRP Active Gateway** | Test # | Scenario | Baseline | Optimized | 9.6 Default | | :----: | ------------------------------------------------- | ----------: | --------------: | -------------: | | 1 | Power loss to VRRP active EdgeHA | ~600 | ~90 | | | 2 | LAN-side failure to VRRP active EdgeHA | ~590 | ~23| | | 3 | WAN failure (primary label) to VRRP active EdgeHA | ~378 | ~15 | | | 4 | WAN failure (backup label) to VRRP active EdgeHA | 0 | 0 | | | 5 | EdgeHA cluster link failure | 0 | 0| | **Convergence Time – EdgeHA BGP Gateway** Note: Default BIO = Both WAN set as Primary | Test # | Test Description | Baseline Network Convergence @ ~60sec | Optimized Network Convergence @ ~15sec | 9.6 Defaults Network Convergence @ ~5sec | |-------- |--------------------------------------------------- |------------------------------------------------------------------ |------------------------------------------------------------------- |------------------------------------------------------------------ | | 1 | Power loss to BGP EdgeHA | ~24 | ~10 | ~40 | | 2 | LAN-side failure to BGP EdgeHA | ~19 | ~2.2 | ~26 | | 3 | WAN failure (primary label) to BGP EdgeHA | 0 | ~2 | ~2 | | 4 | WAN failure (backup label) to BGP EdgeHA | 0 | 0 | 0 | | 5 | EdgeHA cluster link failure | 0 | 0 | 0 | **Packet Loss – EdgeHA BGP Gateway** | Test # | Scenario | Baseline | Optimized |9.6 Defaults | | :----: | ------------------------------------------------- | ----------: | --------------: | -------------: | | 1 | Power loss to BGP EdgeHA | ~229 | ~85 | ~251 | | 2 | LAN-side failure to BGP EdgeHA | ~190 | ~20| ~15 | | 3 | WAN failure (primary label) to BGP EdgeHA | 0 | 17 | ~1 | | 4 | WAN failure (backup label) to BGP EdgeHA | 0 | 0 | 0 | | 5 | EdgeHA cluster link failure | 0 | 0| 0 **Convergence Time – EdgeHA OSPF Gateway** Note: Default BIO = Both WAN set as Primary | Test # | Test Description | Baseline Network Convergence @ ~60sec | Optimized Network Convergence @ ~15sec | 9.6 Defaults Network Convergence @ ~5sec | |-------- |--------------------------------------------------- |------------------------------------------------------------------ |------------------------------------------------------------------- |------------------------------------------------------------------ | | 1 | Power loss to OSPF EdgeHA | ~31 | ~14 | ~35 | | 2 | LAN-side failure to OSPF EdgeHA | ~28 | ~12 | ~1 | | 3 | WAN failure (primary label) to OSPF EdgeHA | ~12 | ~8 | 0 | | 4 | WAN failure (backup label) to OSPF EdgeHA | 0 | 0 | 0 | | 5 | EdgeHA cluster link failure | 0 | 0 | ~1.5 | **Packet Loss – EdgeHA OSPF Gateway** | Test # | Scenario | Baseline | Optimized | 9.6 Defaults | | :----: | ------------------------------------------------- | ----------: | --------------: | -------------: | | 1 | Power loss to OSPF EdgeHA | 0 | 0 | ~246 | | 2 | LAN-side failure to OSPF EdgeHA | 0 | 0| ~2 | | 3 | WAN failure (primary label) to OSPF EdgeHA | 0 | 0 | 0 | | 4 | WAN failure (backup label) to OSPF EdgeHA | 0 | 0 | 0 | | 5 | EdgeHA cluster link failure | 0 | 0| 12 | **Convergence Time – Traditional VRRP Active Gateway** Note: Default BIO = Both WAN set as Primary | Test # | Test Description | Baseline Network Convergence @ ~60sec | Optimized Network Convergence @ ~15sec | 9.6 Defaults Network Convergence @ ~5sec | |-------- |--------------------------------------------------- |------------------------------------------------------------------ |------------------------------------------------------------------- |------------------------------------------------------------------ | | 1 | Power loss to OSPF EdgeHA | ~11 | ~4 | | | 2 | LAN-side failure to OSPF EdgeHA | ~1 | ~1 | | | 3 | WAN failure (primary label) to OSPF EdgeHA | ~59 | ~11 | | | 4 | WAN failure (backup label) to OSPF EdgeHA | 0 | 0 | | | 5 | EdgeHA cluster link failure | N/A | N/A | | **Packet Loss – Traditional VRRP Active Gateway** | Test # | Scenario | Baseline | Optimized | 9.6 Defaults | | :----: | ------------------------------------------------- | ----------: | --------------: | -------------: | | 1 | Power loss to VRRP active Traditional | ~101 | ~25 | | | 2 | LAN-side failure to VRRP active Traditional | ~1 | ~1| | | 3 | WAN failure (primary label) to VRRP active Traditional | ~591 | ~106 | | | 4 | WAN failure (backup label) to VRRP active Traditional | 0 | 0 | | | 5 | EdgeHA cluster link failure | N/A | N/A| | **Convergence Time – Traditional BGP Gateway** Note: Default BIO = Both WAN set as Primary Note: When WAN0 on EC-V with Active traffic goes down the traffic moves over to the seconday default overlay. When WAN0 comes back online it does not fail back right away it takes time. | Test # | Test Description | Baseline Network Convergence @ ~60sec | Optimized Network Convergence @ ~15sec | 9.6 Defaults Network Convergence @ ~5sec | |-------- |--------------------------------------------------- |------------------------------------------------------------------ |------------------------------------------------------------------- |------------------------------------------------------------------ | | 1 | Power loss to BGP | ~43 | ~9 | ~3 | | 2 | LAN-side failure to BGP | ~57 | ~7 | ~13 | | 3 | WAN failure (primary label) to BGP | ~54 | ~16 | ~82 | | 4 | WAN failure (backup label) to BGP | 0 | 0 | 0 | | 5 | cluster link failure | 0 | 0 | 0 | **Packet Loss – Traditional BGP Gateway** | Test # | Scenario | Baseline | Optimized | 9.6 Defaults | | :----: | ------------------------------------------------- | ----------: | --------------: | -------------: | | 1 | Power loss to BGP | ~420 | ~90 | ~22 | | 2 | LAN-side failure to BGP | ~560 | ~57| ~120 | | 3 | WAN failure (primary label) to BGP | ~530 | ~163 | ~760 | | 4 | WAN failure (backup label) to BGP | 0 | 0 | 0 | | 5 | cluster link failure | 0 | 0| 0 | **Convergence Time – Traditional OSPF Gateway** Note: Default BIO = Both WAN set as Primary | Test # | Test Description | Baseline Network Convergence @ ~60sec | Optimized Network Convergence @ ~15sec | 9.6 Defaults Network Convergence @ ~5sec | |-------- |--------------------------------------------------- |------------------------------------------------------------------ |------------------------------------------------------------------- |------------------------------------------------------------------ | | 1 | Power loss to OSPF | ~38 | ~2 | ~2 | | 2 | LAN-side failure to OSPF | ~42 | ~1 | ~3 | | 3 | WAN failure (primary label) to OSPF | ~43 | 0 | ~8 | | 4 | WAN failure (backup label) to OSPF | 0 | 0 | 0 | | 5 | EdgeHA cluster link failure | 0 | 0 | 0 | **Packet Loss – Traditional OSPF Gateway** | Test # | Scenario | Baseline | Optimized | 9.6 Default | | :----: | ------------------------------------------------- | ----------: | --------------: | -------------: | | 1 | Power loss to OSPF | 0 | ~15 | ~17 | | 2 | LAN-side failure to OSPF | 0 | 0| ~25 | | 3 | WAN failure (primary label) to OSPF | 421| 0 | ~70 | | 4 | WAN failure (backup label) to OSPF | 0 | 0 | 0| | 5 | EdgeHA cluster link failure | N/A | N/A| N/A | --- ## 6. Best Practices & Operational Guidance | **Category** | **Convergence Target** | **Global (Fabric-Wide) Setting** | **Per-Appliance/Site Setting** | **When to Use** | **Key Considerations** | | ------------------ | ---------------------- | ------------------------------------------------------- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | **Baseline Mode** | \~60 sec | Conservative timers (default detection/hello intervals) | Default appliance detection & failover response | - Large fabrics (>500 appliances) - High traffic scale (flows/sec in hundreds of thousands) - Non-critical downtime tolerance | - Minimizes CPU and control traffic overhead - Scales most reliably in diverse networks - Works well with high-bandwidth/high-flow sites | | **Optimized Mode** | \~15 sec | Moderate timers (faster hello/dead detection) | Tuned appliance-level failover triggers | - Mid-sized fabrics - Branches or apps where <30s recovery adds value but not mission-critical | - Increases control-plane activity vs Baseline - Appliances must have resource headroom - Balance between recovery speed and scale | | **9.6 Default** | \~3 sec | Aggressive timers (short hello/dead detection) | Appliances set to fast failover and link detection | - Critical sites (healthcare, trading, contact centers) - Where downtime >5s is unacceptable | - Highest CPU/memory/control traffic overhead - Risk of false positives in noisy/variable networks - Careful lab validation strongly recommended | ### 6.1 Guidance for Tuning Always treat global + appliance settings as a pair. For example, you won’t achieve 3-second convergence with only global tuning if per-appliance timers remain at defaults. **Baseline** settings align with the out of box defaults and are the safest starting point especially for large fabrics or diverse traffic profiles. **Optimized** settings represent a balanced choice where faster convergence is desired but scale still matters. **9.6 Defaults** Using the default settings provide out of the box provides faster convergence. --- ## 7. Appendices - **Appendix A – Test Case Matrix** (all scenarios × flows × profiles) - **Appendix B – Configuration Snippets** (Orchestrator & CLI) - **Appendix C – Glossary** (VRRP, BFD, Flow Reclassification, Quiescent timers, etc.) - **Appendix D – Result Tables** (full datasets per topology) - **Appendix E – HA and Network Convergence Configuration Options Tree ```mermaid flowchart TD A["Start"] --> B{"Convergence Tolerance?"} B -- ~30-60 sec acceptable --> C["Baseline Mode (~60s)"] C --> C1["Use for large fabrics >1000 gateways, or high traffic flows"] C1 --> C2["Global = Default timers"] & C3["Appliance = Default detection"] B -- ~5–30 sec acceptable --> D["Optimized Mode (~15s)"] D --> D1["Balanced choice for mid-sized fabrics"] D1 --> D2["Global = Moderate timers"] & D3["Appliance = Tuned failover triggers"] B -- <~5 sec required --> E["ECOS 9.6 w/VRRP Tracking(~3s)"] E --> E1["Mission-critical only, validate at scale"] E1 --> E2["Global = Aggressive timers"] & E3["Appliance = Fast failover detection"] ```