By Yann Rapaport, 6WIND Customer Support and Service Manager
This is the fourth post of a series about High Availability capabilities for packet processing software. In the first post,, the HA requirements for high performance packet processing software were presented. The second post addressed the problem of the synchronization of Control Plane protocols. The third one explained how to monitor packet processing software to detect possible issues. Now, let’s discuss how packet processing software should implement graceful restart capabilities.
Graceful restart provides the capability to restart the system without interrupting the flows (no packet loss) Restarts can be planned or unplanned outages. Planned outages allow In Service Upgrades (ISU) such as updating a software version or installing a debug version in the system.
Graceful restarts for networking and telecoms equipment rely on Non Stop Forwarding (NSF) and Non Stop Routing (NSR).
NSF means packet processing must never stop. Some tables (ARP/NDP, IP routes, IPSec SAD/SPD, NAT/Firewall connections…) should never be flushed to avoid traffic interruption and should never leak.
NSR takes place at the Control Plane level. Control Plane protocols manage communication between the equipment’s neighbors. Some protocols like OSPF, BGP, or IKE… have states which are related to a negotiation. When a restart occurs, whether detected or solicited, such protocol exchanges should not request a « flush » otherwise states should be lost and new negotiations would be required. When a graceful restart occurs, neighbors should be informed that a subsystem is disabled during a grace period; the grace period is defined as the period during which an object can continue to be used until it is confirmed as dead:
- If a neighbor has left during its grace period, its state should be kept as if it was still up,
- When a subsystem is restarting, the neighbors should be notified and they should be informed about the estimated time to get ready,
- NSR can only work if the data path is providing the Control Plane with NSF capabilities.
Graceful restart is a system issue, so each component in the processing loop has to provide graceful restart capabilities. If we take the example of OSPF, it has to be compliant with RFC 3623 but graceful restart capabilities also have to be implemented for the FIB (Forwarding Information Block) manager, the synchronization mechanisms between the Control Plane and the Data Plane, the management system and the synchronization mechanisms between peer Control Planes.
When restarting dynamic components whose activation and configuration are performed by the management system, this restart is done through the management system.
More information about 6WINDGate architecture can be found here.
6WINDGate High Availability Architecture Overview is available here.
You can check 6WINDGate FAQ here.
It is true that some routing protocols provide more or less limited graceful restart capabilities. However, there is always a more complex but better option of fully stateful redundancy. Granted, it is not an easy task to implement stateful TCP and BGP redundancy, but it has been done in the past by yours truly at Amber Networks (acquired by Nokia in 2001), providing very efficient implementation with very low overhead and under 50 msec switchover under any conditions.
It could be that the comment is more for the post#2, but there is no description there about stateful control plane updates, so I’ve decided to make the comment here.
Thanks,
Alex