This goal of this post is breif discussion of main factors controlling fast convergence in OSPF-based networks. Network convergence is a term that is sometimes used under various interpretations. Before we discuss the optimization procedures for OSPF, we define network convergence as the process of synchronizing network forwarding tables after a topology change. Network is said to be converged when none of forwarding tables are changing for “some reasonable” amount of time. This “some” amount of time could be defined as some interval, based on the expected maximum time to stabilize after a single topology change. Network convergence based on native IGP mechanisms is also known as network restoration, since it heals the lost connections. Network mechanisms for traffic protection such as ECMP, MPLS FRR or IP FRR offering different approach to failure handling are outside the scope of this article. We are further taking multicast routing fast recovery out of the scope as well, even though this process is tied to IGP re-convergence.
It is interesting to notice that IGP-based “restoration” techniques have one (more or less) important problem. During the time of re-convergence, temporary micro-loops may exist in the topology due to inconsistency of FIB (forwarding) tables of different routers. This behavior is fundamental to link-state algorithms, as routers closer to failure tend to update their forwarding database before the other routers. The only popular routing protocol that lacks this property is EIGRP, which is loop-free at any moment during re-convergence, thanks to the explicit termination of the diffusing computations. For the link state-protocols, there are some enhancements to the FIB update procedures that allow avoiding such micro-loops with link-state routing, described in the document [ORDERED-FIB].
Even though we are mainly concerned with OSPF, ISIS will be mentioned in the discussion as well. It should be noted that compared to IS-IS, OSPF provides less “knobs” for convergence optimization. The main reason is probably the fact that ISIS is being developed and supported by a separate team of developers, more geared towards the ISPs where fast convergence is a critical competitive factor. The common optimization principles, however, are the same for both protocols, and during the conversation will point out at the features that OSPF lacks while IS-IS has for tuning. Finally, we start our discussion with a formula, which is further explained in the text:
Convergence = Failure_Detection_Time + Event_Propagation_Time + SPF_Run_Time + RIB_FIB_Update_Time
The formula reflects the fact that convergence time for a link-state protocol is sum of the following components:
What would you do if your connection is not physical point-to-point or does not allow translating loss of signal information in timely fashion? Good example could be switched Ethernet or Frame-Relay PVC link. Sometimes there are solutions such as Ethernet port failure translation that may detect an upstream switch port failure and reflect it to the downstream ports, which could be reasonably fast. For another example, Frame-Relay may signal PVC loss via asynchronous LMI updates or A-bit (active bit) in LMI status reports. However, such mechanisms, especially the ones relying on Layer 2 feature may not be timely enough to report failure fast. In such cases, it could be a good idea to rely on fast IGP keepalive timers. Both OSPF and ISIS support fast hellos with the dead/hold interval of one second and sub-second hello intervals ([OSPF-FASTHELLO]). Using this medium-agnostic mechanism could reduce fault detection on non point-to-point links to one second, which could be better than relying on Layer-2 specific signaling. However, fast hello timers have one significant drawback: since all hello packets are processes by the router’s main CPU, having hundreds or more of OSPF/IS-IS neighbors may have significant impact on router’s control plane performance. An alternative could be using BFD (bi-directional forwarding detection, see [BFD]), which provides protocol-agnostic failure detection mechanism that could be reused by multiple routing protocols (e.g. OSPF/ISIS/BGP and so on). BFD is based on the same idea of sub-second keepalive timers, that could be implemented in distributed router interface line-cards, therefore saving the control-plane and central CPU from over-utilization.
You may find out typical SPF runtimes for your network (to estimate the total convergence time) by using the command show ip ospf statistics
The next “problem” is known as SPF throttling. Recent Cisco IOS OSPF implementation is designed to use exponential backoff algorithm when scheduling SPF runs. The goal, as usual, is to avoid excessive calculations in the times of high network instability but keep SPF reaction fast for stable networks. Exponential process is identical to the one used for LSA throttling, with the same timer semantics.
So how would one pick up optimal SPF throttling values? As mentioned before, the initial delay should be kept as short as possible to allow for instant reaction to a change but long enough not to trigger the SPF before the LSA is flooded out. It’s hard to determine the delay to flood the LSA, but at least the initial timer should stay above the per-interface LSA flood pacing timer, so that it does not delay two consecutive LSAs flooded through the topology (as you remember, a typical transit link failure results in generation of at least two LSAs). Setting the interface flooding pacing timer to 5ms and initial SPF delay to 10ms should be a good starting point. After the initial run, SPF algorithm should be further held down for at least the amount of time it takes the network to converge after the initial event. This means that the SPF hold-time should be strictly higher than the value “SPF_Initial_Delay + SPF_Runtime + RIB_FIB_Update_Time”. There exists alternate, more pragmatic approach to this timer tuning as well. Let’s say we want to make sure SPF computations do not take more than 50% of the router’s CPU time. For this to happen, the hold time should be at least the same as a typical SPF run time. This value could be found based on the router statistics and tuned individually on every router. Based on our example, we may set the hold interval to 32ms+20% (error margin, set higher to add more safety), which is about 38ms, and the maximum interval could be set to twice the hold time, which translates into 33% CPU usage under the worst condition of non-stop LSA storms. Notice that SPF hold and maximum timers could be tuned per-router, to account for the different CPU powers, if this applies to your scenario. Total network convergence time should be estimated based on the “slowest” router in the area.
If you think of all prefixes that need to be in a typical network core, you would realize that you don’t need any core “transit” link prefixes in there. In fact, all you need are normally the stub links at the edge of your network, e.g. PE router loopbacks or the summary prefixes injected from the lower layers of your network hierarchy. Therefore, it makes sense to suppress the network prefix information advertised for the transit links. One option would be configuring all transit links as IP unnumbered using the IP addresses of the routers’ Loopback interfaces. However, both IS-IS and OSPF has a special protocol capability to implement suppression automatically. In OSPF it is known as “prefix-suppression” and prevents OSPF from including the link type 3 (stub network address) in the router LSA (see [OSPF-PREFIX-SUPPRESS]). As you remember, OSPF represents a point-to-point connection between two routers via two link types in a router LSA: type 1, declaring the connection to another router based on its Router-ID and type 3, describing the stub connection/prefix of a point-to-point link (snmp ifIndex is used if the link is unnunmered). The prefix-suppression feature drops the second link type and leaves only the topological information in the router LSA. As a result, you will not be able to reach the transit link subnet address but still have perfect connectivity within the topology. The command to enable global prefix suppression is entered under OSPF routing process as prefix-suppression to enable suppression globally or per-interface using the syntax ip ospf prefix-suppression [disable]. Notice that by default OSPF does not suppress stub-link advertisement for the router loopback interfaces, unless you have explicitly configured these for suppression.
As soon as you’re done suppressing all transit link subnets, you are normally left with the router loopback interfaces (typically /32 prefixes) and routing information external to the area, such as summary-addresses or external prefixes. Depending on your network configuration the amount of summary addresses could be significant. The best solution to this problem is optimal summarization and filtering unnecessary prefixes, e.g. by means of of summary-address filters and stub area features. Obviously, this requires a hierarchical address plan, which is not always readily available. If re-designing you network’s IP addressing is not an option, you may still rely on Cisco IOS priority prefix sequencing, which is supported in ISIS. Unfortunately, there is no support for this feature in OSPF for IOS yet, though there is support in IOS-XR. You may read more about ISIS support for priority-driven RIB Prefix Installation here ([ISIS-PRIODRIVEN]). The general idea is to expedite some prefix insertion into the forwarding table, starting with the most important ones, such as PE /32 prefixes. It is worth noting that priority sequencing may extend duration of the routing micro-loops during the re-convergence process. In general, the procedure described in ([ORDERED-FIB] works against fast convergence, trading it for loop-free process.
Is there a way to estimate the RIB/FIB manipulation times? As we have seen before, the show ip ospf statistics command provides information on RIB update time, though this output is not provided on every platform, nor there is clear interpretation of the values in Cisco’s documentation, e.g. it’s unclear whether there is a checkpoint mechanism to inform OSPF of the FIB entry updates. Special measurements should be taken to estimate these values, as done in [BLACKBOX-OSPF], and more importantly these values will heavily depend on the platform used. Still the OSPF RIB manipulation statistics could be useful to estimate the lower bound of network convergence time (though we are mostly interested in the accurate upper boundary).
Maximum SPF runtime: 32ms, doubling for safety makes it 64ms
Maximum RIB update: 10ms, doubling for safety makes it 20ms
OSPF interface flood pacing timer: 5ms (does not apply to the initial LSA flooded)
LSA Generation Initial Delay: 10ms (enough to detect multiple link failures resulting from SRLG failure)
SPF Initial Delay: 10ms (enough to hold SPF to allow two consecutive LSAs to be flooded)
Network geographical size: 100 miles (signal propagation is negligible)
Network physical media: 1 Gbps links (serialization delay is negligible)
Estimated network convergence time in response to initial event: 32*2 + 10*2 + 10 + 10 = 40+64 = 100ms. This estimation does not precisely account for FIB update time, but we assume it would be approximately the same as RIB update. We need to make sure out maximum backoff timers exceed this convergence timer to ensure processing is delay above the convergence interval in the worst case scenario.
LSA Generation Hold Time: 100ms (approximately the convergence time)
LSA Generation Maximum Time: 1s (way above the 100ms)
OSPF Arrival Time: 50ms (way below the LSA Generation hold time)
SPF Hold Time: 100ms
SPF Maximum Hold Time: 1s ( Maximum SPF runtime is 32ms, meaning we skip 30 SPF runtimes in the worst condition. This results in SPF consuming no more than 3% of CPU time under worst-case scenario).
Now estimate the worst-case convergence time: LSA_Maximum_Delay (1s) + SPF_Maximum_Delay (1s) + RIB_Update (<1s) < 3 seconds. Even under heavily congested network, CPU usage for SPF calculations will not exceed 3% and network will converge to changes under 3 seconds. Here is a sample OSPF configuration template:
We omitted other fast-convergence elements such as resilient network design, e.g. redundancy resulting in equal-cost multipathing and faster OSPF adjacency restoration or NSF feature which is very helpful to avoid re-convergence during planned downtimes. We also skipped discussing some other features related to OSPF stability such as flooding reduction and LSA group pacing, that could yield performance benefits in networks with large LSDs. It is not possible to cover all relevant technologies in a single blog post, but you may refer to the further reading documents for more information. And finally, if you are planning to tune your IGP for fast convergence, make sure you understand all consequences. Modern routing platforms are capable of handling almost any "stormy" network condition without losing overall network stability, but pushing network to its limits could always be dangerous. Make sure you monitor your OSPF statistics for potentially high or unusual conditions after you performed tuning, or set maximum timers to more conservative values (e.g. 3-5 seconds) to provide additional safety.
[ORDERED-FIB] "Loop-free convergence using oFIB"
[BLACKBOX-OSPF] "Experience in Black-box OSPF Measurement”
[SUBSEC-CONV] “Achieving Sub-second IGP Convergence in Large IP Networks”
[OSPF-FASTHELLO] "OSPF Fast Hello Enhancement"
[SPD] "Understanding Selective Packet Discard"
[TUNING-OSPF] "Tuning OSPF Performance"
[BFD] "Bi-Directional Forwardin Detection"
[OSPF-PREFIX-SUPPRESS] "OSPF Prefix Suppression Feature"
[ROUTED-CAMPUS] "Cisco fully routed campus design guidelines"
[OPT-DAMPENING] "Optimized IP Event Dampening"
[DIJKSTRA-SPF] "Dijkstra SPF algorithm"
It is interesting to notice that IGP-based “restoration” techniques have one (more or less) important problem. During the time of re-convergence, temporary micro-loops may exist in the topology due to inconsistency of FIB (forwarding) tables of different routers. This behavior is fundamental to link-state algorithms, as routers closer to failure tend to update their forwarding database before the other routers. The only popular routing protocol that lacks this property is EIGRP, which is loop-free at any moment during re-convergence, thanks to the explicit termination of the diffusing computations. For the link state-protocols, there are some enhancements to the FIB update procedures that allow avoiding such micro-loops with link-state routing, described in the document [ORDERED-FIB].
Even though we are mainly concerned with OSPF, ISIS will be mentioned in the discussion as well. It should be noted that compared to IS-IS, OSPF provides less “knobs” for convergence optimization. The main reason is probably the fact that ISIS is being developed and supported by a separate team of developers, more geared towards the ISPs where fast convergence is a critical competitive factor. The common optimization principles, however, are the same for both protocols, and during the conversation will point out at the features that OSPF lacks while IS-IS has for tuning. Finally, we start our discussion with a formula, which is further explained in the text:
Convergence = Failure_Detection_Time + Event_Propagation_Time + SPF_Run_Time + RIB_FIB_Update_Time
The formula reflects the fact that convergence time for a link-state protocol is sum of the following components:
- Time to detect the network failure, e.g. interface down condition.
- Time to propagate the event, i.e. flood the LSA across the topology.
- Time to perform SPF calculations on all routers upon reception of the new information.
- Time to update the forwarding tables for all routers in the area.
Part I: Fast Failure Detection
Detecting link and node failures quickly is number one priority for fast convergence. For maximum speed, relying on IGP keepalive times should be avoided whether possible and physical failure detection mechanisms should be used. This implies the use of physical point-to-point links whether possible. As for link technology, it should be able to detect loss of link within shortest interval possible. For example, a point-to-point gigabit Ethernet link may report failure almost instantly (by detecting of network pulses) if there is no Ethernet switch connecting the two nodes. However, there could be some hardware-dependent timers that may delay reporting the physical-layer event, such as debounce timers. With the GigE example, there is carrier-delay timer, which is set per interface using the command carrier-delay (ms). Aiming at fast convergence, you would like to set this time to zero, unless you have special subnetwork technology, such as SONET, which is able to provide protection within a short interval e.g. under 50ms. In that case, you may want to consider setting the technology-specific delay timer to a value higher than the SONET recovery time, so that a non-critical physical failure is never noticed and healed under the network layer. In most cases, it makes sense to rely on subnetwork recovery mechanics if it is available and provides timely repair within your target convergence time. However, more often you have to deal with “cheaper” technology, such as GigE running over DWDM lambdas, and if that’s the case, minimizing the detection/indication timers is your primary goal. Notice that another positive result of using point-to-point link is the fact that OSPF becomes adjacent faster, thanks to the fact that DR elections are no longer needed. Additionally, type 2 LSAs are not generated for point-to-point link, which slightly reduces OSPF LSDB size and topology complexity.What would you do if your connection is not physical point-to-point or does not allow translating loss of signal information in timely fashion? Good example could be switched Ethernet or Frame-Relay PVC link. Sometimes there are solutions such as Ethernet port failure translation that may detect an upstream switch port failure and reflect it to the downstream ports, which could be reasonably fast. For another example, Frame-Relay may signal PVC loss via asynchronous LMI updates or A-bit (active bit) in LMI status reports. However, such mechanisms, especially the ones relying on Layer 2 feature may not be timely enough to report failure fast. In such cases, it could be a good idea to rely on fast IGP keepalive timers. Both OSPF and ISIS support fast hellos with the dead/hold interval of one second and sub-second hello intervals ([OSPF-FASTHELLO]). Using this medium-agnostic mechanism could reduce fault detection on non point-to-point links to one second, which could be better than relying on Layer-2 specific signaling. However, fast hello timers have one significant drawback: since all hello packets are processes by the router’s main CPU, having hundreds or more of OSPF/IS-IS neighbors may have significant impact on router’s control plane performance. An alternative could be using BFD (bi-directional forwarding detection, see [BFD]), which provides protocol-agnostic failure detection mechanism that could be reused by multiple routing protocols (e.g. OSPF/ISIS/BGP and so on). BFD is based on the same idea of sub-second keepalive timers, that could be implemented in distributed router interface line-cards, therefore saving the control-plane and central CPU from over-utilization.
Part II: Event Propagation
In OSPF and IS-IS topology changes (event) are advertised by means of LSA/LSP flooding. For network to completely converge, an LSA/LSP needs to reach every router within its flooding scope. Normally, in properly designed network, the flooding scope is one area (flooding domain), unless the information is flooded as external, i.e. by means of Type-5 LSA in OSPF. In general, LSA/LSP propagation time is determined by the following factors:- LSA generation delay. IGP implementations normally throttle LSA generation to prevent excessive flooding in case of oscillating (constantly flapping) links. Original OSPF specification required every LSA generation to be delayed for a fixed interval that defaulted to one second. To optimize this behavior, Cisco’s OSPF and ISIS implementations use exponential backoff algorithm to dynamically calculate the delay for generating the SAME LSA (same LSA ID, LSA type and originating Router ID) by the router. You may find more information about truncated exponential backoff in [TUNING-OSPF], but in short the process works as following.
Three parameters control the throttling process: initial interval, hold, and max_wait time specified using the command timers throttle lsa initial hold max_wait. Suppose the network was stable for a relatively long time, and then a router link goes down. As a result, the router needs to generate new router LSA, listing the new connection status. The router delays LSA generation the initial amount if milliseconds and sets the next interval to hold milliseconds. This ensures that two consecutive events (e.g. link going down and then back up) will be separated by at least the hold interval. After this, if an additional event occurs after the initial wait window expired, the event would be held for processing until the hold milliseconds window expire. Thus, all events occurring after the initial delay will be accumulated and processed after the hold time expires. This means the next router LSA will be generated no earlier than hold milliseconds. At the same time, the next hold-time would be doubled, i.e. set to 2*hold. Effectively, every time an event occurs during the current wait window, the processing is delayed until the current hold-time expires and the next hold-time interval is doubled. The hold-time grows exponentially as 2^t*hold until it reaches the max_wait value. After this, every event received during current hold-time window would result in the next interval being equal to the constant max_wait. This ensures that exponential growth is limited or in other words the process is truncated. If there are no events for the duration of 2*max_wait milliseconds, the hold-time window is reset back to the initial value, assuming that the flapping link has returned back to the normal condition.
Initial LSA generation delay has significant impact on network convergence time, so it is important to tune it appropriately. The initial delay should be kept to a minimum, such as 5-10 milliseconds – setting it to zero is still not recommended, as multiple link failure may occur synchronously (e.g. SRLG failure) and it could be beneficial to reflect them all in a single LSA/LSP. The hold interval should be tuned so that the next LSA is only sent after the network has converged in response to the first event. This means the LSA hold time should be based on the convergence time per the formula above, or more accurately it should be at least above LSA_Initial_Delay + LSA_Propagation_Delay + SPF_Initial_Delay. You may then set the maximum hold time to at least twice the hold interval to enhance flooding protection against at least two concurrent oscillating processes (having more parallel oscillations in not very probable). Notice that a single link failure normally results in at least two LSAs being generated, by every attached router. - LSA reception delay. This delay is a sum of the ingress queueing delay and LSA arrival delay. When a router receives LSA, it may be subject to ingress queueing, though this effect is not significant unless massive BGP re-convergence is occurring at the same time. Even under heavy BGP TCP ACK storm, Cisco IOS input queue discipline known as Selective Packet Discard (see [SPD]) provides enough room for IGP traffic and handles it as highest priority. The received packets are then rate-limited based on the LSA arrival interval. OSPF rate-limits only reception of the SAME LSAs (see the definition above): there is a fixed delay between reception of the same LSA originated by a peer. This delay should not exceed the hold-time used for LSA generation – otherwise the receiving router may drop the second LSA generated by peer, say upon link recovery. Notice that every router on the LSA flooding path adds cumulative delay to this component, but the good news is that the initial LSA/LSP will not be rate-limited – the arrival delay applies only to the consecutive copy of the same LSA. As such, you may mainly ignore this component for the purpose of the fast reaction to a change, thanks to fast ingress queueing and expedited reception. Keep in mind that if you are tuning the arrival delay you need to adjust the OSPF retransmission timer to be slightly above the first timer. Otherwise, the side that just sent an LSA and has not received an acknowledgemnt may end up re-sending it again just to be dropped by the receiving side. The command to control retransmission interval for the same LSA is timers pacing retransmission
- Processing Delay is the amount of time it takes the router to put the LSA on the outgoing flood lists. This delay could be signification if SPF process starts before flooding the LSA. SPF runtime is not the only contributor to the processing delay, but it’s the one you have control over. If you configured SPF throttling to be fast enough (see next session) – the exact time varies but mainly the initial delays below than 40ms – it may happen so that SPF run occurs before the triggering LSA is flooded to neighbors. This will result in slower flooding process. For faster convergence, it is required that LSAs are always flooded prior to SPF run. ISIS process in Cisco IOS supports the command fast-flood, which ensures the LSPs are flooded ahead of running SPF, irrespective of the initial SPF delay. On contrary, OSPF does not support this feature and your only option (at the moment) is properly tuning SPF runtime delays (see below).
The other component that may affect processing delay is the interface LSA/LSP flooding pacing and egress queueing. Interface flooding pacing is the OSPF feature that mandates a minimum interval between flooding consecutive LSAs out of an interface. This timer runs per interface and only triggers when there is an LSA needed to be sent out right after the previous LSA. The process-level command to control this interval is timers pacing flood (ms) with the default value of 55ms. Note that if there is just one LSA being flooded through the network, this timer will have no effect on its propagation delay, and only the next consecutive LSA could be rate-limited. Therefore, just like with the arrival timer tuning, we can mainly ignore the impact of this delay on the fast convergence process. Still, it is worth tuning the interface flood pacing timers to a small value possible (e.g. 5ms-10ms) to account for the event when multiple LSAs have to be flooded through the topology, since a link failure normally generates at least two LSA/LSPs from both attached routers (we discussed that earlier already). Interesting to note, that a reception of single LSA signaling loss of link from one router is enough to properly rebuild the topology, since SPF algorithm automatically verifies that the link is bidirectional before accounting it for shortest-path computations. Additionally, reducing interface flooding pace timer helps newly attached router to load OSPF database significantly faster, at the expense of some excessive CPU usage. This applies mainly to large OSPF databases and/or flapping link conditions. To protect against frequent massive database reloads on point-to-point links you may additionally use IP Event Dampening feature for suppression of interface status or properly design network for redundancy to avoid full database reloads upon single link restoration. See [OPT-DAMPENING] for information on tuning the IP Event Dampening parameters.
Lastly, egress queueing may result in significant delay on over-utilized links. In short, router’s egress queue depth could be approximated as Q_Depth=Utilization/(1-Utilization), meaning that links with 50% or above constant utilization always result in some queueing delay (in average). Proper QoS configuration, such as reserving enough bandwidth to the control plane packets should neutralize the effect of this component, coupled with the fact that routing update packets normally have higher priority for handling by router processes. - Packet Propagation Delay. This variable depends is a sum of two major contributors: serialization delay at every hop and cumulative signal propagation delay across the topology. The serialization delay is almost negligible on the modern “fast” links (e.g. 12usec for 1500 bytes packet over a 1Gbps link), though it could be more significant on slow WAN links such as series of T1s. Therefore, signal propagation delay is the main contributor due to physical limitations. This value mainly depends on the distance the signal has to travel to cross the whole OSPF/ISIS area. The propagation delay could be roughly approximated as 0.82 ms per 100 miles and have significant impact only for inter-continental deployments or satellite links. For example, it would take at least 41ms to travel a 5000 miles wide topology. However, since most OSPF/ISIS area sizes do not exceed a single continent, this value could not seriously impact total convergence time.
Part III: SPF Calculations
The SPF algorithm complexity could be bounded as O(L+N*log(N)) where N is number of the nodes and L is the number of the links in a topology under consideration. This estimation hold true provided that implementation is optimal (see [DIJKSTRA-SPF]). Worst case complexity for dense topologies could be as high as O(N^2), but this is rarely seen in real-world topologies. SPF runtime used to be a major limiting factor in the routers of 80s (link-state routing was invented in ARPANET) and 90s (initial OSPF/ISIS deployments) that used slow CPUs where SPF computations may have taken seconds to complete. However, progress in modern hardware (the Moore’s Law) allowed significantly reducing the impact of this factor on the network convergence, though it is still one of the major contributors to the convergence time. The use of Incremental SPF (iSPF) allows to further minimize the amount of calculations needed when partial changes occur in the network (see [TUNING-OSPF]). For example, OSPF Type-1 LSA flooding for a leaf connection does not cause complete SPF re-calculation anymore like it would have been when using classic SPF. An important benefit is that the farther away the router is from the failed link, the less time it needs to recompute the SPF. This compensates for the longer propagation delay to deliver the LSA from a distant corner of the network. Notice that OSPF also supports PRC (partial route computation), which takes only a few milliseconds upon reception of Type 3,4,5 LSAs that are treated as distance-vector updates. The PRC process is not delayed and you cannot tune exponential backoff time for PRC, like you can do for IS-IS.You may find out typical SPF runtimes for your network (to estimate the total convergence time) by using the command show ip ospf statistics
show ip ospf statistics OSPF Router with ID (10.4.1.1) (Process ID 1) Area 10: SPF algorithm executed 18 times Summary OSPF SPF statistic SPF calculation time Delta T Intra D-Intra Summ D-Summ Ext D-Ext Total Reason 1w3d 8 0 0 0 0 0 8 R, X 1w3d 12 0 0 0 4 0 16 R, X 1w3d 16 0 0 0 4 0 20 R, X 1w3d 8 0 0 0 0 0 8 R, 1w3d 20 0 0 0 0 0 20 R, X 1w2d 24 0 0 0 8 0 32 R, X 1w2d 8 4 0 0 0 0 12 R, 6d16h 4 0 0 0 0 4 8 R, X 6d16h 4 0 0 0 0 0 4 R, 6d16h 12 0 0 0 8 0 20 R, X RIB manipulation time during SPF (in msec): Delta T RIB Update RIB Delete 1w3d 4 0 1w3d 8 0 1w3d 10 0 1w3d 5 0 1w3d 8 0 1w2d 10 0 1w2d 3 0 6d16h 2 0 6d16h 1 0 6d16h 9 0The above output is divided in two sections: SPF calculation times and RIB manipulation time. For now, we are interested in the values under the “Total” column, which represent the total time it took OSPF process to run SPF. You may see how these values vary, depending on the “Reason” field. You may want to find the maximum value and use it as an upper limit for SPF computation in your network. In our case, it’s 32ms. The other section of the output will be discussed later.
The next “problem” is known as SPF throttling. Recent Cisco IOS OSPF implementation is designed to use exponential backoff algorithm when scheduling SPF runs. The goal, as usual, is to avoid excessive calculations in the times of high network instability but keep SPF reaction fast for stable networks. Exponential process is identical to the one used for LSA throttling, with the same timer semantics.
So how would one pick up optimal SPF throttling values? As mentioned before, the initial delay should be kept as short as possible to allow for instant reaction to a change but long enough not to trigger the SPF before the LSA is flooded out. It’s hard to determine the delay to flood the LSA, but at least the initial timer should stay above the per-interface LSA flood pacing timer, so that it does not delay two consecutive LSAs flooded through the topology (as you remember, a typical transit link failure results in generation of at least two LSAs). Setting the interface flooding pacing timer to 5ms and initial SPF delay to 10ms should be a good starting point. After the initial run, SPF algorithm should be further held down for at least the amount of time it takes the network to converge after the initial event. This means that the SPF hold-time should be strictly higher than the value “SPF_Initial_Delay + SPF_Runtime + RIB_FIB_Update_Time”. There exists alternate, more pragmatic approach to this timer tuning as well. Let’s say we want to make sure SPF computations do not take more than 50% of the router’s CPU time. For this to happen, the hold time should be at least the same as a typical SPF run time. This value could be found based on the router statistics and tuned individually on every router. Based on our example, we may set the hold interval to 32ms+20% (error margin, set higher to add more safety), which is about 38ms, and the maximum interval could be set to twice the hold time, which translates into 33% CPU usage under the worst condition of non-stop LSA storms. Notice that SPF hold and maximum timers could be tuned per-router, to account for the different CPU powers, if this applies to your scenario. Total network convergence time should be estimated based on the “slowest” router in the area.
Part IV: RIB/FIB Update
After completing SPF computation, OSPF performs sequential RIB update to reflect the changed topology. The RIB updates are further propagated to the FIB table – based on the platform architecture this could be either centralized or distributed process. The RIB/FIB update process may contribute the most to the convergence time in the topologies with large amount of prefixes, e.g. thousands or tens of thousands. In such networks, updating RIB and distributed FIB databases on line-cards may take considerable amount of time, such as at the order of 10′s if not 100′s of milliseconds (varies depending on the platform). There are two major ways to minimize the update delay: advertise less prefixes and sequence FIB updates so that important paths are updated before any other.If you think of all prefixes that need to be in a typical network core, you would realize that you don’t need any core “transit” link prefixes in there. In fact, all you need are normally the stub links at the edge of your network, e.g. PE router loopbacks or the summary prefixes injected from the lower layers of your network hierarchy. Therefore, it makes sense to suppress the network prefix information advertised for the transit links. One option would be configuring all transit links as IP unnumbered using the IP addresses of the routers’ Loopback interfaces. However, both IS-IS and OSPF has a special protocol capability to implement suppression automatically. In OSPF it is known as “prefix-suppression” and prevents OSPF from including the link type 3 (stub network address) in the router LSA (see [OSPF-PREFIX-SUPPRESS]). As you remember, OSPF represents a point-to-point connection between two routers via two link types in a router LSA: type 1, declaring the connection to another router based on its Router-ID and type 3, describing the stub connection/prefix of a point-to-point link (snmp ifIndex is used if the link is unnunmered). The prefix-suppression feature drops the second link type and leaves only the topological information in the router LSA. As a result, you will not be able to reach the transit link subnet address but still have perfect connectivity within the topology. The command to enable global prefix suppression is entered under OSPF routing process as prefix-suppression to enable suppression globally or per-interface using the syntax ip ospf prefix-suppression [disable]. Notice that by default OSPF does not suppress stub-link advertisement for the router loopback interfaces, unless you have explicitly configured these for suppression.
As soon as you’re done suppressing all transit link subnets, you are normally left with the router loopback interfaces (typically /32 prefixes) and routing information external to the area, such as summary-addresses or external prefixes. Depending on your network configuration the amount of summary addresses could be significant. The best solution to this problem is optimal summarization and filtering unnecessary prefixes, e.g. by means of of summary-address filters and stub area features. Obviously, this requires a hierarchical address plan, which is not always readily available. If re-designing you network’s IP addressing is not an option, you may still rely on Cisco IOS priority prefix sequencing, which is supported in ISIS. Unfortunately, there is no support for this feature in OSPF for IOS yet, though there is support in IOS-XR. You may read more about ISIS support for priority-driven RIB Prefix Installation here ([ISIS-PRIODRIVEN]). The general idea is to expedite some prefix insertion into the forwarding table, starting with the most important ones, such as PE /32 prefixes. It is worth noting that priority sequencing may extend duration of the routing micro-loops during the re-convergence process. In general, the procedure described in ([ORDERED-FIB] works against fast convergence, trading it for loop-free process.
Is there a way to estimate the RIB/FIB manipulation times? As we have seen before, the show ip ospf statistics command provides information on RIB update time, though this output is not provided on every platform, nor there is clear interpretation of the values in Cisco’s documentation, e.g. it’s unclear whether there is a checkpoint mechanism to inform OSPF of the FIB entry updates. Special measurements should be taken to estimate these values, as done in [BLACKBOX-OSPF], and more importantly these values will heavily depend on the platform used. Still the OSPF RIB manipulation statistics could be useful to estimate the lower bound of network convergence time (though we are mostly interested in the accurate upper boundary).
Sample Fast Convergence Profile
Putting the above information together, let’s try to find an optimum convergence profile based on the fact that we have “show ip ospf statistics” output from the “weakest” router in the area.show ip ospf statistics OSPF Router with ID (10.4.1.1) (Process ID 1) Area 10: SPF algorithm executed 18 times Summary OSPF SPF statistic SPF calculation time Delta T Intra D-Intra Summ D-Summ Ext D-Ext Total Reason 1w3d 8 0 0 0 0 0 8 R, X 1w3d 12 0 0 0 4 0 16 R, X 1w3d 16 0 0 0 4 0 20 R, X 1w3d 8 0 0 0 0 0 8 R, 1w3d 20 0 0 0 0 0 20 R, X 1w2d 24 0 0 0 8 0 32 R, X 1w2d 8 4 0 0 0 0 12 R, 6d16h 4 0 0 0 0 4 8 R, X 6d16h 4 0 0 0 0 0 4 R, 6d16h 12 0 0 0 8 0 20 R, X RIB manipulation time during SPF (in msec): Delta T RIB Update RIB Delete 1w3d 4 0 1w3d 8 0 1w3d 10 0 1w3d 5 0 1w3d 8 0 1w2d 10 0 1w2d 3 0 6d16h 2 0 6d16h 1 0 6d16h 9 0Failure Detection Delay: about 5-10ms worst case to detect/report loss of network pulses.
Maximum SPF runtime: 32ms, doubling for safety makes it 64ms
Maximum RIB update: 10ms, doubling for safety makes it 20ms
OSPF interface flood pacing timer: 5ms (does not apply to the initial LSA flooded)
LSA Generation Initial Delay: 10ms (enough to detect multiple link failures resulting from SRLG failure)
SPF Initial Delay: 10ms (enough to hold SPF to allow two consecutive LSAs to be flooded)
Network geographical size: 100 miles (signal propagation is negligible)
Network physical media: 1 Gbps links (serialization delay is negligible)
Estimated network convergence time in response to initial event: 32*2 + 10*2 + 10 + 10 = 40+64 = 100ms. This estimation does not precisely account for FIB update time, but we assume it would be approximately the same as RIB update. We need to make sure out maximum backoff timers exceed this convergence timer to ensure processing is delay above the convergence interval in the worst case scenario.
LSA Generation Hold Time: 100ms (approximately the convergence time)
LSA Generation Maximum Time: 1s (way above the 100ms)
OSPF Arrival Time: 50ms (way below the LSA Generation hold time)
SPF Hold Time: 100ms
SPF Maximum Hold Time: 1s ( Maximum SPF runtime is 32ms, meaning we skip 30 SPF runtimes in the worst condition. This results in SPF consuming no more than 3% of CPU time under worst-case scenario).
Now estimate the worst-case convergence time: LSA_Maximum_Delay (1s) + SPF_Maximum_Delay (1s) + RIB_Update (<1s) < 3 seconds. Even under heavily congested network, CPU usage for SPF calculations will not exceed 3% and network will converge to changes under 3 seconds. Here is a sample OSPF configuration template:
router ospf 10 ! ! Suppress transit link prefixes ! prefix-suppression ! ! Wait at least 50ms between accepting the same LSA ! timers lsa arrival 50 ! ! Throttle LSA generation ! timers throttle lsa all 10 100 1000 ! ! Throttle SPF runs ! timers throttle spf 10 100 1000 ! ! Pace interface-level flooding ! timers pacing flood 5 ! ! Make retransmission timer > than arrival ! timers pacing retransmission 60 ! ! Enable incremental SPF ! ispf
Conclusions
It has been well known that link-state IGPs could be tuned for sub-second convergence under almost any practical scenario, yet maintain network stability by the virtue of adaptive backoff timers. In this post we tried to provide a practical approach to calculating the optimum throttling timer values based on your recorded network performance. It is worth noting that three most important timers to tune network for sub-second convergence are the failure detection delay, initial LSA generation delay and initial SPF delay. All other timers, such as hold and maximum time serve the purpose of stabilizing network, and affect convergence in "worst-case" unstable network scenarios. Cisco's recommended values for the initial/hold/maximum timers are 10/100/5000 ms (see [ROUTED-CAMPUS], but those may look a bit conservative as they result in the worst-case convergence time above 10 seconds. Additionally, it is important to notice that in large topologies, significant amount of time is spent updating the RIB/FIB updates after reconvergence. Therefore, in addition to tuning the throttling timers you may want to implement other measures such as prefix-suppression, better summarization (e.g. totally stub areas) and minimization of external routing information. If your platform supports the feature, you may also implement priority-driven RIB prefix installation process.We omitted other fast-convergence elements such as resilient network design, e.g. redundancy resulting in equal-cost multipathing and faster OSPF adjacency restoration or NSF feature which is very helpful to avoid re-convergence during planned downtimes. We also skipped discussing some other features related to OSPF stability such as flooding reduction and LSA group pacing, that could yield performance benefits in networks with large LSDs. It is not possible to cover all relevant technologies in a single blog post, but you may refer to the further reading documents for more information. And finally, if you are planning to tune your IGP for fast convergence, make sure you understand all consequences. Modern routing platforms are capable of handling almost any "stormy" network condition without losing overall network stability, but pushing network to its limits could always be dangerous. Make sure you monitor your OSPF statistics for potentially high or unusual conditions after you performed tuning, or set maximum timers to more conservative values (e.g. 3-5 seconds) to provide additional safety.
Further Reading
The following is the minimum list of the publications suggested to read on the topic of fast IGP convergence.[ORDERED-FIB] "Loop-free convergence using oFIB"
[BLACKBOX-OSPF] "Experience in Black-box OSPF Measurement”
[SUBSEC-CONV] “Achieving Sub-second IGP Convergence in Large IP Networks”
[OSPF-FASTHELLO] "OSPF Fast Hello Enhancement"
[SPD] "Understanding Selective Packet Discard"
[TUNING-OSPF] "Tuning OSPF Performance"
[BFD] "Bi-Directional Forwardin Detection"
[OSPF-PREFIX-SUPPRESS] "OSPF Prefix Suppression Feature"
[ROUTED-CAMPUS] "Cisco fully routed campus design guidelines"
[OPT-DAMPENING] "Optimized IP Event Dampening"
[DIJKSTRA-SPF] "Dijkstra SPF algorithm"
No comments:
Post a Comment