SWIFT: Predictive Fast Reroute

SWIFT is the first BGP fast reroute framework that works upon remote transit outages. Remote outages are challenging because they can happen in any network, at any time, and can affect any set of destinations. Furthermore, the Internet converges slowly upon remote outages, resulting in downtime for users.

SWIFT has two key ingredients. First, it quickly localizes the outage and predicts the affected traffic. Second, it reroutes the affected traffic to unaffected backup paths with just few data-plane rule updates using precomputed data-plane tags and a 2-stage forwarding table.

BGP converges slowly

Upon a remote outage, BGP messages are propagated router by router through the network to announce or withdraw routes for prefixes. This propagation is slow as each router involved takes time to process these messages. For example, in the scenario depicted above, upon the failure of (5,6), AS 1 may wait tens of seconds before it learns the new routes to reach the 10k prefixes advertised by AS 6. We computed the duration of the bursts of withdrawals observed on 213 BGP sessions during a month, and found that it often takes several seconds to receive a burst. In addition, updating the forwarding entries for all the affected prefixes is also time consuming. In the end, the convergence may last tens of seconds, sometimes minutes.

Before the failure

Remote outages often happen

Upon the failure of the link (5,6), AS 5 starts to propagate BGP messages to its neighbors. If AS 5 and AS 2 do not have backup routes for prefixes advertised by AS 6, they drop the traffic for these prefixes and send BGP withdrawals to their neighbors. During the convergence, AS 1 continues to send the traffic on the path (2,5,6), resulting in downtime for AS 1. We found that BGP peers often receive large bursts of withdrawals (see figure on the right). We confirmed through private discussions with operators that bursts of withdrawals do indeed cause downtime.

Before the failure

Operators care about slow BGP convergence! Take a look at our 72 operators survey.

Fast and accurate outage localization

Immediately after receiving the first BGP messages of a burst, a SWIFTED router runs an inference algorithm to localize the outage and predict which prefixes will be a affected—a sort of time-bound Root Cause Analysis (RCA). Based on this inference, the SWIFTED router reroutes the potentially affected prefixes on paths unaffected by the inferred failure. The figure below shows that the SWIFT inference algorithm is accurate for more than 85% of the bursts. Accurate inferences enable SWIFT to fast reroute, and also ensure safety since SWIFT can use backup routes not affected by the failure. For example, upon the failure of the link between AS 5 and 6, SWIFT uses AS 3 as next-hop because the path AS 3 advertises do not use the link (5,6).

Fast and flexible data-plane rerouting

SWIFT includes a data-plane encoding scheme that enables the router to flexibly match and reroute the traffic for all the prefixes affected by a remote outage with few data-plane rule updates. SWIFT relies on a two stage forwarding table to speed up data-plane updates. The first stage contains rules to tag traversing packets based on the AS paths along which they are forwarded and also the next-hop to use should any of them fail. The second stage contains rules for forwarding the packets according to these tags. By matching on portions of the tags, a SWIFTED router can quickly select packets passing through any given AS link(s), and reroute them to a pre-computed next-hop. The figure below shows the convergence time of a recent Cisco router upon a remote failure affecting 290k prefixes. The router takes 110 seconds to converge, while its SWIFTED version converges within 2 seconds.

Before the failure

Is SWIFT useful if PIC or MPLS Fast Reroute are used already? Yes

Prefix Independent Convergence (PIC) and MPLS fast reroute only work upon local failures (e.g., on your peering links), not the remote ones (e.g., in your neighbor's network). SWIFT works both upon local and remote outages.

Is SWIFT deployable on an arbitrary set of nodes? Yes, under reasonable assumptions

We proved that SWIFT rerouting (i) causes no forwarding loop and (ii) can only be beneficial, irrespective of the set of SWIFTED routers. As a result, you can safely deploy SWIFT in one or all your routers, independantly of whether other ASes also deploy SWIFT or not.

Does SWIFT require to cooperate with other networks to be deployable? No

SWIFT does not require any modification of BGP. In addition, data-plane tags are not propagated to other routers. They are added by the SWIFTED router (in the first stage of the forwarding table) and removed by the SWIFTED router as well at the egress. For example, if the destination MAC header field is used for the data-plane tags, the SWIFTED router modifies the destination MAC field in the first stage of the forwarding table, and simply removes the tags in the second stage of the forwarding table by setting the destination MAC address of the actual next-hop . Deploying SWIFT is therefore invisible for the other networks and thus no cooperation is required.

Does SWIFT update the backup next-hops when it receives a BGP advertisement or withdrawal? Yes

Whenever the SWIFTED router receives an advertisement or a withdrawal for a prefix, it recomputes the backup next-hops to use for that prefix in case of a failure, and update the data-plane tags accordingly. Only one data-plane rule update is required in the first stage of the forwarding table to update the data-plane tags for a prefix.

Does SWIFT comply with routing policies? Yes

When SWIFT computes the backup next-hops, it can take into account policies defined by the operators. For example, operators can prevent SWIFT from using an expensive link with a provider rather than a more convenient one with a cutsomer.

Does SWIFT lose traffic if it reroutes unaffected prefixes (false positives)? No, under reasonable assumptions

When SWIFT reroutes, it always uses valid backup paths. Thus, rerouting the traffic for a unaffected prefix will not result in downtime for that prefix. However, other metrics such as the delay or the jitter can change during the rerouting process for that prefix.

SWIFT: Predictive Fast Reroute, SIGCOMM'17. Los Angeles, USA. July, 2017.
(paper, technical report, bibtex)

by Thomas Holterbach, Stefano Vissicchio, Alberto Dainotti, Laurent Vanbever

Boosting the BGP convergence in SDXes with SWIFT, SIGCOMM'17. Los Angeles, USA. July, 2017. Demo (paper, poster, bibtex)

by Philipp Mao, Rudiger Birkner, Thomas Holterbach, Laurent Vanbever

SIGCOMM 2017 (August 2017)

by Thomas Holterbach

We release a Python-based implementation of SWIFT and its algorithms.

You can use our implementation to quickly evaluate the benefit SWIFT would provide on your BGP data.
The source code is publicly available on GitHub.

We provide a VM to show SWIFT in action.

Just run one script in the VM and a virtual network with Quagga routers is built.
Simulate a remote failure and see the benefit of SWIFT!
You can download the VM here. The instructions on how to use it are available in the GitHub repository.

Thomas Holterbach, ETH Zürich, thomahol at ethz dot ch
Stefano Vissicchio, University College London, s.vissicchio at cs dot ucl dot ac dot uk
Alberto Dainotti, CAIDA, UC San Diego, alberto at caida at org
Laurent Vanbever, ETH Zürich, lvanbever at ethz dot ch

SWIFT: Predictive Fast Reroute

Internet converges slowly upon remote outages, resulting in downtime for users

BGP converges slowly

Remote outages often happen

Operators care about slow BGP convergence! Take a look at our 72 operators survey.

SWIFT significantly speeds up convergence by preventively rerouting traffic after the first signs of a problem

Fast and accurate outage localization

Fast and flexible data-plane rerouting

Frequently Asked Questions

Publications

Presentations

Try it out!

People