BGP Troubleshooting cases
Border Gateway Protocol (BGP) is a key component of Internet routing and is responsible for exchanging information on how Autonomous Systems (ASes) can reach one another. When BGP issues occur, inter-network traffic can be affected, from packet loss and latency to complete loss of connectivity. This makes BGP an important protocol for network operators to be able to troubleshoot.
Using BGP route visualizations from real events, we’ll illustrate four scenarios where BGP may be a factor to consider while troubleshooting:
- Peering changes
- Route flapping
- Route hijacking
- DDoS mitigation
One common scenario where BGP comes into play is when a network operator changes peering with ISPs. Peering can change for a variety of reasons, including commercial peering relationships, equipment failures or maintenance. During and after a peering change, it is important to confirm reachability to your service from networks around the world. ThousandEyes presents reachability and route change views, as well as proactive alerts to help troubleshoot issues that may occur.
An example of a peering change in action is with Github and their upstream ISPs. In Figure 1, we see that the peering relationship between Github (AS 36459) changes, as routes are withdrawn to Level 3 (AS 3356 and AS 3549). This is important data when tracking down network performance, which can be adversely affected by major or frequent peering changes.
Figure 1: Github, in light green, has four upstream ISPs (in dotted blue): NTT America (AS 2914), Comcast (AS 7922) and Level 3 (AS 3356 and AS 3549). In this view we see Github changing peers, withdrawing routes to Level 3 (dotted red) and keeping only NTT and Comcast as peers.
Route flapping occurs when routes alternate or are advertised and then withdrawn in rapid sequence, often resulting from equipment or configuration errors. Flapping often causes packet loss and results in performance degradation for traffic traversing the affected networks. Route flaps are visible in ThousandEyes as repeating spikes in route changes on the timeline.
While monitoring Ancestry.com, a popular genealogy website, we noticed a route flap with their upstream providers. In this case, shown in Figure 2, the route flap with XO Communications lasted for about 15 minutes and disrupted connectivity from networks such as GTT and NTT while others such as Level 3 and Cogent that peered with American Fiber had no issues.
Figure 2: The network for Ancestry.com (AS 36175) flaps routes for a short period of time, disconnecting from one of its primary ISPs, XO Communications (AS 2828), while remaining connected to another, American Fiber (AS 31993).
Route hijacking occurs when a network advertises a prefix that it does not control, either by mistake or in order to maliciously deny service or inspect traffic. Since BGP advertisements are generally trusted among ISPs, errors or improper filtering by an ISP can be propagated quickly throughout routing tables around the Internet. As an AS operator, route hijacking is evident when the origin AS of your prefixes changes or when a more specific prefix is broadcast by another party. In some cases, the effects may be localized to only a few networks, but in serious cases hijacks can affect reachability from the entire Internet. You can set alerts in ThousandEyes to notify you of route changes and new subprefixes.
In early April 2014, Indosat, a large Indonesian telecom, incorrectly advertised a majority of the global routing table, in effect claiming that their network was the destination for a large portion of the Internet. The CDN Akamai was particularly hard hit, with a substantial portion of its customers’ traffic rerouted to the Indosat network for nearly an hour. We can see this hijack play out with a test for one of Akamai’s customers, Paypal. Figure 3 shows the hijack in progress, with two origin ASes, the correct one for Akamai (AS 16625) and the incorrect one for Indosat (AS 4761), which for approximately 30 minutes was the destination for 90% of our public BGP vantage points. While this hijack was not intentional, the effects are nonetheless serious.
Figure 3: Paypal’s routes are hijacked when Indosat (AS 4761) incorrectly advertises routes that divert traffic from the proper network belonging to Akamai (AS 16625), affecting a majority of networks.
For companies using cloud-based DDoS mitigation providers, such as Prolexic and Verisign, BGP is a common way to shift traffic to these providers during an attack. Monitoring BGP routes during a DDoS attack is important to confirm that traffic is being routed properly to the mitigation provider’s scrubbing centers. In the case of DDoS mitigation, you’d expect to see the origin AS for your prefixes change from your own AS name to that of your mitigation provider.
We can see this origin AS change in a real example from a global bank that was subject to a DDoS. In Figure 4 we can see the routes to the bank’s AS are withdrawn and new routes to the cloud-based DDoS mitigation vendor are advertised.
Figure 4: A global bank uses BGP to reroute traffic from their own AS (white circle) to that of their DDoS mitigation provider (green circle), with route changes in red.
Monitoring and BGP troubleshooting is a crucial part of managing most large networks. Visibility of BGP route changes and reachability is a powerful tool for operators to correlate events and diagnose root causes. For more information about tracking and correlating BGP changes please contact us.