How the BGP Routing Changes Can Impact On-line Application Performance

While monitoring on-line banking services we came across an interesting event proving that if you really want to understand why things break, it’s not enough to just look at what’s happening on the surface at the application layer.

For this test we begin in the application view, shown below. We see a dip in availability from a variety of locations around the world, with the red circles indicating the regions availability issues have occurred. In order to understand what and why this happened we need to look at some of the other views inside.

Availability to on-line banking drops to 64 percent

Figure 1: Availability to on-line banking services drops to 64%.

Next we move to the Path Visualization view to understand where the loss is occurring. The figure below shows the routes from an agent in Phoenix on the left, to the node where the path ends on the far right. Interfaces with significant loss will be circled in red. When an interface is selected it becomes a dashed line and the information our agents gathered is displayed in a box when you hover over (Figure 2)

BGP Routes from Pheonix

Figure 2: Routes from Pheonix to Ancestry.com terminate

The probes from the Phoenix agent all terminated in this single location resulting in 100% packet loss inside of the XO Communications network. During the test we had multiple agents probing this site so were able to look into this event in greater detail, and we saw an interesting pattern emerge. In this test we have 5 agent locations all exhibiting similar behavior, 100% packet loss inside of a single network, XO Communications.

All affected locations have BGP routes terminating in XO Communication’s network

Figure 3: All affected locations have routes terminating in XO Communication’s network

Why are all these interfaces inside of the XO Communications network dropping packets? To answer this question we look at another view, the BGP Route Visualization.

BGP routes are revoked between AS36175 and XO Communications AS2828

Figure 4: BGP routes are revoked between AS36175 and XO Communications AS2828

Before we get into what happened in the example let’s go through what we’re looking at in the figure above. Each BGP Autonomous System (AS) is assigned its own unique Autonomous System Number (ASN) for routing on the Internet. In this example we have three ASNs in this view: AS 2828 which is registered to XO Communications, AS 31993 American Fiber Systems, Inc., and the AS 36175 myfamily.com, Inc. You can hover over the individual ASN to get information about each AS.

Destination networks are shown in green in this view, which in this case is myfamily.com (Ancestry.com), the site we were monitoring. Intermediary ASNs in the path from the origin to the monitor are shown in grey shaded circles, with the network’s autonomous system (AS) number shown inside the circle. In this case the transit networks are AS 31993 American Fiber Systems, and AS 2828 XO Communications. The smaller circles with location names represent BGP routers that export their BGP best paths to data collectors. We also call these routers monitors. The label “3” in the path between Vancouver and AS 2828 indicates there are three AS hops between that monitor and the AS 2828 network. Dotted red lines represent links that were used as best path at some point during the 15-min time window the topology refers to, but are no longer used at the end of the bin. In this case the link to the upstream XO (AS2828) stopped being used in favor of AS31993.

We can now understand why we were seeing 100% packet loss inside of the XO Communications network. A BGP route change occurred and as a result there were no longer routes available via XO. However packets were still being forwarded to XO Communications due to BGP convergence delay. When this occurred traffic en route to the myfamily.com AS would not have been able to reach its destination.

When things break, you really want to understand where, when and why the issue occurred. You need to look at forwarding paths, dig into BGP, and correlate that with actual application behavior to really understand why that dip in availability occurred in the first place.