Upgrade to BGP peering and transit routers in London

Event Started: 2021-08-09 10:59
Report Published: 2021-08-11 04:00
Last Updated: 2021-08-13 07:00
Event Finished: 2021-08-13 07:00

Following a year of unpredecented traffic growth at Faelix we're upgrading our network in London.

The specific windows we will be performing works are:

04:00-05:00 2021-08-11 Telehouse West, including LINX LON1
04:00-05:00 2021-08-12 Interxion LON2, including LINX LON2
04:00-05:00 2021-08-13 Telehouse North, including Voxility

Prior to the router upgrade at each location we will shutdown BGP sessions to external peers and downstream customer devices connected to the core router at that site to keep disruption to a minimum. Customers with multiple BGP sessions, to other routers in London, should see service gracefully transfer during the maintenance window.

Update 1 (2021-08-11 15:55)

This morning's maintenance appeared to proceed quite smoothly at first:

04:00 - take down BGP sessions to LINX LON1, customers, etc 05:00 - bring up services on new router, monitor services in the meantime 06:00 - sign-off from LINX NOC that all is smooth from their perspective

Unfortunately circa 07:30, after just a few hours of handling production load traffic, one of the two 40G QSFP+ network cards in the router flapped its interface down, up, down, up. This caused a routing instability as OSPF and BGP (our interior routing protocols) converged and reconverged.

By 09:00 we had migrated traffic from the affected interface and onto the other 40G QSFP+ card in the same router. We thought this had mitigated the risks of a recurrence of the incident at 07:30.

Unfortunately we began to receive reports of spikes of high latency. After some debugging, we tracked the problem down to the other 40G QSFP+ link. We adjusted ring-buffer sizes on the card, hoping that this was the cause of the jitter at around 12:20. Unfortunately, and unexpectedly, this caused another brief routing instability as OSPF and BGP reconverged as a result of that configuration change. And this change made no significant improvement.

Having consulted with colleagues, at 13:40 I took the decision to take our router in Telehouse West out of service, pending further investigation. This is the state that we are operating in currently, with two of three core routers in London, albeit with reduced peering capacity as our LINX LON1 port is currently disabled.

We have performed a series of diagnostics, but do not yet feel we have enough information to understand why the problems happened in the way they did this morning. We will endeavour to use tonight's maintenance window to investigate further on the new router in Telehouse West. If appropriate, we will bring the new router into service during that time. Otherwise we will back out our changes and revert to the old router which is still racked up and connected in Telehouse West.

Update 2 (2021-08-12 08:20)

Last night we performed additional diagnostics against dekker, our router in Telehouse West. Following those tests the following changes were made:

disabled hyperthreading on both CPUs
changed RAM layout to reduce cross-bus accesses
changed performance profile of motherboard to adjust I/O priorities
continued to keep the 40G QSFP+ interface connected to rs141 out of production use (this is the interface which flapped down/up/down/up at 07:30 yesterday)

We observed stable ping times and latency, and so from 04:20 this morning we slowly began to bring dekker back into internal routing. By 05:00 we had established iBGP, followed by eBGP sessions on LINX LON1 and to downstream customer routers. We monitored performance throughout the process, and observed the situation to be stable and markedly improved from the period yesterday from 09:00 (till dekker was removed from production at 12:20).

We have continued to monitor and observe the router's for the last 4 hours, and so far feel confident that it is performing as designed. The old router in Telehouse West remains available (powered up and running, but with its network interfaces in disabled state) should we need to revert.

Assuming everything remains positive for the remainder of the day then late tonight (aka Friday morning 4am-5am) we will finish the remaining pieces of work to migrate our London core onto the new platform. Once that work is complete we will publish a full report as is our standard practice, including both a high-level overview in "plain English" along with a low-level post mortem aimed at network engineers and those of our customers who are interested in a "deep dive".

Update 3 (2021-08-13 14:56)

This morning we began as planned: to complete the remaining upgrade work. Unfortunately our LINX LON2 migration could not be completed in a reasonable time — LINX had made an error in configuration at their end, and it required second line to resolve it. We therefore decided not to start the work in Telehouse North. That said, we are pleased with the progress made so far, and have significantly increased our routing capacity facing two of the largest internet exchange points in the UK: LON1 and LON2.

We will review the situation in a couple of weeks, and plan to upgrade earhart (our router in Telehouse North) during a future maintenance window. In the meantime our network is connected to all DDoS mitigation, transit and peering providers as before, and all our monitoring indicates that it is performing well.

Timeline (most recent first)

2021-08-21
00:00:00

Latency spikes have not been observed for several days. We are keeping our mitigation in place, but are entering a period of change-freeze until 2021-09-17 23:59, after which we will perform additional tests to confirm the root cause.
2021-08-19
21:00:00

Latency spikes have not been observed in over 48 hours.
2021-08-18
17:00:00

Latency spikes have not been observed in over 24 hours.
2021-08-17
13:00:00

We believe we have found the root cause of the latency spikes.
2021-08-16
10:00:00

We have observed latency spikes once more; we are investigating.
2021-08-15
17:00:00

We have implemented a different workaround which has cleared the latency issues.
2021-08-15
14:00:00

We have observed latency spikes once more; we are investigating.
2021-08-14
12:00:00

We have implemented some temporary workarounds while we continue investigating the cause of these latency spikes.
2021-08-13
21:00:00

We have observed latency spikes once more; we are investigating.
2021-08-13
08:21:00

LINX LON2 sessions are confirmed established.
2021-08-13
08:05:00

LINX NOC has "added the static arp entry to the switch, so [we] should see ARP frames from us now. Apologies this was missed."
2021-08-13
06:27:00

Our issue requires a second-line LINX NOC engineer to resolve.
2021-08-13
05:22:00

We are having difficulties establishing BGP sessions on LINX LON2; we await response from their NOC team.
2021-08-13
04:00:00

We are beginning work on earhart and gunn to move our LINX LON2 peering to new hardware.
2021-08-12
08:20:00

Peering on LINX LON1 (on dekker) continues to perform satisfactorily.
2021-08-12
05:00:00

Peerings on LINX LON1 are now up; we are continuing to monitor performance.
2021-08-12
04:20:00

We believe dekker is performing correctly, and are re-establishing iBGP and eBGP sessions.
2021-08-12
04:00:00

We have begun investigatory work on dekker.
2021-08-11
15:55:00

We will use tonight's maintenance window to perform further investigations.
2021-08-11
13:40:00

We have taken the router in Telehouse West out of operation, pending further investigation.
2021-08-11
12:20:00

We have adjusted ring-buffer sizes on the router's network cards; unfortunately this has brought no improvement.
2021-08-11
10:00:00

We have received reports of bursts of high latency; we are investigating.
2021-08-11
09:00:00

All traffic has been migrated away from the QSFP+ interface to mitigate the risks of recurrence.
2021-08-11
07:30:00

One of the new 40G QSFP+ interfaces has flapped its connection, which caused a brief routing instability.
2021-08-11
06:00:00

We have received sign-off from LINX NOC that they are happy with the implementation of our upgraded peering port on LON1.
2021-08-11
05:00:00

Services have been established on the new router; we are monitoring performance.
2021-08-11
04:00:00

BGP sessions have been shutdown to customers and LINX LON1.