Following a year of unpredecented traffic growth at Faelix we're upgrading our network in London.
The specific windows we will be performing works are:
- 04:00-05:00 2021-08-11 Telehouse West, including LINX LON1
- 04:00-05:00 2021-08-12 Interxion LON2, including LINX LON2
- 04:00-05:00 2021-08-13 Telehouse North, including Voxility
Prior to the router upgrade at each location we will shutdown BGP sessions to external peers and downstream customer devices connected to the core router at that site to keep disruption to a minimum. Customers with multiple BGP sessions, to other routers in London, should see service gracefully transfer during the maintenance window.
Update 1 (2021-08-11 15:55)
This morning's maintenance appeared to proceed quite smoothly at first:
04:00 - take down BGP sessions to LINX LON1, customers, etc 05:00 - bring up services on new router, monitor services in the meantime 06:00 - sign-off from LINX NOC that all is smooth from their perspective
Unfortunately circa 07:30, after just a few hours of handling production load traffic, one of the two 40G QSFP+ network cards in the router flapped its interface down, up, down, up. This caused a routing instability as OSPF and BGP (our interior routing protocols) converged and reconverged.
By 09:00 we had migrated traffic from the affected interface and onto the other 40G QSFP+ card in the same router. We thought this had mitigated the risks of a recurrence of the incident at 07:30.
Unfortunately we began to receive reports of spikes of high latency. After some debugging, we tracked the problem down to the other 40G QSFP+ link. We adjusted ring-buffer sizes on the card, hoping that this was the cause of the jitter at around 12:20. Unfortunately, and unexpectedly, this caused another brief routing instability as OSPF and BGP reconverged as a result of that configuration change. And this change made no significant improvement.
Having consulted with colleagues, at 13:40 I took the decision to take our router in Telehouse West out of service, pending further investigation. This is the state that we are operating in currently, with two of three core routers in London, albeit with reduced peering capacity as our LINX LON1 port is currently disabled.
We have performed a series of diagnostics, but do not yet feel we have enough information to understand why the problems happened in the way they did this morning. We will endeavour to use tonight's maintenance window to investigate further on the new router in Telehouse West. If appropriate, we will bring the new router into service during that time. Otherwise we will back out our changes and revert to the old router which is still racked up and connected in Telehouse West.
Update 2 (2021-08-12 08:20)
Last night we performed additional diagnostics against
dekker, our router in Telehouse West. Following those tests the following changes were made:
- disabled hyperthreading on both CPUs
- changed RAM layout to reduce cross-bus accesses
- changed performance profile of motherboard to adjust I/O priorities
- continued to keep the 40G QSFP+ interface connected to
rs141out of production use (this is the interface which flapped down/up/down/up at 07:30 yesterday)
We observed stable ping times and latency, and so from 04:20 this morning we slowly began to bring
dekker back into internal routing. By 05:00 we had established iBGP, followed by eBGP sessions on LINX LON1 and to downstream customer routers. We monitored performance throughout the process, and observed the situation to be stable and markedly improved from the period yesterday from 09:00 (till
dekker was removed from production at 12:20).
We have continued to monitor and observe the router's for the last 4 hours, and so far feel confident that it is performing as designed. The old router in Telehouse West remains available (powered up and running, but with its network interfaces in disabled state) should we need to revert.
Assuming everything remains positive for the remainder of the day then late tonight (aka Friday morning 4am-5am) we will finish the remaining pieces of work to migrate our London core onto the new platform. Once that work is complete we will publish a full report as is our standard practice, including both a high-level overview in "plain English" along with a low-level post mortem aimed at network engineers and those of our customers who are interested in a "deep dive".
Update 3 (2021-08-13 14:56)
This morning we began as planned: to complete the remaining upgrade work. Unfortunately our LINX LON2 migration could not be completed in a reasonable time — LINX had made an error in configuration at their end, and it required second line to resolve it. We therefore decided not to start the work in Telehouse North. That said, we are pleased with the progress made so far, and have significantly increased our routing capacity facing two of the largest internet exchange points in the UK: LON1 and LON2.
We will review the situation in a couple of weeks, and plan to upgrade
earhart (our router in Telehouse North) during a future maintenance window. In the meantime our network is connected to all DDoS mitigation, transit and peering providers as before, and all our monitoring indicates that it is performing well.