London Ring Fibre Incident

Event Started: 2025-01-16 11:23
Report Published: 2025-01-16 11:25
Last Updated: 2025-03-21 10:49
Event Finished: 2025-01-20 15:34

RFO

Prepared by Marek Isalski, CTO at FAELIX, on 2025-01-24.

Executive Summary

Around the middle of the daytime of 16th January, an optical fibre trunk between Telehouse West (our point of presence we call "THW") and DigitalRealtyTrust LON2 (our "IXN" POP) suffered damage. Initially this caused a complete loss of light, but then light levels crept back up for a time with the signal hovering around the minimum acceptable level for our optics, before dropping back down to an unmeasurably low level. This caused instability in the switch-to-switch connectivity, but also high CPU load as the link flapped up and down, both of which contributed to a period of very high packet loss. We brought the link out of service, and during testing that evening our fibre provider found no issues with the underlying dark fibre segment. We returned the fibre to service, but the following day saw a recurrence at around the same time. The dark fibre provider found damage to the connection from Faelix within the fibre provider's ODF during their visit to THW, and so they replaced this. The fibre provider has not been able to ascertain how that damage occurred, in spite of contacting the site owners for engineer access logs.

Background

Faelix has network capacity on a dark fibre "ring" between THW, Telehouse North (THN) and IXN. A switch at each site connects to a switch at each other site via a LACP link, each member of the LACP pair taking one direction around the ring over DWDM.

During this incident the light level between THW and IXN dropped, causing one of the LACP members of links traversing that path to drop out of the aggregating link. This causes a small amount of CPU usage on the ring switches affected as they recompute paths. Also during this recomputation the CPU usage increases as more packets are steered via the CPU ("slow path") rather than via the switch ASIC ("fast path"). However, due to the nature of the damage to the fibre, we saw the link flap up and down many times, which caused a prolonged period of high CPU usage and slow path traffic. This resulted in a large amount of packet loss, but also a backlog of computation caused by the flapping, which we could only resolve on one of the switches in THW by rebooting it.

Further Actions

Bringing the affected path back into service on 2025-01-26.
During this incident we identified that our ring switches perform poorly when the site-to-site connectivity flaps a large number of times, which can occur when light levels are just on the cusp of viable transmission.
We will perform lab testing to see whether an updated software version for these switches resolves this performance problem, or whether we need to plan for a technology replacement across the three sites.

Incident Timeline

This is shown in full on our status page for the incident but the key times are:

2025-01-16 11:22:51 — initial alerts for switch ports down
2025-01-16 11:26 — incident identified and status page updated
2025-01-16 11:29:29 — another round of link flapping
2025-01-16 11:30:26 – 11:38:12 — transmission levels on the margin of viability causing link up/down flapping (198 link state changes) and very high CPU usage, too high to even keep up with logging these events on some devices
2025-01-16 11:36 — ticket raised with dark fibre provider
2025-01-16 11:38 — decision to reboot switch in THW
2025-01-16 11:45 — switch finally responds to reboot command
2025-01-16 11:47 — switch completes reboot
2025-01-16 11:49:01 — dark fibre provider acknowledges ticket
2025-01-16 13:48:42 – 13:49:57 — more flapping light levels (but circuit is now taken down)
2025-01-16 14:33 — dark fibre provider assigns ticket to field engineering team
2025-01-16 18:36:17 – 19:18:27 — dark fibre provider's engineering team begins intrusive testing
2025-01-16 19:30:00 — we reject the fix by the dark fibre provider as light levels are too low (an additional 4dBm of loss is observed on one of the two fibres in the duplex pair)
2025-01-16 19:36:25 – 19:57:08 — dark fibre provider's engineering team continues tests
2025-01-16 19:52 — we report back that light levels are back to normal, and the dark fibre provider's engineer leaves site
2025-01-16 04:04 — light levels stable for the last approximately 8 hours, circuit brought back into service
2025-01-17 11:51:12 – 11:51:33 — light level drops too low to establish connectivity on the THW-IXN path again
2025-01-17 11:51:57 – 11:52:39 — recurrence of light level loss, CPU usage on switches again very high, decision to take path out of service
2025-01-17 11:56 — ticket re-opened with dark fibre provider
2025-01-17 11:57:44 – 11:58:15 — recurrence of light level loss
2025-01-17 11:58:49 – 11:59:51 — recurrence of light level loss
2025-01-17 12:20 — escalated with dark fibre provider
2025-01-17 12:33 — dark fibre provider begins intrusive testing at THW
2025-01-17 16:14 — dark fibre provider begins intrusive testing at IXN
2025-01-20 14:55 — dark fibre provider states that a damaged patch in THW was replaced on the evening of 16th
2025-01-23 17:11 — the dark fibre provider sends through this picture of the damaged patch cable (cross-connect) taken from within their ODF in THW

damaged patch cable in Telehouse West ODF — Damaged patch cable in Telehouse West ODF

2025-01-26 04:00 — planned window to bring the affected fibre path back into service

Timeline (most recent first)

2025-01-26
04:16:00

All ports are back up with acceptable light levels. We are now monitoring the situation.
2025-01-26
04:06:00

We are about to begin work now.
2025-01-24
15:22:00

We will be bringing the THW-IXN path back into service at approximately 04:00 on 2025-01-26 (Saturday night/Sunday morning).
2025-01-23
17:11:00

Our account manager has sent through a picture showing fibre damage that was repaired on 16th.
2025-01-23
15:28:00

Our account manager has escalated the investigation, and is asking the technical teams for a proper RFO. "We do not install auto-destructing patch cables, leave this with me."
2025-01-20
15:34:00

Our dark fibre provider has closed the ticket with the following RFO:

Customer reported service as down and requested an investigation. Upon our initial investigation, we have engaged site owner to check the light readings and our fibre team to investigate further. Site owner could not determine any major network incident or official planned maintenance that could affect the service reported. There was no service affecting incident that would have caused the reported interruption. Our fibre team has checked the dark fibre, confirmed no fault found. They cleaned all ports both ends and restored the circuit. Our engineer confirmed that the issue was a faulty patch in Telehouse which he replaced.

We are still waiting for an explanation as to how a fibre patch within our dark fibre provider's ODF was damaged (this is the second time this has happened and caused problems on this circuit).
2025-01-18
16:30:00

We have seen no flaps or light level problems on the circuit for 24 hours, but it remains out of service while we wait for an update from our dark fibre provider.
2025-01-17
22:37:00

Latest non-update from our dark fibre provider:

We have asked our 3rd-party to confirm. We will keep you updated.
2025-01-17
22:30:00

We have chased our dark fibre provider again for an update regarding whether third party engineers accessed the ODFs at the time of the incidents.
2025-01-17
20:06:00

The latest update from our dark fibre provider confirms light levels received in Interxion are as expected.
2025-01-17
18:23:00

We have chased our dark fibre provider for an update. We are still waiting on them to report back from Interxion, and also to provide us with confirmation that nobody (including third parties) was accessing the ODFs at Telehouse West and Interxion during the times of these two incidents.
2025-01-17
16:14:00

Our dark fibre provider is going to perform intrustive testing at Interxion.
2025-01-17
15:59:00

Light levels at Telehouse West all look good.
2025-01-17
12:33:00

The fibre provider is going to perform intrusive testing on the circuit.
2025-01-17
12:20:00

We have taken the affected dark fibre path out of service.

We have escalated this to account management at our dark fibre provider, asking for a full investigation about what is going on.
2025-01-17
11:56:00

We're waiting for an update from our dark fibre provider's NOC, because we are not expecting them to be doing any work on the circuit today.
2025-01-17
11:53:00

We've just seen light levels drop to unacceptably low levels, and the connections across the fibre drop again.
2025-01-17
09:30:00

Service continues to be stable and light levels remain good.

We are discussing with the dark fibre provider's NOC next steps, and trying to get an understanding from them how this incident happened.
2025-01-17
04:20:00

We are seeing good light levels and normal levels of traffic. We have confirmed with our dark fibre provider's network operations centre that we believe service has been restored, and have asked for an RFO.
2025-01-17
04:10:00

Light levels are looking much improved. We're just waiting for traffic to stabilise again.
2025-01-17
04:04:00

We are about to try bringing the link back into service.
2025-01-16
20:23:00

Our fibre provider has asked that we perform tests. Our initial checks look good, but we are not prepared to even test bringing the link back into service until a quiet period. We'll check this at 4am later tonight.
2025-01-16
19:49:00

The fibre provider must be continuing to do works, because the light levels received have dropped to nothing. We've taken the circuit completely out of service while they test this properly themselves.
2025-01-16
19:27:00

The fibre provider says the issue is resolved but we are seeing -4dBm more loss on one of the two fibres in the duplex pair. We have reverted back to the fibre provider, rejecting this.
2025-01-16
19:20:00

We've seen some changes to light levels on the link that has been out of service. We've been asked to test the circuit by our fibre provider as they think they've fixed this.
2025-01-16
14:48:00

The network has been stable for almost 2 hours now. We're continuing to monitor the situation, and work with our fibre provider for our London ring to bring this back into full resilience.
2025-01-16
14:33:00

Latest update from dark fibre provider:

We have engaged Fibre team to check the issue further. Once we receive an update will inform you accordingly.
2025-01-16
13:33:00

The fibre provider has said:

We have engaged site owner to check the light readings. kindly confirm whether we have your approval to disconnect this circuit temporarily to take the light readings.
2025-01-16
12:51:00

We've taken the link out of service as it keeps bouncing up and down.
2025-01-16
12:47:00

The THW-IXN path is sputtering into life, unexpectedly.
2025-01-16
12:07:00

We are seeing traffic levels restore to normal.

We are continuing to work with the dark fibre provider to fix the THW-IXN path of our fibre ring.
2025-01-16
12:03:00

CPU load on the switch in IXN has settled down.
2025-01-16
12:01:00

We've seen a large number of link bounce events on that switch at IXN, including affecting the other path between IXN and THN, such as this:

Last Link Down Time: Jan/16/2025 12:00:54

Last Link Up Time: Jan/16/2025 12:00:58
2025-01-16
11:56:00

Restarting the switch in THW has caused a switch in IXN to have high load. We're monitoring this to see if it returns to normal.
2025-01-16
11:53:00

We have begun initial diagnostics with the dark fibre provider.
2025-01-16
11:47:00

The locked-up switch has started responding to ping again. We are checking the situation.
2025-01-16
11:45:00

The locked-up switch is now rebooting.
2025-01-16
11:38:00

The flapping of connectivity has caused one of our switches in Telehouse West to lock up. We've forced a reboot on it, which may cause a bit more disruption, but is intended to make the problem settle down again.
2025-01-16
11:36:00

We've raised a ticket with the dark fibre provider.
2025-01-16
11:26:00

One path has restored, but the other path, between THW and IXN, remains down. We are raising this with the underlying dark fibre provider.