High Packet Loss Affecting Core Link

Event Started: 2022-11-15 20:55
Report Published: 2022-11-15 21:17
Last Updated: 2023-10-27 09:18
Event Finished: Ongoing

We're seeing very high packet loss affecting a core link. We're investigating.

Timeline (most recent first)

2022-11-25
11:57:00

Following a week of stability, during which we've experienced no further issues with either rs164.w.faelix.net or the replacement core link, we are going to close this issue. A full RFO, with timeline, will be issued in due course.
2022-11-17
15:16:00

We have re-established BGP sessions in London and are monitoring the network. Alerts are currently clear, and our weathermap looks nominal.
2022-11-17
15:05:00

OSPF adjacency has established correctly. We are adjusting the cost back to normal to bring the Manchester-London link back into operational service.
2022-11-17
15:00:00

We are bringing the link back into OSPF, but with cost set high to avoid rerouting traffic.
2022-11-17
14:53:00

Our Manchester-London link has been tested and passes traffic correctly. We are going to start bringing it back into service.
2022-11-17
14:23:00

Further investigation reveals that the switch in Williams House may not have been the issue. We are continuing to investigate.
2022-11-17
13:54:00

The provider's NOC has contacted us to let us know that they have reprovisioned the services for us. We will be carrying out tests.
2022-11-17
13:52:00

The network is stabilising once again, and alerts are starting to clear.
2022-11-17
13:30:00

We think the switch might be failing again.
2022-11-16
21:30:00

Engineers are still on site at Equinix MA1 (Williams House), monitoring the situation, and evaluating options for next steps.
2022-11-16
21:00:00

Services are beginning to restore.
2022-11-16
20:55:00

We have power-cycled rs164.w.faelix.net again. This will cause a short period of unreachability between Manchester and London parts of our network while it boots up.
2022-11-16
20:25:00

While OSPF has reconverged on all our core routers, BGP resolutely refuses to do so. The CPU usage on all our peering and transit and backbone routers is at 200-400%. This is almost entirely in the bgpd process, while BGP sessions between routers and the route-servers are unstable.
2022-11-16
20:14:00

Happening again.
2022-11-16
18:50:00

While engineers were en-route to the datacentre, alerts all cleared and normal service resumed.
2022-11-16
18:20:00

This appears to have happened again.
2022-11-15
22:30:00

The network has stabilised after power-cycling rs164.w.faelix.net, the device which had locked up and we could not access remotely.

We are still waiting for the provider of our other north-south link to update the fault we have open which caused us to lose both paths between Manchester and London.
2022-11-15
22:10:00

Engineers have arrived on site and have power-cycled a device for which the control plane was inaccessible (both in-band and out-of-band).