We're seeing very high packet loss affecting a core link. We're investigating.
High Packet Loss Affecting Core Link Backbone
- Event Started
- 2022-11-15 20:55
- Report Published
- 2022-11-15 21:17
- Last Updated
- 2023-10-27 09:18
- Event Finished
- Ongoing
Timeline (most recent first)
-
2022-11-25
11:57:00Following a week of stability, during which we've experienced no further issues with either
rs164.w.faelix.net
or the replacement core link, we are going to close this issue. A full RFO, with timeline, will be issued in due course. -
2022-11-17
15:16:00We have re-established BGP sessions in London and are monitoring the network. Alerts are currently clear, and our weathermap looks nominal.
-
2022-11-17
15:05:00OSPF adjacency has established correctly. We are adjusting the cost back to normal to bring the Manchester-London link back into operational service.
-
2022-11-17
15:00:00We are bringing the link back into OSPF, but with cost set high to avoid rerouting traffic.
-
2022-11-17
14:53:00Our Manchester-London link has been tested and passes traffic correctly. We are going to start bringing it back into service.
-
2022-11-17
14:23:00Further investigation reveals that the switch in Williams House may not have been the issue. We are continuing to investigate.
-
2022-11-17
13:54:00The provider's NOC has contacted us to let us know that they have reprovisioned the services for us. We will be carrying out tests.
-
2022-11-17
13:52:00The network is stabilising once again, and alerts are starting to clear.
-
2022-11-17
13:30:00We think the switch might be failing again.
-
2022-11-16
21:30:00Engineers are still on site at Equinix MA1 (Williams House), monitoring the situation, and evaluating options for next steps.
-
2022-11-16
21:00:00Services are beginning to restore.
-
2022-11-16
20:55:00We have power-cycled
rs164.w.faelix.net
again. This will cause a short period of unreachability between Manchester and London parts of our network while it boots up. -
2022-11-16
20:25:00While OSPF has reconverged on all our core routers, BGP resolutely refuses to do so. The CPU usage on all our peering and transit and backbone routers is at 200-400%. This is almost entirely in the
bgpd
process, while BGP sessions between routers and the route-servers are unstable. -
2022-11-16
20:14:00Happening again.
-
2022-11-16
18:50:00While engineers were en-route to the datacentre, alerts all cleared and normal service resumed.
-
2022-11-16
18:20:00This appears to have happened again.
-
2022-11-15
22:30:00The network has stabilised after power-cycling
rs164.w.faelix.net
, the device which had locked up and we could not access remotely.We are still waiting for the provider of our other north-south link to update the fault we have open which caused us to lose both paths between Manchester and London.
-
2022-11-15
22:10:00Engineers have arrived on site and have power-cycled a device for which the control plane was inaccessible (both in-band and out-of-band).