High Packet Loss Affecting Core Link Backbone

Event Started
2022-11-15 20:55
Report Published
2022-11-15 21:17
Last Updated
2023-10-27 09:18
Event Finished

We're seeing very high packet loss affecting a core link. We're investigating.

Timeline (most recent first)
  • 2022-11-25

    Following a week of stability, during which we've experienced no further issues with either rs164.w.faelix.net or the replacement core link, we are going to close this issue. A full RFO, with timeline, will be issued in due course.

  • 2022-11-17

    We have re-established BGP sessions in London and are monitoring the network. Alerts are currently clear, and our weathermap looks nominal.

  • 2022-11-17

    OSPF adjacency has established correctly. We are adjusting the cost back to normal to bring the Manchester-London link back into operational service.

  • 2022-11-17

    We are bringing the link back into OSPF, but with cost set high to avoid rerouting traffic.

  • 2022-11-17

    Our Manchester-London link has been tested and passes traffic correctly. We are going to start bringing it back into service.

  • 2022-11-17

    Further investigation reveals that the switch in Williams House may not have been the issue. We are continuing to investigate.

  • 2022-11-17

    The provider's NOC has contacted us to let us know that they have reprovisioned the services for us. We will be carrying out tests.

  • 2022-11-17

    The network is stabilising once again, and alerts are starting to clear.

  • 2022-11-17

    We think the switch might be failing again.

  • 2022-11-16

    Engineers are still on site at Equinix MA1 (Williams House), monitoring the situation, and evaluating options for next steps.

  • 2022-11-16

    Services are beginning to restore.

  • 2022-11-16

    We have power-cycled rs164.w.faelix.net again. This will cause a short period of unreachability between Manchester and London parts of our network while it boots up.

  • 2022-11-16

    While OSPF has reconverged on all our core routers, BGP resolutely refuses to do so. The CPU usage on all our peering and transit and backbone routers is at 200-400%. This is almost entirely in the bgpd process, while BGP sessions between routers and the route-servers are unstable.

  • 2022-11-16

    Happening again.

  • 2022-11-16

    While engineers were en-route to the datacentre, alerts all cleared and normal service resumed.

  • 2022-11-16

    This appears to have happened again.

  • 2022-11-15

    The network has stabilised after power-cycling rs164.w.faelix.net, the device which had locked up and we could not access remotely.

    We are still waiting for the provider of our other north-south link to update the fault we have open which caused us to lose both paths between Manchester and London.

  • 2022-11-15

    Engineers have arrived on site and have power-cycled a device for which the control plane was inaccessible (both in-band and out-of-band).