High Packet Loss Affecting Core Link Backbone

Event Started
2022-11-15 20:55
Report Published
2022-11-15 21:17
Last Updated
2023-10-27 09:18
Event Finished
Ongoing

We're seeing very high packet loss affecting a core link. We're investigating.

Timeline (most recent first)
  • 2022-11-25
    11:57:00

    Following a week of stability, during which we've experienced no further issues with either rs164.w.faelix.net or the replacement core link, we are going to close this issue. A full RFO, with timeline, will be issued in due course.

  • 2022-11-17
    15:16:00

    We have re-established BGP sessions in London and are monitoring the network. Alerts are currently clear, and our weathermap looks nominal.

  • 2022-11-17
    15:05:00

    OSPF adjacency has established correctly. We are adjusting the cost back to normal to bring the Manchester-London link back into operational service.

  • 2022-11-17
    15:00:00

    We are bringing the link back into OSPF, but with cost set high to avoid rerouting traffic.

  • 2022-11-17
    14:53:00

    Our Manchester-London link has been tested and passes traffic correctly. We are going to start bringing it back into service.

  • 2022-11-17
    14:23:00

    Further investigation reveals that the switch in Williams House may not have been the issue. We are continuing to investigate.

  • 2022-11-17
    13:54:00

    The provider's NOC has contacted us to let us know that they have reprovisioned the services for us. We will be carrying out tests.

  • 2022-11-17
    13:52:00

    The network is stabilising once again, and alerts are starting to clear.

  • 2022-11-17
    13:30:00

    We think the switch might be failing again.

  • 2022-11-16
    21:30:00

    Engineers are still on site at Equinix MA1 (Williams House), monitoring the situation, and evaluating options for next steps.

  • 2022-11-16
    21:00:00

    Services are beginning to restore.

  • 2022-11-16
    20:55:00

    We have power-cycled rs164.w.faelix.net again. This will cause a short period of unreachability between Manchester and London parts of our network while it boots up.

  • 2022-11-16
    20:25:00

    While OSPF has reconverged on all our core routers, BGP resolutely refuses to do so. The CPU usage on all our peering and transit and backbone routers is at 200-400%. This is almost entirely in the bgpd process, while BGP sessions between routers and the route-servers are unstable.

  • 2022-11-16
    20:14:00

    Happening again.

  • 2022-11-16
    18:50:00

    While engineers were en-route to the datacentre, alerts all cleared and normal service resumed.

  • 2022-11-16
    18:20:00

    This appears to have happened again.

  • 2022-11-15
    22:30:00

    The network has stabilised after power-cycling rs164.w.faelix.net, the device which had locked up and we could not access remotely.

    We are still waiting for the provider of our other north-south link to update the fault we have open which caused us to lose both paths between Manchester and London.

  • 2022-11-15
    22:10:00

    Engineers have arrived on site and have power-cycled a device for which the control plane was inaccessible (both in-band and out-of-band).