BGP Router Crash in Telehouse North Backbone

Event Started
2022-04-13 10:46
Report Published
2022-04-13 11:10
Last Updated
2022-04-16 15:16
Event Finished
Ongoing

Timeline (most recent first)
  • 2022-04-16
    15:15:00

    We are tentatively marking this as closed.

  • 2022-04-15
    22:13:00

    The new BGP process on earhart remains stable throughout the day. We note that average CPU usage for the last 16+ hours is approximately one third of the previous averages recorded for periods and traffic loads.

    We are continuing to monitor.

  • 2022-04-15
    05:57:00

    The new BGP process on earhart remains stable. We are continuing to monitor.

  • 2022-04-15
    04:57:00

    We have enabled interfaces on earhart — running the newer BGP process — and are monitoring the situation.

  • 2022-04-14
    16:30:00

    We have prepared an update to the router, which includes a more recent version of the BGP routing process. We are going to perform some testing, before deciding whether to reintroduce this into the network.

  • 2022-04-14
    09:32:00

    The BGP process on our peering and transit router in Telehouse North, earhart.n.faelix.net, has once again crashed spontaneously, causing a period of network instability.

    We've removed earhart from service while we investigate this issue.

  • 2022-04-13
    19:36:00

    During our ongoing monitoring we noticed some prefix-lists have been corrupted in running configurations (deviating from the saved configurations). We have removed and re-applied these, and the affected traffic flows are now going via the expected paths.

  • 2022-04-13
    14:23:00

    The last hour has been completely stable since fixing RIB/FIB mismatches.

    We are continuing to monitor the network.

  • 2022-04-13
    13:24:00

    We have identified and resolved some lingering RIB/FIB mismatches.

  • 2022-04-13
    11:42:00

    The BGP process is remaining running.

  • 2022-04-13
    11:34:00

    The router has booted up again.

  • 2022-04-13
    11:13:00

    The router ran for 6 minutes, before the BGP process crashed:

    vyos@earhart.n.faelix.net:~$ show ip bgp sum vtysh: error reading from bgpd: Connection reset by peer (104)Warning: closing connection to bgpd because of an I/O error! Warning: connecting to bgpd...failed! bgpd is not running

  • 2022-04-13
    11:00:00

    The affected router is finishing rebooting.

  • 2022-04-13
    10:36:00

    We've received alerts about a routing issue in Telehouse North, blackholing significant amounts of traffic.