Emergency Reboot of gunn (peering and transit router in Interxion London) Peering and Transit

Event Started
2022-08-08 13:42
Report Published
2022-08-08 14:17
Last Updated
2022-09-20 09:36
Event Finished
Ongoing

The BGP process on gunn.x.faelix.net in Interxion has crashed and dropped its configuration. We have initiated an emergency reboot of the affected router.

Timeline (most recent first)
  • 2022-09-04
    13:25:00

    This problem may have occurred again today. We have re-opened this issue.

  • 2022-08-21
    13:45:00

    We noticed a brief bgpd process instability. This has been resolved, and combined with recent updates, we believe that part of the RPKI validation process of FRRouting is to blame for the bgpd process lock-ups. We have temporarily disabled RPKI to determine if this is the case, and are monitoring for the effects.

  • 2022-08-19
    13:45:00

    We noticed a brief bgpd process instability. This has been resolved.

  • 2022-08-09
    21:16:00

    We've applied those changes, and brought gunn back into IGP and iBGP. We're monitoring the network.

  • 2022-08-09
    21:02:00

    With the test case we've found a workaround we can apply to our BGP routers' prefix-lists which will not compromise routing security and will survive a deliberately-triggered bgpd process restart. We are applying this as an emergency configuration change to our BGP peering and transit routers.

  • 2022-08-09
    18:28:00

    We have identified a test case we can use to reproduce the second of the two issues (the bgpd process not being programmed correctly). We are going to perform some tests to see if we can find a configuration which avoids this buggy behaviour.

  • 2022-08-09
    15:13:00

    We have reached out to the VyOS development team to try to get more information about why this behaviour has been observed, and how we might work with them to fix it.

  • 2022-08-09
    14:30:00

    From the logs of gunn we can see that the bgpd process appeared to lock up and was automatically restarted:

    Aug  9 13:21:07 gunn watchfrr[1169]: [EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
    Aug  9 13:21:07 gunn watchfrr[1169]: Forked background command [pid 3477]: /usr/lib/frr/watchfrr.sh restart bgpd
    Aug  9 13:21:27 gunn watchfrr[1169]: Warning: restart bgpd child process 3477 still running after 20 seconds, sending signal 15
    Aug  9 13:21:27 gunn watchfrr[1169]: restart bgpd process 3477 terminated due to signal 15
    Aug  9 13:21:28 gunn bgpd[1214]: [EC 100663313] SLOW THREAD: task bgpd_sync_callback (7f8ec73e85a0) ran for 58360ms (cpu time 58359ms)
    Aug  9 13:21:28 gunn bgpd[1214]: Terminating on signal
    Aug  9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 35(8) no thread to handle this!
    Aug  9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 70(8) no thread to handle this!
    Aug  9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 93(8) no thread to handle this!
    Aug  9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 39(8) no thread to handle this!
    Aug  9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 82(8) no thread to handle this!
    Aug  9 13:21:53 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 49(8) no thread to handle this!
    Aug  9 13:21:56 gunn zebra[1207]: [EC 4043309122] Client 'vnc' encountered an error and is shutting down.
    Aug  9 13:21:56 gunn zebra[1207]: [EC 4043309122] Client 'bgp' encountered an error and is shutting down.
    Aug  9 13:21:56 gunn watchfrr[1169]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
    

    The next error message shows it has failed to load the previously running configuration:

    Aug  9 13:22:32 gunn bgpd[3741]: [EC 100663337] nb_candidate_commit_prepare: failed to validate candidate configuration
    

    We see something similar happened at almost exactly the same time on earhart:

    Aug  9 13:20:37 earhart bgpd[2505]: [EC 100663313] SLOW THREAD: task bgpd_sync_callback (7f6d217bc5a0) ran for 65325ms (cpu time 65317ms)
    Aug  9 13:21:05 earhart watchfrr[1159]: [EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
    Aug  9 13:21:05 earhart watchfrr[1159]: Forked background command [pid 6933]: /usr/lib/frr/watchfrr.sh restart bgpd
    Aug  9 13:21:25 earhart watchfrr[1159]: Warning: restart bgpd child process 6933 still running after 20 seconds, sending signal 15
    Aug  9 13:21:25 earhart watchfrr[1159]: restart bgpd process 6933 terminated due to signal 15
    Aug  9 13:21:43 earhart bgpd[2505]: [EC 100663313] SLOW THREAD: task bgpd_sync_callback (7f6d217bc5a0) ran for 65677ms (cpu time 65675ms)
    Aug  9 13:21:43 earhart bgpd[2505]: Terminating on signal
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 45(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 36(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 58(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 61(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 38(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 76(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 71(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 41(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 50(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 86(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 85(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 63(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 97(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 98(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 95(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 47(8) no thread to handle this!
    Aug  9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 79(8) no thread to handle this!
    Aug  9 13:21:54 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 66(8) no thread to handle this!
    Aug  9 13:22:28 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 64(8) no thread to handle this!
    Aug  9 13:22:30 earhart watchfrr[1159]: Forked background command [pid 7268]: /usr/lib/frr/watchfrr.sh restart bgpd
    Aug  9 13:22:34 earhart watchfrr[1159]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
    

    And similarly:

    Aug  9 13:22:35 earhart bgpd[7322]: [EC 100663337] nb_candidate_commit_prepare: failed to validate candidate configuration
    

    What is currently puzzling our engineers is the following message:

    Aug  9 13:26:55 earhart bgpd[7322]: [EC 100663331] error processing configuration change: duplicated prefix list value: 0.0.0.0/0
    

    This implies that the configuration that is trying to be applied upon a BGP process restart has two identical entries in it. But that configuration should be exactly the same as the configuration which was running just before the bgpd process restarted — which must have been a valid configuration, otherwise it would not have been correctly the first time bgpd started. And we can see from the rebooted gunn router that the configuration is correctly applied at system boot. The bgpd process starts up normally:

    Aug  9 14:06:40 gunn bgpd[1140]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
    

    We are continuing to investigate.

  • 2022-08-09
    14:10:00

    The network is stabled and alerts have cleared. We have not yet brought gunn back into service. We are going to investigate its entire configuration to see if there is a reason why the BGP process is misbehaving in this way. But no changes have been deployed to gunn for several weeks, so we are not yet sure why it has started having problems.

  • 2022-08-09
    13:40:00

    Again, we've seen the BGP process crash and auto-restart, and not load all of its configuration (notably, missing out the configuration of prefix-lists).

    And then, about 10 minutes later (at 13:35 UTC), the BGP process on gunn completely failed, and would not restart:

    vyos@gunn.x.faelix.net:~$ show ip bgp sum
    bgpd is not running
    

    We have initiated that reboot and are now monitoring the rest of the network.

  • 2022-08-09
    13:23:00

    Backfilled: BGP sessions on router earhart drop, which then recover, but have dropped some of the configured prefix-lists as well.

  • 2022-08-09
    13:23:00

    Backfilled: We publish a timeline update, "We've seen a recurrence," and begin investigating.

  • 2022-08-09
    13:22:00

    Backfilled: BGP sessions on router gunn drop.

  • 2022-08-09
    13:20:00

    Backfilled: CPU usage on all our BGP peering and transit edge routers began to massively increase, not corresponding with an increase in traffic levels. First in London, and then across the rest of our UK network, peaking at 13:23 UTC.

  • 2022-08-08
    15:12:00

    IGP adjacencies and eBGP sessions have established. We are monitoring the situation.

  • 2022-08-08
    15:02:00

    The updated router has finished booting up. We are bringing it into IGP adjacency with the rest of the network.

  • 2022-08-08
    14:26:00

    We are applying the update to gunn.

  • 2022-08-08
    14:02:00

    The network is looking stable. We are preparing to bring forward a planned piece of maintenance work to upgrade the FRRouting process on gunn.