The BGP process on gunn.x.faelix.net
in Interxion has crashed and dropped its configuration. We have initiated an emergency reboot of the affected router.
Emergency Reboot of gunn (peering and transit router in Interxion London) Peering and Transit
- Event Started
- 2022-08-08 13:42
- Report Published
- 2022-08-08 14:17
- Last Updated
- 2022-09-20 09:36
- Event Finished
- Ongoing
Timeline (most recent first)
-
2022-09-04
13:25:00This problem may have occurred again today. We have re-opened this issue.
-
2022-08-21
13:45:00We noticed a brief
bgpd
process instability. This has been resolved, and combined with recent updates, we believe that part of the RPKI validation process of FRRouting is to blame for thebgpd
process lock-ups. We have temporarily disabled RPKI to determine if this is the case, and are monitoring for the effects. -
2022-08-19
13:45:00We noticed a brief
bgpd
process instability. This has been resolved. -
2022-08-09
21:16:00We've applied those changes, and brought
gunn
back into IGP and iBGP. We're monitoring the network. -
2022-08-09
21:02:00With the test case we've found a workaround we can apply to our BGP routers' prefix-lists which will not compromise routing security and will survive a deliberately-triggered
bgpd
process restart. We are applying this as an emergency configuration change to our BGP peering and transit routers. -
2022-08-09
18:28:00We have identified a test case we can use to reproduce the second of the two issues (the
bgpd
process not being programmed correctly). We are going to perform some tests to see if we can find a configuration which avoids this buggy behaviour. -
2022-08-09
15:13:00We have reached out to the VyOS development team to try to get more information about why this behaviour has been observed, and how we might work with them to fix it.
-
2022-08-09
14:30:00From the logs of
gunn
we can see that thebgpd
process appeared to lock up and was automatically restarted:Aug 9 13:21:07 gunn watchfrr[1169]: [EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago Aug 9 13:21:07 gunn watchfrr[1169]: Forked background command [pid 3477]: /usr/lib/frr/watchfrr.sh restart bgpd Aug 9 13:21:27 gunn watchfrr[1169]: Warning: restart bgpd child process 3477 still running after 20 seconds, sending signal 15 Aug 9 13:21:27 gunn watchfrr[1169]: restart bgpd process 3477 terminated due to signal 15 Aug 9 13:21:28 gunn bgpd[1214]: [EC 100663313] SLOW THREAD: task bgpd_sync_callback (7f8ec73e85a0) ran for 58360ms (cpu time 58359ms) Aug 9 13:21:28 gunn bgpd[1214]: Terminating on signal Aug 9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 35(8) no thread to handle this! Aug 9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 70(8) no thread to handle this! Aug 9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 93(8) no thread to handle this! Aug 9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 39(8) no thread to handle this! Aug 9 13:21:29 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 82(8) no thread to handle this! Aug 9 13:21:53 gunn bgpd[1214]: [EC 100663314] Attempting to process an I/O event but for fd: 49(8) no thread to handle this! Aug 9 13:21:56 gunn zebra[1207]: [EC 4043309122] Client 'vnc' encountered an error and is shutting down. Aug 9 13:21:56 gunn zebra[1207]: [EC 4043309122] Client 'bgp' encountered an error and is shutting down. Aug 9 13:21:56 gunn watchfrr[1169]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
The next error message shows it has failed to load the previously running configuration:
Aug 9 13:22:32 gunn bgpd[3741]: [EC 100663337] nb_candidate_commit_prepare: failed to validate candidate configuration
We see something similar happened at almost exactly the same time on
earhart
:Aug 9 13:20:37 earhart bgpd[2505]: [EC 100663313] SLOW THREAD: task bgpd_sync_callback (7f6d217bc5a0) ran for 65325ms (cpu time 65317ms) Aug 9 13:21:05 earhart watchfrr[1159]: [EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago Aug 9 13:21:05 earhart watchfrr[1159]: Forked background command [pid 6933]: /usr/lib/frr/watchfrr.sh restart bgpd Aug 9 13:21:25 earhart watchfrr[1159]: Warning: restart bgpd child process 6933 still running after 20 seconds, sending signal 15 Aug 9 13:21:25 earhart watchfrr[1159]: restart bgpd process 6933 terminated due to signal 15 Aug 9 13:21:43 earhart bgpd[2505]: [EC 100663313] SLOW THREAD: task bgpd_sync_callback (7f6d217bc5a0) ran for 65677ms (cpu time 65675ms) Aug 9 13:21:43 earhart bgpd[2505]: Terminating on signal Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 45(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 36(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 58(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 61(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 38(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 76(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 71(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 41(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 50(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 86(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 85(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 63(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 97(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 98(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 95(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 47(8) no thread to handle this! Aug 9 13:21:44 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 79(8) no thread to handle this! Aug 9 13:21:54 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 66(8) no thread to handle this! Aug 9 13:22:28 earhart bgpd[2505]: [EC 100663314] Attempting to process an I/O event but for fd: 64(8) no thread to handle this! Aug 9 13:22:30 earhart watchfrr[1159]: Forked background command [pid 7268]: /usr/lib/frr/watchfrr.sh restart bgpd Aug 9 13:22:34 earhart watchfrr[1159]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
And similarly:
Aug 9 13:22:35 earhart bgpd[7322]: [EC 100663337] nb_candidate_commit_prepare: failed to validate candidate configuration
What is currently puzzling our engineers is the following message:
Aug 9 13:26:55 earhart bgpd[7322]: [EC 100663331] error processing configuration change: duplicated prefix list value: 0.0.0.0/0
This implies that the configuration that is trying to be applied upon a BGP process restart has two identical entries in it. But that configuration should be exactly the same as the configuration which was running just before the
bgpd
process restarted — which must have been a valid configuration, otherwise it would not have been correctly the first timebgpd
started. And we can see from the rebootedgunn
router that the configuration is correctly applied at system boot. Thebgpd
process starts up normally:Aug 9 14:06:40 gunn bgpd[1140]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
We are continuing to investigate.
-
2022-08-09
14:10:00The network is stabled and alerts have cleared. We have not yet brought
gunn
back into service. We are going to investigate its entire configuration to see if there is a reason why the BGP process is misbehaving in this way. But no changes have been deployed togunn
for several weeks, so we are not yet sure why it has started having problems. -
2022-08-09
13:40:00Again, we've seen the BGP process crash and auto-restart, and not load all of its configuration (notably, missing out the configuration of prefix-lists).
And then, about 10 minutes later (at 13:35 UTC), the BGP process on
gunn
completely failed, and would not restart:vyos@gunn.x.faelix.net:~$ show ip bgp sum bgpd is not running
We have initiated that reboot and are now monitoring the rest of the network.
-
2022-08-09
13:23:00Backfilled: BGP sessions on router
earhart
drop, which then recover, but have dropped some of the configured prefix-lists as well. -
2022-08-09
13:23:00Backfilled: We publish a timeline update, "We've seen a recurrence," and begin investigating.
-
2022-08-09
13:22:00Backfilled: BGP sessions on router
gunn
drop. -
2022-08-09
13:20:00Backfilled: CPU usage on all our BGP peering and transit edge routers began to massively increase, not corresponding with an increase in traffic levels. First in London, and then across the rest of our UK network, peaking at 13:23 UTC.
-
2022-08-08
15:12:00IGP adjacencies and eBGP sessions have established. We are monitoring the situation.
-
2022-08-08
15:02:00The updated router has finished booting up. We are bringing it into IGP adjacency with the rest of the network.
-
2022-08-08
14:26:00We are applying the update to
gunn
. -
2022-08-08
14:02:00The network is looking stable. We are preparing to bring forward a planned piece of maintenance work to upgrade the FRRouting process on
gunn
.