BGP Process Crash/Restart on Peering Router for LINX LON1

Event Started: 2022-10-31 19:40
Report Published: 2022-10-31 19:53
Last Updated: 2023-10-27 09:21
Event Finished: Ongoing

Earlier today we observed the BGP process restart on dekker.d.faelix.net. When bringing this router back into service, we did not realise that it had not load its route-maps correctly when the BGP process restarted. This caused incorrect propagation of routes within and to outside our network.

We immediately shut BGP sessions on LINX LON1 as a result to investigate further, around 19:45 UTC.

Timeline (most recent first)

2022-11-01
06:53:00

Having monitored the network for some time, we are satisfied with the outcome of the emergency maintenance.

We will go through the debugging logs gathered from the affected device to try to determine why it behaved in the way it did just after 03:00 on 31st October.
2022-11-01
04:50:00

We are establishing LINX LON1 peering sessions.
2022-11-01
04:47:00

We are establishing customer BGP sessions.
2022-11-01
04:42:00

We are now performing checks before bringing this router back into production traffic.
2022-11-01
04:01:00

We are about to reboot dekker.d.faelix.net.
2022-11-01
03:33:00

We will be beginning works shortly.
2022-11-01
03:00:00
We have found that the router's issues began when we costed out the Manchester-London path as part of our final maintenance to evacuate MA2. When we did that work we noticed at 03:12 that it cycled all its BGP sessions. It appears several things happened in its control plane at that moment:
- its OSPF process also reloaded, setting some interfaces costs to defaults rather than configured values
- its BGP process partially reloaded, so that when we tried re-establishing peering sessions it did not have the route-maps properly installed
We are therefore taking the decision to reboot this router at 04:00 tonight as an emergency maintenance
2022-10-31
03:00:00

At 03:00 on 31st October we began final maintenance work for the closure of MA2. Part of this involved "costing out" the link between MA2 and THW (Telehouse West) for the duration of the move. This was carried out by adjusting OSPF costs on interfaces using this link shortly after 03:00.

Unfortunately, dekker.d.faelix.net (our peering and transit router in THW), reponded to the set interface ... ip ospf cost commands being committed by cycling one of its 40GE network interfaces:

Oct 31 03:00:49 dekker kernel: [3367956.254281] mlx4_en: eth7: Port link mode changed, restarting port... Oct 31 03:00:49 dekker kernel: [3367956.293771] mlx4_en: eth7: Steering Mode 1 Oct 31 03:00:49 dekker kernel: [3367956.302836] mlx4_en: eth7: Link Down ... Oct 31 03:00:50 dekker zebra[1199]: [EC 4043309074] 0:254:46.227.200.224/29: Failed to enqueue dataplane install Oct 31 03:00:51 dekker kernel: [3367958.096191] mlx4_en: eth7: Link Up

This then caused the zebra process (part of the routing control plane) to struggle resolving routes:

Oct 31 03:00:53 dekker zebra[1199]: Extended Error: Nexthop id does not exist Oct 31 03:00:53 dekker netplugd[1130]: eth7.3064: state INACTIVE flags 0x00001003 UP,BROADCAST,MULTICAST -> 0x00011043 UP,BRO ADCAST,RUNNING,MULTICAST,10000

Many log entries — but only one second later — the zebra process loses track of the Linux kernel netlink messages it is processing, and starts to fail:

Oct 31 03:00:53 dekker zebra[1199]: [EC 4043309091] netlink-listen (NS 0) recvmsg overrun: No buffer space available Oct 31 03:00:53 dekker watchfrr[1157]: [EC 268435457] zebra state -> down : unexpected read error: Connection reset by peer Oct 31 03:00:55 dekker bgpd[1318]: [EC 100663299] buffer_write: write error on fd 21: Broken pipe Oct 31 03:00:55 dekker bgpd[1318]: [EC 100663302] zclient_send_message: buffer_write failed to zclient fd 21, closing

At this point watchfrr kicks in and restarts all the relevant processes:

Oct 31 03:00:58 dekker watchfrr[1157]: Forked background command [pid 17679]: /usr/lib/frr/watchfrr.sh restart all Oct 31 03:00:58 dekker staticd[1293]: Terminating on signal Oct 31 03:00:58 dekker ospf6d[1260]: Terminating on signal SIGINT Oct 31 03:00:58 dekker ripngd[1307]: Terminating on signal Oct 31 03:00:58 dekker ospfd[1330]: Terminating on signal Oct 31 03:00:58 dekker watchfrr[1157]: [EC 268435457] staticd state -> down : read returned EOF Oct 31 03:00:58 dekker isisd[1303]: Terminating on signal SIGINT Oct 31 03:00:58 dekker watchfrr.sh: Cannot stop zebra: pid 1199 not running

And then triggers again?

Oct 31 03:01:18 dekker watchfrr[1157]: Warning: restart all child process 17679 still running after 20 seconds, sending signa l 15 Oct 31 03:01:18 dekker watchfrr[1157]: restart all process 17679 terminated due to signal 15

By now it looks like FRRouting is in a bad way:

Oct 31 03:01:27 dekker bgpd[1318]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed

And is auto-restarted again:

Oct 31 03:02:22 dekker watchfrr[1157]: Forked background command [pid 18024]: /usr/lib/frr/watchfrr.sh restart all Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop staticd: pid file not found Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop bfdd: pid file not found Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop isisd: pid file not found Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ldpd: pid file not found Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ripd: pid file not found Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ospf6d: pid file not found Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ripngd: pid file not found Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ospfd: pid file not found Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop bgpd: pid 1318 not running Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop zebra: pid 1199 not running

And finally seems to be running:

Oct 31 03:02:23 dekker watchfrr[1157]: ospfd state -> up : connect succeeded [snip] Oct 31 03:02:27 dekker watchfrr[1157]: bgpd state -> up : connect succeeded Oct 31 03:02:27 dekker watchfrr[1157]: staticd state -> up : connect succeeded Oct 31 03:02:27 dekker watchfrr[1157]: zebra state -> up : connect succeeded Oct 31 03:02:28 dekker watchfrr[1157]: bfdd state -> up : connect succeeded Oct 31 03:02:28 dekker watchfrr[1157]: ldpd state -> up : connect succeeded Oct 31 03:02:28 dekker watchfrr[1157]: isisd state -> up : connect succeeded Oct 31 03:02:28 dekker watchfrr[1157]: ripd state -> up : connect succeeded Oct 31 03:02:28 dekker watchfrr[1157]: ripngd state -> up : connect succeeded Oct 31 03:02:28 dekker watchfrr[1157]: ospf6d state -> up : connect succeeded ```

Unfortunately the processes now running have only picked up half their configurations. OSPF interface costs are not all correctly derived from VyOS' configuration; prefix-lists are incomplete; and numerous critical BGP route-maps are missing.

This degrades our network's routing, causing packet loss for a while during reconvergence, before stabilising. However, this leaves the router in a running configuration which does not match the programmed configuration, and this will trip up our engineers when bringing services back later the same day (subsequent status updates on this event).