At 03:00 on 31st October we began final maintenance work for the closure of MA2. Part of this involved "costing out" the link between MA2 and THW (Telehouse West) for the duration of the move. This was carried out by adjusting OSPF costs on interfaces using this link shortly after 03:00.
Unfortunately, dekker.d.faelix.net
(our peering and transit router in THW), reponded to the set interface ... ip ospf cost
commands being committed by cycling one of its 40GE network interfaces:
Oct 31 03:00:49 dekker kernel: [3367956.254281] mlx4_en: eth7: Port link mode changed, restarting port...
Oct 31 03:00:49 dekker kernel: [3367956.293771] mlx4_en: eth7: Steering Mode 1
Oct 31 03:00:49 dekker kernel: [3367956.302836] mlx4_en: eth7: Link Down
...
Oct 31 03:00:50 dekker zebra[1199]: [EC 4043309074] 0:254:46.227.200.224/29: Failed to enqueue dataplane install
Oct 31 03:00:51 dekker kernel: [3367958.096191] mlx4_en: eth7: Link Up
This then caused the zebra
process (part of the routing control plane) to struggle resolving routes:
Oct 31 03:00:53 dekker zebra[1199]: Extended Error: Nexthop id does not exist
Oct 31 03:00:53 dekker netplugd[1130]: eth7.3064: state INACTIVE flags 0x00001003 UP,BROADCAST,MULTICAST -> 0x00011043 UP,BRO
ADCAST,RUNNING,MULTICAST,10000
Many log entries — but only one second later — the zebra
process loses track of the Linux kernel netlink messages it is processing, and starts to fail:
Oct 31 03:00:53 dekker zebra[1199]: [EC 4043309091] netlink-listen (NS 0) recvmsg overrun: No buffer space available
Oct 31 03:00:53 dekker watchfrr[1157]: [EC 268435457] zebra state -> down : unexpected read error: Connection reset by peer
Oct 31 03:00:55 dekker bgpd[1318]: [EC 100663299] buffer_write: write error on fd 21: Broken pipe
Oct 31 03:00:55 dekker bgpd[1318]: [EC 100663302] zclient_send_message: buffer_write failed to zclient fd 21, closing
At this point watchfrr
kicks in and restarts all the relevant processes:
Oct 31 03:00:58 dekker watchfrr[1157]: Forked background command [pid 17679]: /usr/lib/frr/watchfrr.sh restart all
Oct 31 03:00:58 dekker staticd[1293]: Terminating on signal
Oct 31 03:00:58 dekker ospf6d[1260]: Terminating on signal SIGINT
Oct 31 03:00:58 dekker ripngd[1307]: Terminating on signal
Oct 31 03:00:58 dekker ospfd[1330]: Terminating on signal
Oct 31 03:00:58 dekker watchfrr[1157]: [EC 268435457] staticd state -> down : read returned EOF
Oct 31 03:00:58 dekker isisd[1303]: Terminating on signal SIGINT
Oct 31 03:00:58 dekker watchfrr.sh: Cannot stop zebra: pid 1199 not running
And then triggers again?
Oct 31 03:01:18 dekker watchfrr[1157]: Warning: restart all child process 17679 still running after 20 seconds, sending signa
l 15
Oct 31 03:01:18 dekker watchfrr[1157]: restart all process 17679 terminated due to signal 15
By now it looks like FRRouting is in a bad way:
Oct 31 03:01:27 dekker bgpd[1318]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
And is auto-restarted again:
Oct 31 03:02:22 dekker watchfrr[1157]: Forked background command [pid 18024]: /usr/lib/frr/watchfrr.sh restart all
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop staticd: pid file not found
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop bfdd: pid file not found
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop isisd: pid file not found
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ldpd: pid file not found
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ripd: pid file not found
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ospf6d: pid file not found
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ripngd: pid file not found
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop ospfd: pid file not found
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop bgpd: pid 1318 not running
Oct 31 03:02:22 dekker watchfrr.sh: Cannot stop zebra: pid 1199 not running
And finally seems to be running:
Oct 31 03:02:23 dekker watchfrr[1157]: ospfd state -> up : connect succeeded
[snip]
Oct 31 03:02:27 dekker watchfrr[1157]: bgpd state -> up : connect succeeded
Oct 31 03:02:27 dekker watchfrr[1157]: staticd state -> up : connect succeeded
Oct 31 03:02:27 dekker watchfrr[1157]: zebra state -> up : connect succeeded
Oct 31 03:02:28 dekker watchfrr[1157]: bfdd state -> up : connect succeeded
Oct 31 03:02:28 dekker watchfrr[1157]: ldpd state -> up : connect succeeded
Oct 31 03:02:28 dekker watchfrr[1157]: isisd state -> up : connect succeeded
Oct 31 03:02:28 dekker watchfrr[1157]: ripd state -> up : connect succeeded
Oct 31 03:02:28 dekker watchfrr[1157]: ripngd state -> up : connect succeeded
Oct 31 03:02:28 dekker watchfrr[1157]: ospf6d state -> up : connect succeeded
```
Unfortunately the processes now running have only picked up half their configurations. OSPF interface costs are not all correctly derived from VyOS' configuration; prefix-lists are incomplete; and numerous critical BGP route-maps are missing.
This degrades our network's routing, causing packet loss for a while during reconvergence, before stabilising. However, this leaves the router in a running configuration which does not match the programmed configuration, and this will trip up our engineers when bringing services back later the same day (subsequent status updates on this event).