One of our core routers has flapped a physical interface, which has caused BGP instability.
Interface Flap in Interxion LON2 Backbone
- Event Started
- 2023-05-23 10:13
- Report Published
- 2023-05-23 10:43
- Last Updated
- 2023-10-27 09:18
- Event Finished
- Ongoing
Timeline (most recent first)
-
2023-05-23
12:49:00We're continuing to monitor our router
gunn.x.faelix.net
and have seen no recurrences since this morning.The logs indicate that the router restarted several of the processes managing the dynamic routing protocols which, in turn, flapped its 40G interfaces.
May 23 10:13:05 gunn kernel: [24764987.199489] mlx4_en: eth7: Port link mode changed, restarting port... May 23 10:13:05 gunn kernel: [24764987.256968] mlx4_en: eth7: Steering Mode 1 May 23 10:13:05 gunn kernel: [24764987.267510] mlx4_en: eth7: Link Down [snipping large number of Zebra/OSPFv2/OSPFv3 messages] May 23 10:13:07 gunn watchfrr[1099]: [EC 268435457] zebra state -> down : read returned EOF May 23 10:13:07 gunn watchfrr[1099]: zebra state -> up : connect succeeded May 23 10:13:07 gunn watchfrr[1099]: [EC 268435457] zebra state -> down : unexpected read error: Connection reset by peer [snip] May 23 10:13:09 gunn kernel: [24764990.705063] mlx4_en: eth7: Link Up May 23 10:13:09 gunn kernel: [24764990.814894] mlx4_en: eth7: Link Down May 23 10:13:09 gunn kernel: [24764991.024967] mlx4_en: eth7: Link Up [snip] May 23 10:13:12 gunn watchfrr[1099]: Forked background command [pid 24271]: /usr/lib/frr/watchfrr.sh restart all [snip] May 23 10:13:32 gunn watchfrr[1099]: Warning: restart all child process 24271 still running after 20 seconds, sending signal 15 May 23 10:13:32 gunn watchfrr[1099]: restart all process 24271 terminated due to signal 15 May 23 10:13:59 gunn watchfrr[1099]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer May 23 10:14:37 gunn watchfrr[1099]: Forked background command [pid 24658]: /usr/lib/frr/watchfrr.sh restart all [snip] May 23 10:14:42 gunn watchfrr[1099]: ospfd state -> up : connect succeeded May 23 10:14:42 gunn watchfrr[1099]: zebra state -> up : connect succeeded May 23 10:14:42 gunn watchfrr[1099]: isisd state -> up : connect succeeded May 23 10:14:42 gunn watchfrr[1099]: ldpd state -> up : connect succeeded May 23 10:14:42 gunn watchfrr[1099]: ripngd state -> up : connect succeeded May 23 10:14:42 gunn watchfrr[1099]: staticd state -> up : connect succeeded May 23 10:14:42 gunn watchfrr[1099]: bfdd state -> up : connect succeeded May 23 10:14:42 gunn watchfrr[1099]: ospf6d state -> up : connect succeeded May 23 10:14:43 gunn watchfrr[1099]: ripd state -> up : connect succeeded May 23 10:14:43 gunn watchfrr[1099]: bgpd state -> up : connect succeeded May 23 10:14:57 gunn watchfrr[1099]: Warning: restart all child process 24658 still running after 20 seconds, sending signal 15
We're still investigating what caused this to happen.
-
2023-05-23
10:38:00We've restarted BGP and OSPF on a router in AQL which appeared to have multiple stuck routes installed. This has cleared up remaining issues.
-
2023-05-23
10:31:00We've seen BGP instability recur.
-
2023-05-23
10:20:00The interface has come back up, is showing no errors, and BGP has reconverged.