Interface Flap in Interxion LON2 Backbone

Event Started
2023-05-23 10:13
Report Published
2023-05-23 10:43
Last Updated
2023-10-27 09:18
Event Finished
Ongoing

One of our core routers has flapped a physical interface, which has caused BGP instability.

Timeline (most recent first)
  • 2023-05-23
    12:49:00

    We're continuing to monitor our router gunn.x.faelix.net and have seen no recurrences since this morning.

    The logs indicate that the router restarted several of the processes managing the dynamic routing protocols which, in turn, flapped its 40G interfaces.

    May 23 10:13:05 gunn kernel: [24764987.199489] mlx4_en: eth7: Port link mode changed, restarting port...
    May 23 10:13:05 gunn kernel: [24764987.256968] mlx4_en: eth7: Steering Mode 1
    May 23 10:13:05 gunn kernel: [24764987.267510] mlx4_en: eth7: Link Down
    
    [snipping large number of Zebra/OSPFv2/OSPFv3 messages]
    
    May 23 10:13:07 gunn watchfrr[1099]: [EC 268435457] zebra state -> down : read returned EOF
    May 23 10:13:07 gunn watchfrr[1099]: zebra state -> up : connect succeeded
    May 23 10:13:07 gunn watchfrr[1099]: [EC 268435457] zebra state -> down : unexpected read error: Connection reset by peer
    
    [snip]
    
    May 23 10:13:09 gunn kernel: [24764990.705063] mlx4_en: eth7: Link Up
    May 23 10:13:09 gunn kernel: [24764990.814894] mlx4_en: eth7: Link Down
    May 23 10:13:09 gunn kernel: [24764991.024967] mlx4_en: eth7: Link Up
    
    [snip]
    
    May 23 10:13:12 gunn watchfrr[1099]: Forked background command [pid 24271]: /usr/lib/frr/watchfrr.sh restart all
    
    [snip]
    
    May 23 10:13:32 gunn watchfrr[1099]: Warning: restart all child process 24271 still running after 20 seconds, sending signal 15
    May 23 10:13:32 gunn watchfrr[1099]: restart all process 24271 terminated due to signal 15
    May 23 10:13:59 gunn watchfrr[1099]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
    May 23 10:14:37 gunn watchfrr[1099]: Forked background command [pid 24658]: /usr/lib/frr/watchfrr.sh restart all
    
    [snip]
    
    May 23 10:14:42 gunn watchfrr[1099]: ospfd state -> up : connect succeeded
    May 23 10:14:42 gunn watchfrr[1099]: zebra state -> up : connect succeeded
    May 23 10:14:42 gunn watchfrr[1099]: isisd state -> up : connect succeeded
    May 23 10:14:42 gunn watchfrr[1099]: ldpd state -> up : connect succeeded
    May 23 10:14:42 gunn watchfrr[1099]: ripngd state -> up : connect succeeded
    May 23 10:14:42 gunn watchfrr[1099]: staticd state -> up : connect succeeded
    May 23 10:14:42 gunn watchfrr[1099]: bfdd state -> up : connect succeeded
    May 23 10:14:42 gunn watchfrr[1099]: ospf6d state -> up : connect succeeded
    May 23 10:14:43 gunn watchfrr[1099]: ripd state -> up : connect succeeded
    May 23 10:14:43 gunn watchfrr[1099]: bgpd state -> up : connect succeeded
    May 23 10:14:57 gunn watchfrr[1099]: Warning: restart all child process 24658 still running after 20 seconds, sending signal 15
    

    We're still investigating what caused this to happen.

  • 2023-05-23
    10:38:00

    We've restarted BGP and OSPF on a router in AQL which appeared to have multiple stuck routes installed. This has cleared up remaining issues.

  • 2023-05-23
    10:31:00

    We've seen BGP instability recur.

  • 2023-05-23
    10:20:00

    The interface has come back up, is showing no errors, and BGP has reconverged.