Switch Crash in Manchester Backbone

Event Started
2018-07-24 15:57
Report Published
2018-07-24 15:54
Last Updated
2022-04-13 09:36
Event Finished
2018-07-24 16:16

We are investigating a core switch crash in Manchester which caused a brief interruption to connectivity to our customer VMs and colocation.

Timeline (most recent first)
  • 2018-07-24
    16:17:00

    We will be working with the support team for the vendor of the switches to determine whether this is caused by a known defect.

  • 2018-07-24
    16:15:00

    We believe the connectivity interruption was been caused by a "split-brain" on our core switches: the master failed to respond to heartbeat messages, another master was elected, and then the original master started functioning normally again. The switches "solved" the split-brain by rebooting the older master switch, which caused a brief interruption to connectivity.

  • 2018-07-24
    16:05:00

    The switch appears to have rebooted, coming back into full service just before 16:56:00. We are investigating why this has happened.

  • 2018-07-24
    16:04:00

    A core switch became unresponsive at 16:54:16 local time today. Our engineers began looking into it as soon as alerts were raised a few seconds later.