We are investigating a core switch crash in Manchester which caused a brief interruption to connectivity to our customer VMs and colocation.
Switch Crash in Manchester Backbone
- Event Started
- 2018-07-24 15:57
- Report Published
- 2018-07-24 15:54
- Last Updated
- 2022-04-13 09:36
- Event Finished
- 2018-07-24 16:16
Timeline (most recent first)
We will be working with the support team for the vendor of the switches to determine whether this is caused by a known defect.
We believe the connectivity interruption was been caused by a "split-brain" on our core switches: the master failed to respond to heartbeat messages, another master was elected, and then the original master started functioning normally again. The switches "solved" the split-brain by rebooting the older master switch, which caused a brief interruption to connectivity.
The switch appears to have rebooted, coming back into full service just before 16:56:00. We are investigating why this has happened.
A core switch became unresponsive at 16:54:16 local time today. Our engineers began looking into it as soon as alerts were raised a few seconds later.