Service Incidentpublished: 2018-07-24 15:57
started: 2018-07-24 15:54
expected: 2018-07-24 15:56
finished: 2018-07-24 16:16
resolved: 2018-07-24 16:16
We are investigating a core switch crash in Manchester which caused a brief interruption to connectivity to our customer VMs and colocation.
- 2018-07-24 16:04
A core switch became unresponsive at 16:54:16 local time today. Our engineers began looking into it as soon as alerts were raised a few seconds later.
- 2018-07-24 16:05
The switch appears to have rebooted, coming back into full service just before 16:56:00. We are investigating why this has happened.
- 2018-07-24 16:15
We believe the connectivity interruption was been caused by a “split-brain” on our core switches: the master failed to respond to heartbeat messages, another master was elected, and then the original master started functioning normally again. The switches “solved” the split-brain by rebooting the older master switch, which caused a brief interruption to connectivity.
- 2018-07-24 16:17
We will be working with the support team for the vendor of the switches to determine whether this is caused by a known defect.