Switch Crash in Manchester Incident

We are investigating a core switch crash in Manchester which caused a brief interruption to connectivity to our customer VMs and colocation.


Timeline

2018-07-24 16:04

A core switch became unresponsive at 16:54:16 local time today. Our engineers began looking into it as soon as alerts were raised a few seconds later.

2018-07-24 16:05

The switch appears to have rebooted, coming back into full service just before 16:56:00. We are investigating why this has happened.

2018-07-24 16:15

We believe the connectivity interruption was been caused by a "split-brain" on our core switches: the master failed to respond to heartbeat messages, another master was elected, and then the original master started functioning normally again. The switches "solved" the split-brain by rebooting the older master switch, which caused a brief interruption to connectivity.

2018-07-24 16:17

We will be working with the support team for the vendor of the switches to determine whether this is caused by a known defect.