Storage failure in lexx Incident

One or more drives in one of our virtual-machine hosting servers, lexx, have failed.

We replicate storage between physical hosts using DRBD, and most VMs continued running using their secondary storage. These were relatively easy to live-migrate onto their secondary node.

A small number, however, locked up with either I/O errors or even numerous segfaults and kernel panics. For crashed VMs we are manually restarting them on another physical host. Unfortunately there is a delay to do this while we make sure that drbd is up to date — in some cases it is requiring a full drbd resync.


Timeline

2018-04-29 07:00
Most VMs have been moved off lexx with live migration.
2018-04-29 07:30
For the half dozen VMs that crashed, we are still rebuilding drbd devices onto other physical hosts.
2018-04-29 08:36
All VMs that crashed have now been moved to different physical hosts and their backing storage is up-to-date and replicated onto another physical host.
2018-04-29 09:09
We are running some sacrificial VMs on lexx, along with disk diagnostic tools. So far we suspect one storage drive is having errors, but are also suspicious about one caching SSD. It will be about four hours till the diagnostic tests complete.
2018-04-29 09:51
Diagnostic tests haven't completed, but it's clear from our sacrificial VMs that we will need to replace at least one drive in lexx. We will continue to wait for the results of the diagnostics on the other drives in lexx before deciding how to proceed.
2018-04-29 16:56
Diagnostic tests have all completed. One storage drive needs replacing, while other drives and SSDs report no errors. As all customer VMs have now been evacuated from lexx, we will perform this maintenance out-of-hours when next convenient.
2018-05-09 18:15
Drive replacement is in progress.
2018-05-09 21:20
Pre-flight tests have completed successfully, and so lexx is returning to full production status.