Storage failure in lexx

Event Started: 2018-04-29 07:45
Report Published: 2018-04-29 06:33
Last Updated: 2022-04-13 09:27
Event Finished: 2018-05-09 21:20

One or more drives in one of our virtual-machine hosting servers, lexx, have failed.

We replicate storage between physical hosts using DRBD, and most VMs continued running using their secondary storage. These were relatively easy to live-migrate onto their secondary node.

A small number, however, locked up with either I/O errors or even numerous segfaults and kernel panics. For crashed VMs we are manually restarting them on another physical host. Unfortunately there is a delay to do this while we make sure that drbd is up to date — in some cases it is requiring a full drbd resync.

Timeline (most recent first)

2018-05-09
21:20:00

Pre-flight tests have completed successfully, and so lexx is returning to full production status.
2018-05-09
18:15:00

Drive replacement is in progress.
2018-04-29
16:56:00

Diagnostic tests have all completed. One storage drive needs replacing, while other drives and SSDs report no errors. As all customer VMs have now been evacuated from lexx, we will perform this maintenance out-of-hours when next convenient.
2018-04-29
09:51:00

Diagnostic tests haven't completed, but it's clear from our sacrificial VMs that we will need to replace at least one drive in lexx. We will continue to wait for the results of the diagnostics on the other drives in lexx before deciding how to proceed.
2018-04-29
09:09:00

We are running some sacrificial VMs on lexx, along with disk diagnostic tools. So far we suspect one storage drive is having errors, but are also suspicious about one caching SSD. It will be about four hours till the diagnostic tests complete.
2018-04-29
08:36:00

All VMs that crashed have now been moved to different physical hosts and their backing storage is up-to-date and replicated onto another physical host.
2018-04-29
07:30:00

For the half dozen VMs that crashed, we are still rebuilding drbd devices onto other physical hosts.
2018-04-29
07:00:00

Most VMs have been moved off lexx with live migration.