One of our VPS hosts in Geneva has locked up.
Crash of VPS host shnik.g.faelix.net Virtual Servers
- Event Started
- 2025-01-02 12:39
- Report Published
- 2025-01-02 17:45
- Last Updated
- 2025-02-12 11:12
- Event Finished
- 2025-01-02 13:11
Timeline (most recent first)
-
2025-02-12
10:47:00We have changed the fan arrangement in
shnik
now. A follow-on maintenance will add two more fans to bothshnik
anduther
to ensure redundancy of all cooling paths so that airflow is maintained even in the event of a fan failure. -
2025-02-11
21:30:00The arrangement of fans within the server chassis puts no airflow over the SAS RAID card. As a result its temperature was 109°C. The storage controller firmware shuts the card down at 116°C. We now believe that this may have also affected
shnik
when it was under load and crashed. Changing the fan arrangement has mitigated this, reducing temperatures on the RAID card by a significant amount.Frustratingly the SAS card's heatsink temperature isn't factored into the server's "health monitoring" by the baseboard management controller. During the thermal shutdown of the storage controller the BMC continued to report all green because the motherboard components (CPU, RAM, etc) were within thresholds.
-
2025-02-11
14:44:00On one of our other VPS hosts with an identical hardware build, currently carrying only test load, we've noticed some errors related to the SAS RAID card. These appear to indicate a thermal exception. An engineer is enroute to investigate further.
-
2025-01-05
13:30:00Customer servers have been running on this for almost a day now, and there has been no recurrence of the hardware warning. We are closing this incident.
-
2025-01-04
16:01:00After a BIOS and BMC update, a full shutdown and power-cycle,
shnik
is showing no hardware warnings. -
2025-01-04
15:37:00The BMC update seems to have been successful. We're going to apply a BIOS update as well, for completeness' sake.
-
2025-01-04
15:23:00We are going to apply an update to the BMC (baseboard management controller) to see if this sheds further light or clears the warning.
-
2025-01-04
15:19:00So far we've turned up nothing obvious: all airflow baffles and channels are correctly placed within the server
shnik
and no components are showing warning lights. -
2025-01-04
14:58:00All customer VPSs are moved off. We are now shutting down
shnik
for investigation. -
2025-01-04
14:43:00Our engineer is on-site at the Geneva DC and is beginning work. First we are going to migrate VPSs off
shnik
before we investigate the hardware "warning". -
2025-01-02
13:55:00The VPS host's hardware is showing a "warning" but none of its sensors are out of thresholds. No log entries on the system management console show any reason for the "warning" either. We are going to send an engineer to the datacentre on 4th January to investigate further.
-
2025-01-02
13:14:00All customer VPSs are now running.
-
2025-01-02
13:11:00Customer VPSs are now beginning to start.
-
2025-01-02
13:04:00It's taken us longer than we'd hoped to power-cycle the server because, at the time it crashed, it was also hosting the VPS for our documentation which includes passwords for the hardware lights out.
-
2025-01-02
12:44:00The VPS host
shnik.g.faelix.net
has become unresponsive. We're going to issue a reboot via the "lights out management" for the server. -
2025-01-02
12:39:00The first alerts have come in that some of the VPSs are unresponsive. We are investigating.