Crash of VPS host shnik.g.faelix.net Virtual Servers

Event Started
2025-01-02 12:39
Report Published
2025-01-02 17:45
Last Updated
2025-02-12 11:12
Event Finished
2025-01-02 13:11

One of our VPS hosts in Geneva has locked up.

Timeline (most recent first)
  • 2025-02-12
    10:47:00

    We have changed the fan arrangement in shnik now. A follow-on maintenance will add two more fans to both shnik and uther to ensure redundancy of all cooling paths so that airflow is maintained even in the event of a fan failure.

  • 2025-02-11
    21:30:00

    The arrangement of fans within the server chassis puts no airflow over the SAS RAID card. As a result its temperature was 109°C. The storage controller firmware shuts the card down at 116°C. We now believe that this may have also affected shnik when it was under load and crashed. Changing the fan arrangement has mitigated this, reducing temperatures on the RAID card by a significant amount.

    Frustratingly the SAS card's heatsink temperature isn't factored into the server's "health monitoring" by the baseboard management controller. During the thermal shutdown of the storage controller the BMC continued to report all green because the motherboard components (CPU, RAM, etc) were within thresholds.

  • 2025-02-11
    14:44:00

    On one of our other VPS hosts with an identical hardware build, currently carrying only test load, we've noticed some errors related to the SAS RAID card. These appear to indicate a thermal exception. An engineer is enroute to investigate further.

  • 2025-01-05
    13:30:00

    Customer servers have been running on this for almost a day now, and there has been no recurrence of the hardware warning. We are closing this incident.

  • 2025-01-04
    16:01:00

    After a BIOS and BMC update, a full shutdown and power-cycle, shnik is showing no hardware warnings.

  • 2025-01-04
    15:37:00

    The BMC update seems to have been successful. We're going to apply a BIOS update as well, for completeness' sake.

  • 2025-01-04
    15:23:00

    We are going to apply an update to the BMC (baseboard management controller) to see if this sheds further light or clears the warning.

  • 2025-01-04
    15:19:00

    So far we've turned up nothing obvious: all airflow baffles and channels are correctly placed within the server shnik and no components are showing warning lights.

  • 2025-01-04
    14:58:00

    All customer VPSs are moved off. We are now shutting down shnik for investigation.

  • 2025-01-04
    14:43:00

    Our engineer is on-site at the Geneva DC and is beginning work. First we are going to migrate VPSs off shnik before we investigate the hardware "warning".

  • 2025-01-02
    13:55:00

    The VPS host's hardware is showing a "warning" but none of its sensors are out of thresholds. No log entries on the system management console show any reason for the "warning" either. We are going to send an engineer to the datacentre on 4th January to investigate further.

  • 2025-01-02
    13:14:00

    All customer VPSs are now running.

  • 2025-01-02
    13:11:00

    Customer VPSs are now beginning to start.

  • 2025-01-02
    13:04:00

    It's taken us longer than we'd hoped to power-cycle the server because, at the time it crashed, it was also hosting the VPS for our documentation which includes passwords for the hardware lights out.

  • 2025-01-02
    12:44:00

    The VPS host shnik.g.faelix.net has become unresponsive. We're going to issue a reboot via the "lights out management" for the server.

  • 2025-01-02
    12:39:00

    The first alerts have come in that some of the VPSs are unresponsive. We are investigating.