Fire Suppression Triggered at MA2 (Reynolds House) - Numerous Hard Disks Damaged Virtual Servers

Event Started
2022-10-21 09:00
Report Published
2022-10-21 09:54
Last Updated
2022-11-01 06:59
Event Finished
2022-10-31 12:00

The fire detection at Equinix MA2 triggered and caused fire suppression to discharge. The ensuing noise and pressure release has damaged several hard disks in our virtual hosting servers there. We were already in the process of moving all customer workloads (virtual servers, network function virtualisation, etc) out of MA2, and have accelerated this process as a result of this major incident.

Timeline (most recent first)
  • 2022-10-31
    12:00:00

    The old route server is now completely decommissioned. And we have completely left the Equinix MA2 site. Therefore we are closing this incident.

    The root cause of this incident was vibrational damage to hard disks. This in turn was caused by high pressure gas discharge in response to the fire detection system classifying a refrigerant leak as smoke and triggering a fire suppression event.

  • 2022-10-27
    16:42:00

    We are scheduling emergency maintenance tonight to migrate configuration off the old BGP route server onto the new cluster. This will take place tonight from 00:00 local time. We do not anticipate any issues, but will be performing this work to address the ongoing problems that this device has caused.

  • 2022-10-27
    14:00:00

    At around 15:00 UTC local time we observed packet loss to our old BGP route server. This caused some routing instability while we worked around this.

  • 2022-10-26
    02:30:00

    Maintenance to the route server was a success.

    We still have left to migrate:

    • 2 virtual servers still running in MA1/MA2 on cluster x4.faelix.net (both BGP route servers); the BGP sessions terminating on these will be moved to our new cluster of route servers, and the old BGP route servers (46.227.200.12 and 46.227.203.12) will be decommissioned

    • 35 virtual servers still running in MA2 on cluster x5.faelix.net (including 21 VPSs related to our CRM, with all minio storage slices migrated or in-flight)

    Customers are likely to experience a shutdown/restart of their server over the next 24-48 hours as we continue moving workload out of MA2.

  • 2022-10-26
    00:35:00

    We still have left to migrate:

    • 2 virtual servers still running in MA1/MA2 on cluster x4.faelix.net (both BGP route servers)
    • 38 virtual servers still running in MA2 on cluster x5.faelix.net (including 21 VPSs related to our CRM, with all minio storage slices migrated or in-flight)

    Customers are likely to experience a shutdown/restart of their server over the next 24-48 hours as we continue moving workload out of MA2.

  • 2022-10-25
    22:38:00

    We still have left to migrate:

    • 6 virtual servers still running in MA1/MA2 on cluster x4.faelix.net (including 2 VPSs related to our CRM)
    • 38 virtual servers still running in MA2 on cluster x5.faelix.net (including 21 VPSs related to our CRM, with all minio storage slices migrated or in-flight)

    Customers are likely to experience a shutdown/restart of their server over the next 24-48 hours as we continue moving workload out of MA2.

  • 2022-10-25
    20:00:00

    We need to perform an emergency reboot of one of our core BGP routeservers as it is running on hardware with failing disks. This will be carried out at 03:00 local time (02:00 UTC). It may cause a brief period of routing instability to some services.

  • 2022-10-25
    08:17:00

    We still have left to migrate:

    • 9 virtual servers still running in MA1/MA2 on cluster x4.faelix.net (including 3 VPSs related to our CRM)
    • 41 virtual servers still running in MA2 on cluster x5.faelix.net (including 21 VPSs related to our CRM, and 4TB of minio storage slices)

    Customers are likely to experience a shutdown/restart of their server over the next 24-48 hours as we continue moving workload out of MA2.

  • 2022-10-22
    10:52:00

    We still have left to migrate:

    • 18 virtual servers still running in MA1/MA2 on cluster x4.faelix.net (including 6 VPSs related to our CRM)
    • 57 virtual servers still running in MA2 on cluster x5.faelix.net (including 21 VPSs related to our CRM, and 12TB of minio storage slices)

    Customers are likely to experience a shutdown/restart of their server over the next 24-48 hours as we continue moving workload out of MA2.

  • 2022-10-22
    06:00:00

    We have completed migration of c3.http.faelix.net and old.mail.faelix.net. We will perform migration of shared.mail.faelix.net late tonight.

    Meanwhile, other customer servers are continuing to be migrated off hardware in MA2 as we are now treating all hard disks there as suspect.

  • 2022-10-22
    00:00:00

    Work is ongoing tonight, to move customer workloads off servers with drives which (while not currently showing hard errors) may have suffered damage during the suppressant discharge earlier today. We will be moving shared hosting services on:

    • c3.http.faelix.net
    • shared.mail.faelix.net
    • old.mail.faelix.net
    • frontend.mail.faelix.net

    We apologise for the outages to email users and websites hosted on these servers during this emergency maintenance.

  • 2022-10-21
    17:00:00

    Engineers are continuing to work on migrating customer services, with priority given to ensure data integrity protection.

  • 2022-10-21
    16:02:00

    Equinix IBX Site Staff reports that upon investigation, it was determined that the root cause of the issue was a refrigerant leakage. The fire detection system is fully operational. The process to fully rectify the fire suppression system is ongoing.

  • 2022-10-21
    15:50:00

    Equinix stands down the alert about power disturbances.

  • 2022-10-21
    12:19:00

    Equinix reports a suspected power disturbance with power failures within Cages MA2:01:00S124:0605, MA2:01:00S144:0102, MA2:01:00S145:0108, MA2:01:00S146:0103, MA2:01:00S146:0204 (including our area).

  • 2022-10-21
    10:16:00

    Numerous hard disks have been damaged, possibly irreparably, by the fire suppression system release at site. The failed or failing drives is now more than half of all drives in our servers in MA2.

    Our engineers decide to begin migrating all customers hosted on the remaining equipment at MA2 to other virtualisation hosts, as we anticipate more errors and failures to start occuring.

  • 2022-10-21
    09:54:00

    Engineers are at site. Immediately, just from the smell, it is clear that the fire suppression system has discharged earlier today. This was our worst fear.

  • 2022-10-21
    09:30:00

    Engineers take initial actions to migrate services which were immediately affected: those where the VM host server had lost both drives in its main operating system's RAID1 boot drive.

    We initially suspect two or three drives may have been affected, and that possibly a power supply in a server failed.

  • 2022-10-21
    09:00:00

    Equinix reports that fifteen minutes earlier (site local time 09:45) "a fire alarm has been triggered and the [facility] was evacuated".

  • 2022-10-21
    08:49:00

    Initial alerts were received at 08:49:15 UTC (09:49:15 UK local time), which appeared to indicate a brief disturbance to optical transmission between MA2 and MA1, and our out-of-band network. Those alerts quickly cleared of their own accord.