On the afternoon of Sunday 09Aug2009, our colo overheated and shutdown. The gory details are here, but basically when the air conditioners failed, the room quickly overheated to unsafe levels, and machines took themselves offline before they were physically damaged. All our build/unittest/talos infrastructure, along with large portions of the rest of Mozilla infrastructure, came to an abrupt halt.
Matthew (mrz) phoned me soon after the colo went offline, just to give me a heads up, so I was able to forewarn others in the group. The rough timeline was:
- 13:30 PDT Sunday afternoon: colo offline
- 21:30 PDT Sunday evening: Mozilla back online
- 01:00 PDT Monday morning: RelEng declares build infrastructure back online
While its bad for a colo provider to have failures like this, it was impressive to watch how the RelEng and IT groups pitched in together to get things going again so quickly – reviving ~420 RelEng machines in under 12 hours was quite a feat.