While everyone else is talking about the Firefox3 release, here’s a behind the scenes story just before the Firefox3 release.
Technical details of the root causes are here.
From our point of view here in RelEng, as “users” of VMware/NetApp, we would see 8-12 VMs of our 84 VMs would randomly lock up at the same instant, for 15-45 seconds at a time.
The first couple of times, we incorrectly thought this hiccup was caused by network outages. However, each time IT confirmed the network was healthy and we never had any way to track the problem, so we’d repair the VMs and just move on. Looking back, I’ve found bug#435052, bug#429406, but there were other times where we never even filed bugs, so no record of them.
Once the interrupted VMs resumed, the o.s. within each of those VMs would come back to life… usually in a broken state. The 15-45 second lockup was enough for the o.s. to timeout connections to what it thought was the machine’s local hard disk, just as if someone had unplugged the disk of a running computer and then plugged it back in. Depending on the length of lockup, the VM would resume with:
- missing or corrupted disks, which we’d have to manually reconstruct. If that failed, we’d delete the VM and recreate it from scratch.
- disks that had become read-only, which a clean reboot would fix, although you could then still have…
- disks being just fine, but the application files on the disk (i.e. the build or unittest in progress) were corrupted. Which caused subsequent runs to fail out with unusual errors. This required understanding where the different application level files were buried, and then manually cleaning up until the applications on the VM started working again. Depending what they were doing, they would fail out in builds with weird compiler / linker errors. Or would fail out of unittests, with what looked like random unittest errors.
- no problems at all. This was a rare and very pleasant surprise, whenever it happened. It seemed to depend on how long the timeout was, but that situation was very very rare, and we didnt even count those!
We’d have to investigate each broken VM, and repair as appropriate. Best case, a builder VM could be repaired from light damage before any other unittest and talos machines noticed a problem. Repair time: a few minutes. Worst case, a buildbot master or unittest master VM would get corrupted, require lengthy repairs, and take down all the slaves attached to it for the duration. Repair time: 5-6 hours.
Sometimes, we’d be still reviving dead VMs, when another set of VMs would die…taking down some of the VMs we’d just revived.
Its worth repeating that this was a problem with *all* our VMs, regardless of branch or purpose. It didnt matter whether the VM was doing builds or unittests, running on win32 or linux, running as slave or master, running on 1.8/1.9/moz2 tinderbox trees. And each failure gave different symptoms every time.
After a few days of this continuous behavior, our lives had deteriorated into manically watching tinderbox, getting screenfuls of new nagios emails every time we checked email, and scrambling to prop up whatever VM just died.
So long as we could fix machines faster than they died, and so long as we worked 24hours a day, 7 days a week, we could keep the trees open with only occasional machine burning problems being visible to developers.
As the failure rate got higher, it turned into a losing battle, and finally late afternoon Sunday 8th, we had to give up and just close the trees. Not just one tree. Close *all* trees.
By Tuesday (10th), Justin had stable ESXHosts and NetApps, so we started reviving / repairing all our VMs. And this time, the VMs stayed up! 🙂 By Monday (16th), we’d repaired the last of the broken VMs and life returned to normal after a never-boring-for-one-minute 21 days.
Many many thanks to bhearsum and nthomas for all their work continuously reviving VMs. Because of their non-stop repair work, we were able to keep the trees open during the FF3.0rc2 and FF3.0rc3 releases.
…and thanks also to Justin and mrz for all their work chasing this down. Debugging 3 different interwoven problems is not fun.
ps: Confusing the matter was bug#407796, where a linux o.s. kernel update was needed in the VM o.s. to prevent the VM disk from going read-only. Doing this kernel update required scheduling downtime for that tinderbox, doing config updates, and a restart. Only after some “updated” VMs re-failed, did we finally get confirmation that the kernel version we needed was different to what we were told to use. We were reupdating kernels to the new “correct” kernel when they also started failing… in sets, just like a network outage…