Last summer we’d changed from our “build on dedicated machines” approach:
…to instead run builds as jobs submitted to identical slaves within a pool-of-slaves:
That reduced the risk that we’d have to close the tree for machine failure of one build machine. It also made setting up builds on new branches like Tracemonkey relatively quick and relatively easy. (more details here).
However, that was only the first milestone.
We still had l10n and unittests running on separate dedicated machines – which meant that l10n and unittests still had the same problems that build machines used to have:
- hard to setup on new active code lines, like when we started tracemonkey
- tree vulnerable to closure when a machine fails.
- spikes in load would backlog, rather then loadbalance onto other available slaves.
Sorting out the difference between the unittest, l10n and build machines was fiddly, as these sets of machines all have different backgrounds. Reconciling all the different user a/cs, toolchains, env.variables, directory structures, all took patient de-tangling. And every change required a bunch of testing to make sure it didnt introduce breakage in some other part of the infrastructure or on some other branch. While some of this could be done in staging, small chunks would be rolled into gradually rolled into production, and then if all still looked good, we’d go back and take on the next part.
In late Dec2008, Lukas and Chris AtLee got unittests to run on machines in the pool-of-slaves. This means that any queued pending unittest job could be handled by any slave in the pool, and after running the two systems side-by-side for a while, we were able to turn off the old dedicated unittest machines, reimage them like the other slaves in the pool-of-slaves, and add them to the pool.
Just a week/two ago, in Mar2009, Armen, Axel and Chris Cooper got l10n repacks to run on machines in the pool-of-slaves. This meant that any queued l10n-repack job could be handled by any slave in the pool, regardless of whether its an l10n-repack-on-change, an l10n nightly or an l10n release. Again, after running both systems side-by-side for a while, we’re powering off the old l10n systems, reimaging them and adding them to the pool as more general purpose slaves in the pool.
OK, cool pictures, but so what?
Well, this is really exciting because its:
- More reliable: if one machine dies, we fail over to another machine
- More scalable: loadbalance incoming jobs across branches across all available slaves. No more backlog on one branch while a dedicated machine sits idle on another branch. No more trying to predict how busy will a project branch be, in order to decide whats the least amount of dedicated machines to create on a new project branch.
- Quicker setup:
- we can now enable unittests wherever we have builds slaves running. One example is unittests being enabled on TryServer, which Lukas announced recently (See Lukas’s blog about linux, mac, win32 announcements).
- setting up a new project branch, running builds *and* unittests and even l10n is now much quicker because we’re not setting up new dedicated machines each time. Instead, we’re scheduling extra jobs in the master queue.
- similarly, scheduling completely new types of jobs like shark builds, or code coverage runs is getting simpler, again because we’re not setting up new dedicated machines, we’re scheduling extra jobs in the master queue.
- Faster end-to-end time: Running l10n repacks as individual repacks all submitted concurrently to the pool of slaves means that we get *much* faster turnaround times. Each repack is only a few minutes, but with almost 70 locales per o.s., that quickly adds up. The FF3.1b3 release was the first time we ran l10n repacks concurrently like this, and we saw the following improvements:
- linux: reduced from ~1h15m -> 20mins
- mac: reduced from 1hr -> 20mins
- win32: reduced from ~6hrs -> 1hr
Later, once we get past some unittest framework cleanup, we should be able to run unittests without requiring an additional unittest-specific build first (see blog for details). Once that’s fixed, we can then start running individual unittest suites concurrently on the pool-of-slaves, which means:
- developers see much faster turnaround time on unittests.
- we can automate running one suite ‘n’ times in a row on the *same* build, to help QA hunt down intermittent unittest failures.
More reliable. More flexible. Easier setup. Faster end-to-end times. Whats not to love?
After all these months of behind-the-scenes work, its great to finally see these changes finally hitting the light of day, and I’m really proud of all the work people did to make this happen.
Disclaimer: I’ve excluded FF2 and Talos systems from this blogpost, just to keep the diagrams manageable. More on those soon.