Last week in the frenzy leadup to FF3.1b1 code freeze, we got into a state where there were too many changes and too many builds being queued for the available pool of slaves to keep up. Literally, there were new builds being requested faster
then than we could generate builds. Worst hit was win32, because win32 builds take much longer then than the other platforms.
Short answer: Since early summer, we’ve used double the number of slaves we had for FF3.0 – which was more
then than enough until last week. We’ve since added even more slaves to the pool, which cleared out the backlog and also should prevent this from happening again. At peak demand, jobs were never lost, they just got consolidated together. Having multiple changes consolidated into the same build means the overrun machines can keep up, but makes it harder to figure out which specific change caused a regression.
First, make some coffee, then keep reading…
If you recall from this earlier post, we’ve moved from having one dedicated-machine-per-build-purpose, to now using a pool of shared identical machines. Pending jobs/builds are queued up, and then allocated to the next available idle slave, without caring if its a opt-build, debug-build, etc. Any slave can do the work. More importantly, failure of any one slave does not close the tree.
Related topic is: its tough to predict how many slaves will be enough for future demand. We started in early summer2008 by having twice as many builders as we had builders on Firefox3.0. That guesstimate was based on the following factors:
- using shared pool across 2 active code lines (mozilla-central, actionmonkey). This has since changed to 3 active code lines (mozilla-central, trace-monkey, mobile-browser), with different volumes of traffic on each.
- assuming that the combined number of changes landing across all active code lines being similar to what we saw in FF3.0. We didnt have project branches back then, but we had approx the same number of developers/community landing changes at about the same rate.
- changed from “build-continuously” to “build-on-checking”. This greatly reduced the number of “waste” builds using up capacity. We still generate some “waste” builds (no code change, but needed to stop builds falling off tinderbox, and to keep talos slaves busy). The question is how many of these “waste” builds are really needed, and can we reduce them further?
This worked fine until last week when a lot of changes landed in the rush to FF3.1beta1, and then regressions forced a lot of back-out-and-try-again builds. Here’s a graph that might help:
Once we realised the current pool of machines was not able to keep up with demand:
- We added new build machines to the pool. This really helped. As each new slave was added to production pool, it immediately was assigned one of the pending builds and started working – helping deal with the backlog of jobs. By the time we added the last slave to the production pool, there were no pending jobs, so there was nothing for it to do, and it remained idle until new builds were queued and immediately processed.
- On the TraceMonkey branch, we triggered a “waste” “nothing changed” build every 2 hours. This was done for mozilla-central to ensures that Talos machines are alway testing something, and builds dont fall off the tinderbox waterfall. We originally setup TraceMonkey to build on same schedule as mozilla-central, but as we didnt have any Talos machines on TraceMonkey, we can safely reduce down that frequency. In bug#458158, dbaron and nthomas increased the gap between “waste” “nothing changed” builds on TraceMonkey branch, from 2 hours to 10hours, which is the longest gap we could have, while still frequent enough to prevent builds falling off tinderbox. They also turned off PGO for win32, which seemed fine, as there are no talos machines measuring performance on TraceMonkey branch anyway; turning off PGO reduced the TraceMonkey buildtime, which meant that the slave would be freed up sooner to deal with another pending job. I tried to visualise this drop in pending jobs by the notch in the graph above.
Looking at the graph again, we dont know if we will seeÂ “future#1” or “future#2”. We’re still unable to predict how many slaves is enough for future demand. We’re adding new project branches, hiring people, adding tests. We’re obsoleting other project branches. Once we fix some infrastructure bugs, we can stop “waste” builds completely. Either way, the infrastructure is designed to handle this flexibility, and we have plenty of room for quick expansion if need arises…
We dont yet have an easy way to track how heavily loaded the pool-of-slaves is, so I have to ask for some help.
Until we get a dashboard/console working, for now, can I ask people to watch for the following: Whenever you land a change, an idle slave should detect and start building within 2 minutes (its intentionally not immediate – there’s a tree stable timer of a couple of mins). If we have enough slaves, there should be one build produced per changeset. Its possible, but rare, that people land within 2 mins of each other, and therefore correctly get included in the same build, but that should be very rare. More usually, each checkin would be in a different build. If you start seeing lots of changesets in the one build, and especially if you see this for a few builds in a row, it may mean that the pool-of-slaves cannot keep up, and queued jobs are being consolidated together. In that situation, please let us know by filing a bug with mozilla.org:ReleaseEngineering and include the buildID of the build, details on your changeset and the other changesets that were also there, which o.s., etc and we’ll investigate. There are many other factors which could be at play, but it *might* be an indicator that there are not enough slaves, and if so, we’ll quickly add some more slaves.
Hope all that makes sense, but please ping me if there are any questions, ok?
Thanks for reading this far!
[updated 13oct2008 to fix broken english syntax.]