We used to generate all updates on one dedicated machine (prometheus-vm). We now generate updates as jobs queued to the pool of slaves. This makes our current work faster, and unblocks us to do some awesome stuff. /me bows with gratitude to coop.
More details for the curious:
Why bother refactoring how we create updates? We have one old machine doing night build updates for years, and creating updates are quick, they take just 15mins a night:
2.5 minutes per update x 3 OS x 1 en-US locale x 2 branches = 15 minutes
Having the one machine doing this for the two active code branches was trivial, so why not just leave it alone, there’s plenty of other things to fix, right?
The problem is that 15mins is for en-US nightly updates only. We wanted to treat l10n as equals, so we figured out how to produce nightly updates for l10n builds as well. This had never been done in Mozilla before, and was great for the l10n community. However, when we turned on l10n nightly updates in production, it changed the math significantly. What used to take 15 minutes now took:
2.5 minutes per update x 3 OS x 75 locales x 2 branches = 1,125 minutes = 18.75hours.
Compounding the situation was RelEng being asked to support 3 active fully localized releases (FF3.0, FF3.5, FF3.6) and also 5 project branches that all wanted nightly updates. And increase from 3 OS to 7 OS on most of those branches.
Clearly one machine couldnt do all this in a 24 hour day.
One approach we considered was just to clone this machine, do half the work on one, half on another identical machine. This *might* have worked, but wasnt risk free as the old system isnt documented anywhere, and we’d have to verify the two systems could not trip each other up, corrupt the updates and break users.
Whatever we changed here had to be so well understood that we would be confident it was not breaking any user updates. And it had to scale. And solve the “single point of failure” problem to be reliable for our needs.
Coop figured out how the old system worked, how it could be broken into independent concurrent chunks, and how it could be integrated into buildbot, so these could be run after each nightly. Details for the curious are here. Its been tricky, because the code is fiddly, there are many sharp edges, and high risk – any bugs would generate bad updates that complete break a user – so confidence in the accuracy of the updates has to be rock solid. And it has to not break other users of the same patch generation code, like Seamonkey, Camino, etc.
Coop rolled this into production a week ago, after months of testing in staging. From the outside looking in, no-one noticed any difference, except that 18 hours of update generation was now being done in < 5 hours. For such a huge change, having nobody notice any problems is a great accomplishment. Within RelEng, this change means that nightly updates for a branch are now done in under 1/3rd of the time. It also means we can now do multiple sets of updates concurrently, spread across the pool, which scales our ability to generate updates. Because of this, we can now generate updates for the new linux64 and osx10.6 64bit builds, like announced here. Because of this, in a 24 hour day, we can now generate updates for 3 fully localized release branches and 5 non-localized project branches… but I’m getting ahead of myself! There’s more in the next post…