On 1.9/trunk, its important to point out that almost all of these 88 machines need to remain up, and working perfectly, in order to keep the 1.9/trunk tinderbox tree open. If one of these machines dies, we usually have to close the tree.
This is because most of these machines are specialized unique machines, built and assigned to do only one thing. For example, bm-xserve08 only does Firefox mac depend/nightly builds; if the hard disk dies, we don’t automatically load balance over to another identically configured machine thats already up and running in parallel. Instead, we close the tree and quickly try to repair that broken specialized unique machine. Or manually build up a new machine to be as close as possible to the unique dead machine. All in a rush, so we can reopen the tree as soon as possible. Looks like this:
Obviously, the more machines we bring online, the more vulnerable we are to routine hardware failures, network hiccups, etc. Kinda like a string of Christmas tree lights which goes dark when any one bulb burns out. The longer your string of Christmas tree lights, the more bulbs you have, the more chance you have of a single bulb burning out, and the more your chances of the tree going dark.
When we started working on moz2 infrastructure, the conversation went something like “what do you want on moz2?”, “everything we have on FF3”, “ummm… everything? really?””yes, the full set. Oh, and we’ll need a few sets of them for a few active different mercurial branches running concurrently”.
So, how do we scale our infrastructure and also improve reliability?One of the big changes in how we are building out the moz2 infrastructure was to *not* have specialized unique machines. Instead, we have a pool of identical slaves for each o.s., each slave equally able to handle whatever bundled work is handed to it. This has a couple of important consequences:
- if one generic slave dies, we dynamically and automatically re-allocate the work to happen on one of the remaining slaves in the pool. Builds would turn around slower, and we’d obviously start repairing the machine, but at least work would continue smoothly, and the tree would not close!
- if we decide we want to add an additional branch, or if we feel the current number of slaves are not able to handle the workload, we can simply add new identical slaves to the pool, and automatically dynamically re-allocate the work across the enlarged pool.
Looks like this:
Adding 88 new unique machines for each of 3-5 new additional active branches would be painfully to setup, and just about impossible to maintain. And we’d be *guessing* how much development work there would be in the next 18 months, and then building the infrastructure out. Instead of having to SWAG our needs for the next 18 months and then setup frantically now, this shared pool approach allows us to grow gradually as needed. Oh, and it should be more robust. 🙂
(Many thanks to BenT for the christmas tree lights analogy. I was saying “a chain is only as strong as the weakest link”, but BenT’s analogy offers much better possibilities for awful tree jokes.)