As of Friday afternoon (06jul2012), RelEng started generating a small number of production builds and try builds on Amazon Web Services.
catlee already covered this in Mozilla’s Platform Meeting, but this multi-month project is a massively important step forward for Mozilla’s Release Engineering infrastructure as well as for all Mozilla’s developers, so is worth calling attention to three important details:
- Seamless integration
- Dynamic allocation
The security of our RelEng infrastructure is obviously important to Mozilla, so we setup these Amazon-based VMs inside a Virtual Private Cloud (VPC). While it is technically possible to have the VMs inside the VPC connect directly to the external internet, we felt it was safer to prohibit any access from the VPC to or from the internet. Therefore the only connection we have to/from our VPC is over a VPN link directly into Mozilla’s existing Build network, within Mozilla’s secured infrastructure.
If an Amazon VM needs to reach an external site for any reason, it can only do this by going from Amazon over VPN to Mozilla’s Build Network and then out through Mozilla’s firewalls. If a Mozilla person wants to access one of our Amazon VMs, they have to do this by going through Mozilla BuildNetwork over the VPN link to the VPC. We designed this very restricted access to help protect these vital systems. It was reassuring to also see all the security audits that Amazon has done.
We integrated Amazon’s VMs nicely into our existing mix of VMs and physical machines in the Mozilla build network. The easiest way to see if your specific build was handled by an Amazon VM (called an “EC2 instance” in Amazon-speak), is to look at the machine name on tbpl.mozilla.org.
The only other way that you can tell we are using AWS for some of the builds is that the additional compute capacity is helping reduce wait times for our builds!
As you can see from looking at our monthly load posts, load on our RelEng infrastructure varies over different times of the day, and over different days of the week. To handle this efficiently, we now dynamically add and remove Amazon VMs from production at any given time to match the demand at that time. We do this as follows:
- Our automation monitors the queue for pending builds
- If there is a backlog of pending builds in the queue, our automation dynamically starts reviving enough VMs in our Virtual Private Cloud to handle the backlog.
- As each of these VMs come online, they connect to a buildbot master, indicating they are idle and ready to process jobs.
- A buildbot master assigns a pending build job to the newly available idle slave.
- Once the build job is completed, the slave goes back to the master looking for another job.
- If there are no backlog of pending jobs for 60 minutes, then our automation starts suspending the idle Amazon VMs. Suspending VMs like this allows us to quickly bring VMs back into production in a few seconds to handle any new backlog, while also reducing costs during low load times. Also, note that the 60minute threshold worked well for us in staging, but we’ll likely adjust this in the near future as we more experience with real-world load.
As of today, we only let some B2G builds overflow onto AWS like this, and we continue to monitor builds and the dynamic allocation carefully. Assuming this continues to work well, we will soon let the rest of the B2G builds overflow to AWS. Then next will be fennec/android builds, and then linux desktop builds. Our focus in the immediate short term will be to siphon excess load from our Mozilla build machines over to AWS, allowing us to better handle the increased number of B2G and Fennec builds being enabled in production recently. This also allows us to reimage some/all of our physical linux builders as physical win64 builders to immediately help with our win64 builder wait times. Eventually, we may start running win64 builds, and maybe even some unittests, on AWS but that need further investigation – stay tuned!
Its hard to overstate how important this is for us.
The increase in build types for B2G and fennec and desktop, combined with the increase in number of checkins per day has RelEng systems continually under heavy load. We first tried using AWS in 2008, but the Amazon VMs that we were using kept being restarted, usually before the build completed; the build would automatically restart everything once revived a few seconds later, but it still blocked us from actually being able to use these in production. Some renewed experiments in summer2011, and discussions with others companies who were doing similar investigations looked promising, so we started work on this in full force in Feb2012.
We hope you like this, and of course, if you see any problems, let us know asap, or file a bug!
[UPDATED: fixed typos, joduinn 17jul2012]