Alice, Rob Helmer and Nick recently fixed an important, long standing problem about how Build Infrastructure handed off builds to the Talos Infrastructure. Their work fixed:
- an intermittent couple-of-times-a-day talos outage, which has been happening ever since we started using Talos in production.
- intermittent cases where Talos would occasionally skip over a build without testing it.
Thats enough reasons to make this an important fix, but its also important because its makes some future timestamp cleanup work possible. For the curious, here are some background details:
- When builds were produced by each o.s. builder, the build infrastructure copied generated builds into a specific directory.
- When Talos machines wanted to test a build, they copied builds from that same specific build directory in order to start testing. Talos would then plot test data on the graph server using “testrun time”Â (time stamp of when Talos started running the test), *not* using the start time of when the build was created. This is an important point, and at the root of a bunch of regression triage complexities.
- Because new builds would be copied into the *same* specific directory, they would overwrite the previous build. Which means that, when testing a build, we didnt know when that build was actually created. All we could tell was what time the “testrun” started for that build. So long as we tested as quickly as we produced new builds, it was close enough.
…but when we ramped up volume of builds and tests to production levels, we discovered:-
- New builds being copied into the same specific directory could collide with Talos downloading the previous build, and cause Talos to fail out with an error. The next Talos attempt would work fine, but because each Talos run takes so long to complete, it would appear that Talos was burning for a couple of hours, until the next test run completed successfully. This happened intermittently a few times every day. This is now fixed.
- Builds are generated at different speeds; linux builds quicker then win32 for example. This means that the contents of the specific directory are refreshed at different rates. The linux code built in the dir almost always contains code of a different timestamp from the win32 code in the dir. Enabling PGO caused win32 build times to double, which made this discrepancy even worse. This is now improved, but not fully fixed.
- In situations where the builds were generated quickly enough, and tests ran slowly enough, we could see: a 1st build becomes available, Talos starts testing 1st build, a new 2nd build becomes available, a 3rd build becomes available, overwriting 2nd build. When Talos finishes testing 1st build, Talos detects and starts testing the available build (the 3rd build, skipping over the 2nd build completely). This is now fixed. There is another, similar sounding but unrelated bug about how Buildbot optimizes pending-requests, by collapsing them all together, see bug#436213.
Thats it for this fix. There’s still plenty more cleanup needed around how time/date is stored in different parts of the infrastructure, but this was an important big first step.
Next steps will include:
- fixing how talos handles re-runs/duplicate data
- have the dated dir be based on yet-to-be-enhanced BuildID
- changing Talos and graph server to use “build time”, not “testrun time”. This will greatly simplify a lot of manual regression triage work for people.
- simplify underlying code that lines up builds and test results on tinderbox/waterfall pages.
- figuring out when is a good time to flip the switch in Talos&graph server, marking all data before a certain point as “testrun time”, and data after that point as “build time”.
Anyone curious for details should read Alice’s recent post to mozilla.dev.apps.firefox and mozilla.dev.performance (“time stamps of talos performance results & finding regressions”), bug#291167, bug#417633 and bug#419487. BuildID changes are being discussed in bug#431270 and bug#431905.
Its a tricky, complex, area in the infrastructure, so hopefully all that makes sense?!!?