In the last two weeks, we’ve hit problems where critical servers were down, the people who relied on them were (rightly) frustrated waiting for them to come back online, and at the same time, the people who should have been repairing them didn’t know there was a problem, or thought someone else was working on it.
Some recent examples are:
* Jonauth’s dev dashboard was blocked from accessing graph server, after causing graph server to crash. Some people thought bug#485928 was tracking the issue, but the one sentence in comment#9 was lost in noise of rest of bug. Fixed within hours of filing bug#486662.
* TryServer not displaying builds on waterfall. This happened about the same time as the iscsi outage. Interesting is that TryServer was actually processing jobs fine, but developers had no way to see this. Bug#485380 got lost-in-weeds. Fixed within hours of filing bug#485869.
* TryServer builds being queued for >24hours. No bug filed originally. Fixed within hours of filing bug#485869.
In each case, once an explicit bug was filed, the problem was fixed within a few hours.
Obviously, automated monitoring of all critical systems would be ideal, and we continue to get more and more systems under the watchful eye of nagios all the time. However, in the meanwhile, if you see a critical system having problems, please file a blocker bug describing exactly what you see is broken. Don’t worry about debugging where the root cause it, or if you can workaround it, the important thing is to make sure people who can fix it know about it. If you cant quickly/easily see a bug focused on just that problem, and if not, file a bug. If we already know about it, we will happily DUP it and make sure its being worked on with the right priority. If we *didnt* already know about it, we’ll make sure the right folks in RelEng or IT jump on it right away!
Please, don’t be shy about filing bugs…. and yes, you can quote me on that! 😉