On Saturday, I sent a mail to the ‘gnome-sysadmin’ list for review before sending it to the larger GNOME community. I pressed ‘send’ and expected our more-than-capable servers to take care of the rest. I got on with some other things.
On Sunday, I looked to see if there were any comments or criticisms of the said sent mail only to find it didn’t appear to have reached the list. In fact, I didn’t appear to have any GNOME mail for the last couple of days. I headed for the mail logs on menubar.
A very confusing hour or so later and I still don’t know what’s going on. Only, I do know I need to get this mail sent out now as I haven’t got all day to investigate mail problems. Oh well.
Somewhat vexed, I couldn’t get on with my other jobs, so I returned to the mystery of why mailman seems to randomly cease up. At one point, I discovered that the master ’named’ for gnome.org had stopped running - oops! I restarted that, restarted mailman and things seemed to work again - for a while. I noticed that the mailman ‘smtp-failure’ log was showing a lot of connection timeouts to the localhost postfix server. There were also tons of (compromised/infected?) machines out there spamming random ‘@gnome.org’ addresses adding to the load, so ’telnet localhost smtp’ was taking a considerably long time. There is also a namazu mail archive re-indexing process going on, but that’ll mostly be blocking on NFS I/O, and not using more than one of the four CPUs. I stopped postfix from accepting external mail for an hour or so, so it had a chance to process the mailman backlog. It seems to be running now, although mailman’s ‘smtp’ log is claiming that each mail is taking in the order of 180 seconds to be accepted - not good. That’ll need more looking into.
In other news, signal.gnome.org now seems to be functional as a NAGIOS monitor (for use by the GNOME sysadmin team only at the moment). It is monitoring all six hosts and most of the known services, although the DNS plugin is broken (which is a shame, given ’named’ likes to stop working) and the mail plugin is timing out after 10 seconds (understandable given mail problems above). It should send alerts when these services come back up, or any other hosts or services go down, so we shouldn’t have to rely on word-of-mouth alone to know about service failures now :) However, it is only very basically configured at the moment (and failure notices are going to ‘@gnome.org’ addresses which is no good if menubar goes down!). Hopefully, given some time to mix a bit of SNMP in and we should have a fairly decent service monitoring tool.
The Subversion migration is going to be cool. I’m looking forward to being able to set up a simple global postcommit hook to look up an e-mail address in a ‘per-repository’ look-up table, so we can send commit notifications to the module maintainers/lists and ‘.po’ file commit notifications per-language to the language maintainers/lists, where requested. However, I’m not going to go too far into it until the migration has been completed and any aftershocks have settled.