Zero-effort Monitoring Support

Monitoring of systems is vital to pro-active troubleshooting — that is, to fix a problem before a customer detects it. Setting up monitoring is usually manual work, but with a bit of infrastructure, this can be automatically done when packages are installed!

Gift-Wrapped Opportunities

Modern Linux systems all use packages of some sort. Common server packages include scripts with a standardised start/stop interface, a method to indicate whether these scripts should be started at system boot, and PID files to check where a daemon should be running.

Linux in turn, provides us with an incredible amount of information based on PIDs, including whether a process is running, its %cpu and network behaviour. Wow, lots of information to monitor.

The one link that is missing, is to relay this information over monitoring protocols. There are many of those, but only one is a true Internet Standard, namely SNMP. A protocol that identifies variables in a strictly defined (ramework where everyhing has a unique ((sub-)sub-)number, through which it can be queried. Numbers exist that lend themselves for the purpose of this exercise.

Assignment

In this assignment, we are asking you to support two systems to prove the ideas, focussing on Debian and RedHat for now. You will therefore do some programming.

We ask you to intervene in system control in a generic matter. We have selected the upcoming systemd service as our focus. This is a modern replacement of the init daemon in process #1, which has an overview of the services that should be running on a system, and whether it actually is running. Find a way to integrate monitoring into this system; ideally this would be part of the systemd daemon, but to demonstrate the power of this approach the scope of this exercise is limited to polling and querying the daemon.

As part of the assignment, you will explain how your approach can be packaged as an add-on or drop-in replacement so that a distribution could benefit from your work. You will also keep a keen eye on any constraints on software that is being controlled by systemd; if any assumptions or requirements crop up you will document those as part of your packaging description.

From systemd, you should be able to retrieve an up-to-date list, containing daemons that ought to run. The list would also mention for each daemon whether it should run at system boot, and this information is also kept up-to-date. Finally, the daemon will tell you whether a daemon is running at that time and so you have all the information to know whether your system is running well -- that is, if everything that ought to run is indeed running, and whether this will also be true after the next reboot.

Now, make the information available from systemd through an AgentX sub-agent (RFC 2741). This evades most of the SNMP protocol details, which can be handled in a "master agent" such as snmpd from the Net-SNMP package. Your sub-agent implements network services monitoring (RFC 2788) and will respond to inquiries to lookup the current status information of a daemon. In addition, it will detect changes in service status, and report these pro-actively through a so-called "SNMP trap".

The rest is easy. Any monitoring application (we might use Zabbix or Nagios in this assignment) can monitor SNMP. Moreover, it can be setup with templates that iterate over your tables. This yields the system's own idea of what should run, and whether it is indeed running. The exceptions (stopped even though it should boot automatically, and started while it is not in the autoboot sequence) can be put to good use to display monitoring information. Finally, traffic dumps can be exploited for drawing diagrams per service.

Research Questions

We are curious if the approach that we perceive is practical. We target both Debian and Redhat because they are sufficiently different to make your answer somewhat broadly ranging. We believe that standardisation and automation of monitoring can greatly aid its usefulness.

Please investigate experiences and reports of administrators, and evaluate the new system based on that. Do not forget to keep your own eyes open either -- what problems are you encountering, can they be remedied or are they show-stoppers? And, is there a need (and a possibility) to monitor more aspects without getting into application-specifics?