Monitoring of systems is vital to pro-active troubleshooting — that is, to fix a problem before a customer detects it. Setting up monitoring is usually manual work, but with a bit of infrastructure, this can be automatically done when packages are installed!
Modern Linux systems all use packages of some sort. Common server packages include scripts with a standardised start/stop interface, a method to indicate whether these scripts should be started at system boot, and PID files to check where a daemon should be running.
Linux in turn, provides us with an incredible amount of information based on PIDs, including whether a process is running, its %cpu and network behaviour. Wow, lots of information to monitor.
The one link that is missing, is to relay this information over monitoring protocols. There are many of those, but only one is a true Internet Standard, namely SNMP. A protocol that identifies variables in a strictly defined (ramework where everyhing has a unique ((sub-)sub-)number, through which it can be queried. Numbers exist that lend themselves for the purpose of this exercise.
In this assignment, we are asking you to support two systems to prove the ideas, focussing on Debian and RedHat for now. You will therefore do some programming.
We ask you to intervene in system control in a generic matter. We have
selected the upcoming
systemd service as our focus. This is a modern
replacement of the
init daemon in process #1, which has an overview of the
services that should be running on a system, and whether it actually is
running. Find a way to integrate monitoring into this system; ideally this
would be part of the
systemd daemon, but to demonstrate the power of this
approach the scope of this exercise is limited to polling and querying the
As part of the assignment, you will explain how your approach can be packaged
as an add-on or drop-in replacement so that a distribution could benefit from
your work. You will also keep a keen eye on any constraints on software that
is being controlled by
systemd; if any assumptions or requirements crop up
you will document those as part of your packaging description.
systemd, you should be able to retrieve an up-to-date list, containing
daemons that ought to run. The list would also mention for each
daemon whether it should run at system boot, and this information is also kept
up-to-date. Finally, the daemon will tell you whether a daemon is running
at that time and so you have all the information to know whether your system
is running well -- that is, if everything that ought to run is indeed running,
and whether this will also be true after the next reboot.
Now, make the information available from
systemd through an AgentX sub-agent
(RFC 2741). This evades most of the SNMP protocol details, which can be handled
in a "master agent" such as
the Net-SNMP package. Your sub-agent implements network services monitoring (RFC
2788) and will respond to inquiries to lookup the current status information of
a daemon. In addition, it will detect changes in service status, and report
these pro-actively through a so-called "SNMP trap".
The rest is easy. Any monitoring application (we might use Zabbix or Nagios in this assignment) can monitor SNMP. Moreover, it can be setup with templates that iterate over your tables. This yields the system's own idea of what should run, and whether it is indeed running. The exceptions (stopped even though it should boot automatically, and started while it is not in the autoboot sequence) can be put to good use to display monitoring information. Finally, traffic dumps can be exploited for drawing diagrams per service.
We are curious if the approach that we perceive is practical. We target both Debian and Redhat because they are sufficiently different to make your answer somewhat broadly ranging. We believe that standardisation and automation of monitoring can greatly aid its usefulness.
Please investigate experiences and reports of administrators, and evaluate the new system based on that. Do not forget to keep your own eyes open either -- what problems are you encountering, can they be remedied or are they show-stoppers? And, is there a need (and a possibility) to monitor more aspects without getting into application-specifics?