Depending on a company's or institution's setup, the majority of administration work is, or should be, setting up something new, changing existing systems, or removing the obsolete. In reality, though, a good portion of a system administrator's job tends to be maintenance, that is, trying to keep things going just as they are. Doing so includes the ability to detect when things stray from the norm.
Many tools exist to do this. Everything from network monitors that ensure that this part of the network can communicate with that part, to host monitors that periodically check that vital machines are available. Service monitors verify that important services such as web servers and email servers are available at all times. Yet others might probe a network to look for suspicious activity, such as servers running on ports that aren't expected.
Does one tool exist that does everything? Perhaps. Is it a commercial package? Most likely. Are there tools that do MOST of these things? Certainly. Are they free? They can be, yes. So why write another?
Possibly the most popular tool that exists right now is Nagios written by Ethan Galstad. It allows you to monitor the services on your network, setting up a flexible system of notifications when things fail. So why not use that?
When I first tried Nagios, I thought it was terrific. It took me all of half a day to write some XSLT stylesheets to convert the data in Amphioxus to the configuration files needed by Nagios. In no time I had Nagios looking at over 1100 hosts. So what went wrong?
Apart from headaches with permissions relating to the web server and CGI scripts, which I am willing to take some blame for I suppose, I found that the Nagios system was, well, a bit unresponsive. Or maybe flaky? I found that it didn't cope well with new datasets being used (it "cached" its work between shutdown and startup in a file, which doesn't necessarily jive with the new configuration). Also, the web interface really seemed to have issues. The cause of the problem I didn't look at, but half the time I got reports that there was no useful data to report, and other times I would be told that I couldn't access this report or that page. CGI problem? Permission problem? Coding problem? I'm not sure.
So Nagios looked neat, but it seemed a bit ... tentative. Perhaps I didn't give it enough of a go to get it working. The documentation does say that you can't just jump right in, which I apparently did. And there are other tools out there, such as NXE, that allow you different methods of accessing Nagios's information. Perhaps I should have tried that to see if the inconsistent data I was seeing in Nagios was all in the web interface, and that the backend was working properly?
There are/were two different projects that I have/had recently taken on: a "showproc" replacement, and a "nmap" reporter.
Showproc is a tool used in our department to check all of the machines for processes being run by a given user. The main use for such a utility is to find where someone is logged in when you're about to suspend their account and wish to terminate all of their login sessions and programs. I suppose there might be other uses, too.
Nmap is a useful utility for probing machines on a network, to seek out ports that those machines might have open, and thereby finding servers and services that the machines should probably not be running, such as pirate FTP servers or mailhosts.
At first, these were two separate projects, but as I started thinking about their implementations, I started thinking that a general host monitor would encompass both tasks. Seeing Nagios, and seeing how close it came to what I wanted, confirmed it.