<--introduction ^--Soigan--^ design, continued-->

Soigan - a Multicast XML monitoring system - design

Design - Questions

So what should Soigan do? Or more importantly, what should it be designed to expand to do? This software, in the end, should answer questions for a system administrator, possibly without the administrator knowing that they want to ask it. What kinds of questions? Two come to mind: yes-or-no questions (Is this host up? Is this server running?) and more wordy ones (What processes are running on this machine? Who's logged in here? What packages are installed over there?)

Some possible yes/no questions we might want answered:

And some more descriptive results: And then you might have a different form of the above: The first group, the yes/no questions, are a subset of the second group, as their result is just a "yes" or "no" instead of a number or a name or a list. The second group would possibly be a list or a multiline result, possibly output from a command, formatted nicely for processing.

Communication

The third group is a little different. It could be done in two ways: either rephrase the question to every machine (change "where is this user logged in?" to "is this user logged in here?") and go through the results from every machine to find the answer; or, you could "call out" to all of the machines and get them to answer if they have relevant information.

The showproc program that we use takes the first approach. It takes the host list, pings every machine to see if they're up, and then connects to every live machine to get a process list based on a given user. The results of all of this are then displayed to the user (or optionally, the found processes can be terminated as they're found). This takes a while. A long while. This is why a rewrite is needed. So how can this be sped up? Multicasting.

Multicasting

While an in-depth discussion of multicasting is beyond the scope of these pages, I'll touch briefly on it. Multicasting allows a single packet on a network to go to multiple, interested interfaces. Instead of having to send out hundreds of ping packets to all the hosts, followed by hundreds more packets to get their process lists, we can send out a single handful of packets on the network, saying, "please send me your process list in the next little while". The "next little while" would be some number of seconds that you would ask them to randomly wait in, so all the hosts don't try to answer the request at once. Even a ten second window would allow many hundreds of hosts to distribute their responses in a hopefully-even span of time.

At first, this seems like you're going to flood the network with traffic every time you make a request, even if you distribute it over a period of time. And that's true. But this would happen regardless of how you implemented it -- in the end, you have so much data to retrieve from these hosts, and so many packets required to do it. At the moment, showproc does it in the span of minutes, so yes, the flood is more of a trickle. But given the speed of networks these days and the impatience of users, I think this is reasonable. And if you feel it's too noisy? Just increase the response period to the few minutes that showproc already took up.

Additionally, multicasting can reduce the network traffic when the hosts themselves can decide whether or not they have anything important to report. If you ask "tell me about the processes that user X has running", then only those hosts that have any would respond. This is far better than having to ask every host individually, because it cuts down on the individual requests to each host, as well as the number of responses needed.

Unicasting

But do we need multicasting for everything? Would it hurt? No, but perhaps we should support the "older" style of communication by also allowing the system to directly connect to a host and ask (one or more) questions directly, one to one. This would be useful if the administrator is interested in the going-on on a specific machine, and wants to know its users, processes, uptime, disk usage, memory usage and network traffic all at once, specifically for that machine alone.

There's also another way we might want to get information -- unsolicited. It, too, would use unicasting, but this would not be the result of a request, but rather a periodic piece of information ("I'm still here!") or a response to an event ("My /tmp partition is full!")

Anonymous probing

And what about getting information from a host without asking? This is pretty much what Nmap does (well, it does go probing, but it doesn't come right out and ask the host "hey, what ports do you have open?"). Is there a need to support that, or should we just make the system a series of services that run on each machine that listening for requests?

Compatibility and Formatting

Since there are likely tools out there that gather every bit of information that an administrator could possibly want, should we try and design so we can use those tools? For instance, I found 51 different check_* programs at the Nagios plugins project. Why not use those?

And how should the results be returned? In web form? Tabular? Comma-delimited? XML? RPC? XML-RPC? Are there existing forms that should be adhered to for compatibility with other tools?

How should results be stored? From what I can see, Nagios doesn't save its data, though it does log the results of each of its checks. It also keeps track of pending work, as mentioned before. Does it make sense to keep track of this information -- for instance, do we care what processes were running on a host last week, or how much swap space that machine was using at noon a month ago? Possibly in very rare instances.

How should the system be configured? Using existing configuration files (such as Nagios's)? An XML format? Windows .ini format? X resource format? And through what interface? A text editor? A web interface? A console app? A GUI app?

Implementation

What platforms should this system run on, both the main process (the "fetching" part), as well as the clients? What language should it be written in? And again, what protocol should be used to pass information back and forth (both from the hosts to the system, and from the system to the user)?
<--introduction ^--Soigan--^ design, continued-->
©2002-2017 Wayne Pearson