Soigan - a Multicast XML monitoring system - design
Design - Questions
So what should Soigan do? Or more importantly, what should it be
designed to expand to do? This software, in the end, should answer
questions for a system administrator, possibly without the
administrator knowing that they want to ask it. What kinds of
questions? Two come to mind: yes-or-no questions (Is this host up?
Is this server running?) and more wordy ones (What processes are
running on this machine? Who's logged in here? What packages are
installed over there?)
Some possible yes/no questions we might want answered:
And some more descriptive results:
- Is the host up?
- Is the web/ftp/ssh/dns/jabber/news/smtp/imap/pop server running?
And then you might have a different form of the above:
- Who is logged in?
- What processes are running?
- What packages are installed?
- Who last logged in?
- How full are the local partitions?
- Which udp/tcp ports are open?
The first group, the yes/no questions, are a subset of the second group,
as their result is just a "yes" or "no" instead of a number or a name
or a list. The second group would possibly be a list or a multiline
result, possibly output from a command, formatted nicely for processing.
- Where is this user logged in?
- When did this user last log in?
- Where is this package installed?
- What machines are running web/ftp/ssh/dns/jabber/news/smtp/imap/pop servers?
The third group is a little different. It could be done in two ways:
either rephrase the question to every machine (change "where is this
user logged in?" to "is this user logged in here?") and go through the
results from every machine to find the answer; or, you could
"call out" to all of the machines and get them to answer if they have
The showproc program that we use takes the first approach. It takes
the host list, pings every machine to see if they're up, and then
connects to every live machine to get a process list based on a given
user. The results of all of this are then displayed to the user (or
optionally, the found processes can be terminated as they're found).
This takes a while. A long while. This is why a rewrite is needed.
So how can this be sped up? Multicasting.
While an in-depth discussion of multicasting is beyond the scope of
these pages, I'll touch briefly on it. Multicasting allows a single
packet on a network to go to multiple, interested interfaces. Instead
of having to send out hundreds of ping packets to all the hosts,
followed by hundreds more packets to get their process lists, we can
send out a single handful of packets on the network, saying, "please
send me your process list in the next little while". The "next little
while" would be some number of seconds that you would ask them to
randomly wait in, so all the hosts don't try to answer the request at
once. Even a ten second window would allow many hundreds of hosts to
distribute their responses in a hopefully-even span of time.
At first, this seems like you're going to flood the network with
traffic every time you make a request, even if you distribute it
over a period of time. And that's true. But this would happen
regardless of how you implemented it -- in the end, you have so
much data to retrieve from these hosts, and so many packets required
to do it. At the moment, showproc does it in the span of minutes,
so yes, the flood is more of a trickle. But given the speed of
networks these days and the impatience of users, I think this is
reasonable. And if you feel it's too noisy? Just increase the
response period to the few minutes that showproc already took up.
Additionally, multicasting can reduce the network traffic
when the hosts themselves can decide whether or not they have
anything important to report. If you ask "tell me about the
processes that user X has running", then only those hosts that
have any would respond. This is far better than having to ask
every host individually, because it cuts down on the individual
requests to each host, as well as the number of responses needed.
But do we need multicasting for everything? Would it hurt? No, but
perhaps we should support the "older" style of communication by also
allowing the system to directly connect to a host and ask (one or
more) questions directly, one to one. This would be useful if the
administrator is interested in the going-on on a specific machine, and
wants to know its users, processes, uptime, disk usage, memory usage
and network traffic all at once, specifically for that machine alone.
There's also another way we might want to get information --
unsolicited. It, too, would use unicasting, but this would not be the
result of a request, but rather a periodic piece of information ("I'm
still here!") or a response to an event ("My /tmp partition is full!")
And what about getting information from a host without asking? This
is pretty much what Nmap does (well, it does go probing, but it doesn't
come right out and ask the host "hey, what ports do you have open?").
Is there a need to support that, or should we just make the system a
series of services that run on each machine that listening for requests?
Compatibility and Formatting
Since there are likely tools out there that gather every bit of
information that an administrator could possibly want, should we try
and design so we can use those tools? For instance, I found 51
different check_* programs at the Nagios plugins project.
Why not use those?
And how should the results be returned? In web form? Tabular?
Comma-delimited? XML? RPC? XML-RPC? Are there existing forms
that should be adhered to for compatibility with other tools?
How should results be stored? From what I can see, Nagios doesn't
save its data, though it does log the results of each of its checks.
It also keeps track of pending work, as mentioned before. Does it
make sense to keep track of this information -- for instance, do we
care what processes were running on a host last week, or how much
swap space that machine was using at noon a month ago? Possibly in
very rare instances.
How should the system be configured? Using existing configuration
files (such as Nagios's)? An XML format? Windows .ini format?
X resource format? And through what interface? A text editor?
A web interface? A console app? A GUI app?
What platforms should this system run on, both the main process
(the "fetching" part), as well as the clients? What language
should it be written in? And again, what protocol should be
used to pass information back and forth (both from the hosts
to the system, and from the system to the user)?
©2002-2018 Wayne Pearson