<--design, continued ^--Soigan--^ planning, continued-->

Soigan - a Multicast XML monitoring system - planning

Planning

There's no time like the present to start figuring out what our protocol and file formats look like. First, though, we should work out the terminology used later.

Terminology

HostAny machine or device that we want to monitor
Soigon workerA daemon that runs on a Host
WorkerA machine running a Soigan worker; also used to refer to the Soigan worker itself
Soigan serverA daemon that collects data from Workers
ServerA machine running a Soigan server; also used to refer to the Soigan server itself
Soigan clientA program that gets data from Servers
ClientA machine running a Soigan client; also used to refer to the Soigan client itself
RequestA message from a Server to a Worker
ResponseA message from a Worker to a Server
QueryA message from a Client to a Server
ResultA message from a Server to a Client

Why are the remote machines considered Workers and not Servers? Hopefully that is made clear below. Also, note that Response, Request and Result have a special meaning when capitalized, but are not the same when lowercase.

Protocol - XML-RPC

The protocol has to support a handful of things: All of these can be implemented in XML-RPC; the Requests, Messages and Instructions are all procedure calls (an XML-RPC request), and the Responses and Results are the results of the procedure calls (an XML-RPC response). Or are they?

We've already decided that there are many ways to get information -- unicast (asking directly), multicast (calling out generally), plugins (finding out anonymously) and unsolicited. In the first case, when connecting directly to a Worker, we would open a connection, ask our question, and get a response. The multicast method asks anyone appropriate to respond, and wait to see if we get any responses. The third doesn't expect responses from the Worker. And unsolicited information doesn't require the Server to ask at all! So when considering responses from a Worker in each case, we have "Yes", "Maybe", "No" and "Unexpected". That's quite a mix. Maybe they should be consolidated into one method of information retrieval?

Our basis can be the unsolicited messages, since they don't involve the Server until the Worker has information. This means that the Worker has to connect to the Server to give the information. Since this is something that the Worker and Server have to support, we can make the transfer of data behave consistently in this model for all cases. Our usual unicast request can be a request to call back with the response. Our multicast request has always been that, so we're set there. The plugins can be a special case where the Server is its own Worker, acting on behalf of a remote Host (because the plugins are finding out information without the Host getting involved).

In this way, we can separate the Soigan engine into different parts -- one that makes the requests, and one that handles the requests. But we're getting ahead of ourselves there. So we can change our previous paragraph:

All of these can be implemented in XML-RPC; the Requests, Messages, Instructions and Responses are all procedure calls (an XML-RPC request), and the Results are the results of the Request procedure calls to a Server from a Client.

Note that here, a Client is some program that culls information from Soigan, be it a webpage, a command-line tool or a GUI application. The communication between the Client and the Server will be a little different than the rest of the system, since it will be a "typical" XML-RPC call, where the rest might not be considered in the spirit of XML-RPC.

What we're doing is, in essence, a callback mechanism. The Server has a set of functions that the Workers call the inform the Server of what they know. In all but the unsolicited case, these callbacks are being triggered by something the Server does. Both ends of communication are XML-RPC calls, and that means that they both expect a response, but we really don't have one! So we'll start by figuring out what the responses (lowercase) to Requests look like.

responses

There has, apparently, been some work to introduce a nil value into XML-RPC, but from what I can tell, this has not become standard. Because of this, I will avoid its use, and instead return a boolean value whenever a Worker or Server get a Request. This value will say whether the Worker plans on acting on the Request or not (perhaps the parameters are incorrect, or the plugin isn't available, or that Server isn't allowed to ask for a Response
<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodResponse>
    <params><param>
      <value>
        <boolean>1</boolean>
      </value>
    </param></params>
  </methodResponse>

Easy enough. And there's nothing more we need, except for, perhaps, a reason why a zero would be returned. Let's leave that for now.

Requests

So what do the RPC calls look like? All Requests are of the form "Tell me the result of executing this plugin (with these parameters)". In some cases, the plugin resides on the Worker (ie, disk usage), and in others they reside on the Server, but they're still plugins. So let's try this:
<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>plugin.run</methodName>
  </methodCall>

Because every plugin will have a different amount (and different types) of parameters, we'll have to support this by using a struct. Some plugins will require a hostname. Some plugins will require a username. Many Requests will require a time value, either to specify during a multicast request "please respond over the next N seconds", or to say "please update me every N seconds".

I want to take a moment to have an aside.  One issue we've now 
stumbled onto is the use of UDP or TCP.  For those of you that know
about this sort of thing, you have already realized this: multicast
uses UDP, where XML-RPC uses TCP.

This is okay, though.  Because of the way we're using XML-RPC, we're
really simulating a stateless connection between Worker and Server
anyway, which is very much how multicasting over UDP works.
Internally, we are going to formulate our requests in the same way
regardless of whether we're requesting over multicast UDP to a hundred
machines, or specifically to one machine with a direct TCP connection.
We do this because in the end, the Servers just wants to make requests,
Workers just want to answer them, and the networking mechanism used is
unimportant.  This is why our previous decision to have responses be a
simple boolean value was easy -- because we don't really care about
that response.  It's the Response that matters.  Make sense?
If not, we'll definitely come back to this.
To make some examples, let's use two plugin ideas previously mentioned: a ping plugin and a users plugin.

ping

The ping plugin will most likely run on a Server. The plugin will provide two pieces of information; whether a given host is up, and what the latency to it is.

We could, of course, just try to connect to the host to determine if it's up, but that can take a while to time out, it doesn't guarantee the host isn't up (the service we attempt to connect to might be down), and the host may not run a known service.

This means, then, that the ping plugin will require a hostname to ping. It should probably be given a timeout period, as well, so we don't wait forever for a response.

<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>plugin.run</methodName>
    <params>
      <param>
        <value>soigan</value>
      </param>
      <param>
        <value><dateTime.iso8601>20040601T09:45:11</dateTime.iso8601></value>
      </param>
      <param>
        <value>ping</value>
      </param>
      <param>
        <value><struct>
          <member>
            <name>host</name><value>mailhost</value>
            <name>timeout</name><value><int>1</int></value>
          </member>
        </struct></value>
      </param>
    </params>
  </methodCall>
If we're using the Nagios check_ping plugin, this would map into the following command-line:
	check_ping -H mailhost -t 1 -w1000,100% -c1000,100%
Note: the Nagios plugins support being told when to change their status from OK to WARNING to CRITICAL. I don't believe that it should be up to the plugin to determine this -- I think they should just fetch information -- so I will ignore this portion of Nagios plugins' results. The -w and -c options are (unfortunately) required for the check_ping program, so we give it some extravagant values.

A typical response from check_ping is

	PING OK - Packet loss = 0%, RTA = 0.90 ms
Which might result in a Response such as
<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>plugin.results</methodName>
    <params>
      <param>
        <value>mailhost</value>
      </param>
      <param>
        <value><dateTime.iso8601>20040601T09:46:00</dateTime.iso8601></value>
      </param>
      <param>
        <value>ping</value>
      </param>
      <param>
        <value><struct>
          <member>
            <name>packetloss</name><value><int>0</int></value>
          </member><member>
            <name>rta</name><value><double>0.90</double></value>
          </member>
        </struct></value>
      </param>
    </params>
  </methodCall>
One thing we should mention is that both the Request and Response contain the hostname of the Worker or Server and the time that the request was made. This is required because the XML-RPC system doesn't necessarily have access to this information when a function is called. It also allows us to save the Requests and Responses to recreate a session and replay events if needed.

Parameter three is the plugin that was called; it could be the case that a Server asked multiple things of a given Worker, and would thus need to know what each Response was responding to. This makes me wonder if it shouldn't also include the time that the Request was made. For now, we'll leave that out.

The fourth and final parameter is a structure, which is required since different plugins are going to return different types and different amounts of information. In the case of check_ping, we have two values we're concerned with -- packetloss and rta (round trip average).

This brings up an interesting issue. How do Servers know what kind of data is going to be returned? Should they care? In the end, what are they doing with it? Who are they reporting it to?

Though we haven't figured out much of the Server side yet, I'm thinking we should add another Request called schema.get:

<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>schema.get</methodName>
    <params>
      <param>
        <value>soigan</value>
      </param>
      <param>
        <value><dateTime.iso8601>20040601T10:10:11</dateTime.iso8601></value>
      </param>
      <param>
        <value>ping</value>
      </param>
    </params>
  </methodCall>
and the Response would be:
<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>schema.results</methodName>
    <params>
      <param>
        <value>mailhost</value>
      </param>
      <param>
        <value><dateTime.iso8601>20040601T10:10:20</dateTime.iso8601></value>
      </param>
      <param>
        <value>ping</value>
      </param>
      <param>
        <value><struct>
          <member>
            <name>packetloss</name><value>int</value>
          </member><member>
            <name>rta</name><value>double</value>
          </member>
        </struct></value>
      </param>
    </params>
  </methodCall>
A very similar Response as before, except here the values in the structure are the XML-RPC types that get returned for each structure member, instead of the values themselves. This is called Reflection and is used in many languages to look into their objects to figure out their basic structure. This reminds me of something else that we might consider later, which is optional versus required fields. For now, we'll work under the assumption that all plugins return values in all cases, but I'm sure we'll hit some that don't need to, or can't. At that point, we'll change things a bit.

users

Let's now see if the idea of a users plugin will fit into what we've got so far.
<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>plugin.run</methodName>
    <params>
      <param>
        <value>soigan</value>
      </param>
      <param>
        <value><dateTime.iso8601>20040601T010:40:11</dateTime.iso8601></value>
      </param>
      <param>
        <value>users</value>
      </param>
      <param>
        <value><struct>
        </struct></value>
      </param>
    </params>
  </methodCall>
and now a Response:
<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>plugin.results</methodName>
    <params>
      <param>
        <value>mailhost</value>
      </param>
      <param>
        <value><dateTime.iso8601>20040601T10:41:00</dateTime.iso8601></value>
      </param>
      <param>
        <value>users</value>
      </param>
      <param>
        <value><struct>
          <member>
            <name>count</name><value><int>3</int></value>
          </member><member>
            <name>names</name><value><array><data>
                                <value>ben</value>
                                <value>crwth</value>
                                <value>crwth</value>
                              </data></array></value>
          </member>
        </struct></value>
      </param>
    </params>
  </methodCall>
In this example, we returned the hostname, the time and the name of the plugin, as before. This time, the structure contains two members, one called count, the other called names. count saves a caller from counting what's in the names array. names contains duplicate names; this is up to the users plugin to decide whether or not it does so. It's also up to the plugin how it finds out this information -- it could use a command-line program (such as users, w or who) or some other method (a service or daemon, or perhaps a /proc entry). Different users plugins might return different information, such as login time, the tty the person is using, and where they're logged in from. This is why the Reflection is important -- to see what kind of information to expect.

And what if there were no users logged in? In that case, we'd have

<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>plugin.results</methodName>
    <params>
      <param>
        <value>mailhost</value>
      </param>
      <param>
        <value><dateTime.iso8601>20040601T10:41:00</dateTime.iso8601></value>
      </param>
      <param>
        <value>users</value>
      </param>
      <param>
        <value><struct>
          <member>
            <name>count</name><value><int>0</int></value>
          </member>
        </struct></value>
      </param>
    </params>
  </methodCall>
which is still valid, because the structure isn't predefined. And if we didn't supply the count value, an empty machine could get
<?xml version="1.0" encoding="ISO-8859-1"?>
  <methodCall>
    <methodName>plugin.results</methodName>
    <params>
      <param>
        <value>mailhost</value>
      </param>
      <param>
        <value><dateTime.iso8601>20040601T10:41:00</dateTime.iso8601></value>
      </param>
      <param>
        <value>users</value>
      </param>
      <param>
        <value><struct>
        </struct></value>
      </param>
    </params>
  </methodCall>
because structures are allowed to be empty. We do, though, have to declare that there is a <struct>, because our functions (in this case, plugin.results()) always have to have to same amount of parameters -- in this case a string, a dateTime, another string, and a structure.

Summary

So far we've introduced two Requests that Servers make, and two Responses that Workers make in response: These names may not be the best -- I'm no interface expert -- but they'll do for now. If better ones are found, these will have to be kept around for backwards compatibility, I suppose.
<--design, continued ^--Soigan--^ planning, continued-->
©2002-2017 Wayne Pearson