
How Internet Rover Works

InetRoverd - Collector

   InetRoverd reads a configuration file that contains a list of network
nodes and associated tests to be performed on each of them.  If the test
fails, a problem entry is added to the problem file.  If the test succeeds,
and a problem entry existsm, the problem entry is removed.  Each problem
addition is logged, as well as each problem deletion.

Display - Display the Problems

   There is also a program that displays the list of current problems.  This
curses based program periodically polls and reads the problem file when
necessary.  Operators can update a problem entry using the display program.
These comments are then seen by all of the display programs.

pingd - Handles the pinging.

   The pingd performs all of the pinging tests for InetRoverd.  It performs
a more complicated ping test than the original ping test.  It can, 
for example, ping any number of times before markinga node as Not Reachable.
The delay time between sending subsequent pings can also be set.  This allows 
you to configure the frequency and define the test failing condition more 
accurately for a diverse network.

For examplem,  we have some very slow PDP 11 network machines still
in production, that are both very busy and 
far away across slow dialup lines.  These machine shouldn't
get pinged very much, and should be given up to 10 times to respond before 
concluding they are not reachable.

The pingd handles all ping tests like this by sending all the pings out in 
paralellel.  It schedules the pings, and uses timers
to determine when the next set of pings is to go out.

-------------------------------
stop here for BASIC ROVER
-------------------------------

NewRover - Collector

   nsfnetRoverd and nsfnett3Roverd discover the T1 and T3 backbone topology
and store the state of the nodes and links in a network state file.  This
Network Status File contains a list of objects named nodes and links.  The
Node class includes ASs and NSSs.  The Link class contains ISISLinks and
ASLinks.

   The NewRovers take 30 seconds to determine the state of the network using
the standard discovery algorithms.  State transitions and node discover is
logged to files named rover.log.YYMMDD.

NewRover meets OldRover

   NewRover will merge with OldRover in the following fashion.  After
NewRover has discovered the state of the network it will add problem entries
to the problem file, and remove the problem entries when the error condition
no longer exists.  Thus, we will still have a Text-Based Display of network
state as well as the X-based graphical displays that will be discussed in
the next section.

   The NewRover will query the backbone nodes using SNMP queries.  Pings
may no longer be necessary since the nodes reachability is determined by
whether or not the node answered the SNMP queries.
Similiarly, the SNMP queries will return the link state information,
eliminating the need to run the linkcheck program ( and the
linkchk configuration files).  FindAS will no longer be needed since we get
that information as well from the newrover.

   The OldRover will be used to monitor network objects that don't fit into
the discovery algorithm easily.  Mail machines,HIMs, spare PSPs are classes
of network devices that would not easily be discovered, and therefore will
continue to be monitored using ping tests.

   Both rovers will write problem entries to the same problem file.  Display
programs will therefore be capable of displaying problems detected by both
sets of rovers.

X-based graphical displays

   The graphical front end to the New Rover data collector displays nodes
and links, colored to reflect their state as seen in the network status
file.  As new node and link objects are discovered in the network, the map
is updated to show the new objects.   Nodes placement is accomplished using
the mouse, and the map is automatically saved upon exit of the program.

   User's can bind arbitrary actions to mouse clicks or key clicks on nodes
or links displayed on the map.  These actions are defined in the user's
.Xdefaults file.  For example, defining the following resource

Nsfnett3*draw*XmPushButton.Translations: #augment \
<Btn1Down>:     system( aixterm -e ping $1 ) \n\
<Btn3Down>:     system( aixterm -e telnet $1 )

in the user's .Xdefaults file binds a mouse button 1 down event to opening
up an aixterm window pinging the node in question.  The mouse Button 3 down
event is likewise bound to opening up an aixterm window telnetted into the
node.   The graphical display replaces all instances of $1 with the IP
Address of the node that was pressed.

   The syntax for actions on links is similiar:

Nsfnett3*draw.Translations: #override\
<Btn1Down>:     system( aixterm -e ping $1 & aixterm -e ping $2 ) \n\
<Btn3Down>:     system( aixterm -e telnet $1 & aixterm -e telnet $2 )

   This specifies that the mouse button 1 down event invokes two aixterm
windows, which pings nodes on either side of the link. The second line
causes Button 2 down
mouse events to invoke two aixterm windows, each telnetting to a node at
the end of the link.

   We are currently specifying the above tasks in rovers' .Xdefaults file by
binding execution of a shell script rather than coding the actual command in
the .Xdefaults.  This was originally for readability,  but experience has
shown that by executing a shell script, we can add a layer of abstration
relatively easily.  Assume we want to get connect to a node.  The
shell script could try to use in-band access first, and if that failed,
automatically use out-of-band access, transparent to the operator.

   One could imagine invoking a link testing shell script to a mouse or
key click sequence that would perform link diagnosis.  This shell script
might query either end of the link for DSU and interface status, error
counts,  etc.  The end result of this might be a filled in link problem
template as defined by the ie folks.  Heuristics could perhaps point out
that this problem appears to be a grey link or black link based on the
results.

   Another advantage to having a network status file around is that network
statistics now have an automatically maintained list of nodes in the network
to query.   The state of the node might be used to determine whether or not
it makes sense to poll this node for statistics or not.  Clearly, if the not
is not reachable, it makes no sense to poll it for statistics.

   Finally, automating diagnosis of problems could be done in the short
term.   Binding execution of a shell script to a node or link state
transition would be one way of automating problem determination.  This
script might automatically update the operators problem entry with
information it discovered.   Eventually, the problem might be corrected
automatically, and the NOC would be informed of the problem and the
corrective action taken.
