Data Collected by netmain_srvr Probes and Observers

This section describes the data collected by the probes and observers run by netmain_srvr monitors.

The descriptions of items sometimes refer to "output formats," or "display formats." Generate these formats with the Netmain Interactive Tool.

CPU_TIME - Null/AEGIS/user CPU time

Records performance statistics about each node's CPU usage and writes them

o

Records cumulative information about disk and storage module performance and errors on all nodes and writes them to

'node_data/net_log/net_log.yy.mm.dd or a file you specify.

Counts the number of requests for disk I/O that could not be serviced because device controller, such as controller memory component errors, internal micro-diagnostic failures, or internal timing problems, can cause equipment checks. On storage modules, only data for Unit 0 is recorded. The display

Storage module reads as percentage of storage module I/O: ERR_COUNTS - Network error counts (normal traffic)

Records cumulative network performance statistics and error counts on all nodes and writes them to 'node_data/net_log/net_log.yy.mm.dd or a file you specify. percentage of the node's total network transmits.

Acknowledge parity, receive (ACK):

o

node's modem hardware cannot lock on the transmission frequency from the node upstream. Biphase errors can result from modem hardware failures, broken cables or connectors, or signal degradation caused by excessive cable lengths between active nodes. The display formats show the statistic as a percentage of the node's total network receives. The ERR_COUNTS probe collects data for this statistic only from nodes running SR8 or later releases. percentage of the node's total network transmits. The ERR_COUNTS probe collects data for this perfor mance statistic, only from nodes running SR8 or

Counts messages that contain Cyclical Redundancy Check (CRC) errors, detected by the ring hardware. CRC errors can result from errors in any part of the received message, except the hardware protocol segments. You cannot disable CRC checking. Compare CRC error statistics to those for RCV header communication frequency. The display formats show the statistic as a percentage of the node's total network receives. The ERR_COUNTS probe collects data for this statistic only from nodes running SR8 or later releases.

ESE errors, transmit:

Counts the number of Elastic Store Buffer (ESB) errors that occur. ESB errors occur when the node is unable to follow a large or sudden change in the network's communication frequency. The display formats show the statistic as a percentage of the node's total network transmits. The ERR_COUNTS

probe collects data for this statistic, only from nodes running SR8 or later releases.

Header checksum, receive:

Counts messages that contain checksum errors in the message header. Since the operating system program that verifies header checksums is usually

disabled, the count for this statistic should be O. The display formats show the

Counts the number of times the transmitter could not synchronize properly with the network, resulting in an Xmit ESB or biphase error condition. See

"transmit biphase errors" and "transmit ESB errors." If the error condition

lasts more than one minute, the node broadcasts a "hardware failure report"

C

Timeout, transmit: Counts transmitted messages that do not complete their transmission in the expected time. This error often occurs when network traffic is slow, due to repeated attempts to retransmit or regenerate the ring token. The display formats show the statistic as a percentage of the node's total network

o

percentage of the node's total network transmits.

Transmit error, receive:

Counts the number of times that either the transmitter or another receiver had an error in the packet. For this error to occur, some other error flag must be set. The display formats show the statistic as a percentage of the node's total network receives.

Receives as percentage of network 110:

Calculates the ratio of incoming messages (receives) to the node's total HW_FAIL - Hardware failure messages

Records every change in the hardware failure message reported by the

Default Probe Skip Distance: Not Applicable

MEMORY - Records counts of memory errors on nodes in the network

Lists nodes on which correctable memory errors have occurred.

Default Probe Interval Time: 0:30:00 Default Probe Skip Distance: 1 NET_SERVICE - Network service queue statistics

Measures the length of the network service queue backlog on each node. requests (the queue "backlog") and counts these remaining requests. When there are no requests, the performance statistic adds a 0 to the average. The file server handles only requests from other nodes for operations such as opening, closing, and creating files (not reading or writing). Backlog incidence is shown as a percentage of the queue's capacity. nodes, for reads, writes, paging, and several internal operating system services.

Incidence plots show the number of times a backlog was present as a percentage of the number of page services requested by other nodes.

6-16

o

remaining requests (the queue "backlog") and counts these remaining requests. When there are no requests, the performance statistic adds a 0 to pages during file I/O, operating system execution, and many other activities.

Output formats that use percentages show the statistic as a percentage of all

Records information about diskless nodes and their paging partners.

Diskless nodes/paging partners:

Default Probe Interval Time: 0:30:00 Default Probe Skip Distance: 1 SWD_10_MSGS - Software Diagnostic Messages (10)

SWD ack parity: of messages on the node's performance statistics.

Counts the number of parity failures in the hardware protocol segments of software diagnostic messages. The hardware always detects ACK parity errors if any occur. The SWD_I0_MSG and SWD_I00_MSG probes collect data for this performance statistic. The display formats show the statistic as a

percentage of the test messages the node receives.

Counts errors during Direct Memory Access (DMA) from the ring controller, during receipt of software diagnostic messages. The SWD_I0_MSG and SWD_I00_MSG probes collect data for this performance statistic. The display formats show the statistic as a percentage of the test messages received by the node.

Counts test messages that contain Cyclical Redundancy Check (CRC) , errors, detected by the ring hardware. C~C errors can result from errors in any part of the received message except the- hardware protocol segments. Note that this performance statistic shows only those errors that occur in software diagnostic messages. The SWD_10_MSG and SWD_100_MSG probes collects data for this performance statistic. The display formats show the statistic as a

percentage of the diagnostic messages received.

Counts the number of times that one or both of the message fields in the test

Counts SWD (software diagnostic) messages that contain checksum errors in the message header. Since the operating system program that verifies header checksums is usually disabled, the count for this statistic should be O. The SWD_10_MSG and SWD_100_MSG probes collect data for this performance statistic. The display formats show the statistic as a percentage of the test the network, during receipt of Software Diagnostic messages. The conditions that cause this error class are biphase or Elastic Store Buffer (ESB) errors.

The SWD_10_MSG and SWD_100_MSG probes collect data for this

performance statistic. The display formats show the statistic as a percentage of the SWD test messages the node receives.

Counts the number of Direct Memory Access (DMA) overruns that occur during software diagnostic -messages. The SWD _10_ MSG and SWD _100_ MSG probes collect data for this performance statistic. The display formats show the statistic as a percentage of the test messages the node receives.

6-18

o

SWD_l00_MSG probes collect data for this performance statistic. The display formats show the statistic as a percentage of the test messages the node receives.

SWD receive timeout:

Counts Software Diagnostic messages received that did not complete in the expected time. The SWD_l0_MSG and SWD_l00_MSG probes collect data for this performance statistic. The display formats show the statistic as a percentage of the test messages received.

SWD transmit errors: SWD_100_MSGS - Software Diagnostic M~ssages (100)

Records the effect of 100-message broadcasts, in the same manner as SWD_l0_MSGS above. This probe collects the same information as SWD_l0_MSGS.

Default Probe Interval Time: 0:30:00 Default Probe Skip Distance: 1 TIME_SKEW - Difference between node clocks Compute offset times:

Automatically computes offset times for each log file. The computed offset times are derived from the contents of the log files, using results from the TIME_SKEW probe. A variety of conditions can prevent the offset

computation from working. For instance the TIME_SKEW probe may never have executed or may not have operated on each node. Default Probe Skip Distance: Not Applicable

o

MODEM_ERRS - Transmit modem errors

This observer reports on nodes that have more than five times the average number of "Transmit Modem Errors."

Default Probe Interval Time: 0:30:00 Default Recheck Interval: 12:00:00 WIN_CRC - Disk drive errors

This observer reports on nodes that have more than 0.01 percent of Winchester disk drive CRC errors.

Default Probe Interval Time: 0:30:00 Default Recheck Interval: 12:00:00

Im Dokument Administering Your DOMAIN System (Seite 120-130)