Monitoring HOWTO


Components Provided for Monitoring

The major components of Cluster Systems Management monitoring tool are the Resource Monitoring and Control (RMC) subsystem and certain resource managers. These are described in the following sections.


Resource Monitoring and Control Subsystem

The Resource Monitoring and Control (RMC) subsystem monitors and queries resources. The RMC daemon manages an RMC session and recovers from communications problems.

The RMC subsystem is used by its clients to monitor the state of system resources and to send commands to resource managers. The RMC subsystem acts as a broker between the client processes that use it and the resource manager processes that control resources.


Resource Managers

A resource manager is a process that maps resource and resource-class abstractions into calls and commands for one or more specific types of resources. A resource manager is a stand-alone daemon. The resource manager contains definitions of all resource classes that the resource manager supports. A resource class definition includes a description of all attributes, actions, and other characteristics of a resource class.

See the man pages for the RMC and ERRM commands or Cluster Systems Management for Linux Technical Reference to learn how to access the resource classes and manipulate their attributes through the command line interface.

The following resource managers are provided:

Audit Log resource manager (IBM.AuditRM)
Provides a system-wide facility for recording information about the system's operation, which is particularly useful for tracking subsystems running in the background. (See Using the Audit Log to Track Monitoring Activity and Audit Log Resource Manager for details.)

Distributed Management Server Resource Manager (IBM.DMSRM)
Manages a set of nodes that are part of a system management cluster. This includes monitoring the status of the nodes and adding, removing, and changing attributes of the nodes in the cluster. (See Distributed Management Server Resource Manager for details.)

Event Response resource manager (IBM.ERRM)
Provides the ability to take actions in response to conditions occurring on the system. (See Event Response Resource Manager for details.)

File System resource manager (IBM.FSRM)
Monitors file systems. (See File System Resource Manager for details.)

Host resource manager (IBM.HostRM)
Monitors resources related to an individual machine. The types of values that are provided relate to load (processes, paging space, and memory usage) and status of the operating system. It also monitors program activity from initiation until termination. (See Host Resource Manager for details.)

Sensor resource manager (IBM.SensorRM)
Provides a means to create a single user-defined attribute to be monitored by the RMC subsystem. See Sensor Resource Manager for details.

Audit Log Resource Manager

The Audit Log subsystem is implemented as a resource manager within the RMC subsystem. It has two resource classes, IBM.AuditLog for subsystem definitions and IBM.AuditLogTemplate for audit-log-template definitions. Entries in the audit log are called records. Records can be added, retrieved, and removed through actions on a specific subsystem or on the subsystem class. The template definition class contains a description of each record type that a subsystem can add to the audit log. The template definition contains the data type, a descriptive message, and other information for each subsystem-specific field within the record.

There are typically two types of clients for the audit-log subsystem, subsystems that need to add records to the audit log, and users who extract records from the audit log through the command line. The formatted message for each record provides a concise description of the situation and allows a user to easily see at a high level what has been happening on the system.

Audit Log Resource Class

Each resource of this class represents a subsystem that will be adding records to the audit log. A resource of this class must be added before the subsystem can add records to the audit log. The resource can be added as part of the installation of the subsystem or at runtime.

The following properties can be monitored for this resource class:

RecordsAdded
Reflects the current number of records in the audit log. Whenever records are added to the audit log, this value is updated.

RecordsRemoved
Conveys which records have been removed. The following data elements comprise the value of this attribute:

RecordCount
Reflects the total number of records in the audit log after the records identified by SeqNumRanges have been removed.

SeqNumCount
Reflects the total number of elements in the SeqNumRanges array. The number of ranges in that array is actually SeqNumCount/2.

SeqNumRanges
Each consecutive pair of CT_INT64 integers defines an inclusive range of sequence numbers of records that have been deleted.

AuditLogSize
Reflects the amount of disk space in bytes that the audit log uses.

Audit Log Template Resource Class

This resource class holds all audit log templates. An audit log template describes the information that exists in each audit log record that is based on the template. In addition, an audit log template contains information on how to present records that use the template to an end user. Each template corresponds to a resource within this class. The attributes of this resource class are internal.


Distributed Management Server Resource Manager

The distributed management server resource manager (IBM.DMSRM) controls the managed node (IBM.ManagedNode) resource class and the node group (IBM.NodeGroup) resource class. The distributed management server resource manager runs on the node designated as the management server and is automatically started by the RMC subsystem.

Managed Node Resource Class

The program name of this resource class is IBM.ManagedNode. It runs on the management server and is started by the RMC subsystem. It is controlled by the distributed management server resource manager.

The following dynamic attributes can be monitored for the IBM.ManagedNode resource class:

ConfigChanged
When a persistent attribute value changes, this attribute is asserted.

PowerStatus
Monitors the power status of the node. The valid states are OFF (0), ON (1), and UNKNOWN (127).

Status
Represents the current accessibility status of the node. Accessibility is defined as the ability to successfully ping the node. The valid states are UNREACHABLE (0), REACHABLE (1), and UNKNOWN (127).

Predefined Conditions for Managed Node Resource Class

The following table shows the predefined conditions and example expressions that are available for the IBM.ManagedNode resource class.

Condition Name

Event Expression

Event Description

Rearm Expression Rearm Description

Notes

NodeReachability

Status!=1

An event is generated when a node in the network cannot be reached from the management server.

Status=1

The event is rearmed when the node can be reached again.

None.

NodeChanged ConfigChanged=1 An event is generated when a node definiton in the ManagedNode resource class changes. None. None. NodeNames = {localnode}

Node Group Resource Class

The program name of the node group resource class is IBM.NodeGroup. The node group resource class runs on the management server.

The following dynamic attributes of the node group resource class can be monitored:

ConfigChanged
When a persistent attribute value changes, this attribute is asserted.

Event Response Resource Manager

The system administrator interacts with the Event Response resource manager (ERRM) through the ERRM command-line interface.

When an event occurs, ERRM runs user-configured commands, which can include scripts provided by RSCT. A command and its attributes are a type of action, and many actions can be configured for a single Event Response resource. An action consists of a name, a command to be run, and other variables. You specify the range of times when the command is run (day, start time, and end time). If the condition occurs at a time outside the specified time ranges, the command is not run, and if all of the actions within this Event Response resource have the same time ranges, none of the commands are run. If no time ranges are specified, the command is always run. There are also event and rearm event flags that specify the events for which the command is run. Three options are allowable; only event set, only rearm event set, or both flags set.

The Event Response Resource Manager (ERRM) is automatically started when the RMC subsystem is started.

Although performance is important, ensuring that no events are lost and that the user's commands are run is of greater importance. Other factors outside the control of ERRM may affect performance as well (for example, network load, system load, and the performance of other required subsystems).

The only user ID that can define, undefine, and modify ERRM resources is root. All other users have read access to ERRM resources. Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. No security audits are generated, and no encryption mechanisms are used. ERRM communicates only with other local subsystems on the same node.

Information is handled as follows:

There are three Event Response resource classes:

  1. Condition

    The Condition resource class contains the necessary information (event expression and rearm expression) for the ERRM to register with the RMC for event notifications that the administrator deems important. Conditions contain essential information such as the resource attributes of the resource to be monitored, the event expression, and the optional rearm expression.

    Configuration of ERRM begins with the definition of a set of Condition resources. A Condition resource is registered with the RMC subsystem when the Condition resource is used in the definition of an active Association resource.

    Notes:

    1. Registration with RMC is necessary for monitoring to run. Registration does not occur when a new Condition resource is defined, but rather when the resource is used in the definition of an active Association resource.

    2. While monitoring a Condition on multiple nodes, if the RMC session with any one node is lost, the Condition's monitor status will be "monitored but in error."
  2. Event Response

    An Event Response resource is configured by defining one or more actions. Each action contains the name of the action, a command, and other fields within the action attribute. The Event Response resource runs any number of configured commands when an event with an active association occurs. When an event occurs, all of the actions associated with its Event Response resource are evaluated to determine whether they should be run.

    Predefined responses are available to use and to serve as templates for creating your own responses. For a description of predefined responses and how to use them, see Predefined Responses. Scripts for notification and logging of events and for broadcasting messages to logged-in user consoles are provided in Cluster Systems Management for Linux Technical Reference.

    Note:
    Commands are run in parallel.

    See Getting Started with the Monitoring Application for specific task information on how to configure actions for Event Response resources and Event Response resources for Conditions.

  3. Association

    The Association resource class joins the Condition resource class together with the Event Response resource class. It contains a flag that indicates whether the association between the condition and the event response is active. Event Responses and Conditions are separate entities, but for monitoring to take place, they need to be associated. An event cannot occur unless at least one Event Response is associated with a Condition. You can configure one or more actions for an Event Response, and one or more Event Responses for a Condition.

See Getting Started with the Monitoring Application for information on how to get started using the capabilities of the Event Response resource manager to monitor your system.


File System Resource Manager

The File System resource manager (FSRM) manages file systems. It can do the following:

There is one File System resource manager (FSRM) on a node. It is started implicitly by the RMC subsystem and is run only when an attribute of an FSRM resource class is monitored (thus cutting down on performance overhead).

To enforce security, only root can start the FSRM resource manager (although it is strongly recommended that the FSRM resource manager not be started manually). Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. No security audits are generated, and no encryption mechanisms are used. The FSRM communicates only with other local subsystems on the same node and with the RMC subsystem. The FSRM has no direct contact with clients.

Information is handled as follows:

These attributes of a file system resource can be monitored:

OpState
Monitors whether the current file system operational state is online (mounted) or offline (unmounted).

PercentTotUsed
Represents the percentage of space that is used in a specific filesystem so that preventative action can be taken if the amount available is approaching a predefined threshold. For example, /tmp PercentTotUsed, /var PercentTotUsed.

PercentINodeUsed
Represents the percentage of i-nodes that are in use for a specific file system; for example, /tmp PercentINodeUsed.

Predefined Conditions for Monitoring File Systems

The following table shows the predefined conditions and examples of expressions that are used to monitor the file system:


Condition Name

Event Expression

Event Description

Rearm Expression

Rearm Description

Monitored Resources

Notes

File system state

OpState != 1

An event is generated when any file system goes offline.

OpState == 1

The event is rearmed when any file system comes back online.

all

n/a

File system i-nodes used

PercentINodeUsed > 90

An event is generated when more than 90% of the total i-nodes in any file system are in use.

PercentINode Used < 85

The event is rearmed when the percentage of i-nodes used in the file system falls below 85%.

all

n/a

File system space used

PercentTotUsed > 90

An event is generated when more than 90% of the total space of any file system is in use.

PercentTotUsed < 85

The event is rearmed when the space used in the file system falls below 85%.

all

n/a

/tmp space used

PercentTotUsed > 90

An event is generated when more than 90% of the total space in the /tmp file system is in use.

PercentTotUsed < 85

The event is rearmed when the space used in the /tmp file system falls below 85%.

/tmp

n/a

/var space used

PercentTotUsed > 90

An event is generated when more than 90% of the total space in the /var file system is in use.

PercentTotUsed < 85

The event is rearmed when the space used in the /var file system falls below 85%.

/var

n/a
AnyNode FileSystem InodesUsed PercentINodeUsed > 90 An event is generated when more than 90% of the total i-nodes in the file system are in use. PercentINodeUsed < 75 The event is rearmed when the percentage of i-nodes used in the file system falls below 75%. all n/a
AnyNode FileSystem SpaceUsed PercentTotUsed>90 An event is generated when more than 90% of the total space of the file system is in use. PercentTotUsed <75 The event is rearmed when the percentage of space used in the file system falls below 75%. all n/a
AnyNodeTmp SpaceUsed PercentTotUsed>90 An event is generated when more than 90% of the total space in the /tmp directory is in use. PercentTotUsed <75 The event is rearmed when the percentage of space used in the /tmp directory falls below 75% /tmp Use Name= '/tmp' for select string.
AnyNodeVar Space Used PercentTotUsed>90 An event is generated when more than 90% of the total space in the /var directory is in use. PercentTotUsed <75 The event is rearmed when the percentage of space used in the /var directory falls below 75% /var Use Name= '/tmp' for select string.

Host Resource Manager

The Host resource manager allows system resources for an individual machine to be monitored, particularly resources related to operating system load and status.

The Host resource manager is started implicitly by the RMC subsystem only when an attribute of a Host resource class is first monitored (thus cutting down on performance overhead).

Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. The Host resource manager runs as root. No security audits are generated, no encryption mechanisms are used, and there is no communication outside the node. The RMC daemon detects any unsuccessful authentication or authorization attempts. All interprocess communication is accomplished through pipes and shared memory.

Information is handled as follows:

The Host resource manager consumes minimal system resources during normal operation. This is because the following approaches have been implemented:

  1. Memory, CPU, and other system resources are not consumed for attributes that are not monitored. If no attributes are monitored, the Host resource manager is not started.
  2. To minimize disk access, information is maintained in memory as much as possible.
  3. The sampling of attribute values is aligned as much as possible to minimize the sampling overhead, in particular, thread or process context swaps.

The Host resource manager has the following resource classes that you can use to monitor system resources:

Host (IBM.Host)
This resource class externalizes the attributes of a machine that is running a single copy of an operating system. Primarily the attributes included are those that are advantageous in predicting or indicating when corrective action needs to be taken. See Host Resource Class for more details.

Program (IBM.Program)
This resource class allows a client to monitor attributes of a program that is running on a host. The program to monitor is identified by properties such as program name, arguments, etc. The resource class does not monitor processes as such because processes are very transient and therefore inefficient to monitor individually. See Program Resource Class for more details.

Host Resource Class

The program name of this resource class is IBM.Host. It allows the following resources of a host system to be monitored:

  1. Global state of active paging spaces (see Monitoring the Global State of Active Paging Space).
  2. Total processor utilization across all active processors in the system (see Monitoring Processor Utilization).

Monitoring the Global State of Active Paging Space

The following attribute monitors the percentage of paging space in use:

PctTotalPgSpUsed
Represents the percentage of paging space in use for all active paging space devices in the system.

Predefined Conditions for Monitoring Global State of Active Paging Space

The following table shows the predefined condition that is available for monitoring paging space, and example expressions:

Condition Name

Event Expression

Event Description

Rearm Expression

Rearm Description

Paging percent space used

PctTotalPgSpUsed > 90

An event is generated when more than 90% of the total paging space is in use.

PctTotalPgSpUsed < 85

The event is rearmed when the percentage falls below 85%.

Monitoring Processor Utilization

The values represented for this attribute reflect total processor utilization across all of the active processors in a system.

This attribute can be monitored:

PctTotalTimeIdle
Represents the system-wide percentage of time that the processors are idle.

Predefined Conditions for Monitoring Processor Utilization

The following table shows the predefined condition that is available for monitoring system-wide processor idle time, and example expressions:

Condition Name

Event Expression

Event Description

Rearm Expression

Rearm Description

Processor idle time

PctTotalTimeIdle>= 70

An event is generated when the average time all processors are idle at least 70% of the time.

PctTotalTimeIdle < 10

The event is rearmed when the idle time decreases below 10%.

Program Resource Class

The program name of this resource class is IBM.Program resource class. This resource class can monitor a set of processes that are running a specific program or command whose attributes match a filter criterion. The filter criterion includes the real or effective user name of the process, arguments that the process was started with, etc. The primary aspect of a program resource that can be monitored is the set of processes that meet the program definition. A client can be informed when processes with the properties that meet the program definition are initiated and when they are terminated. This resource class typically is used to detect when a required subsystem encounters a problem so that recovery actions can be performed and the administrator can be notified.

Program Definition

A program definition requires the program name and the user name of the owner of the program. The program should be identified by user name in addition to program name to avoid confusion when two or more programs have the same name. These attributes are defined as follows:

ProgramName
Identifies the name of the command or program to be monitored. The program name is the base name of the file containing the program. This name is displayed by the ps command when the -l flag or -o comm is specified. Note that the program name displayed by ps when the -f flag or -o args is specified may not be the same as the base name of the file containing the program.

Filter
Specifies a filter that selects a subset of all processes running the program identified by the attribute ProgramName. For example, the filter may limit the process set to those processes that are running ProgramName under the user name foo.
Note:
Process IDs are not used to specify programs because they are transient and have no prior correlation with the program being run, nor can the restart of a program be detected because there is no way to anticipate the process ID that would be assigned to the restarted application.

For a process to match a program definition and thus be considered to be running the program, its name must match the ProgramName attribute value. In addition, the expression defined by the Filter attribute must evaluate to TRUE by using the properties of the process. The Filter attribute is a string that consists of the names of various properties of a process, comparison operators, and literal values. For example, a value of user==greg restricts the process set to those processes that run ProgramName under the user IDgreg. The syntax for the Filter value is the same as for a string.

For more information on selection strings, see Using Expressions.

Processes must have a minimum duration (approximately 15 seconds) to be monitored by the IBM.Program resource class. (If a program runs for only a few seconds, all processes that run the program may not be detected.)

This attribute can be monitored: Processes

These elements of the Processes attribute can be monitored:

CurPidCount
Represents the number of processes that currently match the program definition and thus are considered to be running the program.

PrevPidCount
Represents the number of processes that matched the program definition at the last state change (previous value of CurPidCount).

CurrentList
Contains a list of IDs for the processes that currently match the program definition and thus are considered to be running the program.

ChangeList
Contains a list of IDs for the processes that were added to or removed from the CurrentList since the last state change. Whether the list represents additions or deletions can be determined by comparing CurPidCount and PrevPidCount. If CurPidCount is greater, this list contains additions; otherwise, it contains deletions. Additions and deletions are not combined in the same state change.
For example, assume the six processes shown in the following ps output are running the biod program on node 1:
ps -e -o "ruser,pid,ppid,comm" | grep biod
 
root		7786	8040 biod
 
root		8040	5624 biod
 
root		8300	8040 biod
 
root		8558	8040 biod
 
root		8816	8040 biod
 
root		9074	8040 biod

To be informed when the number of processes running the specified program changes, you can define this event expression:

Processes.CurPidCount!=Processes.PrevPidCount

To be informed when no processes are running the specified program, you can define this event expression:

Processes.CurPidCount==0

Predefined Conditions for Monitoring Programs

This resource class is typically used to detect when a required subsystem encounters a problem so that some recovery action can be performed or an administrator can be notified. The following table shows the predefined conditions and examples of expression that are available for monitoring programs.

Condition Name Event Expression Event Description Rearm Expression Rearm Description Monitored Resources Notes
sendmail daemon state

Processes .CurPidCount <=0

An event is generated whenever the sendmail daemon is not running.

Processes .CurPidCount> 1

The event is rearmed when the sendmail daemon is running.

sendmail

n/a
inetd daemon state

Processes .CurPidCount <=0

An event is generated whenever the inetd daemon is not running.

Processes .CurPidCount> 1

The event is rearmed when the inetd daemon is running. inetd n/a
MgmtSvrCfd Status

Processes .CurPidCount <=0

An event is generated when the cfengine daemon stops running.

Processes .CurPidCount> 1

The event is rearmed when the cfengine daemon starts running again. CSM Mgmt Server Use ProgramName= 'cfd' for the select string.
AnyNodeCfd Status

Processes .CurPidCount <=0

An event is generated when the cfengine daemon stops running.

Processes .CurPidCount> 1

The event is rearmed when the cfengine daemon starts running again. all nodes Use ProgramName= 'cfd' for the select string

Sensor Resource Manager

The Sensor resource manager makes the output of a user-written script known to the RMC subsystem as a dynamic attribute of a sensor resource. The Sensor resource manager determines when this attribute is run according to a specified interval. Thus, an administrator can set up a user-defined sensor to monitor an attribute of interest and then create expressions that contain Conditions and Responses with associated actions that are performed when the attribute has a certain value. For example, a script can be written to return the number of users logged on to the system. Then an ERRM Condition and Response can be defined to run an action when the number of users logged on exceeds a certain threshold.

Sensor Resource Class

The Sensor resource manager has one class, IBM.Sensor. Each resource in the IBM.Sensor resource class represents one sensor and includes information such as the script command, the user name under which the command is run, and how often it should be run. The output of the script causes a dynamic attribute within the resource to be set. This attribute can then be monitored in the typical way.

See the mksensor man page for details on how to set up a sensor.

Predefined Condition for Sensor Resource Class

The following table shows the predefined condition and example expression that is available for the IBM.Sensor resource class.

Condition Name

Event Expression

Event Description

Notes

CFMRootModTimeChanged

"String!=\@P"

An event is generated when a file under /cfmroot is modified, added, or deleted.

Selection String = 'Name="CFMRootModTime"'


Predefined Responses

The following predefined responses are shipped as templates or as starting points for monitoring.

See Using Expressions for a summary of the data types and operators that you can use in selection strings for a customized response.

Response Name Command
BroadcastEventsAnyTime /usr/sbin/rsct/bin/wallevent
CForce /opt/csm/bin/cforce -a
EmailEventsToRootAnyTime /usr/sbin/rsct/bin/notifyevent root
DisplayEventsAnyTime /usr/sbin/rsct/bin/displayevent admindesktop:0
LogEventsAnyTime /usr/sbin/rsct/bin/logevent /var/log/csm/systemEvents
MsgEventsToRootAnytime /usr/sbin/rsct/bin/msgevent root

Predefined Commands, Scripts, Utilities, and Files

You can use the following commands, scripts, utilities, and files to control Monitoring on your system. See the command man pages or Cluster Systems Management for Linux Technical Reference for detailed usage information.

ERRM commands

chcondition
Changes any of the attributes of a defined condition.

lscondition
Lists information about one or more conditions.

mkcondition
Creates a new condition definition which can be monitored.

rmcondition
Removes a condition.

chresponse
Adds or deletes the actions of a response or renames a response.

lsresponse
Lists information about one or more responses.

mkresponse
Creates a new response definition with one action.

rmresponse
Removes a response.

lscondresp
Lists information about a condition and its linked responses, if any.

mkcondresp
Creates a link between a condition and one or more responses.

predefined-condresp
Creates or resets the default monitoring conditions and responses.

rmcondresp
Deletes a link between a condition and one or more responses.

startcondresp
Starts monitoring a condition that has one or more linked responses.

stopcondresp
Stops monitoring a condition that has one or more linked responses.

RMC Commands

chrsrc
Changes the attribute values of a resource or resource class.

lsactdef
Lists (displays) action definitions of a resource or resource class.

lsrsrc
Lists (displays) resources or a resource class.

lsrsrcdef
Lists a resource or resource class definition.

mkrsrc
Defines a new resource.

refrsrc
Refreshes the resources within the specified resource class.

rmrsrc
Removes a defined resource.

Scripts and Utilities

ctsnap
Gathers configuration, log, and trace information for the Reliable Scalable Cluster Technology (RSCT) product.

displayevent
Notifies the specified user of an event by displaying it on the X-Window at the terminal of the user.

logevent
Logs event information generated by the Event Response resource manager to a specified log file.

lsaudrec
Lists records from the audit log.

msgevent
Sends a message to the specified user.

notifyevent
Emails event information generated by the Event Response resource manager to a specified user ID.

rmaudrec
Removes records from the audit log.

rmcctrl
Manages the Resource Monitoring and Control (RMC) subsystem.

wallevent
Broadcasts an event or a rearm event to all users who are logged in.

Files

Resource Data Input File
Defines resources and attribute values of a resource or resource class.

rmccli General Information File
Contains information global to the RMC command line interface.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]