The Cluster Systems Management (CSM) Monitoring application offers a comprehensive set of monitoring and response capabilities that lets you detect, and in many cases correct, system resource problems such as a critical filesystem becoming full. You can monitor virtually all aspects of your system resources and specify a wide range of actions to be taken when a problem occurs, from simple notification by e-mail to recovery that runs a user-written script. You can specify an unlimited number of actions to be taken in response to an event.
As system administrator, you have a great deal of flexibility in responding to events. You can respond to an event in different ways based on the day of the week and time of day. The following are some examples of how you can use monitoring:
CSM uses Resource Monitoring and Control (RMC) to monitor the system and to perform many of its operations. For information about the command line interface to the RMC subsystem, see Cluster Systems Management for Linux Technical Reference. For information on RMC diagnostic information, see Recovering from RMC and Resource Manager Problems. For authorization and modifying the ACL file, see Security Considerations.
Monitoring lets you detect conditions of interest in the cluster nodes and their associated resources and automatically take action when those conditions occur. The key elements in monitoring are conditions and responses. A condition identifies one or more resources that you want to monitor, such as the /var file system, and the specific resource state you are interested in, such as /var>90% full. A response specifies one or more actions to be taken when the condition is found to be true. Actions can include notification, running commands, and logging.
System resources that you can monitor are organized into general categories called resource classes. Examples of resource classes include Processor, File System, Physical Volume, and Ethernet Device.
Each resource class includes individual system resources that belong to the class. For example, the File System resource class might include these resources:
When a resource is specified for use in a condition, it is called a monitored resource.
Each resource within a resource class also has a set of attributes that you can monitor. For example, the File System resource class has the following attributes available for monitoring:
For a condition, you specify the monitored attribute of the resource in a logical expression that defines a threshold or state of the monitored resource. When the logical expression is true (the threshold is reached or the state becomes true), an event is generated. The logical expression is the event expression of the condition. Event expressions are typically used to monitor potential problems and significant changes in the system. For example, the event expression for a /var space used condition might be PercentTotUsed > 90.
The rearm expression of a condition is optional. A rearm expression typically indicates when the monitored resource has returned to an acceptable state. When the rearm expression is met, monitoring for the condition resumes. If a rearm event is not specified, when the event expression becomes true an event is generated for certain attributes every time the monitored attribute is evaluated.
If a rearm expression is specified, evaluation of the rearm expression starts after the event expression becomes true. When the rearm expression becomes true, a rearm event is generated; then the evaluation of the event expression starts again. For example, if the event expression for a /var space used condition is 90% full and the rearm expression is PercentTotUsed < 80, then an event is generated when /var is more than 90% full. The next time the condition is evaluated, the rearm expression is used. When /var is less than 80% full, an event is generated indicating that the condition has been reset, and the event expression is used again to evaluate the condition.
See Using Expressions for more information about data types and operators that you can use in an event expression or a rearm expression.
Predefined conditions are provided with the Monitoring
application. To create a new condition, set the following condition
components:
Condition Component | Description | Example |
Condition name | Required. The name you want to give the condition. | /var space used |
Resource class | Required. The resource class to be monitored. | FileSystem |
Monitored attribute | Optional. The attribute of the resource class to be monitored. If not specified, it will be extracted from the Event expression. | PercentTotUsed |
Monitored resources | Optional. The specific resources in the resource class that are to be monitored. If not specified, the default is all resources in the specified Resource Class. | /var |
Event expression | Required. A logical expression defining the value or state of the monitored property that is to generate an event. | PercentTotUsed > 90 |
Event description | Optional. A text description of the event expression. If not specified, the default is a NULL string. | An event occurs when /var is more than 90% full. |
Rearm expression | Optional. When a rearm expression is specified, the rearm expression is evaluated when the event expression becomes true. When the rearm expression becomes true, the event expression is used for evaluation again. If not specified, this condition will only be monitored with the Event expression. | PercentTotUsed < 80 |
Rearm description | Optional. A text description of the rearm expression. If not specified, the default is a NULL string. | A rearm event occurs when /var is less than 80% full. |
Severity | Optional. The severity of the condition: Informational, Warning, or Critical. If not specified, the default is Informational. | Critical |
Finally, a user-defined sensor can be created to monitor an attribute of interest. Then expressions can be defined that contain conditions and responses with associated actions to be performed when the attribute has a certain value. For example, a script can be written to return the number of users logged on, and a condition and response can be defined so that a specified action is taken when the number of users exceeds a certain threshold.
A response consists of one or more actions to be performed by the system when an event or rearm event occurs for a condition. In the Monitoring application you can use predefined responses or create new responses and associate them with conditions as needed. You can associate multiple responses with one condition, and one response with multiple conditions.
The responses for a condition remain deactivated until you start monitoring for that condition. When you select a condition to start monitoring, you need to activate at least one of its responses. The responses that are not active remain available to be used at another time. This allows you to use different responses for a condition as needed, without having to redefine them.
Predefined responses are provided with the Monitoring application. To create a new response you will need to set the following response and action components:
Response Component | Description | Example |
Response name | The name you want to give the response. | Response for critical conditions |
Actions | One or more actions to be taken as part of the response. | Log events to a file |
Action Component | Description | Example |
Action name | The name of an action to be taken as part of the response. | Send email to the operator |
When in effect | The days and times when this action is to be used to respond to the condition. | 08:00 - 17:00 Monday - Friday |
Use for event, rearm event, or both | Whether the action is to be used to respond to an event, a rearm event, or both. | Event |
Command | The command to be run when an event or rearm event occurs. | A recovery script |
You can associate multiple responses with a condition if you want to define different responses based on when the event occurs. For example, you might have a work day response and a weekend response, each containing one or more actions. Consider how you might respond to a /var space used condition with the following responses. During working hours, you might want to email the operator, run a command, and broadcast a message to users who are logged on. During weekend hours, you might want to email the system administrator and log a message to a file.
After monitoring for the condition begins, the system evaluates the event expression to see if it is true. When the event expression becomes true, an event occurs that automatically notifies all of the associated event responses, which causes each event response to run its defined actions.
The event expression and the rearm expression work together as follows when a condition is monitored. First, the event expression is evaluated. When the event expression becomes true, an event occurs, and the specified actions are taken. When the event expression becomes true, the system begins evaluating the rearm expression. When the rearm expression becomes true, the rearm event occurs, which automatically starts the actions defined for the rearm event. When the rearm event occurs, the system returns to evaluating the event expression.
The mechanisms for authentication and authorization that are provided by CSM are described in the following sections.
CSM provides authentication using the Ident protocol. The daemon identd listens for TCP connections on a known TCP port 113. Application servers need to connect to this daemon on the host where the client is running. The servers have to provide identd with the local and remote ports. The daemon then returns the identity of the owner of the process connected to the remote port, if it exists. The application servers can then use this identity as the remote client's Unix identity.
The security infrastructure assumes that identd is running and listening on Port 113. Red Hat Linux includes identd. The identd code can be downloaded from one of the following sites:
The /etc/services file should contain the following:
auth 113/tcp authentication tap ident
The /etc/identd.conf file should contain the following comment:
#-- Disable username lookups (only return uid numbers) #result:uid-only = no
CSM provides authorization in the form of an access-control-list (ACL) file. You can create an ACL file to apply access control to resource classes. If you do not create an ACL file, then the system uses the following default permissions:
OTHER root@LOCALHOST * rw LOCALHOST * r
The ACL file is in stanza format. Each stanza begins with the stanza name, which is the name of a resource class. A stanza with the name of OTHER applies to all resource classes that are not otherwise specified in the file.
Each line of the stanza contains a user identifier, an object type, and an optional set of permissions. A stanza line indicates that the user at the host has the permissions to access the resource class or resource instances (or both) for the resource class named by the stanza. The user identifier can have one of the following three forms:
user_name@host_name
host_name
A host_name is a fully qualified host domain name or the keyword LOCALHOST. The first form specifies a user running a Resource Monitoring and Control (RMC) application on the named host. If the host name is the keyword LOCALHOST, then the application is running on the same node as the RMC subsystem. The second form specifies any user running an RMC application on the named host. The third form specifies any user running an RMC application on any host.
The object type is one of the characters C, R or *. The letter C indicates that the permissions provide access to the resource class. The letter R indicates that the permissions provide access to all of the resource instances of the class. The asterisk indicates that the permissions provide access to both the resource class and all resource instances of the class.
The permissions provided are represented by one, both, or none of the characters r and w. The letter r indicates that the specified user at the specified host has read permission. The letter w indicates that the specified user at the specified host has write permission. Both letters indicate the user has read and write permission. If the permissions are omitted, then the user does not have access to the objects specified by the type character. Read permission allows you to register and unregister for events, to query attribute values, and to validate resource handles. Write permission allows you to run all other command interfaces. Note that no permissions are needed to query resource class and attribute definitions.
For any command issued against a resource class or its instances, the RMC subsystem examines the lines of the stanza matching the specified class in the order specified in the ACL file. The first line that contains 1) an identifier that matches the user issuing the command and 2) an object type that matches the objects specified by the command is the line used to determine access permissions. Therefore, lines containing more specific user identifiers and object types should be placed before lines containing less specific user identifiers and object types.
A sample ACL file is provided in /usr/sbin/rsct/cfg/ctrmc.acls. This file contains the following default permissions:
OTHER root@LOCALHOST * rw LOCALHOST * r
To change these defaults, you must copy the sample ACL file to /var/ct/cfg/ctrmc.acls and put your modifications in that file (or you can create a new ACL file with the same name and location). Then to activate your new permissions, type:
refresh -s ctrmc
Provided there are no errors in the modified ACL file, the permissions will take effect. If errors are found in the modified ACL file, they are logged to /var/ct/IW/log/mc/default.
The following examples show ways the ACL file can be modified.
The user1 at sys3 has permission to read and write the resource class; user2 at sys3 has no permission to access either the resource class or its instances. All other users at sys3 have permission to read both the resource class and all of its instances.
Finally, root on the machine containing the ACL file can read and write both the resource class and all of its resource instances.
Class_A user1@sys1.pok.ibm.com R rw root@sys1.pok.ibm.com * rw sys1.pok.ibm.com * r
user1@sys3.pok.ibm.com C rw user2@sys3.pok.ibm.com * sys3.pok.ibm.com * r
root@LOCALHOST * rw
Class_B root@LOCALHOST * rw * * r
OTHER root@sys1.pok.ibm.com * r root@LOCALHOST * rw
A stanza begins with a line containing the stanza name, which must start in column 1. A stanza line consists of leading white space (one or more blanks and/or tabs, followed by one or more white-space-separated tokens. Comments may be present in the file. Any line in which the first non-white-space character is a pound sign (#) is a comment. Blank lines are also considered comment lines and are ignored. Any part of a line that begins with two consecutive forward slash characters (//), not surrounded by double quotes ("), is considered to be a comment from that point through the end of the line. The stanza lines in an ACL file each contain two or three tokens:
stanza_name user_identifier type permissions user_identifier type permissions | | | | user_identifier type permissions
The permissions token may be omitted.
For a complete description of the Resource Monitoring and Control components and how to use them, see Components Provided for Monitoring.