Monitoring HOWTO


Overview of Cluster Systems Management

The Cluster Systems Management (CSM) Monitoring application offers a comprehensive set of monitoring and response capabilities that lets you detect, and in many cases correct, system resource problems such as a critical filesystem becoming full. You can monitor virtually all aspects of your system resources and specify a wide range of actions to be taken when a problem occurs, from simple notification by e-mail to recovery that runs a user-written script. You can specify an unlimited number of actions to be taken in response to an event.

As system administrator, you have a great deal of flexibility in responding to events. You can respond to an event in different ways based on the day of the week and time of day. The following are some examples of how you can use monitoring:

CSM uses Resource Monitoring and Control (RMC) to monitor the system and to perform many of its operations. For information about the command line interface to the RMC subsystem, see Cluster Systems Management for Linux Technical Reference. For information on RMC diagnostic information, see Recovering from RMC and Resource Manager Problems. For authorization and modifying the ACL file, see Security Considerations.


Monitoring Concepts

Monitoring lets you detect conditions of interest in the cluster nodes and their associated resources and automatically take action when those conditions occur. The key elements in monitoring are conditions and responses. A condition identifies one or more resources that you want to monitor, such as the /var file system, and the specific resource state you are interested in, such as /var>90% full. A response specifies one or more actions to be taken when the condition is found to be true. Actions can include notification, running commands, and logging.

About Conditions

System resources that you can monitor are organized into general categories called resource classes. Examples of resource classes include Processor, File System, Physical Volume, and Ethernet Device.

Each resource class includes individual system resources that belong to the class. For example, the File System resource class might include these resources:

When a resource is specified for use in a condition, it is called a monitored resource.

Each resource within a resource class also has a set of attributes that you can monitor. For example, the File System resource class has the following attributes available for monitoring:

For a condition, you specify the monitored attribute of the resource in a logical expression that defines a threshold or state of the monitored resource. When the logical expression is true (the threshold is reached or the state becomes true), an event is generated. The logical expression is the event expression of the condition. Event expressions are typically used to monitor potential problems and significant changes in the system. For example, the event expression for a /var space used condition might be PercentTotUsed > 90.

The rearm expression of a condition is optional. A rearm expression typically indicates when the monitored resource has returned to an acceptable state. When the rearm expression is met, monitoring for the condition resumes. If a rearm event is not specified, when the event expression becomes true an event is generated for certain attributes every time the monitored attribute is evaluated.

If a rearm expression is specified, evaluation of the rearm expression starts after the event expression becomes true. When the rearm expression becomes true, a rearm event is generated; then the evaluation of the event expression starts again. For example, if the event expression for a /var space used condition is 90% full and the rearm expression is PercentTotUsed < 80, then an event is generated when /var is more than 90% full. The next time the condition is evaluated, the rearm expression is used. When /var is less than 80% full, an event is generated indicating that the condition has been reset, and the event expression is used again to evaluate the condition.

See Using Expressions for more information about data types and operators that you can use in an event expression or a rearm expression.

Predefined conditions are provided with the Monitoring application. To create a new condition, set the following condition components:

Condition Component Description Example
Condition name Required. The name you want to give the condition. /var space used
Resource class Required. The resource class to be monitored. FileSystem
Monitored attribute Optional. The attribute of the resource class to be monitored. If not specified, it will be extracted from the Event expression. PercentTotUsed
Monitored resources Optional. The specific resources in the resource class that are to be monitored. If not specified, the default is all resources in the specified Resource Class. /var
Event expression Required. A logical expression defining the value or state of the monitored property that is to generate an event. PercentTotUsed > 90
Event description Optional. A text description of the event expression. If not specified, the default is a NULL string. An event occurs when /var is more than 90% full.
Rearm expression Optional. When a rearm expression is specified, the rearm expression is evaluated when the event expression becomes true. When the rearm expression becomes true, the event expression is used for evaluation again. If not specified, this condition will only be monitored with the Event expression. PercentTotUsed < 80
Rearm description Optional. A text description of the rearm expression. If not specified, the default is a NULL string. A rearm event occurs when /var is less than 80% full.
Severity Optional. The severity of the condition: Informational, Warning, or Critical. If not specified, the default is Informational. Critical

Finally, a user-defined sensor can be created to monitor an attribute of interest. Then expressions can be defined that contain conditions and responses with associated actions to be performed when the attribute has a certain value. For example, a script can be written to return the number of users logged on, and a condition and response can be defined so that a specified action is taken when the number of users exceeds a certain threshold.

About Responses

A response consists of one or more actions to be performed by the system when an event or rearm event occurs for a condition. In the Monitoring application you can use predefined responses or create new responses and associate them with conditions as needed. You can associate multiple responses with one condition, and one response with multiple conditions.

The responses for a condition remain deactivated until you start monitoring for that condition. When you select a condition to start monitoring, you need to activate at least one of its responses. The responses that are not active remain available to be used at another time. This allows you to use different responses for a condition as needed, without having to redefine them.

Predefined responses are provided with the Monitoring application. To create a new response you will need to set the following response and action components:


Response Component Description Example
Response name The name you want to give the response. Response for critical conditions
Actions One or more actions to be taken as part of the response. Log events to a file

Action Component Description Example
Action name The name of an action to be taken as part of the response. Send email to the operator
When in effect The days and times when this action is to be used to respond to the condition. 08:00 - 17:00 Monday - Friday
Use for event, rearm event, or both Whether the action is to be used to respond to an event, a rearm event, or both. Event
Command The command to be run when an event or rearm event occurs. A recovery script

You can associate multiple responses with a condition if you want to define different responses based on when the event occurs. For example, you might have a work day response and a weekend response, each containing one or more actions. Consider how you might respond to a /var space used condition with the following responses. During working hours, you might want to email the operator, run a command, and broadcast a message to users who are logged on. During weekend hours, you might want to email the system administrator and log a message to a file.

How Conditions and Responses Work Together

After monitoring for the condition begins, the system evaluates the event expression to see if it is true. When the event expression becomes true, an event occurs that automatically notifies all of the associated event responses, which causes each event response to run its defined actions.

The event expression and the rearm expression work together as follows when a condition is monitored. First, the event expression is evaluated. When the event expression becomes true, an event occurs, and the specified actions are taken. When the event expression becomes true, the system begins evaluating the rearm expression. When the rearm expression becomes true, the rearm event occurs, which automatically starts the actions defined for the rearm event. When the rearm event occurs, the system returns to evaluating the event expression.


Security Considerations

The mechanisms for authentication and authorization that are provided by CSM are described in the following sections.

Authentication

CSM provides authentication using the Ident protocol. The daemon identd listens for TCP connections on a known TCP port 113. Application servers need to connect to this daemon on the host where the client is running. The servers have to provide identd with the local and remote ports. The daemon then returns the identity of the owner of the process connected to the remote port, if it exists. The application servers can then use this identity as the remote client's Unix identity.

The security infrastructure assumes that identd is running and listening on Port 113. Red Hat Linux includes identd. The identd code can be downloaded from one of the following sites:

identd needs to be started from /etc/rc.d/init.d.

The /etc/services file should contain the following:

auth        113/tcp        authentication tap ident

The /etc/identd.conf file should contain the following comment:

#-- Disable username lookups (only return uid numbers)
#result:uid-only = no

Authorization

CSM provides authorization in the form of an access-control-list (ACL) file. You can create an ACL file to apply access control to resource classes. If you do not create an ACL file, then the system uses the following default permissions:

OTHER
   root@LOCALHOST	     *   rw
   LOCALHOST           *   r

The ACL file is in stanza format. Each stanza begins with the stanza name, which is the name of a resource class. A stanza with the name of OTHER applies to all resource classes that are not otherwise specified in the file.

Each line of the stanza contains a user identifier, an object type, and an optional set of permissions. A stanza line indicates that the user at the host has the permissions to access the resource class or resource instances (or both) for the resource class named by the stanza. The user identifier can have one of the following three forms:

  1. user_name@host_name
    
  2. host_name
    
  3. *

A host_name is a fully qualified host domain name or the keyword LOCALHOST. The first form specifies a user running a Resource Monitoring and Control (RMC) application on the named host. If the host name is the keyword LOCALHOST, then the application is running on the same node as the RMC subsystem. The second form specifies any user running an RMC application on the named host. The third form specifies any user running an RMC application on any host.

The object type is one of the characters C, R or *. The letter C indicates that the permissions provide access to the resource class. The letter R indicates that the permissions provide access to all of the resource instances of the class. The asterisk indicates that the permissions provide access to both the resource class and all resource instances of the class.

The permissions provided are represented by one, both, or none of the characters r and w. The letter r indicates that the specified user at the specified host has read permission. The letter w indicates that the specified user at the specified host has write permission. Both letters indicate the user has read and write permission. If the permissions are omitted, then the user does not have access to the objects specified by the type character. Read permission allows you to register and unregister for events, to query attribute values, and to validate resource handles. Write permission allows you to run all other command interfaces. Note that no permissions are needed to query resource class and attribute definitions.

For any command issued against a resource class or its instances, the RMC subsystem examines the lines of the stanza matching the specified class in the order specified in the ACL file. The first line that contains 1) an identifier that matches the user issuing the command and 2) an object type that matches the objects specified by the command is the line used to determine access permissions. Therefore, lines containing more specific user identifiers and object types should be placed before lines containing less specific user identifiers and object types.

How to Create and Modify the ACL File

A sample ACL file is provided in /usr/sbin/rsct/cfg/ctrmc.acls. This file contains the following default permissions:

OTHER
   root@LOCALHOST	     *   rw
   LOCALHOST           *   r

To change these defaults, you must copy the sample ACL file to /var/ct/cfg/ctrmc.acls and put your modifications in that file (or you can create a new ACL file with the same name and location). Then to activate your new permissions, type:

refresh -s ctrmc

Provided there are no errors in the modified ACL file, the permissions will take effect. If errors are found in the modified ACL file, they are logged to /var/ct/IW/log/mc/default.

Examples of ACL File Stanzas

The following examples show ways the ACL file can be modified.

  1. For resource class Class_A, user1 at sys1 has permission to read and write all resource instances, and root at sys1 has permission to read and write both the resource class and all of its resource instances. All other users at sys1 have permission to read both the resource class and all of its instances. Note that this gives user1 permission to read the resource class.

    The user1 at sys3 has permission to read and write the resource class; user2 at sys3 has no permission to access either the resource class or its instances. All other users at sys3 have permission to read both the resource class and all of its instances.

    Note:
    If the line containing user2's user ID and the following line were positionally reversed, then the line containing user2's ID would be rendered ineffective.

    Finally, root on the machine containing the ACL file can read and write both the resource class and all of its resource instances.

    Class_A
       user1@sys1.pok.ibm.com	  R    rw
       root@sys1.pok.ibm.com    *    rw
       sys1.pok.ibm.com	        *	  r
     
    
       user1@sys3.pok.ibm.com	  C    rw 	
       user2@sys3.pok.ibm.com	  *
       sys3.pok.ibm.com	        *    r
     
    
       root@LOCALHOST           *    rw
     
     
    
  2. For Class_B, root on the machine containing the ACL file can read and write both the resource class and all of its resource instances. All other users on all hosts can read both the resource class and all of its resource instances.
    Class_B
       root@LOCALHOST     *   rw
            *             *   r 
     
     
    
  3. For all other resource classes (represented by OTHER), root at sys1 has permission to read both the resource class and all of its resource instances, and root on the machine containing the ACL file can read and write both the resource class and all of its resource instances.
    OTHER
       root@sys1.pok.ibm.com      *   r
       root@LOCALHOST	            *   rw
    

ACL File Stanza Syntax

A stanza begins with a line containing the stanza name, which must start in column 1. A stanza line consists of leading white space (one or more blanks and/or tabs, followed by one or more white-space-separated tokens. Comments may be present in the file. Any line in which the first non-white-space character is a pound sign (#) is a comment. Blank lines are also considered comment lines and are ignored. Any part of a line that begins with two consecutive forward slash characters (//), not surrounded by double quotes ("), is considered to be a comment from that point through the end of the line. The stanza lines in an ACL file each contain two or three tokens:

stanza_name
     user_identifier    type    permissions
     user_identifier    type    permissions
     |      |
     |      |
     user_identifier    type    permissions

The permissions token may be omitted.

For a complete description of the Resource Monitoring and Control components and how to use them, see Components Provided for Monitoring.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]