Dealing with stuck processes

A process is said to be "stuck" when it cannot proceed because it is waiting for an event that cannot, or does not, occur. There are several possible causes:

Application design errors

A stuck process may be caused by a program logic error. For example, consider the following scenarios:

  1. Outstanding user events:
    1. One of the process’s activities returns from what it believes to be its final activation. It issues an EXEC CICS RETURN command without the ENDACTIVITY option.
    2. There are no events on the activity’s reattachment queue, but there is a user event in its event pool.
    3. There is no means for the event to be fired. Perhaps it is an input event which has fired, caused reattachment, and been retrieved, but which the activity has neglected to delete.

    In a case like this, the activity becomes dormant, and there is no way for it to reactivated. The process is stuck.

    The recommended way to prevent this scenario is to add the ENDACTIVITY option to the EXEC CICS RETURN command that ends the final activation of the activity. Coding RETURN ENDACTIVITY deletes any outstanding events--other than activity completion events for child activities, which the activity must deal with properly--and allows the activity to complete normally.

  2. Waiting for an external interaction:

    A user-related activity returns from its initial activation and becomes dormant, waiting for an external interaction to occur. (User-related activities are described in Acquiring an activity.) However, the expected user input doesn’t happen. Perhaps the clerk is sick, or the data she requires is not available. The process is stuck.

    The recommended way to recover from this scenario is to set a timer which, if the expected external interaction does not occur within a specified period, will cause the activity (or its parent) to be reactivated anyway.

  3. Timer error:

    A programming error results in a timer being set to expire in five days rather than five minutes. The process is stuck. See Restarting stuck processes.

    Note:
    To force a timer to expire before its specified time, use the FORCE TIMER command.

Restarting stuck processes

For advice on restarting processes that are stuck because of unserviceable requests, see Dealing with unserviceable requests.

For advice on restarting processes that are stuck because of a CICS failure, see Dealing with CICS failures.

Using activity timers

The best way to restart processes that are stuck for other reasons--including application errors--is to use timers. For example, a parent may set a timer which will cause it to be reactivated after a specified period, if a particular child activity does not complete. (The parent names the timer in a way that associates it with a particular child. If the child completes within the specified period, the parent deletes the timer.)

One reason for making the application responsible for restarting itself is that it is difficult from outside a process to tell whether the process is stuck or merely dormant, particularly if the process is long-lived. Processes of different types may have varying "natural" lifespans; and these lifespans may vary according to system load, availability of remote regions, and so on. The application itself is best placed to know how long each of its activities should run before they can be assumed to be stuck.

You will probably not want to set timers for all your activities. For example, you might think it unnecessary to set a timer for a simple activity that completes its processing in one activation, has no children, and is to be run synchronously. On the other hand, you might want to set a timer for an activity to which one or more of the following apply:

Using process timers

As well as, or instead of, setting timers for individual child activities, you could set a timer for the process itself. That is, the root activity could set a timer with an expiry time some time after the whole process could reasonably be expected to have completed.

If the process is short-lived, you may decide not to set any activity timers, but to set a process timer instead.

If the process is long-lived, do not set a process timer without also setting timers for at least some individual activities. This prevents the possibility of a delay in restarting the process. For example, if a process that is expected to last six months becomes stuck after one day while processing its first activity, and you have set only a process timer, the process could lie dormant for, say, seven months before the root activity is reactivated to deal with the problem.

If the root activity is activated by the process timer, it could, for example:

  1. Browse and inquire on each of its descendant activities, checking completion status and mode. (For examples of the use of the BTS browsing and inquiry commands, see Browsing examples.)
  2. If it succeeds in identifying the stuck activity, issue a CANCEL command to cancel it. (If the stuck activity is not a child but a lower-level descendant of the root activity, the root must first acquire the stuck activity.)
  3. The stuck activity’s completion event fires, causing the parent activity to be reactivated. The CHECK ACTIVITY command issued by the parent returns a completion status of FORCED. The parent should be coded to handle the abnormal completion of one of its children. The process is no longer stuck.

Using status containers

To make it easier for a root activity to identify which of its descendant activities are stuck, you could use status containers. Status containers are simply data-containers that contain information about what an activity is currently doing. Whereas you can use an INQUIRE ACTIVITYID command to discover the mode and completion status of an activity, the information in a status container is likely to be at a more detailed level. For example, each activity in a process might have a data-container called, perhaps, STATUS, which it regularly updates--perhaps at the beginning and end of each activation, and each time it starts new work. A status container might, for instance, contain the date and time, and a string describing the work that the activity has just started or ended, or the fact that it is dormant because it is waiting for the completion of a particular child activity.

You can think of an activity as a finite state machine--it will always be in one of a limited number of processing states. (The "processing states" we refer to here are application-dependent and quite distinct from the BTS-defined modes of an activity.) Each activity could regularly update its status container with its current processing state.

Using a utility program

We have said that it is difficult from outside a process to tell whether the process is stuck or merely dormant. To help you decide, you can use a utility program.

CICS-supplied utility programs

CICS supplies two utility programs for diagnostic purposes:

The audit trail utility, DFHATUP
You can use DFHATUP to print selected audit records from a logstream. If you use auditing to track the progress of your processes across the sysplex, to investigate a stuck process you could print its audit records.

DFHATUP is described in Creating a BTS audit trail.

The repository utility, DFHBARUP
You can use DFHBARUP to print selected records from a repository. To investigate a stuck process, you could print its repository records.

DFHBARUP is described in Examining BTS repository records.

User-written utility programs

You could write a utility program that could check for and restart stuck processes, particularly if your activities use status containers. Your utility program could, for example:

  1. Browse all processes of a specified process-type.
  2. Browse the descendant activities of each process returned in step 1.
  3. Inquire on the status data-container of each activity, and retrieve its contents.
  4. Identify a stuck activity from the contents of its status container.
  5. Issue an ACQUIRE command to acquire the stuck activity.
  6. Issue a CANCEL command to cancel the stuck activity. The latter’s completion event fires, causing its parent to be reactivated. The CHECK ACTIVITY command issued by the parent returns a completion status of FORCED. The parent should be coded to handle the abnormal completion of one of its children. The process is no longer stuck.

Related tasks
Dealing with activity abends
Dealing with unserviceable requests
Dealing with CICS failures
[[ Contents Previous Page | Next Page Index ]]