This topic deals with recovery from failures arising from the use of the coupling facility, and which affect CICS® units of work. It covers:
This type of failure affects only data sets opened in RLS mode.
SMSVSAM supports cache set definitions that allow you to define multiple cache structures5 within a cache set across one or more coupling facilities. To insure against a cache structure failure, use at least two coupling facilities and define each cache structure, within the cache set, on a different coupling facility.
In the event of a cache structure failure, SMSVSAM attempts to rebuild the structure. If the rebuild fails, SMSVSAM switches data sets that were using the failed structure to use another cache structure in the cache set. If SMSVSAM is successful in either rebuilding or switching to another cache structure, processing continues normally, and the failure is transparent to CICS regions. Because the cache is used as a store-through cache, no committed data has been lost.
The support for rebuilding cache structures enables coupling facility storage to be used effectively. It is not necessary to reserve space for a rebuild to recover from a cache structure failure--SMSVSAM uses any available space.
If RLS is unable to recover from the cache failure for any reason, the error is reported to CICS when it tries to access a data set that is bound to the failed cache, and CICS issues message DFHFC0162 followed by DFHFC0158. CICS defers any activity on data sets bound to the failed cache by abending UOWs that attempt to access the data sets. When "cache failed" responses are encountered during dynamic backout of the abended UOWs, CICS invokes backout failure support (see Backout-failed recovery). RLS open requests for data sets that must bind to the failed cache, and RLS record access requests for open data sets that are already bound to the failed cache, receive error responses from SMSVSAM.
When either the failed cache becomes available again, or SMSVSAM is able to connect to another cache in a data set’s cache set, CICS is notified by the SMSVSAM quiesce protocols. CICS then retries all backouts that were deferred because of cache failures.
Whenever CICS is notified that a cache is available, it also drives backout retries for other types of backout failure, because this notification provides an opportunity to complete backouts that may have failed for some transient condition.
CICS recovers after a cache failure automatically. There is no need for manual intervention (other than the prerequisite action of resolving the underlying cause of the cache failure).
The failure of a coupling facility lock structure that cannot be rebuilt by VSAM creates the lost locks condition. The lost locks condition can occur only for data sets opened in RLS mode. A data set can be made available for general use only after all the CICS regions that were accessing the data set at the time of the lock structure failure have completed the process known as lost locks recovery.
When a coupling facility lock structure fails, the SMSVSAM servers attempt to rebuild their locks in the lock structure from their own locally-held copies of the locks. If this is successful, the failure is transparent to CICS.
If the rebuild fails, all SMSVSAM servers abend and restart, but they are not available for service until they can successfully connect to a new coupling facility lock structure. Thus a lock structure failure is initially detected by CICS as an SMSVSAM server failure, and CICS issues message DFHFC0153.
When the SMSVSAM servers abend because of this failure, the sharing control data set is updated to reflect the lost locks condition. The sharing control data set records:
When the SMSVSAM servers are able to connect to a new lock structure, they use the MVS ENF to notify the CICS regions that their SMSVSAM server is available again. CICS is informed during dynamic RLS restart about those data sets for which it must perform lost locks recovery. CICS issues a message (DFHFC0555) to inform you that lost locks recovery is to be performed for one or more data sets.
If a lost-locks condition occurs and is not resolved when a CICS restart (warm or emergency) occurs, CICS is notified during file control restart about any data sets for which it must perform lost locks recovery. On a cold start, CICS does not perform any lost locks recovery, and the information in the sharing control data set, which records that CICS must complete lost locks recovery, is cleared for each data set. This does not affect the information recorded for other CICS regions.
Only UOWs performing lost locks recovery can use data sets affected by lost locks. Error responses are returned on open requests issued by any CICS region that was not sharing the data set at the time the lost locks condition occurred, and on RLS access requests issued by any new UOWs in CICS regions that were sharing the data set.
Lost locks recovery requires that any UOWs that had been updating the data set at the time of the failure must complete before the data set can be made available for general use. This is because their updates are no longer protected by the record locks, so access by other UOWs and other CICS regions cannot be permitted until the updates have been either committed or backed out. Therefore, each CICS region performing lost locks recovery must complete all UOWs before notifying VSAM that it has completed all affected units of work.
CICS takes the following actions during dynamic RLS restart to expedite lost locks recovery:
Following any failure of the SMSVSAM server, CICS abends and backs out any UOWs that had made RLS requests before the failure, and which then attempt further RLS requests when the restarted SMSVSAM server is available, They are not backed out until they make a further request to RLS.
In the case of an SMSVSAM server failure that is caused by a lock structure failure, this would mean that in-flight units of work could delay the recovery from the lost locks condition until the UOWs make further RLS updates. To avoid this potential delay, CICS purges the transactions to expedite lost locks recovery. CICS issues message DFHFC0171 if any in-flight transactions cannot be purged, warning that lost locks recovery could potentially be delayed.
When a CICS region has completed lost locks recovery for a data set, it informs SMSVSAM. This is done once for each data set. When all CICS regions have informed SMSVSAM that they have completed their lost locks recovery for a particular data set, that data set is no longer in a lost locks condition, and is made available for general access. Although the lost locks condition affects simultaneously all data sets in use when the lock structure fails, each data set can be restored to service individually as soon as all its sharing CICS regions have completed lost locks recovery.
If connection to a coupling facility cache structure is lost, DFSMS™ attempts to rebuild the cache in a structure to which all the SMSVSAM servers have connectivity. If the rebuild is successful, the failure is transparent to CICS.
If DFSMS is unable to recover transparently from a connectivity failure to a coupling facility cache structure, CICS issues message DFHFC0163 (followed by DFHFC0158) on detecting the condition. The recovery process from this failure is the same as for a cache structure failure:
If an SMSVSAM server loses connectivity to the coupling facility lock structure, and it is not possible to rebuild locks in another lock structure to which all the SMSVSAM servers in the sysplex have access, the SMSVSAM server abends and restarts itself. CICS issues message DFHFC0153 when it detects that the server is not available for service.
The restarted SMSVSAM is not available for service until it is successfully connected to its coupling facility lock structure. When it does become available, recovery follows dynamic RLS restart in the same way as for any other server failure, because no lock information has been lost.
When an MVS image fails, all CICS regions and the SMSVSAM server in that image also fail. All RLS locks belonging to CICS regions in the image are retained by SMSVSAM servers in the other MVS systems.
When the MVS image restarts, recovery for all resources is through CICS emergency restart. If any CICS region completes emergency restart before the SMSVSAM server becomes available, it performs dynamic RLS restart as soon as the server is available.
The surviving MVS images should be affected by the failure only to the extent that more work is routed to them. Also, tasks that attempt to access records that are locked by CICS regions in the failed MVS image receive the LOCKED response.
If all the MVS images in a sysplex fail, the first SMSVSAM server to restart reconnects to the lock structure in the coupling facility and converts all the locks into retained locks for the whole sysplex.
Recovery from the failure of a sysplex is just the equivalent of multiple MVS failure recoveries.