IBM(R) Multipath Subsystem Device Driver Path Control Module (PCM) Version 2.6.2.1 README FOR AIX December 12, 2011 --------------------------------------------------------------------------- CONTENTS 1.0 About this README file 1.1 Who should read this README file 1.2 How to get latest support version information 2.0 Prerequisites for SDDPCM 3.0 SDDPCM change history 3.1 Defects Fixed 3.2 New Features 3.3 Feature Details 4.0 User license agreement for IBM device drivers 4.1 Background / Purpose 4.2 Definitions 4.3 License grant 4.4 Responsibilities 4.5 Confidential information 4.6 Limitation of liability 4.7 Termination 4.8 Representations and warranties 4.9 General provisions 4.10 Appendix A 5.0 Notices 6.0 Trademarks and service marks --------------------------------------------------------------------------- 1.0 About this README file Welcome to IBM Multipath Subsystem Device Driver Path Control Module(SDDPCM). This README file contains the most recent information about the IBM Multipath Subsystem Device Driver PCM, Version 2 Release 6 Modification 2 Level 1 (SDDPCM 2.6.2.1) for AIX. IBM recommends that you print and review the contents of this README file before installing and using SDDPCM on AIX with MPIO-capable disk driver. 1.1 Who should read this README file This README file is in general intended for storage administrators, system programmers, and performance and capacity analysts. The information in this file only applies to customers who run: 1. IBM BladeCenter S SAS RAID Controller Module Please refer to SDD support matrix to determine the supported SDD levels for above storage subsystem. 1.2 How to get latest support version information Go to the following Web site for SDD/SDDPCM technical support and for the most current SDD documentation and support information: http://www.ibm.com/servers/storage/support/software/sdd/ --------------------------------------------------------------------------- 2.0 Prerequisites for SDDPCM You must install the following host attachment for SDDPCM to support IBM BladeCenter S SAS RAID Controller Module on AIX blade servers. IBM BladeCenter S SAS RAID Controller Module: devices.sas.disk.ibm.mpio.rte(version 1.0.0.0) Note: Please go to following web site for latest RSSM document to find out a list of supported blades: http://www.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-5078491&brandind=5000020 Following lists of required AIX and VIOS apars and service packs are cumulative. Apars required in earlier versions of SDDPCM packages are also required for later versions of SDDPCM packages, unless specifically noted as 'no longer applicable' or 'not required'. sddpcm 2.6.0.0 or higher support ESS/DS8000/DS6000/SVC/DS4000/DS5000/DS5020/DS3950/ IBM BladeCenter S SAS RAID Controller Module storage devices. It requires the following AIX releases and ifix if the AIX blade server has IBM BladeCenter S SAS RAID Controller Module devices configured: AIX Level APAR AIX61 TL03-02 IZ68569, IZ62617, IZ68572, IZ70078 AIX61 TL04-01 IZ68571, IZ68574, IZ70079 For VIO servers: sddpcm 2.6.0.0 or higher supports VIO Servers connected to IBM BladeCenter S SAS RAID Controller Module devices. Please contact IBM AIX support team for the appropriate VIOS level --------------------------------------------------------------------------- 3.0 SDDPCM change history =============================================================================== 3.1 Defects Fixed v2.6.0.0 --------- 4722 (AIX defect 736797)Need to synchronize device cfg/uncfg with pcmsrv device interface calls to prevent system crashing. 4723 pcmsrv checking trace buffer interval is reduced from 5 seconds to 1 second to prevent loosing trace data. 4727 smit mpio change path function fails on Active/Active storage devices. v2.6.0.1 --------- 4748 SVC Concurrent Code down load(CCDL) or note reset may fail application in VIOS environment due to the change made in the AIX disk driver for handling check condition sense code from SVC. 4750 Incorrect adapter/port state(FAILED) after CLOSED path being removed 4766 Miss to handle SC_TRANSPORT_FAULT adapter error for non-RSSM devices. 4769 Need to set path to FAILED state if healthchecker command hit TIMEOUT (A/P) 4770 Setting hc_interval/hc_mode for DS4K/DSK devices caused system crash or hanging(A/P) 4772 System hang when controller failover occurred on open device 4781 (PMR40765) Need to retry inquiry command when hit ACA_ACTIVE SCSI error, to avoid path open failure in dual VIOS environment 4782 (PMR33750) System hung when mounting remote FS in GPFS environment with DS4K/5K devices 4833 Newly added path(s) may set to OFFLINE state mistakenly when a new FC HBA is configured v2.6.0.2 --------- 4729 Deinstall SDDPCM package will stop pcmsrv automatically. Stopping pcmsrv prior to SDDPCM deinstallation is not required. 4810 New script commands are provided to start and stop pcmsrv daemon. 4867 Unmatched lock/unlock call caused system crash 4868 (PMR30514) Replacing a controller on DS4K/5K storage subsystem caused system crash 4895 (PMR40765) Need to return EBUSY to delay path open retry calls, allowing the disk driver to have time to clear ACA_ACTIVE condition on the device. 4907 Distinguish device state byte with different FW level, in order to handle LUN unmap scenario correctly.(A/P SDDPCM) 4908 Fix buffer overrun problem to close TCP/IP security issue when TCP/IP port is enabled. v2.6.0.3 -------- 4850 query port does not display 2 hdisks of disk pair from one machine 4929 Only fail IO if both prefer and non-prefer paths are unavailable 4930 Adapter state is displayed as Degraded after removing a few paths 4940 Fix buffer overrun in pcmsrv tracing code v2.6.0.4 -------- 4950 Add '-s' option for 'pcmpath query portmap' and 'pcmpath query essmap' commands to refresh the information after LUN configuration or sddpcm upgrade 4953 Add 'sddpcm_get_config' and 'manage_disk_driver' commands in sddpcmgetdata script 4955 Allow 'pcmpath enable ports' or 'pcmpath disable ports' command to put paths on closed devices online/offline 4956 (AIX Defect 777020)SDDPCM should not switch dump path after selected as long as the path is functional 4960 Correct the sense data length caused by incorrect documentation(A/P SDDPCM) 4963 (CR186949/AIX Defect 779402) Need to clear Mode Sense Done flag after path open failed, to prevent incorrect path configuration, which caused I/O failure (A/P SDDPCM) v2.6.1.0 -------- 4961 Support FC hba 'add_adap_status' new value - SCSI_ADAP_NOT_PATH_ERR, to avoid counting path error by mistake. 4970 DS45K Check Condition with '8B02' sense code should be logged as 'INFO' type 4979 respawnpcmsrv consumes too much CPU usage, that impacts I/O performance 4981 checking last path error count may use incorrect path pointer 4984 Display Obsidian SAS wwpn for PS70x blade connected to RSSM 4989 DS45K device may put last path in FAILED state when all the connections are disconnected. v2.6.2.0 -------- 5032 (AIX Defect 798502) pcmquerypr usage display needs to correct a few typos 5038 (AIX Defect 798746) Command hung when issuing R/W to a DS4K/5K LUN which is reserved with SCSI-2 by other initiator under Failover_Only path selection algorithm (A/P SDDPCM) v2.6.2.1 -------- 5054 (PMR80851) Use SRC mkssys "-R" option to create pcmsrv subsystem to allow SRC to restart pcmsrv if pcmsrv stops abnormally 5065 For RSSM only, keep retrying IOs on last path until it exceeds retry_timeout setting. ========================================================================================== 3.2 New Features 2.4.0.5 --------- 4585 Add a new device attribute - 'retry_timeout', which sets the timeout value for I/O retry on last path. 2.5.0.0 --------- 4546 Add a new command 'lspcmcfg' to display SDDPCM device information 4556 Add DS5020 new device model support 2.5.1.0 --------- 4666 Add DS3950 new device model support 4676 Support DS4000/5000/DS5020/DS3950 devices as SAN Boot disks and the AIX 'manage_disk_drivers' tool to switch drivers for DS4000/5000/DS5020/DS3950 storage devices configuration. Note: This function requires certain AIX APARs to be installed on the system. APAR information for different AIX OS or TL is listed above under 'Prerequisites for SDDPCM' section. 2.5.2.0 --------- 4713 Support DS4K and DS5K family device controller health check function 4721 Add two new pcmpath commands to allow dynamically change controller health check attributes 2.6.0.0 --------- 4573 Support 'pcmpath query wwpn' command to display SAS adapter WWPN 4717 Support IBM BladeCenter S SAS RAID Controller Module R2.1 GA 4713 Support DS4K and DS5K family device controller health check function 4721 Add two new pcmpath commands to allow dynamically change controller health check attributes 2.6.0.3 --------- 4934 Support DS8800 8 port FC adapter by 'pcmpath query portmap' and 'pcmpath query essmap' commands. 2.6.1.0 --------- 4959 Add device attribute to allow user dynamically turn ON/OFF health Check recovering Distributed Error Detection(DED) failed paths. Default setting is 'no' 2.6.2.0 --------- 4993 (MR02114111425) Add new 'pcmpath query device' option '-s' to display device flashcopy status. To support this, a new device attribute 'flashcpy_tgtvol' is added. Note: This feature does not support SAS RSSM devices =============================================================================== 3.3 Feature/defect Details 4585 Starting from v2.4.0.3, a new device attribute - 'retry_timeout' is added for ESS/DS6K/DS8K/SVC devices. This attribute allows user to set the timeout value for I/O retry on the last path. The default value of this attribute is 120 seconds and it is user-changeable with the valid range of 30 to 600 seconds. pcmpath provides a new CLI command to dynamically change this device retry_timeout attribute. The syntax of this command: pcmpath set device /( ) retry_timeout This feature enables user to adjust the timeout value to control how long SDDPCM will retry the I/O on last path before it fails to the application. In the situation where a device access is lost on all the paths permanently and command fails with TIMEOUT error, which takes rw_timeout time to fail a command. By setting retry_timeout value to default (120 seconds) or smaller value, this feature enables fast path failure and avoids I/O hanging problem under this condition. In the situation where a device loss of access is only temporary, the retry_timeout value may need to be set to a larger value. For example, during certain storage concurrent code download or storage warmstart, it is possible that all the paths to a device may lost access for a short period of time temporarily, such as from a few seconds to a few minutes. Under such situation, I/O should be retried on the last path for an extended period of time to prevent failing the I/O back to application. This will open the window for path recovery from temporary errors. 4713 This feature is for Active/Passive storage devices, which includes DS4K and DS5K family of devices. This controller health check function, when enabled, will be invoked when an enabled path become unavailable due to transport problems. Enabling this feature results in faster I/O failure time when the access to a LUN is lost. The faster failure time is controlled by one of the controller health check attribute - cntl_delay_time setting. By default, this feature is DISABLED. To enable this feature, user needs to set following two ODM attributes to non-zero value for the active/passive storage devices: cntl_delay_time: is the amount of time in seconds the storage device's controller(s) will be health checked after a transport failure. At the end of this period, if no paths are detected as good, then all pending and subsequent I/O to the device will be failed, until the device health checker detects a failed path has recovered. cntl_hcheck_int: The first controller health check will only be issued after a storage fabric transport failure had been detected. cntl_hcheck_int is the amount of time in seconds, which the next controller health check command will be issued. This value must be less than the cntl_delay_time (unless set to "0", to disable this feature). NOTE: Setting either value to "0" disables this feature. Setting cntl_delay_time to '1' also disables this feature For example, if you wish to allow the storage device within 30 seconds to come back on the fabric(after leaving the fabric), then you can set cntl_delay_time=30 and cntl_hcheck_int=2. The device, /dev/hdisk#, must not be in use, when setting the ODM values with 'chdev' command(or the chdev "-P" option must be used, which requires a reboot to make the change to take effect). CAUTION: There are cases where the storage device may reboot both of the controllers and become inaccessible for a period of time. If the controller health check feature is enabled, and 'cntl_delay_time' is set too short, then this may result in an I/O failure. It is recommended to make sure you have an mirrored volume to failover to, or with GPFS configuration if you are running with controller health check enabled and cntl_delay_time setting is under 60 seconds, or is not long enough to cover the temporary device inaccessible conditions, which may occur during the storage concurrent code load operation or other error injection operations. 4721 This feature allows user to change both controller health check ODM attributes dynamically (using 'chdev' command requires unconfigure/reconfigure device). pcmpath set device cntlhc_interval The pcmpath set device cntlhc_interval command dynamically changes the Active/Passive MPIO device controller health check time interval or disables this feature. Syntax pcmpath set device [num2] cntlhc_interval Parameters num1 [ num2 ] * When only num1 is specified, then the command applies to the hdisk specified by num1. * When 2 device logical numbers are entered, this command applies to all the Active/Passive devices whose logical numbers fit within the range of the two device logical numbers. t The range of supported values for controller health check time interval is 0-300 seconds. Setting the value to 0 will disable this feature. Examples If you enter pcmpath set device 2 10 cntlhc_interval 3, the controller health check time interval of hdisk2 to hdisk10 is immediately changed to 3 seconds, if hdisk2 to hdisk10 are all Active/Passive devices. pcmpath set device cntlhc_delay The pcmpath set device cntlhc_delay command dynamically changes the Active/Passive MPIO device controller health check delay time or disables this feature. Syntax pcmpath set device [num2] cntlhc_delay Parameters num1 [ num2 ] * When only num1 is specified, then the command applies to the hdisk specified by num1. * When 2 device logical numbers are entered, this command applies to all the Active/Passive devices whose logical numbers fit within the range of the two device logical numbers. t The range of supported values for controller health check delay time is 0-300 seconds. Setting the value to 0 will disable this feature. Examples If you enter pcmpath set device 2 10 cntlhc_delay 30, the controller health check delay time of hdisk2 to hdisk10 is immediately changed to 30 seconds, if hdisk2 to hdisk10 are all Active/Passive devices. Note: If cntl_delay_time is set to '1', this will disable the controller health check feature, just like set it to '0'. If user tries to set cntl_hcheck_int with a value larger than cntl_delay_time, then cntl_hcheck_int will be set to the same value as cntl_delay_time. If user tries to set cntl_delay_time with a value smaller than cntl_hcheck_int, the command will fail with INVALID parameter. 4810 Due to supporting pcmsrv respawning function, "startsrc -s pcmsrv" and "stopsrc -s pcmsrv" should not be used for starting and stopping pcmsrv because pcmsrv will keep respawning. Two new scripts which prevent pcmsrv from respawning are provided to start and stop pcmsrv. "startpcmsrv" will start pcmsrv daemon and "stoppcmsrv" will stop pcmsrv daemon. e.g. To start pcmsrv, issue "startpcmsrv" on system command prompt #startpcmsrv To stop pcmsrv, issue "stoppcmsrv" on system command prompt #stoppcmsrv 4908 Fix sddsrv buffer overrun problem to close TCP/IP security issue Fix potential security problem due to buffer overrun problem in sddsrv when TCP/IP port is enabled. This defect only impact customers who has enabled TCP/IP port. TCP/IP port is disabled by default. 4934 Support DS8800 8 port FC adapter by 'pcmpath query portmap' and 'pcmpath query essmap' commands. The output is changed in order to display 8 ports per adapter. Example of 'pcmpath query portmap' display: BAY-X BAY-Y ESSID DISK H1 H2 H3 H4 H1 H2 H3 H4 ABCDEFGH ABCDEFGH ABCDEFGH ABCDEFGH ABCDEFGH ABCDEFGH ABCDEFGH ABCDEFGH 75TL811 hdisk1 Bay3-Bay4: -------- -------- Y-Y----- -------- Y-Y----- -------- -------- -------- 75TL811 hdisk2 Bay3-Bay4: -------- -------- Y-Y----- -------- Y-Y----- -------- -------- -------- 75TL811 hdisk3 Bay3-Bay4: -------- -------- O-O----- -------- O-O----- -------- -------- -------- 75TL811 hdisk4 Bay3-Bay4: -------- -------- O-O----- -------- O-O----- -------- -------- -------- 75TL811 hdisk5 Bay3-Bay4: -------- -------- O-O----- -------- O-O----- -------- -------- -------- 75TL811 hdisk6 Bay3-Bay4: -------- -------- O-O----- -------- O-O----- -------- -------- -------- 75TL811 hdisk7 Bay3-Bay4: -------- -------- O-O----- -------- O-O----- -------- -------- -------- 75TL811 hdisk8 Bay3-Bay4: -------- -------- O-O----- -------- O-O----- -------- -------- -------- Y = online/open y = (alternate path) online/open O = online/closed o = (alternate path) online/closed N = offline n = (alternate path) offline - = path not configured ? = path information not available PD = path down Note: 2105 devices' essid has 5 digits, while 1750/2107 device's essid has 7 digits. Example of 'pcmpath query essmap' display: Disk Path P Location adapter LUN SN Type Size LSS Vol Rank C/A S Connection port RaidMode ------- ----- - ---------- -------- -------- ------------ ---- --- --- ---- --- - ----------- ---- -------- hdisk1 path0 03-00-01[FC] fscsi0 75TL811010D IBM 2107-900 1.0GB 1 13 0001 17 Y R1-B3-H3-ZA 230 RAID5 hdisk1 path1 03-00-01[FC] fscsi0 75TL811010D IBM 2107-900 1.0GB 1 13 0001 17 Y R1-B4-H1-ZA 300 RAID5 hdisk1 path2 03-01-01[FC] fscsi1 75TL811010D IBM 2107-900 1.0GB 1 13 0001 17 Y R1-B4-H1-ZC 302 RAID5 hdisk1 path3 03-01-01[FC] fscsi1 75TL811010D IBM 2107-900 1.0GB 1 13 0001 17 Y R1-B3-H3-ZC 232 RAID5 hdisk2 path0 03-00-01[FC] fscsi0 75TL811010E IBM 2107-900 1.0GB 1 14 0001 17 Y R1-B3-H3-ZA 230 RAID5 hdisk2 path1 03-00-01[FC] fscsi0 75TL811010E IBM 2107-900 1.0GB 1 14 0001 17 Y R1-B4-H1-ZA 300 RAID5 hdisk2 path2 03-01-01[FC] fscsi1 75TL811010E IBM 2107-900 1.0GB 1 14 0001 17 Y R1-B4-H1-ZC 302 RAID5 hdisk2 path3 03-01-01[FC] fscsi1 75TL811010E IBM 2107-900 1.0GB 1 14 0001 17 Y R1-B3-H3-ZC 232 RAID5 hdisk3 path0 03-00-01[FC] fscsi0 75TL811010F IBM 2107-900 1.0GB 1 15 0001 17 Y R1-B3-H3-ZA 230 RAID5 hdisk3 path1 03-00-01[FC] fscsi0 75TL811010F IBM 2107-900 1.0GB 1 15 0001 17 Y R1-B4-H1-ZA 300 RAID5 hdisk3 path2 03-01-01[FC] fscsi1 75TL811010F IBM 2107-900 1.0GB 1 15 0001 17 Y R1-B4-H1-ZC 302 RAID5 hdisk3 path3 03-01-01[FC] fscsi1 75TL811010F IBM 2107-900 1.0GB 1 15 0001 17 Y R1-B3-H3-ZC 232 RAID5 hdisk4 path0 03-00-01[FC] fscsi0 75TL8110113 IBM 2107-900 1.0GB 1 19 0001 17 Y R1-B3-H3-ZA 230 RAID5 hdisk4 path1 03-00-01[FC] fscsi0 75TL8110113 IBM 2107-900 1.0GB 1 19 0001 17 Y R1-B4-H1-ZA 300 RAID5 hdisk4 path2 03-01-01[FC] fscsi1 75TL8110113 IBM 2107-900 1.0GB 1 19 0001 17 Y R1-B4-H1-ZC 302 RAID5 hdisk4 path3 03-01-01[FC] fscsi1 75TL8110113 IBM 2107-900 1.0GB 1 19 0001 17 Y R1-B3-H3-ZC 232 RAID5 hdisk5 path0 03-00-01[FC] fscsi0 75TL8110114 IBM 2107-900 1.0GB 1 20 0001 17 Y R1-B3-H3-ZA 230 RAID5 hdisk5 path1 03-00-01[FC] fscsi0 75TL8110114 IBM 2107-900 1.0GB 1 20 0001 17 Y R1-B4-H1-ZA 300 RAID5 hdisk5 path2 03-01-01[FC] fscsi1 75TL8110114 IBM 2107-900 1.0GB 1 20 0001 17 Y R1-B4-H1-ZC 302 RAID5 hdisk5 path3 03-01-01[FC] fscsi1 75TL8110114 IBM 2107-900 1.0GB 1 20 0001 17 Y R1-B3-H3-ZC 232 RAID5 hdisk6 path0 03-00-01[FC] fscsi0 75TL8110115 IBM 2107-900 1.0GB 1 21 0001 17 Y R1-B3-H3-ZA 230 RAID5 hdisk6 path1 03-00-01[FC] fscsi0 75TL8110115 IBM 2107-900 1.0GB 1 21 0001 17 Y R1-B4-H1-ZA 300 RAID5 hdisk6 path2 03-01-01[FC] fscsi1 75TL8110115 IBM 2107-900 1.0GB 1 21 0001 17 Y R1-B4-H1-ZC 302 RAID5 hdisk6 path3 03-01-01[FC] fscsi1 75TL8110115 IBM 2107-900 1.0GB 1 21 0001 17 Y R1-B3-H3-ZC 232 RAID5 4950 After LUN reconfiguration or sddpcm upgrade, 'pcmpath query essmap' or 'query portmap command' output may not match the latest configuration. Using '-s' option will cause rescanning the configured luns to ensure the essmap and portmap outputs will match the latest configuration. 4955 In previous sddpcm releases, 'pcmpath enable ports' or 'pcmpath disable ports' command will only put paths on opened devices online/offline. This fix will also put paths on closed devices online/offline as long as those paths are connected to storage target ports which have one of the connected paths opened once. 2.6.1.0 --------- 4959 Add device attribute to allow user dynamically turn ON/OFF health Check recovering Distributed Error Detection(DED) failed paths. The default setting is 'no', which means no path health check implemented on DED failed paths. Attention: With this attribute set to 'no', the DED failed paths won't be recovered even the error condition does not exist anymore. In order to recover DED failed paths, user has to manually set this attribute to 'yes' by pcmpath command, then the health check will check and recover the DED failed paths. A new device attribute - "recoverDEDpath" is added to control the path health check function on DED failed paths of a device. A new pcmpath command is added for user to dynamically change this attribute setting. Following is the detail information of the command syntax: pcmpath set device recoverDEDpath The pcmpath set device recoverDEDpath command dynamically changes the SDDPCM MPIO device recoverDEDpath option Syntax pcmpath set device [num2] recoverDEDpath