Power8 System Firmware Fix History - Release levels SC8xx

Firmware Description and History

SC860
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
SC860_165_165 / FW860.51

05/22/18
Impact:  Security      Severity:  SPE

Response for Recent Security Vulnerabilities

  • DISRUPTIVE:  In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue number CVE-2018-3639.  In addition, Operating System updates are required in conjunction with this FW level for CVE-2018-3639.
SC860_160_056 / FW860.50

05/03/18
Impact:  Availability      Severity:  SPE

New features and functions

  • Support was added to allow V9R910 and later HMC levels to query Live Partition Mobility (LPM) performance data after an LPM operation.
  • Support was added to the Advanced System Management Interface (ASMI) to provide customer control over speculative execution  in response to CVE-2017-5753 and CVE-2017-5715 (collectively known as Spectre) and CVE-2017-5754 (known as Meltdown).   The ASMI "System Configuration/Speculative Execution Control" provides two options that can only be set when the system is powered off:
    1) Speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks.  This mode is designed for systems that need to mitigate exposures of the hypervisor, operating systems, and user application data to untrusted code.   This mode is set as the default.
    2) Speculative execution fully enabled:  This optional mode is designed for systems where the hypervisor, operating system, and applications can be fully trusted.
    Note:  Enabling this option could expose the system to CVE-2017-5753, CVE-2017- 5715, and CVE-2017-5754.  This includes any partitions that are migrated (using Live Partition Mobility) to this system.
  • Support was added to allow a periodic data capture from the PCIe3 I/O expansion drawer (with feature code #EMX0) cable card links.
  • On systems with an IBM i partition,  support was added for multipliers for IBM i MATMATR fields that are limited to four characters.  When retrieving Server metrics via IBM
    MATMATR calls, and the system contains greater than 9999 GB, for example, MATMATR has an architected "multiplier" field such that 10,000 GB can be represented
    by 5,000 GB * Multiplier of 2, so '5000' and '2' are returned  in the quantity and multiplier fields, respectively, to handle these extended values.  The IBM i OS also requires a PTF to support the MATMATR field multipliers.
  • On systems with redundant service processors, a health check was added for the state of the secondary service processor to verify it matches the state of the primary service processor.  If the state of the secondary service processor is an unexpected value such as in termination, an SRC is logged and a call home is done for service processor FRU that has failed.

System firmware changes that affect all systems

  • DEFERRED:  A problem was fixed for a PCIe3 I/O expansion drawer (with feature code #EMX0) where control path stability issues may cause certain SRCs to be logged.  Systems using copper cables may log SRC B7006A87 or similar SRCs, and the fanout module may fail to become active.  Systems using optical cables may log SRC of B7006A22 or similar SRCs.  For this problem, the errant I/O drawer may be recovered by a re-IPL of the system.
  • A problem was fixed for error logs being collected twice by the HMC, potentially causing an extra call home for an issue that was already resolved.  This problem was caused by a failover to the backup service processor whose error log was missing the acknowledgement from the HMC that error logs had been collected.  This resulted  in the error logs being copied onto the HMC as PELs for a second time.
  • A problem was fixed in which deconfigured-resource records can become malformed and cause the loss of service processor for both redundant and non-redundant service processor systems.  These failures can occur during or after firmware updates to the FW860.40, FW860.41, or FW860.42 levels.  The complete loss of service processor results in the loss of HMC (or FSP stand-alone) management of the server and loss of any further error logging.  The server itself will continue to run.  Without the fix, the loss of the service processor could happen within one month of the deconfiguration records being encountered.  It is highly recommended to install the fix.  Recovery from the problem, once encountered, requires a full server AC power cycle and clearing of deconfiguration records to avoid reoccurrence.  Clearing deconfiguration records exposes the server to repeat hardware failures and possible unplanned outages.
  • A problem was fixed for the guard reminder processing of garded FRUs and error logs that can cause a system power off to hang and time out with a  service processor reset.
  • A problem was fixed for a system termination that can occur when doing a concurrent code update from the FW860.30 level with a clock card deconfigured in the system.  Without the fix, this problem can be avoided by repairing the clock card prior to the code update or by doing a disruptive code update.
  • A problem was fixed for a Coherent Accelerator Processor Proxy (CAPP) unit hardware failure that caused a hypervisor hang with SRC B7000602.  This failure is very rare and can only occur during the early IPL of the hypervisor, before any partitions are started.   A re-IPL will recover from the problem.
  • A problem was fixed for a Live Partition Mobility migration hang that could occur if one of its VIOS Mover Service Partitions (MSPs) goes into a failover at the start of the LPM operation.  This problem is rare because it requires a MSP error to force a MSP failover at the very start of the LPM migration to get the LPM timing error.  The LPM hang can be recovered by using the "migrlpar -o s" and "migrlpar -o r" commands on the HMC.
  • A problem was fixed for incorrect low affinity scores for a partition reported from the HMC "lsmemopt" command when a partition has filled an entire drawer.  A low score indicates the placement is poor but in this case the placement is actually good.  More information on affinity scores for partitions and the Dynamic Platform Optimizer can be found at the IBM Knowledge Center:  https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hat/p8hat_dpoovw.htm.
  • A problem was fixed to allow the management console to display the Active Memory Mirroring (AMM) licensed capability.  Without the fix, the AMM licensed capability of a server will always show as "off" on the management console, even when it is present.
  • A problem was fixed for a rare hypervisor hang for systems with shared processors with a sharing mode of uncapped.  If this hang occurs, all partitions of the system will become unresponsive and the HMC will go to an "Incomplete" state.
  • A problem was fixed for a Live Partition Mobility migration abort that could occur if one of its VIOS Mover Service Partitions (MSPs) goes into a failover during the LPM operation.  This problem is rare because it requires a MSP error to force a MSP failover during the LPM migration to get the LPM timing error.  The LPM abort can be recovered by retrying the LPM migration.
  • A problem was fixed for the FRU callouts for the BA188001 and BA188002 EEH errors to include the PCI Host Bridge (PHB) FRU which had been excluded.  For the P8 systems, these rare errors will more typically isolate to the processor instead of the adapter or slot planar.   In the pre-P8 systems, the I/O planar also included the PHB, but for P8 systems, the PHB was moved to the processor complex.
  • A  problem was fixed for an internal error in the SR-IOV adapter firmware that resets the adapter and logs a B400FF01 reference code.  This error happens in rare cases when there are multiple partitions actively running traffic through the adapter and a subset of the partitions are shutdown hard.  The error causes a temporary disruption of traffic but recovery from the error is automatic with no user intervention needed.
    This fix updates adapter firmware to 10.2.252.1931, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, and EN0L.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for the wrong Redfish method (PATCH or POST) passed for a valid Uniform Resource Indicator (URI) causing an incorrect error message of " 501 - Not Implemented".  With the fix, the message returned is "Invalid Method on URI" which is more helpful to the user.
  • A problem was fixed for SRC call home reminders for bad FRUs causing service processor dumps with SRC B181E911 and reset/reloads.  This occurred if the FRU callout was missing a CCIN number in the error log.  This can happen because some error logs only have have "Symbolic FRUs" and these were not being handled correctly.
  • A problem was fixed for  a PCIe3 I/O expansion drawer (with feature code #EMX0)  failing to initialize during the IPL with a SRC B7006A88 logged.  The error is infrequent.  The errant I/O drawer can be recovered by a re-IPL of the system.
  • A problem was fixed for the SR-IOV firmware adapter updates using the HMC GUI or CLI to only reboot one SR-IOV adapter at a time.  If multiple adapters are updated at the same time, the HMC error message HSCF0241E may occur:  "HSCF0241E Could not read firmware information from SR-IOV device ...".  This fix prevents the system network from being disrupted by the SR-IOV adapter updates when redundant configurations are being used for the network.  The problem can be circumvented by using the HMC GUI to update the SR-IOV firmware one adapter at a time using the following steps: 
     https://www.ibm.com/support/knowledgecenter/en/8247-22L/p8efd/p8efd_updating_sriov_firmware.htm

System firmware changes that affect certain systems

  • On systems with a shared processor pool, a very rare problem was fixed for the hypervisor not responding to partition requests such as power off and LIve Partiton Mobility (LPM).  This error is caused by a request for a guard of a failed processor (when there are not any available spare processors) that has hung.
  • On systems with mirrored memory running IBM i partitions, a problem was fixed for un-mirrored nodal memory errors in the partition that also caused the system to crash.   With the fix, the memory failure is isolated to the impacted partition, leaving the rest of the system unaffected.  This fix improves on an earlier fix delivered for  IBM i memory errors  in FW840.60 by handling the errors in nodal memory.
  • On systems with Huge Page (16 GB) memory enabled for a AIX partition,  a problem was fixed for the OS failing to boot with an 0607 SRC displayed.  This error occurs on systems with  FW860.40, FW860.41 or FW860.42 installed.  To circumvent the problem, disable Huge Pages for the AIX partition.  For information on viewing and setting values for AIX huge-page memory allocation, see the following link in the IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hat/p8hat_aixviewhgpgmem.htm
  • On systems with an IBM i partition, a problem was fixed for 64 bytes overwritten in a portion of the  IBM i Main Storage Dump (MSD).  Approximately 64 bytes are overwritten just beyond the 17 MB (0x11000000) address on P8 systems.  This problem is cosmetic as the dump is still readable for problem diagnostics and no customer operations are affected by it.
  • On systems with a partition with a Fibre Channel Adapter (FCA) or a Fibre Channel over Ethernet (FCoE) adapter,  a problem was fixed for bootable disks attached to the FCA or FCoE adapter not being seen in the System Management Services (SMS) menus for selection as boot devices.  This problem is likely to occur if the only I/O device in the partition is a FCA or FCoE adapter.  If other I/O devices are present, the problem may still occur if the FCA or FCoE  is the first adapter discovered by SMS.  A work-around to this problem is to define a virtual Ethernet adapter in the partition profile.  The virtual adapter does not need to have any physical backing device,  as just having the VLAN defined is sufficient to avoid the problem.  The FCA has feature codes #EN0A, #EN0B, #EN0F, #EN0G, #EN0Y, #EN12, #5729, #5774, #5735, and #5723.  The FCoE adapter has feature codes  #5708, #EN0H, #EN0J, #EN0K, and #EN0L
  • On systems with a partition with a 3.0 USB controller, a problem was fixed for a partition boot failure  The USB 3.0 controller adapter card with feature code #EC45 or #EC46.  The boot failure is triggered by a fault in the USB controller but instead of the just the USB controller failing, the entire partition fails.  With the fix, the failure is limited to the USB controller.
  • On a system in a Power Enterprise Pool (PEP) with Mobile Resources,  a problem was fixed for Mobile Resource not being restored after an IPL.  The missing resources can be started  temporarily with Trial COD or some other methods, or  the PEP recovery steps can be used to get the Mobile Resources restored.  For more information, see the Change CoD Pool command on the HMC:  https://www.ibm.com/support/knowledgecenter/en/POWER8/p8edm/chcodpool.html.
SC860_138_056 / FW860.42

01/09/18
Impact:  Security      Severity:  SPE

New features and functions

  • In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue numbers CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754.  Operating System updates are required in conjunction with this FW level for CVE-2017-5753 and CVE-2017-5754.
SC860_127_056 / FW860.41

12/08/17
Impact:  Availability      Severity:  SPE

System firmware changes that affect certain systems

  • On systems using PowerVM firmware that are co-managed with HMC and  PowerVM NovaLink, a problem was fixed for the HMC going into the Incomplete state after deleting a NovaLink partition or after using the  HMC "chsyscfg powervm_mgmt_capable=0" command to remove the NovaLink attribute from a partition.  Partitions will continue running but cannot be changed by the management console and the Live Partitiion Mobility (LPM) will not function in this state.  A power off of the system will remove it from the Incomplete state, but the NovaLink partition will not have been deleted.  To force the delete of the NovaLink partition or partitions without the fix,  erase the service processor NVRAM and then restore the HMC partition data.
  • On systems using PowerVM firmware with PowerVM NovaLink, a problem was fixed for the HMC going into the incomplete state when restoring HMC profile data after deleting a NovaLink partition.  This fix will prevent but not repair the problem once it has occurred.  Recovery from the problem is to erase the service processor NVRAM and then restore the HMC partition data.
SC860_118_056 / FW860.40

11/08/17
Impact:  Availability      Severity:  SPE

New features and functions

  • Support was added to the Advanced System Management Interface (ASMI) for providing an "All of the above" cable validation display option so that each individual cable option does not have to be selected to get a full report on the cable status.  Select  "System Service Aids ->  Cable Validation -> Display Cable Status"  "All of the above"  and click "Continue"  to see the status of all the cables.
System firmware changes that affect all systems
  • A problem was fixed for recovery from clock card loss of lock failures that resulted in a clock card FRU unnecessarily being called out for repair.  This error happened whenever there was a loss of lock (PLL or CRC) for the clock card.  With the fix, the firmware will not be calling out the failing clock card, but rather it will be reconfigured as the new backup clock card after doing a clock card failover.  Customers will see a benefit from improved system availability by the avoidance of disruptive clock card repairs.
  • A problem was fixed for the "Minimum code level supported" not being shown by the Advanced System Management Interface (ASMI) when selecting the "System Configuration/Firmware Update Policy" menu.  The message shown is "Minimum code level supported value has not been set".  The workaround to find this value is to use the ASMI command line interface with the "registry -l cupd/MinMifLevel" command.
  • A problem was fixed for "sh: errl: not found " error messages to the service processor console whenever the Advanced System Management Interface (ASMI) was used to display error logs.  These messages did not cause any problems except to clutter the console output as seen in the service processor traces.
  • A problem was fixed for the LineInputVoltage and LastPowerOutputWatts being displayed in millivolts and milliwatts, respectively,  instead of volts and watts for the output from the Redfish API for power properties for the chassis.  The URL affected is the following:  "https://<fsp ip>/redfish/v1/Chassis/<id>/Power"
  • A problem was fixed for system node fans going to maximum RPM speeds after a service processor failover that needed the On-Chip Controllers (OCC) to be reloaded.  Without the fix, the system node fan speeds can be restored to normal speed by changing the Power Mode in the Advanced System Management Interface using steps from the IBM Knowledge Center:  https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/areaa_pmms.htm.  After changing the Power Mode, wait about 10 minutes to change the Power Mode back to the original setting.
    If the fix is applied without rebooting the system, the system node fan speeds can be corrected by either changing the Power Mode as above or using the HMC to do an Administrative Failover (AFO).
  • A problem was fixed for a Power Supply Unit (PSU) failure of  SRC 110015xF  logged with a power supply fan call out when doing a hot re-plug of a PSU.   The power supply may be made operational again by doing a dummy replace of the PSU that was called out (keeping the same PSU for the replace operation).  A re-IPL of the system will also recover the PSU.
  • A problem was fixed for the service processor low-level boot code always running off the same side of the flash image, regardless of what side has been selected for boot ( P-side or T-side).  Because this low-level boot code rarely changes, this should not cause a problem unless corruption occurs in the flash image of the boot code.  This problem does not affect firmware side-switches as the service processor initialization code (higher-level code than the boot code) is running correctly from the selected side.  Without the fix, there is no recovery for boot corruption for systems with a single service processor as the service processor must be replaced.
  • A problem was fixed for a missing serviceable event from a periodic call home reminder.  This occurred if there was an FRU deconfigured for the serviceable event.
  • A problem was fixed for help text in the Advanced System Management Interface (ASMI) not informing the user that system fan speeds would increase if the system Power Mode was changed to "Fixed Maximum Frequency" mode.  If ASMI panel function "System Configuration->Power Management->Power Mode Setup" "Enable Fixed Maximum Frequency mode" help is selected, the updated text states "...This setting will result in the fans running at the maximum speed for proper cooling."
  • A problem was fixed for a degraded PCI link causing a Predictive SRC for a non-cacheable unit (NCU) store time-out that occurred with SRC B113E540 or B181E450 and PRD signature  "(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB".  With the fix, the error is changed to be an Informational as the problem is not with the processor core and the processor should not be replaced.  The solution for degraded PCI links is different from the fix for this problem, but a re-IPL of the CEC or a reset of the PCI adapters could help to recover the PCI links from their degraded mode.
  • A problem was fixed for a Redfish Patch on the "Chassis"  "HugeDynamicDMAWindowSlotCount" for the validation of incorrect values.  Without the fix, the user will not get proper error messages when providing bad values to the patch.

System firmware changes that affect certain systems

  • DEFERRED:  On systems using PowerVM firmware, a problem was fixed for DPO (Dynamic Platform Optimizer) operations taking a very long and impacting the server system with a performance degradation.  The problem is triggered by a DPO operation being done on a system with unlicensed processor cores and a very high I/O load.  The fix involves using a different lock type for the memory relocation activities (to prevent lock contention between memory relocation threads and partition threads) that is created at IPL time, so an IPL is needed to activate the fix.  More information on the DPO function can be found at the IBM Knowledge Center:  https://www.ibm.com/support/knowledgecenter/en/8247-42L/p8hat/p8hat_dpoovw.htm
  • On systems using PowerVM firmware,  a problem was fixed for an intermittent service processor core dump and a callout for netsCommonMSGServer with SRC B181EF88.   The HMC connection to the service processor automatically recovers with a new session.
  • On systems using PowerVM firmware, a problem was fixed where the Power Enterprise Pool (PEP) grace period expired early, being short by one hour.  For example, 71 hours may be provided instead of 72 hours in some cases. See https://www.ibm.com/support/knowledgecenter/en/POWER8/p8ha2/entpool_cod_compliance.htm for more information about the PEP grace period.
  • On systems using PowerVM firmware, a problem was fixed for a concurrent firmware update failure with HMC error message "E302F865-PHYPTooBusyToQuiesce".  This error can occur when the error log is full on the hypervisor and it cannot accept more error logs from the service processor.  But the service processor keeps retrying the send of an error log, resulting in a "denial of service" scenario where the hypervisor is kept busy rejecting the error logging attempts.  Without the fix, the problem may be circumvented by starting a  logical partition (if none are running) or by purging the error logs on the service processor.
  • On systems using PowerVM firmware with mirrored memory running IBM i partitions, a problem was fixed for memory fails in the partition that also caused the system to crash.  The system failure will occur any time that IBM i partition memory towards the beginning of the partition's assigned memory fails.  With the fix, the memory failure is isolated to the impacted partition, leaving the rest of the system unaffected.
  • On systems using PowerVM firmware, a problem was fixed for failures deconfiguring SR-IOV Virtual Functions (VFs).  This can occur during Live Partition Mobility (LPM) migrations with HMC error messages of  HSCLAF16, HSCLAF15 and HSCLB602 shown. This results in an LPM migration failure and a system reboot is required to recover the VFs for the I/O adapters.  This error may occur more frequently in cases where the I/O adapter has pending I/O at the time of the deconfigure request for the VF.
  • On systems using PowerVM firmware, a problem was fixed for a vNIC client that has backing devices being assigned an active server that was not the one intended by an HMC user failover for the client adapter.  This only can happen if the vNIC client adapter had never been activated.  A circumvention is to activate the client OS and initialize the vNIC device (ifconfig "xxx" up) and an active backing device will then be selected.
  • On systems using PowerVM firmware, a problem was fixed for partitions with more than 32TB memory failing to IPL with memory space errors.  This can occur if the logical memory block (LMB) size is small as there is a memory loss associated with each LMB.  The problem can be circumvented by reducing the amount of partition memory or increasing the LMB size to reduce the total number of LMBs needed for the memory allocation.
  • On systems using PowerVM firmware,  a problem was fixed for the error handling of EEH events for the SR-IOV Virtual Functions (VFs) that can result in IPL failure with B7006971, B400FF05, and BA210000 SRCs logged.  In these cases, the partition console stops at an OFDBG prompt.  Also, a DLPAR add of a VF may result in a partition crash due to a 300 DSI exception because of a low-level EEH event.  A circumvention for the problem would be to debug the EEH events which should be recovered errors and eliminate the cause of the EEH events.  With the fix, the EEH events still log Predictive Errors but do not cause a partition failure.
  • On systems using PowerVM firmware, a problem was fixed for Power Enterprise Pool (PEP) "not applicable" error messages being displayed when re-entering PEP XML files for PEP updates, in which one of the XML operations calls for Conversion of Perm Resources to PEP Resources.  There is no error as the PEP key was accepted on the first use.  The following message may be seen on the HMC and can be ignored:   "...HSCL0520 A Mobile CoD processor conversion code to convert 0 permanently activated processors to Mobile CoD processors on the managed system has been entered.  HSCL050F This CoD code is not valid for your managed system.  Contact your CoD administrator."
  • On systems using PowerVM firmware, a problem was fixed for Power Enterprise Pool (PEP) busy errors from the system anchor card when creating or updating a PEP pool.    The error returned by the HMC is "HSCL9015 The managed system cannot currently process this operation.  This
    condition is temporary.  Please try the operation again."  To try again, the customer needs to update the pool again.  Typically on the second PEP update, the code is accepted.
    The problem is intermittent and occurs only rarely.
  • On systems using PowerVM firmware, a problem was fixed for an invalid date from the service processor causing the customer date and time to go to the Epoch value (01/01/1970) without a warning or chance for a correction.  With the fix,  the first IPL attempted on an invalid date will be rejected with a message alerting the user to set the time correctly in the service processor.  If the warning is ignored and the date/time is not corrected, the next IPL attempt will complete to the OS with the time reverted to the Epoch time and date.  This problem is very rare but it has been known to occur on service processor replacements when the repair step to set the date and time on the new service processor was inadvertently skipped by the service representative.
  • On systems using PowerVM firmware, a problem was fixed for a Power Enterprise Pool (PEP) system losing its assigned processor and memory resources after an IPL of the system.  This is an intermittent problem caused by a small timing window that makes it possible for the server to not get the IPL-time assignment of resources from the HMC.  If this problem occurs, it can be corrected by the HMC to recover the pool without needing another IPL of the system.
  • On systems using PowerVM firmware with PowerVM NovaLink, a problem was fixed for a lost of a communications channel between the hypervisor and the PowerVM NovaLink during a reset of the service processor.  Various NovaLink tasks, including deploy, could fail with a "No valid host was found" error.  With the fix, PowerVM NovaLink prevents normal operations from being impacted by a reset of the service processor.
  • On systems using PowerVM firmware, a problem was fixed for a rare system hang caused by a process dispatcher deadlock timing window.  If this problem occurs, the HMC will also go to an "Incomplete" state for the managed system.
  • On systems using PowerVM firmware,  a  problem was fixed for communication failures on adapters in SR-IOV shared mode.  This communication failure only occurs when a logical port's VLAN ID ( PVID) is dynamically changed from non-zero to zero.  An SR-IOV logical port is an I/O device created for a partition or a partition profile using the management console (HMC) when a user intends for the partition to access an SR-IOV adapter Virtual Function.  The error can be recovered from by a reboot of the partition.
    This fix updates adapter firmware to 10.2.252.1929, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • On systems using PowerVM firmware, a problem was fixed for error logs not getting sent to the OS running in a partition.   This problem could occur if the error log buffer was full in the hypervisor and then a re-IPL of the system occurred.  The error log full condition was persisting across the re-IPL, preventing further logs from being sent to the OS.
  • On systems using PowerVM firmware, a problem was fixed in the text for the Firmware License agreement to correct a link that pointed to a URL that was not specific to microcode licensing.  The message is displayed for a machine during its initial power on.  Once accepted, the message is not displayed again.  The fixed link in the licensing agreement is the following: http://www.ibm.com/support/docview.wss?uid=isg3T1025362.
SC860_103_056 / FW860.30

06/30/17
Impact:  Availability      Severity:  SPE

New features and functions

  • Support was added for Redfish API to allow the ISO 8610 extended format for the time and date so that the date/time can be represented as an offset from UTC (Universal Coordinated Time).
  • Support for the Redfish API for power and thermal properties for the chassis.  The new URIs are as follows::
    https://<fsp ip>/redfish/v1/Chassis/<id>/Power  : Provides fan data
    https://<fsp ip>/redfish/v1/Chassis/<id>/Thermal : Provides power supply data
    Only the Redfish GET operation is supported for these resources.
System firmware changes that affect all systems
  • A problem was fixed for service actions with SRC B150F138 missing an Advanced System Management Interface (ASMI) Deconfiguration Record.  The deconfiguration records make it easier to organize the repairs that are needed for the system and they need to be consistent with the periodic maintenance reminders that are logged for the failed FRUs.
  • A problem was fixed for a false 1100026B1 (12V power good failure) caused by an I2C bus write error for a LED state.  This error can be triggered by the fan LEDs changing state.
  • A problem was fixed for a fan LED turning amber on solid when there is no fan fault, or when the fan fault is for a different fan.  This error can be triggered anytime a fan LED needs to change its state.  The fan LEDs can be recovered to a normal state concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
  • A problem was fixed for sporadic blinking amber LEDs for the system fans with no SRCs logged.  There was no problem with the fans.  The LED corruption occurred when two service processor tasks attempted to update the LED state at the same time.  The fan LEDs can be recovered to a normal state concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
  • A problem was fixed for a Redfish Patch on the "Chassis" or "IBMEnterpriseComputerSystem" with empty data that caused a "500 Internal Server Error".  Validation for the empty data case has been added to prevent the server error.
  • A problem was fixed for hardware dumps only collecting data for the master processor if a run-time service processor failover had occurred prior to the dump.  Therefore, there would be only master chip and master core data in the event of a core unit checkstop.  To recover to a system state that is able to do a full collection of debug data for all processors and cores after a run-time failover, a re-IPL of the system is needed.
  • A problem was fixed for a Redfish Patch on power mode to "MaxPowerSaver" that caused a  "500 Internal Server Error" when that power mode was not supported on the system.  With the fix, the Redfish server response is a list of the valid power modes that be used for the system.
  • A problem was fixed for the loss of Operations Panel function 30 (displaying ethernet port  HMC1 and HMC2 IP addresses) after a concurrent repair of the Operations Panel.  Operations  Panel function 30 can be restored concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
  • A problem was fixed for a core dump of the rtiminit (service processor time of day) process that logs an SRC B15A3303  and could invalidate the time on the service processor.  If the error occurs while the system is powered on, the hypervisor has the master time and will refresh the service processor time, so no action is needed for recovery.  If the error occurs while the system is powered off, the service processor time must be corrected on the systems having only a single service processor.  Use the following steps from the IBM Knowledge Center to change the UTC time with the Advanced System Management Interface:  https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/viewtime.htm.
  • A problem was fixed for the service processor boot watch-dog timer expiring too soon during DRAM initialization in the reset/reload, causing the service processor to go unresponsive.  On systems with a single service processor, the SRC B1817212 was displayed on the control panel.  For systems with redundant service processors, the failing service processor was deconfigured.  To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action.  This problem is intermittent and very infrequent as most of the reset/reloads of the service processor will work correctly to restore the service processor to a normal operating state.
  • A problem was fixed for host-initiated resets of the service processor causing the system to terminate.  A prior fix for this problem did not work correctly because some of the host-initiated resets were being translated to unknown reset types that caused the system to terminate.  With this new correction for failed host-initiated resets, the service processor will still be unresponsive but the system and partitions will continue to run.  On systems with a single service processor, the SRC B1817212 will be displayed on the control panel.  For systems with redundant service processors, the failing service processor will be deconfigured.  To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action.  This problem is intermittent and very infrequent as most of the host-initiated resets of the service processor will work correctly to restore the service processor to a normal operating state.
  • A problem was fixed for a service processor reset triggered by a spurious false IIC interrupt request in the kernel.  On systems with a single service processor, the SRC B1817201 is displayed on the Operator Panel.  For systems with redundant service processors, an error failover to the backup service processor occurs.  The problem is extremely infrequent and does not impact processes on the running system.
  • A problem was fixed for the System Attention LED failing to light for an error failover for the redundant service processors with an SRC B1812028 logged.
  • A problem was fixed for a system failure at run time with SRC B111E450 corefir(55) that could not reIPL.  A system node should have been deconfigured for an ABUS error on a processor chip but instead, the system was terminated.  To recover from this problem, manually guard the node containing the failed processor and then the IPL will be successful.
  • A problem was fixed for an incorrect Redfish error message when trying to use the $metadata URI:   "The resource at the URI https://<systemip>/redfish/v1/%24metadata was not found.". This %24 is meaningless.  The "%24" has been replaced with a "$" in the error message.  The Redfish $metadata URI is not supported.
  • A problem was fixed for a system failure caused by Host boot problems with one node but the other nodes good.  With the fix, the node that is failing the Hostboot is deconfigured and the system is able to IPL on the remaining nodes.  To recover from this problem, manually guard the node that is failing and reIPL.

System firmware changes that affect certain systems

  • DEFERRED: On systems using PowerVM firmware, a problem was fixed for PCIe3 I/O expansion drawer (#EMX0) link improved stability.  The settings for the continuous time linear equalizers (CTLE) was updated for all the PCIe adapters for the PCIe links to the expansion drawer.  The system must be re-IPLed for the fix to activate.
  •  On systems using PowerVM firmware with a Linux Little Endian (LE) partition, a problem was fixed for system reset interrupts returning the wrong values in the debug output for the NIP and MSR registers.  This problem reduces the ability to debug hung Linux partitions using system reset interrupts.  The error occurs every time a system reset interrupt is used on a Linux LE partition.
  • On systems using PowerVM firmware, a problem was fixed for "Time Power On" enabled partitions not being capable of suspend and resume operations.  This means Live Partition Mobility (LPM) would not be able to migrate this type of partition.  As a workaround, the partition could be transitioned to a "Non-time Power On" state and then made capable of suspend and resume operations.
  • On systems using PowerVM firmware, a problem was fixed for manual vNIC failovers (from the HMC, manually "Make the Backing Device Active") so that the selected server was chosen for the failover, regardless of its priority.  With the problem, the server chosen for the VNIC failover will be the one with the most favorable priority. 
    There are two possible workarounds to the problem:
    (1) Disable auto-priority-failover; Change priority to the server that is needed as the  target of the failover; Force the vNIC failover; Change priority back to original setting.
    (2) Or use auto-priority-failover and change the priority so the server that is needed as the target of the failover is favored.
  • On systems using PowerVM firmware, a problem was fixed for extra error logs in the VIOS due to failovers taking place while the client vNIC is inactive.  The inactive client vNIC failovers are skipped unless the force flag is on.  With the problem occurring, Enhanced Error Handling (EEH) Freeze/Temporary Error/Recovery logs posted in the VIOS error log of the client partition boot can be ignored unless an actual problem is experienced.
  • On systems using PowerVM firmware, a problem was fixed for a Live Partition Mobility (LPM) migration abort and reboot on the FW860  target CEC caused by a mismatched address space for the source and target partition.  The occurrence of this problem is very rare and related to performance improvements made in the memory management on the FW860 system that exposed a timing window in the partition memory validation for the migration.  The reboot of the migrated partition recovers from the problem as the migration was otherwise successful.
  • On systems using PowerVM firmware, a problem was fixed for reboot retries for IBM i partitions such that the first load source I/O adapter (IOA) is retried instead of bypassed after the first failed attempt.  The reboot retries are done for an hour before the reboot process gives up.  This error can occur if there is more than one known load source, and the IOA of the first load source is different from the IOA of the last load source.  The error can be circumvented by retrying the boot of the partition after the load source device has become available.
  • On systems using PowerVM firmware, a problem was fixed for adapters failing to transition to shared SR-IOV mode on the IPL after changing the adapter from dedicated mode.  This intermittent problem could occur on systems using SR-IOV with very large memory configurations.
  • On systems using PowerVM firmware,  a  problem was fixed for SR-IOV adapters in shared mode for a transmission stall or time out with SRC B400FF01 logged.  The time out happens during Virtual Function (VF) shutdowns and during Function Level Resets (FLRs) with network traffic running.
    This fix updates adapter firmware to 10.2.252.1927, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). 
  • On systems with maximum memory configurations (where every DIMM slot is populated - size of DIMM does not matter), a  problem has been fixed for systems losing performance and going into Safe mode (a power mode with reduced processor frequencies intended to protect the system from overheating and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs logged.  This happened because of On-Chip Controller (OCC) timeout errors when collecting Analog Power Subsystem Sweep (APSS) data, used by the OCC to tune the processor frequency.  This problem occurs more frequently on systems that are running heavy workloads.  Recovery from Safe mode back to normal performance can be done with a re-IPL of the system, or concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
    To check or validate that Safe mode is not active on the system will require a dynamic celogin password from IBM Support to use the service processor command line:
    1) Log into ASMI as celogin with  dynamic celogin password generated by IBM Support
    2) Select System Service Aids
    3) Select Service Processor Command Line
    4) Enter "tmgtclient --query_mode_and_function" from the command line
    The first line of the output, "currSysPwrMode" should say "NOMINAL" and this means the system is in normal mode and that Safe mode is not active.
  • A  problem has been fixed for systems losing performance and going into Safe mode (a power mode with reduced processor frequencies intended to protect the system from overheating and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs logged.  This happened because of an On-Chip Controller (OCC) internal queue overflow. The problem has only been observed for systems running heavy workloads with maximum memory configurations (where every DIMM slot is populated - size of DIMM does not matter), but this may not be required to encounter the problem.  Recovery from Safe mode back to normal performance can be done with a re-IPL of the system, or concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
    To check or validate that Safe mode is not active on the system will require a dynamic celogin password from IBM Support to use the service processor command line:
    1) Log into ASMI as celogin with  dynamic celogin password generated by IBM Support
    2) Select System Service Aids
    3) Select Service Processor Command Line
    4) Enter "tmgtclient --query_mode_and_function" from the command line
    The first line of the output, "currSysPwrMode" should say "NOMINAL" and this means the system is in normal mode and that Safe mode is not active.
  • On systems using PowerVM firmware,  a  problem was fixed for a partition boot from a USB 3.0 device that has an error log SRC BA210003.  The error is triggered by an Open Firmware entry to the trace buffer during the partition boot.  The error log can be ignored as the boot is successful to the OS.
  • On systems using PowerVM firmware,  a  problem was fixed for a partition boot fail or hang from a Fibre Channel device having fabric faults.  Some of the fabric errors returned by the VIOS are not interpreted correctly by the Open Firmware VFC drive, causing the hang instead of generating helpful error logs.
  • On systems with redundant service processors,  a problem was fixed for an extra SRC B150F138 logged for a power supply that had already been replaced.  The problem was triggered by a service processor failover and an old power supply fault event that was not cleared on the backup service processor.  This caused the SRC B150F138 to be logged for a second time.  This problem can be circumvented by clearing the error log associated with the bad FRU when the FRU is replaced.
  • On systems using PowerVM firmware, a problem was fixed for a Power Enterprise Pool (PEP) resource Grace Period not being reset when the server is in the "Out of Compliance" state and the resource has been returned to put the server back in Compliance.  The Grace Period was not being reset after a double-commit of a resource (doing an "remove" of an active resource) was resolved by restarting the server with the double-committed resource. When Grace Period ends, the "double-committed" resources on the server have to have been freed up from use to prevent the server from going to "Out of Compliance".  If the user fails to free up the resource, the PEP is in an "Out of Compliance" state, and the only PEP actions allowed are ones to free up the double-commit. Once that is completed, the PEP is back In Compliance. The loss of the Grace Period for the error makes it difficult to move resources around in the PEP.  Without the fix, the user can  "Add" another PEP resource to the server, and the action of adding a PEP resource resets the Grace Period timer.  One could then "Remove" that one PEP resource just added, and then any further "removes" of PEP resources would behave as expected with the full Grace Period in effect.
  • On systems using PowerVM firmware,  a problem was fixed for  Power Enterprise Pool (PEP) IFL processors assignments causing an "Out of Compliance" for normal processor licenses.  The number of IFL processors purchased was first credited as satisfying any "unreturned" PEP processor resources, thus potentially leaving the system "Out Of Compliance" since IFL processors should not be taking the place of the normal (expensive) processor usage.  In this situation, without the fix, the user will need to either purchase more "expensive" non-IFL processors to satisfy the non-IFL workloads or adjust the partitions to reduce the usage of non-IFL processors.  This is a very infrequent problem for the following reasons: 
    1) PEP processors are infrequently left "unreturned" for short periods of time for specialized operations such as LPM migrations
    2) The user would have to purchase IFL processors from IBM, which is not a common occurrence.
    3) The user would have to put in a COD key for IFL processors while a PEP processor is still "unreturned"
  • On systems using PowerVM firmware,  a  problem was fixed for a power off hanging at D200C1FF caused by a vNIC VF failover error with SRC B200F011.  The power off hang error is infrequent because it requires that a VF failover error having occurred first.  The system can be recovered by using the power off immediate option from the Hardware Management Console (HMC).
  • On systems using PowerVM firmware, a problem was fixed for the incorrect reporting of the Universally Unique Identifier (UUID) to the OS, which prevented the tracking of a partition as it moved within a data center.  The UUID value as seen on HMC or the NovaLink did not match the value as displayed in the OS.
  • On systems using PowerVM firmware, a problem was fixed for an error finding the partition load source that has a GPT format.  GUID Partition Table (GPT) is a standard for the layout of the partition table on a physical storage device used in the server, such as a hard disk drive or solid-state drive, using globally unique identifiers (GUID).  Other drives that are working may be using the older master boot record (MBR) partition table format.  This problem occurs whenever load sources utilizing the GPT format occur in other than the first entry of the boot table.  Without the fix, a GPT disk drive must be the first entry in the boot table to be able to use it to boot a partition.
  • On systems using PowerVM firmware, a problem was fixed for an SRC BA090006 serviceable event log occurring whenever an attempt was made to boot from an ALUA  (Asymmetric Logical Unit Access) drive.  These drives are always busy by design and cannot be used for a partition boot, but no service action is required if a user inadvertently tries to do that.  Therefore, the SRC was changed to be an informational log.
SC860_082_056 / FW860.20

03/17/17
Impact:  Availability      Severity:  SPE 

New features and functions

  • Support for the Redfish API for provisioning of Power Management tunable (EnergyScale) parameters.  The Redfish Scalable Platforms Management API ("Redfish") is a DMTF specification that uses RESTful  interface semantics to perform out-of-band systems management. (http://www.dmtf.org/standards/redfish). 
    Redfish service enables platform management tasks to be controlled by client scripts developed using secure and modern programming paradigms.
    For systems with redundant service processors, the Redfish service is accessible only on the primary service processor.   Usage information for the Redfish service is available at the following IBM  Knowledge Center link:  https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hdx/p8_workingwithconsoles.htm.
    The IBM Power server supports DMTF Redfish API (DSP0266, version 1.0.3 published 2016-06-17) for systems management.
    A copy of the the Redfish schema files in JSON format published by the DMTF (http://redfish.dmtf.org/schemas/v1/) are packaged in the firmware image.
    The schema files are distributed on chip to enable proper functioning in deployments with no WAN connectivity.
    IBM extensions to the Redfish schema are published at http://public.dhe.ibm.com/systems/power/redfish/schemas/v1. Copyright notices for the DMTF Redfish API and schemas are at: (a) http://www.dmtf.org/about/policies/copyright, and (b) http://redfish.dmtf.org/schemas/README8010.html.
  • Support added to reduce memory usage for shared SR-IOV adapters.
  • Support for the Advanced System Management Interface (ASMI) was changed to allow the special characters of "I", "O", and "Q" to be entered for the serial number of the I/O Enclosure under the Configure I/O Enclosure option.  These characters have only been found in an IBM serial number rarely, so typing in these characters will normally be an incorrect action.  However, the special character entry is not blocked by ASMI anymore so it is able to support the exception case.  Without the enhancement, the typing of one of the special characters causes message "Invalid serial number" to be displayed.
  • Support was added to the Advanced System Management Interface (ASMI) "System Service Aids => Cable Validation" to add a timestamp for when the last time the cables were validated.
System firmware changes that affect all systems
  • A problem was fixed for the setting the disable of a periodic notification for a call home error log SRC B150F138 for Memory Buffer resources (membuf) from the Advanced System Management Interface (ASMI).
  • A problem was fixed for the call home data for the B1xx2A01 SRC to include the min/max/average readings for more values.  The values for processor utilization, memory utilization, and node power usage were added.
  • A problem was fixed for incorrect callouts of the Power Management Controller (PMC) hardware with SRC  B1112AC4 and SRC B1112AB2 logged.  These extra callouts occur when the On-Chip Controller (OCC) has placed the system in the safe state for a prior failure that is the real problem that needs to be resolved.
  • A problem was fixed for System Vital Product Data (SVPD) FRUs  being guarded but not having a corresponding error log entry.  This is a failure to commit the error log entry that has occurred only rarely.
  • A problem was fixed for the failover to the backup PNOR on a Hostboot Self Boot Engine (SBE) failure.  Without the fix, the failed SBE causes loss of processors and memory with B15050AD logged.  With the fix, the SBE is able to access the backup PNOR and IPL successfully by deconfiguring the failing PNOR and calling it out as a failed FRU.
  • A problem was fixed for the Advanced System Management Interface (ASMI) "System Service Aids => Error/Event Logs" panel not showing the "Clear" and "Show" log options and also having a truncated error log when there are a large number of error logs on the system.
  • A problem was fixed a system going into safe mode with SRC B1502616 logged as informational without a call home notification.  Notification is needed because the system is running with reduced performance.  If there are unrecoverable error logs and any are marked with reduced performance and the system has not been rebooted, then the system is probably running in safe mode with reduced performance.  With the fix, the SRC B1502616 is a Unrecoverable Error (UE).
  • A problem was fixed for valid IPv4 static IP addresses not being allowed to communicate on the network and not being allowed to be configured.
     The Advanced System Management Interface (ASMI) static IPv4 address configuration was not allowing "255" in the IP address subfields.  The corrected range checking is as follows:
    Allowed values:  x.255.x.x, x.x.255.x, x.255.255.x
    Disallowed values:  x.x.x.255
    The failure for the communication on the network is seen if the problematic IP addresses are in use prior to a firmware update to 860.00, 860.10, 860.11, or 860.12.  After the firmware update, the service processor is unable to communicate on the network.  The problem can be circumvented by changing the service processor to use DHCP addressing, or by moving the IP address to a different static IP range, prior to doing the firmware update.
  • A problem was fixed for corrupt service processor error log entries caused by incorrect error log synchronization between primary and backup service processor  during firmware updates.  At the time of the corruption an  B1818601 is logged with a fipsdump generated.   Then during  normal  operations, periodic B1818A12 SRC may be  logged as the corrupted error log entries are encountered.  No service action is needed for the corrupted error logs as the old corrupted entries will be deleted as new error logs are added as part of the error log housekeeping.
  • A problem was fixed for an unneeded service action request for a informational VRM redundant phase fail error logged with SRC 11002701.  If  reminders for service action with SRC B150F138 are occurring for this problem, then firmware containing the fix needs to be installed and ASMI error logs need to be cleared in order to stop the periodic reminder.

System firmware changes that affect certain systems

  • On systems using PowerVM firmware,  a problem was fixed for a blank SRC in the LPA dump for user-initiated non-disruptive adjunct dumps.  The  A2D03004 SRC is needed for problem determination and dump analysis.
  • On a system using PowerVM firmware with an IBM i partition and VIOS,  a problem was fixed for a Live Partition Mobility migration for a IBM i partition that fails if there is a VIOS failover during the migration suspended window.
  • On a system using PowerVM firmware and VIOS,  a problem was fixed for a HMC "Incomplete State" after a Live Partition Mobility migration followed by a VIOS failover.  The error is triggered by a delete operation on a migration adapter on the VIOS that did the failover.  The HMC "Incomplete State" can be recovered from by doing a re-IPL of the system.  This error can also prevent a VIOS from activating.
  • On systems using PowerVM firmware, a problem was fixed with SR-IOV adapter error recovery where the adapter is left in a failed state in nested error cases for some adapter errors.  The probability of this occurring is very low since the problem trigger is multiple low-level adapter failures.  With the fix, the adapter is recovered and returned to an operational state.
  • On systems using PowerVM firmware  with PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared mode, a problem was fixed for the hypervisor SR-IOV adjunct partition failing during the IPL with SRCs B200F011 and B2009014 logged. The SR-IOV adjunct partition successfully recovers after it reboots and the system is operational.
  • On systems using PowerVM firmware with PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared-mode in a PCIe slot with Enlarged IO Capacity and 2TB or more of system memory, a problem was fixed for the hypervisor SR-IOV adjunct partition failing during  the IPL with SRCs B200F011 and B2009014 logged.   In this configuration, it is possible the SR-IOV adapter will not become functional following a system reboot or when an adapter is first configured into shared-mode.  Larger system memory configurations of 2TB or more than 1TB are more likely to encounter the problem.  The problem can be avoided by reducing the number of PCIe slots with Enlarged IO Capacity enabled so it does not include adapters in SR-IOV shared-mode.  Another circumvention option is to move the adapter to an SR-IOV capable PCIe slot where Enlarged IO Capacity is not enabled.
  • On a system using PowerVM firmware and VIOS,  a problem was fixed for a Live Partition Mobility (LPM) migration for an Active Memory Sharing (AMS) partition that hangs if there is a VIOS failover during the migration.
  • On systems using PowerVM firmware, a problem was fixed for the PCIe3 Optical Cable Adapter for the PCIe3 Expansion Drawer failing with SRC B7006A84 error logged during the IPL.  The failed cable adapter can be recovered by using a concurrent repair operation to power it off and on.  Or  the system can be re-IPLed to recover the cable adapter.  The affected optical cable adapters have feature codes #EJ05, #EJ06, and #EJ08 with CCINs 2B1C, 6B52, and 2CE2, respectively.
  • On systems using PowerVM firmware, the hypervisor "vsp" macro was enhanced to show the type of the adjunct partition.  The "vsp -longname" macro option was also updated to list the location codes for the SR-IOV adjunct partitions.  The hypervisor macros are used by IBM support  to help debug Power system problems.
  • On systems using PowerVM firmware, a problem was fixed for PCIe Host Bridge (PHB) outages and PCIe adapter failures in the PCIe I/O expansion drawer caused by error thresholds being exceeded for the LEM bit [21] errors in the FIR accumulator.  These are typically minor and expected errors in the PHB that occur during adapter updates and do not warrant  a reset of the PHB and the PCIe adapter failures.  Therefore, the threshold LEM[21] error limit has been increased and the LEM fatal error has been changed to a Predictive Error to avoid the outages for this condition.
  • On systems using PowerVM firmware, a problem was fixed for PCIe3 I/O expansion drawer (#EMX0) link improved stability.  The settings for the continuous time linear equalizers (CTLE) was updated for all the PCIe adapters for the PCIe links to the expansion drawer. The CEC must be re-IPLed for the fix to activate.
  • On systems using PowerVM firmware with IBM i partitions, a problem was fixed for frequent logging of informational B7005120 errors due to communications path closed conditions during messaging from HMCs to IBMi partitions.  In the majority of cases these errors are due to normal operating conditions and not due to errors that require service or attention.  The logging of informational errors due to this specific communications path closed condition that are the result of normal operating conditions has been removed.
  • On a system using PowerVM firmware with an IBM i partition,  a problem was fixed for a D-mode boot failure for IBM i from an USB RDX cartridge.  There is a hang at the  LPAR progress code C2004130 for a period of time and then a failure with SRC B2004158 logged.  There is a USB External Dock (FC #EU04) and Removable Disk Cartridge (RDX) 63B8-005 attached.  The error is intermittent so the RDX can be powered off and back on to retry the D-mode boot to recover.
  • On systems using PowerVM firmware,  the following problems were fixed for SR-IOV adapters:
    1) Insufficient resources reported for SR-IOV logical port configured with promiscuous mode enable and a Port VLAN ID (PVID) when creating new interface on the SR-IOV adapters.
    2) Spontaneous dumps and reboot of the adjunct partition for SR-IOV adapters.
    3) Adapter enters firmware loop when single bit ECC error is detected.  System firmware detects this condition as a adapter command time out.  System firmware will reset and restart the adapter to recover the adapter functionality.  This condition will be reported as a temporary adapter hardware failure.
    4) vNIC interfaces not being deleted correctly causing SRC  B400FF01 to be logged and Data Storage Interrupt (DSI) errors with failiure on boot of the LPAR.
    This set of fixes updates adapter firmware to 10.2.252.1926, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38 , EL3C, EL56, and EL57.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • On systems using PowerVM firmware with an IBM i partition, a problem was fixed for incorrect maximum performance reports based on the wrong number of "maximum" processors for the system.   Certain performance reports that can be generated on IBMi systems contain not only the existing machine information, but also "what-if" information, such as "how would this system perform if it had all the processors possible installed in this system".  This "what-if" report was in error because the maximum number of processors possible was too high for the system.
  • On systems using PowerVM firmware, a problem was fixed for degraded PCIe3 links for the PCIe3 expansion drawer with SRC B7006A8F not being visible on the HMC.  This occurred because the SRC was informational.  The problem occurs when the link attaching a drawer to the system trains to x8 instead of x16.  With the fix, the SRC has been changed to a B70006A8B permanent error for the degraded link.
  • On systems using PowerVM firmware, a problem was fixed for a concurrent exchange of a CAPI adapter that left the new adapter in a deactivated state.   The system can be powered off and IPLed again to recover the new adapter.  The CAPI adapters have the following feature codes:  #EC3E, #EC3F, #EC3L, #EC3M, #EC3T, #EC3U, #EJ16, #EJ17, #EJ18, #EJ1A, and #EJ1B.
  • On a system using PowerVM firmware with SR-IOV adapters,  a problem was fixed for a DLPAR remove on a Virtual Function (VF) of a ConnectX-4 (CX4) adapter that failed with AIX error "0931-013 Unable to isolate the resource".  The HMC reported error is "HSCL12B5 The operation to remove SR-IOV logical port xx  failed because of the following error: HSCL131D The SR-IOV logical port is still in use by the partition".  The failing PCIe3 adapters are sourced from Mellanox Corporation based on ConnectX-4 technology and have the following feature codes and CCINs:  #EC3E, #EC3F with CCIN 2CEA; #EC3L and #EC3M with CCIN 2CEC; and #EC3T and #ECTU with CCIN 2CEB.  The issue occurs each time a DLPAR remove operation is attempted on the VF.  Restarting the partition after a failed DLPAR remove recovers from the error.
  • On systems using PowerVM firmware, a problem was fixed for NVRAM corruption that can occur when deleting a partition that owns a CAPI adapter, if that CAPI adapter is not assigned to another partition before the system is powered off.  On a subsequent IPL, the system will come up in recovery mode if there is NVRAM corruption.  To recover, the partitions must be restored from the HMC.  The frequency of this error is expected to be rare.  The CAPI adapters have the following feature codes:  #EC3E, #EC3F, #EC3L, #EC3M, #EC3T, #EC3U, #EJ16, #EJ17, #EJ18, #EJ1A, and #EJ1B.
  • On systems using PowerVM firmware, a problem was fixed for NVRAM corruption and a HMC recovery state when using Simplified Remote Restart partitions.  The failing systems will have at least one Remote Restart partition and on the failed IPL there will be a B70005301 SRC with word 7 being 0X00000002.
  • On systems using PowerVM firmware, a problem was fixed for a group of shared processor partitions being able to exceed the designated capacity placed on a shared processor pool.  This error can be triggered by using the DLPAR move function for the shared processor partitions, if the pool has already reached its maximum specified capacity.  To prevent this problem from occurring when making DLPAR changes when the pool is at the maximum capacity, do not use the DLPAR move operation but instead break it into two steps:  DLPAR remove followed by DLPAR add.  This gives enough time for the DLPAR remove to be fully completed prior to starting the DLPAR add request.
  • On systems using PowerVM firmware, a problem was fixed for partition boot failures and run time DLPAR failures when adding I/O that log BA210000, BA210003, and/or BA210005 errors.  The fix also applies to run time failures configuring an I/O adapter following an EEH recovery that log BA188001 events.  The problem can impact IBMi partitions running in any processor mode or AIX/Linux partitions running  in P7 (or older) processor compatibility modes.  The problem is most likely to occur when the system is configured in the Manufacturing Default Configuration (MDC) mode.  The trigger for the problem is a race-condition between the hypervisor and the physical operations panel with a very rare frequency of occurrence.
SC860_070_056 / FW860.12

01/13/17
Impact:  Availability      Severity:  SPE

System firmware changes that affect certain systems

  • On a system using PowerVM firmware, a problem was fixed for the System Management Services (SMS) SAS utility showing very large (incorrect) disk capacity values depending on the size of the disk or Volume Set/Array.  The problem occurs when the number of blocks on a disk is 2 G or more.
  • On a system using PowerVM firmware running a Linux OS,  a problem was fixed for support for Coherent Accelerator Processor Interface (CAPI) adapters.  The CAPI related RTAS h-calls for the CAPI devices could not be made by the Linux OS, impacting the CAPI adapter functionality and usability.  This problem involves the following adapters:  the PCIe3 LP CAPI Accelerator Adapter with F/C #EJ16 that is used on the S812L(8247-21L) and S822L (8247-22L) models;  the PCIe3 CAPI FlashSystem Acclerator Adapter with F/C #EJ17  that is used on the S814(8286-41A) and S824(8286-42A) models;  and the PCIe3 CAPI FlashSystem Accelerator Adapter with F/C #EJ18 that is used on the S822(8284-22A), E870(9119-MME), and E880(9119-MHE) models.  This problem does not pertain to PowerVM AIX partitions using CAPI adapters.
  • On a system using PowerVM firmware, a problem was fixed for Live Partition Mobility (LPM) migrations to FW860.10 or FW860.11 from any other level of firmware (i.e. not FW 860.10 or FW860.11) that caused errors in the output of the AIX "lsattr -El mem0" command and Dynamic LPAR (DLPAR) operations.  The "lsattr" command will report the partition only has one logical memory block (LMB) of memory assigned to it, even though there is more memory assigned to the partition.  Also, as a result of this problem, DLPAR operations will fail with an error indicating the request could not be completed.  This issue affects AIX 5.3, AIX 6.1, AIX 7.1, AIX 7.2 TL 0, and may result in AIX DLPAR error message "0931-032 Firmware failure.   Data may be out of sync and the system may require a reboot."  This issue also affect all levels of Linux.  Not affected  by this issue are AIX 7.2 TL 1, VIOS and IBM i partitions.
    In addition, after performing LPM from FW860 to earlier versions of firmware,  the DLPAR of Virtual Adapters will fail with HMC error message HSCL294C, which contains text similar to the following:  "0931-007 You have specified an invalid drc_name."
    Without the fix, a reboot of the migrated partition will correct the problem.
  • On a system using PowerVM firmware, a problem was fixed for I/O DLPARs that result in partition hangs.  To trigger the problem, the DLPAR operation must be performed on a partition which has been migrated via a Live Partition Mobility (LPM) operation from a P6 or P7 system to a P8 system.  Additionally, DLPAR of I/O will fail when performed on a partition which has been migrated via an LPM operation from a P8 system to a P6 or P7 system.  The failure will produce HMC error message HSCL2928, which contains text similar to the following: "0931-011  Unable to allocate the resource to the partition." DLPAR operations for memory or CPU are not affected.  This issue affects all Linux and AIX partitions.  IBMi partitions are not affected.
SC860_063_056 / FW860.11

12/05/16
Impact:  N/A      Severity:  N/A
  • This Service Pack contained updates for MANUFACTURING ONLY.
SC860_056_056 / FW860.10

11/18/16
Impact:  New      Severity:  New

New features and functions

  • Support enabled for Live Partition Mobility (LPM) operations.
  • Support enabled for partition Suspend and Resume from the HMC.
  • Support enabled for partition Remote Restart.
  • Support enabled for PowerVM vNIC. PowerVM vNIC combined many of the best features of SR-IOV and PowerVM SEA to provide a network solution with options for advanced functions such as Live Partition Mobility along with better performance and I/O efficiency when compared to PowerVM SEA.  In addition PowerVM vNIC provided users with bandwidth control (QoS) capability by leveraging SR-IOV logical ports as the physical interface to the network.
  • Support for dynamic setting of the Simplified Remote Restart VM property, which enables this property to be turned on or off dynamically with the partition running.
  • Support for PowerVM and HMC  to get and set the boot list of a partition.
  • Support for PowerVM partition restart in a Disaster Recovery (DR) environment.
  • Support on PowerVM for a partition with 32 TB memory.  AIX, IBM i and Linux are supported but IBM i must be IBM i 7.3. TR1  IBM i 7.2 has a limit of 16 TB per partition and IBM i 7.1 has a limit of 8 TB per partition.  AIX level must be 7.1S or later.  Linux distributions supported are RHEL 7.2 P8,  SLES 12 SP1,  Ubuntu 16.04 LTS, RHEL 7.3 P8,  SLES 12 SP2, Ubuntu 16.04.1,  and SLES 11 SP4 for SAP HANA.
  • Support for PowerVM and PowerNV (non-virtualized or OPAL bare-metal) booting from a PCIe Non-Volatile Memory express (NVMe) flash adapter.  The adapters include feature codes #EC54 and #EC55 - 1.6 TB,  and #EC56 and #EC57 - 3.2 TB  NVMe flash adapters with CCIN 58CB and 58CC respectively.
  • Support for PowerVM NovaLink V1.0.0.4 which includes the following features:
    - IBM i network boot
    - Live Partition Mobility (LPM) support for inactive source VIOS
    - Support for SR-IOV configurations, vNIC, and vNIC failover
    - Partition support for Red Hat Enterprise Linux
  • Support for a decrease in the amount of PowerVM memory needed to support Huge Dynamic DMA Window (HDDW) for a PCI slot by using 64K pages instead of 4K pages.  The hypervisor only allocates enough storage for the Enlarged IO Capacity (Huge Dynamic DMA Window) capable slots to map every page in main storage with 64K pages rather than 4K pages as was done previously.  This affects only the Linux OS as AIX and IBM i do not use HDDW.
  • Support added to reduce the number of  error logs and call homes for the non-critical FRUs for the power and thermal faults of the system.
  • Support for redundancy in the the transfer of partition state for Live Partition Mobility (LPM) migration operations.  Redundant VIOS Mover Service Partitons (MSPs) can be defined along with redundant network paths at the VIOS/MSP level.  When redundant MSP pairs are used, the migrating memory pages of the logical partition are transferred from the source system to the target system by using two MSP pairs simultaneously. If one of the MSP pair fails, the migration operation continues by using the other MSP pair. In some scenarios, where a common shared Ethernet adapter is not used, use redundant MSP pairs to improve performance and reliability.
    Note:  For a LPM migration for a partition using Advanced Memory Sharing (AMS) in a dual (redundant) MSP configuration the LPM operation may hang if the MSP connection fails during the LPM migration. To avoid this issue that applies only to AMS partitions,  the AMS migrations should only be done from the HMC command line using the migrlpar command and specifying --redundentmsp 0 to disable the redundant MSPs.
    Note: To use redundant MSP pairs, all VIOS MSPs must be at version 2.2.5.00 or later, the HMC at version 8.6.0 or later, and the firmware level FW860 or later.
    For more information on LPM and VIOS supported levels and restrictions, refer to the following links on the IBM Knowledge Center:
    http://www.ibm.com/support/knowledgecenter/PurePower/p8hc3/p8hc3_firmwaresupportmatrix.htm
    https://www.ibm.com/support/knowledgecenter/HW4L4/p8eeo/p8eeo_ipeeo_main.htm
  • Support for failover capability for vNIC client adapters in the PowerVM hypervisor, rather than requiring the failover configuration to be done in the client OS.  To create a redundant connection, the HMC adds another vNIC server with the same remote lpar ID and remote DRC as the first, giving each server its own priority.
  • Support for SAP HANA with Solution edition with feature code #EPVR on 3.65 GHZ processors and 12-core activations and 512 GB memory activations on SUSE Linux..  SAP HANA is an in-memory platform for processing high volumes of data in real-time. HANA allows data analysts to query large volumes of data in real-time. HANA's in-memory database infrastructure frees analysts from having to load or write-back data.
  • Support for the Hardware Management Console (HMC)  to access the service processor IPMI credentials and to retrieve Performance and Capacity Monitor (PCM) data for viewing in a tabular format or for exporting as CSV values. The enhanced HMC interface can now start and stop VIOS Shared Storage Pool (SSP) monitoring from the HMC and start and stop SSP historical data aggregation.
  • Support for the Advanced System Management Interface (ASMI) was changed to not create VPD deconfiguration records and call home alerts for hardware FRUs that have one VPD chip of a redundant pair broken or inaccessible.  The backup VPD chip for the FRU allows continued use of the hardware resource.  The notification of the need for service for the FRU VPD is not provided until both of the redundant VPD chips have failed for a FRU.

System firmware changes that affect all systems

  • A problem was fixed for a failed IPL with SRC UE BC8A090F that does not have a hardware callout or a guard of the failing hardware.  The system may be recovered by guarding out the processor associated with the error and re-IPLing the system.  With the fix, the bad processor core is guarded and the system is able to IPL.
  • A problem was fixed for an infrequent service processor failover hang that results in a reset of the backup service processor that is trying to become the new primary.  This error occurs more often on a failover to a backup service processor that has been in that role for a long period of time (many months).  This error can cause a concurrent firmware update to fail.  To reduce the chance of a firmware update failure because of a bad failover, an Administrative Failover (AFO) can be requested from the HMC prior to the start of the firmware update.  When the AFO has completed, the firmware update can be started as normally done.
  • A problem was fixed for an Operations Panel Function 04 (Lamp test) during an IPL causing the IPL to fail.  With the fix, the lamp test request is rejected during the IPL until the hypervisor is available.  The lamp test can be requested without problems anytime after the system is powered on to hypervisor ready or an OS is running in a partition.
  • A problem was fixed for On-Chip Controller (OCC) errors that had excessive callouts for processor FRUs.  Many of the OCC errors are recoverable and do not required that the processor be called out and guarded.  With the fix, the processors will only be called out for OCC errors if there are three or more OCC failures during a time period of a week.
  • A problem was fixed for the loss of the setting for the disable of a periodic notification for a call home error log after a failover to the backup service processor on a redundant service processor system.  The call home for the presence of a failed resource can get re-enabled (if manually disabled in ASMI on the primary service processor) after a concurrent firmware update or any scenario that causes the service processor to fail over and change roles.  With the fix, the periodic notification flag is synchronized between the service processors when the flag value is changed.
  • A problem was fixed for the On-Chip Controller (OCC) incorrectly calling out processors with SRC B1112A16 for L4 Cache DIMM failures with SRC B124E504.  This false error logging can occur if the DIMM slot that is failing is adjacent to two unoccupied DIMM slots.
  • A problem was fixed for CEC drawer deconfiguration during a IPL due to SRCs BC8A0307 and BC8A1701 that did not have the correct hardware callout for the failing SCM.  With the fix, the failing SCM is called out and guarded so the CEC drawer will IPL even though there is a failed processor.
  • A problem was fixed for device time outs during a IPL logged with a SRC B18138B4.  This error is intermittent and no action is needed for the error log.  The service processor hardware server has allotted more time of the device transactions to allow the transactions to complete without a time-out error.

System firmware changes that affect certain systems

  • DISRUPTIVE:  On systems using the PowerVM firmware, a problem was fixed for an "Incomplete" state caused by initiating a resource dump with selector macros from NovaLink (vio -dump -lp 1 -fr).   The failure causes a communication process stack frame, HVHMCCMDRTRTASK, size to be exceeded with a hypervisor page fault that disrupts the NovalLink and/or HMC communications. The recovery action is to re-IPL the CEC but that will need to be done without the assistance of the management console.  For each partition that has a OS running on the system, shut down each partition from the OS.  Then from the Advanced System Management Interface (ASMI),  power off the managed system.  Alternatively, the system power button may also be used to do the power off.  If the management console Incomplete state persists after the power off, the managed system should be rebuilt from the management console.  For more information on management console recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm.  The fix is disruptive because the size of the PowerVM hypervisor must be increased to accommodate the over-sized stack frame of the failing task.
  • DEFERRED:  On systems using the PowerVM firmware, a problem was fixed for a CAPI function unavailable condition on a system with the maximum number of CAPI adapters and partitions.  Not enough bytes were allocated for CAPI for the maximum configuration case.  The problem may be circumvented by reducing the number of active partitions or CAPI adapters.   The fix is deferred because the size of the hypervisor must be increased to provide the additional CAPI space.
  • DEFERRED:   On systems using PowerVM firmware, a problem was fixed for cable card capable PCI slots that fail during the IPL.  Hypervisor I/O Bus Interface UE B7006A84 is reported for each cable card capable PCI  slot that doesn't contain a PCIe3 Optical Cable Adapter for the PCIe Expansion Drawer (feature code #EJ05).  PCI slots containing a cable card will not report an error but will not be functional.  The problem can be resolved by performing an AC cycle of the system.  The trigger for the failure is the I2C devices used to detect the cable cards are not coming out of the power on reset process in the correct state due to a race condition.
  • On systems using PowerVM firmware, a problem was fixed for network issues, causing critical situations for customers, when an SR-IOV logical port or vNIC is configured with a non-zero Port VLAN ID (PVID).  This fix updates adapter firmware to 10.2.252.1922, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and EL3C.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • On systems using the PowerVM firmware, a problem was fixed for a Live Partition Mobility migration that resulted in the source managed system going to the management console Incomplete state after the migration to the target system was completed.  This problem is very rare and has only been detected once.. The problem trigger is that the source partition does not halt execution after the migration to the target system.   The management console went to the Incomplete state for the source managed system when it failed to delete the source partition because the partition would not stop running.  When this problem occurred, the customer network was running very slowly and this may have contributed to the failure.  The recovery action is to re-IPL the source system but that will need to be done without the assistance of the management console.  For each partition that has a OS running on the source system, shut down each partition from the OS.  Then from the Advanced System Management Interface (ASMI),  power off the managed system.  Alternatively, the system power button may also be used to do the power off.  If the management console Incomplete state persists after the power off, the managed system should be rebuilt from the management console.  For more information on management console recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
  • On systems using PowerVM firmware,  a problem was fixed for a shared processor pool partition showing an incorrect zero "Available Pool Processor" (APP) value after a concurrent firmware update.  The zero APP value means that no idle cycles are present in the shared processor pool but in this case it stays zero even when idle cycles are available.  This value can be displayed using the AIX "lparstat" command.  If this problem is encountered, the partitions in the affected shared processor pool can be dynamically moved to a different shared processor pool.  Before the dynamic move, the  "uncapped" partitions should be changed to "capped" to avoid a system hang. The old affected pool would continue to have the APP error until the system is re-IPLed.
  • On systems using PowerVM firmware, a problem was fixed for a latency time of about 2 seconds being added to a target Live Partition Mobility (LPM) migration system when there is a latency time check failure.  With the fix, in the case of a latency time check failure, a much smaller default latency is used instead of two seconds.  This error would not be noticed if the customer system is using a NTP time server to maintain the time.
  • On multi-node systems with a incorrect memory configuration of DDR3 and DDR4 DIMMs, a problem was fixed for the IPL hanging for four hours instead of terminating immediately.
  • On systems using PowerVM firmware,  a rare problem was fixed for a system hang that can occur  when dynamically moving "uncapped" partitions to a different shared processor pool.  To prevent a system hang, the "uncapped" partitions should be changed to "capped" before doing the move.
  • On systems using the PowerVM firmware, support was added fora new utility option for the System Management Services (SMS) menus.  This is the SMS SAS I/O Information Utility.  It has been introduced to allow an user to get additional information about the attached SAS devices.  The utility is accessed by selecting option 3 (I/O Device Information) from the main SMS menu, and then selecting the option for "SAS Device Information".
  • On systems using the PowerVM hypervisor firmware and Novalink, a problem was fixed for a NovaLink installation error where the hypervisor was unable to get the maximum logical memory buffer (LMB) size from the service processor.  The maximum supported LMB size should be 0xFFFFFFFF but in some cases it was initialized to a value that was less than the amount of configured memory, causing the service processor read failure with error code 0X00000134.
  • On systems using the PowerVM hypervisor firmware and CAPI adapters, a problem was fixed for CAPI adapter error recovery.  When the CAPI adapter goes into the error recovery state, the Memory Mapped I/O (MMIO) traffic to the adapter from the OS continues, disrupting the recovery.  With the fix, the MMIO and DMA traffic to the adapter are now frozen until the CAPI adapter is fully recovered.   If the adapter becomes unusable because of this error, it can be recovered using concurrent maintenance steps from the HMC, keeping the adapter in place during the repair.  The error has a low frequency since it only occurs when the adapter has failed for another reason and needs recovery.
  • On systems using the PowerVM hypervisor firmware, when using affinity groups, if the group includes a VIOS, ensure the group is placed in the same drawer where the VIOS physical I/O is located.  Prior to this change,  if the VIOS was in an affinity group with other partitions, the partitions placement could over-ride the VIOS adapter placement rules and the VIOS could end up in a different drawer from the IO adapters.
  • On systems using PowerVM firmware,  a problem was fixed to improve error recovery when attempting to boot an iSCSI target backed by a drive formatted with a block size other than 512 bytes.  Instead of stopping on this error, the boot attempt fails and then continues with the next potential boot device.  Information regarding the reason for the boot failure is available in an error log entry.  The 512 byte block size for backing devices for iSCSI targets is a partition firmware requirement.
  • On systems using PowerVM firmware, a problem was fixed for extra resources being assigned in a Power Enterprise Pool (PEP).   This only occurs if all of these things happen:
     o  Power server is in a PEP pool
     o  Power server has PEP resources assigned to it
     o  Power server powered down
     o  User uses HMC to 'remove' resources from the powered-down server
     o  Power server is then restarted. It should come up with no PEP resources, but it starts up and shows it still is using PEP resources it should not have. 
    To recover from this problem, the HMC 'remove' of the PEP resources from the server can be performed again.
  • On systems using PowerVM firmware, a problem was fixed for a false thermal alarm in the active optical cables (AOC) for the PCIe3 expansion drawer with SRCs B7006AA6 and B7006AA7 being logged every 24 hours.  The AOC cables have feature codes of #ECC6 through #ECC9, depending on the length of the cable.  The SRCs should be ignored as they call for the replacement of the cable, cable card, or the expansion drawer module.  With the fix, the false AOC thermal alarms are no longer reported.
  • On systems using PowerVM firmware that have an attached HMC,  a problem was fixed for a Live Partition Mobility migration that resulted in a system hang when an EEH error occurred simultaneously with a request for a page migration operation.  On the HMC, it shows an incomplete state for the managed system with reference code A181D000.  The recovery action is to re-IPL the source system but that will need to be done without the assistance of the HMC.  From the Advanced System Management Interface (ASMI),  power off the managed system.  Alternatively, the system power button may also be used to do the power off.  If the HMC Incomplete state persists after the power off, the managed system should be rebuilt from the HMC.  For more information on HMC recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm


SC840
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
SC840_177_056 / FW840.60

09/29/17
Impact:  Availability      Severity:  SPE

System firmware changes that affect all systems

  • A problem was fixed for a false 110026B1 (12V power good failure) caused by an I2C bus write error for a LED state.  This error can be triggered by the fan LEDs changing state.
  • A problem was fixed for a fan LED turning amber on solid when there is no fan fault, or when the fan fault is for a different fan.  This error can be triggered anytime a fan LED needs to change its state.  The fan LEDs can be recovered to a normal state concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
  • A problem was fixed for sporadic blinking amber LEDs for the system fans with no SRCs logged.  There was no problem with the fans.  The LED corruption occurred when two service processor tasks attempted to update the LED state at the same time.  The fan LEDs can be recovered to a normal state concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
  • A problem was fixed for the loss of Operations Panel function 30 (displaying ethernet port  HMC1 and HMC2 IP addresses) after a concurrent repair of the Operations Panel.  Operations  Panel function 30 can be restored concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
  • A problem was fixed for a core dump of the rtiminit (service processor time of day) process that logs an SRC B15A3303  and could invalidate the time on the service processor.  If the error occurs while the system is powered on, the hypervisor has the master time and will refresh the service processor time, so no action is needed for recovery.  If the error occurs while the system is powered off, the service processor time must be corrected on the systems having only a single service processor.  Use the following steps from the IBM Knowledge Center to change the UTC time with the Advanced System Management Interface:  https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/viewtime.htm.
  • A problem was fixed for the "Minimum code level supported" not being shown by the Advanced System Menu Interface when selecting the "System Configuration/Firmware Update Policy" menu.  The message shown is "Minimum code level supported value has not been set".  The workaround to find this value is to use the ASMI command line interface with the "registry -l cupd/MinMifLevel" command.
  • A problem was fixed for a degraded PCI link causing a Predictive SRC for a non-cacheable unit (NCU) store time-out that occurred with SRC B113E540 or B181E450 and PRD signature  "(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB".  With the fix, the error is changed to be an Informational as the problem is not with the processor core and the processor should not be replaced.  The solution for degraded PCI links is different from the fix for this problem, but a re-IPL of the CEC or a reset of the PCI adapters could help to recover the PCI links from their degraded mode.
  • A problem was fixed for system node fans going to maximum RPM speeds after a service processor failover that needed the On-Chip Controllers (OCC) to be reloaded.  Without the fix, the system node fan speeds can be restored to normal speed by changing the Power Mode in the Advanced System Menu Interface using steps from the IBM Knowledge Center:  https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/areaa_pmms.htm.  After changing the Power Mode, wait about 10 minutes to change the Power Mode back to the original setting.
    If the fix is applied concurrently and the fans are already in the maximum RPM speed condition, the system node fan speeds can be corrected by either changing the Power Mode as above, or using the HMC to do an Administrative Failover (AFO).
  • A problem was fixed for the System Attention LED failing to light for an error failover for the redundant service processors with an SRC B1812028 logged.
  • A problem was fixed for a service processor reset triggered by a spurious false IIC interrupt request in the kernel.  On systems with a single service processor, the SRC B1817201 is displayed on the Operator Panel.  For systems with redundant service processors, an error failover to the backup service processor occurs.  The problem is extremely infrequent and does not impact processes on the running system.
  • A problem was fixed for the service processor low-level boot code always running off the same side of the flash image, regardless of what side has been selected for boot ( P-side or T-side).  Because this low-level boot code rarely changes, this should not cause a problem unless corruption occurs in the flash image of the boot code.  This problem does not affect firmware side-switches as the service processor initialization code (higher-level  code than the boot code) is running correctly from the selected side.  Without the fix, there is no recovery for boot corruption for systems with a single service processor as the service processor must be replaced.
  • A problem was fixed for a system failure caused by Hostboot problems with one node but where the other nodes are good.  With the fix, the node that is failing the Hostboot is deconfigured and the system is able to IPL on the remaining nodes.  To recover from this problem, manually guard the node that is failing and re-IPL.
  • A problem was fixed for help text in the Advanced System Management Interface (ASMI) not informing the user that system fan speeds would increase if the system Power Mode was changed to "Fixed Maximum Frequency" mode.  If ASMI panel function "System Configuration->Power Management->Power Mode Setup" "Enable Fixed Maximum Frequency mode" help is selected, the updated text states "...This setting will result in the fans running at the maximum speed for proper cooling."
  • A problem was fixed for a Power Supply Unit (PSU) failiure of  SRC 110015xF  logged with a power supply fan call out when doing a hot re-plug of a PSU.   The power supply may be made operational again by doing a dummy replace of the PSU that was called out (keeping the same PSU for the replace operation).  A re-IPL of the system will also recover the PSU.
  • A problem was fixed for recovery from clock card loss of lock failures that resulted in a clock card FRU unnecessarily being called out for repair.  This error happened whenever there was a loss of lock (PLL or CRC) for the clock card.  With the fix, firmware will not be calling out the failing clock card, but rather it will be re-configured as the new backup clock card after doing a clock card failover.  Customers will see a benefit from improved system availability by the avoidance of disruptive clock card repairs.

System firmware changes that affect certain systems

  • DEFERRED:  On systems using PowerVM firmware, a problem was fixed for PCIe3 I/O expansion drawer (#EMX0) link improved stability.  The settings for the continuous time linear equalizers (CTLE) was updated for all the PCIe adapters for the PCIe links to the expansion drawer.  The CEC must be re-IPLed for the fix to activate.
  • On systems using PowerVM firmware,  a problem was fixed for an intermittent service processor core dump and callout for netsCommonMSGServer with SRC B181EF88.   The HMC connection to the service processor automatically recovers with a new session.
  • On systems using PowerVM firmware with a Linux Little Endian (LE) partition, a problem was fixed for system reset interrupts returning the wrong values in the debug output for the NIP and MSR registers.  This problem reduces the ability to debug hung Linux partitions using system reset interrupts.  The error occurs every time a system reset interrupt is used on a Linux LE partition.
  • On systems using PowerVM firmware, a problem was fixed for "Time Power On" enabled partitions not being capable of suspend and resume operations.  This means Live Partition Mobility (LPM) would not be able to migrate this type of partition.  As a workaround, the partition could be transitioned to a "Non-time Power On" state and then made capable of suspend and resume operations.
  • On systems using PowerVM firmware,  a problem was fixed for  Power Enterprise Pool (PEP) IFL processors assignments causing an "Out of Compliance" for normal processor licenses.  The number of IFL processors purchased was first credited as satisfying any "unreturned" PEP processor resources, thus potentially leaving the system "Out Of Compliance" since IFL processors should not be taking the place of the normal (expensive) processor usage.  In this situation, without the fix, the user will need to either purchase more "expensive" non-IFL processors to satisfy the non-IFL workloads or adjust the partitions to reduce the usage of non-IFL processors.  This is a very infrequent problem for the following reasons: 
    1) PEP processors are infrequently left "unreturned" for short periods of time for specialized operations such as LPM migrations
    2) The user would have to purchase IFL processors from IBM, which is not a common occurrence.
    3) The user would have to put in a COD key for IFL processors while a PEP processor is still "unreturned"
  • On systems using PowerVM firmware, a problem was fixed for a Power Enterprise Pool (PEP) resource Grace Period being short by one hour with 71 hours provided instead of 72 hours.  The Grace Period is provided when all PEP resources are assigned and the user double-uses these resources (typically this is done for a Live Partition Mobility (LPM) migration).  This "borrowing" is temporarily permitted in this case even if there are not enough licenses to cover resources in both servers. The PEP goes into "Approaching Out Of Compliance", indicating the user has a certain amount of time to resolve this double-use. The problem here is that the time length of this Grace Period lasts one hour less than stated.  For a 72-hour Grace Period (the standard setting), the user only gets 71 hours.  The user sees "71 hours remaining" (correct) on first display at start,  then right away, if the user displays again, 70 hours is shown remaining.  But thereafter, the Grace Period time decrements correctly for the time remaining.
  • On systems using PowerVM firmware, a problem was fixed for Power Enterprise Pool (PEP) non-applicable error messages being displayed when re-entering PEP XML files for PEP updates, in which one of the XML operations calls for Conversion of Perm Resources to PEP Resources.  There is no error as the PEP key was accepted on the first use.  The following message may be seen on the HMC and can be ignored:   "...HSCL0520 A Mobile CoD processor conversion code to convert 0 permanently activated processors to Mobile CoD processors on the managed system has been entered.  HSCL050F This CoD code is not valid for your managed system.  Contact your CoD administrator."
  • On systems using PowerVM firmware, a problem was fixed for reboot retries for IBM i partitions such that the first load source I/O adapter (IOA) is retried instead of bypassed after the first failed attempt.  The reboot retries are done for an hour before the reboot process gives up.  This error can occur if there is more than one known load source, and the IOA of the first load source is different from the IOA of the last load source.  The error can be circumvented by retrying the boot of the partition after the load source device has become available.
  • On systems using PowerVM firmware with mirrored memory running IBM i partitions, a problem was fixed for memory fails in the partition that also caused the system to crash.  The system failure will occur any time that IBM i partition memory towards the beginning of the partition's assigned memory fails.  With the fix, the memory failure is isolated to the impacted partition, leaving the rest of the system unaffected.
  • On systems using PowerVM firmware, a problem was fixed for failures deconfiguring SR-IOV Virtual Functions (VFs).  This can occur during Live Partition Mobility (LPM) migrations with HMC error messages of  HSCLAF16,HSCLAF15 and HSCLB602 shown  This results in a LPM migration failure and a system reboot is required to recover the VFs for the I/O adapters.  This error may occur more frequently in cases where the I/O adapter has pending I/O at the time of the deconfigure request for the VF.
  • On systems using PowerVM firmware, a problem was fixed for the incorrect reporting of the Universally Unique Identifier (UUID) to the OS, which prevented the tracking of a partition as it moved within a data center.  The UUID value as seen on HMC or the NovaLink did not match the value as displayed in the OS.
  • On systems using PowerVM firmware,  a  problem was fixed for a partition boot from a USB 3.0 device that has an error log SRC BA210003.  The error is triggered by an Open Firmware entry to the trace buffer during the partition boot.  The error log can be ignored as the boot is successful to the OS.
  • On systems using PowerVM firmware,  a  problem was fixed for a partition boot fail or hang from a Fibre Channel device having fabric faults.  Some of the fabric errors returned by the VIOS are not interpreted correctly by the Open Firmware VFC drive, causing the hang instead of generating helpful error logs.
  • On systems using PowerVM firmware,  problems were fixed for communication failures on adapters in SR-IOV shared mode:
    1) A problem  was fixed for SR-IOV adapters in shared mode for a transmission stall or time out with SRC B400FF01 logged.  The time out happens during Virtual Function (VF) shutdowns and during Function Level Resets (FLRs) with network traffic running.
    2) A problem was fixed for an SR-IOV logical port whose Port VLAN ID (PVID) changing from non-zero to zero causes a communication failure under certain conditions.  The communication failure only occurs when a logical port's PVID is dynamically changed from non-zero to zero.  An SR-IOV logical port is an I/O device created for a partition or a partition profile using the management console (HMC) when a user intends for the partition to access an SR-IOV adapter Virtual Function.  The error can be recovered from by a reboot of the partition.
    These fixes updates adapter firmware to 10.2.252.1929, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • On systems using PowerVM firmware with PowerVM NovaLink, a problem was fixed for a lost of a communications channel between the hypervisor and the PowerVM NovaLink during a reset of the service processor.  Various NovaLink tasks, including deploy, could fail with a "No valid host was found" error.  With the fix, PowerVM NovaLink prevents normal operations from being impacted by a reset of the service processor.
  • On systems using PowerVM firmware with PowerVM NovaLink, a problem was fixed for returning to HMC-only management from  co-management  when a Novalink partition is deleted holding the master mode.  A circumvention is to release master mode before deleting the NovaLink partition and then reconnect the disconnected management console.  Please refer to IBM Knowledge Center link "http://ibm.biz/novalink-kc" for more information on the PowerVM NovaLink feature and changing the master authority when doing co-management.
  • On systems using PowerVM firmware with PowerVM NovaLink, a problem was fixed for a master management console becoming disconnected and blocking other management consoles from performing virtualization changes. A circumvention is to use the HMC CLI on another management console to request the master mode with the force option.   Please refer to IBM Knowledge Center link "http://ibm.biz/novalink-kc" for more information on the PowerVM NovaLink feature and changing the master authority when doing co-management.
  • On systems using PowerVM firmware, a problem was fixed for Power Enterprise Pool (PEP) busy errors from the system anchor card when creating or updating a PEP pool.    The error returned by the HMC is "HSCL9015 The managed system cannot currently process this operation.  This condition is temporary.  Please try the operation again."  To try again, the customer needs to update the pool again.  Typically on the second PEP update, the code is accepted.
    The problem is intermittent and occurs only rarely.
  • On systems using PowerVM firmware, a problem was fixed for an invalid date from the service processor causing the customer date and time to go to the Epoch value (01/01/1970) without a warning or chance for a correction.  With the fix,  the first IPL attempted on an invalid date will be rejected with a message alerting the user to set the time correctly in the service processor.  If the warning is ignored and the date/time is not corrected, the next IPL attempt will complete to the OS with the time reverted to the Epoch time and date.  This problem is very rare but it has been known to occur on service processor replacements when the repair step to set the date and time on the new service processor was inadvertently skipped by the service representative.
  • On systems using PowerVM firmware, a problem was fixed for a Power Enterprise Pool (PEP) system losing its assigned processor and memory resources after an IPL of the system.  This is an intermittent problem caused by a small timing window that makes it possible for the server to not get the IPL-time assignment of resources from the HMC.  If this problem occurs, it can be corrected by the HMC to recover the pool without needing another IPL of the system.
  • On systems using PowerVM firmware,  a problem was fixed for the error handling of EEH events for the SR-IOV Virtual Functions (VFs) that can result in IPL failure with B7006971, B400FF05, and BA210000 SRCs logged.  In these cases, the partition console stops at an OFDBG prompt.  Also a DLPAR add of a VF may result in a parttion crash due to a 300 DSI exception because of a low-level EEH event.  A circumvention for the problem would be to debug the EEH events which should be recovered errors and eliminate the cause of the EEH events.  With the fix, the EEH events still log Predictive Errors but do not cause a partition failure.
  • On systems using PowerVM firmware, a problem was fixed for an error finding the partition load source that has a GPT format.  GUID Partition Table (GPT) is a standard for the layout of the partition table on a physical storage device used in the server, such as a hard disk drive or solid-state drive, using globally unique identifiers (GUID).  Other drives that are working may be using the older master boot record (MBR) partition table format.  This problem occurs whenever load sources utilizing the GPT format occur in other than the first entry of the boot table.  Without the fix, a GPT disk drive must be the first entry in the boot table to be able to use it to boot a partition.
  • On systems using PowerVM firmware, a problem was fixed for an SRC BA090006 serviceable event log occurring whenever an attempt was made to boot from an ALUA  (Asymmetric Logical Unit Access) drive.  These drives are always busy by design and cannot be used for a partition boot, but no service action is required if a user inadvertently tries to do that.  Therefore, the SRC was changed to be an informational log.
  • On systems using PowerVM firmware, a problem was fixed for Live Partition Mobility (LPM) migrations from FW860.12 or later to the FW840.50 level of firmware. Subsequent DLPAR add operations of Virtual Adapters will fail with HMC error message HSCLAB2B, which contains text similar to the following:  "The operation to add a virtual NIC in slot 8 on partition 9 failed. The requested amounts of slot(s) to be added is 1 and the completed amount is 0."  The  AIX OS standard error message with return code 3 is the following: "0931-007 You have specified an invalid drc_name."   This issue affects partitions installed with AIX 7.2 TL 1 and later.   Not affected by this issue are partitions installed with VIOS, IBM i, or earlier levels of AIX.  The error can be recovered by a reboot of the affected partition.
SC840_168_056 / FW840.50

04/21/17
Impact:  Availability      Severity:  SPE

New features and functions

  • Support for the Advanced System Management Interface (ASMI) was changed to allow the special characters of "I", "O", and "Q" to be entered for the serial number of the I/O Enclosure under the Configure I/O Enclosure option.  These characters have only been found in an IBM serial number rarely, so typing in these characters will normally be an incorrect action.  However, the special character entry is not blocked by ASMI anymore so it is able to support the exception case.  Without the enhancement, the typing of one of the special characters causes message "Invalid serial number" to be displayed.
  • On systems using PowerVM firmware, support was added  for the Universally Unique IDentifier (UUID) property for each partition.  The UUID provides each partition with an identifier that is persisted by the platform across partition reboots, reconfigurations, OS reinstalls, partition migration,  and hibernation.

System firmware changes that affect all systems

  • A problem was fixed for the setting the disable of a periodic notification for a call home error log SRC B150F138 for Memory Buffer resources (membuf) from the Advanced System Management Interface (ASMI).
  • A problem was fixed for incorrect callouts of the Power Management Controller (PMC) hardware with SRC  B1112AC4 and SRC B1112AB2 logged.  These extra callouts occur when the On-Chip Controller (OCC) has placed the system in the safe state for a prior failure that is the real problem that needs to be resolved.
  • A problem was fixed for device time outs during a IPL logged with a SRC B18138B4.  This error is intermittent and no action is needed for the error log.  The service processor hardware server has allotted more time of the device transactions to allow the transactions to complete without a time-out error.
  • A problem was fixed for the Advanced System Management Interface (ASMI) "System Service Aids => Error/Event Logs" panel not showing the "Clear" and "Show" log options and also having a truncated error log when there are a large number of error logs on the system.
  • A problem was fixed for the failover to the backup PNOR on a Hostboot Self Boot Engine (SBE) failure.  Without the fix, the failed SBE causes loss of processors and memory with B15050AD logged.  With the fix, the SBE is able to access the backup PNOR and IPL successfully by deconfiguring the failing PNOR and calling it out as a failed FRU.
  • A problem was fixed for System Vital Product Data (SVPD) FRUs  being guarded but not having a corresponding error log entry.  This is a failure to commit the error log entry that has occurred only rarely.
  • A problem was fixed for  a system going into safe mode with SRC B1502616 logged as informational without a call home notification.  Notification is needed because the system is running with reduced performance.  If there are unrecoverable error logs and any are marked with reduced performance and the system has not been rebooted, then the system is probably running in safe mode with reduced performance.  With the fix, the SRC B1502616 is a Unrecoverable Error (UE).
  • A problem was fixed for the service processor boot watch-dog timer expiring too soon during DRAM initialization in the reset/reload, causing the service processor to go unresponsive.  On systems with a single service processor, the SRC B1817212 was displayed on the control panel.  For systems with redundant service processors, the failing service processor was deconfigured.  To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action.  This problem is intermittent and very infrequent as most of the reset/reloads of the service processor will work correctly to restore the service processor to a normal operating state.
  • A problem was fixed for host-initiated resets of the service processor causing the system to terminate.  A prior fix for this problem did not work correctly because some of the host-initiated resets were being translated to unknown reset types that caused the system to terminate.  With this new correction for failed host-initiated resets, the service processor will still be unresponsive but the system and partitions will continue to run.  On systems with a single service processor, the SRC B1817212 will be displayed on the control panel.  For systems with redundant service processors, the failing service processor will be deconfigured.  To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action.  This problem is intermittent and very infrequent as most of the host-initiated resets of the service processor will work correctly to restore the service processor to a normal operating state.
  • A problem was fixed for hardware dumps only collecting data for the master processor if a run-time service processor failover had occurred prior to the dump.  Therefore, there would be only master chip and master core data in the event of a core unit checkstop.  To recover to a system state that is able to do a full collection of debug data for all processors and cores after a run-time failover, a re-IPL of the system is needed.
  • A problem was fixed for incorrect error messages from the Advanced System Management Interface (ASMI) functions when the system is powered on but in the  "Incomplete State".  For this condition, ASMI was assuming the system was powered off because it could not communicate to the PowerVM hypervisor.  With the fix, the ASMI error messages will indicate that ASMI functions have failed because of the bad hypervisor connection instead of falsely stating that the system is powered off.
  • A problem was fixed for a single node failure on a multi-node system preventing an IPL.  The error occurred if  Hostboot hung on a node and timed out  without calling out problem hardware.  With the fix, a service processor failover is used to IPL on an alternate path to recover from the error.  And an error log has been added for the IPL timeout for the node with SRC B111BAAB and a callout for the master processor and PNOR.
  • A  problem has been fixed for systems losing performance and going into Safe mode (a power mode with reduced processor frequencies intended to protect the system from over-heating and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs logged.  This happened  because of an On-Chip Controller (OCC) internal queue overflow. The problem has only been observed for systems running heavy workloads with maximum memory configurations (where every DIMM slot is populated - size of DIMM does not matter), but this may not be required to encounter the problem.  Recovery from Safe mode back to normal performance can be done with a re-IPL of the system, or concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
    To check or validate that Safe mode is not active on the system will require a dynamic celogin password from IBM Support to use the service processor command line:
    1) Log into ASMI as celogin with  dynamic celogin password generated by IBM Support
    2) Select System Service Aids
    3) Select Service Processor Command Line
    4) Enter "tmgtclient --query_mode_and_function" from the command line
    The first line of the output, "currSysPwrMode" should say "NOMINAL" and this means the system is in normal mode and that Safe mode is not active.

System firmware changes that affect certain systems

  • On systems using  PowerVM firmware, a  problem was fixed for cable card (PCIe3 Optical Cable Adapter for the PCIe3 Expansion Drawer) capable PCI slots that fail during the IPL.  Hypervisor I/O Bus Interface UE B7006A84 is reported for each cable card capable PCI slot that doesn't contain a cable card.  PCI slots containing a cable card will not report an error but will not be functional.  The problem can be resolved by doing a "power off/power on" re-IPL of the system. The trigger for the failure is the I2C devices used to detect the cable cards are not coming out of the power on reset process in the correct state due to a race condition.  The affected optical cable adapters have feature codes #EJ05, #EJ07, and #EJ08 with CCINs 2B1C, 6B52, and 2CE2, respectively.
  • On systems using PowerVM firmware,  a problem was fixed for a blank SRC in the LPA dump for user-initiated non-disruptive adjunct dumps.  The SRC is needed for problem determination and dump analysis.
  • On systems using PowerVM firmware, a problem was fixed with SR-IOV adapter error recovery where the adapter is left in a failed state in nested error cases for some adapter errors.  The probability of this occurring is very low since the problem trigger is multiple low-level adapter failures.  With the fix, the adapter is recovered and returned to an operational state.
  • On systems using PowerVM firmware  with PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared mode, a problem was fixed for the hypervisor SR-IOV adjunct partition failing during the IPL with SRCs B200F011 and B2009014 logged. The SR-IOV adjunct partition successfully recovers after it reboots and the system is operational.
  • On systems using PowerVM firmware, a problem was fixed for PCIe Host Bridge (PHB) outages and PCIe adapter failures in the PCIe I/O expansion drawer caused by error thresholds being exceeded for the LEM bit [21] errors in the FIR accumulator.  These are typically minor and expected errors in the PHB that occur during adapter updates and do not warrant  a reset of the PHB and the PCIe adapter failures.  Therefore, the threshold LEM[21] error limit has been increased and the LEM fatal error has been changed to a Predictive Error to avoid the outages for this condition.
  • On systems using PowerVM firmware with a large memory configuration (greater than 8 TB), a problem was fixed for a SR-IOV adjunct failure during the IPL, causing loss of SR-IOV function.  The large system memory space causes an overflow in the space calculations for SR-IOV adapters in PCIe slots with Enlarged IO Capacity enabled.  The problem can be avoided by reducing the number of PCIe slots with Enlarged IO Capacity enabled so it does not include adapters in SR-IOV shared-mode.  Another circumvention option is to move the SR-IOV adapters to  SR-IOV capable PCIe slots where Enlarged IO Capacity is not enabled.   Reducing system physical memory to below 8 TB will also work as a circumvention.
  • On systems using PowerVM firmware, a problem was fixed for Live Partition Mobility (LPM) migrations from FW860.10 or FW860.11 to older levels of firmware. Subsequent DLPAR of Virtual Adapters will fail with HMC error message HSCL294C, which contains text similar to the following:  "0931-007 You have specified an invalid drc_name." This issue affects partitions installed with AIX 7.2 TL 1 and later. Not affected by this issue are partitions installed with VIOS, IBM i, or earlier levels of AIX.
  • On a system using PowerVM firmware running a Linux OS,  a problem was fixed for support for Coherent Accelerator Processor Interface (CAPI) adapters.  The CAPI related RTAS h-calls for the CAPI devices could not be made by the Linux OS, impacting the CAPI adapter functionality and usability.  This problem involves the following adapters:  the PCIe3 LP CAPI Accelerator Adapter with F/C #EJ16 that is used on the S812L(8247-21L) and S822L (8247-22L) models;  the PCIe3 CAPI FlashSystem Acclerator Adapter with F/C #EJ17  that is used on the S814(8286-41A) and S824(8286-42A) models;  and the PCIe3 CAPI FlashSystem Accelerator Adapter with F/C #EJ18 that is used on the S822(8284-22A), E870(9119-MME), and E880(9119-MHE) models.  This problem does not pertain to PowerVM AIX partitions using CAPI adapters.
  • On a system using PowerVM firmware, a problem was fixed for corruption of the partition data in the service processor NVRAM during a power off that causes the managed system to go into the  HMC "Recovery" error state.  A circumvention for the error is to restore partition data from the HMC.  If using Novalink to manage the partition, a recovery can be done from the Novalink backup.  The error is very infrequent but more likely to occur on an immediate power off of the system.  Instead, if a delayed powered off is used, that would allow the hypervisor to complete all pending operations before shutting down cleanly.
  • On systems using PowerVM firmware, a problem was fixed for a group of shared processor partitions being able to exceed the designated capacity placed on a shared processor pool.  This error can be triggered by using the DLPAR move function for the shared processor partitions, if the pool has already reached its maximum specified capacity.  To prevent this problem from occurring when making DLPAR changes when the pool is at the maximum capacity, do not use the DLPAR move operation but instead break it into two steps:  DLPAR remove followed by DLPAR add.  This gives enough time for the DLPAR remove to be fully completed prior to starting the DLPAR add request.
  • On systems using PowerVM firmware, a problem was fixed for NVRAM corruption and a HMC recovery state when using Simplified Remote Restart partitions.  The failing systems will have at least one Remote Restart partition and on the failed IPL there will be a B70005301 SRC with word 7 being 0X00000002.
  • On systems using PowerVM firmware with an IBM i partition, a problem was fixed for incorrect maximum performance reports based on the wrong number of "maximum" processors for the system.   Certain performance reports that can be generated on IBMi systems contain not only the existing machine information, but also "what-if" information, such as "how would this system perform if it had all the processors possible installed in this system".  This "what-if" report was in error because the maximum number of processors possible was too high for the system.
  • On systems using PowerVM firmware, a problem was fixed for NVRAM corruption that can occur when deleting a partition that owns a CAPI adapter, if that CAPI adapter is not assigned to another partition before the system is powered off.  On a subsequent IPL, the system will come up in recovery mode if there is NVRAM corruption.  To recover, the partitions must be restored from the HMC.  The frequency of this error is expected to be rare.  The CAPI adapters have the following feature codes:  #EC3E, #EC3F, #EC3L, #EC3M, #EC3T, #EC3U, #EJ16, #EJ17, #EJ18, #EJ1A, and #EJ1B.
  • On systems using PowerVM firmware, a problem was fixed for PCIe3 I/O expansion drawer (#EMX0) link improved stability.  The settings for the continuous time linear equalizers (CTLE) was updated for all the PCIe adapters for the PCIe links to the expansion drawer. The CEC must be re-IPLed for the fix to activate.
  • On systems using PowerVM firmware,  the following problems were fixed for SR-IOV adapters:
    1) Insufficient resources reported for SR-IOV logical port configured with promiscuous mode enable and a Port VLAN ID (PVID) when creating new interface on the SR-IOV adapters.
    2) Spontaneous dumps and reboot of the adjunct partition for SR-IOV adapters.
    3) Adapter enters firmware loop when single bit ECC error is detected.  System firmware detects this condition as a adapter command time out.  System firmware will reset and restart the adapter to recover the adapter functionality.  This condition will be reported as a temporary adapter hardware failure.
    4) vNIC interfaces not being deleted correctly causing SRC  B400FF01 to be logged and Data Storage Interrupt (DSI) errors with failiure on boot of the LPAR.
    This set of fixes updates adapter firmware to 10.2.252.1926, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38 , EL3C, EL56, and EL57.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • On systems using PowerVM firmware, a problem was fixed for partition boot failures and run time DLPAR failures when adding I/O that log BA210000, BA210003, and/or BA210005 errors.  The fix also applies to run time failures configuring an I/O adapter following an EEH recovery that log BA188001 events.  The problem can impact IBMi partitions running in any processor mode or AIX/Linux partitions running  in P7 (or older) processor compatibility modes.  The problem is most likely to occur when the system is configured in the Manufacturing Default Configuration (MDC) mode.  The trigger for the problem is a race-condition between the hypervisor and the physical operations panel with a very rare frequency of occurrence.
  • On systems with maximum memory configurations (where every DIMM slot is populated - size of DIMM does not matter), a  problem has been fixed for systems losing performance and going into Safe mode (a power mode with reduced processor frequencies intended to protect the system from over-heating and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs logged.  This happened  because of On-Chip Controller (OCC) time out errors when collecting Analog Power Subsystem Sweep (APSS) data, used by the OCC to tune the processor frequency.  This problem occurs more frequently on systems that are running heavy workloads.  Recovery from Safe mode back to normal performance can be done with a re-IPL of the system, or concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
    To check or validate that Safe mode is not active on the system will require a dynamic celogin password from IBM Support to use the service processor command line:
    1) Log into ASMI as celogin with  dynamic celogin password generated by IBM Support
    2) Select System Service Aids
    3) Select Service Processor Command Line
    4) Enter "tmgtclient --query_mode_and_function" from the command line
    The first line of the output, "currSysPwrMode" should say "NOMINAL" and this means the system is in normal mode and that Safe mode is not active.
SC840_147_056 / FW840.40

10/28/16
Impact:  Availability      Severity:  SPE

New features and functions

  • The requirement to upgrade the managing HMCs from HMC 840 to HMC 850 before installing FW840.40 on the E870 (9119-MME), E880 (9119-MHE), E870C (9080-MME) and E880C (9080-MHE) systems has been removed.  However, to properly manage the E870C and E880C systems, the managing HMC(s) must be at V8 R8.5.0 SP1 or later.
  • Support was added to protect the service processor from booting on a level of firmware that is below the minimum MIF level.  If this is detected, a SRC B18130A0 is logged.  A disruptive firmware update would then need to be done to the minimum firmware level or higher.  This new support has no effect on the system being updated with the service pack but has been put in place to provide an enhanced firmware level for the IBM field stock service processors.
  • Support for the Advanced System Management Interface (ASMI) was changed to not create VPD deconfiguration records and call home alerts for hardware FRUs  that have one VPD chip of a redundant pair broken or inaccessible.  The backup VPD chip for the FRU allows continued use of the hardware resource.  The notification of the need for service for the FRU VPD is not provided until both of the redundant VPD chips have failed for a FRU.

System firmware changes that affect all systems

  • A problem was fixed the for an infrequent IPL hang and terminate that can occur if the backup clock card is failing.  The following SRCs may be logged with this termination:  B1813450, B181460B, B181BA07, B181E6C7 and B181E6F1.  If the IPL error occurs, the system can be re-IPLed to recover from the problem.
  • A problem was fixed for the Advanced System Management Interface "Network Services/Network Configuration" "Reset Network Configuration" button that was not resetting the static routes to the default factory setting.  The manufacturing default is to have no static routes defined so the fix clears any static routes that had been added.  A circumvention to the problem is to use the ASMI "Network Services/Network Configuration/Static Route Configuration" "Delete" button before resetting the network configuration.
  • A problem was fixed for a partial callout for a failed SPIVID (Serial Peripheral Interface Voltage Identification) interface on the power supply VRM (Voltage Regulator Module).  The SPVID interface allows the processor to to control it's external voltage supply level, but if it fails, only the processor FRU (SCM) is called out but not the VRM.
    The system IPL will complete with a CEC drawer deconfigured.  The error log will only contain the processor but not the defective processor VRM.  Hostboot does not detect a SPIVID error, but fails on a SCOM operation to the processor chip.  The errors show up with SRC BCxx090F logged by Hostboot and word 7 containing  one of three error values for a SPIVID_SLAVE_PART callout:
    1) RC_SBE_SET_VID_TIMEOUT = 0x005ec1b2
    2) RC_SBE_SPIVID_STATUS_ERROR = 0x00902aac
    3) RC_SBE_SPIVID_WRITE_RETURN_STATUS_ERROR = 0x0045d3cd with HWP Error description : "Procedure: proc_sbe_setup_evid SPIVID Device did not return good status the Boot Voltage Write operation" and HWSV RC of BA24.
    Without the fix, replace both the identified SCM and the associated VRM.
  • A problem was fixed for the HMC Exchange FRU procedure for DVD drive with MTM 7226-1U3 and feature codes 5757/5762/5763 where it did not verify the DVD drive was plugged in at the end of the exchange procedure.  Without the fix,  the user must manually verify that the DVD drive is plugged in.
  • A problem was fixed for a 3.3V power fault on the primary system clock card causing a failover to the backup clock without an error log and a call out for the primary clock card.  This clock card is part of a redundant set in the System Control Unit with CCIN 6B49.
  • A problem was fixed for a PLL unlock error on the backup clock card by using spread spectrum to maintain the phased locked loop for the clock frequency.  This technique was already in use for the primary clock card.  The PLL unlock error is rare in the backup clock for the Power systems but it has been seen more frequently for the same part in other IBM systems.  This clock card is part of a redundant set in the System Control Unit with CCIN 6B49.
  • A problem was fixed for the Advanced System Management Interface (ASMI) incorrectly showing the Anchor card as guarded whenever any redundant VPD chip is guarded.
  • A problem was fixed for the health monitoring of the NVRAM and DRAM in the service processor that had been disabled.  The monitoring has been re-established and early warnings of service processor memory failure is logged with one of the following Predictive Error SRCs:  B151F107, B151F109, B151F10A, or B151F10D.
  • A problem was fixed for infrequent VPD cache read failures during an IPL causing an unnecessary guarding of DIMMs with SRC B123A80F logged.  With the fix, the VPD cache read fails cause a temporary deconfiguration of the associated DIMM but the DIMM is recovered on the next IPL.
  • A problem was fixed for a processor hang where the error recovery was not guarding the failing processor.  The failure causes a SRC B111E540 to be logged with Signature Description of " ex(n0p3c1) (COREFIR[55]) NEST_HANG_DETECT: External Hang detected".  With the fix, the failure processor FRU is called out and guarded so that the error does not re-occur when the system is re-IPLed.
  • A problem was fixed for the service processor recovery from intermittent MAX31760 fan controller faults logged with SRC B1504804.  The fan controller faults caused an out of memory condition on the service processor, forcing it to reset and failover to the backup service processor with SRCs B181720D, B181E6E9,  and B182951C logged.  With the fix, the fan controller faults are handled without memory loss and the only SRC logged is B1504804 for each fan controller fault.
  • A problem was fixed for a DDR4 memory training step during hostboot that incorrectly failed DIMMs on the timing margins for the HOLD limit.  The DIMMs may be recovered by manually unguarding the failed DIMM hardware.  This affects the 256GB DDR4 memory DIMM with feature code #EM8Y.
  • A problem was fixed for a failed IPL with SRC UE BC8A090F that does not have a hardware callout or a guard of the failing hardware.  The system may be recovered by guarding out the processor associated with the error and re-IPLing the system.  With the fix, the bad processor core is guarded and the system is able to IPL.
  • A problem was fixed for the loss of the setting for the disable of a periodic notification for a call home error log after a failover to the backup service processor on a redundant service processor system.  The call home for the presence of a failed resource can get re-enabled (if manually disabled in ASMI on the primary service processor) after a concurrent firmware update or any scenario that causes the service processor to fail over and change roles.  With the fix, the periodic notification flag is synchronized between the service processors when the flag value is changed.
  • A problem was fixed for a shortened "Grace Period" for "Out of Compliance" users of a Power Enterprise Pool (PEP).   The "Grace Period" is short by one hour, so the user has one less hour to resolve compliance issues before the HMC disallows any more borrowing of PEP resources.  For example, if the "Grace Period" should have been 48 hours as shown in the "Out of Compliance" message, it really is 47 hours in the hypervisor firmware.  The borrowing of PEP resources is not a common usage scenario.  It is most often found in Live Partition Mobility (LPM) migrations where PEP resources are borrowed from the source server and loaned to the target server.
  • A problem was fixed for an infrequent service processor failover hang that results in a reset of the backup service processor that is trying to become the new primary.  This error occurs more often on a failover to a backup service processor that has been in that role for a long period of time (many months).  This error can cause a concurrent firmware update to fail.  To reduce the chance of a firmware update failure because of a bad failover, an Administrative Failover (AFO) can be requested from the HMC prior to the start of the firmware update.  When the AFO has completed, the firmware update can be started as normally done.
  • A problem was fixed for On-Chip Controller (OCC) errors that had excessive callouts for processor FRUs.  Many of the OCC errors are recoverable and do not required that the processor be called out and guarded.  With the fix, the processors will only be called out for OCC errors if there are three or more OCC failures during a time period of a week.
  • A problem was fixed for an Operations Panel Function 04 (Lamp test) during an IPL causing the IPL to fail.  With the fix, the lamp test request is rejected during the IPL until the hypervisor is available.  The lamp test can be requested without problems anytime after the system is powered on to hypervisor ready or an OS is running in a partition.
  • A problem was fixed for a false thermal alarm in the active optical cables (AOC) for the PCIe3 expansion drawer with SRCs B7006AA6 and B7006AA7 being logged every 24 hours.  The AOC cables have feature codes of #ECC6 through #ECC9, depending on the length of the cable.  The SRCs should be ignored as they call for the replacement of the cable, cable card, or the expansion drawer module.  With the fix, the false AOC thermal alarms are no longer reported.
  • A problem was fixed for CEC drawer deconfiguration during a IPL due to SRCs BC8A0307 and BC8A1701 that did not have the correct hardware callout for the failing SCM.  With the fix, the failing SCM is called out and guarded so the CEC drawer will IPL even though there is a failed processor.
  • A problem was fixed for extra resources being assigned in a Power Enterprise Pool (PEP).   This only occurs if all of these things happen:
     o  Power server is in a PEP pool
     o  Power server has PEP resources assigned to it
     o  Power server powered down
     o  User uses HMC to 'remove' resources from the powered-down server
     o  Power server is then restarted. It should come up with no PEP resources, but it starts up and shows it still is using PEP resources it should not have. 
    To recover from this problem, the HMC 'remove' of the PEP resources from the server can be performed again.
  • A problem was fixed for the On-Chip Controller (OCC) incorrectly calling out processors with SRC B1112A16 for L4 Cache DIMM failures with SRC B124E504.  This false error logging can occur if the DIMM slot that is failing is adjacent to two unoccupied DIMM slots.

System firmware changes that affect certain systems

  • On systems using PowerVM firmware, a problem was fixed for network issues, causing critical situations for customers, when an SR-IOV logical port or vNIC is configured with a non-zero Port VLAN ID (PVID).  This fix updates adapter firmware to 10.2.252.1922, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and EL3C.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • On systems using the PowerVM hypervisor firmware and Novalink, a problem was fixed for a NovaLink installation error where the hypervisor was unable to get the maximum logical memory buffer (LMB) size from the service processor.  The maximum supported LMB size should be 0xFFFFFFFF but in some cases it was initialized to a value that was less than the amount of configured memory, causing the service processor read failure with error code 0X00000134.
  • On systems using PowerVM firmware with a system partition with more than 64 cores, a problem was fixed for Live Partition Mobility (LPM)  migration operations failing with HSCL365C.  The partition migration is stopped because the platform detects a firmware error anytime the partition has more than 64 cores.
  • On systems using PowerVM firmware, a problem was fixed for an AIX or Linux partition failing with a SRC B2008105 LP 00005 on a re-IPL after a dump (firmware assisted or error generated dump) following a Live Partition Mobility (LPM) migration operation.  The problem does not occur if the migrated partition completes a normal IPL after the migration.
  • On systems using PowerVM firmware, a problem was fixed to prevent NovaLink managed or co-managed systems from blocking SR-IOV configurations.  When configuring or deconfiguring SR-IOV, it is highly likely that the Novalink VMC virtual device will interfere with SR-IOV virtual devices.  Without the fix, SR-IOV is ignoring the NovaLink VMC device and trying to use the same virtual slot.
  • On systems using PowerVM firmware, a problem was fixed for intermittent long delays in the NX co-processor for asynchronous requests such as NX 842 compressions.  This problem was observed for AIX DB2 when it was doing hardware-accelerated compressions of data but could occur on any asynchronous request to the NX co-processor.
  • On systems using PowerVM firmware that have an attached HMC,  a problem was fixed for a Live Partition Mobility migration that resulted in the source managed system going to the Hardware Management Console (HMC) Incomplete state after the migration to the target system was completed.  This problem is very rare and has only been detected once.. The problem trigger is that the source partition does not halt execution after the migration to the target system.   The HMC went to the Incomplete state for the source managed system when it failed to delete the source partition because the partition would not stop running.  When this problem occurred, the customer network was running very slowly and this may have contributed to the failure.  The recovery action is to re-IPL the source system but that will need to be done without the assistance of the HMC.  For each partition that has a OS running on the source system, shut down each partition from the OS.  Then from the Advanced System Management Interface (ASMI),  power off the managed system.  Alternatively, the system power button may also be used to do the power off.  If the HMC Incomplete state persists after the power off, the managed system should be rebuilt from the HMC.  For more information on HMC recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
  • On systems using PowerVM firmware, a problem was fixed for a latency time of about 2 seconds being added to a target Live Partition Mobility (LPM) migration system when there is a latency time check failure.  With the fix, in the case of a latency time check failure, a much smaller default latency is used instead of two seconds.  This error would not be noticed if the customer system is using a NTP time server to maintain the time.
  • On systems using PowerVM firmware that have an attached HMC,  a problem was fixed for a Live Partition Mobility migration that resulted in a system hang when an EEH error occurred simultaneously with a request for a page migration operation.  On the HMC, it shows an incomplete state for the managed system with reference code A181D000.  The recovery action is to re-IPL the source system but that will need to be done without the assistance of the HMC.  From the Advanced System Management Interface (ASMI),  power off the managed system.  Alternatively, the system power button may also be used to do the power off.  If the HMC Incomplete state persists after the power off, the managed system should be rebuilt from the HMC.  For more information on HMC recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
  • On systems using PowerVM firmware, a problem was fixed for a system dump post-dump IPL that resulted in adjunct partition errors of SRC BA54504D, B7005191, and BA220020 when they could not be created due to false space constraints.  These adjunct partition failures will prevent normal operations of the hypervisor such as creating new partitions, so a power off and power on of the system is needed to recover it.  If the customer system is experiencing this error (only some systems will be impacted), it is expected to occur for each system dump post-dump IPL until the fix is applied.
  • On systems using PowerVM firmware,  a problem was fixed for a shared processor pool partition showing an incorrect zero "Available Pool Processor" (APP) value after a concurrent firmware update.  The zero APP value means that no idle cycles are present in the shared processor pool but in this case it stays zero even when idle cycles are available.  This value can be displayed using the AIX "lparstat" command.  If this problem is encountered, the partitions in the affected shared processor pool can be dynamically moved to a different shared processor pool.  Before the dynamic move, the  "uncapped" partitions should be changed to "capped" to avoid a system hang. The old affected pool would continue to have the APP error until the system is re-IPLed.
  • On systems using PowerVM firmware,  a rare problem was fixed for a system hang that can occur  when dynamically moving "uncapped" partitions to a different shared processor pool.  To prevent a system hang, the "uncapped" partitions should be changed to "capped" before doing the move.
  • On systems using PowerVM firmware,  a problem was fixed for a DLPAR add of the USB 3.0 adapter (#EC45 and #EC46) to an AIX partition where the adapter could not be configured with the AIX  "cfgmgr" command that fails with EEH errors and an outstanding illegal DMA transaction.  The trigger for the problem is the DLPAR add operation of the USB 3.0 adapter that has a USB External Dock (#EU04) and RDX Removable Disk Drives attached, or a USB 3.0 adapter that has a flash driver attached.  The PCI slot can be powered off and on to recover the USB 3.0 adapter.
  • On systems using PowerVM firmware,  a problem was fixed for a missing OF trace buffer in the resource dump.  This happens any time a resource dump is requested.  The missing FFDC data may require that problems be recreated before they can be debugged.
  • On systems using PowerVM firmware, a problem was fixed for a Live Partition Mobility (LPM) error where the target partition migration is failed with HSCLB98C error.  Frequency of this error can be moderate with source partitions that have a vNIC resource but extremely low if the source partition does not have a vNIC resource.  The failure originates at the VIOS VF level, so recovery from this error may need a re-IPL of the system to regain full use of the vNIC resources.
SC840_139_056 / FW840.30

09/28/16
Impact:  Availability      Severity:  SPE

New features and functions

  • Support for the E870C (9080-MME) and E880C (9080-MHE) systems.  These systems are cloud-enabled and require a minimum HMC level of V8.R8.5.0 SP1.
  • The certificate store on the service processor has been upgraded to include the changes contained in version 2.6 of the CA certificate list published by the Mozilla Foundation at the mozilla.org website as part of the Network Security Services (NSS) version 3.21.

System firmware changes that affect all systems

  • A problem was fixed for host-initiated resets of the service processor that can cause the service processor to terminate.  In this state, the service processor will be unresponsive but the system and partitions will continue to run.  On systems with a single service processor, the SRC B1817212 will be displayed on the control panel.  For systems with redundant service processors, the failing service processor will be deconfigured.  To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action.  The problem is intermittent and very infrequent as most of the host-initiated resets of the service processor will work correctly to restore the service processor to a normal operating state.
SC840_132_056 / FW840.24

08/31/16
Impact:  Availability      Severity:  HIPER

System firmware changes that affect certain systems

  • HIPER/Non-Pervasive: For a system using PowerVM firmware at a FW840 level and having an AIX partition or VIOS partition at specific back levels,  a problem was fixed for PCI adapters not getting configured in the OS.  DVD boots hang with status code 518 when attempts are made to boot off the AIX or VIOS DVD image.  NIM installs hang with status code 608.  If the firmware is updated to 840_104 through 840_118 for a SAS booted system, the subsequent reboot will hang with status code 554.
    The failing AIX and VIOS levels are as follows:
    AIX:
    AIX 7100-02-06 - AIX 7100-02-07
    AIX 6100-08-06 - AIX 6100-08-07
    VIOS:
    VIOS 2.2.2.6 - VIOS 2.2.2.70
    Without the fix, the problem may be circumvented by upgrading the AIX to 7100-03-03 or 6100-09-03 and the VIOS to 2.2.3.4.
    Depending on the adapter not getting configured, the error may result in Defined devices, EEH errors, and/or failure to boot the partition (if the failing adapter is the boot device).  These errors may also be seen for a rebooted partition after a LPM migration to FW840.
    With the fix applied, the error state for some of the  adapters in the running OS may persist and it will be necessary to reboot the OS to recover from those errors.
SC840_118_056 / FW840.23

07/28/16
Impact: Data            Severity:  HIPER

System firmware changes that affect certain systems

  • HIPER/NON-PERVASIVE: DEFERRED:  On systems with DDR4 memory installed, a problem was fixed for the handling of data errors in the L4 cache.   If a data error occurs in the L4 cache of the memory buffer on an affected system and it is pushed out to mainline memory, the data error will not be correctly handled.   A data error originating in the L4 cache may result in incorrect data being stored into memory.  The DDR4 DRAM has feature code (FC) EM8Y for a 256GB 1600 MHz CDIMM.
    At this firmware level, DDR4 and DDR3 memory cannot be mixed in the system.  At FW860.10, DDR4 and DDR3 can be mixed in a system, but each system node must have either DDR3 or DDR4 only.
    IBM strongly recommends that the customer should plan an outage to install the firmware fix immediately.  Fix activation requires a subsequent platform IPL following the installation of the firmware fix to eliminate any exposure to this issue.
SC840_113_056 / FW840.22

07/06/16
Impact:  Availability      Severity:  ATT

New features and functions

  • Support was added to Live Partition Mobility to allow migrations between partitions at firmware level FW760 and FW840.22 or later.  Previously, migration operations were not allowed between FW760 and FW840 partitions.

System firmware changes that affect all systems

  • Support was added for additional First Failure Data Capture (FFDC) data for processor clock failover errors provided by creating daily clock status reports with SRC B150CCDA informational error logs.  This clock status SRC log is written into the Hardware Management Console (HMC) iqyylog.log as a platform error log (PEL) event.  The PEL event contains a dump of the clock registers.  If a processor clock fails over with SRC B158CC62 posted to the serviceable events log, the iqyylog.log file on the HMC should be collected to help debug the clock problem using the B150CCDA data.  This support had been dropped in FW840.21 because of a IPL initialization conflict that has been resolved and the support is now re-instated.

System firmware changes that affect certain systems

  • On systems using PowerVM firmware, a problem was fixed for a sequence of two or more Live Partition Mobility migrations that caused a partition to crash with a SRC BA330000 logged (Memory allocation error in partition firmware).  The sequence of LPM migrations that can trigger the partition crash are as follows:
    The original source partition level can be any FW760.xx, FW763.xx, FW770.xx, FW773.xx, FW780.xx, or FW783.xx P7 level or any FW810.xx, FW820.xx, FW830.xx, or FW840.xx P8 level.  It is migrated first to a system running one of the following levels:
    1) FW730.70 or later 730 firmware or
    2) FW740.60 or later 740 firmware
    And then a second migration is needed to a system running one of the following levels:
    1) FW760.00 - FW760.20 or
    2) FW770.00 - FW770.10
    The twice-migrated system partition is now susceptible to the BA330000 partition crash during normal operations until the partition is rebooted.  If an additional LPM migration is done to any firmware level, the thrice-migrated partition is also susceptible to the partition crash until it is rebooted.
    With the fix applied, the susceptible partitions may still log multiple BA330000 errors but there will be no partition crash.  A reboot of the partition will stop the logging of the BA330000 SRC.
SC840_111_056 / FW840.21

06/24/16
Impact:  Availability      Severity:  SPE

NOTE:

Critical firmware update for FW840.20 (SC840_104) level systems

System IPLed with FW840.20:  A critical firmware update is required for all 9119-MME and 9119-MHE that have been IPLed with FW840.20 (SC840_104). The FW840.20 level can cause a failed IPL or a potential unplanned outage. If the server is already in production, then customer should plan an outage at a convenient time to apply FW 840.21 (SC840_111) or higher and IPL.

System had FW840.20 concurrently applied:  If firmware level FW840.20 was concurrently installed (i.e. system was NOT IPL'ed after installing the level) customers are not impacted by this issue provided they apply FW840.21 (SC840_111) or higher prior to next planned system reboot. NOTE: FW 840.21 can be applied concurrently.

System IPLed with any other version of Firmware:  If the current firmware level of the system is not FW840.20, the system is not exposed to this issue. Customers can install this level or later at the next scheduled update window.

To verify the firmware level installed on the server, select “Updates” from the left side of the HMC and place a check mark on the server of interest. Then select “View system information” from the bottom view, select “None - Display current values”. The Platform IPL Level will indicate the last level the system was booted on.

System firmware changes that affect all systems

  • A problem was fixed for an intermittent failure in Hostboot during the system IPL resulting in SRCs BC70090F and BC8A1701 logged with a hardware procedure return code of "RC_PROC_BUILD_SMP_ADU_STATUS_MISMATCH".  The system terminates with a Terminate Immediate (TI) condition.  The system must be re-IPLed to recover.  The failure is very infrequent and was caused by a race condition introduced as part of clock card failure data collection procedure which has now been removed (see below).
  • Support was removed for additional First Failure Data Capture (FFDC) data for processor clock failover errors added in FW840.20.   The FFDC was provided by creating daily clock status reports with SRC B150CCDA informational error logs.  This change was removed because it could trigger intermittent IPL & initialization failures.
SC840_104_056 / FW840.20

05/31/16
Impact:  Availability      Severity:  SPE

New features and functions

  • Support for a system control unit (SCU) with three fans instead of four on the E870 (9119-MME) and E880 (9119-MHE) system models.  The SCU fan has CCIN 6B44 with part number 00FV798.
  • Support was added for the Stevens6+ option of the internal tray loading DVD-ROM drive with F/C #EU13.  This is an 8X/24X(max) Slimline SATA DVD-ROM Drive.  The Stevens6+ option is a FRU hardware replacement for the Stevens3+.  MTM 7226-1U3 (Oliver)  FC 5757/5762/5763 attaches to IBM Power Systems and lists Stevens6+ as optional for Stevens3+.  If the Stevens6+  DVD drive is installed on the system without the required firmware support, the boot of an AIX partition will fail when the DVD is used as the load source.  Also, an IBM i partition cannot consistently boot from the DVD drive using D-mode IPL.  A SRC C2004130 may be logged for the load source not found error.
  • Support for the IBM PCIe3 12GB cache RAID plus SAS dual 4-port 6Gb x8 adapter with feature code #EJ14 and CCIN 57B1.  This adapter is very similar to the #EJ0L SAS adapter, but it uses a second chip in the card to provide more IOPS capacity (significant performance improvement) and can attach more SSD.  This adapter uses integrated flash memory to provide protection of the write cache, without need for batteries, in case of power failure.
  • Support for PowerVM vNIC extended to Linux OS Ubuntu 16.04 LE with up to ten vNIC client adapters for each partition.  PowerVM vNIC combines many of the best features of SR-IOV and PowerVM SEA to provide a network solution with options for advanced functions such as Live Partition Mobility along with better performance and I/O efficiency when compared to PowerVM SEA.  In addition PowerVM vNIC provides users with bandwidth control (QoS) capability by leveraging SR-IOV logical ports as the physical interface to the network.
  • PowerVM CoD was enhanced to eliminate the yearly Utility CoD renewal on systems using Utility COD.  The Utility CoD usage is already monitoring to make sure systems are running within the prescribed threshold limit of unreported usage, so a yearly customer renewal is not needed to manage the Utility CoD processor usage.
  • Support was added to the DHCP client on the service processor for non-random backoff mode needed for Data Center Manageability Interface (DCMI) V1.5  compliance.  By default, the DHCP client does random backoff delays for retries during DHCP discovery.  For DCMI V1.5, non-random backoff delays were introduced as an option.  Disabling the random back-off mode is not required for normal operations, but if wanted, the system administrator can override the default and disable the random back-off mode by sending the “SET DCMI Configuration Parameters” for the random back-off property of the Discovery Configuration parameter.  A value of "0" for the bit means "Disabled".  Or, the DHCP configuration file can be modified to add "random-backoff off", causing the non-random mode for the retry delays to be used during DHCP discovery.
  • Support was added for enhanced diagnostics for PowerVM Simplified Remote Restart (SRR) partitions.   This service pack level is recommended when using SRR partitions.  You can learn more about SSR partitions at the IBM Knowledge Center: " http://www.ibm.com/support/knowledgecenter/HW4P4/p8hat/p8hat_createremotereslpar.htm".
  • Support was added for auto-correction in the Advanced System Manager Interface (ASMI) for the "Feature Code/Sequence Number" field of the "System Configuration/Program Vital Product Data/System Enclosures" menu selection.  Lower case letters are invalid in the "Feature Code/Sequence Number" field so these are now changed to upper case letters to help form a valid entry.  For example, if  "78c9-001" was entered, it would be changed to "78C9-001".
  • Support was added for HTTP Strict Transport Security (HSTS) compliance for The Advanced System Management Interface (ASMI) web connection.  Even without this feature, any attempt to access ASMI with the HTTP protocol was rejected because the service processor firewall blocks port 80 (HTTP).  But enabling HSTS for ASMI prevents HSTS security warnings for the service processor during network scans by security scanner programs such as IBM AppScan.

System firmware changes that affect all systems

  • DEFERRED:  A problem was fixed in the dynamic ram (DRAM) initialization to update the VREF on the dimms to the optimal settings and to add an additional margin check test to improve the reliability of the DRAM by screening out more marginal dimms before they can result in a run-time memory fault.
  • A problem was fixed for a degraded PCI link causing a processor core to be guarded if a non-cacheable unit (NCU) store time-out occurred with SRC B113E540 and PRD signature  "(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB".  With the fix, the processor core is not guarded because of the NCU error.  If this problem occurs and a core is deconfigured. clear the guard record and re-IPL to regain the processor core.  The solution for degraded PCI links is different from the fix for this problem, but a re-IPL of the CEC or a reset of the PCI adapters could help to recover the PCI links from their degraded mode.
  • A problem was fixed for an incorrect reduction in FRU callouts for Processor Run-time Diagnostic (PRD) errors after a reference oscillator clock (OSCC) error has been logged.  Hardware resources are not called out and guarded as expected.  Some of the missing PRD data can be found in the secondary SRC of B181BAF5 logged by hardware services.  The callouts that PRD would have made are in the user data of that error log.
  • A problem was fixed for a Qualys network scan for security vulnerabilities causing a core dump in the Intelligent Platform Management Interface (IPMI)  process on the service processor with SRC B181EF88.  The error occurs anytime the Qualys scan is run because it sends an invalid IPMI session id that should have been handled and discarded without a core dump.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during RSA PPS signature verification. The Common Vulnerabilities and Exposures issue number is CVE-2015-3194.
  • A security problem was fixed in the lighttpd server on the service processor, where a remote attacker, while attempting authentication, could insert strings into the lighttpd server log file.  Under normal operations on the service processor, this does not impact anything because the log is disabled by default.  The Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
  • Support for cable validation option in the Advanced System Management Interface (ASMI).  A new panel option called "Cable Validation" has been added to the "System Service Aids" menu.  The cable validation can be performed on the FSP, Clock, UPIC, and SMP cables.
  • A problem was fixed for a missing error log when a clock card fails over to the backup clock card.  This problem causes loss of redundancy on the clock cards without a callout notification that there is a problem with the FRU.  If the fix is applied to a system that had a failed clock, that condition will not be known until the system is IPLed again when a errorlog and callout of the clock card will occur if it is in a persisted failed state.
  • A problem was fixed for the Hardware Management Console (HMC) "chpwrmgmt" command not providing a meaningful error message when used to try enable an invalid power saver mode of "dynamic_favor_power" on the 9119-MME or 9119-MHE models.  This power saver mode is not available on these models but the error message issued was "HSCL1400 An error has occurred during the operation to the managed system. Try the task again."  The following is the corrected error message:  "HSCL1402 This operation failed due to the following reasons: HSCL02F3 The managed system does not support the specified power saver mode."
  • A problem was fixed for a secondary clock card (CCIN 6B49 ) failure on the system control unit (SCU) being called out as a local clock card (CCIN 6B2D) failure on the node with SRC B158E504.  For this failure to occur, the primary clock card on the SCU must have been previously failed and guarded.
  • Support was added for additional First Failure Data Capture (FFDC) data for processor clock failover errors provided by creating daily clock status reports with SRC B150CCDA informational error logs.  This clock status SRC log is written into the Hardware Management Console (HMC) iqyylog.log as a platform error log (PEL) event.  The PEL event contains a dump of the clock registers.  If a processor clock fail over with SRC B158CC62 occurs on the service processor, the iqyylog.log file on the HMC should be collected to help debug the clock problem using the B150CCDA data.
  • A problem was fixed for the service processor going to the reset state instead of the termination state when the anchor card is missing or broken.  At the termination state, the Advanced System Manager Interface (ASMI) can be used to collect failure data and debug the problem with the anchor card.
  • A problem was fixed for error log entries created by Hostboot not getting written to the error log in some situations.  This can cause hardware detected as failed by Hostboot to not get reported or have a call-home generated.  This problem will occur whenever Hostboot commits a recovered or informational error as its last error log in the current IPL.  In the next IPL,  one or more error logs from Hostboot will be lost.
  • A problem was fixed for a service processor failure during a system power off that causes a reset of the service processor.  The service processor is in the correct state for a normal system power on after the error.  The frequency for this error should be low as it is caused by a very rare race condition in the power off process.
  • A problem was fixed so that service processor NVRAM bit flips are now detected and reported as predictive errors after a certain threshold of failures have occurred.  The SRCs reported are B151F109 (threshold of NVRAM errors was reached) or B151F10A (a NVRAM address has failed multiple times).  Previously, these normal wear errors in the NVRAM were ignored.  The bit flip is self-corrected and does not cause a problem but a high occurrence of these could mean that a service processor card FRU or system backplane FRU, as called out in the SRC, is in need of service. 
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during SSL certificate management. The Common Vulnerabilities and Exposures issue number is CVE-2016-0797.

System firmware changes that affect certain systems

  • DEFERRED:  On systems using PowerVM firmware, a performance improvement was made by disabling the Hot/Cold Affinity (HCA) hardware feature, which gathers memory usage statistics for consumption by partition operating system memory management algorithms.  The statistics gathering can, in rare cases, cause performance to degrade.  The workloads that may experience issues are memory-intensive workloads that have little locality of reference and thus cannot take advantage of hardware memory cache.  As a consequence, the problem occurs very infrequently or not at all except for very specific workloads in a HPC environment.  This performance fix requires an IPL of the system to activate it after it is applied.
  • DEFERRED:  On systems using 256GB DDR4 dimms, a problem was fixed in the 3DS packaging that could result in a recoverable memory error.  This fix requires an IPL of the system to take effect.  Any system with DDR4 dimms should be re-IPLed at the next opportunity to do so after applying this service pack to provide the best running conditions for the DDR4 dimms for reliable operation.
  • On systems with DDR4 memory DIMMs install, a fix was made for the longer IPL times needed to initialize DDR4 memory.  The time needed for the IPL has been reduced to be comparable to systems using other DIMM types such as DDR3.
  • On systems with a PowerVM Active Memory Sharing (AMS) partition with AIX  Level 7.2.0.0 or later with Firmware Assisted Dump enabled, a problem was fixed for a Restart Dump operation failing into KDB mode.  If "q" is entered to exit from KDB mode, the partition fails to start.  The AIX partition must be powered off and back on to recover.  The problem can be circumvented by disabling Firmware Assisted Dump (default is enabled in AIX 7.2).
  • On a PowerVM system, a problem was fixed for an incorrect date in partitions created with a Simplified Remote Restart-Capable (SRR) attribute where the date is created as Epoch 01/01/1970 (MM/DD/YYYY).  Without the fix, the user must change the partition time of day when starting the partition for the first time to make it correct.  This problem only occurs with SRR partitions.
  • On a PowerVM system with licensed Power Integrated Facility for Linux (IFL) processors, a problem was fixed for a system hang that could occur if the system contains both 1) dedicated processor partitions configured to share processors while active and  2) shared processor partitions.  This problem is more likely to occur on a system with a low number of non-IFL processors.
  • On systems using PowerVM firmware with dedicated processor partitions,  a problem was fixed for the dedicated processor partition becoming intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.  This is a follow-on to the fix provided in 840.11 for a different issue for delays in dedicated processor partitions that were caused by low I/O utilization.
  • A problem was fixed for transmit time-outs on a Virtual Function (VF) during stressful network traffic, on systems using PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared-mode.  This fix updates adapter firmware to 10.2.252.1918, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and EL3C.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • On PowerVM systems using Elastic Capacity on Demand (CoD) (also known as On/Off CoD), a problem was fixed for losing entitlement amounts when upgrading from FW820 or FW830.  If you upgrade to a service pack level that does not have this fix and lose the entitlement, you can get another On/Off (Elastic) CoD Enablement code from IBM Support.  This problem only pertains to the E850 (8408-E8E), E870 (9119-MME), and E880 (9119-MHE) models.
SC840_087_056 / FW840.11

03/18/16
Impact:  Availability      Severity:  ATT

New features and functions

  • The default setting for the "Enlarged I/O Memory Capacity" feature was disabled on newly manufactured E850, E870 & E880 models to reduce hypervisor memory usage.  Customers of the new systems using PCI adapters that leverage "Enlarged I/O Memory Capacity" will need to explicitly enable this feature for the supported PCI slots, using ASMI Menus while the system is powered off.  Existing systems will not see a change in their current setting.  For existing systems with only AIX and IBM i partitions that do not benefit from this feature, it can be disabled by using the Advanced System Management Interface (ASMI) for the "System Configuration-> I/O Adapter Enlarged Capacity" panel to uncheck the option for the "I/O Adapter Enlarged Adapter Capacity" feature.

System firmware changes that affect certain systems

  • On systems using PowerVM partitions, a problem was fixed for error recovery from failed Live Partition Mobility (LPM) migrations.  The recovery error is caused by a partition reset that leaves the partition in an unclean state with the following consequences:  1) A retry on the migration for the failed source partition may not not be allowed; and 2) With enough failed migration recovery errors, it is possible that any new migration attempts for any partition will be denied.  This error condition can be cleared by a re-IPL of the system. The partition recovery error after a failed migration  is much more likely to occur for partitions managed by NovaLink but it is still possible to occur for Hardware Management Console (HMC) managed partitions.
SC840_079_056 / FW840.10

03/04/16
Impact:  Availability      Severity:  SPE

New features and functions

  • Support for a 256GB DDR4 memory DIMM.  Memory feature code #EM8Y provides a total of 1024GB of memory with 4 each 256GB CDIMMs (1600 MHz, 8GBIT DDR4).   Note that DDR4 and DDR3 DIMMs cannot be mixed in a system for FW840.   Also, the minimum firmware level needed for DDR4 usage is FW840.23 due to a fix needed for a data integrity problem.  At firmware level FW860.10, DDR4 and DDR3 DIMMs can be mixed in a system, but no mixing is allowed in a node.
  • Support was added to block a full Hardware Management Console (HMC) connection to the service processor when the HMC is at a lower firmware major and minor release level than the service processor.  In the past, this check was done only for the major version of the firmware release but it now has been extended to the minor release version level as well.  The HMC at the lower firmware level can still make a limited connection to the higher firmware level service processor.  This will put the CEC in a "Version Mismatch" state.  Firmware updates are allowed with the CEC in the "Version Mismatch" state so that the condition can be corrected with either a HMC update or a firmware update of the CEC.
  • Support for PowerVM vNIC with more vNIC client adapters for each partition, up to 10 from a limit of 6 at the FW840.00 level.  PowerVM vNIC combines many of the best features of SR-IOV and PowerVM SEA to provide a network solution with options for advanced functions such as Live Partition Mobility along with better performance and I/O efficiency when compared to PowerVM SEA.  In addition PowerVM vNIC provides users with bandwidth control (QoS) capability by leveraging SR-IOV logical ports as the physical interface to the network.
  • Support for a 10-core 4.19 GHz Power8 processor with feature code #EPBS on the IBM Power System E880 (9119-MHE).  This feature provides a 40-core processor planar containing four ten-core processor SCMs.  Each processor core has 512KB of L2 cache and 8MB of L3 cache.
  • The default setting for the "Enlarged I/O Memory Capacity" feature was disabled on newly manufactured E850, E870 & E880 models to reduce hypervisor memory usage.  Customers using PCI adapters that leverage "Enlarged I/O Memory Capacity" will need to explicitly enable this feature for the supported PCI slots, using ASMI Menus while the system is powered off.

System firmware changes that affect all systems

  • On multi-node systems with a power fault, a problem was fix for On-Chip Controller errors caused by the power fault being reported as predictive errors for SRC B1602ACB.  These have been corrected to be informational error logs.  If running without the fix, the predictive and unrecoverable errors logged for the OCC on loss of power to the node can be ignored.
  • A problem was fixed for a system IPL hang at C100C1B0 with SRC 1100D001 when the power supplies have failed to supply the necessary 12-volt output for the system.   The 1100D001 SRC was calling out the planar when it should have called out the power supplies.  With the fix, the system will terminate as needed and call out the power supply for replacement.  One mode of power supply failure that could trigger the hang is sync-FET failures that disrupt the 12-volt output.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) not getting all error logs reported when its error log queue is full.  In the case where the error log queue is full with 16 entries, only one entry is returned to the hypervisor for reporting.  This error log truncation only occurs during periods of high error activity in the expansion drawer.
  • A problem was fixed for the callout of a VPD collection fault and system termination with SRC 11008402 to include the 1.2vcs VRM FRU.  The power good fault fault for the 1.2 volts would be a primary cause of this error.  Without the fix, the VRM is missing in the callout list and only has the VPDPART isolation procedure.
  • A problem was fixed for excessive logging of the SRC 11002610 on a power good (pgood) fault when detected by the Digital Power Subsystem Sweep (DPSS).  Multiple pgood interrupts are signaled by the DPSS in the interval between the first pgood failure and the node power down.  A threshold was added to limit the number of error logs for the condition.
  • A problem was fixed for redundant logging of the SRC B1504804 for a fan failure, once every five seconds.  With the fix, the failure is logged only at the initial time of failure in the IPL.
  • A problem was fixed to speed recovery for VPD collection time-out errors for PCIe resources in an I/O drawer logged with SRC 10009133 during concurrent firmware updates.  With the fix, the hypervisor is notified as soon as the VPD collection has finished so the PCIe resources can report as available .  Without the fix, there is a delay as long as two hours for the recovery to complete.
  • A problem was fixed to allow IPMI entity IDs to be used in ipmitool raw commands on the service processor to get the temperature reading.  Without the fix, the DCMI entity IDs have to be used in the raw command for the "Get temperature" function.
  • A problem was fixed for a false unrecoverable error (UE) logged for B1822713 when an invalid cooling zone is found during the adjustment of the system fan speeds.  This error can be ignored as it does not represent a problem with the fans.
  • A problem was fixed for a processor clock failover error with SRC B158CC62 calling out all processors instead of isolating to the suspect processor.  The callout priority correctly has a clock and a procedure callout as the highest priority, and these should be performed first to resolve the problem before moving on to the processors.
  • A problem was fixed for loss of back-level protection during firmware updates if an anchor card has been replaced.  The Power system manufacturing process sets the minimum code level a system is allowed to have for proper operation.  If a anchor card is replaced, it is possible that the replacement anchor card is one that has the Minimum MIF Level (MinMifLevel) given as "blank",  and this removes the system back-level protection. With the fix, blanks or nulls on the anchor card for this field are handled correctly to preserve the back-level protection.  Systems that have already lost the back-level protection due to anchor card replacement remain vulnerable to a accidental downgrade of code level by operator error, so code updates to a lower level for these systems should only be performed under guidance from IBM Support.  The following command can be run the Advanced Management Management Interface (ASMI) to determine if the system has lost the back-level protection with the presence of "blanks" or ASCII 20 values for MinMifLevel:
    "registry -l cupd/MinMifLevel" with output:
    "cupd/MinMifLevel:
    2020202020202020 2020202020202020 [ ]
    2020202020202020 2020202020202020 [ ]"
  • A problem was fixed for a code update error from FW830 to a FW840 level causes temperature sensors to be lost so that the ipmitool command to list the temperature sensors fails with a IPMI program core dump.  If the temperature sensors are already corrupted due to a preceding code update, this fix adds back in the temperature sensors to allow the ipmitool to work for listing the temperature sensors.
  • A problem was fixed for a system checkstop caused by a L2 cache least-recently used (LRU) error that should have been a recoverable error for the processor and the cache.  The cache error should not have caused a L2 HW CTL error checkstop.
  • A problem was fixed for a re-IPL with power on failure with B181A40F SRC logged for VPD not found for a DIMM FRU.  The DIMM had been moved to another slot or just removed.  In this situation, a IPL of the system from power off will work without errors, but a re-IPL with power on,  such as that done after processing a hardware dump, will fail with the B181A40F.  Power off the system and IPL to recover.  Until the fix is applied, the problem can be circumvented after a DIMM memory move by putting the PNOR flash memory in genesis mode by running the following commands in ASMI with the CEC powered off:
            1) hwsvPnorCmd -c
            2) hwsvPnorCmd -g
  • A problem was fixed for the service processor becoming inaccessible when having a dynamic IP address and being in DCMI "non-random" mode for DHCP discovery by customer configuration.  The problem can occur intermittently during a AC power on of the system.  If the service processor does not respond on the network, AC power cycle to recover.  Without the fix, the problem can be circumvented by using the DHCP client in the DCMI "random" mode for DHCP discovery, which is the default on the service processor.
  • A problem was fixed for priority callouts for system clock card errors with SRC B158CC62.  These errors had high priority callouts for the system clock card and medium callouts for FRUs in the clock path.  With the fix, all callouts are set to medium priority as the clock card is not the most probable FRU to have failed but is just a candidate among the many FRUs along the clock path.
  • A problem was fixed for a memory initialization error reported with SRC BC8A0506 that terminates the IPL.  This problem is unlikely to occur because it depends on a specific memory location being used by the code load. The system can be recovered from the error by doing another IPL.

System firmware changes that affect certain systems

  • On PowerVM systems a problem was fixed to address a performance degradation. The problem surfaces under the following conditions:
    1)    There is at least one VIOS or Linux partition that is running with dedicated processors AND
    2)    There is at least one VIOS or Linux partition running with shared processors AND
    3)    There is at least one AIX or IBMi partitions configured with shared processors. 
    If ALL the above conditions are met AND one of the following actions occur,
    1)    VIOS/Linux dedicated processor partition is configured to share processors while active OR
    2)    A dynamic platform optimization operation (HMC 'optmem' command) is performed OR
    3)    Processors are unlicensed via a capacity on demand operation
    there is an exposure for a loss in performance.
  • On systems using PowerVM firmware, a problem was fixed for PCIe switch recovery to prevent a partition switch failure during the IPL with error logs for SRC B7006A22 and B7006971  reported.  This problem can occur when doing recovery for an informational error on the switch.  If this problem occurs, the partition must be restarted to recover the affected I/O adapters.
  • On systems using PowerVM firmware, a problem was fixed for a concurrent FRU exchange of a CAPI  (Coherent Accelerator Processor Interface) adapter for a standard I/O adapter that results in a vary off failure.  If this failure occurs, the system needs to be re-IPLed to fix the adapter.  The trigger for this failure is a dual exchange where the CAPI adapter is exchanged first for a standard (non-like-typed) adapter.  Then an attempt is made to exchange the standard adapter for a CAPI adapter which fails.
  • On systems using PowerVM firmware, a problem was fixed for a CAPI  (Coherent Accelerator Processor Interface) device going to a "Defined" state instead of "Available" after a partition boot.  If the CAPI device is doing recovery and logging error data at the time of the partition boot, the error may occur.  To recover from the error, reboot the partition.  With the fix, the hypervisor will wait for the logging of error data from the CAPI device to finish before proceeding with the partition boot.
  • On systems using PowerVM firmware, a problem was fixed for a hypervisor adjunct partition failed with "SRC B2009008 LP=32770" for an unexpected SR-IOV adapter configuration.  Without the fix, the system must be re-IPLed to correct the adjunct error.  This error is infrequent and can only occur if an adapter port configuration is being changed at the same time that error recovery is occurring for the adapter.
  • On systems using PowerVM firmware and PCIe adapters in SR-IOV mode,  the following problem was addressed with a Broadcom Limited (formerly known as Avago Technologies and Emulex) adapter firmware update to 10.2.252.1913:  Transmit time-outs on a Virtual Function (VF) during stressful network traffic.
  • On systems using PowerVM firmware with an invalid P-side or T-side in the firmware, a problem was fixed in the partition firmware Real-Time Abstraction System (RTAS) so that system Vital Product Data (VPD) is returned at least from the valid side instead of returning no VPD data.   This allows AIX host commands such as lsmcode, lsvpd, and lsattr that rely on the VPD data to work to some extent even if there is one bad code side.  Without the fix,  all the VPD data is blocked from the OS until the invalid code side is recovered by either rejecting the firmware update or attempting to update the system firmware again.
  • On systems using PowerVM firmware without a HMC (and in Manufacturing Default Configuration (MDC) mode with a single host partition), a problem was fixed for missing dumps of type SYSDUMP. FSPDUMP. LOGDUMP, and RSCDUMP that were not off-loaded to the host OS.  This is an infrequent error caused by a timing error that causes the dump notification signal to the host OS to be lost.  The missing/pending dumps can be retrieved by rebooting the host OS partition.  The rebooted host OS will receive new notifications of the dumps that have to be off-loaded.
  • On systems using PowerVM firmware, a problem was fixed for truncation on the memory fields displayed in the Advanced System Management Interface on the COD panels.  ASMI shows three fields of memory called "Installed memory", Permanent memory", and "Inactive memory".  The largest value that can be displayed in the fields was "9999" GB.  This has been expanded to a maximum of "999999" GB for each of the ASMI fields.  The truncation was only in the displayed memory value, not in the actual memory size being used by the system which was correct.
  • On systems using PowerVM firmware and a partition using Active memory Sharing (AMS), a problem was fixed for a Live Partition Mobility (LPM) migration of the AMS partition that can hang the hypervisor on the target CEC.  When an AMS partition migrates to the target CEC, a hang condition can occur after processors are resumed on the target CEC, but before the migration operation completes.  The hang will prevent the migration from completing, and will likely require a CEC reboot to recover the hung processors.  For this problem to occur, there needs to be memory page-based activity (e.g. AMS dedup or Pool paging) that occurs exactly at the same time that the Dirty Page Manager's PSR data for that page is being sent to the target CEC.
  • On systems using PowerVM firmware and having a IBM i partition with more than 64 cores, a performance problem was fixed with the choice of processor cores assigned to the partition.
    This problem only applies to the E870 (9119-MME) and E880 (9119-MHE) models.
  • On systems using PowerVM firmware, a problem was fixed for PCIe adapter hangs and network traffic error recovery during Live Partition Mobility (LPM) and SR-IOV vNIC (virtual ethernet adapter)  operations.  An error in the PCI Host Bridge (PHB) hardware can persist in the L3 cache and fail all subsequent network traffic through the PHB.  The PHB  error recovery was enhanced to flush the PHB L3 cache to allow network traffic to resume.
  • On systems using PowerVM firmware with AIX or Linux partitions with greater than 8TB of memory, a problem was fixed for Dynamic DMA Window (DDW) enabled adapters IPLing into a "Defined" state,  instead of "Available", and unusable with a "0" size DMA window.  If a DDW enabled adapter is plugged into an HDDW (Huge Dynamic DMA Window) slot in a partition with the large memory size, the OS changes the default DMA window to "0" in size.  To prevent this problem, the Advanced System Management Interface (ASMI) in the service processor can be used to set "I/O Enlarged Capacity" to "0" (which is off), and all the DDW enabled adapters will work on the next IPL.
  • On a multi-node system,  a problem was fixed for a power fault with SRC 11002610 having incorrect FRU callouts.  The wrong second FRU callout is made on nodes 2, 3, and 4 of a multi-node system.  Instead of calling out the processor FRU, the enclosure FRU is called out.  The first FRU callout is correct.
  • On PowerVM systems with partitions running Linux, a problem was fixed for intermittent hangs following a Live Partition Mobility (LPM) migration of a Linux partition.  A partition migrating from a source system running FW840.00 to a system running any other supported firmware level may become unresponsive and unusable once it arrives on the target system.  The problem only affects Linux partitions and is intermittent.  Only partitions that have previously been migrated to a FW840.00 system are susceptible to a hang on subsequent migration to another system.  If a partition is hung following a LPM migration, it must be rebooted on the target system to resume operations.
  • On systems using OPAL firmware, a problem was fixed that prevented multiple NVIDIA Tesla K80 GPUs from being attached to one PCIe adapter.  This prevented using a PCIe attached GPU drawer.  This fix increases the PCIe MMIO (memory-mapped I/O) space to 1 TB from a previous maximum of 64 GB per PHB/PCIe slot.
  • On PowerVM systems with dedicated processor partitions with low I/O utilization, the dedicated processor partition may become intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.
  • On systems using OPAL firmware, a problem was fixed in OPAL to identify the PCI Host Bridge (PHB) on CAPI adapter errors and not always assume PHB0.
  • On systems using OPAL firmware, a problem was fixed in the OPAL gard utility to remove gard records after guarded components have been replaced,  Without the fix, Hostboot and the gard utility could be in disagreement on the replaced components, causing some components to still display as guarded after a repair.
  • On systems using PowerVM firmware with partitions with very large number of PCIe adapters, a problem was fixed for partitions that would hang because the partition firmware ran out of memory for the OpenFirmware FCode device drivers for PCIe adapters.  With the fix, the hypervisor is able to dynamically increase the memory to accommodate the larger partition configurations of I/O slots and adapters.
  • On PowerVM systems with vNIC adapters, a problem was fixed for doing a network boot or install from the adapter using a VLAN tag.  Without the fix, the support is missing for doing a network boot from the VLAN tag from the SMS RIPL menu.
  • On systems using PowerVM firmware, a problem was fixed for a Live Partition Mobility (LPM) migration of a partition with large memory that had a migration abort when the partition took longer than five minutes to suspend.  This is a rare problem and is triggered by an abnormally slow response time from the migrating partition.  With the fix, the five minute time limit on the suspend operation has been removed.
  • On systems using PowerVM firmware at FW840.00 with an AIX VIO client partition at level 7.1 TL04 SP03 or 7.2 TL01 SP00 or later, a problem was fixed for virtual ethernet adapters adapters with a  IPv6 largesend packet (-i.e.,  data packets of size greater than the maximum transmission unit (MTU)) that hung and/or ran slow because largesend packets were discarded by the hypervisor.   For example, telnet and ping commands for the system will be working but as soon as a send of a large packet of data is attempted, the network connection hangs.  This firmware fix requires AIX levels 7.1 TL04 SP03 or 7.2 TL01 SP00 or later for the largesend feature to work.
    The problem can be circumvented by disabling "mtu_bypass" (largesend) on the AIX VIO client.  The "mtu_bypass" is disabled by default but many network administrators enable it for a performance gain.  To disable " mtu_bypass" on the AIX VIO client,  use the following steps:
    (0) This change may impact existing connections so shut down the affected NIC cards (where X is the interface number)  prior to the change
    (1) Login to AIX VIO client from console as root
    (2) ifconfig enX down;ifconfig enX detach
    (3) chdev -l enX -a mtu_bypass=off
    (4) chdev -l enX -a state=up
    (5) mkdev -l inet0
SC840_056_056 / FW840.00

12/04/15
Impact:  New      Severity:  New

New Features and Functions

NOTE:
  • POWER8 (and later) servers include an “update access key” that is checked when system firmware updates are applied to the system.  The initial update access keys include an expiration date which is tied to the product warranty. System firmware updates will not be processed if the GA date of the desired firmware level occurred after the update access key’s expiration date.  As these update access keys expire, they need to be replaced using either the Hardware Management Console (HMC) or the Advanced Management Interface (ASMI) on the service processor.  Update access keys can be obtained via the key management website: http://www.ibm.com/servers/eserver/ess/index.wss.
  • Support for allowing the PowerVM hypervisor to continue to run when communication between the service processor and platform firmware has been lost and cannot be re-established.  A SRC B1817212 may be logged and any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.  This error condition would only be seen on a system that had been running with a single service processor (no redundancy for the service processor).
  • Support in the Advanced Systems Management Interface (ASMI) for managing certificates on the service processor with option "System Configuration/Security/Certificate Management".  Certificate management includes 1) Generation of Certificate Signing Request (CSR) 2) Download of CSR and 3) Upload of signed certificates.  For more information on managing certificates, go to the IBM KnowledgeCenter link for "Certificate Management"
    (https://www-01.ibm.com/support/knowledgecenter/P8ESS/p8hby/p8hby_securitycertificate.htm)
  • Support for concurrent add of the PCIe expansion drawer (F/C #EMX0) and concurrent add of PCIe optical cable adapters (F/C EJ07 and CCIN 6B52).  For concurrent add guidance, go to the IBM KnowledgeCenter links for "Connecting a PCIe Gen3 I/O expansion drawer to your system"(https://www-01.ibm.com/support/knowledgecenter/9119-MHE/p8egp/p8egp_connect_kickoff.htm?lang=en-us) and for  "PCIe adapters for the 9119-MHE and 9119-MME" (https://www-01.ibm.com/support/knowledgecenter/9119-MHE/p8hak/p8hak_87x_88x_kickoff.htm?lang=en-us).
  • Support for concurrent repair/exchange of the PCIe3 6-slot Fanout module for the PCIe3 Expansion Drawer,  PCIe Optical Cable adapters and PCIe3 Optical Cable.  For concurrent repair/exchange guidance for these parts, go to the IBM KnowledgeCenter link for "Removing and replacing parts in the PCIe Gen3 I/O expansion drawer"(https://www-01.ibm.com/support/knowledgecenter/9119-MHE/p8egr/p8egr_emx0_kickoff.htm?lang=en-us). Below are the feature codes for the affected parts:
    #EMX0 - PCIe3 Expansion Drawer
    #EMXF - PCIe3 6-Slot Fanout Module for PCIe3 Expansion Drawer (all server models)
    #EJ07 (CCIN 6B52) - PCIe3 Optical Cable Adapter for PCIe3 Expansion Drawer
    #ECC6 - 2M Optical Cable Pair for PCIe3 Expansion Drawer
    #ECC8 - 10M Optical Cable Pair for PCIe3 Expansion Drawer
    #ECC9 - 20M Optical Cable Pair for PCIe3 Expansion Drawer
  • PowerVM support for Support for Coherent Accelerator Processor Interface (CAPI) adapters.  The PCIe3 LP CAPI Accelerator Adapter with F/C #EJ16 is used on the S812L(8247-21L) and S822L (8247-22L)  models  The PCIe3 CAPI FlashSystem Acclerator Adapter with F/C #EJ17  is used on the S814(8286-41A) and S824(8286-42A) models.  The PCIe3 CAPI FlashSystem Accelerator Adapter with F/C #EJ18 is used on the S822(8284-22A), E870(9119-MME), and E880(9119-MHE) models.  This feature does not apply to the S824L (8247-42L) model.
  • Management console enhancements for support of concurrent maintenance of CAPI-enabled adapters.
  • Support for PCIe3 Expansion Drawer (#EMX0) lower cable failover, using lane reversal mode to bring up the expansion drawer from the top cable.  This eliminates a single point of failure by supporting lane reversal in case of problems with the lower cable.
  • Expanded support of Virtual Ethernet Large send from IPv4 to the IPv6 protocol in PowerVM.
  • Support for IBM i network install on a IEEE 802.1Q VLAN.  The OS supported levels are IBM i.7.2.TR3 or later.  This feature applies only to S814 (8286-41A), S824(8286-42A), E870 (9119-MME), and E880 (9119-MHE) models.
  • Support for PowerVM vNIC with up to six vNIC client adapters for each partition.  PowerVM vNIC combines many of the best features of SR-IOV and PowerVM SEA to provide a network solution with options for advanced functions such as Live Partition Mobility along with better performance and I/O efficiency when compared to PowerVM SEA.  In addition PowerVM vNIC provides users with bandwidth control (QoS) capability by leveraging SR-IOV logical ports as the physical interface to the network.
    Note:  If more than six vNIC client adapters are used in a partition, the partition will run, as there is no check to prevent the extra adapters, but certain operations such as Live Partition Mobility may fail.
  • Enhanced handling of errors to allow partial data in a Shared Storage Pool (SSP) cluster.  Under partial data error conditions, the management console "Manage PowerVM" gui will correctly show the working VIOS clusters along with information about the broken VIOS clusters, instead of showing no data.
  • Live Partition Mobility (LPM) was enhanced to allow the user to specify VIOS concurrency level overrides.
  • Support was added for PowerVM hard compliance enforcement of the Power Integrated Facility for Linux (IFL).  IFL is an optional lower cost per processor core activation for Linux-only workloads on IBM Power Systems.  Power IFL processor cores can be activated that are restricted to running Linux workloads.  In contrast, processor cores that are activated for general-purpose workloads can run any supported operating system.  PowerVM  will block partition activation, LPM and DLPAR requests on a system with IFL processors configured if the total entitlement of AIX and IBMi partitions exceeds the amount of licensed general-purpose processors.  For AIX and IBMi partitions configured with uncapped processors, the PowerVM hypervisor will limit the entitlement and uncapped resources consumed to the amount of expensive processors that are currently licensed.
  • Support was added to allow Power Enterprise Pools to convert permanently-licensed (static) processors to Pool Processors using a CPOD COD activation code provided by the management console.  Previously, only unlicensed processors were able to become Pool Processors.
  • The management console was enhanced to allow a Live Partition Mobility (LPM) if there is a failed VIOS in a redundant pair.  During LPM, if the VIOS is inactive, the management console will use stored configuration information to perform the LPM.
  • The firmware update process from the management console and from in-band OS (except for IBM i PTFs) has been enhanced to download new "Update access keys" as needed to prevent the access key from expiring.  This provides an automatic renewal process for the entitled customer.
  • Live Partition Mobility support was added to allow the user to specify a different virtual Ethernet switch on the target server.
  • PowerVM was enhanced to support an AIX Live Update where the AIX kernel is updated without rebooting the kernel.  The AIX OS level must be 7.2 or later.  Starting with AIX Version 7.2, the AIX operating system provides the AIX Live Update function which eliminates downtime associated with patching the AIX operating system. Previous releases of AIX
    required systems to be rebooted after an interim fix was applied to a running system. This new feature allows workloads to remain active during a Live Update operation and the operating system
    can use the interim fix immediately without needing to restart the entire system. In the first release of this feature, AIX Live Update will allow customers to install interim fixes (ifixes) only. For more information on AIX Live Update,  go to the IBM KnowledgeCenter link for "Live Update" 
    (https://www-01.ibm.com/support/knowledgecenter//ssw_aix_72/com.ibm.aix.install/live_update_install.htm).
  • The management console has been enhanced to use standard FTP in its firmware update process instead of a custom implementation.  This will provide a more consistent interface for the users.
  • Support for setting Power Management Tuning Parameters from the management console (Fixed Maximum Frequency (FMF), Idle Power Save, and DPS Tunables) without needing to use the Advanced System Management Interface (ASMI) on the service processor.  This allows FMF mode to be set by default without having to modify any tunable parameters using ASMI.
  • Support for a Corsa PCIe adapter with accelerator FPGA for low latency connection using CAPI (Coherent Accelerator Processor Interface) attached to a FlashSystem 900 using two 8Gb optical SR Fibre Channel (FC) connections.
    Supported IBM Power Systems for this feature are the following:
    1) E880 (9119-MHE) with CAPI Activation feature #EC19 and  Corsa adapter #EJ18 Low profile on AIX.
    2) E870 (9119-MME) with CAPI Activation feature #EC18 and Corsa adapter #EJ18.Low profile on AIX.
    3) S822 (8284-22A) with CAPI  Activation feature #EC2A and Corsa adapter #EJ18.Low profile on AIX.
    4) S814 (8286-41A) with CAPI Activation feature #EC2A and Corsa adapter #EJ17 Full height on AIX.
    5) S824 (8286-42A) with CAPI Activation feature #EC2A and Corsa adapter #EJ17 Full height on AIX.
    6) S812L (8247-21L) with CAPI Activation feature #EC2A and Corsa adapter #EJ16 Low profile on Linux.
    7) S822L (8247-22L)  with CAPI Activation feature #EC2A and Corsa adapter #EJ16 Low profile on Linux.
    OS levels that support this feature are PowerVM AIX 7.2 or later and OPAL bare-metal Linux Ubuntu 15.10.
    The IBM FlashSystem 900 storage system is model 9840-AE2 (one year warranty) or 9843-AE2 (three year warranty) at the 1.4.0.0 or later firmware level with features codes #AF23, #AF24, and #AF25 supported for 1.2 TB, 2.9 TB, 5.7 TB modules, respectively.
  • The Digital Power Subsystem Sweep (DPSS) FPGA, used to control P8 fan speeds and memory voltages, was enhanced to support the 840 GA level. This DPSS update is delayed to the next IPL of the CEC and adds 18 to 20 minutes to the IPL.  See the "Concurrent Firmware Updates" section above for details.
  • Support for Data Center Manageability Interface (DCMI) V1.5 and Energy Star compliance.  DCMI  features were added to the Intelligent Platform Management Interface (IPMI) 2.0 implementation on the service processor.  DCMI adds platform management capability for monitoring elements such as system temperatures, power supplies, and bus errors.  It also includes automatic and manually driven recovery capabilities such as local or remote system resets, power on/off operations, logging of abnormal or  "out-of-range‟ conditions for later examination.  And It allows querying for inventory information that can help identify a failed hardware unit along with power management options for getting and setting power limits.
    Note:  A deviation from the DCMI V1.5 specification exists for 840.00 for the DCMI Configuration Parameters for DHCP Discovery.  Random back-off mode is enabled by default instead of being disabled.  The random back-off puts a random variation delay in the DHCP retry interval so that the DHCP clients are not responding at the same time. Disabling the back-off time is not required for normal operations, but if wanted, the system administrator can override the default and disable the random back-off mode by sending the “SET DCMI Configuration Parameters” for the random back-off property of the Discovery Configuration parameter.  A value of "0" for the bit means "Disabled".


SC830
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
SC830_106_048 / FW830.50

04/27/17
Impact: Availability    Severity: SPE

New features and functions

  • Support for the Advanced System Management Interface (ASMI) was changed to allow the special characters of "I", "O", and "Q" to be entered for the serial number of the I/O Enclosure under the Configure I/O Enclosure option.  These characters have only been found in an IBM serial number rarely, so typing in these characters will normally be an incorrect action.  However, the special character entry is not blocked by ASMI anymore so it is able to support the exception case.  Without the enhancement, the typing of one of the special characters causes message "Invalid serial number" to be displayed.
  • Support was added  for the Universally Unique IDentifier (UUID) property for each partition.  The UUID provides each partition with an identifier that is persisted by the platform across partition reboots, reconfigurations, OS reinstalls, partition migration,  and hibernation.

System firmware changes that affect all systems

  • A problem was fixed for System Vital Product Data (SVPD) FRUs  being guarded but not having a corresponding error log entry.  This is a failure to commit the error log entry that has occurred only rarely.
  • A problem was fixed for a system going into safe mode with SRC B1502616 logged as informational without a call home notification.  Notification is needed because the system is running with reduced performance.  If there are unrecoverable error logs and any are marked with reduced performance and the system has not been rebooted, then the system is probably running in safe mode with reduced performance.  With the fix, the SRC B1502616 is a Unrecoverable Error (UE).
  • A problem was fixed for the PCIe3 Optical Cable Adapter for the PCIe3 Expansion Drawer failing with SRC B7006A84 error logged during the IPL.  The failed cable adapter can be recovered by using a concurrent repair operation to power it off and on.  Or  the system can be re-IPLed to recover the cable adapter.  The affected optical cable adapters have feature codes #EJ05, #EJ06, and #EJ08 with CCINs 2B1C, 6B52, and 2CE2, respectively.
  • A problem was fixed for PCIe Host Bridge (PHB) outages and PCIe adapter failures in the PCIe I/O expansion drawer caused by error thresholds being exceeded for the LEM bit [21] errors in the FIR accumulator.  These are typically minor and expected errors in the PHB that occur during adapter updates and do not warrant  a reset of the PHB and the PCIe adapter failures.  Therefore, the threshold LEM[21] error limit has been increased and the LEM fatal error has been changed to a Predictive Error to avoid the outages for this condition.
  • A problem was fixed for PCIe3 I/O expansion drawer (#EMX0) link improved stability.  The settings for the continuous time linear equalizers (CTLE) was updated for all the PCIe adapters for the PCIe links to the expansion drawer. The CEC must be re-IPLed for the fix to activate.
  • The following problems were fixed for SR-IOV adapters:
    1) Insufficient resources reported for SR-IOV logical port configured with promiscuous mode enable and a Port VLAN ID (PVID) when creating new interface on the SR-IOV adapters.
    2) Spontaneous dumps and reboot of the adjunct partition for SR-IOV adapters.
    3) Adapter enters firmware loop when single bit ECC error is detected.  System firmware detects this condition as a adapter command time out.  System firmware will reset and restart the adapter to recover the adapter functionality.  This condition will be reported as a temporary adapter hardware failure.
    4) vNIC interfaces not being deleted correctly causing SRC  B400FF01 to be logged and Data Storage Interrupt (DSI) errors with failiure on boot of the LPAR.
    This set of fixes updates adapter firmware to 10.2.252.1926, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38 , EL3C, EL56, and EL57.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for Live Partition Mobility (LPM) migrations from FW860.10 or FW860.11 to older levels of firmware.  Subsequent  DLPAR of Virtual Adapters will fail with HMC error message HSCL294C, which contains text similar to the following:  "0931-007 You have specified an invalid drc_name." This issue affects partitions installed with AIX 7.2 TL 1 and later. Not affected by this issue are partitions installed with VIOS, IBM i, or earlier levels of AIX.
  • A problem was fixed for incorrect callouts of the Power Management Controller (PMC) hardware with SRC  B1112AC4 and SRC B1112AB2 logged.  These extra callouts occur when the On-Chip Controller (OCC) has placed the system in the Safe mode state for a prior failure that is the real problem that needs to be resolved.
  • A problem was fixed for a failure in launching the Advanced System Management Interface (ASMI) from the HMC local console for the HMC levels of V8R8.3.0 SP2 and V8R8.4.0 SP1.  There was a frozen window displayed  instead of the ASMI login panel.  A circumvention to the problem is to connect to ASMI from a remote browser session.
  • A problem was fixed for the Advanced System Management Interface (ASMI) "System Service Aids => Error/Event Logs" panel not showing the "Clear" and "Show" log options and also having a truncated error log when there are a large number of error logs on the system.
  • A problem was fixed for sporadic blinking amber LEDs for the system fans with no SRCs logged.  There was no problem with the fans.  The LED corruption occurred when two service processor tasks attempted to update the LED state at the same time.  The fan LEDs can be recovered to a normal state concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
  • A problem was fixed for hardware dumps only collecting data for the master processor if a run-time service processor failover had occurred prior to the dump.  Therefore, there would be only master chip and master core data in the event of a core unit checkstop.  To recover to a system state that is able to do a full collection of debug data for all processors and cores after a run-time failover, a re-IPL of the system is needed.
  • A problem was fixed for the loss of Operations Panel function 30 (displaying ethernet port  HMC1 and HMC2 IP addresses) after a concurrent repair of the Operations Panel.  Operations  Panel function 30 can be restored concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
  • A problem was fixed for the service processor boot watch-dog timer expiring too soon during DRAM initialization in the reset/reload, causing the service processor to go unresponsive.  On systems with a single service processor, the SRC B1817212 was displayed on the control panel.  For systems with redundant service processors, the failing service processor was deconfigured.  To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action.  This problem is intermittent and very infrequent as most of the reset/reloads of the service processor will work correctly to restore the service processor to a normal operating state.
  • A problem was fixed for host-initiated resets of the service processor causing the system to terminate.  A prior fix for this problem did not work correctly because some of the host-initiated resets were being translated to unknown reset types that caused the system to terminate.  With this new correction for failed host-initiated resets, the service processor will still be unresponsive but the system and partitions will continue to run.  On systems with a single service processor, the SRC B1817212 will be displayed on the control panel.  For systems with redundant service processors, the failing service processor will be deconfigured.  To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action.  This problem is intermittent and very infrequent as most of the host-initiated resets of the service processor will work correctly to restore the service processor to a normal operating state.
  • A problem was fixed for incorrect error messages from the Advanced System Management Interface (ASMI) functions when the system is powered on but in the  "Incomplete State".  For this condition, ASMI was assuming the system was powered off because it could not communicate to the PowerVM hypervisor.  With the fix, the ASMI error messages will indicate that ASMI functions have failed because of the bad hypervisor connection instead of falsely stating that the system is powered off.
  • A problem was fixed for a single node failure on a multi-node system preventing an IPL.  The error occurred if Hostboot hung on a node and timed out  without calling out problem hardware.  With the fix, a service processor failover is used to IPL on an alternate path to recover from the error.  And an error log has been added for the IPL timeout for the node with SRC B111BAAB and a callout for the master processor and PNOR.
  • A problem was fixed for the System Attention LED failing to light for an error failover for the redundant service processors with a SRC B1812028 logged.

System firmware changes that affect certain systems

  • On systems with PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared mode, a problem was fixed for the hypervisor SR-IOV adjunct partition failing during the IPL with SRCs B200F011 and B2009014 logged. The SR-IOV adjunct partition successfully recovers after it reboots and the system is operational.
  • On systems with maximum memory configurations (where every DIMM slot is populated - size of DIMM does not matter), a  problem has been fixed for systems losing performance and going into Safe mode (a power mode with reduced processor frequencies intended to protect the system from over-heating and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs logged.  This happened  because of On-Chip Controller (OCC) time out errors when collecting Analog Power Subsystem Sweep (APSS) data, used by the OCC to tune the processor frequency.  This problem occurs more frequently on systems that are running heavy workloads.  Recovery from Safe mode back to normal performance can be done with a re-IPL of the system, or concurrently using the following link steps for a soft reset of the service processor:  https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
    To check or validate that Safe mode is not active on the system will require a dynamic celogin password from IBM Support to use the service processor command line:
    1) Log into ASMI as celogin with  dynamic celogin password generated by IBM Support
    2) Select System Service Aids
    3) Select Service Processor Command Line
    4) Enter "tmgtclient --query_mode_and_function" from the command line
    The first line of the output, "currSysPwrMode" should say "NOMINAL" and this means the system is in normal mode and that Safe mode is not active.
SC830_101_048 / FW830.40

12/08/16
Impact: Availability    Severity: ATT

New features and functions

  • Support for the Advanced System Management Interface (ASMI) was changed to not create VPD deconfiguration records and call home alerts for hardware FRUs that have one VPD chip of a redundant pair broken or inaccessible.  The backup VPD chip for the FRU allows continued use of the hardware resource.  The notification of the need for service for the FRU VPD is not provided until both of the redundant VPD chips have failed for a FRU.
  • Support was added for systems to be able to automatically convert permanently activated resources (processor and memory) to  Mobile CoD resources for use in a Power Enterprise Pool (PEP).  The ability to do a CoD resource license conversion requires a minimum HMC level of V8R8.4.0 or later.  More information on how to use a  PEP for a group of systems tp share Mobile Capacity on Demand (CoD) processor resources and memory resources can be found in the IBM Knowledge Center at the following link: https://www.ibm.com/support/knowledgecenter/HW4M4/p8ha2/systempool_cod.htm

System firmware changes that affect all systems

  • A problem was fixed the for an infrequent IPL hang and terminate that can occur if the backup clock card is failing.  The following SRCs may be logged with this termination:  B1813450, B181460B, B181BA07, B181E6C7 and B181E6F1.  If the IPL error occurs, the system can be re-IPLed to recover from the problem.
  • A problem was fixed for an infrequent service processor failover hang that results in a reset of the backup service processor that is trying to become the new primary.  This error occurs more often on a failover to a backup service processor that has been in that role for a long period of time (many months).  This error can cause a concurrent firmware update to fail.  To reduce the chance of a firmware update failure because of a bad failover, an Administrative Failover (AFO) can be requested from the HMC prior to the start of the firmware update.  When the AFO has completed, the firmware update can be started as normally done.
  • A problem was fixed for the loss of the setting for the disable of a periodic notification for a call home error log after a failover to the backup service processor on a redundant service processor system.  The call home for the presence of a failed resource can get re-enabled (if manually disabled in ASMI on the primary service processor) after a concurrent firmware update or any scenario that causes the service processor to fail over and change roles.  With the fix, the periodic notification flag is synchronized between the service processors when the flag value is changed.
  • A problem was fixed for On-Chip Controller (OCC) errors that had excessive callouts for processor FRUs.  Many of the OCC errors are recoverable and do not required that the processor be called out and guarded.  With the fix, the processors will only be called out for OCC errors if there are three or more OCC failures during a time period of a week.
  • A problem was fixed for an Operations Panel Function 04 (Lamp test) during an IPL causing the IPL to fail.  With the fix, the lamp test request is rejected during the IPL until the hypervisor is available.  The lamp test can be requested without problems anytime after the system is powered on to hypervisor ready or an OS is running in a partition.
  • A problem was fixed for a 3.3V power fault on the primary system clock card causing a failover to the backup clock without an error log and a call out for the primary clock card.  This clock card is part of a redundant set in the System Control Unit with CCIN 6B49.
  • A problem was fixed for a Phased Locked Loop (PLL) unlock error on the backup clock card by using spread spectrum to maintain the phased locked loop for the clock frequency.  This technique was already in use for the primary clock card.  The PLL unlock error is rare in the backup clock for the Power systems but it has been seen more frequently for the same part in other IBM systems.  This clock card is part of a redundant set in the System Control Unit with CCIN 6B49.
  • A problem was fixed for infrequent VPD cache read failures during an IPL causing an unnecessary guarding of DIMMs with SRC B123A80F logged.  With the fix, the VPD cache read fails cause a temporary deconfiguration of the associated DIMM but the DIMM is recovered on the next IPL.
  • A problem was fixed for extra resources being assigned in a Power Enterprise Pool (PEP).   This only occurs if all of these things happen:
     o  Power server is in a PEP pool
     o  Power server has PEP resources assigned to it
     o  Power server powered down
     o  User uses HMC to 'remove' resources from the powered-down server
     o  Power server is then restarted. It should come up with no PEP resources, but it starts up and shows it still is using PEP resources it should not have. 
    To recover from this problem, the HMC 'remove' of the PEP resources from the server can be performed again.
  • A  problem was fixed for a Live Partition Mobility (LPM) error where the target partition migration is failed with HSCLB98C error.  Frequency of this error can be moderate with source partitions that have a vNIC resource but extremely low if the source partition does not have a vNIC resource.  The failure originates at the VIOS VF level, so recovery from this error may need a re-IPL of the system to regain full use of the vNIC resources.
  • A problem was fixed for a latency time of about 2 seconds being added to a target Live Partition Mobility (LPM) migration system when there is a latency time check failure.  With the fix, in the case of a latency time check failure, a much smaller default latency is used instead of two seconds.  This error would not be noticed if the customer system is using a NTP time server to maintain the time.
  • A problem was fixed for a system dump post-dump IPL that resulted in adjunct partition errors of SRC BA54504D, B7005191, and BA220020 when they could not be created due to false space constraints.  These adjunct partition failures will prevent normal operations of the hypervisor such as creating new partitions, so a power off and power on of the system is needed to recover it.  If the customer system is experiencing this error (only some systems will be impacted), it is expected to occur for each system dump post-dump IPL until the fix is applied.
  • A problem was fixed for a shared processor pool partition showing an incorrect zero "Available Pool Processor" (APP) value after a concurrent firmware update.  The zero APP value means that no idle cycles are present in the shared processor pool but in this case it stays zero even when idle cycles are available.  This value can be displayed using the AIX "lparstat" command.  If this problem is encountered, the partitions in the affected shared processor pool can be dynamically moved to a different shared processor pool.  Before the dynamic move, the  "uncapped" partitions should be changed to "capped" to avoid a system hang. The old affected pool would continue to have the APP error until the system is re-IPLed.
  • A rare problem was fixed for a system hang that can occur  when dynamically moving "uncapped" partitions to a different shared processor pool.  To prevent a system hang, the "uncapped" partitions should be changed to "capped" before doing the move.
  • A problem was fixed for a DLPAR add of the USB 3.0 adapter (#EC45 and #EC46) to an AIX partition where the adapter could not be configured with the AIX  "cfgmgr" command that fails with EEH errors and an outstanding illegal DMA transaction.  The trigger for the problem is the DLPAR add operation of the USB 3.0 adapter that has a USB External Dock (#EU04) and RDX Removable Disk Drives attached, or a USB 3.0 adapter that has a flash driver attached.  The PCI slot can be powered off and on to recover the USB 3.0 adapter.
  • A problem was fixed for network issues, causing critical situations for customers, when an SR-IOV logical port or vNIC is configured with a non-zero Port VLAN ID (PVID).  This fix updates adapter firmware to 10.2.252.1922, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and EL3C.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for a failed IPL with SRC UE BC8A090F that does not have a hardware callout or a guard of the failing hardware.  The system may be recovered by guarding out the processor associated with the error and re-IPLing the system.  With the fix, the bad processor core is guarded and the system is able to IPL.
  • A problem was fixed for the On-Chip Controller (OCC) incorrectly calling out processors with SRC B1112A16 for L4 Cache DIMM failures with SRC B124E504.  This false error logging can occur if the DIMM slot that is failing is adjacent to two unoccupied DIMM slots.
  • A problem was fixed for host-initiated resets of the service processor that can cause the service processor to terminate.  In this state, the service processor will be unresponsive but the system and partitions will continue to run.  On systems with a single service processor, the SRC B1817212 will be displayed on the control panel.  For systems with redundant service processors, the failing service processor will be deconfigured.  To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action.  The problem is intermittent and very infrequent as most of the host-initiated resets of the service processor will work correctly to restore the service processor to a normal operating state.
  • A problem was fixed for device time outs during a IPL logged with a SRC B18138B4.  This error is intermittent and no action is needed for the error log.  The service processor hardware server has allotted more time of the device transactions to allow the transactions to complete without a time-out error.
  • A  problem was fixed for cable card capable PCI slots that fail during the IPL.  Hypervisor I/O Bus Interface UE B7006A84 is reported for each cable card capable PCI  slot that doesn't contain a PCIe3 Optical Cable Adapter for the PCIe Expansion Drawer (feature code #EJ05).  PCI slots containing a cable card will not report an error but will not be functional.  The problem can be resolved by performing an AC cycle of the system.  The trigger for the failure is the I2C devices used to detect the cable cards are not coming out of the power on reset process in the correct state due to a race condition.
  • A problem was fixed with SR-IOV adapter error recovery where the adapter is left in a failed state in nested error cases for some adapter errors.  The probability of this occurring is very low since the problem trigger is multiple low-level adapter failures.  With the fix, the adapter is recovered and returned to an operational state.
  • A problem was fixed for the setting the disable of a periodic notification for a call home error log SRC B150F138 for Memory Buffer resources (membuf) from the Advanced System Management Interface (ASMI).
  • A problem was fixed for a blank SRC in the LPA dump for user-initiated non-disruptive adjunct dumps.  The SRC is needed for problem determination and dump analysis.
  • A problem was fixed for a missing processor FRU callout for SRC BC8A0307 for a node deconfiguration during the IPL.  The failing SCM is now provided on the callout when this error occurs during the IPL.  This callout allows the guard of the failing  processor to occur so that the IPL is successful.

System firmware changes that affect certain systems

  • On systems using the PowerVM hypervisor firmware and Novalink, a problem was fixed for a NovaLink installation error where the hypervisor was unable to get the maximum logical memory buffer (LMB) size from the service processor.  The maximum supported LMB size should be 0xFFFFFFFF but in some cases it was initialized to a value that was less than the amount of configured memory, causing the service processor read failure with error code 0X00000134.
  • On systems that have an attached HMC,  a problem was fixed for a Live Partition Mobility migration that resulted in the source managed system going to the Hardware Management Console (HMC) Incomplete state after the migration to the target system was completed.  This problem is very rare and has only been detected once.. The problem trigger is that the source partition does not halt execution after the migration to the target system.   The HMC went to the Incomplete state for the source managed system when it failed to delete the source partition because the partition would not stop running.  When this problem occurred, the customer network was running very slowly and this may have contributed to the failure.  The recovery action is to re-IPL the source system but that will need to be done without the assistance of the HMC.  For each partition that has a OS running on the source system, shut down each partition from the OS.  Then from the Advanced System Management Interface (ASMI),  power off the managed system.  Alternatively, the system power button may also be used to do the power off.  If the HMC Incomplete state persists after the power off, the managed system should be rebuilt from the HMC.  For more information on HMC recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
  • On systems that have an attached HMC,  a problem was fixed for a Live Partition Mobility migration that resulted in a system hang when an EEH error occurred simultaneously with a request for a page migration operation.  On the HMC, it shows an incomplete state for the managed system with reference code A181D000.  The recovery action is to re-IPL the source system but that will need to be done without the assistance of the HMC.  From the Advanced System Management Interface (ASMI),  power off the managed system.  Alternatively, the system power button may also be used to do the power off.  If the HMC Incomplete state persists after the power off, the managed system should be rebuilt from the HMC.  For more information on HMC recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
SC830_097_048 / FW830.30

08/24/16
Impact: Availability    Severity: SPE

New features and functions

  • The certificate store on the service processor has been upgraded to include the changes contained in version 2.6 of the CA certificate list published by the Mozilla Foundation at the mozilla.org website as part of the Network Security Services (NSS) version 3.21.
  • Support was added to the Advanced System Management Interface (ASMI) for the Intelligent Platform Machine Interface (IPMI) to be able to change the IPMI password.  On the "Login Profile/Change Password" menu, a user ID of "IPMI" can be selected.  Changing the password for IPMI changes the password for the default IPMI user ID.  IPMI is not a user ID for logging into ASMI.  The IPMI function on the service processor can be accessed using tool "ipmitool" from a client system that has a network connection to the service processor.
  • Support was added to protect the service processor from booting on a level of firmware that is below the minimum MIF level.  If this is detected, a SRC B18130A0 is logged.  A disruptive firmware update would then need to be done to the minimum firmware level or higher.  This new support has no effect on the system being updated with the service pack but has been put in place to provide an enhanced firmware level for the IBM field stock service processors.
  • Support was added for the Stevens6+ option of the internal tray loading DVD-ROM drive with F/C #EU13.  This is an 8X/24X(max) Slimline SATA DVD-ROM Drive.  The Stevens6+ option is a FRU hardware replacement for the Stevens3+.  MTM 7226-1U3 (Oliver)  FC 5757/5762/5763 attaches to IBM Power Systems and lists Stevens6+ as optional for Stevens3+.  If the Stevens6+  DVD drive is installed on the system without the required firmware support, the boot of an AIX partition will fail when the DVD is used as the load source.  Also, an IBM i partition cannot consistently boot from the DVD drive using D-mode IPL.  A SRC C2004130 may be logged for the load source not found error.

System firmware changes that affect all systems

  • DEFERRED:  A performance improvement was made by disabling the Hot/Cold Affinity (HCA) hardware feature which gathers memory usage statistics for consumption by partition operating system memory management algorithms.  The statistics gathering can, in rare cases, cause performance to degrade.  The workloads that may experience issues are memory-intensive workloads that have little locality of reference and thus cannot take advantage of hardware memory cache.  As a consequence, the problem occurs very infrequently or not at all except for very specific workloads in a HPC environment.  This performance fix requires an IPL of the system to activate it after it is applied.
  • A problem was fixed for the service processor going to the reset state instead of the termination state when the anchor card is missing or broken.  At the termination state, the Advanced System Management Interface (ASMI) can be used to collect failure data and debug the problem with the anchor card.
  • A problem was fixed for error log entries created by Hostboot not getting written to the error log in some situations.  This can cause hardware detected as failed by Hostboot to not get reported or have a call-home generated.  This problem will occur whenever Hostboot commits a recovered or informational error as its last error log in the current IPL.  In the next IPL,  one or more error logs from Hostboot will be lost.
  • A problem was fixed for the Hardware Management Console (HMC) "chpwrmgmt" command not providing a meaningful error message when used to try to enable an invalid power saver mode of "dynamic_favor_power" on the 9119-MME or 9119-MHE models.  This power saver mode is not available on these models but the error message issued was "HSCL1400 An error has occurred during the operation to the managed system. Try the task again."  The following is the corrected error message:  "HSCL1402 This operation failed due to the following reasons: HSCL02F3 The managed system does not support the specified power saver mode."
  • A problem was fixed for the health monitoring of the NVRAM and DRAM in the service processor that had been disabled.  The monitoring has been re-established and early warnings of service processor memory failure is logged with one of the following Predictive Error SRCs:  B151F107, B151F109, B151F10A, or B151F10D.
  • A  problem was fixed for an incorrect date in partitions created with a Simplified Remote Restart-Capable (SRR) attribute where the date is created as Epoch 01/01/1970 (MM/DD/YYYY).  Without the fix, the user must change the partition time of day when starting the partition for the first time to make it correct.  This problem only occurs with SRR partitions.
  • A problem was fixed for hypervisor task failures in adjunct partitions with a SRC B7000602 reported in the error log.  These failures occur during adjunct partition reboots for concurrent firmware updates but are extremely rare and require a re-IPL of the system to recover from the task failure.  The adjunct partitions may be associated with the VIOS or I/O virtualization for the physical adapters such as done for SR-IOV.
  • A problem was fixed for a shortened "Grace Period" for "Out of Compliance" users of a Power Enterprise Pool (PEP).   The "Grace Period" is short by one hour, so the user has one less hour to resolve compliance issues before the HMC disallows any more borrowing of PEP resources.  For example, if the "Grace Period" should have been 48 hours as shown in the "Out of Compliance" message, it really is 47 hours in the hypervisor firmware.  The borrowing of PEP resources is not a common usage scenario.  It is most often found in Live Partition Mobility (LPM) migrations where PEP resources are borrowed from the source server and loaned to the target server.
  • A problem was fixed for an AIX or Linux partition failing with a SRC B2008105 LP 00005 on a re-IPL after a dump (firmware assisted or error generated dump) following a Live Partition Mobility (LPM) migration operation.  The problem does not occur if the migrated partition completes a normal IPL after the migration.
  • A problem was fixed for intermittent long delays in the NX co-processor for asynchronous requests such as NX 842 compressions.  This problem was observed for AIX DB2 when it was doing hardware-accelerated compressions of data but could occur on any asynchronous request to the NX co-processor.
  • A problem was fixed for transmit time-outs on a Virtual Function (VF) during stressful network traffic, on systems using PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared-mode.  This fix updates adapter firmware to 10.2.252.1918, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and EL3C.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note:  Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can only be updated concurrently by the OS that owns the adapter.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during SSL certificate management. The Common Vulnerabilities and Exposures issue number is CVE-2016-0797.
  • A problem was fixed for missing dumps for  service processor failures during firmware updates.
  • A problem was fixed for a service processor failure during a system power off that causes a reset of the service processor.  The service processor is in the correct state for a normal system power on after the error.  The frequency for this error should be low as it is caused by a very rare race condition in the power off process.
  • A problem was fixed for a processor hang where the error recovery was not guarding the failing processor.  The failure causes a SRC B111E540 to be logged with Signature Description of " ex(n0p3c1) (COREFIR[55]) NEST_HANG_DETECT: External Hang detected".  With the fix, the failure processor FRU is called out and guarded so that the error does not re-occur when the system is re-IPLed.
  • A problem was fixed for a sequence of two or more Live Partition Mobility migrations that caused a partition to crash with a SRC BA330000 logged (Memory allocation error in partition firmware).  The sequence of LPM migrations that can trigger the partition crash are as follows:
    The original source partition level can be any FW760.xx, FW763.xx, FW770.xx, FW773.xx, FW780.xx, or FW783.xx P7 level or any FW810.xx, FW820.xx, FW830.xx, or FW840.xx P8 level.  It is migrated first to a system running one of the following levels:
    1) FW730.70 or later 730 firmware or
    2) FW740.60 or later 740 firmware
    And then a second migration is needed to a system running one of the following levels:
    1) FW760.00 - FW760.20 or
    2) FW770.00 - FW770.10
    The twice-migrated system partition is now susceptible to the BA330000 partition crash during normal operations until the partition is rebooted.  If an additional LPM migration is done to any firmware level, the thrice-migrated partition is also susceptible to the partition crash until it is rebooted.
    With the fix applied, the susceptible partitions may still log multiple BA330000 errors but there will be no partition crash.  A reboot of the partition will stop the logging of the BA330000 SRC.
  • A problem was fixed for the Advanced System Management Interface "Network Services/Network Configuration" "Reset Network Configuration" button that was not resetting the static routes to the default factory setting.  The manufacturing default is to have no static routes defined so the fix clears any static routes that had been added.  A circumvention to the problem is to use the ASMI "Network Services/Network Configuration/Static Route Configuration" "Delete" button before resetting the network configuration.
  • A problem was fixed for a partial callout for a failed SPIVID (Serial Peripheral Interface Voltage Identification) interface on the power supply VRM (Voltage Regulator Module).  The SPVID interface allows the processor to to control it's external voltage supply level, but if it fails, only the processor FRU (SCM) is called out but not the VRM.
    The system IPL will complete with a CEC drawer deconfigured.  The error log will only contain the processor but not the defective processor VRM.  Hostboot does not detect a SPIVID error, but fails on a SCOM operation to the processor chip.  The errors show up with SRC BCxx090F logged by Hostboot and word 7 containing  one of three error values for a SPIVID_SLAVE_PART callout:
    1) RC_SBE_SET_VID_TIMEOUT = 0x005ec1b2
    2) RC_SBE_SPIVID_STATUS_ERROR = 0x00902aac
    3) RC_SBE_SPIVID_WRITE_RETURN_STATUS_ERROR = 0x0045d3cd with HWP Error description : "Procedure: proc_sbe_setup_evid SPIVID Device did not return good status the Boot Voltage Write operation" and HWSV RC of BA24.
    Without the fix, replace both the identified SCM and the associated VRM.
  • A problem was fixed for the HMC Exchange FRU procedure for DVD drive with MTM 7226-1U3 and feature codes 5757/5762/5763 where it did not verify the DVD drive was plugged in at the end of the exchange procedure.  Without the fix,  the user must manually verify that the DVD drive is plugged in.
  • A problem was fixed for the Advanced System Mangement Interface (ASMI) incorrectly showing the Anchor card as guarded whenever any redundant VPD chip is guarded.

System firmware changes that affect certain systems

  • A problem was fixed for the service processor recovery from intermittent MAX31760 fan controller faults logged with SRC B1504804.  The fan controller faults caused an out of memory condition on the service processor, forcing it to reset and failover to the backup service processor with SRCs B181720D, B181E6E9,  and B182951C logged.  With the fix, the fan controller faults are handled without memory loss and the only SRC logged is B1504804 for each fan controller fault.
  • On systems with a PowerVM Active Memory Sharing (AMS) partition with AIX  Level 7.2.0.0 or later with Firmware Assisted Dump enabled, a problem was fixed for a Restart Dump operation failing into KDB mode.  If "q" is entered to exit from KDB mode, the partition fails to start.  The AIX partition must be powered off and back on to recover.  The problem can be circumvented by disabling Firmware Assisted Dump (default is enabled in AIX 7.2).
  • For a system partition with more than 64 cores, a problem was fixed for Live Partition Mobility (LPM)  migration operations failing with HSCL365C.  The partition migration is stopped because the platform detects a firmware error anytime the partition has more than 64 cores.
SC830_093_048 / FW830.22

06/28/16
Impact: Availability    Severity: SPE

Critical firmware update for FW830.21 (SC830_092) level systems

System IPLed with FW830.21:  A critical firmware update is required for all 9119-MME and 9119-MHE systems that have been IPLed with FW830.21 (SC830_092). The FW830.21 level can cause a failed IPL or a potential unplanned outage. If the server is already in production, then customer should plan an outage at a convenient time to apply FW 830.22 (SC830_093) or higher and IPL.

System had FW830.21 concurrently applied:  If firmware level FW830.21 was concurrently installed (i.e. system was NOT IPL'ed after installing the level) customers are not impacted by this issue provided they apply FW830.22 (SC830_093) or higher prior to next planned system reboot. NOTE: FW 830.22 can be applied concurrently.

System IPLed with any other version of Firmware:  If the current firmware level of the system is not FW830.21, the system is not exposed to this issue. Customers can install this level or later at the next scheduled update window.

To verify the firmware level installed on the server, select “Updates” from the left side of the HMC and place a check mark on the server of interest. Then select “View system information” from the bottom view, select “None - Display current values”. The Platform IPL Level will indicate the last level the system was booted on.

System firmware changes that affect all systems

  • A problem was fixed for an intermittent failure in Hostboot during the system IPL resulting in SRCs BC70090F and BC8A1701 logged with a hardware procedure return code of "RC_PROC_BUILD_SMP_ADU_STATUS_MISMATCH".  The system terminates with a Terminate Immediate (TI) condition.  The system must be re-IPLed to recover.  The failure is very infrequent and was caused by a race condition introduced as part of clock card failure data collection procedure which has now been corrected.
SC830_092_048 / FW830.21

06/01/16
Impact: Availability    Severity: SPE

System firmware changes that affect all systems

  • Support for additional First Failure Data Capture (FFDC) data for processor clock failover errors provided by creating daily clock status reports with SRC B150CCDA informational error logs.  This clock status SRC log is written into the Hardware Management Console (HMC) iqyylog.log as a platform error log (PEL) event.  The PEL event contains a dump of the clock registers.  If a processor clock fail over with SRC B158CC62 occurs on the service processor, the iqyylog.log file on the HMC should be collected to help debug the clock problem using the B150CCDA data.
  • A problem was fixed for a missing error log when a clock card fails over to the backup clock card.  This problem causes loss of redundancy on the clock cards without a callout notification that there is a problem with the FRU.  If the fix is applied to a system that had a failed clock, that condition will not be known until the system is IPLed again when a error log and callout of the clock card will occur if it is in a persisted failed state.
  • On systems using PowerVM firmware with dedicated processor partitions,  a problem was fixed for the dedicated processor partition becoming intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.  This is a follow-on to the fix provided in 830.20 for a different issue for delays in dedicated processor partitions that were caused by low I/O utilization.
  • A problem was fixed for a secondary clock card (CCIN 6B49 ) failure on the system control unit (SCU) being called out as a local clock card (CCIN 6B2D) failure on the node with SRC B158E504.  For this failure to occur, the primary clock card on the SCU must have been previously failed and guarded.
SC830_086_048 / FW830.20

04/01/16
Impact: Availability    Severity: SPE

New features and functions

  • Support was added to the Advanced System Management Interface (ASMI) to be able to add a IPv4 static route definition for each ethernet interface on the service processor.  Using a static route definition,  a Hardware Management Console (HMC) configured on a private subnet that is different from the service processor subnet is now able to connect to the service processor and manage the CEC.  A static route persists until it is deleted or until the service processor settings are restored to manufacturing defaults.  The static route is managed with the ASMI panel "Network Services/Network Configuration/Static Route Configuration" IPv4 radio button.  The "Add" button is used to add a static route (only one is allowed for each ethernet interface) and the "Delete" button is used to delete the static route.
  • Support was added to the Advanced System Management Interface (ASMI) to display the environmental info section of error logs in the "System Service Aids-> Error->Event logs" panel.  The following is an example of the information displayed:
    |------------------------------------------------------
    |                              Environmental Info      
    |------------------------------------------------------
    | Section Version          : 1                         
    | Sub-section type         : 0                        
    | Created by               : powr                                   
    | Genesis Record Time-Stamp: 03/12/2015 15:31:21
    | Genesis Corr-Resistance  : 4.687847
    | Genesis Ambient-Temp(C)  : 28.000000
    | Genesis Corrosion-Rate   : 0           
    |                                                       
    | Corrosion Rate Status    : 1             
    | Presence of UsrDataSec   : 1
    | Num Corrosion Readings   : 1        
    |                                                      
    | Daily Corr-Resistance    : 4.804206          
    | Daily Ambient-Tempr(C)   : 35.312500      
    | Daily Corrosion-Rate     : 12C                  
    |------------------------------------------------------

System firmware changes that affect all systems

  • A problem was fixed for a power fault on a single node with SRC 11002610 that terminates the multi-node system.  The problem can be circumvented by unplugging the failing node and the system will IPL.  With the fix, the failing node is guarded on the power fault and the rest of the system is able to IPL.
  • A problem was fixed for Advanced System Management Interface (ASMI) TTY to allow "admin" passwords to be greater than eight characters in length to be consistent with prior generations of the product.  The ASMI web interface works correctly for user "admin" passwords with no truncation in the length of the passwords.
  • A problem was fixed for the recovery of a failing PCI clock so that a failover to the backup PCI clock occurs without a node failing and being deconfigured.  Without the fix, the PCI clock does not behave as a redundant FRU and faults on it will cause the CEC to terminate.  A re-IPL of the CEC recovers it from the PCI clock error with the bad clock guarded so that the other PCI clock is used,
  • A problem was fixed for an intermittent IPL failure with SRC B181E6C7 for a deadlock condition when testing the clocks during the IPL.  The problem state can be recovered by doing another IPL.  The problem is triggered by an error in the IPL clock test causing a interrupt handler to switch to the redundant clock and deadlock.  With the fix, the clock fault is handled and the bad clock is guarded, with the IPL completing on the redundant clock.
  • A problem was fixed for a system IPL hang at C100C1B0 with SRC 1100D001 when the power supplies have failed to supply the necessary 12-volt output for the system.   The 1100D001 SRC was calling out the planar when it should have called out the power supplies.  With the fix, the system will terminate as needed and call out the power supply for replacement.  One mode of power supply failure that could trigger the hang is sync-FET failures that disrupt the 12-volt output.
  • A problem was fixed for recovery from PNOR flash memory corruption that causes the IPL to fail with SRC D143900C.  This is very rare and only has happened in IBM internal labs.  Without the fix, the service processor cannot correct the corruption in the PNOR.  If a system has the problem SRC and cannot IPL,  then that system must be disruptively firmware updated to apply the fix to be able to IPL again.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) not getting all error logs reported when its error log queue is full.  In the case where the error log queue is full with 16 entries, only one entry is returned to the hypervisor for reporting.  This error log truncation only occurs during periods of high error activity in the expansion drawer.
  • A problem was fixed for recovering from a misplug of the service processor FSI cables (U2-P1-C10-T2 and U1-P1-C9-T2) where the plug locations are reversed from what would be a proper connection.  Without the fix, the bad FSI connections cause the service processors to go to the service processor stop state.  With the fix applied, the error logs call out the bad cables so they can be repaired and the service processor remains in a working state.
  • A problem was fixed for hardware system dump collection after a hardware checkstop that was missing scan ring data.  This is a very infrequent problem caused by an error with timing in the multi-threaded dump collection process.  Until this fix is applied, the debug of some hardware dump problems may require doing multiple dump collections to get all the data.
  • A problem was fixed for an Advanced System Management Interface (ASMI) error that occurred when trying to display detail on a deconfigured Anchor Card VPD.  If the error log for the selected deconfiguration record had been deleted, it caused ASMI to core dump.  With the fix,  if the error log for deconfiguration record is missing, the error log details such as failing SRC for the deconfiguration record are returned as blank.
  • A problem was fixed for an On-Chip Controller error with SRC B1702AC4 that was logged as a unrecoverable without hardware callouts.  This occurred  when the slave OCC failed to receive any Analog Power Subsystem Sweep (APSS) data over a long time interval.  With the fix, if the OCC fails in the same manner, the error is predictive with hardware callouts in the error log.
  • A problem was fixed in the Advanced System Management Interface (ASMI) for a FRU exchange of a DVD where the DVD was not being powered off as needed for the exchange.  The missing power off of the FRU could cause a data read or write error if the DVD is in use when the DVD is removed.  With the fix, the ASMI deactivate DVD button turns off the DVD green power LED during the exchange procedure, so it is known when it is safe to continue with the exchange procedure steps and remove the DVD.
  • A problem was for fixed so that error logs are now generated for thermal errors detected by the service processor.  Without the fix, thermal errors such as a temperature over the threshold will not get reported in the error log but higher fan speeds will be present as an indicator of the thermal problem.  Until the fix is applied, the error log and call home mechanism cannot be relied on to monitor for system thermal problems.
  • A problem was fixed for processor core checkstops that cause an LPAR outage but do not create hardware errors and service events.  The processor core is deconfigured correctly for the error.  This can happen if the hypervisor forces processor checkstops in response to excessive processor recovery.
  • A problem was fixed for the callout of a VPD collection fault and system termination with SRC 11008402 to include the 1.2vcs VRM FRU.  The power good fault fault for the 1.2 volts would be a primary cause of this error.  Without the fix, the VRM is missing in the callout list and only has the VPDPART isolation procedure.
  • A problem was fixed for excessive logging of the SRC 11002610 on a power good (pgood) fault when detected by the Digital Power Subsystem Sweep (DPSS).  Multiple pgood interrupts are signaled by the DPSS in the interval between the first pgood failure and the node power down.  A threshold was added to limit the number of error logs for the condition.
  • A problem was fixed for redundant logging of the SRC B1504804 for a fan failure, once every five seconds.  With the fix, the failure is logged only at the initial time of failure in the IPL.
  • A problem was fixed to speed up recovery for VPD collection time-out errors for PCIe resources in an I/O drawer logged with SRC 10009133 during concurrent firmware updates.  With the fix, the hypervisor is notified as soon as the VPD collection has finished so the PCIe resources can report as available .  Without the fix, there is a delay as long as two hours for the recovery to complete.
  • A problem was fixed for a false unrecoverable error (UE) logged for B1822713 when an invalid cooling zone is found during the adjustment of the system fan speeds.  This error can be ignored as it does not represent a problem with the fans.
  • A problem was fixed for a processor clock failover error with SRC B158CC62 calling out all processors instead of isolating to the suspect processor.  The callout priority correctly has a clock and a procedure callout as the highest priority, and these should be performed first to resolve the problem before moving on to the processors.
  • A problem was fixed for loss of back-level protection during firmware updates if an anchor card has been replaced.  The Power system manufacturing process sets the minimum code level a system is allowed to have for proper operation.  If a anchor card is replaced, it is possible that the replacement anchor card is one that has the Minimum MIF Level (MinMifLevel) given as "blank",  and this removes the system back-level protection. With the fix, blanks or nulls on the anchor card for this field are handled correctly to preserve the back-level protection.  Systems that have already lost the back-level protection due to anchor card replacement remain vulnerable to a accidental downgrade of code level by operator error, so code updates to a lower level for these systems should only be performed under guidance from IBM Support.  The following command can be run the Advanced Management Management Interface (ASMI) to determine if the system has lost the back-level protection with the presence of "blanks" or ASCII 20 values for MinMifLevel:
    "registry -l cupd/MinMifLevel" with output:
    "cupd/MinMifLevel:
    2020202020202020 2020202020202020 [ ]
    2020202020202020 2020202020202020 [ ]"
  • A problem was fixed for a system checkstop caused by a L2 cache least-recently used (LRU) error that should have been a recoverable error for the processor and the cache.  The cache error should not have caused a L2 HW CTL error checkstop.
  • A problem was fixed that was corrupting the Update Access Key (UAK) date with a corrupted date of "1900".   The user should correct the UAK date, if needed, to allow the firmware update to proceed, by using the original UAK key for the system.  On the Management Console,  enter the original update access key via the "Enter COD Code" panel. Or on the Advanced System Manager Interface (ASMI),  enter the original update access key via the "On Demand Utilities/COD Activation" panel.
  • A problem was fixed for PCIe switch recovery to prevent a partition switch failure during the IPL with error logs for SRC B7006A22 and B7006971 reported.  This problem can occur when doing recovery for an informational error on the switch.  If this problem occurs, the partition must be restarted to recover the affected I/O adapters.
  • A problem was fixed to correct the error messages for early failures in the Live Partition Mobility (LPM) migration of a partition.  The management console might report an unrelated error such as  "HSCLA27E The operation to lock the physical device location for target adapter" when the actual error might be not enough available memory on the target CEC to run the migration.  With the fix, the correct error code is returned so there is enough information to correct the error and retry the migration.
  • A problem was fixed for a hypervisor task hang during a FRU exchange on the PCIe3 I/O expansion drawer (#EMX0) that requires the entire drawer to power off and power on again.  The activation phase for the power on may never complete if a very rare sequence of events occurs during the power on step.  The FRUs to exchange that would cause the expansion drawer to power off  and power on are the following:  midplane, I/O module, I/O module VRM, chassis management card (CMC), cable card, and active optical cable.
  • A problem was fixed for PCIe adapter hangs and network traffic error recovery during Live Partition Mobility (LPM) and SR-IOV vNIC (virtual ethernet adapter)  operations.  An error in the PCI Host Bridge (PHB) hardware can persist in the L3 cache and fail all subsequent network traffic through the PHB.  The PHB  error recovery was enhanced to flush the PHB L3 cache to allow network traffic to resume.
  • A problem was fixed for a network boot/install failure using bootp in a network with switches using the Spanning Tree Protocol (STP).  A network boot/install using lpar_netboot on the management console was enhanced to allow the number of retries to be increased.  If the user is not using lpar_netboot, the number of bootp retries can be increased using the SMS menus.  If the SMS menus are not an option, the STP in the switch can be set up to allow packets to pass through while the switch is learning the network configuration.
  • A problem was fixed for a hypervisor adjunct partition failed with "SRC B2009008 LP=32770" for an unexpected SR-IOV adapter configuration.  Without the fix, the system must be re-IPLed to correct the adjunct error.  This error is infrequent and can only occur if an adapter port configuration is being changed at the same time that error recovery is occurring for the adapter.
  • A problem was fixed for recovering from FSI interrupt overruns (too many FSI interrupts at one time that cause the service processor to go interrupt-bound and get stuck in a loop) that caused the service processor to go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.  With the fix, the FSI interrupt generation is reset if a threshold is exceeded, allowing the service processor to continue normal processing.  The failure trigger is a rare hardware fault condition that does not persist in the service processor.
  • A problem was fixed for priority callouts for system clock card errors with SRC B158CC62.  These errors had high priority callouts for the system clock card and medium callouts for FRUs in the clock path.  With the fix, all callouts are set to medium priority as the clock card is not the most probable FRU to have failed but is just a candidate among the many FRUs along the clock path.
  • A problem was fixed for a degraded PCI link causing a processor core to be guarded if a non-cacheable unit (NCU) store time-out occurred with SRC B113E540 and PRD signature  "(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB".  With the fix, the processor core is not guarded for the NCU error.  If this problem occurs and a core is deconfigured. clear the guard record and re-IPL to regain the processor core.  The solution for degraded PCI links is different from the fix for this problem, but a re-IPL of the CEC or a reset of the PCI adapters could help to recover the PCI links from their degraded mode.
  • A problem was fixed for a L2 cache error on the service processor that caused the service processor to reset or go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.  With the fix, the L2 cache error is handled with single-bit corrected with no error to the service processor, so it can continue normal processing.  The L2 cache data error that causes this fail is infrequent and the service processor requires its limit of three resets in fifteen minutes to be exceeded for the service processor to fail, so service processor failure rate for this problem is low.
  • A problem was fixed for an incorrect reduction in FRU callouts for Processor Run-time Diagnostic (PRD) errors after a reference oscillator clock (OSCC) error has been logged.  Hardware resources are not called out and guarded as expected.  Some of the missing PRD data can be found in the secondary SRC of B181BAF5 logged by hardware services.  The callouts that PRD would have made are in the user data of that error log.
  • A problem was fixed for error recovery from failed Live Partition Mobility (LPM) migrations.  The recovery error is caused by a partition reset that leaves the partition in an unclean state with the following consequences:  1) A retry on the migration for the failed source partition may not not be allowed; and 2) With enough failed migration recovery errors, it is possible that any new migration attempts for any partition will be denied.  This error condition can be cleared by a re-IPL of the system. The partition recovery error after a failed migration  is much more likely to occur for partitions managed by NovaLink but it is still possible to occur for Hardware Management Console (HMC) managed partitions.
  • A problem was fixed for a Qualys network scan for security vulnerabilities causing a core dump in the Intelligent Platform Management Interface (IPMI)  process on the service processor with SRC B181EF88.  The error occurs anytime the Qualys scan is run because it sends an invalid IPMI session id that should have been handled and discarded without a core dump.
  • A security problem was fixed in the lighttpd server on the service processor, where a remote attacker, while attempting authentication, could insert strings into the lighttpd server log file.  Under normal operations on the service processor, this does not impact anything because the log is disabled by default.  The Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during RSA PPS signature verification. The Common Vulnerabilities and Exposures issue number is CVE-2015-3194.
  • A problem was fixed to guard a failed processor core to allow the system to IPL.  The processor core chiplet FRU was failing to be called out and guarded on a RC_PMPROC_CHKSLW_ADDRESS_MISMATCH error and this prevented the system from being able to IPL.

System firmware changes that affect certain systems

  • On multi-node systems with a power fault, a problem was fix for On-Chip Controller errors caused by the power fault being reported as predictive errors for SRC B1602ACB.  These have been corrected to be informational error logs.  If running without the fix, the predictive and unrecoverable errors logged for the OCC on loss of power to the node can be ignored.
  • On a multi-node system,  a problem was fixed for a power fault with SRC 11002610 having incorrect FRU callouts.  The wrong second FRU callout is made on nodes 2, 3, and 4 of a multi-node system.  Instead of calling out the processor FRU, the enclosure FRU is called out.  The first FRU callout is correct.
  • On PowerVM systems with dedicated processor partitions with low I/O utilization, the dedicated processor partition may become intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.
  • On systems where memory relocation (as done by using Live Partition Mobility (LPM)) and a partition reboot are occurring simultaneously, a problem for a system termination was fixed.  The potential for the problem existed between the active migration and the partition reboot.
  • On a system running a IBM i partition,  a problem was fixed for a machine check incorrectly issued to an IBM i partition running 7.2 or later with 4K sector disks.  This problem only pertains to the IBM Power System S814 (8286-41A) , S824 (8286-42A), E870 (9119-MME), and E880 (9119-MHE) models.
  • A problem was fixed that limited Virtual Functions (VFs) to a maximum of 50 on a single PCIe3 10GbE  adapter (feature codes #EN15, #EN16, #EN17, and #EN18; and CCINs 2CE3 and 2CE4) when 64 should have been allowed.  This problem only occurs for two of the SR-IOV capable slot locations in the Power Systems:  slot C4 in the PCIe3 I/O expansion drawer (#EMX0) and slot C7 in the Power System E850 (8408-E8E).
  • A problem was fixed for an extraneous PCIe switch SRC B7006A22 being called out when there is a valid PCIe  expansion drawer cable problem with SRC B7006A88 reported.  The callout for SRC B7006A22 should be ignored as the PCIe switch hardware is working for this case.
  • On a system with a AIX partition and a Linux partition, a problem was fixed for dynamically moving an adapter that uses DMA from the Linux partition to the AIX partition that caused the AIX to fail by going into KDB mode (0c20 crash).  The management console showed the following message for the partition operation:  "Dynamic move of I/O resources failed.  The I/O slot dynamic partitioning operation failed.".  The error was caused by Linux using 64K mappings for the DMA window and AIX using 4K mappings for the DMA window, causing incorrect calculations on the AIX when it received the adapter.  Until the fix is applied, the adapters that use DMA should only be moved from Linux to AIX when the partitions are powered off.  This problem does not pertain to Power System S812L(8247-21L), S822L(8247-22L), and S824L(8247-42L) models.
  • A problem was fixed for a Live Partition Mobility migration failure of a time reference partition (TRP) to a FW830 system when setting partition hibernate capable "false".  This happens any time the TRP partition is attempted to be migrated.  To circumvent the problem, set the partition's Time Reference Property to disabled and retry the migration.
  • On systems with a partition using Active memory Sharing (AMS), a problem was fixed for a Live Partition Mobility (LPM) migration of the AMS partition that can hang the hypervisor on the target CEC.  When an AMS partition migrates to the target CEC, a hang condition can occur after processors are resumed on the target CEC, but before the migration operation completes.  The hang will prevent the migration from completing, and will likely require a CEC reboot to recover the hung processors.  For this problem to occur, there needs to be memory page-based activity (e.g. AMS dedup or Pool paging) that occurs exactly at the same time that the Dirty Page Manager's PSR data for that page is being sent to the target CEC.
  • On systems with an invalid P-side or T-side in the firmware, a problem was fixed in the partition firmware Real-Time Abstraction System (RTAS) so that system Vital Product Data (VPD) is returned at least from the valid side instead of returning no VPD data.   This allows AIX host commands such as lsmcode, lsvpd, and lsattr that rely on the VPD data to work to some extent even if there is one bad code side.  Without the fix,  all the VPD data is blocked from the OS until the invalid code side is recovered by either rejecting the firmware update or attempting to update the system firmware again.
  • On systems using PCIe adapters in SR-IOV mode, a problem was fixed for occasional B200F011 and B2009008 SRCs that can occur during an IPL, moving a adapter into SR-IOV mode, or with SR-IOV link up/down activity.
  • On systems using PCIe adapters in SR-IOV mode,  the following problems were addressed with a Broadcom Limited (formerly known as Avago Technologies and Emulex) adapter firmware update to 10.2.252.1905:  1) Eliminating virtual function (VF) transmit errors during VF resets and 2) Preventing  loss of legacy flow control when an adapter port is connected to a priority flow control (PFC) capable switch.
  • On systems with a AIX or Linux encapsulated state partitions, a problem was fixed for a Live Partition Mobility migration failure for the encapsulated state partitions.  The migration fails on the target CEC when the associated paging space needed to support the encapsulated state is not available.  Removing the "Encapsulated State" attribute from the partition would allow the migration to succeed.  However, removing this attribute can only be accomplished if the partition in the powered off state.  Encapsulated State partitions are needed for the remote restart feature.  An encapsulated state partition is a partition in which the configuration information and the persistent data are stored external to the server on persistent storage.  A partition that supports remote restart can be restarted remotely.  For more information on the remote start feature, refer to this IBM Knowledge Center link: http://www.ibm.com/support/knowledgecenter/P8DEA/p8efd/p8efd_lpar_general_props.htm
  • Support was added to eliminate the yearly Utility COD renewal on systems using Utility COD.  The Utility COD usage is already monitoring to make sure systems are running within the prescribed threshold limit of unreported usage, so a yearly customer renewal is not needed to manage the Utility COD processor usage.
SC830_075_048 / FW830.11

11/11/15
Impact: Availability    Severity: HIPER

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for recovering from embedded MultiMediaCard (eMMC) flash NAND errors that caused the service processor to go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.
  • HIPER/Pervasive: A problem associated with workloads using transactional memory on PowerVM was discovered and is fixed in this service pack. The effect of the problem is non-deterministic but may include undetected corruption of data.
  • DEFERRED:  A problem was fixed for memory on-die termination (ODT) settings to improve the signal integrity of the memory channel.
  • A problem was fixed for recovery from unaligned addresses for MSI interrupts from PCIe adapters.  The recovery prevents an adapter timeout caused by resource exhaustion.  With the fix, the resources for each bad interrupt are returned, allowing the PCIe adapter to continue to run for the normal traffic.
  • A problem was fixed for an Operations Panel SRC of B1504804 with no FRU callout.  A callout of the failed hardware has been added.
  • A problem was fixed to prevent recoverable power faults of short duration from causing the system to lose power supply redundancy.  Without the fix, the faulted state persisted for the recovered power fault, causing a problem with a system power off if other power supplies were lost at a later time.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) link failure with SRC B7006A8B .  The settings for the continuous time linear equalizers (CTLE) were adjusted to improve the incoming signal strength to improve the stability of the links.  The expansion drawer must be power cycled or the CEC can be re-IPLed for the fix to activate.
  • A problem was fixed for recovery from a processor local bus (PLB) hang on the service processor.  The errant PLB hang recovery would be seen in concurrent firmware updates that, on rare occasions, fail to do a side switch to activate to the new level of firmware.  On the management console, the error message would be HSCF010180E Operation failed ... E302F873 is the error code."  Other than the failed code level activation, the firmware update is successful.  If this problem occurs, the system can be set to the new firmware level by doing a power off from the management console and then doing a power on with side switch selected in the advanced properties.

System firmware changes that affect certain systems

  • A problem was fixed for the System Feature Code for the E870 (9119-MME) being displayed as "EPBB" by IBM i "DSPSYSVAL QPRCFEAT"  when it should be "EPBA".  This created a problem for certain IBM i software packages whose license was tied to the System Feature Code.  This fix has a concurrent activation.  For FW830.10, a similar, non-concurrent fix for the feature codes was made but the System Feature Code, as seen in IBM i  partitions, did not update immediately.
SC830_068_048 / FW830.10

09/10/15
Impact: Availability    Severity: HIPER

New features and functions

  • The firmware code update process was enhanced with a feature to block a firmware "downgrade" to a level that is below the system's manufactured code level.

System firmware changes that affect all systems

  • HIPER/Pervasive:DEFERRED:  A problem was fixed for a TCP/IP performance degradation on PCIe ethernet adapters with Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE).  By adjusting the system memory caching, a significant improvement was made to the data throughput speed to restore performance to expected levels.  This fix requires a system re-IPL to take effect.  This problem affects the E850 (8408-E8E), E870 (9119-MME), and E880 (9119-MHE) systems.
  • HIPER/Pervasive:  A problem was fixed for an ethernet adapter hanging on the service processor.  This hang prevents TCP/IP network traffic from the managment console and the Advanced System Management Interface (ASMI) browsers.  It makes it appear as if the service processor is unresponsive and can be confused with a service processor in the stopped state..  An A/C power cycle would recover a hung ethernet adapter.
  • HIPER/Pervasive:  A problem was fixed for missing the interrupts for processor local bus (PLB) time-outs..  This problem could hang the service processor or cause it to panic with a reset/reload of the service processor.  There is a possibility the reset of the service processor could take it to a stopped state where the service processor would be unresponsive.  In the service processor stopped state, any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.
  • HIPER/Pervasive:  A problem was fixed for a system reset to clear the boot registers to prevent the reset from being mishandled as chip reset.   If a "system reset" is misinterpreted as a "chip reset", the boot of the service processor can go inadvertently to a stopped state and be unresponsive.  Pin-hole resets from the operations panel could also fail to the service processor stopped state.  In the service processor stopped state, any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.
  • HIPER/Pervasive:  A problem was fixed so a corrupted file system partition table can be recovered and not have the service processor lose the ability to do P and T-side switches.  In error recovery situations, the loss of the side-switch option could present itself as an unresponsive service processor if it was needed to prevent a failure to the service processor stopped state.
  • HIPER/Pervasive:  A problem was fixed for a runaway interrupt request (IRQ) condition that caused the service processor to go to a stopped state.  In the service processor stopped state, any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.
  • HIPER/Pervasive:  A problem was fixed for a dump partition full condition that caused the service processor to go to a stopped state.  In the service processor stopped state, any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.
  • DEFERRED:  A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) link failure with SRC B7006A8B .  Data packet send retries were increased and link recovery was enabled to improve the stability of the links.  The CEC must be re-IPLed for the fix to activate.
  • A problem was fixed for a SRC 11002613 logged during a concurrent repair of a power supply.  This SRC was erroneously logged and did not represent a real problem.
  • A problem was fixed for an intermittent SRC B1504804 logged on a re-ipl of the CEC but that did not result in an IPL failure.
  • A problem was fixed for the capture of the registers for the Hostboot Self-Boot Engine (SBE) for SBE failures.  These registers had been missing from failure data for SBE failures, making these problems more difficult to debug.
  • A problem was fixed to remove an unnecessary delay in the system IPL to reduce the time needed to IPL by 30 seconds.
  • A problem was fixed for an unneeded error log with SRC B181DB04 that occurred in a failed IPL for a normal condition of lost PNOR flash access after a reIPL process had started and taken over the access.
  • A problem was fixed for an Advanced System Manager Interface (ASMI) error message of "Error in function 'connect", error code 111" when a browser attempted to connect before the service processor was ready.  The browser connection through the web server is now held off until the ASMI process is ready after a reset of the service processor or a AC power cycle of the system.
  • A problem was fixed for an incorrect call home for SRC B1818A0F.  There was no real problem so this call home should have been ignored.
  • A problem was fixed for a dump reIPL that failed with SRC B1818601 and B181460B after processor checkstops had terminated the system.
  • A problem was fixed for an infrequent service processor database corruption during concurrent firmware update that caused the system to terminate.
  • A problem was fixed for a failed PCI oscillator that was not guarded, causing repeated errors with SRC B15050A6 and B158E504 logged on each IPL of the system.
  • A problem was fixed for a local clock card (LCC)  failure with SRC 11001515 that was missing a part number and location code.  This information has been added for LCC faults so the FRU to replace is properly identified.
  • A problem was fixed for a defective PCI oscillator in the local clock card (LCC) with SRC BC58090F that caused a IPL failure for the node instead of failing over to the redundant LCC.
  • A problem was fixed for a service processor dump with error logs  B181E911 and B181D172 during an IPL.  The error logs were for the detection of defunct processes but otherwise the IPL was successful.
  • A problem was fixed for Digital Power Subsystem Sweep (DPSS) firmware updates that caused an error log with SRC B1819906 but otherwise was successful.
  • A problem was fixed for missing Keyword (KW) and Resource ID (RID) for SRC B181A40F.
  • A problem was fixed for a I2C bus lock error during a CEC power off that caused a ten minute delay for the power off and  errorlog SRCs  B1561314 and B1814803 with error number (errno) 3E.
  • A problem was fixed for the System Feature Code for the E870(9119-MME) being displayed as "EPBB" by IBM i "DSPSYSVAL QPRCFEAT"  when it should be "EPBA".  This created a problem for certain  IBM i software packages whose license was tied to the System Feature Code.  The System Feature Code, as seen in IBM i  partitions, does not update immediately with concurrent activation of the fix pack, but it will eventually change to the correct "EPBA" value within 24 hours.  If it is necessary to see the new System Feature Code value immediately,  a re-IPL of the system is needed.
  • A problem was fixed for concurrent firmware updates to a system that needed to be re-IPLed after getting a B113E504 SRC during activation of the new firmware level on the hypervisor.  The code update activate failed if the Sleep Winkle (SLW) images were significantly different between the firmware levels.  The SLW contains the state of the processor and cache so it can be restored after sleep or power saving operations.
  • A problem was fixed for System Power Control Network (SPCN) failover for a I/O module A/C power fault on the PCIe3  I/O expansion drawer (#EMX0).  A sideband failure on one I/O module was blocking SPCN commands for the entire drawer instead of SPCN failing over to a working I/O module.  The broken SPCN communications path prevented  concurrent maintenance operations on the expansion drawer.
  • A problem was fixed for a possible lack of recovery for an A/C power loss condition on the PCIe3  I/O expansion drawer (#EMX0).   If there was an outstanding problem on the expansion drawer and an A/C loss occurred while the earlier error was still unprocessed, the auto-recovery for the A/C power loss would not have happened.
  • A problem was fixed for a missing FRU call out for error SRC B7006A87  when unable to read the drawer module logical flash VPD for the PCIe3 I/O expansion drawer (#EMX0).
  • For a partition that has been migrated with Live Partition Mobility (LPM) from FW730 to FW740 or later, a problem was fixed for a Main Storage Dump (MSD) IPL failing with SRC B2006008.  The MSD IPL can happen after a system failure and is used to collect failure data.  If the partition is rebooted anytime after the migration, the problem cannot happen.  The potential for the problem existed between the active migration and a partition reboot.
  • A problem was fixed for partial loss of Entitlement for On/Off Memory Capacity On Demand (also called Elastic COD).  Users with large amounts of Entitlement on the system of greater than "65535 GB * Days" could have had a truncation of the Entitlement value on a re-IPL of the system.  To recover lost Entitlement, the customer can request another On/Off Enablement Code from IBM support to "re-fill" their entitlement.
  • A problem was fixed for a management console command line failure with a return code 0x40000147 (invalid lock state) when trying to delete SR-IOV shared mode configurations.  This could have occurred if the adapter slot had been re-purposed without involvement of the management console and was owned and operational at the time of the requested delete.  With the fix, the current ownership of the slot is honored and only the SR-IOV shared mode configuration data is deleted on the force delete.
  • A problem was fixed for an  incorrect restriction on the amount of "Unreturned"  resources allowed for a Power Enterprise Pool (PEP).  PEP allows for logical moving of resources (processors and memory) from one server to another.  Part of this is 'borrowing' resources from one server to move to another. This may result in "Unreturned" resources on the source server. The management console controls how many total "Unreturned" PEP resources can exist.  For this problem,  the user had some "Unreturned" PEP memory and asked to borrow more but this request was incorrectly refused by the hypervisor.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) error with SRCs  B7006A82 and B7004137 for a missing FRU location code.  The FRU location code for the Active Optical Cable (AOC)  was added to identify the failing drawer side.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0)  failing to IPL when the IPL includes a FPGA update for the drawer.  The FPGA update is actually good but perceived as a failure when the FPGA resets as part of the update.  For the problem, a re-IPL of the system would have fixed the drawer.
  • A problem was fixed for Live Partition Mobility (LPM) to prevent a memory access error during LPM operations with unpredictable affects.  When data is moved by LPM, the underlying firmware code requires that the buffers be 4K aligned.  The fixes made now force the buffers to be 4K aligned and if there is still an alignment issue, the LPM operation will fail without impacting the system.
  • A problem was fixed for an On-Chip Controller (OCC) failure after a system dump with SRCs B18B2616 and BC822024 reported.  This resulted in the system running with reduced performance in safe mode, where processor clock frequencies are lowered to minimum levels to avoid hardware errors since the OCC is not available to monitor the system.   A re-IPL of the system would have resolved the problem.
  • A  performance problem was fixed for systems entering processor hang recovery prematurely with SRC B111E504 and PBCENTFIR(9) "PB_CENT_HANG_RECOV".  The ability of the L3 cache to prefetch memory was extended to speed the memory accesses and prevent a processor hang condition for applications running with lower memory affinity.
  • A problem was fixed for a processor error causing a Hostboot terminate instead of a deconfiguration of the bad hardware and continuation of the IPL.  The state of the processors was synchronized between the service processor and the Hostboot process to correct the error.
  • A problem was fixed for a USB Save and Restore of machine configuration to not lose the system name.
  • A problem was fixed for Advanced System Management Interface (ASMI) help text for menu "I/O Adapter Enlarged Capacity" being missing with the system IPLed and partitions running.  The help text is now available for the system in the powered on state as well as in the powered off state.
  • A problem was fixed for an intermittent power supply error SRC 1100D008 with a flood of VPD SRC B1504804 with errno 3Es logged on a re-ipl of the CEC but that did not result in an IPL failure.
  • A problem was fixed for a LED intermittently not lighting for an enclosure with a fault.
  • A problem was fixed for an intermittent PSI link error with SRC B15CDA27 after a firmware update or reset/reload of the service processor.
  • A problem was fixed for PCIe3 adapters failing when requesting more than 32 Message Signaled Interrupts (MSI-X).  The adapter may fail to ping or cause OS tasks to hang that are using the adapter.  This problem was found specifically on the 10 Gb Ethernet-SR (Short Range) PCIe3 adapter with feature codes #5275 and #5769 and on the 56 Gb Infiniband (IB) Fourteen Data Rate (FDR) adapter with feature codes #EC32, #EC33, #EL3D, and #EL50 and CCIN 2CE7.  However, other PCIe adapters may also be affected.
  • A problem was fixed for IBM copyright statements being displayed on the System Management Services (SMS) menu after a repair or replacement of system hardware.

System firmware changes that affect certain systems

  • HIPER/Pervasive:  For partitions with a graphics console and USB keyboard, a problem was fixed for a OS boot hang at the CA00E100 progress SRC.  For the problem, the hang can be avoided by issuing the boot command from the Open Firmware (OF) prompt.
  • HIPER/Pervasive:  On systems using PowerVM with shared processor partitions that are configured as capped or in a shared processor pool, there was a problem found that delayed the dispatching of the virtual processors which caused performance to be degraded in some situations.  Partitions with dedicated processors are not affected.   The problem is rare and can be mitigated, until the service pack is applied, by creating a new shared processor AIX or Linux partition and booting it to the SMS prompt; there is no need to install an operating system on this partition.  Refer to help document http://www.ibm.com/support/docview.wss?uid=nas8N1020863 for additional details.
  • DEFERRED:  A problem was fixed for Non-Volatile Memory express (NVMe) adapters, plugged into PCIe3 switches, mis-training to generation 1 instead of generation 3.   NVMe adapters attached directly to the PCIe3 slots trained correctly to the generation 3 specification. This fix requires a re-IPL of the system to correct the training of any mis-trained adapters.
  • On multiple-node systems, a problem was fixed for a missing location code, part, and serial number for a faulty symmetric multiprocessing (SMP) cable in the call home B1504922 error log.
  • On multiple-node systems, a problem was fixed for a two hour IPL hang in HostBoot caused by multiple B18ABAAB errors from more than one node.  The Hostboot process failed to go into its reconfiguration loop to do error recovery and continue the IPL.
  • On a system with redundant service processors,  a problem was fixed for an IPL failure for a bad service processor cable on the primary service processor with SRCs B1504904 and B18ABAAB logged.  The system should have did an error failover to the backup service processor and continued the IPL to get the partitions running.
  • On a system with redundant service processors where redundancy is disabled, a problem was fixed for an unrecoverable (UE) SRC B181DA19 being logged on a re-IPL after a checkstop error.  The error log did not interfere with the reIPL which was successful.
  • On multiple-node systems, a problem was fixed for extraneous error logs after a 12V power fault.  After termination, there were additional 110026Bx error log entries that should have been ignored.
  • On a system with redundant service processors, a problem was fixed for the isolation procedures for an Anchor card error and system VPD collection failure with termination SRC B181A40F .  FSPSP04 and FSPSP06 are no longer called out as part of reporting the VPD collection failure.  FSPSP30 has been updated with isolation steps for this problem and is called out and should be used for the problem isolation.  Retain tip H213935 also provides the FRU isolation steps.  Procedure FSPSP30 tries to replace the service processor first.  If that does not work, then the procedure has the Anchor card replaced.
  • On multiple-node systems, a problem was fixed to isolate a power fault during IPL to the specific node and guard the node, and allow the rest of the system to IPL.  Previously, the power fault would not be localized to the problem node and it caused the IPL of all the nodes of the system to fail.
  • On a system with redundant service processors, a problem was fixed for failovers to the backup service processor that caused an On-Chip Controller (OCC) abort.  This placed the CEC in a "safe" mode where it ran at reduced processor clock frequencies to prevent exceeding the power limits while not under OCC control.
  • On a system with an IBM i partition using Active Memory Sharing (AMS),  a problem was fixed for internal memory management errors caused by deleting a IBM i partition that had been powered off in the middle of a Main Storage Dump (MSD).  Until the fix is installed, if a MSD is interrupted for a IBM i partition that has AMS, the partition should be powered on and powered off normally before a delete of the partition is done to prevent errors with unpredictable affects.  This problem does not affect the S822 (8284-22A), S812L(8247-21L), S822L (8247-22L), S824L(8247-42L), and E850 (8408-E8E) models.
  • On a system with redundant service processors, a problem was fixed for a failover to the backup service processor during a power off of the CEC that caused a hypervisor time-out with SRC B182953C.  This error was caused by a delay in synchronizing the state of the hypervisor to the backup service processor but it did not prevent the power off from completing successfully.
  • On a system with redundant service processors, a problem was fixed for a firmware update causing an error log server dump with SRC B1818601.  The error log server restarted automatically to recover from the error and the firmware update was successful.
SC830_048_048 / FW830.00

06/08/15
Impact:  New      Severity:  New

New Features and Functions

NOTE:
  • POWER8 (and later) servers include an “update access key” that is checked when system firmware updates are applied to the system.  The initial update access keys include an expiration date which is tied to the product warranty. System firmware updates will not be processed if the calendar date has passed the update access key’s expiration date, until the key is replaced.  As these update access keys expire, they need to be replaced using either the Hardware Management Console (HMC) or the Advanced Management Interface (ASMI) on the service processor.  Update access keys can be obtained via the key management website: http://www.ibm.com/servers/eserver/ess/index.wss.
  • Support for Little Endian (LE) Linux in PowerVM.  With PowerVM LE guest support, all three Linux on Power distribution partners (SUSE, Canonical, and Red Hat) with LE operating systems can run on the same IBM Power Systems.
  • Support for allowing the PowerVM hypervisor to continue to run after the service processor has become unresponsive with a SRC B1817212.  Any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.  This error condition would only be seen on a system that had been running with a single service processor (no redundancy for the service processor).
  • Support for three and four node configurations of the E880 (9119-MHE) system.
  • Support for an increase of the maximum number of PCIe 3 I/O expansion drawers (#EMX0) that can be attached to an E870 /E880 node from two to four.
  • Support for Single Root I/O Virtualization (SR-IOV) that enables the hypervisor to share a SR-IOV-capable PCI-Express adapter across multiple partitions. Twelve ethernet adapters are supported with the SR-IOV NIC capability, when placed in the P8 system  (SR-IOV supported in both native mode and through VIOS):
    - PCIe3  4-port 10GbE SR Adapter                           (F/C EN15 and CCIN 2CE3)
    - PCIe3  4-port 10GbE SR Adapter                         (F/C EN16 and CCIN 2CE3).  Fits E870/E880 system node PCIe slot.
    - PCIe3  4-port 10GbE SFP+ Copper Adapter                    (F/C EN17 and CCIN 2CE4)
    - PCIe3  4-port 10GbE SFP+ Copper Adapter                    (F/C EN18 and CCIN 2CE4).  Fits E870/E880 system node PCIe slot.
    - PCIe2  4-port (10Gb FCoE & 1GbE) SR and RJ45 SFP+ Adapter        (F/C EN0H and CCIN 2B93)
    - PCIe2 LP 4-port (10Gb FCoE & 1GbE) SR and RJ45  SFP+ Adapter        (F/C EN0J and CCIN 2B93)
    - PCIe2 LP Linux 4-port (10Gb FCoE & 1GbE) SR and RJ45 SFP+ Adapter       (F/C EL38 and CCIN 2B93)
    - PCIe2  4-port (10Gb FCoE & 1GbE) LR and RJ45 Adapter             (F/C EN0M and CCIN 2CC0)
    - PCIe2 LP 4-port (10Gb FCoE & 1GbE) LR and RJ45 Adapter              (F/C EN0N and CCIN 2CC0)
     -PCIe2  4-port (10Gb FCoE & 1GbE) SFP+Copper and RJ45 Adapter        (F/C EN0K and CCIN 2CC1)
    - PCIe2 LP 4-port (10Gb FCoE & 1GbE) SFP+Copper and RJ45    Adapter        (F/C EN0L and CC IN 2CC1)
    - PCIe2 LP Linux 4-port (10Gb FCoE & 1Gb Ethernet) SFP+Copper and RJ45    (F/C EL3C and CCIN 2CC1)
    These adapters each have four ports, and all four ports are enabled with SR-IOV function. The entire adapter (all four ports) is configured for SR-IOV or none of the ports is.
    System firmware updates the adapter firmware level on these adapters to 10.2.252.16 when a supported adapter is placed into SR-IOV mode.
    Support for SR-IOV adapter sharing is now available for adapters in the PCIe3 I/O Expansion Drawer with F/C #EMX0.
    SR-IOV NIC on the Power P8 systems is supported by:
        - AIX 6.1 TL9 SP4 and APAR IV63331, or later
        - AIX 7.1 TL3 SP4 and APAR IV63332, or later
        - IBM i 7.1 TR8, or later (Supported on S824/S814)
        - IBM i 7.2  or later  (Supported on S824/S814)
        - IBM i 7.1 TR9, or later (Supported on E870/E880)
        - IBM i 7.2 TR1, or later  (Supported on E870/E880)
                - Red Hat Enterprise Linux 6.5 or later ( Supported on E870/E880/S812L/S822/S822L/S814/S824/S824L except for adapters with F/Cs EN15/EN16/EN17/EN18)
        - Red Hat Enterprise Linux 6.6, or later (Supported on E850 and minimum level needed for adapters with F/Cs EN15/EN16/EN17/EN18)
        - Red Hat Enterprise Linux 7.1, or later
        - SUSE Linux Enterprise Server 11 SP1 or later  (Supported on S812L/S822/S822L/S814/S824/S824L)
        - SUSE Linux Enterprise Server 11 SP3 or later  (Supported on E870/E880)
        - SUSE Linux Enterprise Server 12, or later  (Supported on E850)
        - Ubuntu 15.04 or later (Supported on E850/S812L/S822/S822L/S814/S824/S824L) 
        - VIOS 2.2.3.4 with interim fix IV63331, or later
  • Support for an upgrade from 8-core processors to 12-core processors for the E880 (9119-MHE) system.
  • Support for adjusting voltage regulators input voltage dynamically based on regulator slave failures to achieve the optimal voltage for system operation for normal and degraded conditions.
System firmware changes that affect all systems
  • A problem was fixed to eliminate unneeded guard data from call home messages for the cases where there is no hardware error in the system.
  • On systems with redundant service processors, a problem was fixed in the run-time error failover to the backup service processor so it does not terminate on FRU support interface (FSI) errors.  In the case of FSI errors on the new primary service processor, the primary will do a reset/reload instead of a terminate.
  • A problem was fixed to call home guarded FRUs on each IPL.  Only the initial failure of the hardware was being reported to the error log.
  • Support was added to the Advanced System Management Interface (ASMI) USB menu to allow a system dump to be collected to USB with the power on to the system.  This allows the dump to be collected with the system memory state intact.
  • A problem was fixed for the service processor error log handling that caused SRC B150BAC5 errors when converting a error log entry from an object into a flattened array of bytes.
  • A problem was fixed that prevented a second management console from being added to the CEC.  In some cases, network outages caused defunct management console connection entries to remain in the service processor connection table,  making connection slots unavailable for new management consoles  A reset of the service processor could be used to remove the defunct entries.
  • A problem was fixed to eliminate a false error log and call home for a SRC1100154F fan fault caused by an unplugged power cable.
  • A problem was fixed for a highly intermittent IPL failure with SRC B18187D9 caused by a defunct attention handler process.  For this problem, the IPL will continue to fail until the service processor is reset.
    A problem was fixed for missing FRU information in SRC 11001515.   SRC 11001515 was logged indicating replacement of power supply hardware, but did not include the location code, the part number, the CCIN, or the serial number.
  • A problem was fixed for systems with a corrupted date of "1900" showing for the Update Access Key (UAK).  The firmware update is allowed to proceed on systems with a bad UAK date because the fix is in an emergency service pack.  After the fix is installed, the user should correct the UAK date, if needed, by using the original UAK key for the system.  On the Management Console,  enter the original update access key via the "Enter COD Code" panel. Or on the Advanced System Manager Interface (ASMI),  enter the original update access key via the "On Demand Utilities/COD Activation" panel.
  • A problem with concurrent PCIe adapter maintenance was fixed that caused On-Chip Controller (OCC) resets with SRCs logged of B18B2616 and BC822029, forcing the system into safe mode (processor voltage/frequency reduced to a "safe" level where thermal monitoring is not required).  Recovery from safe mode requires a system re-IPL.
  • A problem was fixed for I/O adapters so that BA400002 errors were changed to informational for memory boundary adjustments made to the size of DMA map-in requests.  These DMA size adjustments were marked as UE previously for a condition that is normal.
System firmware changes that affect certain systems
  • On systems in PowerVM mode, a problem was fixed for unresponsive PCIe adapters after a partition power off or a partition reboot.
  • On systems using Virtual Shared Processor Pools (VSPP), a problem was fixed for an inaccurate pool idle count over a small sampling period.
  • On systems with partitions using shared processors, a problem was fixed that could result in latency or timeout issues with I/O devices.
  • On systems using PowerVM,  a problem was fixed for a hypervisor deadlock that results in the system being in a "Incomplete state" as seen on the management console.  This deadlock is the result of two hypervisor tasks using the same locking mechanism for handling requests between the partitions and the management console.  Except for the loss of the management console control of the system, the system is operating normally when the "Incomplete state" occurs.
  • On systems with memory mirroring enabled, a problem was fixed for PowerVM over-estimating its memory needs, allowing more memory to be used by the partitions.
  • On systems using PowerVM, a problem was fixed for the handling of the error of multiple cache hits in the instruction effective-to-real address translation cache (IERAT).  A multi-hit IERAT error was causing system termination with SRC B700F105.  The multi-hit IERAT is now recognized by the hypervisor and reported to the OS where it is handled.
  • On systems using PowerVM, a problem was fixed to allow booting off an iSCSI device.  For the failure, the partition firmware error logs had SRC BA012010 "Opening the TCP node failed." and SRC BA010013 "The information in the error log entry for this SRC provides network trace data."  The open firmware standard output trace showed SRC BA012014  "The TCP re-transmission count of 8 was exceeded. This indicates a large number of lost packets between this client and the boot or installation server" followed by SRC BA012010.
  • On systems using PowerVM, support was added for USB 2.0 HUBs so that a keyboard plugged into the USB 2.0 HUB will work correctly at the SMS menus.  Previously, a keyboard plugged into a USB 2.0 HUB was not a recognized device.



SC820
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
SC820_103_047 / FW820.50

09/26/16
Impact:  Availability      Severity:  SPE

New Features and Functions

  • Support was added to protect the service processor from booting on a level of firmware that is below the minimum MIF level.  If this is detected, a SRC B18130A0 is logged.  A disruptive firmware update would then need to be done to the minimum firmware level or higher.  This new support has no effect on the system being updated with the service pack but has been put in place to provide an enhanced firmware level for the IBM field stock service processors.
  • The certificate store on the service processor has been upgraded to include the changes contained in version 2.6 of the CA certificate list published by the Mozilla Foundation at the mozilla.org website as part of the Network Security Services (NSS) version 3.21.
  • Support was added for systems to be able to automatically convert permanently activated resources (processor and memory) to  Mobile CoD resources for use in a Power Enterprise Pool (PEP).  The ability to do a CoD resource license conversion requires a minimum HMC level of V8R8.4.0 or later.  More information on how to use a  PEP for a group of systems tp share Mobile Capacity on Demand (CoD) processor resources and memory resources can be found in the IBM Knowledge Center at the following link: https://www.ibm.com/support/knowledgecenter/HW4M4/p8ha2/systempool_cod.htm.

System firmware changes that affect all systems

  • A problem was fixed for an unneeded error log with SRC B181DB04 that occurred in a failed IPL for a normal condition of lost PNOR flash access after a reIPL process had started and taken over the access.
  • A problem was fixed for the Advanced System Management Interface "Network Services/Network Configuration" "Reset Network Configuration" button that was not resetting the static routes to the default factory setting.  The manufacturing default is to have no static routes defined so the fix clears any static routes that had been added.  A circumvention to the problem is to use the ASMI "Network Services/Network Configuration/Static Route Configuration" "Delete" button before resetting the network configuration.
  • Support was added for additional First Failure Data Capture (FFDC) data for processor clock failover errors provided by creating daily clock status reports with SRC B150CCDA informational error logs.  This clock status SRC log is written into the Hardware Management Console (HMC) iqyylog.log as a platform error log (PEL) event.  The PEL event contains a dump of the clock registers.  If a processor clock fails over with SRC B158CC62 posted to the serviceable events log, the iqyylog.log file on the HMC should be collected to help debug the clock problem using the B150CCDA data.
  • A problem was fixed the for the service processor recovery from intermittent MAX31760 fan controller faults logged with SRC B1504804.  The fan controller faults caused an out of memory condition on the service processor, forcing it to reset and failover to the backup service processor with SRCs B181720D, B181E6E9,  and B182951C logged.  With the fix, the fan controller faults are handled without memory loss and the only SRC logged is B1504804 for each fan controller fault.
  • A problem was fixed for a sequence of two or more Live Partition Mobility migrations that caused a partition to crash with a SRC BA330000 logged (Memory allocation error in partition firmware).  The sequence of LPM migrations that can trigger the partition crash are as follows:
    The original source partition level can be any FW760.xx, FW763.xx, FW770.xx, FW773.xx, FW780.xx, or FW783.xx P7 level or any FW810.xx, FW820.xx, FW830.xx, or FW840.xx P8 level.  It is migrated first to a system running one of the following levels:
    1) FW730.70 or later 730 firmware or
    2) FW740.60 or later 740 firmware
    And then a second migration is needed to a system running one of the following levels:
    1) FW760.00 - FW760.20 or
    2) FW770.00 - FW770.10
    The twice-migrated system partition is now susceptible to the BA330000 partition crash during normal operations until the partition is rebooted.  If an additional LPM migration is done to any firmware level, the thrice-migrated partition is also susceptible to the partition crash until it is rebooted.
    With the fix applied, the susceptible partitions may still log multiple BA330000 errors but there will be no partition crash.  A reboot of the partition will stop the logging of the BA330000 SRC.
  • A problem was fixed for a service processor failure during a system power off that causes a reset of the service processor.  The service processor is in the correct state for a normal system power on after the error.  The frequency for this error should be low as it is caused by a very rare race condition in the power off process.
  • A problem was fixed for the health monitoring of the NVRAM and DRAM in the service processor that had been disabled.  The monitoring has been re-established and early warnings of service processor memory failure is logged with one of the following Predictive Error SRCs:  B151F107, B151F109, B151F10A, or B151F10D.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during SSL certificate management. The Common Vulnerabilities and Exposures issue number is CVE-2016-0797.
  • A problem was fixed the for an infrequent IPL hang and terminate that can occur if the backup clock card is failing.  The following SRCs may be logged with this termination:  B1813450, B181460B, B181BA07, B181E6C7 and B181E6F1.  If the IPL error occurs, the system can be re-IPLed to recover from the problem.
  • A problem was fixed for the Advanced System Mangement Interface (ASMI) incorrectly showing the Anchor card as guarded whenever any redundant VPD chip is guarded.
  • A problem was fixed for hypervisor task failures in adjunct partitions with a SRC B7000602 reported in the error log.  These failures occur during adjunct partition reboots for concurrent firmware updates but are extremely rare and require a re-IPL of the system to recover from the task failure.  The adjunct partitions may be associated with the VIOS or I/O virtualization for the physical adapters such as done for SR-IOV.
  • A problem was fixed for a shortened "Grace Period" for "Out of Compliance" users of a Power Enterprise Pool (PEP).   The "Grace Period" is short by one hour, so the user has one less hour to resolve compliance issues before the HMC disallows any more borrowing of PEP resources.  For example, if the "Grace Period" should have been 48 hours as shown in the "Out of Compliance" message, it really is 47 hours in the hypervisor firmware.  The borrowing of PEP resources is not a common usage scenario.  It is most often found in Live Partition Mobility (LPM) migrations where PEP resources are borrowed from the source server and loaned to the target server.
  • A problem was fixed for intermittent long delays in the NX co-processor for asynchronous requests such as NX 842 compressions.  This problem was observed for AIX DB2 when it was doing hardware-accelerated compressions of data but could occur on any asynchronous request to the NX co-processor.
  • A problem was fixed for a Live Partition Mobility migration that resulted in the source managed system going to the Hardware Management Console (HMC) Incomplete state after the migration to the target system was completed.  This problem is very rare and has only been detected once.. The problem trigger is that the source partition does not halt execution after the migration to the target system.   The HMC went to the Incomplete state for the source managed system when it failed to delete the source partition because the partition would not stop running.  When this problem occurred, the customer network was running very slowly and this may have contributed to the failure.  The recovery action is to re-IPL the source system but that will need to be done without the assistance of the HMC.  For each partition that has a OS running on the source system, shut down each partition from the OS.  Then from the Advanced System Management Interface (ASMI),  power off the managed system.  Alternatively, the system power button may also be used to do the power off.  If the HMC Incomplete state persists after the power off, the managed system should be rebuilt from the HMC.  For more information on HMC recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
  • A problem was fixed for transmit time-outs on a Virtual Function (VF) during stressful network traffic, on systems using PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared-mode.  This fix updates adapter firmware to 10.2.252.1918, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and EL3C.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for a unplugged or missing clock cable during a IPL that logs an expected SRC B158CC62 for the clock cable problem but also results in an unexpected  system checkstop with a DIMM failure with SRC B124E504 and error signature  "mb(n1p0) () PLL".  With the fix, the clock cable problems are detected and handled without incurring secondary faults.
  • A problem was fixed for infrequent VPD cache read failures during an IPL causing an unnecessary guarding of DIMMs with SRC B123A80F logged.  With the fix, the VPD cache read fails cause a temporary deconfiguration of the associated DIMM but the DIMM is recovered on the next IPL.
System firmware changes that affect certain systems
  • On systems with a PowerVM Active Memory Sharing (AMS) partition with AIX  Level 7.2.0.0 or later with Firmware Assisted Dump enabled, a problem was fixed for a Restart Dump operation failing into KDB mode.  If "q" is entered to exit from KDB mode, the partition fails to start.  The AIX partition must be powered off and back on to recover.  The problem can be circumvented by disabling Firmware Assisted Dump (default is enabled in AIX 7.2).
  • For a system partition with more than 64 cores, a problem was fixed for Live Partition Mobility (LPM)  migration operations failing with HSCL365C.  The partition migration is stopped because the platform detects a firmware error anytime the partition has more than 64 cores.
  • On multiple-node systems, a problem was fixed for a two hour IPL hang in HostBoot caused by multiple B18ABAAB errors from more than one node.  The Hostboot process failed to go into its reconfiguration loop to do error recovery and continue the IPL.
SC820_099_047 / FW820.40

05/04/16
Impact:  Availability      Severity:  SPE

New Features and Functions

  • Support was added for the Stevens6+ option of the internal tray loading DVD-ROM drive with F/C #EU13.  This is an 8X/24X(max) Slimline SATA DVD-ROM Drive.  The Stevens6+ option is a FRU hardware replacement for the Stevens3+.  MTM 7226-1U3 (Oliver)  FC 5757/5762/5763 attaches to IBM Power Systems and lists Stevens6+ as optional for Stevens3+.  If the Stevens6+  DVD drive is installed on the system without the required firmware support, the boot of an AIX partition will fail when the DVD is used as the load source.  Also, an IBM i partition cannot consistently boot from the DVD drive using D-mode IPL.  A SRC C2004130 may be logged for the load source not found error.

System firmware changes that affect all systems

  • A problem was fixed for a system IPL hang at C100C1B0 with SRC 1100D001 when the power supplies have failed to supply the necessary 12-volt output for the system.   The 1100D001 SRC was calling out the planar when it should have called out the power supplies.  With the fix, the system will terminate as needed and call out the power supply for replacement.  One mode of power supply failure that could trigger the hang is sync-FET failures that disrupt the 12-volt output.
  • A problem was fixed for the callout of a VPD collection fault and system termination with SRC 11008402 to include the 1.2vcs VRM FRU.  The power good fault fault for the 1.2 volts would be a primary cause of this error.  Without the fix, the VRM is missing in the callout list and only has the VPDPART isolation procedure.
  • On multi-node systems with a power fault, a problem was fix for On-Chip Controller errors caused by the power fault being reported as predictive errors for SRC B1602ACB.  These have been corrected to be informational error logs.  If running without the fix, the predictive and unrecoverable errors logged for the OCC on loss of power to the node can be ignored.
  • A problem was fixed for excessive logging of the SRC 11002610 on a power good (pgood) fault when detected by the Digital Power Subsystem Sweep (DPSS).  Multiple pgood interrupts are signaled by the DPSS in the interval between the first pgood failure and the node power down.  A threshold was added to limit the number of error logs for the condition.
  • A problem was fixed for redundant logging of the SRC B1504804 for a fan failure, once every five seconds.  With the fix, the failure is logged only at the initial time of failure in the IPL.
  • A problem was fixed for a false unrecoverable error (UE) logged for B1822713 when an invalid cooling zone is found during the adjustment of the system fan speeds.  This error can be ignored as it does not represent a problem with the fans.
  • On a multi-node system,  a problem was fixed for a power fault with SRC 11002610 having incorrect FRU callouts.  The wrong second FRU callout is made on nodes 2, 3, and 4 of a multi-node system.  Instead of calling out the processor FRU, the enclosure FRU is called out.  The first FRU callout is correct.
  • A problem was fixed for a processor clock failover error with SRC B158CC62 calling out all processors instead of isolating to the suspect processor.  The callout priority correctly has a clock and a procedure callout as the highest priority, and these should be performed first to resolve the problem before moving on to the processors.
  • A problem was fixed for a system checkstop caused by a L2 cache least-recently used (LRU) error that should have been a recoverable error for the processor and the cache.  The cache error should not have caused a L2 HW CTL error checkstop.
  • A problem was fixed for priority callouts for system clock card errors with SRC B158CC62.  These errors had high priority callouts for the system clock card and medium callouts for FRUs in the clock path.  With the fix, all callouts are set to medium priority as the clock card is not the most probable FRU to have failed but is just a candidate among the many FRUs along the clock path.
  • A problem was fixed for PCIe switch recovery to prevent a partition switch failure during the IPL with error logs for SRC B7006A22 and B7006971 reported.  This problem can occur when doing recovery for an informational error on the switch.  If this problem occurs, the partition must be restarted to recover the affected I/O adapters.
  • A problem was fixed to correct the error messages for early failures in the Live Partition Mobility (LPM) migration of a partition.  The management console might report an unrelated error such as  "HSCLA27E The operation to lock the physical device location for target adapter" when the actual error might be not enough available memory on the target CEC to run the migration.  With the fix, the correct error code is returned so there is enough information to correct the error and retry the migration.
  • A problem was fixed for a hypervisor task hang during a FRU exchange on the PCIe3 I/O expansion drawer (#EMX0) that requires the entire drawer to power off and power on again.  The activation phase for the power on may never complete if a very rare sequence of events occurs during the power on step.  The FRUs to exchange that would cause the expansion drawer to power off  and power on are the following:  midplane, I/O module, I/O module VRM, chassis management card (CMC), cable card, and active optical cable.
  • A problem was fixed for PCIe adapter hangs and network traffic error recovery during Live Partition Mobility (LPM) and SR-IOV vNIC (virtual ethernet adapter)  operations.  An error in the PCI Host Bridge (PHB) hardware can persist in the L3 cache and fail all subsequent network traffic through the PHB.  The PHB  error recovery was enhanced to flush the PHB L3 cache to allow network traffic to resume.
  • A problem was fixed for a Qualys network scan for security vulnerabilities causing a core dump in the Intelligent Platform Management Interface (IPMI)  process on the service processor with SRC B181EF88.  The error occurs anytime the Qualys scan is run because it sends an invalid IPMI session id that should have been handled and discarded without a core dump.
  • A problem was fixed for error recovery from failed Live Partition Mobility (LPM) migrations.  The recovery error is caused by a partition reset that leaves the partition in an unclean state with the following consequences:  1) A retry on the migration for the failed source partition may not not be allowed; and 2) With enough failed migration recovery errors, it is possible that any new migration attempts for any partition will be denied.  This error condition can be cleared by a re-IPL of the system. The partition recovery error after a failed migration  is much more likely to occur for partitions managed by the Integrated Virtualization Manager (IVM) but it is still possible to occur for Hardware Management Console (HMC) managed partitions.
  • A problem was fixed for a L2 cache error on the service processor that caused the service processor to reset or go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.  With the fix, the L2 cache error is handled with single-bit corrected with no error to the service processor, so it can continue normal processing.  The L2 cache data error that causes this fail is infrequent and the service processor requires its limit of three resets in fifteen minutes to be exceeded for the service processor to fail, so service processor failure rate for this problem is low.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during RSA PPS signature verification. The Common Vulnerabilities and Exposures issue number is CVE-2015-3194.
  • A security problem was fixed in the lighttpd server on the service processor, where a remote attacker, while attempting authentication, could insert strings into the lighttpd server log file.  Under normal operations on the service processor, this does not impact anything because the log is disabled by default.  The Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
  • A problem was fixed for a hypervisor adjunct partition failed with "SRC B2009008 LP=32770" for an unexpected SR-IOV adapter configuration.  Without the fix, the system must be re-IPLed to correct the adjunct error.  This error is infrequent and can only occur if an adapter port configuration is being changed at the same time that error recovery is occurring for the adapter.
  • A problem was fixed for a missing error log when a clock card fails over to the backup clock card.  This problem causes loss of redundancy on the clock cards without a callout notification that there is a problem with the FRU.  If the fix is applied to a system that had a failed clock, that condition will not be known until the system is IPLed again when a errorlog and callout of the clock card will occur if it is in a persisted failed state.
  • A problem was fixed for the service processor going to the reset state instead of the termination state when the anchor card is missing or broken.  At the termination state, the Advanced System Manager Interface (ASMI) can be used to collect failure data and debug the problem with the anchor card.
System firmware changes that affect certain systems
  • On systems with AIX or Linux encapsulated state partitions, a problem was fixed for a Live Partition Mobility migration failure for the encapsulated state partitions.  The migration fails on the target CEC when the associated paging space needed to support the encapsulated state is not available.  Removing the "Encapsulated State" attribute from the partition would allow the migration to succeed.  However, removing this attribute can only be accomplished if the partition in the powered off state.  Encapsulated State partitions are needed for the remote restart feature.  An encapsulated state partition is a partition in which the configuration information and the persistent data are stored external to the server on persistent storage.  A partition that supports remote restart can be restarted remotely.  For more information on the remote start feature, refer to this IBM Knowledge Center link: http://www.ibm.com/support/knowledgecenter/P8DEA/p8efd/p8efd_lpar_general_props.htm
  • For Integrated Virtualization Manager (IVM) managed systems with more than 64 active partitions, a problem was fixed for recovery from Live Partition Mobility (LPM) errors.  Without the fix, the IVM  managed system partition can appear to still be running LPM after LPM has aborted, preventing retries of the LPM operation.  In this case, the partition must be stopped and restarted to clear the LPM error state.  The problem is not frequent because it requires a failed LPM on a partition with a partition ID that is greater than 64.
  • On systems with an invalid P-side or T-side in the firmware, a problem was fixed in the partition firmware Real-Time Abstraction System (RTAS) so that system Vital Product Data (VPD) is returned at least from the valid side instead of returning no VPD data.   This allows AIX host commands such as lsmcode, lsvpd, and lsattr that rely on the VPD data to work to some extent even if there is one bad code side.  Without the fix,  all the VPD data is blocked from the OS until the invalid code side is recovered by either rejecting the firmware update or attempting to update the system firmware again.
  • A problem was fixed for an incorrect date in partitions created with a Simplified Remote Restart-Capable (SRR) attribute where the date is created as Epoch 01/01/1970 (MM/DD/YYYY).  Without the fix, the user must change the partition time of day when starting the partition for the first time to make it correct.  This problem only occurs with SRR partitions.
  • On systems using PowerVM firmware with dedicated processor partitions,  a problem was fixed for the dedicated processor partition becoming intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.
SC820_091_047 / FW820.30

11/18/15
Impact:  Availability      Severity:  HIPER

New Features and Functions

  • The firmware code update process was enhanced with a feature to block a firmware "downgrade" to a level that is below the system's manufactured code level.
  • Support was added to the Advanced System Management Interface (ASMI) to be able to add a IPv4 static route definition for each ethernet interface on the service processor.  Using a static route definition,  a Hardware Management Console (HMC) configured on a private subnet that is different from the service processor subnet is now able to connect to the service processor and manage the CEC.  A static route persists until it is deleted or until the service processor settings are restored to manufacturing defaults.  The static route is managed with the ASMI panel "Network Services/Network Configuration/Static Route Configuration" IPv4 radio button.  The "Add" button is used to add a static route (only one is allowed for each ethernet interface) and the "Delete" button is used to delete the static route.

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for recovering from embedded MultiMediaCard (eMMC) flash NAND errors and three other low-level boot errors that caused the service processor to go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.  Other low-level boot errors included in this fix:
    1) A system reset to clear the boot registers may be erroneously handled as a chip reset causing the service processor to enter a stopped state and become unresponsive.
    2) Improves recovery for a defective file system partition table that causes the service processor to lose the ability to perform P and T (Permanent and Temporary) side switch.
    3) Do not fail on a dump partition full condition as this is normal when a service processor has a maximum number of service processor dumps active.
    For each of these issues, on systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.
  • HIPER/Non-Pervasive: A problem associated with workloads using transactional memory on PowerVM was discovered and is fixed in this service pack. The effect of the problem is non-deterministic but may include undetected corruption of data.
  • HIPER/Non-Pervasive:  A problem was fixed for recovery from PNOR flash memory corruption that causes the IPL to fail with SRC D143900C.  This is very rare and only has happened in IBM internal labs.  Without the fix, the service processor cannot correct the corruption in the PNOR.  If a system has the problem SRC and  cannot IPL,  then that system must be disruptively firmware updated to apply the fix to be able to IPL again.
  • DEFERRED:  A problem was fixed for memory on-die termination (ODT) settings to improve the signal integrity of the memory channel.
  • DEFERRED:  A problem was fixed for a TCP/IP performance degradation on PCIe ethernet adapters with Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE).  By adjusting the system memory caching, a significant improvement was made to the data throughput speed to restore performance to expected levels.  This fix requires a system re-IPL to take effect.
  • DEFERRED:  A problem was fixed for a hang in the processor and cache memory that causes a system checkstop with SRC B181E540 logged with a processor FRU callout.  The error log details include  "Description:  Runtime diagnostics has detected a problem on a memory bus" and "Signature Description:  mcs(n0p0c6) (MCIFIR[40]) CHANNEL TIMEOUT ERROR" and "Multi-Signature List:  ex(n0p0c14) (L3FIR[24]) L3 Hw Control Error".  The trigger for the hang error is speculative DMA partial writes into cache and the frequency of the error varies with the workload, but may happen several times a month.  A re-IPL of the system is needed for this fix to take effect after a concurrent firmware update of the service pack.
  • A problem was fix for certain error logs not being reported to the OS.  The error occurs when the hypervisor is not ready to receive an error log message and rejects it.  The error log handler on the service processor was not retrying until the error log was successfully delivered.  Until the fix is applied, there will be a small loss of error logs when the hypervisor is initializing during the IPL as these will get discarded until the hypervisor is ready.  The missing error logs may be viewed from the service processor using the Advanced System Management Interface (ASMI) or may be viewed as serviceable events on the management console if there is one attached.
  • A problem was fixed for the error reporting of multiple AC power losses so that all occurrences of the power losses are logged.  With the problem, only the first AC power loss for SRC 10001510 is reported, with subsequent power faults not being reported.  Until the fix is applied, a re-IPL of the CEC will re-enable power supply problem reporting.
  • A problem was fixed for a SRC 11002613 logged during a concurrent repair of a power supply.  This SRC was erroneously logged and did not represent a real problem.
  • A problem was fixed for an intermittent SRC B1504804 logged on a re-ipl of the CEC but that did not result in an IPL failure.  This problem is a inability of the service processor to do a read from the IIC bus resulting from incorrect device lock management.  This problem has no adverse impact on the system other than a predictive error log and can be ignored until the fix is applied.
  • A problem was fixed for a bad Time of Day (TOD) battery with SRC B15A3305 calling out the P1 Backplane instead of the P1-E2 Battery.  This occurs whenever the TOD battery becomes bad.  Until the fix is applied, always replace the battery FRU for this SRC as the first repair action.
  • A problem was fixed for the capture of the registers for the Hostboot Self-Boot Engine (SBE) for SBE failures.  These registers had been missing from failure data for SBE failures, making these problems more difficult to debug.
  • A problem was fixed for an Advanced System Management Interface (ASMI) error message of "Error in function 'connect", error code 111" when a browser attempted to connect before the service processor was ready.  The browser connection through the web server is now held off until the ASMI process is ready after a reset of the service processor or a AC power cycle of the system.  Until the fix is applied, the ASMI user can wait one or two minutes and then retry the operation.
  • A problem was fixed for an incorrect call home for SRC B1818A0F.  This call home can be ignored.  It occurs rarely only in the case of dynamic IP configuration for the service processor when it fails to acquire a IP address from the Dynamic Host Configuration Protocol (DHCP) server.  Unit the fix is applied, use the information from the SRC and network topology  to understand why the DHCP client cannot acquire an IP address as this is normally a network configuration error.
  • A problem was fixed for a system dump re-IPL that failed with SRC B1818601 and B181460B after processor core checkstops had terminated the system.  The failed processor cores created a complex condition that prevented a successful dump collection of all the hardware objects.  Until the fix is applied, the checkstop processor problems will have to be debugged with partial data from the degraded dump collections that have the failure SRCs.
  • A problem was fixed for an infrequent service processor database corruption during concurrent firmware update that caused the system to terminate with a UIRA impact to the customer.  The cause of the database corruption is undetermined but the problem is resolved by the service processor making a backup of the data that can be restored, if needed, to allow the firmware updates to complete successfully.
  • A problem was fixed for Advanced System Management Interface (ASMI) TTY to allow "admin" passwords to be greater than eight characters in length to be consistent with prior generations of the product.  The ASMI web interface works correctly for user "admin" passwords with no truncation in the length of the passwords.
  • A problem was fixed for a local clock card (LCC)  failure with SRC 11001515 that was missing a part number and location code.  This information has been added for LCC faults so the FRU to replace is properly identified.
  • A problem was fixed for a defective PCI oscillator in the local clock card (LCC) with SRC BC58090F that caused a IPL failure for the node instead of failing over to the redundant LCC.  For a multi-node system,  the failure is isolated to the node with the bad LCC and the other nodes are able to IPL.
  • A problem was fixed for a service processor dump with error logs  B181E911 and B181D172 during an IPL.  The error logs were for the detection of defunct processes but otherwise the IPL was successful.
  • A problem was fixed for missing Keyword (KW) and Resource ID (RID) for SRC B181A40F.
  • A problem was fixed for a I2C bus lock error during a CEC power off that caused a ten minute delay for the power off and  errorlog SRCs  B1561314 and B1814803 with error number (errno) 3E.
  • A problem was fixed for Advanced System Management Interface (ASMI) help text for menu "I/O Adapter Enlarged Capacity" being missing with the system IPLed and partitions running.  The help text, shown below, is now available for the system in the powered on state as well as in the powered off state.
    "I/O Adapter Enlarged Capacity
    This option controls the size of PCI memory space allocated to each PCI slot.
    When enabled, the selected number of PCI slots, including those in external I/O subsystems, receive the larger DMA and memory mapped address space.
    Some PCI adapters may require this additional DMA or memory space, per the adapter specification.
    This option increases system mainstore allocation to these selected PCI slots.
    Enabling this option may result in some PCI host bridges and slots not being configured because the installed mainstore is insufficient to configure all installed PCI slots."
  • A problem was fixed for recovering from a misplug of the service processor FSI cables (U2-P1-C10-T2 and U1-P1-C9-T2) where the plug locations are reversed from what would be a proper connection.  Without the fix, the bad FSI connections cause the service processors to go to the service processor stop state.  With the fix applied, the error logs call out the bad cables so they can be repaired and the service processor remains in a working state.
  • For a partition that has been migrated with Live Partition Mobility (LPM) from FW730 to FW740 or later, a problem was fixed for a Main Storage Dump (MSD) IPL failing with SRC B2006008.  The MSD IPL can happen after a system failure and is used to collect failure data.  If the partition is rebooted anytime after the migration, the problem cannot happen.  The potential for the problem existed between the active migration and a partition reboot.
  • A problem was fixed for partial loss of Entitlement for On/Off Memory Capacity On Demand (also called Elastic COD).  Users with large amounts of Entitlement on the system of greater than "65535 GB * Days" could have had a truncation of the Entitlement value on a re-IPL of the system.  To recover lost Entitlement, the customer can request another On/Off Enablement Code from IBM support to "re-fill" their entitlement.
  • A problem was fixed for a management console command line failure with a return code 0x40000147 (invalid lock state) when trying to delete SR-IOV shared mode configurations.  This could have occurred if the adapter slot had been re-purposed without involvement of the management console and was owned and operational at the time of the requested delete.  With the fix, the current ownership of the slot is honored and only the SR-IOV shared mode configuration data is deleted on the force delete.
  • A problem was fixed for an incorrect restriction on the amount of "Unreturned"  resources allowed for a Power Enterprise Pool (PEP).  PEP allows for logical moving of resources (processors and memory) from one server to another.  Part of this is 'borrowing' resources from one server to move to another. This may result in "Unreturned" resources on the source server. The management console controls how many total "Unreturned" PEP resources can exist.  For this problem,  the user had some "Unreturned" PEP memory and asked to borrow more but this request was incorrectly refused by the hypervisor.
  • On systems where memory relocation (as done by using Live Partition Mobility (LPM)) and a partition reboot are occurring simultaneously, a problem for a system termination was fixed.  The potential for the problem existed between the active migration and the partition reboot.
  • A problem was fixed that was corrupting the Update Access Key (UAK) date with a corrupted date of "1900".   The user should correct the UAK date, if needed, to allow the firmware update to proceed, by using the original UAK key for the system.  On the Management Console,  enter the original update access key via the "Enter COD Code" panel. Or on the Advanced System Management Interface (ASMI),  enter the original update access key via the "On Demand Utilities/COD Activation" panel.
  • A problem was fixed for recovery from unaligned addresses for MSI interrupts from PCIe adapters.  The recovery prevents an adapter timeout caused by resource exhaustion.  With the fix, the resources for each bad interrupt are returned, allowing the PCIe adapter to continue to run for the normal traffic.
  • A problem was fixed for a machine check incorrectly issued to an IBM i partition running 7.2 or later with 4K sector disks.
  • A problem was fixed for an extraneous PCIe switch SRC B7006A22 being called out when a there is a valid PCIe  expansion drawer cable problem with SRC B7006A88 reported.  The callout for SRC B7006A22 should be ignored as the PCIe switch hardware is working for this case.
  • A problem was fixed for a Network boot/install failure using bootp in a network with switches using the Spanning Tree Protocol (STP).  A Network boot/install using lpar_netboot on the management console was enhanced to allow the number of retries to be increased.  If the user is not using lpar_netboot, the number of bootp retries can be increased using the SMS menus.  If the SMS menus are not an option, the STP in the switch can be set up to allow packets to pass through while the switch is learning the network configuration.
  • A problem was fixed for PCIe3 adapters failing when requesting more than 32 Message Signaled Interrupts (MSI-X).  The adapter may fail to ping or cause OS tasks to hang that are using the adapter.  This problem was found specifically on the 10 Gb Ethernet-SR (Short Range) PCIe3 adapter with feature codes #5275 and #5769 and on the 56 Gb Infiniband (IB) Fourteen Data Rate (FDR) adapter with feature codes #EC32, #EC33, #EL3D, and #EL50 and CCIN 2CE7.  However, other PCIe adapters may also be affected.
  • A security problem was fixed for an OpenSSL specially crafted X.509 certificate that could cause the service processor to reset in a denial-of-service (DOS) attack.  The Common Vulnerabilities and Exposures issue number is CVE-2015-1789.
  • A problem was fixed for false errors reported with SRC B1812663 for the On-Chip Controller (OCC).  These error logs can be ignored as these are caused by a prior error log using a buffer that is not properly sized for the log data.
  • A problem was fixed to prevent recoverable power faults of short duration from causing the system to lose power supply redundancy.  Without the fix, the faulted state persisted for the recovered power fault, causing a problem with a system power off if other power supplies were lost at a later time.
  • A problem was fixed to guard a failed processor during an IPL instead of hanging with SRC B1813450 reported to the error log.
  • A problem was fixed for an intermittent PSI link error with SRC B15CDA27 after a firmware update or reset/reload of the service processor.
  • A problem was fixed for hardware system dump collection after a hardware checkstop that was missing scan ring data.  This is a very infrequent problem caused by an error with timing in the multi-threaded dump collection process.  Until this fix is applied, the debug of some hardware dump problems may require doing multiple dump collections to get all the data.
  • A problem was fixed for an Advanced System Managementr Interface (ASMI) error that occurred when trying to display detail on a deconfigured Anchor Card VPD.  If the error log for the selected deconfiguration record had been deleted, it caused ASMI to core dump.  With the fix,  if the error log for deconfiguration record is missing, the error log details such as failing SRC for the deconfiguration record are returned as blank.
  • A problem was fixed for an Operations Panel SRC of B1504804 with no FRU callout.  A callout of the failed hardware has been added.
  • A problem was fixed for guarding failed hardware dynamically during the IPL to prevent the IPL from terminating.  Without the fix,  certain hardware failures will not be called out to handled by the reconfiguration loop,  Until the fix is applied, multiple IPL attempts may be needed if hardware is failing.
  • A problem was fixed for a processor error causing a Hostboot terminate instead of a deconfiguration of the bad hardware and continuation of the IPL.  The state of the processors was synchronized between the service processor and the Hostboot process to correct the error.
  • A problem was fixed for the recovery of a failing PCI clock so that a failover to the backup PCI clock occurs without a node failing and being deconfigured.  Without the fix, the PCI clock does not behave as a redundant FRU and faults on it will cause the CEC to terminate.  A re-IPL of the CEC recovers it from the PCI clock error with the bad clock guarded so that the other PCI clock is used.
  • A problem was for fixed so that error logs are now generated for thermal errors detected by the service processor.  Without the fix, thermal errors such as a temperature over the threshold will not get reported in the error log but higher fan speeds will be present as an indicator of the thermal problem.  Until the fix is applied, the error log and call home mechanism cannot be relied on to monitor for system thermal problems.
  • A problem was fixed for processor core checkstops that cause an LPAR outage but do not create hardware errors and service events.  The processor core is deconfigured correctly for the error.  This can happen if the hypervisor forces processor checkstops in response to excessive processor recovery.
  • A problem was fixed for recovery from a processor local bus (PLB) hang on the service processor.  The errant PLB hang recovery would be seen in concurrent firmware updates that, on rare occasions, fail to do a side switch to activate to the new level of firmware.  On the management console, the error message would be HSCF010180E Operation failed ... E302F873 is the error code."  Other than the failed code level activation, the firmware update is successful.  If this problem occurs, the system can be set to the new firmware level by doing a power off from the management console and then doing a power on with side switch selected in the advanced properties.

System firmware changes that affect certain systems

  • On a system with redundant service processors where redundancy is disabled, a problem was fixed for an unrecoverable (UE) SRC B181DA19 being logged on a re-IPL after a checkstop error.  The error log did not interfere with the re-IPL which was successful.  The error log is for an active Processor Support Interface (PSI) link not being found for the backup service processor.  This is a correct condition when redundancy is disabled, so the error log should not have been generated.  Until the fix is applied, the error code can be ignored.
  • On multiple-node systems, a problem was fixed for extraneous error logs after a 12V power fault with SRC 11002610.  After system termination, there are additional 110026B0 and 110026B3 error log entries that can be ignored.
  • On a system with redundant service processors, a problem was fixed for the isolation procedures for an Anchor card error and system VPD collection failure with termination SRC B181A40F .  FSPSP04 and FSPSP06 are no longer called out as part of reporting the VPD collection failure.  FSPSP30 has been updated with isolation steps for this problem and is called out and should be used for the problem isolation.  Retain tip H213935 also provides the FRU isolation steps.  Procedure FSPSP30 tries to replace the service processor first.  If that does not work, then the procedure has the Anchor card replaced.
  • On a system with redundant service processors, a problem was fixed for failovers to the backup service processor that caused an On-Chip Controller (OCC) abort.  This placed the CEC in a "safe" mode where it ran at reduced processor clock frequencies to prevent exceeding the power limits while not under OCC control.
  • On a system with an IBM i partition using Active Memory Sharing (AMS),  a problem was fixed for internal memory management errors caused by deleting a IBM i partition that had been powered off in the middle of a Main Storage Dump (MSD).  Until the fix is installed, if a MSD is interrupted for a IBM i partition that has AMS, the partition should be powered on and powered off normally before a delete of the partition is done to prevent errors with unpredictable affects.
  • On systems using PCIe adapters in SR-IOV mode, a problem was fixed for occasional B200F011 and B2009008 SRCs that can occur during an IPL, moving a adapter into SR-IOV mode, or with SR-IOV link up/down activity.
  • On systems using PCIe adapters in SR-IOV mode,  the following problems were addressed with an Avago Technologies adapter firmware update to 10.2.252.1905:  1) Eliminating virtual function (VF) transmit errors during VF resets and 2) Preventing  loss of legacy flow control when an adapter port is connected to a priority flow control (PFC) capable switch.
  • On a system with redundant service processors, a problem was fixed for a firmware update causing an error log server dump with SRC B1818601.  The error log server restarted automatically to recover from the error and the firmware update was successful.
  • On a system with a AIX partition and a Linux partition, a problem was fixed for dynamically moving an adapter that uses DMA from the Linux partition to the AIX partition that caused the AIX to fail by going into KDB mode (0c20 crash).  The management console showed the following message for the partition operation:  "Dynamic move of I/O resources failed.  The I/O slot dynamic partitioning operation failed.".  The error was caused by Linux using 64K mappings for the DMA window and AIX using 4K mappings for the DMA window, causing incorrect calculations on the AIX when it received the adapter.  Until the fix is applied, the adapters that use DMA should only be moved from Linux to AIX when the partitions are powered off.
  • On a system with redundant service processors, a problem was fixed for an IPL failure for a bad service processor cable on the primary service processor with SRCs B1504904 and B18ABAAB logged.  The system should have did an error failover to the backup service processor and continued the IPL to get the partitions running.
SC820_087_047 / FW820.21

09/24/15
Impact:  Performance    Severity:  HIPER

System firmware changes that affect certain systems

  • HIPER/Pervasive:  On systems using PowerVM with shared processor partitions that are configured as capped or in a shared processor pool, there was a problem found that delayed the dispatching of the virtual processors which caused performance to be degraded in some situations.  Partitions with dedicated processors are not affected.   The problem is rare and can be mitigated, until the service pack is applied, by creating a new shared processor AIX or Linux partition and booting it to the SMS prompt; there is no need to install an operating system on this partition.  Refer to help document http://www.ibm.com/support/docview.wss?uid=nas8N1020863 for additional details.
SC820_085_047 / FW820.20

07/16/15
Impact:  Availability      Severity:  SPE

New Features and Functions

  • Support was added to the Advanced System Management Interface (ASMI) to display Anchor card VPD failures in the "Deconfigurations records" menu.

System firmware changes that affect all systems

  • DEFERRED: A problem was fixed for the fabric bus to allow a processor clock failover to be completed without a checkstop of the CEC.   A skew between the primary and secondary processor clock signal was eliminated to fix the problem.
  • DEFERRED: On systems with memory mirroring enabled, a problem was fixed for PowerVM over-estimating its memory needs, allowing more memory to be used by the partitions.  To free up the memory for the partitions that the hypervisor does not need, the CEC must be re-ipled after the fix is applied.
  • DEFERRED: A problem was fixed for the hypervisor being unable to make a partition configuration change when all licensed memory is in use by the partitions. An insufficient storage error is returned to the management console and the management console may go to the incomplete state for the CEC..  The hypervisor management of memory fragments has been improved so that partition configuration changes can be made when all licensed memory is in use.  To make this additional memory available for the partition changes,  the CEC must be re-ipled after the fix is applied.
  • A problem was fixed for a missing SRC if the operations panel failed while the system was running.  A B156A023 SRC is now logged if the operations panel fails or is removed while the system is running.
  • A problem was fixed that prevented a second management console from being added to the CEC.  In some cases, network outages caused defunct management console connection entries to remain in the service processor connection table,  making connection slots unavailable for new management consoles  A reset of the service processor could be used to remove the defunct entries.
  • A problem was fixed for a missing SRC when a Universal Power Interconnect Cable (UPIC) to the system control unit (SCU) failed or became loose while the system was running.  Up to four hot pluggable UPIC cables (#ECCA and #ECCB) provide redundant power to the SCU but only one is needed for operation.  When a UPIC cable fails now, a SRC 11008802 is logged and calls out the lost of one of the redundant power cables.
  • A problem was fixed for a false guarding and call out of a PSI link with SRC B15CDA27.  This failure is very infrequent but sometimes seen after the reset/reload of the service processor during a concurrent firmware update.   Since there is no actual hardware failure, a manual unguarding of the PSI link allows it to be reused.
  • A problem has been fix for the LED lights being interchanged for the Universal Power Interconnect Cable (UPIC) and the GFSP interface card FRUs on the system node.  The GFSP interface card has CCIN 6B2E and part number 00E2598 with location codes of Un-P1-C9-T2 and Un-P1-C10-T2.  The UPIC cables have part numbers 00FX185 and 00FX186 with location codes Un-P1-C9-T1 and Un-P1-C10-T1.
  • A problem was fixed for a CEC power off error with SRC B1818903 logged.  The error causes a dump and reset of the service processor that allows the power off operation to complete.
  • A problem was fixed for a two to four minute delay that could occur when performing an Administrative Failover (AFO) of the service processor.  An On-Chip Controller (OCC) deadlock was occurring in the service processor, leaving both service processors into the backup role.   This error state is automatically corrected by the hypervisor with a host-initiated reset/reload when it cannot find a service processor in the primary role after the delay time-out period.
  • A problem was fixed for losing power capping capability in the On-Chip Controllers (OCCs) after a service processor failover.  When this occurs. an UE B1702A03 SRC is logged by the OCC.  To restore power capping,  shut down all partitions and power off the CEC.  IPL the CEC again to restore power capping.
  • A problem was fixed for the error handling of a Local Clock and Control(LCC) card failure in a system node that triggers a flood of FDAL informational SRCs of B1504800 to the error log, causing the service processor to run out of memory and reset with a failover to the backup service processor.  The LCC has CCIN 682D and part number 00E2394 with location codes Un-P1-C11 and Un-P1-C12 as it is redundant in each system node.
  • A problem was fixed for a IPL failure with SRC B181BC04 when a system node was added to the CEC at service processor standby.  The new system node hardware was not added correctly to the hardware scan ring and a AC power cycle of the CEC was needed to fix the error.
  • A problem was fixed for missing hardware data in system dumps created for hardware checkstops.  A certain class of hardware scan rings were being skipped during the dump collection and these are now included so that all the hardware data is available for problem debug.
  • A problem was fixed for missing "fastarray" data in hardware dump type HWPROC.  The "fastarray" contains debug information for the processor cores.
  • A problem was fixed for the Advanced System Management Interface (ASMI) to allow removal of Hardware Management Console (HMC) connections that have been temporarily disconnected.  In some instances, the ASMI "System Configuration/Hardware Management Consoles" button for  "Remove Connection"  was not being shown.
  • A problem was fixed for the Advanced System Management Interface (ASMI)  IPv4 Network Configuration where the IP address was being overwritten by value in the subnet mask field for the initial values of the panel.  If the network configuration was saved without fixing the IP address, the wrong IP address was also saved.
  • A problem was fixed for missing call outs when having multiple "Memory Card/FRU" failures with SRC B124E504.  There is a call out for the first memory FRU of the failures but any other memory FRUs failing at the same time were not reported.
  • A problem was fixed for Administrative Failover (AFO) having error log SRC B1818601.  This error did not prevent the AFO from completing as the backup service processor became the primary service processor.
  • A problem was fixed for an intermittent problem in a CEC IPL where an On-Chip Controller is stuck in a reset loop, logging repeated SRCs for B1702A17, and eventually places the CEC in safe mode, running at minimum processor clock frequencies.
  • A problem was fixed for errors during a CEC power off with SRCs B1812616 and B1812601.  These occurred if the CEC was powered off immediately after a power on such that the On-Chip Controllers (OCCs) had to shutdown during their initialization.
  • A problem was fixed for a highly intermittent IPL failure with SRC B18187D9 caused by a defunct attention handler process.  Without this fix, the IPL will continue to fail until the service processor is reset.
  • A problem was fixed to add the callouts for the fan FRUs for system fan faults with SRCs 11007610, 11007620, and 11007630.  The fan FRU with CCIN 6B42, part number 00E9335, and location code Un-A1 is now included as needed.
  • A problem was fixed for an Administrative Failover (AFO) having error log SRC B185270E.  This error did not prevent the AFO from completing as the backup service processor became the primary service processor.   The error log has been made informational as it is a normal occurrence when fan speeds are adjusted.
  • A problem was fixed to allow adding a system node with only one working Local Clock and Control (LCC) card and being able to IPL the system node.  The LCC is redundant, so a broken or missing LCC should not cause an IPL to fail.  The problem can be circumvented by using the Advanced System Management Interface (ASMI) command line on the primary service processor to run this command "rmgrcmd --primary-lcc force-init" and then do the IPL.
  • A problem was fixed for finding the path to the second Local Clock and Control (LCC) card when a LCC card has failed to ensure proper redundancy for the LCC and the system node.
  • A problem was fixed for incorrect FRU callouts for Power Line Disturbance (PLD) and Processor clock errors.
  • A problem was fixed for extra FRU callouts being listed for SRCs with multiple FRU callouts.  The extra callouts are from previous SRCs and should not have been listed for the current error log entry.
  • A problem was fixed for the Advanced System Management Interface (ASMI) being allowed to deconfigure a node in a single-node system.  A safe guard was added so that ASMI can only deconfigure nodes in multi-node CECs.
  • A problem was fixed to include PCIe clocks as part of the minimum hardware check during an IPL.  Previously, no error was logged when a system had no functional PCIe clocks, causing run-time failures for PCIe I/O operations in partitions.
  • A problem was fixed for missing FRU information in SRC 11001515.   SRC 11001515 was logged indicating replacement of power supply hardware, but did not include the location code, the part number, the CCIN, or the serial number.
  • A problem was fixed for concurrent firmware update after concurrent PCIe adapter maintenance (add, remove, exchange,etc.) causing the CEC to enter safe mode with its reduced performance.  In safe mode, the processor voltage/frequency is reduced to a "safe" level where thermal monitoring is not required.  Recovery from safe mode requires a system re-IPL.
  • A problem was fixed for an Administrative Failover (AFO) failing with the backup service processor terminating with UE SRCs B15738FD and B1573838.  This failure  was caused by an intermittent error with the operations panel presence detection during failover.
  • A problem was fixed for an Administrative Failover (AFO) having error log SRC B1814616 and a fwdbserver core dump.  This error did not prevent the AFO from completing as the backup service processor became the primary service processor.
  • A problem was fixed for a hypervisor deadlock that results in the system being in a "Incomplete state" as seen on the management console.  This deadlock is the result of two hypervisor tasks using the same locking mechanism for handling requests between the partitions and the management console.  Except for the loss of the management console control of the system, the system is operating normally when the "Incomplete state" occurs.
  • A problem was fixed for Live Partition Mobility (LPM) migrations of Linux partitions running in P8 compatibility mode.  After an active migration, the resumed partition may experience performance degradation.
  • A problem was fixed for a false error message with error code 0x8006 when creating a virtual ethernet adapter with the Integrated Virtualization Manager (IVM).  The error message can be ignored as the virtual ethernet slot is fully functional.
  • A problem was fixed for the recovery of PCIe adapters for a device outage occurring on the PCIe3 6-slot fanout module from the PCIe3  I/O expansion drawer (#EMX0).  One or more of the adapters on the fanout module failed to recover with SRC BA188002.
  • A problem was fixed for an unexpected interrupt from a PCIe adapter that causes the AIX OS to abend.  The extra interrupt comes in from the adapter before it has been enabled for interrupts, after it has reached End of Information (EOI) for its previous session.  The double interrupt from the adapter has been corrected.
  • On systems using PowerVM, a problem was fixed for the handling of the error of multiple cache hits in the instruction effective-to-real address translation cache (IERAT).  A multi-hit IERAT error was causing system termination with SRC B700F105.  The multi-hit IERAT is now recognized by the hypervisor and reported to the OS where it is handled.
  • A problem was fixed for a MDC D-mode IPL that failed if the MDC load source slots were unoccupied.
  • A problem was fixed for systems with a corrupted date of "1900" showing for the Update Access Key (UAK).  The firmware update is allowed to proceed on systems with a bad UAK date because the override is set for the service pack.  After the fix is installed, the user should correct the UAK date, if needed, by using the original UAK key for the system.  On the Management Console,  enter the original update access key via the "Enter COD Code" panel. Or on the Advanced System Manager Interface (ASMI),  enter the original update access key via the "On Demand Utilities/COD Activation" panel.
  • A problem was fixed for a hang during a Dynamic Platform Optimizer (DPO) operation. A system re-IPL was needed to end the DPO operation.
  • A problem was fixed for concurrent firmware updates to a system that needed to be re-IPLed after getting a B113E504 SRC during activation of the new firmware level on the hypervisor.  The code update activate failed if the Sleep Winkle (SLW) images were significantly different between the firmware levels.  The SLW contains the state of the processor and cache so it can be restored after sleep or power saving operations.
  • Support was added for USB 2.0 HUBs so that a keyboard plugged into the USB 2.0 HUB will work correctly at the SMS menus.  Previously, a keyboard plugged into a USB 2.0 HUB was not a recognized device.
  • A problem was fixed for Live Partition Mobility (LPM) to prevent a system failure with SRC B700F103 during LPM operations.  When data is moved by LPM, the underlying firmware code requires that the buffers be 4K aligned, otherwise the system fail could result.  The fixes made now force the buffers to be 4K aligned and if there is still an alignment issue, the LPM operation will fail without impacting the system.
  • A problem was fixed in the run-time abstraction services (RTAS) extended error handling (EEH) recovery for EEH events for SR-IOV Virtual Functions (VFs) to fully reconfigure the VF devices after an EEH event.  Since the physical adapter does recover from the EEH event itself, and there are no error logs generated, it might not be immediately apparent that the VF did not fully reconfigure.  This prevents certain PCIe settings from being established for interrupts and performance settings, leading to unexpected adapter behavior and errors in the partition.
  • A security problem was fixed in OpenSSL where a remote attacker could crash the service processor with a specially crafted X.509 certificate that causes an invalid pointer or an out-of-bounds write. The Common Vulnerabilities and Exposures issue numbers are CVE-2015-0286 and CVE-2015-0287.
  • A problem was fixed for an error log SRC B15738B0 with no FRU callout for a FSI bus error.
  • A problem was fixed for an error log SRC B1504803 with no FRU callout for a IIC bus error.
  • A problem was fixed for a memory error that prevented the CEC from doing an IPL.  The failing DIMM is now deconfigured during the HostBoot part of the IPL and the failing section of the boot is retried to get a successful IPL.
  • A problem was fixed for a checkstop that occurred for a failed Local Clock and Control (LCC) card instead of a failover to the backup LCC card.   The fabric bus erroneously detected a TOD step error during the failover and triggered the checkstop.
  • A problem was fixed for an On-Chip Controller (OCC) failure after a system dump with SRCs B18B2616 and BC822024 reported.  This resulted in the system running with reduced performance in safe mode, where processor clock frequencies are lowered to minimum levels to avoid hardware errors since the OCC is not available to monitor the system.   A re-IPL of the system would resolve the problem.
  • A problem was fixed for new service processor error logs not getting created if too many old error logs exist.  This problem can occur if a large number of small error logs get created and use up all the available inodes (directory entries) for the file system.  The error log garbage collector was not checking the available number of inodes correctly, so it was not always deleting old error logs before attempting to create a new error log.   Without the fix,  this problem will continue until some error logs are purged.
SC820_075_047 / FW820.12

05/18/15
Impact: Function         Severity:  ATT

System firmware changes that affect all systems

  • A problem was fixed for a clearing of all guard records associated with one error log entry.  If a FRU is replaced for any of the related guard record, all the related guard records are cleared.  Previously, only the guard record for the replaced FRU was cleared and the association was lost.
  • A fix was made to prevent processor speculative memory loads from the service processor mailbox Direct Memory Access (DMA) area in the CEC memory.  The speculative loads caused memory cache faults and system checkstops with SRC B181E540.
  • A problem was fixed to reduce switching noise on the memory address bus for DIMMs.  Noise on the bus could cause a failure for a marginal DIMM, so this fix has the effect of potentially improving the reliability of the memory.
SC820_070_047 / FW820.11

04/03/15
Impact: Function         Severity:  SPE

System firmware changes that affect certain systems

  • On systems with a large number of memory DIMMs (64 or more) and redundant service processors, a problem was fixed for a firmware update failure with SRC E302F966 when a failover was attempted as part of the firmware update, but the service processors did not change roles.  This also fixes failing Administrative Failovers (AFOs) for systems with large memory.  The performance of the CEC memory initialization was improved to prevent the hypervisor time-outs for service processor failovers.
SC820_067_047 / FW820.10

03/12/15
Impact:  Security      Severity:  HIPER

New Features and Functions

  • Support for setting Power Management Tuning Parameters from the management console (Fixed Maximum Frequency (FMF), Idle Power Save, and DPS Tunables) without needing to use the Advanced System Management Interface (ASMI) on the service processor.  This allows FMF mode to be set by default without having to modify any tunable parameters using ASMI.
  • Support for SSLv3 has been discontinued to reduce security vulnerabilities in the secured connections to the service processor.
  • Support was added for Single Root I/O Virtualization (SR-IOV) that enables the hypervisor to share a SR-IOV-capable PCI-Express adapter across multiple partitions. Two Ethernet adapters are supported with the SR-IOV NIC capability, when placed in the Power E880/E870:
    •    PCIe2 LP 4-port (10Gb FCoE and 1GbE) SR&RJ45 Adapter (#EN0L)
    •    PCIe2 LP 4-port (10Gb FCoE and 1GbE) SFP+Copper and RJ4 Adapter (#EN0J)
    These adapters each have four ports, and all four ports are enabled with SR-IOV function. The entire adapter (all four ports) is configured for SR-IOV or none of the ports is.
    System firmware updates the adapter firmware level on these adapters to 10.2.252.16 when a supported adapter is placed into SR-IOV mode.
    Support for SR-IOV adapter sharing is not yet available for adapters is a PCIe Gen3 I/O Expansion Drawer.
    SR-IOV NIC on the Power E870/E880 is supported by:
    •    AIX 6.1 TL9 SP4 and APAR IV63331, or later
    •    AIX 7.1 TL3 SP4 and APAR IV63332, or later
    •    IBM i 7.1 TR9, or later
    •    IBM i 7.2 TR1, or later
    •    Red Hat Enterprise Linux 6.5, or later
    •    Red Hat Enterprise Linux 7, or later
    •    SUSE Linux Enterprise Server 11 SP3, or later
    -           VIOS 2.2.3.4 with interim fix IV63331, or later

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for a processor clock failover with SRC B158CC62 that caused a system checkstop when the backup clock oscillator did not initialize fast enough.
  • A problem was fixed for the iptables process consuming all available memory, causing an out of memory dump and reset/reload of the service processor.
  • A problem was fixed for a PowerVM hypervisor hang after a processor core and system checkstop.  The failed processor core was not put into a guarded state and the hypervisor hung when it tried to use the failed core.
  • A problem was fixed for a oscillator error caused by a power line disturbance that logged an UE SRC B150CC62 with no FRU call outs.  The  error SRC was changed from unrecoverable to informational as no service action is required.
  • A problem was fixed for the NEBS DC power supply showing up in the part inventories for the CEC as "IBM AC PS".  The description string has been changed to "IBM PS" as power supplies can be of DC or AC type.
  • A problem was fixed for the power supplies to add a monitor process for the second rotor in each power supply that was not being monitored.  This will improve fault isolation for power supply problems.  A fix for the second rotor in an earlier service pack release provided the monitor infrastructure but was missing the monitor process.
  • A problem was fixed for a FSI link heartbeat surveillance fault with SRC B1504813 logged that has no FRU call outs.  The FRU call outs have been added.
  • A problem was fixed with the Advanced System Management Interface (ASMI) VPD menu where the Generic External Connector (GC) FRU was displayed as an unknown FRU type.  The "Unknown" has been replaced with "Generic External Connector".
  • A problem was fixed for a system fan identify LED not being able to light after a Digital Power Systems Sweep (DPSS) chip failover.  The fan LED ownership was not transferred to the new primary DPSS chip, so it was unable to light the LED under fan fault conditions.
  • A problem was fixed for SRC B1104800 having duplicate FRU call outs for the PNOR flash FRU.
  • A problem was fixed to prevent the Advanced System Management Interface (ASMI) "System Service Aids/Factory Configuration" panel option from restoring to factory configuration for FSP or ALL if one boot side of the service processor is marked invalid.  The following informational message is issued:  "The request cannot be performed because a firmware boot side is marked invalid.  This state may have been caused by a previous firmware update failure."
  • A problem was fixed for error log with SRC B150DA19,  created on the backup service processor for a PSI link failure detected on the primary,  not being visible in the error logs on the primary service processor.
  • A problem was fixed in the hardware server to prevent a UE B181BA07 abort when a host boot dump collection is in progress.
  • A problem was fixed for an LED fault with SRC B181A734 that occurred during a normal rebuild of the LED tables, resulting in the LED not being lit.  The problem has been fixed using retries for LEDs that are in a busy state.
  • A problem was fixed for a PSI link failure with SRC B1517212 that resulted in a service processor stop state.  The correct state for a system with broken PSI links is the terminate state so the problem can be resolved with a call home service event.
  • A problem was fixed to prevent false oscillator error logs of SRC B150CC62 for errors unrelated to clock failures.
  • A security problem was fixed in OpenSSL for padding-oracle attacks known as Padding Oracle On Downgraded Legacy Encryption (POODLE).  This attack allows a man-in-the-middle attacker to obtain a plain text version of the encrypted session data. The Common Vulnerabilities and Exposures issue number is CVE-2014-3566.  The service processor POODLE fix is implemented by disabling SSL protocol SSLv3 and requiring TLSv1.2 protocol on all secured connections.  The Hardware Management Console (HMC) also requires a POODLE fix for APAR MB03867(FIX FOR CVE-2014-3566 FOR HMC V8 R8.2.0 SP1 with PTF MH01455).  This HMC minimum requirement is enforced by the firmware update process for this defect.
  • A problem was fixed for firmware updates that caused the primary service processor to be guarded and SRC B152E6D0 and SRCs of form B181XXXX to be logged.
  • A problem was fixed for intermittent firmware database errors that logged an UE SRC of B1818611 and had a fwdbServer core dump.
  • A problem was fixed to enable the redundant Vital Product Data (VPD) SEEPROM for processors and voltage regulator modules (VRMs).  Previously, only the primary SEEPROM was programmed with the FRU data with no backup protection.
  • A problem was fixed for vague error text for SRC B1504922 for a bad SMP cable.  It was made more specific to state that an incorrect cable length was detected.
  • A problem was fixed for an intermittent reset/reload of the service processor during the early part of an IPL with SRC B1814616 logged.
  • A problem was fixed for hardware presence detection and local clock card (LCC) failover.  The system could not detect critical system hardware with th e default LCC missing, causing an error when failing over to the backup LCC.
  • A problem was fixed for non-optimal voltage levels from the power supplies.  Having the power supply output voltages meet the exact specifications will help prevent stress-related hardware failures.
  • A problem was fixed for an error in the "Enlarged IO Capacity Slot Count" that caused more memory than expected to be consumed by the hypervisor.  If the "Enlarged IO Capacity Slot Count" was not a "1", it was wrongly changed to an "8" by the IPL process, increasing the amount of memory that needs to be reserved for I/O buffers.  Retain tip H213684 tells how to reduce the hypervisor memory consumption when this problem happens as the fix will not change the value automatically:
    With the system at the "Power Off" state, take the following actions to to free up some memory from the hypervisor:
    - Log into ASMI and then select "System Configuration" menu    
    - Select  "I/O Adapter Enlarged Capacity" option                
    - Use the pulldown to select "1" as the new value for all nodes
    - After changing the value click on the "Save" setting. The change will be active on the next IPL of the system.
  • A problem was fixed for the PCIe reset line (PERST) to keep it active during the IPL until both system power and clocks are stable.  Keeping the PCIe devices in reset until the environment is stable prevents PCIe device lockup.
  • A problem was fixed to prevent a hypervisor task failure if multiple resource dumps running concurrently run out of dump buffer space.  The failed hypervisor task could prevent basic logical partition operations from working.
  • On systems using the Virtual I/O Server (VIOS) to share physical I/O resources among client logical partitions, a problem was fixed for memory relocation errors during page migrations for the virtual control blocks.  These errors caused a CEC termination with SRC B700F103.  The memory relocation could be part of the processing for the Dynamic Platform Optimizer (DPO), Active Memory Sharing (AMS) between partitions, mirrored memory defragmentation, or a concurrent FRU repair.
  • A problem was fixed that could result in unpredictable behavior if a memory UE is encountered while relocating the contents of a logical memory block during one of these operations:
    - Reducing the size of an Active Memory Sharing (AMS) pool.
    - On systems using mirrored memory, using the memory mirroring optimization tool.
    - Performing a Dynamic Platform Optimizer (DPO) operation.
  • A problem was fixed for PCIe link width faults on the  I/O expansion drawer (F/C #EMX0) to only log the SRC B7006A8B once for each FRU instead of having multiple SRCs and call outs for the same part.
  • A problem was fixed for a wrong state for the PCIe link LEDs (lit when link has failed) to the I/O expansion drawer with feature code #EMX0.  The fix insures that the link operational LEDs are not lit when the link to the I/O drawer has failed.
  • A problem was fixed for an incorrect SRC of B7006A9F logged for I/O drawer VPD mismatch during an enclosure serial number update of the I/O drawer (F/C #EMX0).  The incorrect SRC was logged if the non-primary service path module (right bay) was in a failed state.
  • A problem was fixed for a SRC B7006A84 PCIe link down event not being reported as a failed link for the I/O expansion drawer (F/C #EMX0) in the PCIe topology status in the Advanced System Manager Interface (ASMI) or on the management console.
  • A problem was fixed for the Live Partition Mobility (LPM) migration of virtual devices to a Power8 systems to update each virtual device location code correctly to reflect the location code in the target systems instead of the location code in the source system.  This problem prevented the management console from being able to look up AIX Object Data Manager (ODM) names for the virtual devices so that operations such as remove on the device could not be performed.
  • A problem was fixed for PCIe adapters requesting PCI I/O space that triggers a SRC BA1800007 error log.  This SRC should not have been logged since PC I/O spaces are not supported by Power8 systems.  The SRC log is now suppressed.
  • A problem was fixed for a processor core unit being deconfigured but not guarded for a SRC B113E504 processor error in host boot with fault isolation register (FIR) code "RC_PMPROC_CHKSLW_NOT_IN_ETR" that caused the CEC to go to termination.  By guarding the failed processor core, the fix insures the core is not used on the reIPL of the CEC.
  • A security problem was fixed in OpenSSL for memory leaks that allowed remote attackers to cause a denial of service (out of memory on the service processor). The Common Vulnerabilities and Exposures issue numbers are CVE-2014-3513 and CVE-2014-3567.
  • A security problem in GNU Bash was fixed to prevent arbitrary commands hidden in environment variables from being run during the start of a Bash shell.  Although GNU Bash is not actively used on the service processor, it does exist in a library so it has been fixed.  This is IBM Product Security Incident Response Team (PSIRT) issue #2211.  The Common Vulnerabilities and Exposures issue numbers for this problem are CVE-2014-6271, CVE-2014-7169, CVE-2014-7186, and CVE-2014-7187.
  • A problem was fixed to add failure recovery in the early boot of the service processor so that the boot is retried on failure instead of the service processing going unresponsive with SRC B1817212 on the operations panel.
  • A problem was fixed for isolating and repairing DIMM memory failures at the byte level without affecting other ranks of memory. This fix substantially reduces the FRU call outs of DIMMS for memory problems.
  • A security problem was fixed in OpenSSL where the service processor would, under certain conditions, accept Diffie-Hellman client certificates without the use of a private key, allowing a user to falsely authenticate .  The Common Vulnerabilities and Exposures issue number is CVE-2015-0205.
  • A security problem was fixed in OpenSSL to prevent a denial of service when handling certain Datagram Transport Layer Security (DTLS) messages.  A specially crafted DTLS message could exhaust all available memory and cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number is CVE-2015-0206.
  • A security problem was fixed in OpenSSL to prevent a denial of service when handling certain Datagram Transport Layer Security (DTLS) messages.  A specially crafted DTLS message could do an null pointer de-reference and cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number is CVE-2014-3571.
  • A security problem was fixed in OpenSSL to fix multiple flaws in the parsing of X.509 certificates.  These flaws could be used to modify an X.509 certificate to produce a certificate with a different fingerprint without invalidating its signature, and possibly bypass fingerprint-based blacklisting.  The Common Vulnerabilities and Exposures issue number is CVE-2014-8275.
  • A security vulnerability, commonly referred to as GHOST, was fixed in the service processor glibc functions getbyhostname() and getbyhostname2() that allowed remote users of the functions to cause a buffer overflow and execute arbitrary code with the permissions of the server application.  There is no way to exploit this vulnerability on the service processor but it has been fixed to remove the vulnerability from the firmware.  The Common Vulnerabilities and Exposures issue number is CVE-2015-0235.
  • A problem was fixed for an incorrect SRC logged for an unplugged cable to the PCIe I/O expansion drawer (F/C #EMX0).  A B7006A88 SRC was errantly logged that calls out the cable as bad hardware that needs to be replaced.  This is replaced with SRC B7006A82 that says a cable is unplugged to a PCIe FanOut module in the IO expansion drawer.
  • A problem was fixed for missing dump data for cores and L3 cache memory when there is core checkstop and deconfiguration of the core.
  • A problem was fixed for a false power supply fan failure with SRC 1100152F.  If the AC was interrupted to the power supply, the SRC 11001525 would have been logged for a bad fan with a call out of the power supply for replacement.
  • A problem was fixed for a partition deletion error on the management console with error code 0x4000E002 and message "...insufficient memory for PHYP".  The partition delete operation has been adjusted to accommodate the temporary increase in memory usage caused by memory fragmentation, allowing the delete operation to be successful.
  • A problem was fixed for disruptive firmware update to prevent false reference clock failures with SRC B1814805 and a hang in the IPL for the CEC.
  • A problem was fixed for a memory leak associated with the logging of SRC B1561311 for a bad voltage regulator module (VRM).
  • A problem was fixed for the processor module replacement process to prevent VPD corruption on the primary and redundant VPD chips on the new processor module.  This corruption resulted in the processor being unusable with HostBoot failing with unrecoverable errors (UEs) of SRCs BC8A090F and BC8A1701.
System firmware changes that affect certain systems
  • HIPER/Pervasive:Deferred:  On a system configured for a large number of PCIe adapters across multiple PCIe I/O expansion drawers (F/C #EMX0), a problem was fixed so that the PCIe adapters worked correctly in the system.  Previously, the PCIe interrupt servicing could deadlock, causing the PCIe adapter cards to become unresponsive.
  • For a system with Virtual Trusted Platform Module (VTPM) partitions,  a problem was fixed for a management console error that occurred while restoring a backup profile that caused the system to to go the management console "Incomplete state".  The failed system had a suspended VTPM partition and a B7000602 SRC logged.
  • For systems with IBMi partitions, a problem was fixed for the "5250 Application Capable" capability so it is passed to the IBMi partition as "True" if purchased.  For the problem, the capability was not sent to the partition and could cause extra performance to be missing for the "Fast Green Screen Performance" feature in IBMi.  There is a delay of up to 15 minutes after this fix is installed before it becomes active on the system.  If the updated capability property does not show up in the management console CEC properties as "True", this is a slowness in the refresh of the capability properties to the management console and not a problem with the fix.  To resolve this issue with the capability not displaying correctly, rebuild the managed system on the management console and then wait up to one hour for the CEC property capability "5250 Application Capable" to be updated to "True".
  • On a system with a Linux partition, a problem was fixed for the Linux "lsslot" command so that it is able to find the F/C EC41 and EC42 PCIe 3D graphics adapter installed in the CEC, instead of showing the slot as "empty".  The Linux graphics adapter worked correctly even though it showed as "empty".
  • On systems with a PCIe 3D graphics adapter (F/C #EC41 or #EC42) in a partition, a problem was fixed for a partition hang or BA21xxxx error conditions during partition initialization.
  • A problem was fixed for certain workloads that caused the system to enter safe mode (mode for running at minimum processor frequencies)  when the On-chip controllers (OCCs) did not get the Analog Power Subsystem Sweep (APSS)  frequency control data within the OCC time out period.  The time out for a OCC update has been increased so the OCC can tolerate periods of high bus use that slow down the APSS communication.
  • On a system with redundant service processors, a problem was fixed for bad pointer reference in the mailbox function during data synchronization between the two service processors.  The de-reference of the bad pointer caused a core dump, reset/reload, and fail-over to the backup service processor.
SC820_051_047 / FW820.03

01/27/15
Impact: Serviceability         Severity:  SPE

System firmware changes that affect all systems

  • A problem was fixed in concurrent firmware update to prevent the secondary service processor from going to a failed state.
  • A problem was fixed for the power supply fans to monitor both rotors instead of one to prevent a failure in one rotor from shutting down the power supply.
  • A problem was fixed for firmware updates to reduce the number of informational B181A85E SRCs for an expected SQL lock condition during a database transaction.  Previously, several thousand B181A85E SRC entries were created for the error log, slowing performance of the service processor and flooding the error log.
  • A problem was fixed for reset/reload failures caused by excessive synchronization of thermal management data with the redundant service processor.
  • A problem was fixed for failovers to the secondary service processor failing with SRC B1818601 caused by a bad data base object reference.

System firmware changes that affect certain systems

  • For a system with memory mirroring activated and a memory block size of 16 Megabytes, a problem was fixed for system dump that caused Hypervisor Real Mode Offset (HMRO) data structure corruption in the physical memory map.    This problem could cause concurrent firmware update failures or subsequent system dumps to be corrupted.
SC820_048_047 / FW820.02

12/01/14
Impact:  New      Severity:  New

New Features and Functions
  • GA Level