Abstract: additional tuning option for the PowerHA tie breaker function. Background: In a PowerHA cluster, a partition or “split brain” can occur when all normal communications are lost between nodes or set of nodes, and the nodes on either side of the split remain active. By default, both sides of the split will try to access applications and shared storage, which could lead to potential undetected data corruption. PowerHA provides features to help prevent this from happening when a partition occurs. Working in conjunction with the RSCT and CAA components, there are several options that can be configured, including the use of an external “tie breaker” which can serve as an arbiter when a partition occurs and decide which side of the partition should continue processing and which side should be shut down. APAR IJ22512 describes a problem where the PowerHA software does not properly handle notifications from the tie breaker mechanism. If you are reading this, you will already be familiar with the problem and have access to all applicable fixes. After applying those fixes, you are encouraged to test various split conditions to ensure that everything is working as expected. Description: These instructions apply to the optional tuning parameter which is available when using the tie breaker feature. If you are using any other solution for split handling, changing this value will have no effect. Environment: PowerHA accesses the tie breaker function through the RSCT subsystem. RSCT has a tuning parameter known as “post reserve wait time” which can be adjusted to achieve the desired results in your environment. To understand if you need to tune this particular option, you will need to first understand the different factors that can affect the performance of the tie breaker function. 1) The tie breaker - whether it is disk or NFS based - is ideally situated somewhere outside of the infrastructure used by the cluster nodes, such that a failure of that infrastructure is less likely to affect the tie breaker. Because the tie breaker is remote, the performance of the network connecting the tie breaker will affect the speed at which RSCT can contact the tie breaker, receive response(s) and deal with any errors or retries. 2) The tie breaker itself will have its own performance characteristics which may be affected by other components. With the NFS based tie breaker, the performance of NFS itself and any contention or retries will affect how fast the tie breaker arrives at a conclusion. 3) Once RSCT receives a reply from the tie breaker, it will send notifications to the cluster node(s) on each side of the split. The node(s) on one side of the split will be informed that they “lost” the contest to access the tie breaker. Once this notification is received, it is up to those node(s) to react by either halting the nodes or taking some other orderly action to quiesce any applications that may try to access the shared storage. As you can see, there are many factors that may affect how fast RSCT can interact with the tie breaker, inform both sides of the results, and subsequently for PowerHA to react to the results by shutting down or initiating takeover of resource groups, starting applications, etc. Because of this potential delay, the winning side of the partition may not want to immediately initiate takeover, as the losing side may not have yet received or fully processed the notification. In order to avoid any premature attempt at recovery by the winning side, the RSCT tie breaker feature has a tunable named “post reserve wait time" which can delay the notifications sent to the winning side. If your testing demonstrates that this wait time is a factor, or could be a factor in the operation of the overall solution, you should continue reading to find out how to set this tunable. If your tie breaker solution is not producing the expected results for other reasons, changing this tunable is unlikely to have any effect. In this case you can always contact IBM support to see if there are any other options that may help address your specific problem. By default the value of the post reserve wait time is 30 seconds, that is, the notifications sent to the winning side of the partition will be delayed up to 30 seconds such that the losing side has a chance to receive and fully process the actions necessary on the losing side. As with many features in PowerHA, you configure the option(s) through PowerHA, but the function itself is provided by some other component like AIX or RSCT. In this case the post reserve wait time is configured through PowerHA while the run time execution occurs at the RSCT component level. The post reserve wait time can be changed by adding the RSCT_TB_WAITTIME tunable to /etc/environment. This will make the new value applicable every time cluster services are restarted cluster wide, or when a Dynamic Automatic Reconfiguration Event (DARE) is run. If you make any changes to this value you should test thoroughly to make sure the desired results are achieved. The following command can be used to query the active tie breaker for testing purposes: # lsrsrc -c IBM.PeerNode OpQuorumTieBreaker Sample output will look like this: Resource Class Persistent Attributes for IBM.PeerNode resource 1: OpQuorumTieBreaker = "PowerHA_NFS_TieBreaker" The "-c" option shows the class wide values instead of the per-node values. The "OpQuorumTieBreaker" can only be set at the class level. The default value for the reserve wait time attribute is 30 seconds. To change this you must add the following line to /etc/environment on every cluster node: RSCT_TB_WAITTIME=60 The value of 60 is only an example – set the value to whatever number works in your environment. PowerHA configures the RSCT tie breaker whenever a DARE is run. With cluster services active you can use "Verify and Synchronize" to initiate the DARE. PowerHA also configures the RSCT tie breaker when the first node in the cluster is started, so if you do not use DARE after changing the value, you must stop cluster services on all nodes first, then start cluster services. You can use the command above to verify that your change has taken effect.