Failover issues in Always-on AG
- What is failover in always-on AG in SQL server.?
In SQL Server Always On Availability Groups (AG), failover is the process of switching the primary role of an availability replica from one instance of SQL Server to another. This ensures high availability and disaster recovery for databases within the availability group. During a failover, the current primary replica transitions to a secondary role, and a designated secondary replica assumes the primary role. This new primary replica then brings its databases online to accept client connections and transactions.
2. How many types of failovers are there in AG.?
Automatic Failover: This occurs automatically in response to a failure of the primary replica, provided the availability group is configured for synchronous-commit mode and automatic failover, and the target secondary replica is synchronized.
Planned Manual Failover: This is initiated by a database administrator for maintenance or other planned events. It requires the primary and target secondary replicas to be in synchronous-commit mode and synchronized, ensuring no data loss.
Forced Manual Failover (with possible data loss): This is a drastic measure used in disaster scenarios when the primary replica is unavailable and data loss might be acceptable to restore service quickly. It does not require synchronization
3. What are the issues for failover in AG.?
One of the issue with failover in AG is slowness or failing failover, failovers can be prolonged or fail entirely if synchronization is incomplete, or if there are issues with the Windows Server Failover Clustering (WSFC) cluster or network.
4. How to resolve slowness or failing failover in AG.?
Resolving slowness or failing failover in SQL Server Always On Availability Groups or Failover Cluster Instances involves identifying the root cause, which can range from network issues to resource contention or misconfigurations.
Network and Connectivity:
Network Latency: Need to maintain high network latency between replicas can significantly impact synchronization and failover speed.
Firewall Rules: Make sure necessary ports are open for communication between SQL Server instances and cluster nodes.
Client Protocols: Verify that Shared Memory or TCP/IP are enabled under Client Protocols for SQL Native Client Configuration.
Force Protocol Encryption: Check if this setting is enabled and if it’s impacting performance.
Resource Contention and Performance:
Disk I/O Latency: High I/O latency on shared storage or local disks can delay database recovery during failover. Use tools like IO Meter to test I/O performance.
CPU Usage: High CPU usage on the primary or secondary replica can slow down failover. Investigate and optimize resource-intensive queries or processes.
Transaction Log Management: Large or fragmented transaction logs can lead to longer recovery times during failover. Ensure proper transaction log shrinking and growth.
VLF Count: A high number of Virtual Log Files (VLFs) in the transaction log can also impact recovery time.
Cluster and Availability Group Configuration:
Maximum Failures in Specified Period: Increase this value in the cluster resource properties to allow for more transient failures before a complete failover.
Quorum Configuration: Ensure the cluster quorum is healthy and correctly configured.
Conclusion:
It is difficult to maintain network latency and drives performance but should keep monitoring and take care of issues before giving impact during the failovers needed. Try to facilitate environment based on business load on database server especially during peak level of business loads.