Lease, cluster, & health check timeouts

Differences in hardware, software, and cluster configurations as well as different application

requirements for uptime and performance require specific configuration for lease, cluster, and health

check timeout values. Certain applications and workloads require more aggressive monitoring to limit

downtime following hard failures. Others require more tolerance for transient network issues and waits

from high resource usage and are okay with slower failovers.

Multiple services on each node work to detect failures. The cluster service could detect quorum loss, the

resource DLL could detect an issue surfaced by Always On health detection, or manual failover might be

initiated directly on the primary instance. The cluster service, the resource host, and the SQL Server

instance synchronize with each other via RPC, shared memory, and T-SQL. In most scenarios, these

services successfully communicate, however this communication isn’t perfectly reliable even between

services on the same machine. Furthermore, the availability group (AG) needs to be able to withstand

system wide events like network and disk failures, which might prevent communication or interrupt

functionality. With many failure cases and without fully dependable communication between services,

the AG depends on various failover detection mechanisms to detect and respond to failures

independently of each other so the cluster state is always consistent for all nodes.

Resource constraints such as high CPU, disk latency, or memory exhaustion can trigger an Always On

availability group lease timeout. When a lease timeout is reported in the cluster log, the most recent

performance monitor data for CPU utilization, memory utilization, and disk read and write latency are

reported in the Windows Failover Cluster Log along with the lease timeout.

Likewise, resource constraints can also trigger a health check timeout. Starting with SQL Server 2025

(17.x) Preview, the same performance monitor counters are now reported in the Windows Failover

Cluster Log when a health check timeout is detected, similar to the lease timeout diagnostic output.

The following is a sample of the improved Windows Failover Cluster Log output for a health check

timeout:

Output

2025 improved health check timeout

[Verbose] 000035b8.00001a64::2024/04/18-23:56:35.536 ERR [RES] SQL Server Availability
Group: [hadrag] Failure detected, diagnostics heartbeat is lost
[Verbose] 000035b8.00001a64::2024/04/18-23:56:35.536 WARN [RES] SQL Server Availability
Group: [hadrag] AG health check failed, logging perf counter data collected so far
[Verbose] 000035b8.00001a64::2024/04/18-23:56:35.536 WARN [RES] SQL Server Availability