Troubleshoot: Availability group exceeded RTO

After an automatic failover or a planned manual failover without data loss on an availability

group, you may find that the failover time exceeds your recovery time objective (RTO). Or,

when you estimate the failover time of a synchronous-commit secondary replica (such as an

automatic failover partner) using the method in

Monitor performance for Always On

Availability Groups

, you find that it exceeds your RTO.

If your automatic failover still has not completed, see

Troubleshooting automatic failover

problems in SQL Server 2012 Always On environments.

The following sections describe the common causes for a failover time that exceeds RTO.

Reporting workload blocks the redo thread from running

Redo thread falls behind due to resource contention

The redo thread on the secondary replica is blocked from making data definition language

(DDL) changes by a long-running read-only query.

On the secondary replica, the read-only queries acquire schema stability (

) locks. These

locks can block the redo thread from acquiring schema modification (

) locks to

make any DDL changes. A blocked redo thread cannot apply log records until it is unblocked.

Once unblocked, it can continue to catch up to the end of log and allow the subsequent undo

and failover process to proceed.

When the redo thread is blocked, an extended event called

generated. Additionally, you can query the DMV sys.dm_exec_request on the secondary replica

to find out which session is blocking the REDO thread, and then you can take corrective action.

Sch-S
Sch-S
Sch-M sqlserver.lock_redo_blocked