Chapter 8 — Crash Recovery
8.1 Failure Classification
Database systems must handle three categories of failures, each requiring a different recovery strategy:
| Failure Type | Cause | What is Lost | Recovery Method |
|---|
| Transaction failure | Logical error (overflow, division by zero, constraint violation, user cancel, deadlock) | Only that transaction's uncommitted changes | Rollback the single transaction using undo log |
| System crash | Power failure, OS kernel panic, hardware fault — but disk survives intact | Everything in RAM (entire buffer pool contents) | Redo/undo using the write-ahead log from the last checkpoint |
| Disk failure (media failure) | Disk head crash, corrupt sectors, catastrophic hardware failure | Contents of the failed disk | Restore from backup + replay archived log records |
Stable Storage
Stable storage is storage that survives any single failure. Approximated in practice by:
- Writing data to multiple physically independent disks (RAID 1, RAID 10) before acknowledging the write
- The probability of ALL copies failing simultaneously is engineered to be negligibly small
Block-level operations (important for understanding WAL):
- Input(B): transfer disk block B from disk to a RAM buffer frame
- Output(B): transfer RAM buffer block B to disk (write to stable storage)
8.2 Write-Ahead Log (WAL) and Log-Based Recovery
The Fundamental WAL Rule
Before a modified data page is written to disk (Output), the corresponding log record describing that modification MUST already be in stable storage.
This rule ensures that even if the system crashes immediately after writing a data page, the log record already exists on disk — enabling recovery by undo or redo.
Log Record Format
Each log record describes a single operation by a single transaction:
| Record Type | Format | Meaning |
|---|
| Start | <Ti, start> | Transaction Ti has begun |
| Update | <Ti, Xj, old_value, new_value> | Ti updated item Xj from old_value to new_value |
| Commit | <Ti, commit> | Ti has committed; all changes made durable |
| Abort | <Ti, abort> | Ti has been rolled back; changes undone |
| Checkpoint | <checkpoint, [T1, T2, ...]> | Checkpoint taken with active transactions listed |
Undo: use old_value to restore Xj before Ti's modification
Redo: use new_value to reapply Ti's modification to Xj
Immediate vs Deferred Modification
| Immediate Modification | Deferred Modification |
|---|
| When data written to disk | Before the transaction commits (as soon as WRITE executes) | Only AFTER the transaction commits |
| Recovery operations needed | Both UNDO (uncommitted) AND REDO (committed but not yet flushed) | REDO only (nothing reaches disk before commit, so no undo needed) |
| Log records needed | Must log both old AND new values | Only new values needed |
| Performance | Better write concurrency | Simpler recovery logic |
Recovery Algorithm (with Checkpointing)
Recovery uses two phases. Both phases start from the most recent checkpoint, not from the beginning of the entire log.
Phase 1 — Redo Phase (forward scan from last checkpoint to end of log):
- Begin scanning forward from the last checkpoint record
- Redo every update found in the log (even from transactions that will ultimately be undone — simpler to redo everything, then undo)
- While scanning, build the undo-list:
- Add to undo-list: transactions that have a
<Ti, start> record but no <Ti, commit> record
Phase 2 — Undo Phase (backward scan from end of log):
- Scan the log backward from the end
- For each transaction Ti in the undo-list:
- Apply undo: set Xj = old_value for each update record of Ti
- Write a Compensation Log Record (CLR):
<Ti, Xj, new_val, old_val, CLR>
- When you encounter
<Ti, start>: write <Ti, abort> and remove Ti from undo-list
- Continue until the undo-list is empty
Checkpointing
A checkpoint reduces recovery time by establishing a known-good point:
- Write
<checkpoint, [active_transactions]> to log
- Flush all dirty (modified) buffer pages to disk
- Flush the log to stable storage
Recovery shortcut: During recovery, only scan the log from the most recent checkpoint. All transactions committed BEFORE the checkpoint are already on disk — no need to redo them.
Comprehensive Checkpoint Recovery — Fully Worked Example
LOG contents (in order):
<T0, start>
<T0, A, 1000, 900> ← T0: A changes from 1000 to 900
<T1, start>
── CHECKPOINT, active = {T0, T1} ── ← last checkpoint point
<T1, B, 2000, 2100> ← T1: B changes from 2000 to 2100
<T1, commit> ← T1 commits successfully
<T2, start>
<T2, C, 500, 600> ← T2: C changes from 500 to 600
** SYSTEM CRASH ** ← no commit for T0 or T2
────────────────────────────────────────────────────────
Recovery Procedure:
Step 0: Locate last checkpoint → "CHECKPOINT, active = {T0, T1}"
Initial undo-list = {T0, T1} (transactions active at checkpoint time)
────────────────────────────────────────────────────────
REDO PHASE — scan FORWARD from checkpoint to end of log:
Process <T1, B, 2000, 2100>: REDO → set B = 2100 ✓
Process <T1, commit>: T1 committed → REMOVE T1 from undo-list
undo-list = {T0}
Process <T2, start>: T2 not seen before → ADD T2 to undo-list
undo-list = {T0, T2}
Process <T2, C, 500, 600>: REDO → set C = 600 ✓ (will be undone in next phase)
After redo phase:
State: A=900, B=2100, C=600
undo-list = {T0, T2} (T0 started before checkpoint, never committed;
T2 started after, never committed)
────────────────────────────────────────────────────────
UNDO PHASE — scan BACKWARD from end of log:
Process <T2, C, 500, 600>: UNDO T2 → set C = 500; write <T2,C,600,500,CLR>
Process <T2, start>: write <T2, abort>; REMOVE T2 from undo-list
undo-list = {T0}
Process <T1, commit>: T1 NOT in undo-list → skip
Process <T1, B, 2000, 2100>: T1 NOT in undo-list → skip
(checkpoint record)
Process <T0, A, 1000, 900>: UNDO T0 → set A = 1000; write <T0,A,900,1000,CLR>
Process <T0, start>: write <T0, abort>; REMOVE T0 from undo-list
undo-list = {} → DONE
Final state after recovery:
A = 1000 (T0 rolled back — never committed)
B = 2100 (T1 redone — committed before crash)
C = 500 (T2 rolled back — never committed) ✓
8.3 Shadow Paging
Shadow paging is an alternative recovery technique that avoids using a traditional log. It uses two page tables instead.
Mechanism
| Component | Description |
|---|
| Shadow page table | The stable, committed page table pointing to data as of the last commit — never modified during an active transaction |
| Current page table | The working copy used during the transaction — updated as writes happen |
How it works step-by-step:
- At the start of a transaction, copy the current page table to use as the shadow (or the existing shadow remains as-is)
- During the transaction: all writes go to NEW disk pages. The current page table is updated to point to these new pages. The shadow table still points to the old pre-transaction pages.
- On COMMIT: Atomically write a single pointer (the DB root pointer) to the current page table. This single atomic write makes the current state the new committed state.
- On ABORT or CRASH (before commit): Discard the current page table. The shadow table is still intact, so the database is exactly as it was before the transaction started.
How Atomicity and Durability Are Ensured
- Atomicity: If crash before commit, shadow page table is untouched → automatic rollback to last committed state. No undo log is needed.
- Durability: After commit, the new root pointer is durably written. If crash after commit, shadow table now points to the new pages → the committed state persists.
Shadow Paging vs Log-Based Recovery
| Aspect | Shadow Paging | Log-Based Recovery |
|---|
| Undo logging | Not needed (just discard current page table) | Required (must log old values) |
| Redo logging | Not needed (keep new pages) | Required (must log new values) |
| Concurrency | Poor — very difficult to support multiple concurrent transactions sharing page tables | Excellent — log is append-only |
| Storage fragmentation | Yes — data pages spread across disk as new pages are allocated | No — pages updated in place |
| Garbage collection | Required (must free old shadow pages) | Not required |
| Practical adoption | Limited (SQLite uses a variant; mostly historical) | Universal (PostgreSQL, MySQL, Oracle, SQL Server) |
Exam answer — "Describe shadow paging for atomicity and durability, and state its main advantage and disadvantage over log-based recovery:"
Shadow paging maintains two page tables: the shadow (last committed state) and the current (in-progress). All writes go to new disk pages and the current table is updated. On commit, a single atomic pointer swap replaces the shadow with the current table — achieving durability. On crash before commit, the shadow is intact and no undo is needed — achieving atomicity.
Main advantage over log-based: No undo/redo log records are needed — simpler recovery.
Primary disadvantage: Very poor concurrency support (multiple concurrent transactions are difficult to manage) and storage fragmentation as data pages scatter across disk.
8.4 Remote Backup Systems (High Availability)
A remote backup system maintains a copy of the database at a geographically separate site to ensure availability when the primary site fails.
Architecture: The primary site writes to its WAL; log records are shipped to the backup site continuously or periodically. The backup site applies these log records to stay synchronized.
Types of Standby Systems
| Standby Type | Replication Mode | Failover Time | Data Loss Risk | Description |
|---|
| Hot standby | Synchronous — primary waits for backup to acknowledge every write | Near-zero (seconds) | Near-zero | Backup is always fully up-to-date; highest cost |
| Warm standby | Semi-synchronous — background replication with some lag | Minutes | Seconds to minutes | Good cost/availability balance for most systems |
| Cold standby | Periodic log shipping (every few hours or days) | Hours | Minutes to hours | Cheapest; acceptable for non-critical, low-RPO systems |
Transfer of control (failover procedure):
- Detect primary site failure (via heartbeat timeout or manual trigger)
- Backup site applies any remaining log records it has already received
- Backup site opens its database for client connections
- Clients are redirected to the backup site (via DNS change, load balancer update, or VIP failover)
Exam Quick-Reference — Chapter 8
- WAL rule: log record MUST reach stable storage BEFORE the corresponding data page is written to disk
- Log record format:
<Ti, Xj, old_val, new_val> for update; <Ti, commit> for commit; <Ti, start> for begin
- Immediate modification: data written before commit → need both UNDO and REDO
- Deferred modification: data written only after commit → REDO only (no undo)
- Checkpoint shortcut: start recovery from last checkpoint, not from the beginning of the entire log
- Redo phase: scan FORWARD from checkpoint; redo all updates; build undo-list of uncommitted transactions
- Undo phase: scan BACKWARD; undo all transactions in undo-list (those with no commit record)
- Shadow paging commit: atomic pointer swap from current page table to shadow → no undo log needed
- Shadow paging disadvantage: poor concurrency + storage fragmentation
- Hot standby: synchronous; near-zero failover + data loss. Cold standby: periodic log shipping; hours to recover