Availability, Backups, and Disaster Recovery

17. Availability, Backups, and Disaster Recovery #

17.1 Objectives and tiers #

Applications must meet defined availability objectives and be recoverable within acceptable time (RTO) and data‑loss (RPO) limits. Each solution is assigned a tier with explicit targets and dependencies documented at design time.

  • PENDING: Tier definitions and targets — Tier 1 RTO: __, RPO: __; Tier 2 RTO: __, RPO: __; Tier 3 RTO: __, RPO: __
  • Uptime targets (if applicable) must align to the assigned tier and contractual SLAs.

17.2 Architecture for availability #

Design to meet the tier: eliminate single points of failure, use health checks and graceful degradation, and protect critical dependencies (databases, message queues, object storage) with appropriate redundancy. Internet‑facing services must be protected with WAF/DDoS defenses (per Section 10/15). Planned maintenance must follow change windows with notice periods and rollback plans. PENDING: Change windows/notice by risk class

17.3 Backup scope and coverage #

Backups must cover all data and configuration required to restore service to the declared RPO:

  • Application data (databases, object stores, search indexes, queues as applicable).
  • Configuration state (schemas, environment config, secrets references, not raw secrets).
  • Artifacts and infrastructure definitions (release images, IaC templates) sufficient to recreate the environment. Backups must be encrypted at rest and in transit, cataloged with retention periods, and validated for integrity after creation.

17.4 Immutability and isolation #

For Tier 1/critical datasets, use immutability/object‑lock or write‑once policies where supported to reduce ransomware risk. Keep at least one logically isolated backup copy (different account/tenant or provider region in Canada). PENDING: Immutability requirement — Yes/No per tier

17.5 Key management for recoverability #

Backup encryption keys must be managed in a vault/KMS with dual control for deletion/rotation, and backed up or escrowed so restores are possible during key service outages. Do not co‑locate keys and encrypted backups in the same blast radius without additional controls.

17.6 Restore testing and evidence #

Perform restore tests at a cadence aligned to tier to demonstrate RTO/RPO feasibility and data integrity.

  • PENDING: Restore test cadence — Tier 1: [Monthly/Quarterly , Tier 2: [Quarterly/Semi‑annual], Tier 3: [Semi‑annual/Annual]]
  • Evidence must include what was restored, from what timestamp/snapshot, duration (start–finish), validation steps (checksums/app verification), and any issues found with remediation actions.
  • Store evidence with the release/operations record. PENDING: Repository/path for storing restore evidence

17.7 Runbooks, rollback, and DR procedures #

Maintain concise runbooks for: (a) rollback from a failed deployment; (b) point‑in‑time restore; (c) full environment rebuild using IaC and release artifacts. Runbooks must list pre‑checks, execution steps, validation tests, and decision points. Tabletop these procedures at least annually and after major changes.

17.8 Monitoring and failover drills #

Monitor availability, latency, error rates, and queue depths against SLOs, and alert on threshold breaches with on‑call targets. Exercise failover/fallback for critical components where feasible without customer impact. PENDING: On‑call acknowledgment/response targets (Sev1 ≤ __ min; Sev2 ≤ __ min)

17.9 Change safety for platform updates #

Platform/security patches with potential availability impact must follow change control with staging validation, canary/blue‑green (for higher risk), and a documented rollback criterion. Communicate customer‑visible impact and maintenance windows in advance per Section 18.

17.10 Vendor and subprocessor dependencies #

Document external dependencies (e.g., email/SMS providers, shipping APIs) and their impact on availability. For each, define local mitigations or queuing during outages and the notification/coordination protocol with the vendor. Ensure subprocessors meet comparable backup/DR expectations and provide evidence upon request.

17.11 Residency for DR #

DR and backup storage must comply with residency constraints (Canada by default). Cross‑border DR sites are not permitted without an approved residency exception per Section 9.2. Ensure runbooks and configurations do not fail open to non‑approved regions.

17.12 Notification and status updates #

For incidents threatening RTO/RPO targets or causing customer‑visible downtime, notify CLD promptly and provide status updates on a regular cadence until recovery. PENDING: Notification window (e.g., immediate + written within __ hours) and update cadence

17.13 Exceptions #

Any deviation from availability targets, backup scope, restore cadence, or DR procedures requires a documented exception with scope, rationale, risk assessment, compensating controls (e.g., increased snapshot frequency, additional monitoring), owner, approval (ISO + Executive Sponsor), and expiry.