Operations Handover and Support

20. Operations Handover and Support #

20.1 Purpose #

Operations handover ensures CLD and the vendor can operate, monitor, support, and restore the Application reliably after Go‑Live. This section defines the minimum artifacts, responsibilities, and support expectations.

20.2 Handover package (minimum) #

Before Go‑Live, the vendor must deliver a concise handover package accessible to CLD:

  • Runbooks: start/stop, health checks, common issues with remedies, dependency checks (DB/queues/object storage), and escalation paths.
  • Operational diagrams: high‑level architecture and data flows with dependencies and trust boundaries; ports/protocols; identity/roles overview.
  • Monitoring and alerts: list of dashboards, alert rules (names, thresholds), and how to acknowledge/escalate.
  • Backup/restore: what is backed up, where, retention, restore procedures and validation steps.
  • Access control: admin/support roles and permissions; break‑glass process; request/approval workflow for elevation.
  • Contact matrix: vendor and subprocessor support contacts (L1/L2/L3), hours of coverage, ticketing channels, and escalation order.
  • Change calendar/maintenance windows: planned maintenance slots and blackout periods.
    PENDING: Repository/path for storing the handover package

20.3 Support model and hours #

Define the support model for the Application (vendor L1/L2/L3 vs CLD L1 with vendor L2/L3) and the hours of coverage. Specify the primary contact channel (ticketing portal/email/phone) and the expectation for triage and ownership when multiple parties are involved. PENDING: Support hours (e.g., 24x7/Business hours); primary contact channel; ownership split

20.4 On‑call and escalation targets #

Incidents must route to an on‑call engineer with explicit acknowledgment and response targets. At minimum:

  • PENDING: Sev1 acknowledgment ≤ __ minutes; initial response ≤ __ minutes; update cadence (e.g., every __ minutes) until mitigation
  • PENDING: Sev2 acknowledgment ≤ __ minutes; update cadence (e.g., every __ hours) Document the escalation chain (L1→L2→L3→Vendor SME→Executive) with time thresholds for handoff. Include phone numbers or paging mechanisms that bypass email if needed.

20.5 Monitoring and health #

Operate and maintain dashboards that expose availability, latency, error rates, queue depth, WAF blocks, auth/MFA failures, upload scan results, and external dependency health. Validate alert thresholds in staging and re‑validate after major releases. Periodically test alert delivery end‑to‑end (e.g., quarterly) and document results and fixes. PENDING: Dashboard list and alert test cadence

20.6 Ticketing and incident workflow #

Use a single ticket per incident with linked subtasks as needed. Tickets must include timeline, impact, suspected cause, actions taken, decisions (including rollback), and resolution details. For Sev1/Sev2, open the bridge (conference/war‑room) and keep a live log. After mitigation, convert to problem record as appropriate for root cause analysis (RCA) and corrective actions. Provide a post‑incident report to CLD within PENDING: __ business days for Sev1 and Sev2 incidents.

20.7 Change and configuration control #

All operational changes (config toggles, WAF rules, rate limits, scaling profiles) must follow change control appropriate to risk class (Section 18). Emergency changes must be recorded with rationale, impact, and rollback steps and reviewed in the PIR. Maintain versioned configuration (in code or parameter store) and avoid manual drift.

20.8 Capacity and cost management #

Track key capacity indicators (throughput, storage, concurrency) and set thresholds that trigger scale‑up or housekeeping actions. Provide monthly usage and cost summaries for cloud resources tied to the Application upon request, highlighting anomalies and forecasted needs. PENDING: Reporting cadence and format if required

20.9 Knowledge transfer and access #

Provide brief orientation for CLD stakeholders (Ops, ISO, Product Owner) covering runbooks, dashboards, and escalation. Ensure CLD has read access to necessary consoles or portals for visibility (or provide alternate reports) without granting excessive privileges. PENDING: Required read‑only access list or reporting alternatives

20.10 Maintenance windows and communications #

Adhere to agreed maintenance windows for planned work and communicate customer‑visible impact at least PENDING: __ business days in advance. For unplanned/emergency maintenance, notify CLD as soon as possible with reason, scope, expected duration, and rollback plan. Maintain a change calendar visible to CLD.

20.11 Subprocessor coordination #

If a subprocessor is involved during incidents or maintenance, the vendor remains the single throat to choke: coordinate on CLD’s behalf, hold the bridge, and ensure timelines and updates meet the agreed targets. Share subprocessor post‑incident statements or advisories relevant to CLD’s environment.

20.12 Documentation upkeep #

Keep runbooks, contact matrices, dashboards/alert rule lists, and dependency diagrams current. Update the handover package upon significant changes (e.g., new subprocessor, new dependency, materially changed workflow) and notify CLD. PENDING: Review cadence for documentation (e.g., semi‑annual)

20.13 Exceptions #

Any deviation from these operational requirements (e.g., temporary reduced coverage) requires a documented exception stating scope, rationale, compensating controls (e.g., enhanced monitoring, standby engineer), owner, ISO recommendation, Executive Sponsor approval, and expiry.