Forms Portal — Disaster Recovery Playbook¶
Append to: mkdocs-portal/docs/response-plans/drp.md
Classification: CONFIDENTIAL — Internal Use Only
Version: 1.0 · 2026-04-18
RTO / RPO targets¶
| Metric | Target | Basis |
|---|---|---|
| Recovery Time Objective (RTO) | 4 hours | Can operate on legacy PHP as fallback during restore |
| Recovery Point Objective (RPO) | 15 minutes | Cloud SQL PITR granularity |
| Attachment RPO | 0 (no loss) | GCS versioning + retention lock |
Asset inventory for DR¶
| Asset | Location | Recovery source |
|---|---|---|
Cloud SQL instance gpus-forms-db |
gpus-infra us-central1 |
Automated backups (7d) + PITR |
KMS keys gpus-forms-cmek, gpus-forms-dek-wrapper |
gpus-forms keyring us-central1 |
CANNOT be recreated if fully destroyed; re-encrypt data if lost |
GCS gpus-forms-attachments |
us-central1 | Versioning + retention lock |
Cloud Run gpus-forms-backend |
us-central1 | Container image in Artifact Registry; redeploys in <2min |
| Secret Manager (HappyFox creds) | gpus-infra | 6-version history; restore previous version |
| YAML form definitions | gpus-infra-portals repo main branch |
Git history |
DR-FP-01 — Cloud SQL instance lost (deletion, region outage)¶
Scenario: Instance gpus-forms-db is gone or unreachable for > 30 min
Step 1 — Confirm the outage
gcloud sql instances list --filter="name:gpus-forms-db"
gcloud sql operations list --instance=gpus-forms-db --limit=5
If instance exists but is in FAILED or MAINTENANCE state, wait or open GCP support case. If instance is gone, proceed to Step 2.
Step 2 — Restore from backup
List available backups:
gcloud sql backups list --instance=gpus-forms-db-old 2>/dev/null || \
gcloud sql backups list --filter="instance:gpus-forms-db"
If original instance is fully deleted, you can still restore the backup into a new instance:
gcloud sql instances create gpus-forms-db \
--database-version=POSTGRES_15 \
--tier=db-f1-micro \
--region=us-central1 \
--availability-type=zonal \
--storage-size=10 \
--no-assign-ip \
--network=projects/gpus-infra/global/networks/gpus-vpc \
--database-flags=cloudsql.iam_authentication=on \
--backup-start-time=07:00 \
--retained-backups-count=7 \
--enable-point-in-time-recovery \
--deletion-protection \
--disk-encryption-key=projects/gpus-infra/locations/us-central1/keyRings/gpus-forms/cryptoKeys/gpus-forms-cmek
Restore the backup into it:
gcloud sql backups restore <BACKUP_ID> --restore-instance=gpus-forms-db --backup-instance=<BACKUP_SOURCE_INSTANCE>
Step 3 — Reattach IAM DB users
gcloud sql users create gpus-forms-backend@gpus-infra.iam --instance=gpus-forms-db --type=cloud_iam_service_account
gcloud sql users create rajesh.chhetry@greenpeace.us --instance=gpus-forms-db --type=cloud_iam_user
Step 4 — Verify data integrity
SSH to MAPLE, run proxy, psql in, run verification queries from Phase 1 Step 6 (Q1-Q4 for RLS/policies).
Step 5 — Redirect backend to restored instance
If instance connection name changed:
gcloud run services update gpus-forms-backend --region=us-central1 \
--update-env-vars=CLOUDSQL_INSTANCE=gpus-infra:us-central1:gpus-forms-db
Step 6 — Smoke test
curl -sS -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
$(gcloud run services describe gpus-forms-backend --region=us-central1 --format='value(status.url)')/health/deep
Expected: {"status":"ok","db":true,"kms":true}
DR-FP-02 — Point-in-time recovery (logical corruption, bad data)¶
Scenario: Bug or malicious action caused bad data to land in DB; need to roll back to a specific timestamp
Step 1 — Identify target recovery timestamp
From audit_log:
Step 2 — PITR clone to new instance
gcloud sql instances clone gpus-forms-db gpus-forms-db-pitr \
--point-in-time="2026-MM-DDTHH:MM:SSZ" \
--region=us-central1
Step 3 — Validate clone (proxy + psql from MAPLE)
Step 4 — Cut over traffic
Option A (recommended): dump clone's affected rows, apply to live instance as targeted fix
Option B (aggressive): redirect backend to clone by updating CLOUDSQL_INSTANCE env var, then rename clone → original
Step 5 — Verify + document
DR-FP-03 — KMS key destroyed (catastrophic)¶
Scenario: gpus-forms-dek-wrapper primary version is destroyed and grace period expired
Reality check: This is near-unrecoverable. Encrypted submission fields can no longer be decrypted. Attachments are unaffected (separate CMEK).
Step 1 — Confirm
gcloud kms keys versions list --key=gpus-forms-dek-wrapper --keyring=gpus-forms --location=us-central1
Look for state: DESTROYED.
Step 2 — Determine blast radius
All submission_fields rows where kms_key_version matches the destroyed version are unrecoverable.
Step 3 — Mark affected submissions
UPDATE submissions SET status = 'purged',
error_message = 'DR-FP-03 KMS key destroyed, data unrecoverable'
WHERE id IN (SELECT DISTINCT submission_id FROM submission_fields WHERE kms_key_version = '<destroyed>');
Step 4 — Notify submitters
Using submissions.submitter_email, send notification requesting re-submission.
Step 5 — Lessons learned
Review why the key was destroyed. Was grace period missed? Who had cryptoKeyVersions.destroy permission? Remove that permission from all humans; restrict to break-glass service account only.
DR-FP-04 — Region-wide us-central1 outage¶
Scenario: Entire us-central1 is down for > 1 hour
Reality check: Forms portal is zonal (db-f1-micro doesn't support regional HA at that tier). Full region outage means forms portal is offline until region recovers. Legacy PHP app at forms.us.gl3 (on-prem Meraki WDC) continues to function and can serve as fallback.
Step 1 — Redirect DNS to legacy
- DNS: change forms.greenpeace.us CNAME to point at legacy Apache (requires DNS admin action)
- Legacy PHP forms remain functional for the outage duration
Step 2 — Communicate
- Slack #it-ops + staff email: "forms.greenpeace.us degraded, using legacy interim"
Step 3 — Recovery - When us-central1 returns, DNS CNAME flips back - Any submissions made through legacy during outage stay in the legacy DB (document in IAR as known gap; manual migration post-recovery if needed)
Step 4 — Consider upgrade
- If region outages recur, upgrade gpus-forms-db to regional HA (--availability-type=regional, ~$50/mo additional). Justify via incident count.
DR-FP-05 — Legacy MySQL in_formfeed compromise or loss¶
Scenario: Legacy read-only reference DB at 34.171.123.238 is compromised or lost
Reality check: If Phase 1.5 migration completed, this is low-impact — form definitions already live in YAML in the repo. The legacy DB is cold reference only.
Step 1 — Assess
- Is the new forms portal running with all YAML loaded? Confirm via boot.yaml_load count.
- If yes: no action required. Legacy loss is non-event.
Step 2 — If still mid-migration
- All migration data came from category, form, field, pulldown, template tables — 260KB total
- Re-migrate from latest backup of legacy instance (automated backups should be enabled)
Step 3 — Document - Note legacy loss in IAR; close legacy MySQL line item
Testing schedule¶
Per existing DR pattern at GPUS:
- Quarterly — DR-FP-01 cold-restore test (scheduled via Cowork recurring task — already created 2026-04-18)
- Annually — DR-FP-03 + DR-FP-04 tabletop exercise
- On change — re-verify after any major infra change (DB tier change, KMS rotation policy change, region migration)
Document each test result in mkdocs-portal/docs/infrastructure/runbooks/backup-restore.md.