Forms Portal — Disaster Recovery Playbook¶

Append to: mkdocs-portal/docs/response-plans/drp.md Classification: CONFIDENTIAL — Internal Use Only Version: 1.0 · 2026-04-18

RTO / RPO targets¶

Metric	Target	Basis
Recovery Time Objective (RTO)	4 hours	Can operate on legacy PHP as fallback during restore
Recovery Point Objective (RPO)	15 minutes	Cloud SQL PITR granularity
Attachment RPO	0 (no loss)	GCS versioning + retention lock

Asset inventory for DR¶

Asset	Location	Recovery source
Cloud SQL instance `gpus-forms-db`	`gpus-infra` us-central1	Automated backups (7d) + PITR
KMS keys `gpus-forms-cmek`, `gpus-forms-dek-wrapper`	`gpus-forms` keyring us-central1	CANNOT be recreated if fully destroyed; re-encrypt data if lost
GCS `gpus-forms-attachments`	us-central1	Versioning + retention lock
Cloud Run `gpus-forms-backend`	us-central1	Container image in Artifact Registry; redeploys in <2min
Secret Manager (HappyFox creds)	gpus-infra	6-version history; restore previous version
YAML form definitions	`gpus-infra-portals` repo main branch	Git history

DR-FP-01 — Cloud SQL instance lost (deletion, region outage)¶

Scenario: Instance gpus-forms-db is gone or unreachable for > 30 min

Step 1 — Confirm the outage

gcloud sql instances list --filter="name:gpus-forms-db"
gcloud sql operations list --instance=gpus-forms-db --limit=5

If instance exists but is in FAILED or MAINTENANCE state, wait or open GCP support case. If instance is gone, proceed to Step 2.

Step 2 — Restore from backup

List available backups:

gcloud sql backups list --instance=gpus-forms-db-old 2>/dev/null || \
  gcloud sql backups list --filter="instance:gpus-forms-db"

If original instance is fully deleted, you can still restore the backup into a new instance:

gcloud sql instances create gpus-forms-db \
  --database-version=POSTGRES_15 \
  --tier=db-f1-micro \
  --region=us-central1 \
  --availability-type=zonal \
  --storage-size=10 \
  --no-assign-ip \
  --network=projects/gpus-infra/global/networks/gpus-vpc \
  --database-flags=cloudsql.iam_authentication=on \
  --backup-start-time=07:00 \
  --retained-backups-count=7 \
  --enable-point-in-time-recovery \
  --deletion-protection \
  --disk-encryption-key=projects/gpus-infra/locations/us-central1/keyRings/gpus-forms/cryptoKeys/gpus-forms-cmek

Restore the backup into it:

gcloud sql backups restore <BACKUP_ID> --restore-instance=gpus-forms-db --backup-instance=<BACKUP_SOURCE_INSTANCE>

Step 3 — Reattach IAM DB users

gcloud sql users create gpus-forms-backend@gpus-infra.iam --instance=gpus-forms-db --type=cloud_iam_service_account
gcloud sql users create rajesh.chhetry@greenpeace.us --instance=gpus-forms-db --type=cloud_iam_user

Step 4 — Verify data integrity

SSH to MAPLE, run proxy, psql in, run verification queries from Phase 1 Step 6 (Q1-Q4 for RLS/policies).

Step 5 — Redirect backend to restored instance

If instance connection name changed:

gcloud run services update gpus-forms-backend --region=us-central1 \
  --update-env-vars=CLOUDSQL_INSTANCE=gpus-infra:us-central1:gpus-forms-db

Step 6 — Smoke test

curl -sS -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
  $(gcloud run services describe gpus-forms-backend --region=us-central1 --format='value(status.url)')/health/deep

Expected: {"status":"ok","db":true,"kms":true}

DR-FP-02 — Point-in-time recovery (logical corruption, bad data)¶

Scenario: Bug or malicious action caused bad data to land in DB; need to roll back to a specific timestamp

Step 1 — Identify target recovery timestamp

From audit_log:

SELECT MIN(occurred_at) FROM audit_log WHERE details->>'suspicious' = 'true';

Step 2 — PITR clone to new instance

gcloud sql instances clone gpus-forms-db gpus-forms-db-pitr \
  --point-in-time="2026-MM-DDTHH:MM:SSZ" \
  --region=us-central1

Step 3 — Validate clone (proxy + psql from MAPLE)

Step 4 — Cut over traffic

Option A (recommended): dump clone's affected rows, apply to live instance as targeted fix Option B (aggressive): redirect backend to clone by updating CLOUDSQL_INSTANCE env var, then rename clone → original

Step 5 — Verify + document

DR-FP-03 — KMS key destroyed (catastrophic)¶

Scenario: gpus-forms-dek-wrapper primary version is destroyed and grace period expired

Reality check: This is near-unrecoverable. Encrypted submission fields can no longer be decrypted. Attachments are unaffected (separate CMEK).

Step 1 — Confirm

gcloud kms keys versions list --key=gpus-forms-dek-wrapper --keyring=gpus-forms --location=us-central1

Look for state: DESTROYED.

Step 2 — Determine blast radius

All submission_fields rows where kms_key_version matches the destroyed version are unrecoverable.

SELECT kms_key_version, COUNT(*) FROM submission_fields GROUP BY kms_key_version;

Step 3 — Mark affected submissions

UPDATE submissions SET status = 'purged',
  error_message = 'DR-FP-03 KMS key destroyed, data unrecoverable'
WHERE id IN (SELECT DISTINCT submission_id FROM submission_fields WHERE kms_key_version = '<destroyed>');

Step 4 — Notify submitters

Using submissions.submitter_email, send notification requesting re-submission.

Step 5 — Lessons learned

Review why the key was destroyed. Was grace period missed? Who had cryptoKeyVersions.destroy permission? Remove that permission from all humans; restrict to break-glass service account only.

DR-FP-04 — Region-wide us-central1 outage¶

Scenario: Entire us-central1 is down for > 1 hour

Reality check: Forms portal is zonal (db-f1-micro doesn't support regional HA at that tier). Full region outage means forms portal is offline until region recovers. Legacy PHP app at forms.us.gl3 (on-prem Meraki WDC) continues to function and can serve as fallback.

Step 1 — Redirect DNS to legacy - DNS: change forms.greenpeace.us CNAME to point at legacy Apache (requires DNS admin action) - Legacy PHP forms remain functional for the outage duration

Step 2 — Communicate - Slack #it-ops + staff email: "forms.greenpeace.us degraded, using legacy interim"

Step 3 — Recovery - When us-central1 returns, DNS CNAME flips back - Any submissions made through legacy during outage stay in the legacy DB (document in IAR as known gap; manual migration post-recovery if needed)

Step 4 — Consider upgrade - If region outages recur, upgrade gpus-forms-db to regional HA (--availability-type=regional, ~$50/mo additional). Justify via incident count.

DR-FP-05 — Legacy MySQL `in_formfeed` compromise or loss¶

Scenario: Legacy read-only reference DB at 34.171.123.238 is compromised or lost

Reality check: If Phase 1.5 migration completed, this is low-impact — form definitions already live in YAML in the repo. The legacy DB is cold reference only.

Step 1 — Assess - Is the new forms portal running with all YAML loaded? Confirm via boot.yaml_load count. - If yes: no action required. Legacy loss is non-event.

Step 2 — If still mid-migration - All migration data came from category, form, field, pulldown, template tables — 260KB total - Re-migrate from latest backup of legacy instance (automated backups should be enabled)

Step 3 — Document - Note legacy loss in IAR; close legacy MySQL line item

Testing schedule¶

Per existing DR pattern at GPUS:

Quarterly — DR-FP-01 cold-restore test (scheduled via Cowork recurring task — already created 2026-04-18)
Annually — DR-FP-03 + DR-FP-04 tabletop exercise
On change — re-verify after any major infra change (DB tier change, KMS rotation policy change, region migration)

Document each test result in mkdocs-portal/docs/infrastructure/runbooks/backup-restore.md.