Skip to content

Incident Response Plan

Classification: CONFIDENTIAL — Internal Use Only Document: response-plans/irp.md · v1.0 · 2026-03-14 · GPUS-IT


1. Purpose & Scope

This Incident Response Plan (IRP) defines the process for detecting, containing, eradicating, and recovering from security incidents affecting the GPUS-IT infrastructure. It covers all four WDC on-premises servers (SKY, RAIN, SUN, WIND), GCP cloud services, the Cloud VPN tunnel, and all hosted applications.

Compliance alignment:

Framework Reference
CIS Controls v8 CIS 17.1 — Designate personnel to manage incident handling
CIS Controls v8 CIS 17.2 — Establish and maintain contact information
CIS Controls v8 CIS 17.3 — Establish and maintain an enterprise process for reporting incidents
CIS Controls v8 CIS 17.4 — Establish and maintain an incident response process
CIS Controls v8 CIS 17.7 — Conduct routine incident response exercises
CIS Controls v8 CIS 17.9 — Establish and maintain security incident thresholds
PCI-DSS Requirement 12.10 — Implement an incident response plan
NIST SP 800-53 IR-1 through IR-10 — Incident Response control family
NIST CSF Respond (RS) and Recover (RC) functions

2. Incident Severity Classification

Severity Description Response Time Examples
P1 — Critical Complete service unavailability or confirmed active compromise Immediate / 15 min DNS/DHCP fully down, root-level compromise, ransomware, zone data tampering, all servers unreachable
P2 — High Partial service degradation or active attack in progress < 1 hour DHCP failover failure, DDoS against resolvers, failed login flood, VPN tunnel down, single server compromise
P3 — Medium Security anomaly detected, no confirmed service impact < 4 hours Unauthorized zone transfer attempt, AIDE integrity alert, config drift detected, Fail2ban threshold breach
P4 — Low Informational alert requiring investigation < 24 hours Single failed login, DHCP lease pool warning, certificate expiry notice, minor log anomaly

3. Incident Response Team

Role Responsibilities Contact
Incident Commander (Director of Cyber Security) Activate IRP, coordinate response, external communications, escalation decisions On-call phone
DNS/DHCP Responder SKY/RAIN containment, DNSSEC integrity, DNS/DHCP service restoration On-call phone + SSH
Monitoring/Logging Responder SUN/WIND containment, log collection, forensic preservation On-call phone + SSH
Security Operations Threat analysis, IOC identification, containment recommendations, AIDE verification SOC hotline
Network Operations Meraki MX100 firewall rules, VPN tunnel, network-layer isolation NOC hotline
GCP Responder Cloud Run, VPC, IAM investigation and remediation GCP Console + CLI
Legal / Compliance Data breach assessment, regulatory notification (if required) Legal counsel

4. Incident Response Phases

Phase 1 — Preparation

Ongoing activities that enable effective response:

  • IR runbooks maintained and tested quarterly.
  • All four servers: Fail2ban active, AIDE baseline current, audit logging to WIND via rsyslog.
  • Grafana alerting configured on SUN for threshold breaches.
  • Elasticsearch alerts on WIND for security event patterns.
  • Out-of-band communication path maintained: Signal group + phone tree.
  • Offline copies of all configs, zone files, and DNSSEC keys in /backup/ and GCS.
  • ESXi snapshots current (daily schedule verified).
  • This document reviewed and exercised annually.

Phase 2 — Detection & Analysis

Detection sources:

Source What it detects Location
Fail2ban SSH brute force, repeated auth failures All 4 servers
AIDE File integrity violations, unexpected config changes All 4 servers
Prometheus/Grafana Resource anomalies, service down, threshold breaches SUN
Elasticsearch/Kibana Log pattern analysis, security event correlation WIND
NTA cron Network traffic anomalies SKY, RAIN (hourly)
Manual Admin observation, user reports, external notification All

Initial analysis checklist:

# 1. Identify affected server(s)
ping 192.168.120.1  # SKY
ping 192.168.120.2  # RAIN
ping 192.168.120.3  # SUN
ping 192.168.120.4  # WIND

# 2. Check service status on affected server
systemctl status named dhcpd fail2ban aide    # SKY/RAIN
systemctl status prometheus grafana-server    # SUN
systemctl status elasticsearch logstash kibana  # WIND

# 3. Check recent auth failures
grep "Failed password" /var/log/secure | tail -50
grep "authentication failure" /var/log/secure | tail -50

# 4. Check Fail2ban bans
fail2ban-client status
fail2ban-client status sshd

# 5. Run AIDE check
sudo aide --check 2>&1 | grep -v "^$"

# 6. Check for unexpected processes
ps aux | grep -v -E "(named|dhcpd|sshd|systemd|root|fail2ban|aide|prometheus|node_exp|grafana|elasticsearch|logstash|kibana)"

# 7. Check listening ports
ss -tlnp

# 8. Check recent logins
last | head -20
lastb | head -20

# 9. Review Elasticsearch for correlated events
curl -s "http://192.168.120.4:9200/logstash-*/_search?q=severity:high&size=20" | python3 -m json.tool

Phase 3 — Containment

Immediate containment (P1/P2 — do not wait for full analysis):

# Option A: Network isolate affected server via firewalld (preferred)
# Blocks all traffic except management subnet
firewall-cmd --zone=drop --add-source=0.0.0.0/0 --permanent
firewall-cmd --zone=trusted --add-source=192.168.124.0/24 --permanent
firewall-cmd --reload

# Option B: ESXi-level isolation (if server compromise is severe)
# Via vSphere Client → VM → Disconnect network adapters

# Option C: Block specific attacking IP
firewall-cmd --add-rich-rule='rule family="ipv4" source address="<ATTACKER_IP>" drop' --permanent
firewall-cmd --reload

# Preserve evidence BEFORE any remediation
# Capture running process list, network connections, and memory if possible
ps auxf > /tmp/ir-$(date +%Y%m%d-%H%M%S)-procs.txt
ss -tlnp > /tmp/ir-$(date +%Y%m%d-%H%M%S)-netstat.txt
netstat -an > /tmp/ir-$(date +%Y%m%d-%H%M%S)-connections.txt
cp /var/log/secure /tmp/ir-$(date +%Y%m%d-%H%M%S)-secure.log
cp /var/log/audit/audit.log /tmp/ir-$(date +%Y%m%d-%H%M%S)-audit.log

# Copy evidence to WIND for preservation
scp /tmp/ir-* monitadmin@192.168.120.4:/var/log/ir-evidence/

DNS/DHCP-specific containment (SKY/RAIN):

# If zone tampering suspected — freeze zone and verify integrity
rndc freeze wdc.us.gl3
# Compare zone file checksum against last known-good backup
sha256sum /var/named/wdc.us.gl3.db
# Verify DNSSEC signatures
dnssec-verify -z wdc.us.gl3 /var/named/wdc.us.gl3.db

# If SKY compromised — RAIN auto-assumes; disable zone transfers FROM SKY
# On RAIN: temporarily remove SKY as notify source

Phase 4 — Eradication

After containment and evidence preservation:

# 1. Identify root cause from logs and AIDE report
# 2. Remove malicious artifacts (files, accounts, cron jobs, authorized_keys)

# Check for unauthorized accounts
awk -F: '$3 >= 1000 {print $1, $3}' /etc/passwd

# Check authorized_keys on all accounts
find /root /home -name "authorized_keys" -exec cat {} \;

# Check crontabs
crontab -l
ls -la /etc/cron.* /var/spool/cron/

# Check for SUID binaries added since baseline
find / -perm /4000 -newer /var/lib/aide/aide.db.gz -ls 2>/dev/null

# 3. Patch the exploited vulnerability
dnf update --security -y

# 4. Rotate credentials
# SSH keys: regenerate host keys, rotate admin authorized_keys
ssh-keygen -A  # Regenerate host keys
# Rotate all admin SSH keys per access control policy

# 5. Update AIDE baseline after clean state confirmed
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz

Phase 5 — Recovery

# Restore from known-good snapshot (preferred) OR rebuild from backup
# See DR Plan sections 6.1–6.4 for per-server recovery procedures

# After restoration:
# 1. AIDE baseline update
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz

# 2. Re-sign DNSSEC (SKY/RAIN)
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa

# 3. Verify all services operational
# 4. Remove network isolation (if applied in containment)
firewall-cmd --remove-rich-rule='...' --permanent
firewall-cmd --reload

# 5. Monitor intensively for 48 hours post-recovery
# Grafana: watch for re-compromise indicators
# AIDE: daily checks for 1 week
# Fail2ban: review ban list

Phase 6 — Post-Incident Activity

Within 5 business days of incident closure:

  1. Incident After-Action Report — complete the IAR template at response-plans/incident-after-action-review.md.
  2. Root cause analysis — documented and shared with Director of Cyber Security.
  3. Lessons learned — update this IRP and/or DR Plan if procedures were found insufficient.
  4. Compliance evidence — log incident in the risk register; document remediation.
  5. Asset inventory update — if any hardware/software changed, update wdc-hostregistry.csv.
  6. AIDE baseline — confirm final clean baseline on all affected servers.

5. Incident-Specific Runbooks

5.1 SSH Brute Force / Credential Attack

# 1. Identify attacking IPs
fail2ban-client status sshd
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

# 2. Confirm Fail2ban has auto-banned
# If not banned (new attack pattern):
firewall-cmd --add-rich-rule='rule family="ipv4" source address="<IP>" drop' --permanent
firewall-cmd --reload

# 3. Check for successful logins from same IP range
grep "Accepted" /var/log/secure | tail -50

# 4. If successful login detected → escalate to P1, begin full containment
# 5. Review authorized_keys — ensure no new keys were added
find /root /home -name "authorized_keys" -exec cat {} \;

5.2 AIDE Integrity Alert

# 1. Review AIDE report
sudo aide --check 2>&1 | tee /tmp/aide-report-$(date +%Y%m%d).txt

# 2. Categorize changes:
#    - Expected (post-change): update baseline → not an incident
#    - Unexpected: investigate immediately

# 3. For unexpected changes, identify what changed:
# File added: potential malware drop
# File modified: potential config tampering or binary replacement
# File deleted: potential evidence destruction

# 4. Cross-reference with /var/log/asset-inventory.log
# If change not in log → treat as unauthorized → escalate to P2/P1

# 5. If binary modified — assume compromise; escalate to P1
# Isolate server immediately (see Phase 3 containment)

5.3 DNS Zone Tampering

# 1. Verify current zone against last backup
diff /var/named/wdc.us.gl3.db /backup/dns-dhcp/wdc.us.gl3.db.last

# 2. Check DNSSEC signatures
dnssec-verify -z wdc.us.gl3 /var/named/wdc.us.gl3.db

# 3. Check named journal files for recent changes
ls -la /var/named/*.jnl

# 4. Freeze zone immediately
rndc freeze wdc.us.gl3

# 5. If tampering confirmed:
#    - Restore zone from last known-good backup
#    - Re-sign DNSSEC
#    - Check TSIG keys (rotate if compromised)
#    - Escalate to P1

# 6. Alert all DNS clients if records were poisoned

5.4 Unauthorized Access / Privilege Escalation

# 1. Identify unauthorized account or privilege change
grep "sudo" /var/log/secure | grep -v "$(hostname)" | tail -50
grep "useradd\|usermod\|passwd" /var/log/secure | tail -50

# 2. Check for new accounts
awk -F: '$3 >= 1000 {print $1, $3, $6}' /etc/passwd

# 3. Lock unauthorized account immediately
passwd -l <USERNAME>
usermod -s /sbin/nologin <USERNAME>

# 4. Kill active sessions
pkill -u <USERNAME>

# 5. Review what the account accessed
grep <USERNAME> /var/log/secure /var/log/audit/audit.log | tail -100

# 6. Rotate all admin credentials on affected server
# 7. Escalate to P1 — full containment

5.5 GCP / Cloud Incident

# 1. Check Cloud Audit Logs for unauthorized API calls
gcloud logging read \
    'protoPayload.authorizationInfo.granted=true AND protoPayload.authenticationInfo.principalEmail!~"@greenpeace.us"' \
    --project=gpus-infra \
    --limit=50 \
    --format=json

# 2. Check IAM for unauthorized bindings
gcloud projects get-iam-policy gpus-infra --format=json

# 3. Check for unauthorized Cloud Run deployments
gcloud run services list --region=us-central1 --project=gpus-infra

# 4. If unauthorized service account activity detected:
gcloud iam service-accounts disable <SA_EMAIL> --project=gpus-infra

# 5. If VPN tunnel compromised — rotate PSK via Terraform
# Update var.vpn_shared_secret in terraform.tfvars
# terraform apply -target=google_compute_vpn_tunnel.wdc_tunnel

6. Communication Templates

P1 Initial Alert (Director of Cyber Security → Stakeholders)

SUBJECT: [P1 SECURITY INCIDENT] GPUS-IT — <Brief Description>

Time detected: <HH:MM UTC>
Systems affected: <SERVER LIST>
Service impact: <DNS/DHCP/Monitoring/Logging/GCP>
Current status: Containment in progress

Incident Commander: <NAME>
Next update: <HH:MM UTC>

Actions taken so far:
- <ACTION 1>
- <ACTION 2>

Incident Closure Notification

SUBJECT: [RESOLVED] GPUS-IT Security Incident — <Incident ID>

Incident: <DESCRIPTION>
Duration: <START TIME> → <END TIME>
Root cause: <SUMMARY>
Systems affected: <LIST>
Service impact: <DESCRIPTION>

Remediation completed:
- <ACTION 1>
- <ACTION 2>

After-action review scheduled: <DATE>

7. IR Testing Schedule

Exercise Frequency Owner Type
Tabletop exercise — P1 DNS failure Quarterly Director of Cyber Security Discussion-based
Live failover drill — SKY to RAIN Quarterly DNS Admin Live test
Containment procedure walkthrough Semi-annually Security Ops Walkthrough
Full IR simulation — active compromise Annually Director of Cyber Security + Full Team Simulation
GCP incident response drill Annually GCP Admin Simulation

8. Evidence Retention

Evidence Type Retention Period Storage Location
IR evidence archives 1 year WIND: /var/log/ir-evidence/ + GCS
AIDE reports from incidents 1 year WIND: /var/log/aide/
Auth logs (/var/log/secure) 90 days rolling WIND (rsyslog)
Audit logs (/var/log/audit/) 90 days rolling WIND (rsyslog)
Incident After-Action Reports Indefinite response-plans/incident-after-action-review.md + GCS
Change log entries Indefinite /var/log/asset-inventory.log

9. Regulatory Notification Thresholds

Greenpeace US handles supporter data and payment card data (PCI-DSS). The following incidents require regulatory or legal review:

Incident Type Notification Requirement Threshold
Cardholder data exposure PCI-DSS — notify acquiring bank + card brands Any confirmed exposure
Supporter PII breach State breach notification laws Any confirmed exposure of PII
Unauthorized access to supporter data Internal escalation to Legal Any unauthorized access

In all cases: notify Director of Cyber Security immediately → Legal Counsel within 1 hour of P1 declaration.


10. Plan Maintenance

This plan is reviewed and tested:

  • Annually — full review by Director of Cyber Security + full IR team
  • After any security incident — updated within 5 business days
  • After any infrastructure change — reviewed for impact on detection/response capabilities

Document version: v1.0 · 2026-03-14 · GPUS-IT · Classification: CONFIDENTIAL — Internal Use Only


See also

  • Forms Portal — IR Playbooks — FP-IR-01 through FP-IR-05: HappyFox credential leak, Okta token compromise, submission-field decrypt failure, DB IAM-role misuse, attachment malware scan fail