Incident Response Plan¶
Classification: CONFIDENTIAL — Internal Use Only Document:
response-plans/irp.md· v1.0 · 2026-03-14 · GPUS-IT
1. Purpose & Scope¶
This Incident Response Plan (IRP) defines the process for detecting, containing, eradicating, and recovering from security incidents affecting the GPUS-IT infrastructure. It covers all four WDC on-premises servers (SKY, RAIN, SUN, WIND), GCP cloud services, the Cloud VPN tunnel, and all hosted applications.
Compliance alignment:
| Framework | Reference |
|---|---|
| CIS Controls v8 | CIS 17.1 — Designate personnel to manage incident handling |
| CIS Controls v8 | CIS 17.2 — Establish and maintain contact information |
| CIS Controls v8 | CIS 17.3 — Establish and maintain an enterprise process for reporting incidents |
| CIS Controls v8 | CIS 17.4 — Establish and maintain an incident response process |
| CIS Controls v8 | CIS 17.7 — Conduct routine incident response exercises |
| CIS Controls v8 | CIS 17.9 — Establish and maintain security incident thresholds |
| PCI-DSS | Requirement 12.10 — Implement an incident response plan |
| NIST SP 800-53 | IR-1 through IR-10 — Incident Response control family |
| NIST CSF | Respond (RS) and Recover (RC) functions |
2. Incident Severity Classification¶
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| P1 — Critical | Complete service unavailability or confirmed active compromise | Immediate / 15 min | DNS/DHCP fully down, root-level compromise, ransomware, zone data tampering, all servers unreachable |
| P2 — High | Partial service degradation or active attack in progress | < 1 hour | DHCP failover failure, DDoS against resolvers, failed login flood, VPN tunnel down, single server compromise |
| P3 — Medium | Security anomaly detected, no confirmed service impact | < 4 hours | Unauthorized zone transfer attempt, AIDE integrity alert, config drift detected, Fail2ban threshold breach |
| P4 — Low | Informational alert requiring investigation | < 24 hours | Single failed login, DHCP lease pool warning, certificate expiry notice, minor log anomaly |
3. Incident Response Team¶
| Role | Responsibilities | Contact |
|---|---|---|
| Incident Commander (Director of Cyber Security) | Activate IRP, coordinate response, external communications, escalation decisions | On-call phone |
| DNS/DHCP Responder | SKY/RAIN containment, DNSSEC integrity, DNS/DHCP service restoration | On-call phone + SSH |
| Monitoring/Logging Responder | SUN/WIND containment, log collection, forensic preservation | On-call phone + SSH |
| Security Operations | Threat analysis, IOC identification, containment recommendations, AIDE verification | SOC hotline |
| Network Operations | Meraki MX100 firewall rules, VPN tunnel, network-layer isolation | NOC hotline |
| GCP Responder | Cloud Run, VPC, IAM investigation and remediation | GCP Console + CLI |
| Legal / Compliance | Data breach assessment, regulatory notification (if required) | Legal counsel |
4. Incident Response Phases¶
Phase 1 — Preparation¶
Ongoing activities that enable effective response:
- IR runbooks maintained and tested quarterly.
- All four servers: Fail2ban active, AIDE baseline current, audit logging to WIND via rsyslog.
- Grafana alerting configured on SUN for threshold breaches.
- Elasticsearch alerts on WIND for security event patterns.
- Out-of-band communication path maintained: Signal group + phone tree.
- Offline copies of all configs, zone files, and DNSSEC keys in
/backup/and GCS. - ESXi snapshots current (daily schedule verified).
- This document reviewed and exercised annually.
Phase 2 — Detection & Analysis¶
Detection sources:
| Source | What it detects | Location |
|---|---|---|
| Fail2ban | SSH brute force, repeated auth failures | All 4 servers |
| AIDE | File integrity violations, unexpected config changes | All 4 servers |
| Prometheus/Grafana | Resource anomalies, service down, threshold breaches | SUN |
| Elasticsearch/Kibana | Log pattern analysis, security event correlation | WIND |
| NTA cron | Network traffic anomalies | SKY, RAIN (hourly) |
| Manual | Admin observation, user reports, external notification | All |
Initial analysis checklist:
# 1. Identify affected server(s)
ping 192.168.120.1 # SKY
ping 192.168.120.2 # RAIN
ping 192.168.120.3 # SUN
ping 192.168.120.4 # WIND
# 2. Check service status on affected server
systemctl status named dhcpd fail2ban aide # SKY/RAIN
systemctl status prometheus grafana-server # SUN
systemctl status elasticsearch logstash kibana # WIND
# 3. Check recent auth failures
grep "Failed password" /var/log/secure | tail -50
grep "authentication failure" /var/log/secure | tail -50
# 4. Check Fail2ban bans
fail2ban-client status
fail2ban-client status sshd
# 5. Run AIDE check
sudo aide --check 2>&1 | grep -v "^$"
# 6. Check for unexpected processes
ps aux | grep -v -E "(named|dhcpd|sshd|systemd|root|fail2ban|aide|prometheus|node_exp|grafana|elasticsearch|logstash|kibana)"
# 7. Check listening ports
ss -tlnp
# 8. Check recent logins
last | head -20
lastb | head -20
# 9. Review Elasticsearch for correlated events
curl -s "http://192.168.120.4:9200/logstash-*/_search?q=severity:high&size=20" | python3 -m json.tool
Phase 3 — Containment¶
Immediate containment (P1/P2 — do not wait for full analysis):
# Option A: Network isolate affected server via firewalld (preferred)
# Blocks all traffic except management subnet
firewall-cmd --zone=drop --add-source=0.0.0.0/0 --permanent
firewall-cmd --zone=trusted --add-source=192.168.124.0/24 --permanent
firewall-cmd --reload
# Option B: ESXi-level isolation (if server compromise is severe)
# Via vSphere Client → VM → Disconnect network adapters
# Option C: Block specific attacking IP
firewall-cmd --add-rich-rule='rule family="ipv4" source address="<ATTACKER_IP>" drop' --permanent
firewall-cmd --reload
# Preserve evidence BEFORE any remediation
# Capture running process list, network connections, and memory if possible
ps auxf > /tmp/ir-$(date +%Y%m%d-%H%M%S)-procs.txt
ss -tlnp > /tmp/ir-$(date +%Y%m%d-%H%M%S)-netstat.txt
netstat -an > /tmp/ir-$(date +%Y%m%d-%H%M%S)-connections.txt
cp /var/log/secure /tmp/ir-$(date +%Y%m%d-%H%M%S)-secure.log
cp /var/log/audit/audit.log /tmp/ir-$(date +%Y%m%d-%H%M%S)-audit.log
# Copy evidence to WIND for preservation
scp /tmp/ir-* monitadmin@192.168.120.4:/var/log/ir-evidence/
DNS/DHCP-specific containment (SKY/RAIN):
# If zone tampering suspected — freeze zone and verify integrity
rndc freeze wdc.us.gl3
# Compare zone file checksum against last known-good backup
sha256sum /var/named/wdc.us.gl3.db
# Verify DNSSEC signatures
dnssec-verify -z wdc.us.gl3 /var/named/wdc.us.gl3.db
# If SKY compromised — RAIN auto-assumes; disable zone transfers FROM SKY
# On RAIN: temporarily remove SKY as notify source
Phase 4 — Eradication¶
After containment and evidence preservation:
# 1. Identify root cause from logs and AIDE report
# 2. Remove malicious artifacts (files, accounts, cron jobs, authorized_keys)
# Check for unauthorized accounts
awk -F: '$3 >= 1000 {print $1, $3}' /etc/passwd
# Check authorized_keys on all accounts
find /root /home -name "authorized_keys" -exec cat {} \;
# Check crontabs
crontab -l
ls -la /etc/cron.* /var/spool/cron/
# Check for SUID binaries added since baseline
find / -perm /4000 -newer /var/lib/aide/aide.db.gz -ls 2>/dev/null
# 3. Patch the exploited vulnerability
dnf update --security -y
# 4. Rotate credentials
# SSH keys: regenerate host keys, rotate admin authorized_keys
ssh-keygen -A # Regenerate host keys
# Rotate all admin SSH keys per access control policy
# 5. Update AIDE baseline after clean state confirmed
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
Phase 5 — Recovery¶
# Restore from known-good snapshot (preferred) OR rebuild from backup
# See DR Plan sections 6.1–6.4 for per-server recovery procedures
# After restoration:
# 1. AIDE baseline update
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
# 2. Re-sign DNSSEC (SKY/RAIN)
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa
# 3. Verify all services operational
# 4. Remove network isolation (if applied in containment)
firewall-cmd --remove-rich-rule='...' --permanent
firewall-cmd --reload
# 5. Monitor intensively for 48 hours post-recovery
# Grafana: watch for re-compromise indicators
# AIDE: daily checks for 1 week
# Fail2ban: review ban list
Phase 6 — Post-Incident Activity¶
Within 5 business days of incident closure:
- Incident After-Action Report — complete the IAR template at
response-plans/incident-after-action-review.md. - Root cause analysis — documented and shared with Director of Cyber Security.
- Lessons learned — update this IRP and/or DR Plan if procedures were found insufficient.
- Compliance evidence — log incident in the risk register; document remediation.
- Asset inventory update — if any hardware/software changed, update
wdc-hostregistry.csv. - AIDE baseline — confirm final clean baseline on all affected servers.
5. Incident-Specific Runbooks¶
5.1 SSH Brute Force / Credential Attack¶
# 1. Identify attacking IPs
fail2ban-client status sshd
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -20
# 2. Confirm Fail2ban has auto-banned
# If not banned (new attack pattern):
firewall-cmd --add-rich-rule='rule family="ipv4" source address="<IP>" drop' --permanent
firewall-cmd --reload
# 3. Check for successful logins from same IP range
grep "Accepted" /var/log/secure | tail -50
# 4. If successful login detected → escalate to P1, begin full containment
# 5. Review authorized_keys — ensure no new keys were added
find /root /home -name "authorized_keys" -exec cat {} \;
5.2 AIDE Integrity Alert¶
# 1. Review AIDE report
sudo aide --check 2>&1 | tee /tmp/aide-report-$(date +%Y%m%d).txt
# 2. Categorize changes:
# - Expected (post-change): update baseline → not an incident
# - Unexpected: investigate immediately
# 3. For unexpected changes, identify what changed:
# File added: potential malware drop
# File modified: potential config tampering or binary replacement
# File deleted: potential evidence destruction
# 4. Cross-reference with /var/log/asset-inventory.log
# If change not in log → treat as unauthorized → escalate to P2/P1
# 5. If binary modified — assume compromise; escalate to P1
# Isolate server immediately (see Phase 3 containment)
5.3 DNS Zone Tampering¶
# 1. Verify current zone against last backup
diff /var/named/wdc.us.gl3.db /backup/dns-dhcp/wdc.us.gl3.db.last
# 2. Check DNSSEC signatures
dnssec-verify -z wdc.us.gl3 /var/named/wdc.us.gl3.db
# 3. Check named journal files for recent changes
ls -la /var/named/*.jnl
# 4. Freeze zone immediately
rndc freeze wdc.us.gl3
# 5. If tampering confirmed:
# - Restore zone from last known-good backup
# - Re-sign DNSSEC
# - Check TSIG keys (rotate if compromised)
# - Escalate to P1
# 6. Alert all DNS clients if records were poisoned
5.4 Unauthorized Access / Privilege Escalation¶
# 1. Identify unauthorized account or privilege change
grep "sudo" /var/log/secure | grep -v "$(hostname)" | tail -50
grep "useradd\|usermod\|passwd" /var/log/secure | tail -50
# 2. Check for new accounts
awk -F: '$3 >= 1000 {print $1, $3, $6}' /etc/passwd
# 3. Lock unauthorized account immediately
passwd -l <USERNAME>
usermod -s /sbin/nologin <USERNAME>
# 4. Kill active sessions
pkill -u <USERNAME>
# 5. Review what the account accessed
grep <USERNAME> /var/log/secure /var/log/audit/audit.log | tail -100
# 6. Rotate all admin credentials on affected server
# 7. Escalate to P1 — full containment
5.5 GCP / Cloud Incident¶
# 1. Check Cloud Audit Logs for unauthorized API calls
gcloud logging read \
'protoPayload.authorizationInfo.granted=true AND protoPayload.authenticationInfo.principalEmail!~"@greenpeace.us"' \
--project=gpus-infra \
--limit=50 \
--format=json
# 2. Check IAM for unauthorized bindings
gcloud projects get-iam-policy gpus-infra --format=json
# 3. Check for unauthorized Cloud Run deployments
gcloud run services list --region=us-central1 --project=gpus-infra
# 4. If unauthorized service account activity detected:
gcloud iam service-accounts disable <SA_EMAIL> --project=gpus-infra
# 5. If VPN tunnel compromised — rotate PSK via Terraform
# Update var.vpn_shared_secret in terraform.tfvars
# terraform apply -target=google_compute_vpn_tunnel.wdc_tunnel
6. Communication Templates¶
P1 Initial Alert (Director of Cyber Security → Stakeholders)¶
SUBJECT: [P1 SECURITY INCIDENT] GPUS-IT — <Brief Description>
Time detected: <HH:MM UTC>
Systems affected: <SERVER LIST>
Service impact: <DNS/DHCP/Monitoring/Logging/GCP>
Current status: Containment in progress
Incident Commander: <NAME>
Next update: <HH:MM UTC>
Actions taken so far:
- <ACTION 1>
- <ACTION 2>
Incident Closure Notification¶
SUBJECT: [RESOLVED] GPUS-IT Security Incident — <Incident ID>
Incident: <DESCRIPTION>
Duration: <START TIME> → <END TIME>
Root cause: <SUMMARY>
Systems affected: <LIST>
Service impact: <DESCRIPTION>
Remediation completed:
- <ACTION 1>
- <ACTION 2>
After-action review scheduled: <DATE>
7. IR Testing Schedule¶
| Exercise | Frequency | Owner | Type |
|---|---|---|---|
| Tabletop exercise — P1 DNS failure | Quarterly | Director of Cyber Security | Discussion-based |
| Live failover drill — SKY to RAIN | Quarterly | DNS Admin | Live test |
| Containment procedure walkthrough | Semi-annually | Security Ops | Walkthrough |
| Full IR simulation — active compromise | Annually | Director of Cyber Security + Full Team | Simulation |
| GCP incident response drill | Annually | GCP Admin | Simulation |
8. Evidence Retention¶
| Evidence Type | Retention Period | Storage Location |
|---|---|---|
| IR evidence archives | 1 year | WIND: /var/log/ir-evidence/ + GCS |
| AIDE reports from incidents | 1 year | WIND: /var/log/aide/ |
Auth logs (/var/log/secure) |
90 days rolling | WIND (rsyslog) |
Audit logs (/var/log/audit/) |
90 days rolling | WIND (rsyslog) |
| Incident After-Action Reports | Indefinite | response-plans/incident-after-action-review.md + GCS |
| Change log entries | Indefinite | /var/log/asset-inventory.log |
9. Regulatory Notification Thresholds¶
Greenpeace US handles supporter data and payment card data (PCI-DSS). The following incidents require regulatory or legal review:
| Incident Type | Notification Requirement | Threshold |
|---|---|---|
| Cardholder data exposure | PCI-DSS — notify acquiring bank + card brands | Any confirmed exposure |
| Supporter PII breach | State breach notification laws | Any confirmed exposure of PII |
| Unauthorized access to supporter data | Internal escalation to Legal | Any unauthorized access |
In all cases: notify Director of Cyber Security immediately → Legal Counsel within 1 hour of P1 declaration.
10. Plan Maintenance¶
This plan is reviewed and tested:
- Annually — full review by Director of Cyber Security + full IR team
- After any security incident — updated within 5 business days
- After any infrastructure change — reviewed for impact on detection/response capabilities
Document version: v1.0 · 2026-03-14 · GPUS-IT · Classification: CONFIDENTIAL — Internal Use Only
See also¶
- Forms Portal — IR Playbooks — FP-IR-01 through FP-IR-05: HappyFox credential leak, Okta token compromise, submission-field decrypt failure, DB IAM-role misuse, attachment malware scan fail