Incident Response Plan¶

Classification: CONFIDENTIAL — Internal Use Only Document: response-plans/irp.md · v1.0 · 2026-03-14 · GPUS-IT

1. Purpose & Scope¶

This Incident Response Plan (IRP) defines the process for detecting, containing, eradicating, and recovering from security incidents affecting the GPUS-IT infrastructure. It covers all four WDC on-premises servers (SKY, RAIN, SUN, WIND), GCP cloud services, the Cloud VPN tunnel, and all hosted applications.

Compliance alignment:

Framework	Reference
CIS Controls v8	CIS 17.1 — Designate personnel to manage incident handling
CIS Controls v8	CIS 17.2 — Establish and maintain contact information
CIS Controls v8	CIS 17.3 — Establish and maintain an enterprise process for reporting incidents
CIS Controls v8	CIS 17.4 — Establish and maintain an incident response process
CIS Controls v8	CIS 17.7 — Conduct routine incident response exercises
CIS Controls v8	CIS 17.9 — Establish and maintain security incident thresholds
PCI-DSS	Requirement 12.10 — Implement an incident response plan
NIST SP 800-53	IR-1 through IR-10 — Incident Response control family
NIST CSF	Respond (RS) and Recover (RC) functions

2. Incident Severity Classification¶

Severity	Description	Response Time	Examples
P1 — Critical	Complete service unavailability or confirmed active compromise	Immediate / 15 min	DNS/DHCP fully down, root-level compromise, ransomware, zone data tampering, all servers unreachable
P2 — High	Partial service degradation or active attack in progress	< 1 hour	DHCP failover failure, DDoS against resolvers, failed login flood, VPN tunnel down, single server compromise
P3 — Medium	Security anomaly detected, no confirmed service impact	< 4 hours	Unauthorized zone transfer attempt, AIDE integrity alert, config drift detected, Fail2ban threshold breach
P4 — Low	Informational alert requiring investigation	< 24 hours	Single failed login, DHCP lease pool warning, certificate expiry notice, minor log anomaly

3. Incident Response Team¶

Role	Responsibilities	Contact
Incident Commander (Director of Cyber Security)	Activate IRP, coordinate response, external communications, escalation decisions	On-call phone
DNS/DHCP Responder	SKY/RAIN containment, DNSSEC integrity, DNS/DHCP service restoration	On-call phone + SSH
Monitoring/Logging Responder	SUN/WIND containment, log collection, forensic preservation	On-call phone + SSH
Security Operations	Threat analysis, IOC identification, containment recommendations, AIDE verification	SOC hotline
Network Operations	Meraki MX100 firewall rules, VPN tunnel, network-layer isolation	NOC hotline
GCP Responder	Cloud Run, VPC, IAM investigation and remediation	GCP Console + CLI
Legal / Compliance	Data breach assessment, regulatory notification (if required)	Legal counsel

4. Incident Response Phases¶

Phase 1 — Preparation¶

Ongoing activities that enable effective response:

IR runbooks maintained and tested quarterly.
All four servers: Fail2ban active, AIDE baseline current, audit logging to WIND via rsyslog.
Grafana alerting configured on SUN for threshold breaches.
Elasticsearch alerts on WIND for security event patterns.
Out-of-band communication path maintained: Signal group + phone tree.
Offline copies of all configs, zone files, and DNSSEC keys in /backup/ and GCS.
ESXi snapshots current (daily schedule verified).
This document reviewed and exercised annually.

Phase 2 — Detection & Analysis¶

Detection sources:

Source	What it detects	Location
Fail2ban	SSH brute force, repeated auth failures	All 4 servers
AIDE	File integrity violations, unexpected config changes	All 4 servers
Prometheus/Grafana	Resource anomalies, service down, threshold breaches	SUN
Elasticsearch/Kibana	Log pattern analysis, security event correlation	WIND
NTA cron	Network traffic anomalies	SKY, RAIN (hourly)
Manual	Admin observation, user reports, external notification	All

Initial analysis checklist:

# 1. Identify affected server(s)
ping 192.168.120.1  # SKY
ping 192.168.120.2  # RAIN
ping 192.168.120.3  # SUN
ping 192.168.120.4  # WIND

# 2. Check service status on affected server
systemctl status named dhcpd fail2ban aide    # SKY/RAIN
systemctl status prometheus grafana-server    # SUN
systemctl status elasticsearch logstash kibana  # WIND

# 3. Check recent auth failures
grep "Failed password" /var/log/secure | tail -50
grep "authentication failure" /var/log/secure | tail -50

# 4. Check Fail2ban bans
fail2ban-client status
fail2ban-client status sshd

# 5. Run AIDE check
sudo aide --check 2>&1 | grep -v "^$"

# 6. Check for unexpected processes
ps aux | grep -v -E "(named|dhcpd|sshd|systemd|root|fail2ban|aide|prometheus|node_exp|grafana|elasticsearch|logstash|kibana)"

# 7. Check listening ports
ss -tlnp

# 8. Check recent logins
last | head -20
lastb | head -20

# 9. Review Elasticsearch for correlated events
curl -s "http://192.168.120.4:9200/logstash-*/_search?q=severity:high&size=20" | python3 -m json.tool

Phase 3 — Containment¶

Immediate containment (P1/P2 — do not wait for full analysis):

# Option A: Network isolate affected server via firewalld (preferred)
# Blocks all traffic except management subnet
firewall-cmd --zone=drop --add-source=0.0.0.0/0 --permanent
firewall-cmd --zone=trusted --add-source=192.168.124.0/24 --permanent
firewall-cmd --reload

# Option B: ESXi-level isolation (if server compromise is severe)
# Via vSphere Client → VM → Disconnect network adapters

# Option C: Block specific attacking IP
firewall-cmd --add-rich-rule='rule family="ipv4" source address="<ATTACKER_IP>" drop' --permanent
firewall-cmd --reload

# Preserve evidence BEFORE any remediation
# Capture running process list, network connections, and memory if possible
ps auxf > /tmp/ir-$(date +%Y%m%d-%H%M%S)-procs.txt
ss -tlnp > /tmp/ir-$(date +%Y%m%d-%H%M%S)-netstat.txt
netstat -an > /tmp/ir-$(date +%Y%m%d-%H%M%S)-connections.txt
cp /var/log/secure /tmp/ir-$(date +%Y%m%d-%H%M%S)-secure.log
cp /var/log/audit/audit.log /tmp/ir-$(date +%Y%m%d-%H%M%S)-audit.log

# Copy evidence to WIND for preservation
scp /tmp/ir-* monitadmin@192.168.120.4:/var/log/ir-evidence/

DNS/DHCP-specific containment (SKY/RAIN):

# If zone tampering suspected — freeze zone and verify integrity
rndc freeze wdc.us.gl3
# Compare zone file checksum against last known-good backup
sha256sum /var/named/wdc.us.gl3.db
# Verify DNSSEC signatures
dnssec-verify -z wdc.us.gl3 /var/named/wdc.us.gl3.db

# If SKY compromised — RAIN auto-assumes; disable zone transfers FROM SKY
# On RAIN: temporarily remove SKY as notify source

Phase 4 — Eradication¶

After containment and evidence preservation:

# 1. Identify root cause from logs and AIDE report
# 2. Remove malicious artifacts (files, accounts, cron jobs, authorized_keys)

# Check for unauthorized accounts
awk -F: '$3 >= 1000 {print $1, $3}' /etc/passwd

# Check authorized_keys on all accounts
find /root /home -name "authorized_keys" -exec cat {} \;

# Check crontabs
crontab -l
ls -la /etc/cron.* /var/spool/cron/

# Check for SUID binaries added since baseline
find / -perm /4000 -newer /var/lib/aide/aide.db.gz -ls 2>/dev/null

# 3. Patch the exploited vulnerability
dnf update --security -y

# 4. Rotate credentials
# SSH keys: regenerate host keys, rotate admin authorized_keys
ssh-keygen -A  # Regenerate host keys
# Rotate all admin SSH keys per access control policy

# 5. Update AIDE baseline after clean state confirmed
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz

Phase 5 — Recovery¶

# Restore from known-good snapshot (preferred) OR rebuild from backup
# See DR Plan sections 6.1–6.4 for per-server recovery procedures

# After restoration:
# 1. AIDE baseline update
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz

# 2. Re-sign DNSSEC (SKY/RAIN)
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa

# 3. Verify all services operational
# 4. Remove network isolation (if applied in containment)
firewall-cmd --remove-rich-rule='...' --permanent
firewall-cmd --reload

# 5. Monitor intensively for 48 hours post-recovery
# Grafana: watch for re-compromise indicators
# AIDE: daily checks for 1 week
# Fail2ban: review ban list

Phase 6 — Post-Incident Activity¶

Within 5 business days of incident closure:

Incident After-Action Report — complete the IAR template at response-plans/incident-after-action-review.md.
Root cause analysis — documented and shared with Director of Cyber Security.
Lessons learned — update this IRP and/or DR Plan if procedures were found insufficient.
Compliance evidence — log incident in the risk register; document remediation.
Asset inventory update — if any hardware/software changed, update wdc-hostregistry.csv.
AIDE baseline — confirm final clean baseline on all affected servers.

5. Incident-Specific Runbooks¶

5.1 SSH Brute Force / Credential Attack¶

# 1. Identify attacking IPs
fail2ban-client status sshd
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

# 2. Confirm Fail2ban has auto-banned
# If not banned (new attack pattern):
firewall-cmd --add-rich-rule='rule family="ipv4" source address="<IP>" drop' --permanent
firewall-cmd --reload

# 3. Check for successful logins from same IP range
grep "Accepted" /var/log/secure | tail -50

# 4. If successful login detected → escalate to P1, begin full containment
# 5. Review authorized_keys — ensure no new keys were added
find /root /home -name "authorized_keys" -exec cat {} \;

5.2 AIDE Integrity Alert¶

# 1. Review AIDE report
sudo aide --check 2>&1 | tee /tmp/aide-report-$(date +%Y%m%d).txt

# 2. Categorize changes:
#    - Expected (post-change): update baseline → not an incident
#    - Unexpected: investigate immediately

# 3. For unexpected changes, identify what changed:
# File added: potential malware drop
# File modified: potential config tampering or binary replacement
# File deleted: potential evidence destruction

# 4. Cross-reference with /var/log/asset-inventory.log
# If change not in log → treat as unauthorized → escalate to P2/P1

# 5. If binary modified — assume compromise; escalate to P1
# Isolate server immediately (see Phase 3 containment)

5.3 DNS Zone Tampering¶

# 1. Verify current zone against last backup
diff /var/named/wdc.us.gl3.db /backup/dns-dhcp/wdc.us.gl3.db.last

# 2. Check DNSSEC signatures
dnssec-verify -z wdc.us.gl3 /var/named/wdc.us.gl3.db

# 3. Check named journal files for recent changes
ls -la /var/named/*.jnl

# 4. Freeze zone immediately
rndc freeze wdc.us.gl3

# 5. If tampering confirmed:
#    - Restore zone from last known-good backup
#    - Re-sign DNSSEC
#    - Check TSIG keys (rotate if compromised)
#    - Escalate to P1

# 6. Alert all DNS clients if records were poisoned

5.4 Unauthorized Access / Privilege Escalation¶

# 1. Identify unauthorized account or privilege change
grep "sudo" /var/log/secure | grep -v "$(hostname)" | tail -50
grep "useradd\|usermod\|passwd" /var/log/secure | tail -50

# 2. Check for new accounts
awk -F: '$3 >= 1000 {print $1, $3, $6}' /etc/passwd

# 3. Lock unauthorized account immediately
passwd -l <USERNAME>
usermod -s /sbin/nologin <USERNAME>

# 4. Kill active sessions
pkill -u <USERNAME>

# 5. Review what the account accessed
grep <USERNAME> /var/log/secure /var/log/audit/audit.log | tail -100

# 6. Rotate all admin credentials on affected server
# 7. Escalate to P1 — full containment

5.5 GCP / Cloud Incident¶

# 1. Check Cloud Audit Logs for unauthorized API calls
gcloud logging read \
    'protoPayload.authorizationInfo.granted=true AND protoPayload.authenticationInfo.principalEmail!~"@greenpeace.us"' \
    --project=gpus-infra \
    --limit=50 \
    --format=json

# 2. Check IAM for unauthorized bindings
gcloud projects get-iam-policy gpus-infra --format=json

# 3. Check for unauthorized Cloud Run deployments
gcloud run services list --region=us-central1 --project=gpus-infra

# 4. If unauthorized service account activity detected:
gcloud iam service-accounts disable <SA_EMAIL> --project=gpus-infra

# 5. If VPN tunnel compromised — rotate PSK via Terraform
# Update var.vpn_shared_secret in terraform.tfvars
# terraform apply -target=google_compute_vpn_tunnel.wdc_tunnel

6. Communication Templates¶

P1 Initial Alert (Director of Cyber Security → Stakeholders)¶

SUBJECT: [P1 SECURITY INCIDENT] GPUS-IT — <Brief Description>

Time detected: <HH:MM UTC>
Systems affected: <SERVER LIST>
Service impact: <DNS/DHCP/Monitoring/Logging/GCP>
Current status: Containment in progress

Incident Commander: <NAME>
Next update: <HH:MM UTC>

Actions taken so far:
- <ACTION 1>
- <ACTION 2>

Incident Closure Notification¶

SUBJECT: [RESOLVED] GPUS-IT Security Incident — <Incident ID>

Incident: <DESCRIPTION>
Duration: <START TIME> → <END TIME>
Root cause: <SUMMARY>
Systems affected: <LIST>
Service impact: <DESCRIPTION>

Remediation completed:
- <ACTION 1>
- <ACTION 2>

After-action review scheduled: <DATE>

7. IR Testing Schedule¶

Exercise	Frequency	Owner	Type
Tabletop exercise — P1 DNS failure	Quarterly	Director of Cyber Security	Discussion-based
Live failover drill — SKY to RAIN	Quarterly	DNS Admin	Live test
Containment procedure walkthrough	Semi-annually	Security Ops	Walkthrough
Full IR simulation — active compromise	Annually	Director of Cyber Security + Full Team	Simulation
GCP incident response drill	Annually	GCP Admin	Simulation

8. Evidence Retention¶

Evidence Type	Retention Period	Storage Location
IR evidence archives	1 year	WIND: `/var/log/ir-evidence/` + GCS
AIDE reports from incidents	1 year	WIND: `/var/log/aide/`
Auth logs (`/var/log/secure`)	90 days rolling	WIND (rsyslog)
Audit logs (`/var/log/audit/`)	90 days rolling	WIND (rsyslog)
Incident After-Action Reports	Indefinite	`response-plans/incident-after-action-review.md` + GCS
Change log entries	Indefinite	`/var/log/asset-inventory.log`

9. Regulatory Notification Thresholds¶

Greenpeace US handles supporter data and payment card data (PCI-DSS). The following incidents require regulatory or legal review:

Incident Type	Notification Requirement	Threshold
Cardholder data exposure	PCI-DSS — notify acquiring bank + card brands	Any confirmed exposure
Supporter PII breach	State breach notification laws	Any confirmed exposure of PII
Unauthorized access to supporter data	Internal escalation to Legal	Any unauthorized access

In all cases: notify Director of Cyber Security immediately → Legal Counsel within 1 hour of P1 declaration.

10. Plan Maintenance¶

This plan is reviewed and tested:

Annually — full review by Director of Cyber Security + full IR team
After any security incident — updated within 5 business days
After any infrastructure change — reviewed for impact on detection/response capabilities

Document version: v1.0 · 2026-03-14 · GPUS-IT · Classification: CONFIDENTIAL — Internal Use Only