Incident Response 101: From Detection to Recovery
Key Takeaways
- •Incident response is the structured process of detecting, containing, eradicating, and recovering from a security incident.
- •NIST SP 800-61r2 defines four phases, which the industry generally expands to six.
- •This plan covers: confirmed or suspected unauthorized access, malware infection,.
In February 2024, Change Healthcare — a subsidiary of UnitedHealth Group processing 50% of US medical insurance claims — was hit by ALPHV/BlackCat ransomware. The time between initial access and the encryption event: 9 days. The time between encryption and full service restoration: months. The direct cost as disclosed to the SEC: $872 million in Q1 2024 alone.
The 9-day dwell window is where incident response either works or fails. Attackers moved laterally through Change Healthcare's network, mapped their infrastructure, exfiltrated data, and then encrypted everything — all while existing security monitoring failed to generate actionable alerts. The initial entry vector: Citrix credentials with no multi-factor authentication.
This post is not about compliance theater. It is about building the operational capability to detect active intrusions faster than attackers can complete their objectives, contain damage once detected, and recover with the fidelity required to not face the same incident six months later.
What Incident Response Actually Is
Incident response is the structured process of detecting, containing, eradicating, and recovering from a security incident. The word "structured" is load-bearing — the difference between a two-hour containment and a two-week crisis almost always traces back to decisions made before the incident occurred: who does what, with what authority, using which tools, and through what communications channel.
A security incident is any event that actually or potentially jeopardizes confidentiality, integrity, or availability of information systems. The NIST definition (SP 800-61r2) distinguishes events (observable occurrences) from incidents (events with adverse security implications). Triage discipline — the ability to accurately classify events and escalate actual incidents without drowning in false positives — is one of the most underdeveloped skills in most security teams.
The NIST SP 800-61 Lifecycle
NIST SP 800-61r2 defines four phases, which the industry generally expands to six. The phases are not fully sequential — recovery can reveal new indicators that send you back to containment, and lessons from one incident should feed directly into the next cycle's preparation.
Preparation
↓
Detection & Analysis ←────────────────────┐
↓ │
Containment ──────────────────────────────┤
↓ │ (loop if new indicators found)
Eradication ──────────────────────────────┤
↓ │
Recovery ─────────────────────────────────┘
↓
Post-Incident Review (Lessons Learned)
Phase 1: Preparation
Preparation is the only phase that happens before the incident. It determines the ceiling of your response capability. No response effort can exceed what was built during preparation.
IR Plan Structure:
An IR plan does not need to be a 200-page compliance document. Under stress, people read checklists, not essays. A practical IR plan has these components:
# INCIDENT RESPONSE PLAN — [ORGANIZATION NAME]
Version: 2.4
Last Updated: 2026-01-15
Owner: CISO
## 1. Scope
This plan covers: confirmed or suspected unauthorized access, malware infection,
data exfiltration, DoS attacks, ransomware, and insider threats.
## 2. IR Team Roles
| Role | Primary | Backup | Contact |
|---|---|---|---|
| IR Lead | J. Smith | K. Chen | jsmith@... / +1-555-... |
| Communications Lead | L. Park | M. Rodriguez | lpark@... |
| Legal Counsel | A. Williams (outside: [Firm]) | — | +1-555-... |
| CISO | — | — | +1-555-... |
| IT Systems Owner | — | — | +1-555-... |
| HR (insider threat) | — | — | +1-555-... |
## 3. External Contacts
| Party | Contact | When to Engage |
|---|---|---|
| Cyber Insurance | [Carrier], [Claim #] | Within 24h of confirmed incident |
| IR Retainer | [Firm], [Engagement #] | Immediately upon major incident |
| Outside Legal | [Firm], [Attorney] | Any incident with legal exposure |
| CISA | 1-888-282-0870 | Major incidents, critical infrastructure |
| FBI Cyber Division | IC3.gov | Ransomware, nation-state |
| Local FBI Field Office | [Number] | — |
## 4. Severity Classification
| Level | Definition | Response SLA |
|---|---|---|
| P1 — Critical | Active breach, ransomware, major data exfil | Immediate, 24/7 response |
| P2 — High | Suspected breach, malware on critical system | Response within 2 hours |
| P3 — Medium | Phishing compromise, isolated malware | Response within 8 hours |
| P4 — Low | Policy violation, suspicious but unconfirmed | Response within 24 hours |
## 5. Communication Protocols
- P1/P2 incidents: use out-of-band channel (Signal group "IR Emergency")
NOT corporate email or Slack if those may be in scope
- Executive notification: P1 within 1 hour of confirmation; P2 within 4 hours
- Customer notification: per legal/PR guidance, never without legal review
- Regulatory notification: 72 hours under GDPR; 60 days under HIPAA; varies by state
## 6. Evidence Collection Standards
- Do not power off systems before memory capture
- Use forensic write blockers for disk imaging
- Document chain of custody for all evidence
- Store evidence in designated IR network share [\\IRSERVER\evidence\]
- Hash all collected evidence files: sha256sum file > file.sha256
## 7. Runbooks (see appendix)
- Ransomware Response Runbook
- Phishing Account Compromise Runbook
- Insider Threat Runbook
- Cloud Incident Runbook (AWS/Azure/GCP)
- Data Exfiltration RunbookPre-staged Jump Kit:
A jump kit is a pre-configured toolkit maintained separately from the production environment. When primary systems are compromised, the jump kit provides untainted tools and credentials.
Jump Kit Contents:
□ Laptop (offline-capable, never domain-joined, fresh OS install annually)
□ USB drives with:
- Bootable Kali Linux or TSURUGI Linux (forensics distro)
- WinPmem (memory capture)
- FTK Imager (disk imaging)
- Volatility 3 + plugins
- KAPE (triage collection)
- CyberChef (data analysis)
- Wireshark
- Eric Zimmerman tools (Windows forensics)
□ Network tap (hardware passive tap)
□ 1TB+ external hard drive (evidence storage)
□ Out-of-band communications:
- Separate SIM / mobile hotspot
- Signal installed with IR team contacts
□ Printed IR plan (laminated)
□ Printed out-of-band contact list
□ Offline credentials:
- Root/local admin passwords for critical systems (encrypted, separate safe)
- VPN configs for secondary management plane
□ Hardware security keys (for authenticating to critical systems without domain auth)
Logging Infrastructure:
The most common failure mode in incident response is discovering that the logs needed to understand what happened either don't exist, have rotated, or were disabled by the attacker.
Minimum logging requirements:
# Windows Event Log configuration (via GPO)
# Computer Configuration → Windows Settings → Security Settings → Advanced Audit Policy
Required audit categories:
Account Logon:
- Credential Validation: Success, Failure
- Kerberos Authentication Service: Success, Failure
Account Management:
- User Account Management: Success, Failure
- Security Group Management: Success
DS Access:
- Directory Service Access: Success, Failure (on DCs)
Logon/Logoff:
- Logon: Success, Failure
- Special Logon: Success
Object Access:
- File System: Failure (critical file servers)
- Registry: Failure
- Filtering Platform Connection: Success (noisy but valuable)
Policy Change:
- Audit Policy Change: Success, Failure
Privilege Use:
- Sensitive Privilege Use: Success, Failure
Process Tracking:
- Process Creation: Success (critical — requires Sysmon for full detail)
System:
- Security State Change: Success, Failure
# Critical Event IDs to alert on:
4624 - Successful logon
4625 - Failed logon
4648 - Logon with explicit credentials
4672 - Admin logon (special privileges)
4688 - Process creation (with command line logging enabled)
4698 - Scheduled task created
4720 - User account created
4732 - Member added to privileged group
4768 - Kerberos TGT request
4769 - Kerberos service ticket request (Kerberoasting detection)
4776 - NTLM authentication
7045 - New service installed
1102 - Event log cleared (!)Sysmon deployment provides dramatically better process telemetry than default Windows Event Logs:
<!-- Sysmon configuration (SwiftOnSecurity config as baseline) -->
<Sysmon schemaversion="4.82">
<HashAlgorithms>md5,sha256,IMPHASH</HashAlgorithms>
<CheckRevocation/>
<EventFiltering>
<!-- Event ID 1: Process Create -->
<ProcessCreate onmatch="exclude">
<!-- Exclude noisy legitimate processes -->
<Image condition="is">C:\Windows\System32\wbem\WmiPrvSE.exe</Image>
</ProcessCreate>
<!-- Event ID 3: Network Connect -->
<NetworkConnect onmatch="include">
<!-- Monitor connections from command shells -->
<Image condition="end with">cmd.exe</Image>
<Image condition="end with">powershell.exe</Image>
<Image condition="end with">pwsh.exe</Image>
</NetworkConnect>
<!-- Event ID 10: Process Access (LSASS dumping detection) -->
<ProcessAccess onmatch="include">
<TargetImage condition="end with">lsass.exe</TargetImage>
</ProcessAccess>
<!-- Event ID 11: File Create -->
<!-- Event ID 12/13/14: Registry operations -->
<!-- Event ID 22: DNS Query -->
<DnsQuery onmatch="exclude">
<QueryName condition="end with">.microsoft.com</QueryName>
</DnsQuery>
</EventFiltering>
</Sysmon>Log retention: CISA recommends 12 months log retention, with at least 3 months immediately accessible. DFIR investigations routinely require logs from 6-12 months prior to the detection event because attackers establish persistence long before the visible incident.
Phase 2: Detection and Analysis
Detection is where most programs struggle. The average detection time (from breach to discovery) was 194 days in IBM's 2023 Cost of a Data Breach Report. The companies that detect fastest have one thing in common: they invested in detection engineering, not just detection tools.
Triage Framework:
Not every alert is an incident. The triage process:
Alert fires
↓
Initial Triage (5-15 minutes)
- Is this a known false positive pattern? → Document, tune, resolve
- Is there corroborating evidence from other sources? → Escalate
- Is the affected asset low-value/non-production? → Lower priority
↓
Preliminary Investigation (15-60 minutes)
- Collect: process tree, network connections, authentication history
- Answer: What happened? When? Which systems? Which accounts?
- Classify: Event vs. Incident
↓
If Incident:
- Declare severity (P1-P4)
- Notify IR lead
- Begin documentation in IR ticketing
- Proceed to Containment
Initial scoping queries:
# Splunk: Find all activity from a compromised host in the past 72 hours
index=windows host=COMPROMISED_HOST
| table _time, EventCode, user, src_ip, dest_ip, CommandLine, message
| sort _time
# Splunk: Find all accounts the compromised host authenticated with
index=windows host=COMPROMISED_HOST EventCode=4624
| stats values(user) as accounts, count by host
| mvexpand accounts
# Splunk: Find all hosts that communicated with the compromised host
index=network (src=COMPROMISED_HOST_IP OR dest=COMPROMISED_HOST_IP)
| stats count by src, dest, dest_port
| sort -count# KQL (Sentinel): Scope analysis for potentially compromised account
let suspect_account = "jsmith";
let lookback_days = 7d;
union
(SigninLogs | where UserPrincipalName contains suspect_account | where TimeGenerated > ago(lookback_days)),
(AuditLogs | where InitiatedBy has suspect_account | where TimeGenerated > ago(lookback_days)),
(OfficeActivity | where UserId contains suspect_account | where TimeGenerated > ago(lookback_days))
| project TimeGenerated, Type, ActivityDisplayName, IPAddress, Location, ResultType, ResultDescription
| order by TimeGenerated ascVolatile evidence collection priority:
Before containment actions are taken, capture volatile evidence that will be lost on network disconnection or shutdown:
# Windows — collect volatile data (run from SYSTEM or elevated context)
# Create evidence directory
$evidenceDir = "C:\IR_Evidence_$(Get-Date -Format 'yyyyMMdd_HHmmss')"
New-Item -ItemType Directory -Path $evidenceDir
# Running processes with command lines
Get-Process | Select-Object Id, Name, Path, StartTime, CPU, WS |
Export-Csv "$evidenceDir\processes.csv"
# Get-WmiObject for more detail including parent PIDs
Get-WmiObject Win32_Process |
Select-Object ProcessId, ParentProcessId, Name, CommandLine, ExecutablePath |
Export-Csv "$evidenceDir\process_details.csv"
# Active network connections
netstat -anob > "$evidenceDir\netstat.txt"
Get-NetTCPConnection | Export-Csv "$evidenceDir\tcp_connections.csv"
# DNS cache (reveals recent DNS resolutions)
Get-DnsClientCache | Export-Csv "$evidenceDir\dns_cache.csv"
# Scheduled tasks
Get-ScheduledTask | Export-Csv "$evidenceDir\scheduled_tasks.csv"
schtasks /query /fo CSV /v > "$evidenceDir\schtasks_verbose.csv"
# Services
Get-Service | Export-Csv "$evidenceDir\services.csv"
sc query type= all state= all > "$evidenceDir\sc_query.txt"
# Autostart locations (abbreviated; Autoruns is more complete)
reg export "HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run" "$evidenceDir\run_keys.reg"
reg export "HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Run" "$evidenceDir\user_run_keys.reg"
# Recent file activity
Get-ChildItem C:\Users -Recurse -File |
Where-Object {$_.LastWriteTime -gt (Get-Date).AddDays(-7)} |
Sort-Object LastWriteTime -Descending |
Select-Object FullName, LastWriteTime, Length |
Export-Csv "$evidenceDir\recent_files.csv"
# Hash the evidence directory
Get-ChildItem $evidenceDir -File |
ForEach-Object {
$hash = (Get-FileHash $_.FullName -Algorithm SHA256).Hash
"$hash $($_.FullName)"
} > "$evidenceDir\EVIDENCE_HASHES.txt"
Write-Host "Evidence collected to: $evidenceDir"# Linux volatile evidence collection
EVIDENCE_DIR="/tmp/ir_$(hostname)_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$EVIDENCE_DIR"
# Running processes
ps auxf > "$EVIDENCE_DIR/ps_tree.txt"
ls -la /proc/*/exe 2>/dev/null > "$EVIDENCE_DIR/process_paths.txt"
# Network connections with processes
ss -tupn > "$EVIDENCE_DIR/network_connections.txt"
netstat -tulnp 2>/dev/null >> "$EVIDENCE_DIR/network_connections.txt"
# Active connections
ss -tnp state established > "$EVIDENCE_DIR/established_connections.txt"
# Listening services
ss -tlnp > "$EVIDENCE_DIR/listening_services.txt"
# ARP cache
arp -n > "$EVIDENCE_DIR/arp_cache.txt"
# Routing table
ip route > "$EVIDENCE_DIR/routing.txt"
ip neigh > "$EVIDENCE_DIR/ip_neighbors.txt"
# Loaded kernel modules
lsmod > "$EVIDENCE_DIR/loaded_modules.txt"
# Cron jobs
crontab -l 2>/dev/null > "$EVIDENCE_DIR/user_cron.txt"
ls -la /etc/cron* > "$EVIDENCE_DIR/system_cron.txt"
systemctl list-timers --all > "$EVIDENCE_DIR/systemd_timers.txt"
# Systemd services
systemctl list-units --type=service --all > "$EVIDENCE_DIR/systemd_services.txt"
# Recently modified files (last 7 days)
find / -xdev -newer /tmp -ls 2>/dev/null > "$EVIDENCE_DIR/recently_modified.txt" &
# Users and login history
w > "$EVIDENCE_DIR/logged_in_users.txt"
last > "$EVIDENCE_DIR/login_history.txt"
lastb 2>/dev/null > "$EVIDENCE_DIR/failed_logins.txt"
cat /etc/passwd > "$EVIDENCE_DIR/passwd.txt"
cat /etc/shadow 2>/dev/null > "$EVIDENCE_DIR/shadow.txt"
# Hash evidence
sha256sum "$EVIDENCE_DIR"/* > "$EVIDENCE_DIR/HASHES.txt" 2>/dev/null
echo "Evidence directory: $EVIDENCE_DIR"Do not power off a compromised host before collecting memory. Running processes, network connections, encryption keys in RAM (critical for ransomware), and injected shellcode in process memory — all of it is lost at power-off. WinPmem for Windows, LiME for Linux. Memory capture first, then disk imaging, then containment via network isolation.
Memory acquisition:
# Windows — WinPmem
# Download: github.com/Velocidex/WinPmem/releases
winpmem_mini_x64_rc2.exe memdump.raw
# Alternative: Magnet RAM Capture (GUI)
# Linux — LiME (Loadable Kernel Module)
# Must compile for specific kernel version
# Build on matching system:
make -C /lib/modules/$(uname -r)/build M=$(pwd) modules
# Load module and dump:
insmod lime-$(uname -r).ko "path=/tmp/memdump.lime format=lime"
# macOS — osxpmem (for Intel Macs)
# M1/M2: memory acquisition requires different approaches; use external toolsMemory analysis with Volatility 3:
# Install Volatility 3
pip3 install volatility3
# Basic system info (identifies OS version for symbol matching)
vol -f memdump.raw windows.info
# Process list
vol -f memdump.raw windows.pslist
# Process tree (shows parent-child relationships — crucial for spotting injection)
vol -f memdump.raw windows.pstree
# Network connections at time of dump
vol -f memdump.raw windows.netscan
# Detect injected code and hollowed processes
vol -f memdump.raw windows.malfind
# DLLs loaded by specific process
vol -f memdump.raw windows.dlllist --pid 1234
# Command history (even for closed terminals)
vol -f memdump.raw windows.cmdline
# Check handles
vol -f memdump.raw windows.handles --pid 1234
# Dump suspicious process for analysis
vol -f memdump.raw windows.dumpfiles --pid 1234
# Scan for registry hives in memory
vol -f memdump.raw windows.registry.hivelist
# Extract specific registry key
vol -f memdump.raw windows.registry.printkey \
--key "SOFTWARE\Microsoft\Windows\CurrentVersion\Run"
# Linux memory analysis
vol -f memdump.lime linux.pslist
vol -f memdump.lime linux.bash # Bash history from memory
vol -f memdump.lime linux.netfilter # Network filter rulesPhase 3: Containment
Containment stops the spread without destroying evidence. The judgment call: how much business disruption is acceptable to stop the threat from spreading?
Short-term containment preserves evidence while limiting damage:
# Windows — isolate host via network without power-off
# Method 1: Disable network adapters
Get-NetAdapter | Disable-NetAdapter -Confirm:$false
# Method 2: Apply blocking firewall rules
New-NetFirewallRule -DisplayName "IR_ISOLATION_BLOCK_INBOUND" -Direction Inbound -Action Block
New-NetFirewallRule -DisplayName "IR_ISOLATION_BLOCK_OUTBOUND" -Direction Outbound -Action Block -RemoteAddress "0.0.0.0/0" -Except "10.0.0.0/8"
# Leave management IP accessible for remote forensic access
# Method 3: EDR host isolation (preferred — reversible, maintains EDR telemetry)
# CrowdStrike: Host Management → select host → Contain
# SentinelOne: Management → select endpoint → Isolate
# Microsoft Defender for Endpoint: Device page → Isolate device# Linux — network isolation
# Drop all traffic except management IP
iptables -I INPUT -s MGMT_IP -j ACCEPT
iptables -I OUTPUT -d MGMT_IP -j ACCEPT
iptables -A INPUT -j DROP
iptables -A OUTPUT -j DROP
iptables -A FORWARD -j DROP
# Save rules to persist across service restart
iptables-save > /etc/iptables/rules.v4Long-term containment allows business to continue while full eradication proceeds:
Actions during long-term containment:
□ Block C2 domains and IPs at firewall and DNS
- Add IOCs to firewall deny list
- Sinkhole C2 domains via DNS RPZ (Response Policy Zone)
□ Rotate credentials for all potentially compromised accounts
- All accounts on affected hosts
- All accounts used on affected hosts (check logon history)
- Service accounts with access to affected systems
- Domain admin accounts (all of them)
□ Patch the initial access vulnerability
- If phishing: no patch, but enable/enhance MFA
- If VPN CVE: apply patch, update firmware
- If misconfiguration: remediate configuration
□ Deploy additional monitoring on potentially affected hosts
- Enable enhanced Sysmon logging
- Deploy EDR sensor if not present
- Increase SIEM alert sensitivity for affected host range
□ Implement temporary access controls
- Restrict lateral movement paths identified during investigation
- Require step-up authentication for privileged operations
Phase 4: Eradication
Eradication removes the threat completely. Partial eradication is operationally indistinguishable from no eradication — if one persistence mechanism survives, the attacker returns.
The eradication principle: assume full compromise, rebuild rather than clean.
Root cause identification checklist:
□ How did the attacker gain initial access?
- Phishing → identify affected accounts, check for mail rules/forwarding
- VPN vulnerability → identify what was accessed via VPN during the exploitation window
- Brute force → identify all accounts with successful logins from external IPs
- Supply chain → identify scope of access the compromised vendor/tool had
□ How long did they have access? (dwell time)
- Earliest indicator of compromise in logs
- Creation timestamps on malicious files/artifacts
□ What persistence mechanisms were established?
- All scheduled tasks created in incident window
- All services installed in incident window
- All registry run keys added
- All new user accounts or modified accounts
- SSH authorized_keys additions (Linux)
- cron job additions (Linux)
- WMI event subscriptions (Windows)
- Browser extensions installed in incident window
□ What credentials were potentially compromised?
- All accounts logged into affected systems
- All accounts whose credentials were stored on affected systems
- All service accounts with access to affected systems
- Kerberos tickets that may have been stolen (assess if golden ticket attack occurred)
□ Was data exfiltrated?
- Network logs for large outbound transfers
- DLP alerts
- Cloud sync tool activity (OneDrive, Dropbox, rclone)
- USB device insertion events
Persistence hunting on Windows:
# Autoruns — the most comprehensive persistence check
# Download Sysinternals Suite
# Run Autoruns64.exe as Administrator
# Options → Scan Options → Check VirusTotal.com
# View → Hide Microsoft Entries (reduces noise)
# Export: File → Save → autoruns_baseline.arn
# PowerShell persistence hunt
# Scheduled tasks
Get-ScheduledTask | Where-Object {$_.State -ne "Disabled"} |
Select-Object TaskName, TaskPath, State,
@{N="LastRun";E={$_.LastRunTime}},
@{N="Actions";E={$_.Actions.Execute}} |
Format-Table -AutoSize
# Services with unusual paths
Get-WmiObject Win32_Service |
Where-Object {$_.PathName -notmatch "^C:\\Windows\\|^C:\\Program Files"} |
Select-Object Name, DisplayName, State, StartMode, PathName
# Registry autorun keys
$autorun_paths = @(
"HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run",
"HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce",
"HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run",
"HKLM:\SYSTEM\CurrentControlSet\Services"
)
foreach ($path in $autorun_paths) {
if (Test-Path $path) {
Get-ItemProperty $path | Select-Object * -ExcludeProperty PS*
}
}
# WMI event subscriptions (common attacker persistence)
Get-WMIObject -Namespace root\subscription -Class __EventFilter
Get-WMIObject -Namespace root\subscription -Class __EventConsumer
Get-WMIObject -Namespace root\subscription -Class __FilterToConsumerBinding# Linux persistence hunting
# Cron jobs
crontab -l 2>/dev/null
sudo crontab -l 2>/dev/null
for user in $(cut -f1 -d: /etc/passwd); do
echo "--- Crontab for $user ---"
crontab -u "$user" -l 2>/dev/null
done
ls -la /etc/cron* /var/spool/cron/
# Systemd units
systemctl list-units --type=service --state=enabled --no-pager
find /etc/systemd /lib/systemd /usr/lib/systemd -name "*.service" \
-newer /var/log/dpkg.log 2>/dev/null # Recently modified
# SSH authorized keys
find / -name "authorized_keys" -exec ls -la {} \; 2>/dev/null
find / -name "authorized_keys" -exec cat {} \; 2>/dev/null
# SUID/SGID binaries (attacker may have added these)
find / -perm /4000 -type f -exec ls -la {} \; 2>/dev/null
find / -perm /2000 -type f -exec ls -la {} \; 2>/dev/null
# Recently modified files in sensitive locations
find /etc /usr/bin /usr/sbin /bin /sbin -newer /tmp -type f -ls 2>/dev/null
# Loaded kernel modules (potential rootkit)
lsmod | sort
# Compare against known-good baseline
# Check /proc for hidden processes (discrepancy with ps)
ls /proc/ | grep -E '^[0-9]+$' | sort -n > /tmp/proc_pids.txt
ps ax | awk '{print $1}' | grep -E '^[0-9]+$' | sort -n > /tmp/ps_pids.txt
diff /tmp/proc_pids.txt /tmp/ps_pids.txtRebuild, don't clean:
For endpoint-level compromises, reimaging from a known-good base is dramatically more reliable than attempting to clean infected systems:
Decision matrix for rebuild vs. clean:
- Any evidence of kernel-level rootkit → mandatory rebuild
- Ransomware deployment → mandatory rebuild
- Cobalt Strike or similar C2 confirmed → mandatory rebuild
- Attacker had SYSTEM/root access for any duration → recommended rebuild
- Phishing compromise, malware contained at user level → cleaning may suffice
Rebuild process:
1. Capture forensic image BEFORE wiping (legal evidence, investigation value)
2. Hash the image for chain of custody
3. Boot from trusted media (USB with clean OS installer)
4. Wipe drives securely (DoD 5220.22-M or single pass zero — crypto-erase for SSDs)
5. Reinstall OS from original media or trusted deployment image
6. Apply all patches BEFORE reconnecting to network
7. Restore only necessary data files — not executable binaries from compromised host
8. Reconnect to network in monitored VLAN for 30-day observation period
Phase 5: Recovery
Recovery returns affected systems to operational status in a state that is verified clean and more resistant to recurrence.
Recovery success criteria (define these before starting recovery, not while doing it):
System cleared for production when:
□ All forensic images captured and stored
□ Host rebuilt from clean baseline (not cleaned in-place, for serious compromises)
□ All outstanding patches applied before reconnection
□ EDR sensor deployed and reporting healthy
□ Enhanced logging enabled (Sysmon, command-line logging)
□ All credentials associated with the compromised system rotated
□ Initial access vector confirmed remediated (patch applied, config fixed, MFA enabled)
□ 72 hours of post-reconnection monitoring shows no anomalous behavior
□ Backup verification complete: last clean backup restored successfully in test environment
Post-recovery monitoring — 30-day enhanced observation:
# Splunk: Enhanced monitoring for recently recovered host
# Flag any authentication anomalies
index=windows (host=RECOVERED_HOST_01 OR host=RECOVERED_HOST_02)
| search (EventCode=4624 OR EventCode=4625)
| eval status=case(EventCode==4624,"Success",EventCode==4625,"Failure",1==1,"Unknown")
| stats count by _time, host, user, src_ip, status
| where count > 5 AND status="Failure"
# Monitor for recurrence of attacker TTPs
index=windows host IN (RECOVERED_HOST_01, RECOVERED_HOST_02)
| search (CommandLine="*vssadmin*" OR CommandLine="*mimikatz*" OR
CommandLine="*procdump*" OR CommandLine="*rundll32*comsvcs*")
| table _time, host, user, CommandLineRegulatory notification obligations:
Recovery planning must account for mandatory notification timelines:
| Regulation | Who It Covers | Breach Notification Window | |---|---|---| | GDPR | EU resident data | 72 hours to supervisory authority | | HIPAA | US healthcare/PHI | 60 days to HHS; 60 days to patients (mass breach: immediate media notice) | | SEC Rules (2023) | Public companies | 4 business days after materiality determination | | PCI DSS | Payment card data | Immediately to card brands + acquirer | | CCPA/CPRA | CA resident data | "Expedient" — no specific window, but delay increases penalty exposure | | NY SHIELD Act | NY resident data | "Expedient" notice | | Various state laws | Varies | 30-90 days depending on state |
# Notification template structure for customer/user notification:
# (Review with legal before sending)
Subject: Important Security Notice Regarding [Company Name]
Dear [Customer Name],
We are writing to notify you of a security incident that may have affected
your information.
WHAT HAPPENED:
On [date], we detected unauthorized access to [systems/data]. The intrusion
occurred between [date] and [date].
INFORMATION INVOLVED:
The following types of information may have been accessed:
- [List specific data types exposed]
WHAT WE ARE DOING:
We have [specific remediation actions taken]. We have also [enhanced security
measures implemented].
WHAT YOU SHOULD DO:
We recommend you [specific actionable steps: change password, monitor credit, etc.]
For more information, contact our dedicated response line at [phone/email].
We sincerely regret this occurred. [Signature block]
Phase 6: Post-Incident Review
The PIR (Post-Incident Review) transforms a painful incident into organizational security improvement. Without a systematic PIR, the same mistakes recur.
PIR template:
# POST-INCIDENT REVIEW
Incident ID: IR-2026-0037
Incident Type: Ransomware / Lateral Movement
Date of Detection: 2026-02-14
Date of Containment: 2026-02-15
Date of Recovery: 2026-02-28
PIR Meeting Date: 2026-03-14
Facilitator: [Name]
Attendees: [Names and roles]
## Incident Summary
[2-3 paragraph summary of what happened, impact, and timeline]
## Timeline Reconstruction
| Time | Event | Evidence Source |
|---|---|---|
| 2026-01-15 14:32 | First IOC: Cobalt Strike beacon from WS-042 | Sysmon EID 3 |
| 2026-01-15 15:17 | LSASS access from WS-042 | Sysmon EID 10 |
| ... | ... | ... |
| 2026-02-14 02:11 | VSS deletion commands | Security EID 4688 |
| 2026-02-14 02:14 | Encryptor executed | Sysmon EID 1 |
## Root Cause Analysis
Initial Access: [How the attacker got in]
Primary Failure: [The specific control failure that enabled the breach]
Contributing Factors: [2-5 additional factors]
## What Worked
- [Detection controls that fired]
- [Response actions that were effective]
## What Failed
- [Detection gaps]
- [Response delays and their causes]
- [Missing runbooks/procedures]
## Impact Assessment
- Systems affected: [count/names]
- Data potentially accessed: [description]
- Downtime: [hours/days]
- Estimated recovery cost: [range]
- Regulatory obligations triggered: [Y/N, which]
## Action Items
| # | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| 1 | Enable MFA on all VPN accounts | [Name] | 2026-03-28 | Open |
| 2 | Deploy Sysmon to remaining 40% of endpoints | [Name] | 2026-04-15 | Open |
| 3 | Establish quarterly backup restore testing | [Name] | 2026-05-01 | Open |Track PIR action items in your ticketing system (Jira, ServiceNow, GitHub Issues — whatever your organization uses) with assigned owners and deadlines. A PIR action item in a shared Google Doc that nobody revisits is not an action item — it is documentation of intent.
Building a Detection Engineering Practice
Effective incident response depends on detection engineering — the discipline of writing, testing, and maintaining detection rules that fire on real attacks.
Detection rule template (Sigma format):
# Sigma rule — Ransomware pre-cursor: VSS deletion
title: Volume Shadow Copy Deletion via WMI or VSSAdmin
id: e5b33f7d-98f1-4c15-b9a0-fb35c91abcde
status: production
description: |
Detects deletion of volume shadow copies, a common pre-encryption step
in ransomware attacks. Alert should trigger immediate host isolation.
references:
- https://attack.mitre.org/techniques/T1490/
- https://www.cisa.gov/stopransomware
author: Security Team
date: 2026/01/01
tags:
- attack.impact
- attack.t1490
logsource:
category: process_creation
product: windows
detection:
selection_vssadmin:
CommandLine|contains|all:
- 'vssadmin'
- 'delete'
- 'shadows'
selection_wmic:
CommandLine|contains|all:
- 'wmic'
- 'shadowcopy'
- 'delete'
selection_ps_wmi:
CommandLine|contains|all:
- 'Get-WmiObject'
- 'Win32_ShadowCopy'
CommandLine|contains: 'Delete()'
condition: selection_vssadmin or selection_wmic or selection_ps_wmi
falsepositives:
- Legitimate backup software (verify against known backup processes)
level: critical# Sigma rule — Kerberoasting detection
title: Kerberoasting — RC4 Kerberos Ticket Requests
id: a8c3d2f1-7b4e-4d5c-9f2a-12345678abcd
status: production
description: |
Detects potential Kerberoasting by monitoring for RC4-encrypted Kerberos
service ticket requests (etype 23) which are used for offline password cracking.
RC4 tickets are weak; modern environments should use AES.
logsource:
product: windows
service: security
detection:
selection:
EventID: 4769
TicketEncryptionType: '0x17' # RC4-HMAC
ServiceName|endswith: '$'
ServiceName|not: 'krbtgt'
condition: selection
falsepositives:
- Legacy systems that only support RC4
- Some older service accounts
level: highDetection testing with Atomic Red Team:
# Install Invoke-AtomicRedTeam
Install-Module -Name invoke-atomicredteam
# Test VSS deletion detection
Invoke-AtomicTest T1490 -TestNumbers 1,2
# Test LSASS access detection
Invoke-AtomicTest T1003.001 -TestNumbers 1
# Test scheduled task persistence detection
Invoke-AtomicTest T1053.005 -TestNumbers 1,2,3
# Run test and verify alert fired
# Good detection: alert fires, provides enough context to investigate
# Bad detection: no alert, OR alert fires but provides insufficient contextEssential IR Toolkit
| Category | Tool | Use Case | |---|---|---| | Memory acquisition | WinPmem | Windows memory capture | | Memory acquisition | LiME | Linux kernel memory capture | | Memory acquisition | Magnet RAM Capture | Windows GUI memory capture | | Memory analysis | Volatility 3 | Memory forensics (cross-platform) | | Disk imaging | FTK Imager | Forensic disk imaging (GUI) | | Disk imaging | dd / dcfldd | Command-line disk imaging | | Triage collection | KAPE | Windows forensic artifact collection | | Process/registry | Sysinternals Suite | Process Explorer, Autoruns, TCPView | | Process/registry | Process Monitor | Real-time file/registry/process monitoring | | Log analysis | Chainsaw | Fast Windows event log hunting | | Log analysis | Hayabusa | Windows EVTX threat hunting (Sigma rules) | | SIEM | Elastic SIEM / Splunk | Central log aggregation and alerting | | EDR | CrowdStrike / SentinelOne / MDE | Endpoint telemetry and isolation | | Threat intel | MISP | IOC sharing and management | | Threat intel | VirusTotal | Hash/IP/domain reputation | | Threat intel | Shodan | Internet-facing asset discovery | | Network forensics | Zeek | Network protocol analysis | | Network forensics | Wireshark / tshark | Packet capture and analysis | | Network forensics | NetworkMiner | Passive network forensics | | Malware analysis | Any.run / Hybrid Analysis | Dynamic malware sandbox | | Malware analysis | CAPE Sandbox | Advanced malware analysis | | Timeline | Plaso / log2timeline | Forensic timeline generation | | Forensic analysis | Autopsy | Disk forensic analysis (GUI) |
Tabletop Exercises: Testing Before the Crisis
A plan untested under stress is a document, not a capability. Tabletop exercises simulate an incident scenario through discussion without requiring live systems.
Sample tabletop scenario: ransomware discovery at 2 AM:
# TABLETOP EXERCISE: WEEKEND RANSOMWARE
Facilitator: [Name]
Participants: CISO, IR Lead, IT Director, Legal, Communications, Finance
SCENARIO INJECT 1 (07:30 Saturday):
Your on-call engineer receives an automated alert:
"Multiple hosts in the CORP network have triggered EDR ransomware alerts.
Files matching pattern *.locked are being created on file server FS-001."
Discussion questions:
1. Who do you call first? What is the notification chain?
2. What is your immediate containment action? What authority do you need?
3. What evidence do you preserve before isolation?
4. The CEO calls asking for a status update. Who handles this? What do you say?
SCENARIO INJECT 2 (09:00 Saturday):
Investigation reveals the following timeline:
- 3 days ago: phishing email opened by finance user on WS-047
- 2 days ago: Cobalt Strike beacon established from WS-047
- Yesterday: lateral movement to all finance hosts, DC accessed
- This morning: VSS deleted, encryption started
30% of file servers are encrypted. Backup server is also encrypted.
Last clean backup: 3 days ago.
Discussion questions:
1. You have no clean backup covering the past 3 days. What are your options?
2. Your cyber insurance requires you notify them within 24 hours. Legal is unavailable.
3. Do you pay the ransom? Who makes this decision? What factors influence it?
4. HR flags that the phishing victim is a recent new hire. How does this affect your response?
SCENARIO INJECT 3 (Monday morning):
The attacker contacts you via the ransom note portal and offers a "goodwill" file decryption to prove their decryptor works. They are asking $4.5M in Bitcoin with a 72-hour deadline.
Discussion questions:
1. How do you verify whether a decryptor actually works before committing to payment?
2. GDPR applies (EU customer data was in scope). What are your notification obligations?
3. How do you communicate to customers? When? Drafted by whom?
4. What is your recovery timeline estimate? Who do you communicate this to?Run tabletops at least twice annually. After each exercise, document the gaps discovered and track remediation. Common findings: decision authority was unclear, legal couldn't be reached after hours, out-of-band communications weren't tested, backups weren't verified, the IR plan was outdated.
IR Plan Checklist: Minimum Viable Documentation
- [ ] IR plan documented, approved by leadership, accessible offline
- [ ] CSIRT roles and responsibilities defined with named individuals and backups
- [ ] Out-of-band contact list current (tested quarterly)
- [ ] Severity classification matrix defined with response SLAs
- [ ] Logging infrastructure covers: endpoint (Sysmon), network (Zeek/NetFlow), identity (AD/Azure AD/Okta), cloud
- [ ] Log retention: 12 months, 3 months immediately searchable
- [ ] Jump kit assembled and tested (tools load, credentials work, communications functional)
- [ ] IR retainer established (or internal team trained and equipped)
- [ ] Cyber insurance policy reviewed — pre-approved response vendors?
- [ ] Runbooks: ransomware, account compromise, data exfiltration, insider threat, cloud incident
- [ ] Backup strategy: 3-2-1-1-0, restores tested quarterly
- [ ] Regulatory notification obligations documented (GDPR, HIPAA, PCI, SEC, state laws)
- [ ] Tabletop exercise completed in the past 12 months
- [ ] PIR action items from last incident: all completed or scheduled
Build the capability before you need it. The alternative is building it during the incident, under pressure, with adversarial time constraints, while the clock on regulatory notifications is running.