MSP Incident Management: How It Actually Works
Incident management is the backbone of MSP operations. It's how you turn chaos into a process. But in most MSPs, the reality is very different from the ITIL textbook version.
Priority Classifications
P1 — Critical (Response: 15 min, Resolution: 4 hours)
What qualifies as P1: - Complete business down (all users affected) - Core infrastructure failure (Exchange, AD, file server) - Security breach or active attack - Data loss event - Compliance-critical system failure
Example scenarios: - Ransomware encryption spreading across the network - Primary domain controller offline - Complete internet connectivity loss - Financial system compromised
What happens: - Immediate page to on-call engineer - All other work stops - Client notified within 15 minutes - Bridge call established - Status updates every 30 minutes - Post-incident review mandatory
P2 — High (Response: 30 min, Resolution: 8 hours)
What qualifies as P2: - Major business function impaired - Multiple users affected (>20%) - Workaround not available - Performance degradation on critical systems
Example scenarios: - Email delivery delays - CRM system slow or unavailable - Printer fleet down office-wide - VPN connectivity issues affecting remote workers
What happens: - Assigned to available engineer - Client notified within 30 minutes - Status updates every 2 hours - Escalation to P1 if not resolved in 4 hours
P3 — Medium (Response: 2 hours, Resolution: 24 hours)
What qualifies as P3: - Single user or small group affected - Workaround available - Non-critical system issue - Convenience functions impaired
Example scenarios: - Individual user can't print - One user's Outlook not syncing - Non-critical application slow - Mobile device issues
What happens: - Normal ticket queue processing - Client notified within 2 hours - Status updates at resolution - Escalation to P2 if impact increases
P4 — Low (Response: 8 hours, Resolution: 5 days)
What qualifies as P4: - Cosmetic issues - Feature requests - Information requests - Scheduled maintenance items
Example scenarios: - User wants a software change - Password reset request - New user setup (non-urgent) - Documentation update request
What happens: - Normal ticket queue processing - Resolved during standard business hours - No escalation required unless deadline-driven
Escalation Paths
Functional Escalation
L1 Service Desk
→ Triage and basic troubleshooting
→ If can't resolve in 30 min → L2
L2 Systems Engineer
→ Advanced troubleshooting
→ If can't resolve in 2 hours → L3
L3 Senior Engineer / Architect
→ Complex root cause analysis
→ If can't resolve → Vendor support / External specialist
L4 Vendor Support
→ Microsoft, Cisco, etc.
→ Escalation through partner channels
Management Escalation
15 min — Team Lead notified
30 min — Service Delivery Manager notified
1 hour — Operations Manager notified
2 hours — Director / VP notified
4 hours — Executive briefing (for P1)
Client Escalation
15 min — Client primary contact notified (email + phone)
30 min — Client IT manager notified
1 hour — Client executive sponsor notified (P1 only)
2 hours — Client leadership briefing (P1 only)
Communication Templates
Initial Notification (P1)
Subject: [P1] [Client] - [Issue Summary] - OUTAGE
Severity: P1 - Critical
Status: Investigating
Impact: [X users affected] - [Business function] unavailable
Start Time: [Time]
Assigned To: [Engineer name]
Next update: [Time + 30 min]
Current Actions:
- Investigating root cause
- [Specific action taken]
Status Update (P1)
Subject: [P1] [Client] - [Issue Summary] - UPDATE [X]
Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Current impact]
Root Cause: [If identified]
Actions Taken:
- [Action 1]
- [Action 2]
Next Steps:
- [Next action]
- ETA for resolution: [Time]
Next update: [Time + 30 min]
Resolution Notification
Subject: [P1] [Client] - [Issue Summary] - RESOLVED
Status: Resolved
Resolution Time: [X hours X minutes]
Root Cause: [Brief description]
Actions Taken:
- [Resolution steps]
Follow-up:
- Post-incident review scheduled for [Date]
- [Any ongoing monitoring]
Please confirm normal operations from your end.
SLA Implications
How SLAs Affect Your Work
Response Time SLA: - The clock starts when the ticket is created - If you're on-call, your phone should never be on silent - Missed response SLAs = client credits = your bonus takes a hit
Resolution Time SLA: - Complex issues may legitimately exceed SLA - But poor documentation and communication makes it look worse - Escalate early if you think you'll miss resolution SLA
Escalation SLAs: - If you don't escalate per process, you own the delay - Document every escalation and response - Cover yourself: "Escalated to L2 at [time], awaiting response"
SLA Breach Consequences
For the MSP: - Financial penalties (contractual credits) - Client churn risk - Reputation damage - Partner status impact (Microsoft, etc.)
For you: - Performance reviews - Bonus impact - On-call rotation changes - Increased scrutiny
On-Call: The Reality
What On-Call Actually Involves
- Phone must be on 24/7 during your rotation
- Response within 15 minutes of page (most MSPs use PagerDuty or similar)
- Resolve or escalate — don't sit on issues during on-call
- Document everything — on-call work must be logged in the PSA
- Handover — brief the next on-call engineer on any open issues
Surviving On-Call
- Prepare before your rotation — Know the current state of all clients
- Keep tools accessible — VPN, RMM, PSA should be on your phone
- Set up alerts properly — Don't get paged for every low-priority ticket
- Sleep when you can — But don't miss pages
- Track your hours — On-call should be compensated (standby + call-in rates)
- Request TOIL — Time off in lieu for after-hours work
On-Call Compensation (Australia)
Under most awards and agreements: - Standby allowance — Paid for being available (typically $3-5/hour) - Call-in rate — Minimum 3 hours at overtime rates when called in - Weekend/holiday rates — Higher standby and call-in rates
[!WARNING] If your MSP doesn't compensate on-call work, check your award and employment contract. Under the Professional Employees Award, additional hours beyond 38/week must be "reasonable" and may attract penalty rates.
Related Guides
- Fair Work Rights — Know your legal rights around on-call compensation
- MSP Onboarding Checklist — Your first 90 days guide
- MSP Burnout Guide — Warning signs and how to recover
- Essential 8 Implementation — Security incident response
- PowerShell Automation — Automate repetitive tasks
Was this helpful?