🔍

MSP Incident Management: How It Actually Works - MSP Guide Australia

Operations 2026-06-10 🕐 5 min 1030 words

MSP Incident Management: How It Actually Works

Incident management is the backbone of MSP operations. It's how you turn chaos into a process. But in most MSPs, the reality is very different from the ITIL textbook version.

Priority Classifications

P1 — Critical (Response: 15 min, Resolution: 4 hours)

What qualifies as P1: - Complete business down (all users affected) - Core infrastructure failure (Exchange, AD, file server) - Security breach or active attack - Data loss event - Compliance-critical system failure

Example scenarios: - Ransomware encryption spreading across the network - Primary domain controller offline - Complete internet connectivity loss - Financial system compromised

What happens: - Immediate page to on-call engineer - All other work stops - Client notified within 15 minutes - Bridge call established - Status updates every 30 minutes - Post-incident review mandatory

P2 — High (Response: 30 min, Resolution: 8 hours)

What qualifies as P2: - Major business function impaired - Multiple users affected (>20%) - Workaround not available - Performance degradation on critical systems

Example scenarios: - Email delivery delays - CRM system slow or unavailable - Printer fleet down office-wide - VPN connectivity issues affecting remote workers

What happens: - Assigned to available engineer - Client notified within 30 minutes - Status updates every 2 hours - Escalation to P1 if not resolved in 4 hours

P3 — Medium (Response: 2 hours, Resolution: 24 hours)

What qualifies as P3: - Single user or small group affected - Workaround available - Non-critical system issue - Convenience functions impaired

Example scenarios: - Individual user can't print - One user's Outlook not syncing - Non-critical application slow - Mobile device issues

What happens: - Normal ticket queue processing - Client notified within 2 hours - Status updates at resolution - Escalation to P2 if impact increases

P4 — Low (Response: 8 hours, Resolution: 5 days)

What qualifies as P4: - Cosmetic issues - Feature requests - Information requests - Scheduled maintenance items

Example scenarios: - User wants a software change - Password reset request - New user setup (non-urgent) - Documentation update request

What happens: - Normal ticket queue processing - Resolved during standard business hours - No escalation required unless deadline-driven

Escalation Paths

Functional Escalation

L1 Service Desk
    → Triage and basic troubleshooting
    → If can't resolve in 30 min → L2

L2 Systems Engineer
    → Advanced troubleshooting
    → If can't resolve in 2 hours → L3

L3 Senior Engineer / Architect
    → Complex root cause analysis
    → If can't resolve → Vendor support / External specialist

L4 Vendor Support
    → Microsoft, Cisco, etc.
    → Escalation through partner channels

Management Escalation

15 min — Team Lead notified
30 min — Service Delivery Manager notified
1 hour — Operations Manager notified
2 hours — Director / VP notified
4 hours — Executive briefing (for P1)

Client Escalation

15 min — Client primary contact notified (email + phone)
30 min — Client IT manager notified
1 hour — Client executive sponsor notified (P1 only)
2 hours — Client leadership briefing (P1 only)

Communication Templates

Initial Notification (P1)

Subject: [P1] [Client] - [Issue Summary] - OUTAGE

Severity: P1 - Critical
Status: Investigating
Impact: [X users affected] - [Business function] unavailable
Start Time: [Time]
Assigned To: [Engineer name]

Next update: [Time + 30 min]

Current Actions:
- Investigating root cause
- [Specific action taken]

Status Update (P1)

Subject: [P1] [Client] - [Issue Summary] - UPDATE [X]

Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Current impact]
Root Cause: [If identified]

Actions Taken:
- [Action 1]
- [Action 2]

Next Steps:
- [Next action]
- ETA for resolution: [Time]

Next update: [Time + 30 min]

Resolution Notification

Subject: [P1] [Client] - [Issue Summary] - RESOLVED

Status: Resolved
Resolution Time: [X hours X minutes]
Root Cause: [Brief description]

Actions Taken:
- [Resolution steps]

Follow-up:
- Post-incident review scheduled for [Date]
- [Any ongoing monitoring]

Please confirm normal operations from your end.

SLA Implications

How SLAs Affect Your Work

Response Time SLA: - The clock starts when the ticket is created - If you're on-call, your phone should never be on silent - Missed response SLAs = client credits = your bonus takes a hit

Resolution Time SLA: - Complex issues may legitimately exceed SLA - But poor documentation and communication makes it look worse - Escalate early if you think you'll miss resolution SLA

Escalation SLAs: - If you don't escalate per process, you own the delay - Document every escalation and response - Cover yourself: "Escalated to L2 at [time], awaiting response"

SLA Breach Consequences

For the MSP: - Financial penalties (contractual credits) - Client churn risk - Reputation damage - Partner status impact (Microsoft, etc.)

For you: - Performance reviews - Bonus impact - On-call rotation changes - Increased scrutiny

On-Call: The Reality

What On-Call Actually Involves

  • Phone must be on 24/7 during your rotation
  • Response within 15 minutes of page (most MSPs use PagerDuty or similar)
  • Resolve or escalate — don't sit on issues during on-call
  • Document everything — on-call work must be logged in the PSA
  • Handover — brief the next on-call engineer on any open issues

Surviving On-Call

  1. Prepare before your rotation — Know the current state of all clients
  2. Keep tools accessible — VPN, RMM, PSA should be on your phone
  3. Set up alerts properly — Don't get paged for every low-priority ticket
  4. Sleep when you can — But don't miss pages
  5. Track your hours — On-call should be compensated (standby + call-in rates)
  6. Request TOIL — Time off in lieu for after-hours work

On-Call Compensation (Australia)

Under most awards and agreements: - Standby allowance — Paid for being available (typically $3-5/hour) - Call-in rate — Minimum 3 hours at overtime rates when called in - Weekend/holiday rates — Higher standby and call-in rates

[!WARNING] If your MSP doesn't compensate on-call work, check your award and employment contract. Under the Professional Employees Award, additional hours beyond 38/week must be "reasonable" and may attract penalty rates.

Frequently Asked Questions

How should my MSP handle IT incidents?
MSPs should have clear incident classification, response procedures, communication protocols, and post-incident reviews. See our Incident Management guide for best practices.
What SLAs should I expect for incident response?
Critical: 15-minute response, 4-hour resolution. High: 30-minute response, 8-hour resolution. Medium: 4-hour response, next business day. See our MSP Contract Checklist for SLA templates.

Related Reading