ROI Framework
How to Build a Business Case for Reliability Investment
Updated April 2026 · Template for SRE leads, VP Engineering, CTO
The ROI Formula
Annual Savings = Expected Annual Downtime Cost x Outage Reduction %
ROI = Annual Savings / Annual Investment Cost
Payback (months) = 12 / ROIStart with the downtime cost calculator to get your Expected Annual Downtime Cost.
Worked Example: Justifying SRE Investment
Company profile: $50M ARR SaaS company, 200 employees, currently experiencing 4 significant outages per year averaging 3 hours each. Per-hour cost calculated at $180,000 (using our calculator with SaaS defaults).
| Input | Value | Notes |
|---|---|---|
| Per-hour downtime cost | $180,000/hr | From calculator |
| Outages per year | 4 | Historical average |
| Average duration | 3 hours | From post-mortems |
| Expected annual cost | $2,160,000 | 4 x 3h x $180K |
| Proposed investment | $350,000/yr | 1 senior SRE salary + overhead |
| Expected reduction | 60% | Industry benchmark for SRE hire |
| Annual savings | $1,296,000 | $2.16M x 60% |
| Net benefit | $946,000/yr | $1.296M - $350K |
| Payback period | ~3 months | $350K / $1.296M x 12 |
CFO takeaway: Hiring one SRE at $350,000/year (all-in) produces $946,000 in net annual benefit. Payback in approximately 3 months. This is a stronger ROI than most product features.
Reliability Investment Options
SRE Headcount (1 senior SRE)
Reduces mean time to detection and recovery; enables proactive reliability work; handles on-call rotation properly.
Observability Tooling (Datadog/Grafana Cloud)
Faster detection reduces outage duration. Every minute of faster detection = minutes of saved downtime cost. See monitoringcost.com for tool pricing.
Multi-AZ Architecture Upgrade
Eliminates single-AZ as a failure domain. Reduces severe outage frequency significantly. One-time design cost plus ongoing infra delta.
Incident Management Platform (PagerDuty/Incident.io)
Faster escalation, better runbooks, automated incident coordination reduces MTTR. See pagerdutypricing.com for current rates.
Multi-Region Active-Active
Eliminates single-region dependency. Required for 99.999% SLA targets. Significant engineering investment and ongoing operational complexity.
5-Slide CFO Deck Structure
Slide 1: The Current Cost
Show 12 months of outage incidents: date, duration, estimated cost. Total to an annual figure. Source: your internal post-mortems or SRE report. Use ITIC 2024 as a floor if internal data is incomplete.
Slide 2: Industry Benchmark Comparison
Show that your per-hour cost and frequency are above or below the ITIC 2024 benchmark for your sector. This validates that your calculation methodology is credible, not just self-reported.
Slide 3: The Proposed Investment
Describe the specific investment: headcount, tooling, infrastructure upgrade. Show the annual cost and what you expect it to change (metric: MTTR reduction, outage frequency reduction).
Slide 4: The ROI Calculation
Walk through the formula: Expected Annual Cost x Reduction % - Investment Cost = Net Benefit. Show three scenarios: conservative (40%), expected (60%), optimistic (80%). Show payback months for each.
Slide 5: Risk of Inaction
Reference a real incident from /case-studies that is comparable to your industry and company size. CrowdStrike for SaaS/enterprise. Healthcare examples for healthcare. Frame it: one of these events is the downside case if no investment is made.
Handling CFO Objections
"We haven't had an outage in 6 months. Why invest now?"
Absence of outages is not evidence of absence of risk. Show the SLA allowance math: at 99.9% SLA, you are owed 8.75 hours of downtime per year. The 6 months of stability may reflect luck or the absence of a triggering event. Reference CrowdStrike: every affected organization was fine until July 19 at 04:09 UTC.
"$300K/hr seems too high. Our revenue is only $20M/year."
The per-hour figure includes productivity loss (your employees not working), recovery cost, and reputation/churn - not just revenue. For a $20M ARR company with 50 employees, the productivity cost alone is approximately $4,250/hr. Full calculation with all components is in the /how-to-calculate page.
"We use AWS, they have 99.99% SLA - that covers us."
The SLA covers your monthly AWS bill, not your revenue loss. See /sla-credits for the exact math. If your AWS spend is $5,000/month and you have a 99.99% SLA breach, your maximum credit is $500. If the outage costs $50,000 in lost revenue, the credit covers 1% of the loss.