Why Use This This skill provides specialized capabilities for aiskillstore's codebase.
Use Cases Developing new features in the aiskillstore repository Refactoring existing code to follow aiskillstore standards Understanding and working with aiskillstore's codebase structure
Install Guide 2 steps 1 2 Install inside Ananke
Click Install Skill, paste the link below, then press Install.
https://github.com/aiskillstore/marketplace/tree/main/skills/asmayaseen/operating-production-services Skill Snapshot Auto scan of skill assets. Informational only.
Valid SKILL.md Checks against SKILL.md specification
Source & Community
Updated At Jan 19, 2026, 04:39 AM
Skill Stats
SKILL.md 198 Lines
Total Files 1
Total Size 0 B
License NOASSERTION
---
name: operating-production-services
description: |
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response.
Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing
on-call practices. NOT for initial service development (use scaffolding skills instead).
---
# Operating Production Services
Production reliability patterns: measure what matters, learn from failures, improve systematically.
## Quick Reference
| Need | Go To |
|------|-------|
| Define reliability targets | [SLOs & Error Budgets](#slos--error-budgets) |
| Write incident report | [Postmortem Templates](#postmortem-templates) |
| Set up SLO alerting | [references/slo-alerting.md](references/slo-alerting.md) |
---
## SLOs & Error Budgets
### The Hierarchy
```
SLA (Contract) → SLO (Target) → SLI (Measurement)
```
### Common SLIs
```promql
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```
### SLO Targets Reality Check
| SLO % | Downtime/Month | Downtime/Year |
|-------|----------------|---------------|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |
**Don't aim for 100%.** Each nine costs exponentially more.
### Error Budget
```
Error Budget = 1 - SLO Target
```
**Example:** 99.9% SLO = 0.1% error budget = 43 minutes/month
**Policy:**
| Budget Remaining | Action |
|------------------|--------|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |
See [references/slo-alerting.md](references/slo-alerting.md) for Prometheus recording rules and multi-window burn rate alerts.
---
## Postmortem Templates
### The Blameless Principle
| Blame-Focused | Blameless |
|---------------|-----------|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
### When to Write Postmortems
- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
### Standard Template
```markdown
# Postmortem: [Incident Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX
## Executive Summary
One paragraph: what happened, impact, root cause, resolution.
## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |
## Root Cause Analysis
### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]
## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X
## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
```
### Quick Template (Minor Incidents)
```markdown
# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3
## What Happened
One sentence description.
## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution
## Root Cause
One sentence.
## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
```
---
## Postmortem Meeting Guide
### Structure (60 min)
1. **Opening (5 min)** - Remind: "We're here to learn, not blame"
2. **Timeline (15 min)** - Walk through events chronologically
3. **Analysis (20 min)** - What failed? Why? What allowed it?
4. **Action Items (15 min)** - Prioritize, assign owners, set dates
5. **Closing (5 min)** - Summarize learnings, confirm owners
### Facilitation Tips
- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants
---
## Anti-Patterns
| Don't | Do Instead |
|-------|------------|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |
---
## Verification
Run: `python scripts/verify.py`
## References
- [references/slo-alerting.md](references/slo-alerting.md) - Prometheus rules, burn rate alerts, Grafana dashboards