2. Whatโs Monitored and Why
- CPU, memory, and disk usage of servers and containers.
- Application response times and error rates.
- Availability of network endpoints (ping, HTTP checks).
- Log ingestion for error patterns or anomalies.
- Security-related metrics like failed logins and suspicious access.
- Database health including replication lag, slow queries, and size growth.
3. Alert Escalation Matrix
- Level 1: Non-critical alerts sent to on-call engineers via Slack/email.
- Level 2: Repeated or unresolved alerts escalate to team leads via SMS or call.
- Level 3: Critical system failure triggers company-wide alerts and disaster recovery protocol.
- Use tools like PagerDuty, Opsgenie, or VictorOps for incident routing.
4. SLA Thresholds
Service Level Agreements (SLAs) define the acceptable uptime and performance metrics. Monitoring must continuously evaluate:
- 99.9% uptime (maximum ~43 min downtime per month)
- Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR)
- Response time thresholds per endpoint (e.g., <500ms)
- Error rate ceilings (e.g., <1%)
5. Response Checklist
- Verify the alert is legitimate and reproducible.
- Review logs and metrics to locate root cause.
- Notify stakeholders and document ongoing status.
- Mitigate impact through rollbacks, scaling, or patches.
- Log the incident in a tracking system (e.g., Jira, ServiceNow).
- Conduct post-incident review to prevent recurrence.