SRE Failure Scenarios

This document contains 17 real-world SRE incident scenarios designed for training, testing, and RCA (Root Cause Analysis) practice. Each scenario simulates cascading failures common in Kubernetes, GitOps, and cloud-native environments.

Broken Image → ImagePullBackOff ⭐ Beginner
Stale ConfigMap -> Argo CD Drift -> Application CrashLoopBackOff
Expired Secret Rotation -> Database Auth Failures -> API Downtime
Node Pressure + HPA Misconfiguration -> Evictions -> Argo Rollback
NetworkPolicy Change -> Service Mesh Timeout -> API Chain Failure
Misconfigured Autoscaler -> Cost Spike -> Cluster Autoscaler Backoff
ArgoCD Image Updater -> Wrong Tag Match -> Rollout Regression
Redis Failover -> Connection Leaks -> Node Resource Pressure
Argo Rollout Canary + Wrong Weighting -> Full Outage
Cloud DNS TTL + Config Drift -> Multi-Cluster Routing Blackhole
Throttled API Rate Limits -> Prometheus Scrape Failures -> HPA Misfires
ArgoCD Drift -> Secret Mismatch -> DB Connection Leak -> Node Pressure -> Prometheus Throttle -> Alert Delays
Misconfigured HPA -> Cost Spike -> Cluster Autoscaler -> Throttled API -> ArgoCD Sync Failure -> Alertmanager Storm
NetworkPolicy Restriction -> Service Mesh Retry Storm -> DB Saturation -> Prometheus Lag -> Alert Suppression -> Partial Outage
Postgres Schema Drift -> ORM Migration Failure -> API Crash -> Prometheus Missing Series -> Alert Flap -> Argo Rollback Loop
Prometheus High Cardinality -> TSDB Corruption -> Metrics Drop -> Alert Delay -> Argo Rollout Overshoot -> DB Overload
Kube-API Slowdown -> Prometheus Scrape Failures -> Alert Silencing -> Cost Anomaly -> Cluster Node Eviction -> App Downtime
Combined Multi-Layer Scenarios

0. Broken Image → ImagePullBackOff

⭐ Beginner Scenario - The simplest Kubernetes failure. Perfect for learning basic troubleshooting.

Primary Trigger

Pod references a container image that doesn't exist or is unreachable.

Propagation Path

Pod Created: Kubernetes scheduler assigns pod to a node
Image Pull Attempt: Kubelet tries to pull the container image
Pull Failure: Image registry is unreachable or image doesn't exist
Backoff Loop: Kubelet enters exponential backoff retry → ImagePullBackOff

Impact

Pod never starts
Application unavailable
Deployment stuck in pending state
No container logs available (container never created)

Detection Signals

ImagePullBackOff or ErrImagePull pod status
Events showing "Failed to pull image" or "rpc error"
Container status: waiting with reason ImagePullBackOff

Commands to observe:

# View pod status showing ImagePullBackOff
kubectl get pods -n sre-demo

# View events showing image pull failures
kubectl get events -n sre-demo --sort-by='.lastTimestamp'

# Describe pod for detailed error messages
kubectl describe pod broken-image-demo -n sre-demo

# Check container status
kubectl get pod broken-image-demo -n sre-demo -o jsonpath='{.status.containerStatuses[0].state}'

Mitigation Steps

Identify the incorrect image reference
Verify the image exists in the registry
Check registry authentication (imagePullSecrets)
Update pod/deployment with correct image
Verify pod starts successfully

Prevention

Use image digest (SHA) instead of mutable tags
Implement CI/CD image validation
Use private registry with proper access controls
Set up image scanning and vulnerability checks
Pre-pull images to nodes for critical workloads

Common Causes

Typo in image name or tag
Image deleted from registry
Private registry without imagePullSecrets
Registry rate limiting (Docker Hub)
Network policies blocking registry access
Registry authentication expired

1. Stale ConfigMap -> Argo CD Drift -> Application CrashLoopBackOff

Primary Trigger

A ConfigMap updated manually in the cluster but not synced in ArgoCD.

Propagation Path

Manual Hotfix: DevOps engineer hotfixes app config directly in the cluster (bypassing GitOps)
Drift Detection: ArgoCD detects drift but auto-sync is disabled
Forced Reconciliation: Deployment rolls back due to ArgoCD's re-sync on next reconciliation
Configuration Mismatch: App now uses stale config missing new environment variable -> fails on startup

Impact

Pod crash loops
Health checks fail
SLO violations
Application unavailability

Detection Signals

CrashLoopBackOff pod status
ArgoCD drift warnings
Application logs showing missing configuration (ERROR: NEW_FEATURE environment variable is missing!)
Increased restart count in pods
Pod events showing BackOff and container restarts

Commands to observe:

# View pod status showing CrashLoopBackOff
kubectl get pods -n demo-app

# View events showing crashes
kubectl get events -n demo-app --sort-by='.lastTimestamp'

# View pod logs showing the error
kubectl logs -n demo-app -l app=demo-app --tail=20

# Describe pod for detailed restart info
kubectl describe pod -n demo-app -l app=demo-app | grep -A5 "State\|Restart"

Mitigation Steps

Review ArgoCD drift reports
Identify the missing configuration in Git
Update the Git repository with the correct configuration
Sync ArgoCD application
Verify pod health and application functionality

Prevention

Enforce GitOps workflows with admission controllers
Enable ArgoCD auto-sync with caution
Implement configuration validation in CI/CD
Use policy engines (OPA/Kyverno) to prevent manual changes

2. Expired Secret Rotation -> Database Auth Failures -> API Downtime

Primary Trigger

Database credentials in a Kubernetes Secret expired.

Propagation Path

Silent Failure: Secret auto-rotation job failed silently (CronJob misconfigured)
Cached Credentials: Apps kept using old secret cached by kubelet
Connection Rejection: Database rejects new connections
Pool Exhaustion: Connection pool exhaustion -> API pods crash -> 502s on ingress

Impact

Complete API outage
Elevated latency
Multiple alert cascades
Customer-facing service disruption

Detection Signals

Database connection errors in application logs
502 Bad Gateway responses
Connection pool exhaustion metrics
Failed CronJob executions
Authentication errors in database audit logs

Mitigation Steps

Identify the expired credentials
Manually rotate secrets if automation failed
Restart affected pods to pick up new secrets
Verify database connectivity
Clear connection pools if necessary

Prevention

Monitor CronJob execution status
Implement secret rotation alerts
Use external secret management (Vault, AWS Secrets Manager)
Set up pre-expiration warnings
Test secret rotation in staging environments

3. Node Pressure + HPA Misconfiguration -> Evictions -> Argo Rollback

Primary Trigger

Memory leak + wrong HPA thresholds.

Propagation Path

Memory Leak: App starts leaking memory
Scale-Up Block: HPA scale-up blocked (minReplicas misconfigured)
Node Pressure: Node pressure causes Kubernetes to evict pods
Auto-Rollback: ArgoCD auto-rolls back last working version due to failing health checks
Broken Rollback: The rollback image had a deprecated dependency -> app becomes partially broken

Impact

Availability degraded
Stability issues
Recovery delayed due to Argo rollback loop
Inconsistent application state

Detection Signals

Node memory pressure warnings
Pod eviction events
HPA unable to scale warnings
ArgoCD rollback events
OOMKilled containers

Mitigation Steps

Identify memory leak source
Correct HPA configuration (minReplicas, maxReplicas, target utilization)
Add or scale nodes to relieve pressure
Fix the memory leak in application code
Deploy patched version through proper GitOps workflow

Prevention

Set proper resource requests and limits
Configure HPA with realistic thresholds
Implement memory profiling and leak detection
Use Vertical Pod Autoscaler (VPA) for recommendations
Test autoscaling behavior under load

4. NetworkPolicy Change -> Service Mesh Timeout -> API Chain Failure

Primary Trigger

Updated NetworkPolicy restricts inter-namespace communication.

Propagation Path

Policy Update: Security engineer tightens network policy to isolate staging from prod namespaces
Connection Timeout: Service mesh (Istio/Linkerd) sidecars timeout on cross-namespace calls
Service Isolation: Frontend services can't reach backend auth microservice
Retry Storm: Retry storms cause cascading latency and load on ingress

Impact

Partial outage with high error rates
Customers see random 504 Gateway Timeout errors
Service dependency failures
Increased ingress load

Detection Signals

504 Gateway Timeout errors
Service mesh timeout metrics
NetworkPolicy applied events
Increased retry attempts in logs
Distributed tracing showing broken service chains

Mitigation Steps

Review recent NetworkPolicy changes
Identify blocked communication paths
Update NetworkPolicy to allow required traffic
Verify service mesh configuration
Test inter-service connectivity

Prevention

Test NetworkPolicy changes in staging
Use network policy visualization tools
Implement gradual rollout of security policies
Document service dependencies and communication patterns
Use service mesh observability for impact analysis

5. Misconfigured Autoscaler -> Cost Spike -> Cluster Autoscaler Backoff

Primary Trigger

Wrong HPA target CPU utilization set to 10%.

Propagation Path

Aggressive Scaling: Autoscaler aggressively scales replicas from 3 -> 200
Node Explosion: Cluster Autoscaler spins up >100 nodes in AWS/GCP
Quota Exhaustion: Cost monitoring tool throttles API due to quota exhaustion
Emergency Shutdown: Budget alert triggers emergency shutdown policy -> cluster scaled down abruptly

Impact

Production instability
Massive billing spike
Delayed reconciliation
Potential quota limits hit
Service disruption from emergency shutdown

Detection Signals

Abnormal replica count increase
Node count spike
Cloud provider quota warnings
Cost anomaly alerts
API throttling errors

Mitigation Steps

Immediately correct HPA target thresholds
Manually scale down excess replicas
Drain and remove unnecessary nodes
Review and adjust budget policies
Implement gradual scale-down to avoid disruption

Prevention

Set realistic HPA metrics and thresholds (typically 70-80% CPU)
Configure maxReplicas limits
Implement cost guardrails and alerts
Use cluster autoscaler limits (min/max nodes)
Regular autoscaling configuration reviews

6. ArgoCD Image Updater -> Wrong Tag Match -> Rollout Regression

Primary Trigger

Automated image updater matches wrong semantic tag.

Propagation Path

Wrong Tag Selection: Image updater regex picks v1.2 instead of v1.2.1-hotfix
Auto-Sync: ArgoCD syncs and deploys the wrong container version
Missing Configuration: Newly deployed image lacks environment variable introduced in configmap
Readiness Failure: Application fails readiness probe, rollouts paused

Impact

Half of pods stuck in Pending/CrashLoopBackOff
Inconsistent application state
Degraded service availability
Potential data inconsistency

Detection Signals

ArgoCD sync events with unexpected image tags
Failed readiness probes
Pod status showing CrashLoopBackOff
Image tag mismatch in deployment vs expected version

Mitigation Steps

Identify incorrect image tag
Update ArgoCD application to use correct image
Fix image updater regex pattern
Sync to correct version
Verify all pods are healthy

Prevention

Use strict semantic versioning patterns
Implement image tag validation
Require manual approval for production deployments
Use immutable tags or SHA digests
Test image updater patterns in staging

7. Redis Failover -> Connection Leaks -> Node Resource Pressure

Primary Trigger

Redis master node restarted due to zone failure.

Propagation Path

Redis Failover: Clients failover to replica but connection pool not reinitialized properly
Connection Leak: Old connections keep retrying -> file descriptor leak
Resource Pressure: Node memory + FD pressure increases
Evictions: Kubelet OOMKills non-critical pods -> observability stack degraded

Impact

Partial observability loss
Rising latency
Delayed RCA visibility
Node instability
Potential cascade to other services

Detection Signals

Redis connection errors
File descriptor exhaustion warnings
Node resource pressure events
OOMKilled containers
Missing metrics/logs from observability stack

Mitigation Steps

Restart affected application pods to reset connection pools
Fix connection pool initialization logic
Scale observability stack back up
Add or scale nodes if needed
Review Redis client configuration

Prevention

Implement proper connection pool management
Use Redis Sentinel or Redis Cluster for HA
Monitor file descriptor usage
Set connection pool limits and timeouts
Test failover scenarios regularly

8. Argo Rollout Canary + Wrong Weighting -> Full Outage

Primary Trigger

Canary weight misconfigured as 100% instead of 10%.

Propagation Path

Full Traffic Shift: Argo Rollout shifts 100% of traffic to canary
Schema Incompatibility: Canary connects to new DB schema incompatible with production traffic
Validation Failures: All API calls fail schema validation
Slow Rollback: Rollback takes minutes due to controller stuck waiting for metrics provider (Prometheus) sync

Impact

Complete outage
Severe customer impact
Delayed metrics collection
Extended recovery time
Data validation errors

Detection Signals

100% error rate spike
Schema validation errors in logs
Argo Rollout events showing unexpected weights
Prometheus metrics showing full canary deployment
Database schema mismatch errors

Mitigation Steps

Immediately abort rollout
Manually shift traffic back to stable version
Rollback canary deployment
Fix canary weight configuration
Verify database schema compatibility

Prevention

Validate rollout configurations before applying
Use progressive delivery with small initial weights (5-10%)
Implement automated rollback based on error rates
Test canary deployments in staging with production-like schemas
Use analysis templates with short intervals

9. Cloud DNS TTL + Config Drift -> Multi-Cluster Routing Blackhole

Primary Trigger

Ingress IP change in cluster A not propagated to cluster B.

Propagation Path

IP Change: Multi-cluster setup with GSLB or ExternalDNS
DNS Update Failure: DNS record TTL 10m; update fails due to expired cloud provider credentials
Stale Routing: Clients route to old IP (nonexistent nodepool)
No Failover: Failover policy not triggered due to missing health checks

Impact

Random region unavailability
Partial outage with difficult traceability
Intermittent connectivity issues
Customer experience degradation

Detection Signals

DNS resolution to incorrect IPs
Connection timeout to specific regions
ExternalDNS errors in logs
Cloud provider authentication failures
Health check failures not triggering failover

Mitigation Steps

Update cloud provider credentials
Manually update DNS records
Verify health check configuration
Lower DNS TTL temporarily for faster propagation
Test failover mechanisms

Prevention

Monitor ExternalDNS operation status
Use shorter DNS TTLs (1-5 minutes)
Implement credential rotation automation
Configure proper health checks for GSLB
Test multi-cluster failover regularly

10. Throttled API Rate Limits -> Prometheus Scrape Failures -> HPA Misfires

Primary Trigger

Prometheus scraping throttled by kube-apiserver rate limits.

Propagation Path

Rate Limiting: API rate limit exceeded due to surge in metric queries
Missed Scrapes: Prometheus misses several scrape intervals
Missing Metrics: HPA based on those metrics sees no load -> scales down pods incorrectly
Cascading Failure: Underprovisioned app starts dropping requests; latency alerts trigger too late

Impact

Latency spike
Error rate increase
Incorrect scaling decisions
Missing metrics for monitoring
Delayed incident detection

Detection Signals

Prometheus scrape failure errors
Kube-apiserver throttling logs
HPA showing stale/missing metrics
Unexpected scale-down events
429 Too Many Requests errors from API server

Mitigation Steps

Increase kube-apiserver rate limits
Reduce Prometheus scrape frequency or cardinality
Manually scale up affected workloads
Add Prometheus federation or sharding
Review metric collection efficiency

Prevention

Monitor API server request rates
Optimize Prometheus metric collection
Use metric relabeling to reduce cardinality
Implement Prometheus sharding for large clusters
Set appropriate HPA evaluation intervals
Use custom metrics from external sources

11. ArgoCD Drift -> Secret Mismatch -> DB Connection Leak -> Node Pressure -> Prometheus Throttle -> Alert Delays

Primary Trigger

Manual patch to Deployment in the cluster bypassed ArgoCD sync.

Propagation Path

ArgoCD Drift: Manual hotfix introduces a config mismatch (DB_PASSWORD changed in Secret)
Secret Mismatch: App restarts, can't connect to DB (stale connection credentials)
DB Connection Leak: Connection pool retries infinitely -> Postgres starts refusing new connections
Node Pressure: App pods consume CPU/memory on retry loop -> Kubelet starts OOM killing other pods
Prometheus Scrapes Fail: Kubelet metrics endpoint throttled; /metrics returns 500s
Alertmanager Delay: Alert thresholds missed; high latency alerts arrive 15 min late

Impact

Prometheus dashboards show stale data
Delayed alerting and incident detection
Database instability and connection exhaustion
Application latency spike
False sense of cluster health
Node resource exhaustion affecting multiple workloads

Detection Signals

ArgoCD drift warnings
Database connection pool exhaustion errors
Application authentication failures
OOMKilled containers
Node resource pressure events
Prometheus scrape failures
Alert delivery delays in Alertmanager
Kubelet /metrics endpoint errors

Mitigation Steps

Identify and revert manual changes in cluster
Sync proper configuration through ArgoCD
Update Secret with correct database credentials
Restart affected application pods to reset connection pools
Scale up nodes if resource pressure persists
Verify Prometheus scrape health
Review and flush Alertmanager queue

Prevention

Enforce GitOps-only workflows with admission controllers (OPA/Kyverno)
Enable ArgoCD drift detection with automated notifications
Implement proper Secret rotation workflows
Configure connection pool limits and timeouts
Set appropriate resource requests/limits
Monitor Prometheus scrape success rates
Use PodDisruptionBudgets to protect critical workloads
Implement chaos engineering to test cascading failure scenarios

12. Misconfigured HPA -> Cost Spike -> Cluster Autoscaler -> Throttled API -> ArgoCD Sync Failure -> Alertmanager Storm

Primary Trigger

HPA target CPU utilization incorrectly set to 5%.

Propagation Path

HPA Misfires: Pods scale from 3 -> 500 in 10 min due to low CPU threshold
Cluster Autoscaler Expansion: Adds 100+ nodes in AWS/GCP to accommodate pods
Cloud Billing Surge: Cost monitoring agent hits cloud API rate limit
K8s API Throttled: Controller-manager and ArgoCD syncs fail due to QPS throttling
ArgoCD Drift Detected: Sync status shows "Unknown," leading to partial rollouts
Alertmanager Storm: Every HPA, cost, and Argo alert fires simultaneously

Impact

Massive overnight billing spike ($10K+)
Alertmanager overload and alert fatigue
ArgoCD unable to maintain desired state
API server performance degradation
Engineers silencing alerts without identifying root cause
Production instability from partial deployments
Cloud provider quota exhaustion

Detection Signals

Abnormal replica count increase (3 -> 500+)
Node count explosion (100+ nodes)
Cloud API throttling errors (429 responses)
Kube-apiserver high latency and QPS throttling
ArgoCD sync failures and "Unknown" status
Cost anomaly alerts
Alert storm in Alertmanager (hundreds of firing alerts)
HPA events showing aggressive scaling

Mitigation Steps

Immediately correct HPA target threshold to realistic value (70-80%)
Manually scale down excess replicas
Drain and remove unnecessary nodes gradually
Increase kube-apiserver QPS limits temporarily
Pause ArgoCD auto-sync until API stability restored
Clear Alertmanager alert queue
Review and adjust cost monitoring thresholds
Implement emergency budget controls

Prevention

Set realistic HPA metrics (typically 70-80% CPU utilization)
Configure maxReplicas limits on all HPAs
Implement HPA configuration validation in CI/CD
Use cluster autoscaler limits (min/max nodes per node group)
Set up cost guardrails and budget alerts
Monitor kube-apiserver request rates and QPS
Implement rate limiting on cost monitoring tools
Regular review of autoscaling configurations
Test autoscaling behavior in staging environments
Use policy engines to validate HPA configurations

13. NetworkPolicy Restriction -> Service Mesh Retry Storm -> DB Saturation -> Prometheus Lag -> Alert Suppression -> Partial Outage

Primary Trigger

Security team tightened NetworkPolicy to block cross-namespace traffic.

Propagation Path

NetworkPolicy Restriction: API gateway pods can't reach auth service in different namespace
Istio Sidecars Retries: Each failed call retries 5x, flooding service mesh with traffic
DB Saturation: Auth service database hit by redundant requests; connection pool full
Prometheus Metrics Lag: /metrics endpoint times out; scraping delayed 1-2m
Alert Suppression: Alertmanager rules rely on for: 2m thresholds -> no alert fired in time
Partial Outage: Users intermittently unable to log in, while monitoring appears "green"

Impact

High real-user impact with minimal alerting
Intermittent authentication failures
Database connection pool exhaustion
Service mesh traffic explosion
Observability blind spot
Extended MTTR due to delayed detection
Customer-facing login issues

Detection Signals

NetworkPolicy applied events
Service mesh timeout errors
Increased retry attempts in sidecar logs
Database connection pool exhaustion
Auth service elevated error rates
Prometheus scrape timeout warnings
Distributed tracing showing broken service chains
User-reported login issues before alerts fire

Mitigation Steps

Review recent NetworkPolicy changes
Identify blocked communication paths using service mesh observability
Update NetworkPolicy to allow required cross-namespace traffic
Restart affected sidecars to clear retry queues
Scale auth service and database if needed
Verify Prometheus scrape health recovery
Review and adjust alert timing thresholds

Prevention

Test NetworkPolicy changes in staging first
Use network policy visualization tools (e.g., Cilium Hubble)
Implement gradual rollout of security policies
Document service dependencies and communication patterns
Configure appropriate retry budgets in service mesh
Set connection pool limits and circuit breakers
Lower alert evaluation intervals for critical services
Use synthetic monitoring to detect issues before users
Implement real-user monitoring (RUM)
Create NetworkPolicy templates validated by CI/CD

14. Postgres Schema Drift -> ORM Migration Failure -> API Crash -> Prometheus Missing Series -> Alert Flap -> Argo Rollback Loop

Primary Trigger

Database schema manually altered (added nullable constraint outside migration control).

Propagation Path

Schema Drift: DB table altered outside migration workflow
ORM Migration Fails: Next app deploy from ArgoCD fails startup migration check
API Crash: App enters CrashLoopBackOff; readiness probe fails
Prometheus Missing Time-Series: Metrics labels for API latency disappear from TSDB
Alertmanager Flap: Missing metrics cause alerts to resolve/fire intermittently
ArgoCD Rollback Loop: Argo repeatedly rolls back + redeploys same version due to failing probes

Impact

Endless rollout loop preventing recovery
Alert flapping causing confusion
API service unavailability
No clear root cause in monitoring
Development team blames CI/CD pipeline
Database schema inconsistency
Extended outage duration

Detection Signals

Database migration failure errors in application logs
CrashLoopBackOff pod status
ArgoCD sync/rollback events in rapid succession
Missing Prometheus time-series (metrics disappearing)
Alert state flapping (firing -> resolved -> firing)
Readiness probe failures
Schema validation errors
Database audit logs showing manual schema changes

Mitigation Steps

Identify schema drift in database
Manually revert unauthorized schema changes or apply proper migration
Pause ArgoCD auto-sync to break rollback loop
Fix application migration scripts to handle current schema state
Deploy corrected version manually
Verify Prometheus metrics recovery
Resume ArgoCD auto-sync after stability confirmed
Implement schema change detection

Prevention

Enforce database schema change controls
Use migration tools exclusively (Flyway, Liquibase, Alembic)
Implement database change approval workflows
Enable database audit logging
Test migrations in staging environments
Use schema validation in CI/CD pipeline
Configure migration failure alerts
Implement database GitOps workflows
Use read-only database users for application runtime
Add pre-deployment schema compatibility checks
Disable ArgoCD auto-rollback for database-dependent services

15. Prometheus High Cardinality -> TSDB Corruption -> Metrics Drop -> Alert Delay -> Argo Rollout Overshoot -> DB Overload

Primary Trigger

Unbounded metric labels from dynamic pod names or user IDs.

Propagation Path

High-Cardinality Metric: /metrics includes user_id label -> millions of time-series created
Prometheus TSDB Corruption: WAL write queue overflows, partial block compaction fails
Metrics Drop: CPU/memory metrics become stale; HPA scales incorrectly based on old data
Alert Delay: Alertmanager backlog increases; firing delayed by 10+ min
Argo Rollout Overshoot: Rollout controller reads outdated metrics, increases canary weight to 100%
DB Overload: New app version hits DB with schema bug; DB CPU pegged at 100%

Impact

Monitoring system silent during critical failure
Massive query load on database
Complete canary rollout of buggy version
False positives after recovery
TSDB storage exhaustion
Incorrect autoscaling decisions
Extended outage with delayed detection

Detection Signals

Prometheus TSDB corruption errors
WAL write failures in Prometheus logs
Metrics cardinality explosion (series count spike)
Prometheus storage usage spike
Stale metrics in queries
HPA showing outdated metric values
Argo Rollout events showing unexpected weight changes
Database CPU saturation
Alert delivery delays

Mitigation Steps

Identify and remove high-cardinality metrics
Restart Prometheus to clear WAL corruption
Implement metric relabeling to drop problematic labels
Manually rollback Argo Rollout to stable version
Scale database to handle load
Fix application code generating high-cardinality metrics
Clear Alertmanager backlog
Verify metrics collection recovery

Prevention

Monitor Prometheus cardinality and series count
Implement metric relabeling rules to limit cardinality
Set cardinality limits in instrumentation libraries
Use metric label allowlists
Regular Prometheus performance reviews
Implement TSDB storage monitoring and alerts
Use metric naming conventions avoiding dynamic labels
Configure Prometheus retention policies
Implement progressive delivery safeguards (small initial canary weights)
Add Argo Rollout analysis templates with strict thresholds
Use database connection pooling and query timeouts
Test new instrumentation in staging first

16. Kube-API Slowdown -> Prometheus Scrape Failures -> Alert Silencing -> Cost Anomaly -> Cluster Node Eviction -> App Downtime

Primary Trigger

etcd I/O latency spike due to cloud disk throttling.

Propagation Path

etcd Latency: API responses slow down; kube-apiserver call latency >2s
Prometheus Scrape Failures: Kube-state-metrics times out; missing pod/node metrics
Alert Silencing: Alertmanager silences triggered alerts since data incomplete
Autoscaler Fails: HPA sees "no metrics," scales pods down to minReplicas
Cost Anomaly Detector: Cost monitoring agent retries repeatedly -> hits cloud API quota
Node Eviction Chaos: Node pressure rises; pods evicted mid-transaction; service downtime

Impact

Incident triggered by infrastructure bottleneck
Misdiagnosed as HPA regression
Metrics and alerts unreliable throughout incident
Inappropriate scale-down during high load
Cloud API quota exhaustion
Pod evictions causing data loss
Service disruption with poor visibility

Detection Signals

etcd high latency warnings (fsync duration >100ms)
Kube-apiserver slow request logs
Prometheus scrape timeout errors
Kube-state-metrics unavailability
HPA showing "unable to fetch metrics"
Alertmanager silence events
Unexpected scale-down events
Cloud API throttling (429 errors)
Node pressure events
Pod eviction events
Disk I/O throttling metrics

Mitigation Steps

Identify etcd disk I/O bottleneck
Increase etcd disk IOPS or migrate to faster storage
Reduce kube-apiserver load (throttle clients if needed)
Manually scale up underprovisioned workloads
Disable cost monitoring agent temporarily
Add or scale nodes to relieve pressure
Verify Prometheus scrape recovery
Review and restore alerts from silence
Implement etcd performance tuning

Prevention

Monitor etcd performance metrics (disk latency, fsync duration)
Use high-performance storage for etcd (SSD, provisioned IOPS)
Implement etcd disk I/O alerts
Optimize kube-apiserver configuration and rate limits
Use Prometheus federation or sharding to reduce API load
Configure appropriate HPA evaluation intervals and fallback behavior
Implement rate limiting on cost monitoring tools
Set PodDisruptionBudgets to prevent excessive evictions
Use node affinity to isolate critical workloads
Regular etcd performance testing and capacity planning
Monitor cloud resource quotas
Implement graceful degradation for metrics unavailability

Bonus: Combined Multi-Layer Scenarios

For advanced RCA training, combine multiple scenarios to simulate complex, cascading failures:

Scenario A: ConfigMap Drift + NetworkPolicy + Autoscaler Misconfig

Trigger Chain:

Stale ConfigMap causes app instability
NetworkPolicy change blocks service communication during troubleshooting
Misconfigured HPA scales up aggressively trying to handle errors
Cost spike triggers emergency response while debugging

Complexity: Multiple teams involved (Dev, Security, Platform), competing priorities, cost pressure

Scenario B: Secret Expiry + ArgoCD Sync Delay + DB Connection Exhaustion

Trigger Chain:

Database credentials expire silently
ArgoCD sync delayed due to configuration drift
Apps exhaust connection pools trying to reconnect
New deployments fail due to inability to verify DB connectivity

Complexity: Authentication layer, GitOps workflow, connection management, deployment pipeline

Scenario C: DNS TTL + Prometheus Throttling + Canary Rollout Failure

Trigger Chain:

DNS propagation delay causes traffic routing issues
Prometheus throttled by API server, missing metrics
Canary rollout proceeds without proper metrics validation
Faulty canary deployed to 100% traffic before metrics appear

Complexity: Multi-cluster, observability gaps, progressive delivery, timing dependencies

Using These Scenarios

For Training

Walk through each propagation path step-by-step
Identify detection points and signals
Practice incident response procedures
Document lessons learned

For Testing

Implement chaos engineering experiments
Test monitoring and alerting coverage
Validate incident response playbooks
Verify automated remediation

For RCA Practice

Present scenarios without showing propagation paths
Practice hypothesis formation and testing
Use distributed tracing and logging to piece together timeline
Identify multiple contributing factors

For Prevention

Review scenarios against current architecture
Identify gaps in observability and automation
Implement guardrails and policy enforcement
Regular game day exercises

Contributing

To add new scenarios or improve existing ones:

Follow the established format (Trigger -> Propagation -> Impact -> Detection -> Mitigation -> Prevention)
Base scenarios on real incidents or realistic failure modes
Include specific tools and technologies
Provide actionable detection signals and mitigation steps

Additional Resources

Last Updated: 2025-11-05 Maintainer: SRE Team Version: 1.1.0

SRE Failure Scenarios

Table of Contents

0. Broken Image → ImagePullBackOff

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

Common Causes

1. Stale ConfigMap -> Argo CD Drift -> Application CrashLoopBackOff

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

2. Expired Secret Rotation -> Database Auth Failures -> API Downtime

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

3. Node Pressure + HPA Misconfiguration -> Evictions -> Argo Rollback

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

4. NetworkPolicy Change -> Service Mesh Timeout -> API Chain Failure

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

5. Misconfigured Autoscaler -> Cost Spike -> Cluster Autoscaler Backoff

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

6. ArgoCD Image Updater -> Wrong Tag Match -> Rollout Regression

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

7. Redis Failover -> Connection Leaks -> Node Resource Pressure

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

8. Argo Rollout Canary + Wrong Weighting -> Full Outage

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

9. Cloud DNS TTL + Config Drift -> Multi-Cluster Routing Blackhole

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention

10. Throttled API Rate Limits -> Prometheus Scrape Failures -> HPA Misfires

Primary Trigger

Propagation Path

Impact

Detection Signals

Mitigation Steps

Prevention