SRE Failure Scenarios
This document contains 17 real-world SRE incident scenarios designed for training, testing, and RCA (Root Cause Analysis) practice. Each scenario simulates cascading failures common in Kubernetes, GitOps, and cloud-native environments.
Table of Contents
- Broken Image → ImagePullBackOff ⭐ Beginner
- Stale ConfigMap -> Argo CD Drift -> Application CrashLoopBackOff
- Expired Secret Rotation -> Database Auth Failures -> API Downtime
- Node Pressure + HPA Misconfiguration -> Evictions -> Argo Rollback
- NetworkPolicy Change -> Service Mesh Timeout -> API Chain Failure
- Misconfigured Autoscaler -> Cost Spike -> Cluster Autoscaler Backoff
- ArgoCD Image Updater -> Wrong Tag Match -> Rollout Regression
- Redis Failover -> Connection Leaks -> Node Resource Pressure
- Argo Rollout Canary + Wrong Weighting -> Full Outage
- Cloud DNS TTL + Config Drift -> Multi-Cluster Routing Blackhole
- Throttled API Rate Limits -> Prometheus Scrape Failures -> HPA Misfires
- ArgoCD Drift -> Secret Mismatch -> DB Connection Leak -> Node Pressure -> Prometheus Throttle -> Alert Delays
- Misconfigured HPA -> Cost Spike -> Cluster Autoscaler -> Throttled API -> ArgoCD Sync Failure -> Alertmanager Storm
- NetworkPolicy Restriction -> Service Mesh Retry Storm -> DB Saturation -> Prometheus Lag -> Alert Suppression -> Partial Outage
- Postgres Schema Drift -> ORM Migration Failure -> API Crash -> Prometheus Missing Series -> Alert Flap -> Argo Rollback Loop
- Prometheus High Cardinality -> TSDB Corruption -> Metrics Drop -> Alert Delay -> Argo Rollout Overshoot -> DB Overload
- Kube-API Slowdown -> Prometheus Scrape Failures -> Alert Silencing -> Cost Anomaly -> Cluster Node Eviction -> App Downtime
- Combined Multi-Layer Scenarios
0. Broken Image → ImagePullBackOff
⭐ Beginner Scenario - The simplest Kubernetes failure. Perfect for learning basic troubleshooting.
Primary Trigger
Pod references a container image that doesn't exist or is unreachable.
Propagation Path
- Pod Created: Kubernetes scheduler assigns pod to a node
- Image Pull Attempt: Kubelet tries to pull the container image
- Pull Failure: Image registry is unreachable or image doesn't exist
- Backoff Loop: Kubelet enters exponential backoff retry → ImagePullBackOff
Impact
- Pod never starts
- Application unavailable
- Deployment stuck in pending state
- No container logs available (container never created)
Detection Signals
ImagePullBackOfforErrImagePullpod status- Events showing "Failed to pull image" or "rpc error"
- Container status: waiting with reason ImagePullBackOff
Commands to observe:
# View pod status showing ImagePullBackOff
kubectl get pods -n sre-demo
# View events showing image pull failures
kubectl get events -n sre-demo --sort-by='.lastTimestamp'
# Describe pod for detailed error messages
kubectl describe pod broken-image-demo -n sre-demo
# Check container status
kubectl get pod broken-image-demo -n sre-demo -o jsonpath='{.status.containerStatuses[0].state}'
Mitigation Steps
- Identify the incorrect image reference
- Verify the image exists in the registry
- Check registry authentication (imagePullSecrets)
- Update pod/deployment with correct image
- Verify pod starts successfully
Prevention
- Use image digest (SHA) instead of mutable tags
- Implement CI/CD image validation
- Use private registry with proper access controls
- Set up image scanning and vulnerability checks
- Pre-pull images to nodes for critical workloads
Common Causes
- Typo in image name or tag
- Image deleted from registry
- Private registry without imagePullSecrets
- Registry rate limiting (Docker Hub)
- Network policies blocking registry access
- Registry authentication expired
1. Stale ConfigMap -> Argo CD Drift -> Application CrashLoopBackOff
Primary Trigger
A ConfigMap updated manually in the cluster but not synced in ArgoCD.
Propagation Path
- Manual Hotfix: DevOps engineer hotfixes app config directly in the cluster (bypassing GitOps)
- Drift Detection: ArgoCD detects drift but auto-sync is disabled
- Forced Reconciliation: Deployment rolls back due to ArgoCD's re-sync on next reconciliation
- Configuration Mismatch: App now uses stale config missing new environment variable -> fails on startup
Impact
- Pod crash loops
- Health checks fail
- SLO violations
- Application unavailability
Detection Signals
CrashLoopBackOffpod status- ArgoCD drift warnings
- Application logs showing missing configuration (
ERROR: NEW_FEATURE environment variable is missing!) - Increased restart count in pods
- Pod events showing
BackOffand container restarts
Commands to observe:
# View pod status showing CrashLoopBackOff
kubectl get pods -n demo-app
# View events showing crashes
kubectl get events -n demo-app --sort-by='.lastTimestamp'
# View pod logs showing the error
kubectl logs -n demo-app -l app=demo-app --tail=20
# Describe pod for detailed restart info
kubectl describe pod -n demo-app -l app=demo-app | grep -A5 "State\|Restart"
Mitigation Steps
- Review ArgoCD drift reports
- Identify the missing configuration in Git
- Update the Git repository with the correct configuration
- Sync ArgoCD application
- Verify pod health and application functionality
Prevention
- Enforce GitOps workflows with admission controllers
- Enable ArgoCD auto-sync with caution
- Implement configuration validation in CI/CD
- Use policy engines (OPA/Kyverno) to prevent manual changes
2. Expired Secret Rotation -> Database Auth Failures -> API Downtime
Primary Trigger
Database credentials in a Kubernetes Secret expired.
Propagation Path
- Silent Failure: Secret auto-rotation job failed silently (CronJob misconfigured)
- Cached Credentials: Apps kept using old secret cached by kubelet
- Connection Rejection: Database rejects new connections
- Pool Exhaustion: Connection pool exhaustion -> API pods crash -> 502s on ingress
Impact
- Complete API outage
- Elevated latency
- Multiple alert cascades
- Customer-facing service disruption
Detection Signals
- Database connection errors in application logs
- 502 Bad Gateway responses
- Connection pool exhaustion metrics
- Failed CronJob executions
- Authentication errors in database audit logs
Mitigation Steps
- Identify the expired credentials
- Manually rotate secrets if automation failed
- Restart affected pods to pick up new secrets
- Verify database connectivity
- Clear connection pools if necessary
Prevention
- Monitor CronJob execution status
- Implement secret rotation alerts
- Use external secret management (Vault, AWS Secrets Manager)
- Set up pre-expiration warnings
- Test secret rotation in staging environments
3. Node Pressure + HPA Misconfiguration -> Evictions -> Argo Rollback
Primary Trigger
Memory leak + wrong HPA thresholds.
Propagation Path
- Memory Leak: App starts leaking memory
- Scale-Up Block: HPA scale-up blocked (minReplicas misconfigured)
- Node Pressure: Node pressure causes Kubernetes to evict pods
- Auto-Rollback: ArgoCD auto-rolls back last working version due to failing health checks
- Broken Rollback: The rollback image had a deprecated dependency -> app becomes partially broken
Impact
- Availability degraded
- Stability issues
- Recovery delayed due to Argo rollback loop
- Inconsistent application state
Detection Signals
- Node memory pressure warnings
- Pod eviction events
- HPA unable to scale warnings
- ArgoCD rollback events
- OOMKilled containers
Mitigation Steps
- Identify memory leak source
- Correct HPA configuration (minReplicas, maxReplicas, target utilization)
- Add or scale nodes to relieve pressure
- Fix the memory leak in application code
- Deploy patched version through proper GitOps workflow
Prevention
- Set proper resource requests and limits
- Configure HPA with realistic thresholds
- Implement memory profiling and leak detection
- Use Vertical Pod Autoscaler (VPA) for recommendations
- Test autoscaling behavior under load
4. NetworkPolicy Change -> Service Mesh Timeout -> API Chain Failure
Primary Trigger
Updated NetworkPolicy restricts inter-namespace communication.
Propagation Path
- Policy Update: Security engineer tightens network policy to isolate staging from prod namespaces
- Connection Timeout: Service mesh (Istio/Linkerd) sidecars timeout on cross-namespace calls
- Service Isolation: Frontend services can't reach backend auth microservice
- Retry Storm: Retry storms cause cascading latency and load on ingress
Impact
- Partial outage with high error rates
- Customers see random 504 Gateway Timeout errors
- Service dependency failures
- Increased ingress load
Detection Signals
- 504 Gateway Timeout errors
- Service mesh timeout metrics
- NetworkPolicy applied events
- Increased retry attempts in logs
- Distributed tracing showing broken service chains
Mitigation Steps
- Review recent NetworkPolicy changes
- Identify blocked communication paths
- Update NetworkPolicy to allow required traffic
- Verify service mesh configuration
- Test inter-service connectivity
Prevention
- Test NetworkPolicy changes in staging
- Use network policy visualization tools
- Implement gradual rollout of security policies
- Document service dependencies and communication patterns
- Use service mesh observability for impact analysis
5. Misconfigured Autoscaler -> Cost Spike -> Cluster Autoscaler Backoff
Primary Trigger
Wrong HPA target CPU utilization set to 10%.
Propagation Path
- Aggressive Scaling: Autoscaler aggressively scales replicas from 3 -> 200
- Node Explosion: Cluster Autoscaler spins up >100 nodes in AWS/GCP
- Quota Exhaustion: Cost monitoring tool throttles API due to quota exhaustion
- Emergency Shutdown: Budget alert triggers emergency shutdown policy -> cluster scaled down abruptly
Impact
- Production instability
- Massive billing spike
- Delayed reconciliation
- Potential quota limits hit
- Service disruption from emergency shutdown
Detection Signals
- Abnormal replica count increase
- Node count spike
- Cloud provider quota warnings
- Cost anomaly alerts
- API throttling errors
Mitigation Steps
- Immediately correct HPA target thresholds
- Manually scale down excess replicas
- Drain and remove unnecessary nodes
- Review and adjust budget policies
- Implement gradual scale-down to avoid disruption
Prevention
- Set realistic HPA metrics and thresholds (typically 70-80% CPU)
- Configure maxReplicas limits
- Implement cost guardrails and alerts
- Use cluster autoscaler limits (min/max nodes)
- Regular autoscaling configuration reviews
6. ArgoCD Image Updater -> Wrong Tag Match -> Rollout Regression
Primary Trigger
Automated image updater matches wrong semantic tag.
Propagation Path
- Wrong Tag Selection: Image updater regex picks v1.2 instead of v1.2.1-hotfix
- Auto-Sync: ArgoCD syncs and deploys the wrong container version
- Missing Configuration: Newly deployed image lacks environment variable introduced in configmap
- Readiness Failure: Application fails readiness probe, rollouts paused
Impact
- Half of pods stuck in Pending/CrashLoopBackOff
- Inconsistent application state
- Degraded service availability
- Potential data inconsistency
Detection Signals
- ArgoCD sync events with unexpected image tags
- Failed readiness probes
- Pod status showing CrashLoopBackOff
- Image tag mismatch in deployment vs expected version
Mitigation Steps
- Identify incorrect image tag
- Update ArgoCD application to use correct image
- Fix image updater regex pattern
- Sync to correct version
- Verify all pods are healthy
Prevention
- Use strict semantic versioning patterns
- Implement image tag validation
- Require manual approval for production deployments
- Use immutable tags or SHA digests
- Test image updater patterns in staging
7. Redis Failover -> Connection Leaks -> Node Resource Pressure
Primary Trigger
Redis master node restarted due to zone failure.
Propagation Path
- Redis Failover: Clients failover to replica but connection pool not reinitialized properly
- Connection Leak: Old connections keep retrying -> file descriptor leak
- Resource Pressure: Node memory + FD pressure increases
- Evictions: Kubelet OOMKills non-critical pods -> observability stack degraded
Impact
- Partial observability loss
- Rising latency
- Delayed RCA visibility
- Node instability
- Potential cascade to other services
Detection Signals
- Redis connection errors
- File descriptor exhaustion warnings
- Node resource pressure events
- OOMKilled containers
- Missing metrics/logs from observability stack
Mitigation Steps
- Restart affected application pods to reset connection pools
- Fix connection pool initialization logic
- Scale observability stack back up
- Add or scale nodes if needed
- Review Redis client configuration
Prevention
- Implement proper connection pool management
- Use Redis Sentinel or Redis Cluster for HA
- Monitor file descriptor usage
- Set connection pool limits and timeouts
- Test failover scenarios regularly
8. Argo Rollout Canary + Wrong Weighting -> Full Outage
Primary Trigger
Canary weight misconfigured as 100% instead of 10%.
Propagation Path
- Full Traffic Shift: Argo Rollout shifts 100% of traffic to canary
- Schema Incompatibility: Canary connects to new DB schema incompatible with production traffic
- Validation Failures: All API calls fail schema validation
- Slow Rollback: Rollback takes minutes due to controller stuck waiting for metrics provider (Prometheus) sync
Impact
- Complete outage
- Severe customer impact
- Delayed metrics collection
- Extended recovery time
- Data validation errors
Detection Signals
- 100% error rate spike
- Schema validation errors in logs
- Argo Rollout events showing unexpected weights
- Prometheus metrics showing full canary deployment
- Database schema mismatch errors
Mitigation Steps
- Immediately abort rollout
- Manually shift traffic back to stable version
- Rollback canary deployment
- Fix canary weight configuration
- Verify database schema compatibility
Prevention
- Validate rollout configurations before applying
- Use progressive delivery with small initial weights (5-10%)
- Implement automated rollback based on error rates
- Test canary deployments in staging with production-like schemas
- Use analysis templates with short intervals
9. Cloud DNS TTL + Config Drift -> Multi-Cluster Routing Blackhole
Primary Trigger
Ingress IP change in cluster A not propagated to cluster B.
Propagation Path
- IP Change: Multi-cluster setup with GSLB or ExternalDNS
- DNS Update Failure: DNS record TTL 10m; update fails due to expired cloud provider credentials
- Stale Routing: Clients route to old IP (nonexistent nodepool)
- No Failover: Failover policy not triggered due to missing health checks
Impact
- Random region unavailability
- Partial outage with difficult traceability
- Intermittent connectivity issues
- Customer experience degradation
Detection Signals
- DNS resolution to incorrect IPs
- Connection timeout to specific regions
- ExternalDNS errors in logs
- Cloud provider authentication failures
- Health check failures not triggering failover
Mitigation Steps
- Update cloud provider credentials
- Manually update DNS records
- Verify health check configuration
- Lower DNS TTL temporarily for faster propagation
- Test failover mechanisms
Prevention
- Monitor ExternalDNS operation status
- Use shorter DNS TTLs (1-5 minutes)
- Implement credential rotation automation
- Configure proper health checks for GSLB
- Test multi-cluster failover regularly
10. Throttled API Rate Limits -> Prometheus Scrape Failures -> HPA Misfires
Primary Trigger
Prometheus scraping throttled by kube-apiserver rate limits.
Propagation Path
- Rate Limiting: API rate limit exceeded due to surge in metric queries
- Missed Scrapes: Prometheus misses several scrape intervals
- Missing Metrics: HPA based on those metrics sees no load -> scales down pods incorrectly
- Cascading Failure: Underprovisioned app starts dropping requests; latency alerts trigger too late
Impact
- Latency spike
- Error rate increase
- Incorrect scaling decisions
- Missing metrics for monitoring
- Delayed incident detection
Detection Signals
- Prometheus scrape failure errors
- Kube-apiserver throttling logs
- HPA showing stale/missing metrics
- Unexpected scale-down events
- 429 Too Many Requests errors from API server
Mitigation Steps
- Increase kube-apiserver rate limits
- Reduce Prometheus scrape frequency or cardinality
- Manually scale up affected workloads
- Add Prometheus federation or sharding
- Review metric collection efficiency
Prevention
- Monitor API server request rates
- Optimize Prometheus metric collection
- Use metric relabeling to reduce cardinality
- Implement Prometheus sharding for large clusters
- Set appropriate HPA evaluation intervals
- Use custom metrics from external sources
11. ArgoCD Drift -> Secret Mismatch -> DB Connection Leak -> Node Pressure -> Prometheus Throttle -> Alert Delays
Primary Trigger
Manual patch to Deployment in the cluster bypassed ArgoCD sync.
Propagation Path
- ArgoCD Drift: Manual hotfix introduces a config mismatch (DB_PASSWORD changed in Secret)
- Secret Mismatch: App restarts, can't connect to DB (stale connection credentials)
- DB Connection Leak: Connection pool retries infinitely -> Postgres starts refusing new connections
- Node Pressure: App pods consume CPU/memory on retry loop -> Kubelet starts OOM killing other pods
- Prometheus Scrapes Fail: Kubelet metrics endpoint throttled; /metrics returns 500s
- Alertmanager Delay: Alert thresholds missed; high latency alerts arrive 15 min late
Impact
- Prometheus dashboards show stale data
- Delayed alerting and incident detection
- Database instability and connection exhaustion
- Application latency spike
- False sense of cluster health
- Node resource exhaustion affecting multiple workloads
Detection Signals
- ArgoCD drift warnings
- Database connection pool exhaustion errors
- Application authentication failures
- OOMKilled containers
- Node resource pressure events
- Prometheus scrape failures
- Alert delivery delays in Alertmanager
- Kubelet /metrics endpoint errors
Mitigation Steps
- Identify and revert manual changes in cluster
- Sync proper configuration through ArgoCD
- Update Secret with correct database credentials
- Restart affected application pods to reset connection pools
- Scale up nodes if resource pressure persists
- Verify Prometheus scrape health
- Review and flush Alertmanager queue
Prevention
- Enforce GitOps-only workflows with admission controllers (OPA/Kyverno)
- Enable ArgoCD drift detection with automated notifications
- Implement proper Secret rotation workflows
- Configure connection pool limits and timeouts
- Set appropriate resource requests/limits
- Monitor Prometheus scrape success rates
- Use PodDisruptionBudgets to protect critical workloads
- Implement chaos engineering to test cascading failure scenarios
12. Misconfigured HPA -> Cost Spike -> Cluster Autoscaler -> Throttled API -> ArgoCD Sync Failure -> Alertmanager Storm
Primary Trigger
HPA target CPU utilization incorrectly set to 5%.
Propagation Path
- HPA Misfires: Pods scale from 3 -> 500 in 10 min due to low CPU threshold
- Cluster Autoscaler Expansion: Adds 100+ nodes in AWS/GCP to accommodate pods
- Cloud Billing Surge: Cost monitoring agent hits cloud API rate limit
- K8s API Throttled: Controller-manager and ArgoCD syncs fail due to QPS throttling
- ArgoCD Drift Detected: Sync status shows "Unknown," leading to partial rollouts
- Alertmanager Storm: Every HPA, cost, and Argo alert fires simultaneously
Impact
- Massive overnight billing spike ($10K+)
- Alertmanager overload and alert fatigue
- ArgoCD unable to maintain desired state
- API server performance degradation
- Engineers silencing alerts without identifying root cause
- Production instability from partial deployments
- Cloud provider quota exhaustion
Detection Signals
- Abnormal replica count increase (3 -> 500+)
- Node count explosion (100+ nodes)
- Cloud API throttling errors (429 responses)
- Kube-apiserver high latency and QPS throttling
- ArgoCD sync failures and "Unknown" status
- Cost anomaly alerts
- Alert storm in Alertmanager (hundreds of firing alerts)
- HPA events showing aggressive scaling
Mitigation Steps
- Immediately correct HPA target threshold to realistic value (70-80%)
- Manually scale down excess replicas
- Drain and remove unnecessary nodes gradually
- Increase kube-apiserver QPS limits temporarily
- Pause ArgoCD auto-sync until API stability restored
- Clear Alertmanager alert queue
- Review and adjust cost monitoring thresholds
- Implement emergency budget controls
Prevention
- Set realistic HPA metrics (typically 70-80% CPU utilization)
- Configure maxReplicas limits on all HPAs
- Implement HPA configuration validation in CI/CD
- Use cluster autoscaler limits (min/max nodes per node group)
- Set up cost guardrails and budget alerts
- Monitor kube-apiserver request rates and QPS
- Implement rate limiting on cost monitoring tools
- Regular review of autoscaling configurations
- Test autoscaling behavior in staging environments
- Use policy engines to validate HPA configurations
13. NetworkPolicy Restriction -> Service Mesh Retry Storm -> DB Saturation -> Prometheus Lag -> Alert Suppression -> Partial Outage
Primary Trigger
Security team tightened NetworkPolicy to block cross-namespace traffic.
Propagation Path
- NetworkPolicy Restriction: API gateway pods can't reach auth service in different namespace
- Istio Sidecars Retries: Each failed call retries 5x, flooding service mesh with traffic
- DB Saturation: Auth service database hit by redundant requests; connection pool full
- Prometheus Metrics Lag: /metrics endpoint times out; scraping delayed 1-2m
- Alert Suppression: Alertmanager rules rely on
for: 2mthresholds -> no alert fired in time - Partial Outage: Users intermittently unable to log in, while monitoring appears "green"
Impact
- High real-user impact with minimal alerting
- Intermittent authentication failures
- Database connection pool exhaustion
- Service mesh traffic explosion
- Observability blind spot
- Extended MTTR due to delayed detection
- Customer-facing login issues
Detection Signals
- NetworkPolicy applied events
- Service mesh timeout errors
- Increased retry attempts in sidecar logs
- Database connection pool exhaustion
- Auth service elevated error rates
- Prometheus scrape timeout warnings
- Distributed tracing showing broken service chains
- User-reported login issues before alerts fire
Mitigation Steps
- Review recent NetworkPolicy changes
- Identify blocked communication paths using service mesh observability
- Update NetworkPolicy to allow required cross-namespace traffic
- Restart affected sidecars to clear retry queues
- Scale auth service and database if needed
- Verify Prometheus scrape health recovery
- Review and adjust alert timing thresholds
Prevention
- Test NetworkPolicy changes in staging first
- Use network policy visualization tools (e.g., Cilium Hubble)
- Implement gradual rollout of security policies
- Document service dependencies and communication patterns
- Configure appropriate retry budgets in service mesh
- Set connection pool limits and circuit breakers
- Lower alert evaluation intervals for critical services
- Use synthetic monitoring to detect issues before users
- Implement real-user monitoring (RUM)
- Create NetworkPolicy templates validated by CI/CD
14. Postgres Schema Drift -> ORM Migration Failure -> API Crash -> Prometheus Missing Series -> Alert Flap -> Argo Rollback Loop
Primary Trigger
Database schema manually altered (added nullable constraint outside migration control).
Propagation Path
- Schema Drift: DB table altered outside migration workflow
- ORM Migration Fails: Next app deploy from ArgoCD fails startup migration check
- API Crash: App enters CrashLoopBackOff; readiness probe fails
- Prometheus Missing Time-Series: Metrics labels for API latency disappear from TSDB
- Alertmanager Flap: Missing metrics cause alerts to resolve/fire intermittently
- ArgoCD Rollback Loop: Argo repeatedly rolls back + redeploys same version due to failing probes
Impact
- Endless rollout loop preventing recovery
- Alert flapping causing confusion
- API service unavailability
- No clear root cause in monitoring
- Development team blames CI/CD pipeline
- Database schema inconsistency
- Extended outage duration
Detection Signals
- Database migration failure errors in application logs
- CrashLoopBackOff pod status
- ArgoCD sync/rollback events in rapid succession
- Missing Prometheus time-series (metrics disappearing)
- Alert state flapping (firing -> resolved -> firing)
- Readiness probe failures
- Schema validation errors
- Database audit logs showing manual schema changes
Mitigation Steps
- Identify schema drift in database
- Manually revert unauthorized schema changes or apply proper migration
- Pause ArgoCD auto-sync to break rollback loop
- Fix application migration scripts to handle current schema state
- Deploy corrected version manually
- Verify Prometheus metrics recovery
- Resume ArgoCD auto-sync after stability confirmed
- Implement schema change detection
Prevention
- Enforce database schema change controls
- Use migration tools exclusively (Flyway, Liquibase, Alembic)
- Implement database change approval workflows
- Enable database audit logging
- Test migrations in staging environments
- Use schema validation in CI/CD pipeline
- Configure migration failure alerts
- Implement database GitOps workflows
- Use read-only database users for application runtime
- Add pre-deployment schema compatibility checks
- Disable ArgoCD auto-rollback for database-dependent services
15. Prometheus High Cardinality -> TSDB Corruption -> Metrics Drop -> Alert Delay -> Argo Rollout Overshoot -> DB Overload
Primary Trigger
Unbounded metric labels from dynamic pod names or user IDs.
Propagation Path
- High-Cardinality Metric: /metrics includes user_id label -> millions of time-series created
- Prometheus TSDB Corruption: WAL write queue overflows, partial block compaction fails
- Metrics Drop: CPU/memory metrics become stale; HPA scales incorrectly based on old data
- Alert Delay: Alertmanager backlog increases; firing delayed by 10+ min
- Argo Rollout Overshoot: Rollout controller reads outdated metrics, increases canary weight to 100%
- DB Overload: New app version hits DB with schema bug; DB CPU pegged at 100%
Impact
- Monitoring system silent during critical failure
- Massive query load on database
- Complete canary rollout of buggy version
- False positives after recovery
- TSDB storage exhaustion
- Incorrect autoscaling decisions
- Extended outage with delayed detection
Detection Signals
- Prometheus TSDB corruption errors
- WAL write failures in Prometheus logs
- Metrics cardinality explosion (series count spike)
- Prometheus storage usage spike
- Stale metrics in queries
- HPA showing outdated metric values
- Argo Rollout events showing unexpected weight changes
- Database CPU saturation
- Alert delivery delays
Mitigation Steps
- Identify and remove high-cardinality metrics
- Restart Prometheus to clear WAL corruption
- Implement metric relabeling to drop problematic labels
- Manually rollback Argo Rollout to stable version
- Scale database to handle load
- Fix application code generating high-cardinality metrics
- Clear Alertmanager backlog
- Verify metrics collection recovery
Prevention
- Monitor Prometheus cardinality and series count
- Implement metric relabeling rules to limit cardinality
- Set cardinality limits in instrumentation libraries
- Use metric label allowlists
- Regular Prometheus performance reviews
- Implement TSDB storage monitoring and alerts
- Use metric naming conventions avoiding dynamic labels
- Configure Prometheus retention policies
- Implement progressive delivery safeguards (small initial canary weights)
- Add Argo Rollout analysis templates with strict thresholds
- Use database connection pooling and query timeouts
- Test new instrumentation in staging first
16. Kube-API Slowdown -> Prometheus Scrape Failures -> Alert Silencing -> Cost Anomaly -> Cluster Node Eviction -> App Downtime
Primary Trigger
etcd I/O latency spike due to cloud disk throttling.
Propagation Path
- etcd Latency: API responses slow down; kube-apiserver call latency >2s
- Prometheus Scrape Failures: Kube-state-metrics times out; missing pod/node metrics
- Alert Silencing: Alertmanager silences triggered alerts since data incomplete
- Autoscaler Fails: HPA sees "no metrics," scales pods down to minReplicas
- Cost Anomaly Detector: Cost monitoring agent retries repeatedly -> hits cloud API quota
- Node Eviction Chaos: Node pressure rises; pods evicted mid-transaction; service downtime
Impact
- Incident triggered by infrastructure bottleneck
- Misdiagnosed as HPA regression
- Metrics and alerts unreliable throughout incident
- Inappropriate scale-down during high load
- Cloud API quota exhaustion
- Pod evictions causing data loss
- Service disruption with poor visibility
Detection Signals
- etcd high latency warnings (fsync duration >100ms)
- Kube-apiserver slow request logs
- Prometheus scrape timeout errors
- Kube-state-metrics unavailability
- HPA showing "unable to fetch metrics"
- Alertmanager silence events
- Unexpected scale-down events
- Cloud API throttling (429 errors)
- Node pressure events
- Pod eviction events
- Disk I/O throttling metrics
Mitigation Steps
- Identify etcd disk I/O bottleneck
- Increase etcd disk IOPS or migrate to faster storage
- Reduce kube-apiserver load (throttle clients if needed)
- Manually scale up underprovisioned workloads
- Disable cost monitoring agent temporarily
- Add or scale nodes to relieve pressure
- Verify Prometheus scrape recovery
- Review and restore alerts from silence
- Implement etcd performance tuning
Prevention
- Monitor etcd performance metrics (disk latency, fsync duration)
- Use high-performance storage for etcd (SSD, provisioned IOPS)
- Implement etcd disk I/O alerts
- Optimize kube-apiserver configuration and rate limits
- Use Prometheus federation or sharding to reduce API load
- Configure appropriate HPA evaluation intervals and fallback behavior
- Implement rate limiting on cost monitoring tools
- Set PodDisruptionBudgets to prevent excessive evictions
- Use node affinity to isolate critical workloads
- Regular etcd performance testing and capacity planning
- Monitor cloud resource quotas
- Implement graceful degradation for metrics unavailability
Bonus: Combined Multi-Layer Scenarios
For advanced RCA training, combine multiple scenarios to simulate complex, cascading failures:
Scenario A: ConfigMap Drift + NetworkPolicy + Autoscaler Misconfig
Trigger Chain:
- Stale ConfigMap causes app instability
- NetworkPolicy change blocks service communication during troubleshooting
- Misconfigured HPA scales up aggressively trying to handle errors
- Cost spike triggers emergency response while debugging
Complexity: Multiple teams involved (Dev, Security, Platform), competing priorities, cost pressure
Scenario B: Secret Expiry + ArgoCD Sync Delay + DB Connection Exhaustion
Trigger Chain:
- Database credentials expire silently
- ArgoCD sync delayed due to configuration drift
- Apps exhaust connection pools trying to reconnect
- New deployments fail due to inability to verify DB connectivity
Complexity: Authentication layer, GitOps workflow, connection management, deployment pipeline
Scenario C: DNS TTL + Prometheus Throttling + Canary Rollout Failure
Trigger Chain:
- DNS propagation delay causes traffic routing issues
- Prometheus throttled by API server, missing metrics
- Canary rollout proceeds without proper metrics validation
- Faulty canary deployed to 100% traffic before metrics appear
Complexity: Multi-cluster, observability gaps, progressive delivery, timing dependencies
Using These Scenarios
For Training
- Walk through each propagation path step-by-step
- Identify detection points and signals
- Practice incident response procedures
- Document lessons learned
For Testing
- Implement chaos engineering experiments
- Test monitoring and alerting coverage
- Validate incident response playbooks
- Verify automated remediation
For RCA Practice
- Present scenarios without showing propagation paths
- Practice hypothesis formation and testing
- Use distributed tracing and logging to piece together timeline
- Identify multiple contributing factors
For Prevention
- Review scenarios against current architecture
- Identify gaps in observability and automation
- Implement guardrails and policy enforcement
- Regular game day exercises
Contributing
To add new scenarios or improve existing ones:
- Follow the established format (Trigger -> Propagation -> Impact -> Detection -> Mitigation -> Prevention)
- Base scenarios on real incidents or realistic failure modes
- Include specific tools and technologies
- Provide actionable detection signals and mitigation steps
Additional Resources
- Kubernetes Troubleshooting Guide
- ArgoCD Documentation
- Prometheus Best Practices
- SRE Workbook - Google
- Chaos Engineering Principles
Last Updated: 2025-11-05 Maintainer: SRE Team Version: 1.1.0