Running Scenarios
This guide explains how to execute SRE-bench scenarios and what to expect during each run.
Prerequisites
Before running scenarios, ensure you have:
- kubectl - Kubernetes command-line tool
- kind - Kubernetes in Docker (for creating test clusters)
- Docker - Container runtime (required by Kind)
- Git - For cloning the repository
Installing Prerequisites
macOS:
# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install tools
brew install kubectl kind
Linux:
# kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# kind
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
Windows (PowerShell):
# Using Chocolatey
choco install kubernetes-cli kind
# Or using Scoop
scoop install kubectl kind
Quick Start
The simplest way to run a scenario:
# Clone the repository
git clone https://github.com/siddhantprateek/SRE-bench.git
cd SRE-bench
# Make script executable (if needed)
chmod +x scripts/1_scenerio.sh
# Run scenario 1
./scripts/1_scenerio.sh
This will:
- Create a new Kind cluster named
scenario-1-cluster - Install ArgoCD
- Deploy the application
- Trigger the ConfigMap drift failure
- Show you the cascading failure
- Offer to clean up the cluster
Execution Modes
Each scenario script supports multiple execution modes:
Mode 1: New Kind Cluster (Default)
Creates a fresh Kind cluster for the scenario:
./scripts/1_scenerio.sh
When to use:
- First time running the scenario
- Want complete isolation
- Testing from scratch
- Developing/debugging scenarios
Cleanup: The script will offer to delete the cluster at the end.
Mode 2: Existing Cluster
Use your existing Kubernetes cluster:
./scripts/1_scenerio.sh --cluster
When to use:
- Testing on existing infrastructure
- Running multiple scenarios sequentially
- Using managed Kubernetes (EKS, GKE, AKS)
- Want to preserve the cluster for investigation
Cleanup: The script will offer to delete only the namespace.
Mode 3: Custom Kubeconfig
Target a specific cluster with custom kubeconfig:
./scripts/1_scenerio.sh --cluster --kubeconfig ~/.kube/my-cluster-config
When to use:
- Multiple kubeconfig files
- Testing on remote clusters
- CI/CD environments
- Multi-cluster setups
Available Scenarios
Here's a quick reference of all scenarios:
| Script | Scenario | Components | Duration |
|---|---|---|---|
1_scenerio.sh |
ConfigMap Drift → CrashLoopBackOff | ArgoCD | ~5 min |
2_scenerio.sh |
Secret Rotation → Database Auth Failure | None | ~4 min |
3_scenerio.sh |
Node Pressure + HPA → Evictions | ArgoCD, Metrics Server | ~8 min |
4_scenerio.sh |
NetworkPolicy → Service Mesh Timeout | ArgoCD | ~6 min |
5_scenerio.sh |
Autoscaler Cost Spike | ArgoCD, Metrics Server | ~7 min |
6_scenerio.sh |
Image Updater Wrong Tag | ArgoCD | ~5 min |
7_scenerio.sh |
Redis Failover → Connection Leaks | None | ~6 min |
8_scenerio.sh |
Argo Rollout Canary Misconfiguration | ArgoCD, Argo Rollouts | ~8 min |
10_scenerio.sh |
API Rate Limit → HPA Misfire | Metrics Server | ~6 min |
Step-by-Step Walkthrough
Let's walk through running Scenario 1 in detail:
Step 1: Start the Scenario
./scripts/1_scenerio.sh
You'll see:
========================================
Scenario 1: ConfigMap Drift
========================================
========================================
Checking Prerequisites
========================================
✓ All prerequisites satisfied
Step 2: Cluster Creation
The script creates a Kind cluster:
========================================
Creating Kind Cluster
========================================
Creating cluster "scenario-1-cluster" ...
✓ Ensuring node image (kindest/node:v1.27.3)
✓ Preparing nodes
✓ Writing configuration
✓ Starting control-plane
✓ Installing CNI
✓ Installing StorageClass
✓ Cluster created successfully
Step 3: Component Installation
ArgoCD is installed automatically:
========================================
Installing ArgoCD
========================================
ℹ Waiting for ArgoCD to be ready...
deployment.apps/argocd-server condition met
✓ ArgoCD installed successfully
Step 4: Initial Deployment
The stable application is deployed:
========================================
Deploying Application via ArgoCD
========================================
application.argoproj.io/demo-app created
ℹ Waiting for ArgoCD to sync...
✓ Application deployed via ArgoCD
Step 5: Observe Stable State
The script shows the working application:
========================================
Observing Initial State
========================================
ℹ Application Status:
NAME SYNC STATUS HEALTH STATUS
demo-app Synced Healthy
ℹ Pod Status:
NAME READY STATUS RESTARTS AGE
demo-app-7d4b8c9f5d-abcde 1/1 Running 0 30s
demo-app-7d4b8c9f5d-fghij 1/1 Running 0 30s
Step 6: Interactive Pause
Press Enter to apply hotfix (this will create drift)...
Press Enter to continue and trigger the failure.
Step 7: Failure Demonstration
The script shows the cascading failure:
========================================
Demonstrating Drift and Rollback
========================================
⚠ Hotfix applied directly to cluster (bypassing GitOps)
✓ ConfigMap updated with new.feature key
ℹ ArgoCD detecting drift in 30 seconds...
⚠ DRIFT DETECTED
⚠ ArgoCD will auto-sync and remove the hotfix!
ℹ Watching pod status during rollback...
NAME READY STATUS RESTARTS AGE
demo-app-7d4b8c9f5d-abcde 0/1 CreateContainerConfigError 0 2m
demo-app-7d4b8c9f5d-fghij 0/1 CrashLoopBackOff 1 2m
✗ Pods are now crashing!
✗ Missing required environment variable: new.feature
Step 8: Impact Summary
The script shows the incident details:
========================================
Impact Summary
========================================
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PRODUCTION INCIDENT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Severity: P1 - Service Degradation
Error Rate: 100% (all pods failing)
Root Cause: ConfigMap drift - manual hotfix reverted by ArgoCD
Trigger: Engineer applied config change bypassing GitOps
Timeline:
T+0s: Manual hotfix applied to ConfigMap
T+30s: Pods start using new configuration
T+60s: ArgoCD detects drift
T+90s: ArgoCD auto-sync reverts ConfigMap
T+120s: Pods restart and fail (missing new.feature)
Detection Signals:
• CrashLoopBackOff pod status
• ArgoCD drift warnings
• Application health checks failing
• Missing environment variable errors in logs
Prevention:
1. Always commit changes to Git first
2. Use ArgoCD self-heal carefully
3. Set up drift alerts
4. Document emergency hotfix procedures
5. Use ConfigMap validation
Step 9: Cleanup
========================================
Cleanup
========================================
Delete cluster scenario-1-cluster? (y/N)
Type y to delete the cluster or N to keep it for investigation.
Observing Scenarios
While scenarios run, you can open additional terminals to observe:
Watch Pods
# In another terminal
export KUBECONFIG="$(kind get kubeconfig --name scenario-1-cluster)"
watch kubectl get pods -A
View Logs
# Watch application logs
kubectl logs -n demo-app -l app=demo-app -f
# Watch ArgoCD application status
kubectl get applications -n argocd -w
Check ArgoCD UI
# Port-forward ArgoCD server
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d
# Open browser to https://localhost:8080
# Username: admin
# Password: (from command above)
Troubleshooting
Scenario Fails to Start
Issue: kind: command not found
Solution:
# Install Kind
brew install kind # macOS
# or follow installation instructions for your OS
Issue: Docker daemon not running
Solution:
# Start Docker Desktop (macOS/Windows)
# or start Docker daemon (Linux)
sudo systemctl start docker
Cluster Creation Hangs
Issue: Cluster creation stuck at "Preparing nodes"
Solution:
# Delete and retry
kind delete cluster --name scenario-1-cluster
./scripts/1_scenerio.sh
ArgoCD Installation Times Out
Issue: waiting for ArgoCD to be ready... timeout
Solution:
# Check pod status
kubectl get pods -n argocd
# If pods are pending, check node resources
kubectl describe nodes
# Restart the scenario with more resources
kind delete cluster --name scenario-1-cluster
# Edit scripts/kind.yaml to increase resources if needed
./scripts/1_scenerio.sh
Port Conflicts
Issue: port is already allocated
Solution:
# Find and kill process using the port
lsof -ti:8080 | xargs kill -9
# Or use a different port
kubectl port-forward svc/argocd-server -n argocd 8888:443
Running Multiple Scenarios
You can run scenarios sequentially on the same cluster:
# Create a persistent cluster
kind create cluster --name sre-bench-cluster
# Run scenarios using the existing cluster
./scripts/1_scenerio.sh --cluster
./scripts/3_scenerio.sh --cluster
./scripts/4_scenerio.sh --cluster
# Each scenario will:
# - Use the existing cluster
# - Install any missing components
# - Create its own namespace
# - Clean up its namespace when done
# Delete the cluster when finished
kind delete cluster --name sre-bench-cluster
CI/CD Integration
Scenarios can be used in automated testing:
# .github/workflows/test-scenarios.yml
name: Test SRE Scenarios
on: [push, pull_request]
jobs:
test-scenarios:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Kind
run: |
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
- name: Run Scenario 1
run: ./scripts/1_scenerio.sh
- name: Run Scenario 7
run: ./scripts/7_scenerio.sh
Agent Testing
To test an autonomous agent against scenarios:
# Start scenario but don't trigger failure yet
# (modify script to pause before failure injection)
# Let your agent diagnose and remediate
your-agent diagnose --namespace demo-app
# Compare agent actions against expected remediation
Next Steps
- Architecture Overview - Understand how scenarios are structured
- Contributing - Create your own scenarios
- Scenario Details - Deep dive into each scenario