Contributing to SRE-bench
Thank you for your interest in contributing to SRE-bench! This guide will help you create and submit new scenarios to the benchmark.
Why Contribute?
- Share Knowledge - Help others learn from real-world incidents you've experienced
- Improve Agent Testing - Add diverse scenarios to make agent benchmarks more comprehensive
- Build Community - Collaborate with SRE and AI practitioners
- Test Your Own Agents - Create scenarios that reflect your specific use cases
What Makes a Good Scenario?
A good SRE-bench scenario should be:
- Realistic - Based on actual production incidents or common failure patterns
- Reproducible - Runs consistently across different environments
- Educational - Teaches something valuable about Kubernetes or SRE practices
- Observable - Has clear detection signals and symptoms
- Cascading - Shows how a single issue can propagate through the system
- Remediable - Has a clear path to resolution
Scenario Template
Each scenario needs four components:
- Kubernetes Manifests - Resources that reproduce the failure
- Executable Script - Bash script that orchestrates the scenario
- Documentation - Detailed description in the scenario README
- Testing - Verification that it works in both new and existing clusters
Step-by-Step Guide
Step 1: Choose Your Scenario
Think about:
- What real-world incidents have you experienced?
- What failure modes are commonly misunderstood?
- What agent capabilities do you want to test?
Example Scenarios:
- PersistentVolume storage exhaustion causing pod evictions
- Ingress misconfiguration causing routing loops
- Service mesh mTLS cert expiration causing connection failures
- etcd performance degradation causing API server slowness
Step 2: Create Manifest Directory
# Create directory for your scenario
mkdir -p manifests/scenario-N/
# Where N is the next available scenario number
# Check existing scenarios first:
ls manifests/
Step 3: Write Kubernetes Manifests
Create the resources needed to reproduce the failure:
# manifests/scenario-N/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: your-scenario-namespace
labels:
scenario: "N"
# manifests/scenario-N/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: your-app
namespace: your-scenario-namespace
spec:
replicas: 3
selector:
matchLabels:
app: your-app
template:
metadata:
labels:
app: your-app
spec:
containers:
- name: app
image: busybox:latest
command: ["/bin/sh", "-c"]
args:
- |
# Your application logic here
# Include the failure condition
echo "Application starting..."
while true; do
# Simulate work
sleep 5
done
Key Principles:
- Use lightweight images (busybox, alpine) for faster startup
- Include comments explaining the intentional misconfiguration
- Add logging to demonstrate the failure
- Use realistic resource requests/limits
- Include health checks (readiness/liveness probes)
Step 4: Create Scenario Script
Copy an existing script as a template:
# Use scenario 1 as a template
cp scripts/1_scenerio.sh scripts/N_scenerio.sh
# Make it executable
chmod +x scripts/N_scenerio.sh
Script Structure:
#!/bin/bash
###############################################################################
# Scenario N: [Brief Title]
#
# This scenario demonstrates:
# 1. [Key point 1]
# 2. [Key point 2]
# 3. [Key point 3]
#
# Primary Trigger: [What initiates the failure]
# Propagation: [How it cascades]
# Impact: [Service disruption]
###############################################################################
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
# Configuration
CLUSTER_NAME="scenario-N-cluster"
NAMESPACE="your-scenario-namespace"
GIT_REPO="https://github.com/siddhantprateek/SRE-bench"
GIT_BRANCH="main"
MANIFEST_PATH="manifests/scenario-N"
# Parse command line arguments
USE_EXISTING_CLUSTER=false
KUBECONFIG_PATH=""
while [[ $# -gt 0 ]]; do
case $1 in
--cluster)
USE_EXISTING_CLUSTER=true
shift
;;
--kubeconfig)
KUBECONFIG_PATH="$2"
shift 2
;;
*)
echo "Unknown option: $1"
echo "Usage: $0 [--cluster] [--kubeconfig PATH]"
exit 1
;;
esac
done
if [ -n "$KUBECONFIG_PATH" ]; then
export KUBECONFIG="$KUBECONFIG_PATH"
fi
# Utility functions
print_header() {
echo -e "\n${BLUE}========================================${NC}"
echo -e "${BLUE}$1${NC}"
echo -e "${BLUE}========================================${NC}\n"
}
print_success() { echo -e "${GREEN}✓ $1${NC}"; }
print_error() { echo -e "${RED}✗ $1${NC}"; }
print_warning() { echo -e "${YELLOW}⚠ $1${NC}"; }
print_info() { echo -e "${BLUE}ℹ $1${NC}"; }
# Main functions
check_prerequisites() {
print_header "Checking Prerequisites"
# Check for required tools
print_success "All prerequisites satisfied"
}
create_cluster() {
if $USE_EXISTING_CLUSTER; then
print_header "Using Existing Cluster"
kubectl cluster-info
return
fi
print_header "Creating Kind Cluster"
# Create Kind cluster
print_success "Cluster created successfully"
}
deploy_initial_state() {
print_header "Deploying Initial State"
# Deploy application in healthy state
print_success "Initial deployment complete"
}
trigger_failure() {
print_header "Triggering Failure Condition"
# Apply misconfiguration or inject fault
print_error "Failure condition triggered"
}
observe_impact() {
print_header "Observing Impact"
# Show logs, pod status, metrics
print_error "Demonstrating cascading failure"
}
show_impact() {
print_header "Impact Summary"
echo -e "${RED}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
echo -e "${RED} PRODUCTION INCIDENT ${NC}"
echo -e "${RED}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
# Incident details, timeline, prevention tips
}
cleanup() {
print_header "Cleanup"
# Offer to delete cluster or namespace
}
main() {
print_header "Scenario N: [Your Scenario Title]"
check_prerequisites
create_cluster
deploy_initial_state
echo ""
read -p "Press Enter to trigger failure condition..."
echo ""
trigger_failure
observe_impact
show_impact
cleanup
print_header "Scenario Complete"
}
main
Important Considerations:
- Support both
--clusterand--kubeconfigflags - Use color-coded output for readability
- Include interactive pauses at key moments
- Show clear detection signals
- Provide actionable remediation steps
- Offer cleanup options
Step 5: Document the Scenario
Add your scenario to scenerio/README.md:
## N. [Scenario Title]
### Primary Trigger
[What initiates the failure - be specific]
### Propagation Path
1. **[Step 1]**: [Description]
2. **[Step 2]**: [Description]
3. **[Step 3]**: [Description]
4. **[Final Impact]**: [Description]
### Impact
- [Impact 1]
- [Impact 2]
- [Impact 3]
### Detection Signals
- [Signal 1]
- [Signal 2]
- [Signal 3]
### Mitigation Steps
1. [Step 1]
2. [Step 2]
3. [Step 3]
### Prevention
- [Best practice 1]
- [Best practice 2]
- [Best practice 3]
### Related Scenarios
- Scenario X: [Related scenario]
### Real-World Examples
- [Optional: Link to postmortem or incident report]
Step 6: Test Your Scenario
Test thoroughly in multiple modes:
# Test with new Kind cluster
./scripts/N_scenerio.sh
# Test with existing cluster
kind create cluster --name test-cluster
./scripts/N_scenerio.sh --cluster
kind delete cluster --name test-cluster
# Test with custom kubeconfig
./scripts/N_scenerio.sh --cluster --kubeconfig ~/.kube/config
Checklist:
- Script runs without errors
- Failure condition is clearly demonstrated
- Detection signals are observable
- Cleanup works properly
- Works in both new and existing clusters
- All required components are installed
- Logs are informative and color-coded
- Interactive pauses make sense
- Impact summary is comprehensive
Step 7: Submit Pull Request
# Create a new branch
git checkout -b scenario-N-your-scenario-name
# Add your files
git add manifests/scenario-N/
git add scripts/N_scenerio.sh
git add scenerio/README.md
# Commit with descriptive message
git commit -m "Add Scenario N: [Your Scenario Title]
- Demonstrates [key failure mode]
- Includes [components used]
- Tests [agent capability]
"
# Push to your fork
git push origin scenario-N-your-scenario-name
# Open PR on GitHub
PR Description Template:
## Scenario N: [Your Scenario Title]
### Description
[Brief description of the scenario and what it demonstrates]
### Failure Mode
- **Trigger**: [What causes it]
- **Impact**: [Service disruption]
- **Components**: [ArgoCD, HPA, etc.]
### Testing
- [x] Tested in new Kind cluster
- [x] Tested in existing cluster
- [x] Tested with custom kubeconfig
- [x] Documentation complete
- [x] Script follows conventions
### Related Issues
Closes #[issue-number] (if applicable)
### Screenshots/Logs
[Optional: Include example output]
Scenario Design Patterns
Pattern 1: GitOps Drift Scenario
Use Case: Configuration or deployment issues
# Deploy via ArgoCD
cat <<EOF | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: your-app
namespace: argocd
spec:
source:
repoURL: ${GIT_REPO}
targetRevision: ${GIT_BRANCH}
path: ${MANIFEST_PATH}
destination:
server: https://kubernetes.default.svc
namespace: ${NAMESPACE}
EOF
When to Use:
- Deployment strategy issues
- Configuration drift
- Image update problems
- Rollout failures
Pattern 2: Runtime Failure Scenario
Use Case: Operational or infrastructure issues
# Direct kubectl apply
kubectl apply -f manifests/scenario-N/
# Trigger runtime failure
kubectl scale deployment app --replicas=0 -n ${NAMESPACE}
When to Use:
- Connection pool issues
- Resource pressure
- Monitoring failures
- Network problems
Pattern 3: Resource Pressure Scenario
Use Case: Scaling or resource issues
# Install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Create HPA with misconfiguration
kubectl apply -f manifests/scenario-N/hpa.yaml
When to Use:
- HPA misconfiguration
- Node pressure
- OOMKilled pods
- Cluster autoscaler issues
Best Practices
Manifest Design
- Use Comments: Clearly mark intentional misconfigurations
- Realistic Resources: Use production-like resource requests/limits
- Health Checks: Include probes that will fail during the incident
- Logging: Add verbose logging to show the failure progression
- Labels: Use consistent labels for easier querying
Script Design
- Prerequisites: Check for all required tools
- Error Handling: Use
set -eand handle failures gracefully - Progress Indicators: Show what's happening at each step
- Timeouts: Use reasonable timeouts for kubectl wait commands
- Cleanup: Always offer cleanup options
Documentation
- Clear Title: Describe the failure mode concisely
- Propagation Path: Show how the failure cascades
- Detection Signals: List observable symptoms
- Prevention: Include actionable best practices
- Related Scenarios: Link to similar scenarios
Getting Help
- Questions: Open a GitHub Discussion
- Bugs: Open a GitHub Issue
- Ideas: Start with a GitHub Discussion before implementing
- Review: Tag maintainers in your PR for review
Code of Conduct
Please be:
- Respectful - Be kind to other contributors
- Constructive - Provide helpful feedback
- Collaborative - Work together to improve scenarios
- Patient - Reviews take time
Recognition
Contributors will be:
- Listed in the repository contributors
- Credited in the scenario documentation
- Mentioned in release notes
Thank you for contributing to SRE-bench!