Operational Runbook
This runbook covers day-to-day operations for Caracal Core.
Common Operations
Create Agent
caracal agent create \
--name "my-agent" \
--description "My AI agent" \
--owner "user@example.com"
Create Budget Policy
caracal policy create \
--agent-id <agent-id> \
--limit 100.00 \
--currency USD \
--time-window daily \
--window-type calendar \
--change-reason "Initial budget allocation"
Query Agent Spending
caracal ledger query \
--agent-id <agent-id> \
--start-time "2024-01-01T00:00:00Z" \
--end-time "2024-01-31T23:59:59Z"
# Current daily spending
caracal ledger query \
--agent-id <agent-id> \
--time-window daily
Create Resource Allowlist
# Regex pattern
caracal allowlist create \
--agent-id <agent-id> \
--pattern "^https://api\\.openai\\.com/.*$" \
--pattern-type regex
# Glob pattern
caracal allowlist create \
--agent-id <agent-id> \
--pattern "https://api.anthropic.com/*" \
--pattern-type glob
Verify Ledger Integrity
# Verify specific batch
caracal merkle verify-batch --batch-id <batch-id>
# Verify time range
caracal merkle verify-range \
--start-time "2024-01-01T00:00:00Z" \
--end-time "2024-01-31T23:59:59Z"
Create Ledger Snapshot
caracal snapshot create
caracal snapshot list
caracal snapshot restore --snapshot-id <snapshot-id>
Monitor Dead Letter Queue
caracal dlq list
caracal dlq get --event-id <event-id>
Troubleshooting
Gateway Returns 503
Diagnosis:
curl http://gateway:8443/health
kubectl logs -f deployment/caracal-gateway --tail=100
caracal db health-check
Common Causes:
- Database unavailable
- Kafka unavailable
- Redis unavailable
Resolution:
kubectl rollout restart deployment/<component>
Kafka Consumer Lag Increasing
Diagnosis:
caracal kafka consumer-lag --consumer-group ledger-writer-group
kubectl logs -f deployment/caracal-ledger-writer --tail=100
Resolution:
# Scale consumers
kubectl scale deployment/caracal-ledger-writer --replicas=5
# Increase resources
kubectl set resources deployment/caracal-ledger-writer \
--limits=cpu=2,memory=4Gi \
--requests=cpu=1,memory=2Gi
Merkle Verification Failures
CRITICAL: This indicates potential data tampering.
Immediate Actions:
- Stop all writes to affected batches
- Preserve evidence (database dumps, logs)
- Notify security team
- Investigate root cause
High Memory Usage
Diagnosis:
kubectl top pods
curl http://gateway:9090/metrics | grep memory
Resolution:
kubectl set resources deployment/<component> \
--limits=memory=8Gi \
--requests=memory=4Gi
Backup and Recovery
PostgreSQL Backup
# Manual backup
kubectl exec -it postgresql-0 -- \
pg_dump -U caracal caracal | gzip > backup.sql.gz
# Using CLI
caracal backup create --type postgresql
PostgreSQL Restore
# Stop consumers
kubectl scale deployment/caracal-ledger-writer --replicas=0
kubectl scale deployment/caracal-metrics-aggregator --replicas=0
# Restore
gunzip -c backup.sql.gz | \
kubectl exec -i postgresql-0 -- psql -U caracal caracal
# Verify and restart
caracal db health-check
kubectl scale deployment/caracal-ledger-writer --replicas=3
Event Replay Recovery
# Stop consumers
kubectl scale deployment/caracal-ledger-writer --replicas=0
# Reset database
caracal db reset --confirm
# Restore from snapshot
caracal snapshot restore --snapshot-id <snapshot-id>
# Replay events
caracal replay start --from-snapshot <snapshot-id>
# Monitor and verify
caracal replay status
caracal merkle verify-range --start-time <timestamp> --end-time now
Scaling
Horizontal Scaling
# Gateway
kubectl scale deployment/caracal-gateway --replicas=5
# Consumers
kubectl scale deployment/caracal-ledger-writer --replicas=10
# Auto-scaling
kubectl autoscale deployment/caracal-gateway \
--min=3 --max=10 --cpu-percent=70
Vertical Scaling
kubectl set resources deployment/caracal-gateway \
--limits=cpu=4,memory=8Gi \
--requests=cpu=2,memory=4Gi
Database Scaling
database:
pool_size: 50
max_overflow: 100
read_replica_url: "postgresql://replica:5432/caracal"
Monitoring
Key Metrics
caracal_gateway_requests_total- Total requestscaracal_gateway_request_duration_seconds- Request latencycaracal_kafka_consumer_lag- Consumer lagcaracal_merkle_verification_failures_total- Verification failurescaracal_dlq_size- Dead letter queue size
Grafana Dashboards
kubectl apply -f monitoring/grafana/dashboards/
Emergency Procedures
Complete System Outage
- Check infrastructure:
kubectl get pods --all-namespaces - Check recent changes:
kubectl rollout history deployment/<component> - Restart in order: Database > Kafka > Redis > Gateway > Consumers
- Verify health:
curl http://gateway:8443/health
Data Corruption Detected
-
STOP ALL WRITES:
kubectl scale deployment/caracal-gateway --replicas=0
kubectl scale deployment/caracal-ledger-writer --replicas=0 -
Preserve Evidence:
caracal backup create --type postgresql --tag "corruption-evidence"
caracal backup create --type kafka --tag "corruption-evidence" -
Notify Security Team
-
Investigate and Recover