Operational Runbook

This runbook covers day-to-day operations for Caracal Core.

Common Operations

Create Agent

caracal agent create \
  --name "my-agent" \
  --description "My AI agent" \
  --owner "user@example.com"

Create Budget Policy

caracal policy create \
  --agent-id <agent-id> \
  --limit 100.00 \
  --currency USD \
  --time-window daily \
  --window-type calendar \
  --change-reason "Initial budget allocation"

Query Agent Spending

caracal ledger query \
  --agent-id <agent-id> \
  --start-time "2024-01-01T00:00:00Z" \
  --end-time "2024-01-31T23:59:59Z"

# Current daily spending
caracal ledger query \
  --agent-id <agent-id> \
  --time-window daily

Create Resource Allowlist

# Regex pattern
caracal allowlist create \
  --agent-id <agent-id> \
  --pattern "^https://api\\.openai\\.com/.*$" \
  --pattern-type regex

# Glob pattern
caracal allowlist create \
  --agent-id <agent-id> \
  --pattern "https://api.anthropic.com/*" \
  --pattern-type glob

Verify Ledger Integrity

# Verify specific batch
caracal merkle verify-batch --batch-id <batch-id>

# Verify time range
caracal merkle verify-range \
  --start-time "2024-01-01T00:00:00Z" \
  --end-time "2024-01-31T23:59:59Z"

Create Ledger Snapshot

caracal snapshot create
caracal snapshot list
caracal snapshot restore --snapshot-id <snapshot-id>

Monitor Dead Letter Queue

caracal dlq list
caracal dlq get --event-id <event-id>

Troubleshooting

Gateway Returns 503

Diagnosis:

curl http://gateway:8443/health
kubectl logs -f deployment/caracal-gateway --tail=100
caracal db health-check

Common Causes:

Database unavailable
Kafka unavailable
Redis unavailable

Resolution:

kubectl rollout restart deployment/<component>

Kafka Consumer Lag Increasing

Diagnosis:

caracal kafka consumer-lag --consumer-group ledger-writer-group
kubectl logs -f deployment/caracal-ledger-writer --tail=100

Resolution:

# Scale consumers
kubectl scale deployment/caracal-ledger-writer --replicas=5

# Increase resources
kubectl set resources deployment/caracal-ledger-writer \
  --limits=cpu=2,memory=4Gi \
  --requests=cpu=1,memory=2Gi

Merkle Verification Failures

CRITICAL: This indicates potential data tampering.

Immediate Actions:

Stop all writes to affected batches
Preserve evidence (database dumps, logs)
Notify security team
Investigate root cause

High Memory Usage

Diagnosis:

kubectl top pods
curl http://gateway:9090/metrics | grep memory

Resolution:

kubectl set resources deployment/<component> \
  --limits=memory=8Gi \
  --requests=memory=4Gi

Backup and Recovery

PostgreSQL Backup

# Manual backup
kubectl exec -it postgresql-0 -- \
  pg_dump -U caracal caracal | gzip > backup.sql.gz

# Using CLI
caracal backup create --type postgresql

PostgreSQL Restore

# Stop consumers
kubectl scale deployment/caracal-ledger-writer --replicas=0
kubectl scale deployment/caracal-metrics-aggregator --replicas=0

# Restore
gunzip -c backup.sql.gz | \
  kubectl exec -i postgresql-0 -- psql -U caracal caracal

# Verify and restart
caracal db health-check
kubectl scale deployment/caracal-ledger-writer --replicas=3

Event Replay Recovery

# Stop consumers
kubectl scale deployment/caracal-ledger-writer --replicas=0

# Reset database
caracal db reset --confirm

# Restore from snapshot
caracal snapshot restore --snapshot-id <snapshot-id>

# Replay events
caracal replay start --from-snapshot <snapshot-id>

# Monitor and verify
caracal replay status
caracal merkle verify-range --start-time <timestamp> --end-time now

Scaling

Horizontal Scaling

# Gateway
kubectl scale deployment/caracal-gateway --replicas=5

# Consumers
kubectl scale deployment/caracal-ledger-writer --replicas=10

# Auto-scaling
kubectl autoscale deployment/caracal-gateway \
  --min=3 --max=10 --cpu-percent=70

Vertical Scaling

kubectl set resources deployment/caracal-gateway \
  --limits=cpu=4,memory=8Gi \
  --requests=cpu=2,memory=4Gi

Database Scaling

database:
  pool_size: 50
  max_overflow: 100
  read_replica_url: "postgresql://replica:5432/caracal"

Monitoring

Key Metrics

caracal_gateway_requests_total - Total requests
caracal_gateway_request_duration_seconds - Request latency
caracal_kafka_consumer_lag - Consumer lag
caracal_merkle_verification_failures_total - Verification failures
caracal_dlq_size - Dead letter queue size

Grafana Dashboards

kubectl apply -f monitoring/grafana/dashboards/

Emergency Procedures

Complete System Outage

Check infrastructure: kubectl get pods --all-namespaces
Check recent changes: kubectl rollout history deployment/<component>
Restart in order: Database > Kafka > Redis > Gateway > Consumers
Verify health: curl http://gateway:8443/health

Data Corruption Detected

STOP ALL WRITES:

kubectl scale deployment/caracal-gateway --replicas=0
kubectl scale deployment/caracal-ledger-writer --replicas=0

Preserve Evidence:

caracal backup create --type postgresql --tag "corruption-evidence"
caracal backup create --type kafka --tag "corruption-evidence"

Notify Security Team
Investigate and Recover

Common Operations​

Create Agent​

Create Budget Policy​

Query Agent Spending​

Create Resource Allowlist​

Verify Ledger Integrity​

Create Ledger Snapshot​

Monitor Dead Letter Queue​

Troubleshooting​

Gateway Returns 503​

Kafka Consumer Lag Increasing​

Merkle Verification Failures​

High Memory Usage​

Backup and Recovery​

PostgreSQL Backup​

PostgreSQL Restore​

Event Replay Recovery​

Scaling​

Horizontal Scaling​

Vertical Scaling​

Database Scaling​

Monitoring​

Key Metrics​

Grafana Dashboards​

Emergency Procedures​

Complete System Outage​

Data Corruption Detected​

Common Operations

Create Agent

Create Budget Policy

Query Agent Spending

Create Resource Allowlist

Verify Ledger Integrity

Create Ledger Snapshot

Monitor Dead Letter Queue

Troubleshooting

Gateway Returns 503

Kafka Consumer Lag Increasing

Merkle Verification Failures

High Memory Usage

Backup and Recovery

PostgreSQL Backup

PostgreSQL Restore

Event Replay Recovery

Scaling

Horizontal Scaling

Vertical Scaling

Database Scaling

Monitoring

Key Metrics

Grafana Dashboards

Emergency Procedures

Complete System Outage

Data Corruption Detected