Backup and Restore Guide¶

This guide covers backup and restore procedures for SCP deployments.

Overview¶

SCP includes automated daily backups with the following characteristics:

Frequency: Daily at 2 AM (configurable)
Local retention: 7 days
S3 retention: 30 days (via lifecycle policy)
Contents: PostgreSQL database + configuration

What's Backed Up¶

Component	Included	Notes
PostgreSQL database	Yes	All tables (agents, bundles, etc.)
Configuration (.env)	Yes	Including secrets
Docker volumes	No	Ephemeral, rebuilt from DB
Logs	No	Available in CloudWatch

Automated Backups¶

Automated backups run daily via cron:

# View cron schedule
crontab -l
# 0 2 * * * /opt/scp/backup.sh >> /var/log/scp-backup.log 2>&1

Verify Backups¶

# Check recent backup logs
tail -50 /var/log/scp-backup.log

# List local backups
ls -la /opt/scp/backups/

# List S3 backups
aws s3 ls s3://${BACKUP_BUCKET}/backups/ --recursive

Manual Backup¶

Create Immediate Backup¶

# Full backup (local + S3)
/opt/scp/backup.sh

# Local only (no S3 upload)
/opt/scp/backup.sh --local

Backup Before Major Changes¶

Always backup before: - Version updates - Configuration changes - Database migrations - Bundle imports

# Create backup before update
/opt/scp/backup.sh
# Proceed with changes...
/opt/scp/update.sh 0.3.1

Restore Procedures¶

Restore from Local Backup¶

# List available backups
ls -la /opt/scp/backups/

# Restore from local file
/opt/scp/backup.sh --restore /opt/scp/backups/scp-backup-20260201_020000.tar.gz

Restore from S3¶

# List S3 backups
aws s3 ls s3://${BACKUP_BUCKET}/backups/ --recursive

# Restore from S3 (downloads automatically)
/opt/scp/backup.sh --restore s3://${BACKUP_BUCKET}/backups/2026/02/scp-backup-20260201_020000.tar.gz

What Restore Does¶

The restore process:

Downloads backup from S3 (if needed)
Stops all SCP services
Starts PostgreSQL only
Restores database from dump
Restores configuration
Restarts all services

Warning: Restore is destructive. It replaces current data with backup contents.

Point-in-Time Recovery¶

For more granular recovery, consider enabling PostgreSQL continuous archiving:

Option 1: RDS (Managed)¶

Switch to RDS for PostgreSQL with automated backups: - Point-in-time recovery up to 35 days - Automated snapshots - Multi-AZ for high availability

Option 2: WAL Archiving (Self-Managed)¶

Enable WAL archiving to S3:

# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://bucket/wal/%f'

Disaster Recovery¶

Complete Instance Loss¶

If the EC2 instance is lost:

Deploy new stack using CloudFormation

Restore from S3 backup:

/opt/scp/backup.sh --restore s3://${BACKUP_BUCKET}/backups/latest.tar.gz

Update DNS to point to new ALB (if needed)

Database Corruption¶

If database becomes corrupted:

Stop services: docker compose down
Remove corrupted data: sudo rm -rf /opt/scp/data/postgres/*
Restore from backup: /opt/scp/backup.sh --restore <backup-file>

Configuration Loss¶

If .env is lost but database is intact:

Retrieve secrets from SSM:

aws ssm get-parameter --name /scp/production/db-password --with-decryption
aws ssm get-parameter --name /scp/production/jwt-secret --with-decryption

Run first-boot script: /opt/scp/first-boot.sh

Backup Monitoring¶

CloudWatch Metrics¶

Monitor backup health with custom metrics:

# In backup.sh, add:
aws cloudwatch put-metric-data \
  --namespace SCP \
  --metric-name BackupSuccess \
  --value 1 \
  --dimensions Environment=production

Alerts¶

Set up CloudWatch alarms for backup failures:

Create SNS topic for alerts
Create alarm on BackupSuccess metric
Alert if no successful backup in 26 hours

Verification¶

Periodically verify backups can be restored:

# Restore to test instance
/opt/scp/backup.sh --restore <backup-file>

# Verify services
curl http://localhost:8000/health
curl http://localhost:8001/health

Retention Policy¶

Location	Retention	Managed By
Local	7 days	backup.sh cleanup
S3	30 days	S3 lifecycle policy
SSM	Indefinite	Manual cleanup

Modify Retention¶

To change S3 retention, update the CloudFormation template:

LifecycleConfiguration:
  Rules:
    - Id: DeleteOldBackups
      Status: Enabled
      ExpirationInDays: 90  # Change from 30

Troubleshooting¶

Backup Fails¶

Symptoms: ERROR: Database backup failed

Solutions: 1. Check PostgreSQL is running: docker compose ps postgres 2. Check disk space: df -h 3. Verify database credentials in .env

S3 Upload Fails¶

Symptoms: WARNING: S3 upload failed

Solutions: 1. Check IAM role permissions 2. Verify bucket exists and is accessible 3. Check network connectivity (NAT Gateway)

Restore Fails¶

Symptoms: ERROR: pg_restore completed with errors

Solutions: - Minor warnings are often OK (e.g., "role already exists") - Check for actual errors in output - Verify backup file isn't corrupted: tar -tzf backup.tar.gz

Large Backup Size¶

If backups are growing too large:

Check for unnecessary data
Consider archiving old events/logs
Implement data retention policies in application

Best Practices¶

Test restores regularly - Don't wait for an emergency
Monitor backup completion - Set up alerts
Keep multiple copies - Local + S3 + periodic offsite
Document recovery procedures - Keep this guide updated
Practice recovery - Run quarterly DR drills