Backup and Restore Guide¶
This guide covers backup and restore procedures for SCP deployments.
Overview¶
SCP includes automated daily backups with the following characteristics:
- Frequency: Daily at 2 AM (configurable)
- Local retention: 7 days
- S3 retention: 30 days (via lifecycle policy)
- Contents: PostgreSQL database + configuration
What's Backed Up¶
| Component | Included | Notes |
|---|---|---|
| PostgreSQL database | Yes | All tables (agents, bundles, etc.) |
| Configuration (.env) | Yes | Including secrets |
| Docker volumes | No | Ephemeral, rebuilt from DB |
| Logs | No | Available in CloudWatch |
Automated Backups¶
Automated backups run daily via cron:
# View cron schedule
crontab -l
# 0 2 * * * /opt/scp/backup.sh >> /var/log/scp-backup.log 2>&1
Verify Backups¶
# Check recent backup logs
tail -50 /var/log/scp-backup.log
# List local backups
ls -la /opt/scp/backups/
# List S3 backups
aws s3 ls s3://${BACKUP_BUCKET}/backups/ --recursive
Manual Backup¶
Create Immediate Backup¶
# Full backup (local + S3)
/opt/scp/backup.sh
# Local only (no S3 upload)
/opt/scp/backup.sh --local
Backup Before Major Changes¶
Always backup before: - Version updates - Configuration changes - Database migrations - Bundle imports
# Create backup before update
/opt/scp/backup.sh
# Proceed with changes...
/opt/scp/update.sh 0.3.1
Restore Procedures¶
Restore from Local Backup¶
# List available backups
ls -la /opt/scp/backups/
# Restore from local file
/opt/scp/backup.sh --restore /opt/scp/backups/scp-backup-20260201_020000.tar.gz
Restore from S3¶
# List S3 backups
aws s3 ls s3://${BACKUP_BUCKET}/backups/ --recursive
# Restore from S3 (downloads automatically)
/opt/scp/backup.sh --restore s3://${BACKUP_BUCKET}/backups/2026/02/scp-backup-20260201_020000.tar.gz
What Restore Does¶
The restore process:
- Downloads backup from S3 (if needed)
- Stops all SCP services
- Starts PostgreSQL only
- Restores database from dump
- Restores configuration
- Restarts all services
Warning: Restore is destructive. It replaces current data with backup contents.
Point-in-Time Recovery¶
For more granular recovery, consider enabling PostgreSQL continuous archiving:
Option 1: RDS (Managed)¶
Switch to RDS for PostgreSQL with automated backups: - Point-in-time recovery up to 35 days - Automated snapshots - Multi-AZ for high availability
Option 2: WAL Archiving (Self-Managed)¶
Enable WAL archiving to S3:
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://bucket/wal/%f'
Disaster Recovery¶
Complete Instance Loss¶
If the EC2 instance is lost:
- Deploy new stack using CloudFormation
- Restore from S3 backup:
/opt/scp/backup.sh --restore s3://${BACKUP_BUCKET}/backups/latest.tar.gz - Update DNS to point to new ALB (if needed)
Database Corruption¶
If database becomes corrupted:
- Stop services:
docker compose down - Remove corrupted data:
sudo rm -rf /opt/scp/data/postgres/* - Restore from backup:
/opt/scp/backup.sh --restore <backup-file>
Configuration Loss¶
If .env is lost but database is intact:
- Retrieve secrets from SSM:
aws ssm get-parameter --name /scp/production/db-password --with-decryption aws ssm get-parameter --name /scp/production/jwt-secret --with-decryption - Run first-boot script:
/opt/scp/first-boot.sh
Backup Monitoring¶
CloudWatch Metrics¶
Monitor backup health with custom metrics:
# In backup.sh, add:
aws cloudwatch put-metric-data \
--namespace SCP \
--metric-name BackupSuccess \
--value 1 \
--dimensions Environment=production
Alerts¶
Set up CloudWatch alarms for backup failures:
- Create SNS topic for alerts
- Create alarm on
BackupSuccessmetric - Alert if no successful backup in 26 hours
Verification¶
Periodically verify backups can be restored:
# Restore to test instance
/opt/scp/backup.sh --restore <backup-file>
# Verify services
curl http://localhost:8000/health
curl http://localhost:8001/health
Retention Policy¶
| Location | Retention | Managed By |
|---|---|---|
| Local | 7 days | backup.sh cleanup |
| S3 | 30 days | S3 lifecycle policy |
| SSM | Indefinite | Manual cleanup |
Modify Retention¶
To change S3 retention, update the CloudFormation template:
LifecycleConfiguration:
Rules:
- Id: DeleteOldBackups
Status: Enabled
ExpirationInDays: 90 # Change from 30
Troubleshooting¶
Backup Fails¶
Symptoms: ERROR: Database backup failed
Solutions:
1. Check PostgreSQL is running: docker compose ps postgres
2. Check disk space: df -h
3. Verify database credentials in .env
S3 Upload Fails¶
Symptoms: WARNING: S3 upload failed
Solutions: 1. Check IAM role permissions 2. Verify bucket exists and is accessible 3. Check network connectivity (NAT Gateway)
Restore Fails¶
Symptoms: ERROR: pg_restore completed with errors
Solutions:
- Minor warnings are often OK (e.g., "role already exists")
- Check for actual errors in output
- Verify backup file isn't corrupted: tar -tzf backup.tar.gz
Large Backup Size¶
If backups are growing too large:
- Check for unnecessary data
- Consider archiving old events/logs
- Implement data retention policies in application
Best Practices¶
- Test restores regularly - Don't wait for an emergency
- Monitor backup completion - Set up alerts
- Keep multiple copies - Local + S3 + periodic offsite
- Document recovery procedures - Keep this guide updated
- Practice recovery - Run quarterly DR drills