Skip to content

Backup and Restore Guide

This guide covers backup and restore procedures for SCP deployments.

Overview

SCP includes automated daily backups with the following characteristics:

  • Frequency: Daily at 2 AM (configurable)
  • Local retention: 7 days
  • S3 retention: 30 days (via lifecycle policy)
  • Contents: PostgreSQL database + configuration

What's Backed Up

Component Included Notes
PostgreSQL database Yes All tables (agents, bundles, etc.)
Configuration (.env) Yes Including secrets
Docker volumes No Ephemeral, rebuilt from DB
Logs No Available in CloudWatch

Automated Backups

Automated backups run daily via cron:

# View cron schedule
crontab -l
# 0 2 * * * /opt/scp/backup.sh >> /var/log/scp-backup.log 2>&1

Verify Backups

# Check recent backup logs
tail -50 /var/log/scp-backup.log

# List local backups
ls -la /opt/scp/backups/

# List S3 backups
aws s3 ls s3://${BACKUP_BUCKET}/backups/ --recursive

Manual Backup

Create Immediate Backup

# Full backup (local + S3)
/opt/scp/backup.sh

# Local only (no S3 upload)
/opt/scp/backup.sh --local

Backup Before Major Changes

Always backup before: - Version updates - Configuration changes - Database migrations - Bundle imports

# Create backup before update
/opt/scp/backup.sh
# Proceed with changes...
/opt/scp/update.sh 0.3.1

Restore Procedures

Restore from Local Backup

# List available backups
ls -la /opt/scp/backups/

# Restore from local file
/opt/scp/backup.sh --restore /opt/scp/backups/scp-backup-20260201_020000.tar.gz

Restore from S3

# List S3 backups
aws s3 ls s3://${BACKUP_BUCKET}/backups/ --recursive

# Restore from S3 (downloads automatically)
/opt/scp/backup.sh --restore s3://${BACKUP_BUCKET}/backups/2026/02/scp-backup-20260201_020000.tar.gz

What Restore Does

The restore process:

  1. Downloads backup from S3 (if needed)
  2. Stops all SCP services
  3. Starts PostgreSQL only
  4. Restores database from dump
  5. Restores configuration
  6. Restarts all services

Warning: Restore is destructive. It replaces current data with backup contents.

Point-in-Time Recovery

For more granular recovery, consider enabling PostgreSQL continuous archiving:

Option 1: RDS (Managed)

Switch to RDS for PostgreSQL with automated backups: - Point-in-time recovery up to 35 days - Automated snapshots - Multi-AZ for high availability

Option 2: WAL Archiving (Self-Managed)

Enable WAL archiving to S3:

# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://bucket/wal/%f'

Disaster Recovery

Complete Instance Loss

If the EC2 instance is lost:

  1. Deploy new stack using CloudFormation
  2. Restore from S3 backup:
    /opt/scp/backup.sh --restore s3://${BACKUP_BUCKET}/backups/latest.tar.gz
    
  3. Update DNS to point to new ALB (if needed)

Database Corruption

If database becomes corrupted:

  1. Stop services: docker compose down
  2. Remove corrupted data: sudo rm -rf /opt/scp/data/postgres/*
  3. Restore from backup: /opt/scp/backup.sh --restore <backup-file>

Configuration Loss

If .env is lost but database is intact:

  1. Retrieve secrets from SSM:
    aws ssm get-parameter --name /scp/production/db-password --with-decryption
    aws ssm get-parameter --name /scp/production/jwt-secret --with-decryption
    
  2. Run first-boot script: /opt/scp/first-boot.sh

Backup Monitoring

CloudWatch Metrics

Monitor backup health with custom metrics:

# In backup.sh, add:
aws cloudwatch put-metric-data \
  --namespace SCP \
  --metric-name BackupSuccess \
  --value 1 \
  --dimensions Environment=production

Alerts

Set up CloudWatch alarms for backup failures:

  1. Create SNS topic for alerts
  2. Create alarm on BackupSuccess metric
  3. Alert if no successful backup in 26 hours

Verification

Periodically verify backups can be restored:

# Restore to test instance
/opt/scp/backup.sh --restore <backup-file>

# Verify services
curl http://localhost:8000/health
curl http://localhost:8001/health

Retention Policy

Location Retention Managed By
Local 7 days backup.sh cleanup
S3 30 days S3 lifecycle policy
SSM Indefinite Manual cleanup

Modify Retention

To change S3 retention, update the CloudFormation template:

LifecycleConfiguration:
  Rules:
    - Id: DeleteOldBackups
      Status: Enabled
      ExpirationInDays: 90  # Change from 30

Troubleshooting

Backup Fails

Symptoms: ERROR: Database backup failed

Solutions: 1. Check PostgreSQL is running: docker compose ps postgres 2. Check disk space: df -h 3. Verify database credentials in .env

S3 Upload Fails

Symptoms: WARNING: S3 upload failed

Solutions: 1. Check IAM role permissions 2. Verify bucket exists and is accessible 3. Check network connectivity (NAT Gateway)

Restore Fails

Symptoms: ERROR: pg_restore completed with errors

Solutions: - Minor warnings are often OK (e.g., "role already exists") - Check for actual errors in output - Verify backup file isn't corrupted: tar -tzf backup.tar.gz

Large Backup Size

If backups are growing too large:

  1. Check for unnecessary data
  2. Consider archiving old events/logs
  3. Implement data retention policies in application

Best Practices

  1. Test restores regularly - Don't wait for an emergency
  2. Monitor backup completion - Set up alerts
  3. Keep multiple copies - Local + S3 + periodic offsite
  4. Document recovery procedures - Keep this guide updated
  5. Practice recovery - Run quarterly DR drills