Skip to main content

Overview

The RDS Module deploys an Amazon Aurora PostgreSQL database cluster optimized for the Artos platform. Aurora provides a fully-managed, highly available database with automated backups, point-in-time recovery, and read scaling capabilities. The module handles secure credential management, network isolation, parameter optimization, and monitoring integration.

Key Features

  • High Availability: Multi-AZ deployment with automatic failover
  • Secure Credentials: Automated password generation and Secrets Manager storage
  • Encryption: Data encrypted at rest using KMS and in transit using SSL
  • Automated Backups: Continuous backups with configurable retention
  • Performance Insights: Built-in query performance monitoring
  • Read Scaling: Support for multiple reader instances
  • Serverless Option: Aurora Serverless v2 for variable workloads

Core Components

1. Database Credentials Management

The module automatically generates and securely stores database credentials.

Random Password Generation

Purpose: Creates a cryptographically secure password for the database master user. Configuration:
  • Length: 32 characters
  • Character Set: Letters, numbers, and special characters
  • Special Characters: !#$%&*()-_=+[]{}|;:,.<>?~ (excludes problematic characters)
Security Benefits:
  • No hardcoded passwords in code or configuration
  • Meets complexity requirements for enterprise security policies
  • Unique per deployment

Secrets Manager Storage

Purpose: Securely stores database credentials for application access. Secret Name: {db_identifier}-password Secret Content:
{
  "username": "artos_admin",
  "password": "<generated-password>"
}
Recovery Window:
  • Production: 30 days (allows recovery from accidental deletion)
  • Non-Production: 0 days (immediate deletion for faster iteration)
Lifecycle Management: The secret version has ignore_changes on secret_string to prevent Terraform from updating the password after initial creation. This means:
  • Password rotations are managed outside Terraform
  • State file doesn’t contain password updates
  • Manual rotation doesn’t trigger Terraform changes
Accessing Credentials: Applications retrieve credentials from Secrets Manager using IAM authentication. The IAM module grants the necessary permissions to application pods.

2. Aurora PostgreSQL Cluster

The Aurora cluster is the logical database container that manages replication, backups, and endpoints. Engine: Aurora PostgreSQL (MySQL-compatible API on PostgreSQL engine) Engine Version: 15.4 (default, configurable) Database Name: artos (default, configurable) Key Configurations:

Storage and Encryption

  • Storage Type: Aurora Serverless storage (auto-scaling)
  • Encryption: Enabled by default using KMS
  • KMS Key: Customer-managed key for compliance requirements
Benefits:
  • Storage automatically grows from 10GB to 128TB
  • Only pay for storage actually used
  • KMS encryption enables fine-grained access control

Backup Configuration

Automated Backups:
  • Retention Period: 7 days (default), configurable up to 35 days
  • Backup Window: 03:00-04:00 UTC (default, configurable)
  • Continuous Backups: Transaction logs continuously backed up to S3
Point-in-Time Recovery:
  • Restore to any second within retention period
  • No impact on production performance
  • Fast restore times (minutes vs hours for snapshots)
Final Snapshot:
  • skip_final_snapshot: false (default)
  • Creates final snapshot when cluster is deleted
  • Enables data recovery after accidental deletion
Production Protection: The module sets deletion_protection = true by default. You must explicitly disable deletion protection before destroying the cluster to prevent accidental data loss.

Maintenance Configuration

Preferred Maintenance Window: sun:04:00-sun:05:00 (default) What Happens During Maintenance:
  • Minor version patches applied
  • Security updates installed
  • Parameter changes requiring reboot take effect
Best Practices:
  • Schedule during low-traffic periods
  • Coordinate with application maintenance windows
  • Monitor CloudWatch alarms during maintenance

CloudWatch Logs Integration

Enabled Logs: PostgreSQL logs Log Group: /aws/rds/cluster/{db_identifier}/postgresql Retention: 7 days (default, configurable) Log Contents:
  • SQL queries (if logging enabled)
  • Connection events
  • Error messages
  • Slow query logs
  • Database startup/shutdown events

3. Cluster Instances

Aurora uses a cluster architecture with separate compute instances. Instance Configuration:
ComponentDescription
Instance Count2 (default) - 1 writer, 1 reader
Instance Classdb.r6g.large (default)
Writer InstanceHandles all write operations
Reader InstancesHandle read operations for scaling
High Availability: Multi-AZ Deployment:
  • Instances automatically deployed across multiple availability zones
  • Aurora maintains 6 copies of data across 3 AZs
  • Automatic failover in case of AZ failure (typically < 30 seconds)
Failover Process:
  1. Primary instance becomes unavailable
  2. Aurora promotes a reader instance to writer
  3. DNS endpoint automatically updates
  4. Applications reconnect transparently
Instance Classes: General Purpose (R6g - AWS Graviton2):
  • db.r6g.large: 2 vCPU, 16 GB RAM
  • db.r6g.xlarge: 4 vCPU, 32 GB RAM
  • db.r6g.2xlarge: 8 vCPU, 64 GB RAM
When to Use:
  • Standard production workloads
  • Predictable traffic patterns
  • Consistent performance requirements
Serverless v2:
  • db.serverless: Auto-scaling compute
  • ACUs (Aurora Capacity Units): 0.5 to 4 (default)
  • Scales in 0.5 ACU increments
When to Use:
  • Variable workloads with unpredictable traffic
  • Development/staging environments
  • Applications with long idle periods
Serverless Configuration:
instance_class = "db.serverless"
serverless_min_capacity = 0.5  # 1 GB RAM
serverless_max_capacity = 4     # 8 GB RAM

4. Parameter Group

The parameter group defines database configuration settings. Family: aurora-postgresql16 Configured Parameters:

max_connections

Default: 1000 connections Purpose: Maximum number of concurrent database connections. Formula: Based on instance memory
max_connections = (DBInstanceClassMemory / 9531392)
Tuning Recommendations:
  • Connection Pooling: Use PgBouncer or application-level pooling
  • Monitor: Track connection usage via CloudWatch metric DatabaseConnections
  • Adjust: Increase if seeing “too many connections” errors
Common Values:
  • Development: 100-200
  • Staging: 500-1000
  • Production: 1000-2000

shared_buffers

Default: 262144 KB (256 MB) Purpose: Memory used for caching data pages. Best Practice: Set to 25% of available memory for dedicated database server. Formula for Aurora:
shared_buffers = {DBInstanceClassMemory / 10922} KB
Examples:
  • db.r6g.large (16 GB): ~4 GB shared_buffers
  • db.r6g.xlarge (32 GB): ~8 GB shared_buffers
Impact:
  • Higher values: Better cache hit ratio, fewer disk reads
  • Too high: Less memory for other operations
  • Requires reboot to apply changes
Apply Method: Both parameters use apply_method = "pending-reboot", meaning changes take effect after cluster restart. Plan parameter changes during maintenance windows.

5. Network Configuration

DB Subnet Group

Purpose: Defines which subnets the RDS cluster can use. Requirements:
  • Minimum 2 subnets in different availability zones
  • Subnets must be in database tier (isolated from application tier)
  • All subnets in same VPC
Why Multiple AZs:
  • Required for Multi-AZ deployments
  • Enables automatic failover
  • Aurora distributes replicas across AZs

Security Group

Purpose: Controls network access to the database. Ingress Rules:
PortProtocolSourceDescription
5432TCPAllowed CIDR blocksPostgreSQL from specified networks
5432TCPAllowed security groupsPostgreSQL from EKS nodes
Egress Rules:
  • All traffic to VPC CIDR (enables communication within VPC)
Typical Configuration:
allowed_security_groups = [
  module.eks.eks_nodes_security_group_id
]
allowed_cidr_blocks = []  # Rely on security groups for access control
Security Best Practices:
  • Prefer security group references over CIDR blocks
  • Never use 0.0.0.0/0 for database access
  • Use bastion host for administrative access
  • Enable VPC Flow Logs to monitor connection attempts

6. Cluster Endpoints

Aurora provides multiple endpoints for different access patterns.

Writer Endpoint

Format: {cluster-identifier}.cluster-xxxxx.{region}.rds.amazonaws.com Purpose: All write operations and consistent reads Use Cases:
  • INSERT, UPDATE, DELETE operations
  • Schema changes (DDL)
  • Transactions requiring consistency
  • Administrative operations
Connection String:
DATABASE_URL = "postgresql://artos_admin:<password>@artos-production.cluster-xxxxx.us-east-1.rds.amazonaws.com:5432/artos"

Reader Endpoint

Format: {cluster-identifier}.cluster-ro-xxxxx.{region}.rds.amazonaws.com Purpose: Load-balanced read operations across reader instances Use Cases:
  • SELECT queries for reporting
  • Analytics workloads
  • Read-only API endpoints
  • Background processing tasks
Load Balancing:
  • Aurora automatically distributes connections across readers
  • Round-robin with session-level stickiness
  • Unhealthy readers automatically removed from rotation

Module Configuration

Basic Configuration

module "rds" {
  source = "./modules/rds"

  db_identifier = "artos-production"
  database_name = "artos"
  db_username   = "artos_admin"
  
  # Instance configuration
  instance_class = "db.r6g.large"
  instance_count = 2  # 1 writer + 1 reader
  engine_version = "15.4"
  
  # Network configuration
  vpc_id     = module.networking.vpc_id
  vpc_cidr   = module.networking.vpc_cidr_block
  subnet_ids = module.networking.database_subnet_ids
  
  # Security
  allowed_security_groups = [
    module.networking.eks_nodes_security_group_id
  ]
  kms_key_arn = module.kms.key_arn
  
  # Backups
  backup_retention_period = 7
  
  # Environment
  environment = "production"
  
  tags = {
    Environment = "production"
  }
}

Production Configuration

module "rds_production" {
  source = "./modules/rds"

  db_identifier = "artos-production"
  database_name = "artos"
  db_username   = "artos_admin"
  
  # Larger instances for production
  instance_class = "db.r6g.xlarge"
  instance_count = 3  # 1 writer + 2 readers for read scaling
  engine_version = "15.4"
  
  # Network configuration
  vpc_id     = module.networking.vpc_id
  vpc_cidr   = module.networking.vpc_cidr_block
  subnet_ids = module.networking.database_subnet_ids
  
  # Security
  allowed_security_groups = [
    module.networking.eks_nodes_security_group_id
  ]
  kms_key_arn = module.kms.key_arn
  
  # Extended backup retention for production
  backup_retention_period      = 30
  preferred_backup_window      = "03:00-04:00"
  preferred_maintenance_window = "sun:04:00-sun:05:00"
  
  # Protection settings
  skip_final_snapshot  = false
  deletion_protection  = true
  
  # Performance tuning
  max_connections = "2000"
  shared_buffers  = "524288"  # 512 MB
  
  # Monitoring
  performance_insights_enabled = true
  monitoring_interval          = 60
  log_retention_days           = 30
  
  environment = "production"
  
  tags = {
    Environment = "production"
    Backup      = "daily"
    Compliance  = "required"
  }
}

Serverless v2 Configuration

module "rds_dev" {
  source = "./modules/rds"

  db_identifier = "artos-dev"
  database_name = "artos"
  db_username   = "artos_admin"
  
  # Serverless v2 for variable workloads
  instance_class = "db.serverless"
  instance_count = 1
  engine_version = "15.4"
  
  # Serverless scaling
  serverless_min_capacity = 0.5   # 1 GB RAM
  serverless_max_capacity = 4     # 8 GB RAM
  
  # Network configuration
  vpc_id     = module.networking.vpc_id
  vpc_cidr   = module.networking.vpc_cidr_block
  subnet_ids = module.networking.database_subnet_ids
  
  # Security
  allowed_security_groups = [
    module.networking.eks_nodes_security_group_id
  ]
  kms_key_arn = module.kms.key_arn
  
  # Minimal backups for development
  backup_retention_period = 1
  skip_final_snapshot     = true
  deletion_protection     = false
  
  # Development settings
  performance_insights_enabled = false
  monitoring_interval          = 0  # Disable enhanced monitoring
  log_retention_days           = 3
  
  environment = "development"
  
  tags = {
    Environment  = "development"
    AutoShutdown = "true"
  }
}

Connecting to the Database

From Application Pods

Applications access the database using credentials from Secrets Manager: Python (using psycopg2):
import json
import boto3
import psycopg2

# Retrieve credentials from Secrets Manager
secrets_client = boto3.client('secretsmanager')
secret_value = secrets_client.get_secret_value(SecretId='artos-production-password')
credentials = json.loads(secret_value['SecretString'])

# Connect to database
conn = psycopg2.connect(
    host='artos-production.cluster-xxxxx.us-east-1.rds.amazonaws.com',
    port=5432,
    database='artos',
    user=credentials['username'],
    password=credentials['password'],
    sslmode='require'
)

From Bastion Host

For administrative access and troubleshooting: SSH to Bastion:
aws ssm start-session --target <bastion-instance-id> --region us-east-1
or use the connect-bastion.sh script:
chmod +x ./scripts/connect-bastion.sh
./scripts/connect-bastion.sh
Connect to Database:
# Retrieve password from Secrets Manager
export DB_PASSWORD=$(aws secretsmanager get-secret-value \
  --secret-id artos-production-password \
  --query SecretString \
  --output text | jq -r '.password')

# Connect using psql
psql -h artos-production.cluster-xxxxx.us-east-1.rds.amazonaws.com \
     -U artos_admin \
     -d artos \
     -p 5432

# Or export PGPASSWORD to avoid prompt
export PGPASSWORD=$DB_PASSWORD
psql -h artos-production.cluster-xxxxx.us-east-1.rds.amazonaws.com \
     -U artos_admin \
     -d artos

Connection Pooling

For production applications, use connection pooling to optimize database connections.

Database Maintenance

Backup and Restore

Manual Snapshot

Create a manual snapshot for specific points in time:
aws rds create-db-cluster-snapshot \
  --db-cluster-identifier artos-production \
  --db-cluster-snapshot-identifier artos-production-manual-2024-01-15

Point-in-Time Restore

Restore to any second within the retention period:
aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier artos-production \
  --db-cluster-identifier artos-production-restored \
  --restore-to-time 2024-01-15T10:30:00Z \
  --vpc-security-group-ids sg-xxxxx \
  --db-subnet-group-name artos-production-subnet-group

Restore from Snapshot

aws rds restore-db-cluster-from-snapshot \
  --db-cluster-identifier artos-production-restored \
  --snapshot-identifier artos-production-manual-2024-01-15 \
  --engine aurora-postgresql \
  --vpc-security-group-ids sg-xxxxx \
  --db-subnet-group-name artos-production-subnet-group

Parameter Changes

Applying parameter changes:
# Modify parameter group via Terraform
# Then reboot cluster for changes to take effect

# Reboot writer instance
aws rds reboot-db-instance --db-instance-identifier artos-production-1

# Reboot reader instances
aws rds reboot-db-instance --db-instance-identifier artos-production-2
Reboot Impact: Rebooting the writer instance causes a brief downtime (typically < 30 seconds). Aurora automatically promotes a reader to writer during reboot to minimize downtime.

Version Upgrades

Minor version upgrades are automatic during maintenance windows. For major version upgrades:
# Create snapshot before upgrade
aws rds create-db-cluster-snapshot \
  --db-cluster-identifier artos-production \
  --db-cluster-snapshot-identifier artos-production-before-upgrade

# Modify cluster to new version
aws rds modify-db-cluster \
  --db-cluster-identifier artos-production \
  --engine-version 16.1 \
  --apply-immediately

Monitoring and Troubleshooting

CloudWatch Metrics

Key metrics to monitor:
MetricDescriptionAlarm Threshold
CPUUtilizationCPU usage percentage> 80%
DatabaseConnectionsActive connections> 80% of max_connections
FreeableMemoryAvailable memory< 1 GB
ReadLatencyRead operation latency> 20ms
WriteLatencyWrite operation latency> 50ms
CommitLatencyTransaction commit time> 100ms

Common Issues

Too Many Connections

Symptoms: FATAL: sorry, too many clients already Solutions:
  1. Increase max_connections parameter
  2. Implement connection pooling (PgBouncer)
  3. Review application connection management
  4. Close idle connections

Connection Timeouts

Symptoms: Applications cannot connect to database Troubleshooting:
# Verify security group rules
aws ec2 describe-security-groups --group-ids <rds-sg-id>

# Check if cluster is available
aws rds describe-db-clusters --db-cluster-identifier artos-production \
  --query 'DBClusters[0].Status'

# Test connectivity from EKS pod
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h artos-production.cluster-xxxxx.us-east-1.rds.amazonaws.com \
       -U artos_admin -d artos

Security

Descriptions of security around the RDS instances.

1. Network Isolation

  • Deploy in private database subnets (no internet access)
  • Use security groups to restrict access to EKS nodes only
  • Enable VPC Flow Logs for connection audit trail

2. Encryption

At Rest:
  • Always use KMS encryption with customer-managed keys
  • Enables fine-grained access control via key policies
  • Meets compliance requirements (HIPAA, PCI-DSS)
In Transit:
  • Enforce SSL/TLS connections
  • Set rds.force_ssl = 1 parameter (optional)
  • Verify SSL in application connection strings

3. Credential Management

  • Never hardcode database passwords
  • Rotate passwords regularly (outside Terraform)
  • Use Secrets Manager for automatic rotation
  • Grant minimal IAM permissions for secret access

Module Maintenance: This module is compatible with Terraform 1.0+ and AWS Provider 5.x. Aurora PostgreSQL version 15.4 is the default. Review AWS Aurora release notes before upgrading to newer versions. Always test database upgrades in non-production environments first.