Skip to main content

Overview

The Monitoring Module provides comprehensive observability for your Artos deployment using AWS CloudWatch. It creates centralized log aggregation, performance metrics, automated alerting, and visualization dashboards to help you monitor application health, diagnose issues, and maintain system reliability.

Key Features

  • Centralized Logging: Aggregated logs from all application components
  • Performance Metrics: CPU and memory utilization tracking
  • Automated Alerts: Proactive notifications for anomalies and threshold breaches
  • Visual Dashboards: Real-time insights into cluster and application health
  • Query Templates: Pre-configured CloudWatch Insights queries for common investigations
  • Event Tracking: Capture and respond to EKS cluster state changes

Core Components

1. CloudWatch Log Groups

Log groups provide centralized storage and organization for application logs from different components.

Application Log Group

Log Group Name: /aws/eks/{cluster_name}/application Purpose: Stores logs from the main Artos backend API server. Typical Log Contents:
  • HTTP request/response logs
  • Business logic execution traces
  • Error stack traces and exceptions
  • Database query logs
  • Authentication events
  • API performance metrics
Example Log Entry:
{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "INFO",
  "message": "POST /api/documents completed in 234ms",
  "user_id": "user_12345",
  "status_code": 200
}

Celery Log Group

Log Group Name: /aws/eks/{cluster_name}/celery Purpose: Stores logs from Celery background workers processing asynchronous tasks. Typical Log Contents:
  • Task execution start/completion
  • Task failures and retries
  • Worker health status
  • Queue processing metrics
  • Long-running job progress
Example Log Entry:
{
  "timestamp": "2024-01-15T10:31:20Z",
  "level": "INFO",
  "task": "process_document",
  "task_id": "abc-123-def",
  "status": "SUCCESS",
  "duration": "45.2s"
}

Nginx Log Group

Log Group Name: /aws/eks/{cluster_name}/nginx Purpose: Stores logs from Nginx ingress controller or reverse proxy. Typical Log Contents:
  • HTTP access logs
  • Request routing decisions
  • SSL/TLS handshake events
  • Upstream connection errors
  • Rate limiting events
Example Log Entry:
192.168.1.10 - - [15/Jan/2024:10:30:45 +0000] "GET /api/health HTTP/1.1" 200 145 "-" "ALB-HealthChecker/2.0"
Log Retention: All log groups use configurable retention periods to balance observability needs with storage considerations. Logs older than the retention period are automatically deleted. Log retention is set to the maximum allowed in cloudwatch.

2. CloudWatch Alarms

Alarms monitor metrics and automatically trigger notifications when thresholds are breached.

High CPU Alarm

Alarm Name: {cluster_name}-high-cpu Trigger Condition:
  • Metric: Average CPU Utilization
  • Threshold: > 80%
  • Evaluation: 2 consecutive periods of 5 minutes
  • Total duration: 10 minutes above threshold
What It Means: When CPU utilization exceeds 80% for 10 consecutive minutes, this alarm triggers. High CPU can indicate:
  • Increased application load requiring more compute capacity
  • CPU-intensive operations (AI model inference, data processing)
  • Inefficient code or infinite loops
  • Need to scale node groups or add more pods
Response Actions:
  1. Check CloudWatch metrics to identify which pods are consuming CPU
  2. Review application logs for unusual activity
  3. Consider horizontal scaling (more pod replicas) or vertical scaling (larger instance types)
  4. Investigate for potential performance bottlenecks

High Memory Alarm

Alarm Name: {cluster_name}-high-memory Trigger Condition:
  • Metric: Average Memory Utilization
  • Threshold: > 80%
  • Evaluation: 2 consecutive periods of 5 minutes
  • Total duration: 10 minutes above threshold
What It Means: When memory utilization exceeds 80% for 10 consecutive minutes, this alarm triggers. High memory usage can indicate:
  • Memory leaks in application code
  • Large dataset processing
  • Caching layers consuming excessive memory
  • Insufficient memory allocation for workloads
Response Actions:
  1. Identify memory-intensive pods using kubectl top pods
  2. Review application logs for memory errors or OOM events
  3. Analyze heap dumps if available
  4. Consider increasing memory limits or adding more nodes
  5. Investigate memory leaks or optimize data structures
Critical Threshold: If CPU or memory reaches 90%+, pods may be throttled or evicted, causing application instability. Consider setting additional alarms at 90% for critical alerts.

3. CloudWatch Dashboard

The dashboard provides a unified view of cluster metrics and application logs. Dashboard Name: {cluster_name}-dashboard Widgets:

EKS Cluster Metrics Widget

  • Type: Time series graph
  • Metrics: CPU and Memory Utilization
  • Period: 5 minutes
  • View: Line graph with both metrics overlaid
Use Cases:
  • Quick health check of cluster resource utilization
  • Identify trends and patterns over time
  • Correlate CPU/memory spikes with application events
  • Capacity planning insights

Application Logs Widget

  • Type: Log query table
  • Source: Application log group
  • Query: Latest 20 log entries, sorted by timestamp
  • Refresh: Auto-refresh every 1 minute
Use Cases:
  • Real-time log streaming for debugging
  • Immediate visibility into recent errors
  • Monitor application activity during deployments
  • Quick access to latest log events
Accessing the Dashboard: The dashboard URL is available as a Terraform output:
https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#dashboards:name={cluster_name}-dashboard

4. CloudWatch Insights Queries

Pre-configured query definitions for common log analysis tasks.

Error Logs Query

Query Name: {cluster_name}-error-logs Log Groups: Application and Celery Query:
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
Use Cases:
  • Quickly find all error messages across application and workers
  • Troubleshoot failures and exceptions
  • Identify error patterns and frequencies
  • Generate error reports for analysis
Running the Query:
  1. Navigate to CloudWatch Console → Insights
  2. Select the error-logs query definition
  3. Choose time range
  4. Click “Run query”

Performance Logs Query

Query Name: {cluster_name}-performance-logs Log Groups: Application Query:
fields @timestamp, @message
| filter @message like /performance/ or @message like /slow/
| sort @timestamp desc
| limit 100
Use Cases:
  • Identify slow API endpoints or database queries
  • Monitor performance degradation over time
  • Find operations exceeding performance SLAs
  • Prioritize optimization efforts
Custom Queries: You can create additional queries for specific needs: Count Errors by Type:
fields @message
| filter @message like /ERROR/
| parse @message /ERROR: (?<error_type>.*?) -/
| stats count() by error_type
| sort count() desc
Request Duration Statistics:
fields @timestamp, @message
| filter @message like /completed in/
| parse @message /completed in (?<duration>\d+)ms/
| stats avg(duration), max(duration), min(duration), count()

5. SNS Topic for Alerts

The SNS (Simple Notification Service) topic distributes alarm notifications to configured endpoints. Topic Name: {cluster_name}-alerts Encryption: KMS-encrypted for security Subscribers: Configure email, SMS, HTTP/HTTPS endpoints, or Lambda functions to receive alerts. How It Works:
  1. CloudWatch alarm enters ALARM state
  2. Alarm publishes message to SNS topic
  3. SNS delivers notification to all subscribers
  4. Team members receive alerts via configured channels
Example Alert Message:
{
  "AlarmName": "artos-production-high-cpu",
  "AlarmDescription": "This metric monitors EKS cluster CPU utilization",
  "NewStateValue": "ALARM",
  "NewStateReason": "Threshold Crossed: 2 datapoints [85.4, 87.2] were greater than the threshold (80.0).",
  "StateChangeTime": "2024-01-15T10:35:00.000Z",
  "Region": "us-east-1"
}
Adding Subscribers:
# Subscribe email to SNS topic
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:artos-production-alerts \
  --protocol email \
  --notification-endpoint [email protected]

# Confirm subscription from email
# Check inbox for confirmation link and click to confirm
Integration Options:
  • Email: Direct email notifications
  • SMS: Text message alerts for critical events
  • Slack/Teams: Webhook integration for team channels
  • PagerDuty: Incident management system integration
  • Lambda: Custom processing and routing logic

6. CloudWatch Events (EventBridge)

Captures EKS cluster state change events for automated responses. Event Rule Name: {cluster_name}-events Event Pattern:
{
  "source": ["aws.eks"],
  "detail-type": ["EKS Cluster State Change"],
  "detail": {
    "name": ["{cluster_name}"]
  }
}
Captured Events:
  • Cluster creation or deletion
  • Cluster version updates
  • Node group scaling events
  • Add-on installation or updates
  • Cluster configuration changes
Event Target: SNS topic (alerts are sent to subscribers) Use Cases:
  • Automated notification of cluster changes
  • Audit trail for infrastructure modifications
  • Trigger automated workflows on cluster events
  • Compliance logging for change management
Example Event:
{
  "version": "0",
  "id": "abc-123-def",
  "detail-type": "EKS Cluster State Change",
  "source": "aws.eks",
  "time": "2024-01-15T10:40:00Z",
  "region": "us-east-1",
  "detail": {
    "name": "artos-production",
    "status": "UPDATING",
    "resourcesVpcConfig": {
      "endpointPublicAccess": false,
      "endpointPrivateAccess": true
    }
  }
}

Module Configuration

Basic Configuration

module "monitoring" {
  source = "./modules/monitoring"

  cluster_name = "artos-production"
  aws_region   = "us-east-1"
  kms_key_arn  = module.kms.key_arn
  
  # Log retention
  log_retention_days = 30
  
  # Enable all features
  enable_alarms      = true
  enable_dashboard   = true
  enable_sns_alerts  = true
  enable_events      = true
  
  # Alarm actions (empty = no notifications)
  alarm_actions = []
  
  tags = {
    Environment = "production"
  }
}

Production Configuration with SNS Alerts

module "monitoring_production" {
  source = "./modules/monitoring"

  cluster_name = "artos-production"
  aws_region   = "us-east-1"
  kms_key_arn  = module.kms.key_arn
  
  # Extended log retention for production
  log_retention_days = 90
  
  # Enable all monitoring features
  enable_alarms      = true
  enable_dashboard   = true
  enable_sns_alerts  = true
  enable_events      = true
  
  # Send alarms to SNS topic
  alarm_actions = [
    module.monitoring.sns_topic_arn  # Reference SNS topic created by module
  ]
  
  tags = {
    Environment = "production"
    AlertLevel  = "critical"
  }
}

# Subscribe operations team to alerts
resource "aws_sns_topic_subscription" "ops_team" {
  topic_arn = module.monitoring_production.sns_topic_arn
  protocol  = "email"
  endpoint  = "[email protected]"
}

resource "aws_sns_topic_subscription" "on_call" {
  topic_arn = module.monitoring_production.sns_topic_arn
  protocol  = "sms"
  endpoint  = "+1234567890"
}

Development Configuration

module "monitoring_dev" {
  source = "./modules/monitoring"

  cluster_name = "artos-dev"
  aws_region   = "us-east-1"
  kms_key_arn  = module.kms.key_arn
  
  # Minimal log retention for development
  log_retention_days = 3
  
  # Enable dashboard only
  enable_alarms      = false  # Disable alarms in dev
  enable_dashboard   = true
  enable_sns_alerts  = false  # No alerts needed
  enable_events      = false  # No event tracking
  
  alarm_actions = []
  
  tags = {
    Environment = "development"
  }
}

Configuring Application Logging

Application Code Configuration

Configure your applications to send logs to CloudWatch Log Groups: Python (using watchtower):
import logging
from watchtower import CloudWatchLogHandler

# Configure CloudWatch handler
handler = CloudWatchLogHandler(
    log_group='/aws/eks/artos-production/application',
    stream_name='backend-api',
    use_queues=True
)

# Add to logger
logger = logging.getLogger(__name__)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Log messages
logger.info("Application started")
logger.error("Failed to process document", extra={
    "document_id": "doc_123",
    "user_id": "user_456"
})
Node.js (using winston-cloudwatch):
const winston = require('winston');
const WinstonCloudWatch = require('winston-cloudwatch');

const logger = winston.createLogger({
  transports: [
    new WinstonCloudWatch({
      logGroupName: '/aws/eks/artos-production/application',
      logStreamName: 'backend-api',
      awsRegion: 'us-east-1'
    })
  ]
});

logger.info('Application started');
logger.error('Failed to process document', {
  document_id: 'doc_123',
  user_id: 'user_456'
});

Kubernetes Configuration

Alternatively, use a log forwarder like Fluent Bit or Fluentd: Fluent Bit ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [OUTPUT]
        Name cloudwatch_logs
        Match application.*
        region us-east-1
        log_group_name /aws/eks/artos-production/application
        log_stream_prefix backend-
        auto_create_group true
    
    [OUTPUT]
        Name cloudwatch_logs
        Match celery.*
        region us-east-1
        log_group_name /aws/eks/artos-production/celery
        log_stream_prefix worker-
        auto_create_group true

Accessing Monitoring Resources

CloudWatch Console

View Dashboard:
  1. Navigate to AWS Console → CloudWatch
  2. Select “Dashboards” from left menu
  3. Click on {cluster_name}-dashboard
Query Logs:
  1. Navigate to CloudWatch → Logs → Insights
  2. Select log groups to query
  3. Choose saved query or write custom query
  4. Adjust time range and run query
View Alarms:
  1. Navigate to CloudWatch → Alarms
  2. Filter by cluster name
  3. View alarm state and history

AWS CLI

Query Recent Logs:
# Get latest application logs
aws logs tail /aws/eks/artos-production/application --follow

# Search for errors
aws logs filter-log-events \
  --log-group-name /aws/eks/artos-production/application \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000
Check Alarm Status:
# List all alarms
aws cloudwatch describe-alarms \
  --alarm-name-prefix artos-production

# Get alarm history
aws cloudwatch describe-alarm-history \
  --alarm-name artos-production-high-cpu \
  --max-records 10
Run Insights Query:
# Start query
QUERY_ID=$(aws logs start-query \
  --log-group-name /aws/eks/artos-production/application \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 20' \
  --query 'queryId' \
  --output text)

# Get results
aws logs get-query-results --query-id $QUERY_ID

Best Practices

1. Structured Logging

Use structured (JSON) logging instead of plain text for better query capabilities: Good - Structured:
{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "ERROR",
  "message": "Database connection failed",
  "error_code": "DB_CONN_TIMEOUT",
  "database": "postgres-prod",
  "retry_count": 3
}
Avoid - Unstructured:
ERROR: Database connection failed - postgres-prod timed out after 3 retries

2. Log Levels

Use appropriate log levels to control verbosity:
LevelWhen to Use
ERRORApplication errors requiring immediate attention
WARNPotential issues that don’t halt execution
INFOImportant business events and milestones
DEBUGDetailed diagnostic information (dev/staging only)

3. Alarm Threshold Tuning

Adjust alarm thresholds based on your application’s normal behavior:
  • Start with conservative thresholds (80%)
  • Monitor false positive rate
  • Adjust based on actual capacity needs
  • Set multiple severity levels (warning at 70%, critical at 90%)

4. Dashboard Customization

Extend the default dashboard with application-specific metrics:
# Add custom widget to dashboard
resource "aws_cloudwatch_dashboard" "custom" {
  dashboard_name = "${var.cluster_name}-custom-dashboard"
  
  dashboard_body = jsonencode({
    widgets = [
      # Include default widgets...
      {
        type = "metric"
        properties = {
          metrics = [
            ["Artos", "DocumentsProcessed", "ClusterName", var.cluster_name],
            [".", "APIRequestRate", ".", "."]
          ]
          title = "Business Metrics"
        }
      }
    ]
  })
}

5. Log Sampling

For high-volume logs, consider sampling to reduce storage costs while maintaining visibility:
import random

# Sample 10% of INFO logs, keep all ERROR logs
if log_level == "INFO" and random.random() > 0.1:
    return  # Skip logging
logger.log(log_level, message)

Module Maintenance: This module is compatible with Terraform 1.0+ and AWS Provider 5.x. CloudWatch Logs are retained according to configured retention periods and automatically deleted after expiration. Review retention settings periodically to balance observability needs with storage requirements.