Monitoring Module

Overview

The Monitoring Module provides comprehensive observability for your Artos deployment using AWS CloudWatch. It creates centralized log aggregation, performance metrics, automated alerting, and visualization dashboards to help you monitor application health, diagnose issues, and maintain system reliability.

Key Features

Centralized Logging: Aggregated logs from all application components
Performance Metrics: CPU and memory utilization tracking
Automated Alerts: Proactive notifications for anomalies and threshold breaches
Visual Dashboards: Real-time insights into cluster and application health
Query Templates: Pre-configured CloudWatch Insights queries for common investigations
Event Tracking: Capture and respond to EKS cluster state changes

Core Components

1. CloudWatch Log Groups

Log groups provide centralized storage and organization for application logs from different components.

Application Log Group

Log Group Name: /aws/eks/{cluster_name}/application Purpose: Stores logs from the main Artos backend API server. Typical Log Contents:

HTTP request/response logs
Business logic execution traces
Error stack traces and exceptions
Database query logs
Authentication events
API performance metrics

Example Log Entry:

{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "INFO",
  "message": "POST /api/documents completed in 234ms",
  "user_id": "user_12345",
  "status_code": 200
}

Celery Log Group

Log Group Name: /aws/eks/{cluster_name}/celery Purpose: Stores logs from Celery background workers processing asynchronous tasks. Typical Log Contents:

Task execution start/completion
Task failures and retries
Worker health status
Queue processing metrics
Long-running job progress

Example Log Entry:

{
  "timestamp": "2024-01-15T10:31:20Z",
  "level": "INFO",
  "task": "process_document",
  "task_id": "abc-123-def",
  "status": "SUCCESS",
  "duration": "45.2s"
}

Nginx Log Group

Log Group Name: /aws/eks/{cluster_name}/nginx Purpose: Stores logs from Nginx ingress controller or reverse proxy. Typical Log Contents:

HTTP access logs
Request routing decisions
SSL/TLS handshake events
Upstream connection errors
Rate limiting events

Example Log Entry:

192.168.1.10 - - [15/Jan/2024:10:30:45 +0000] "GET /api/health HTTP/1.1" 200 145 "-" "ALB-HealthChecker/2.0"

Log Retention: All log groups use configurable retention periods to balance observability needs with storage considerations. Logs older than the retention period are automatically deleted. Log retention is set to the maximum allowed in cloudwatch.

2. CloudWatch Alarms

Alarms monitor metrics and automatically trigger notifications when thresholds are breached.

High CPU Alarm

Alarm Name: {cluster_name}-high-cpu Trigger Condition:

Metric: Average CPU Utilization
Threshold: > 80%
Evaluation: 2 consecutive periods of 5 minutes
Total duration: 10 minutes above threshold

What It Means: When CPU utilization exceeds 80% for 10 consecutive minutes, this alarm triggers. High CPU can indicate:

Increased application load requiring more compute capacity
CPU-intensive operations (AI model inference, data processing)
Inefficient code or infinite loops
Need to scale node groups or add more pods

Response Actions:

Check CloudWatch metrics to identify which pods are consuming CPU
Review application logs for unusual activity
Consider horizontal scaling (more pod replicas) or vertical scaling (larger instance types)
Investigate for potential performance bottlenecks

High Memory Alarm

Alarm Name: {cluster_name}-high-memory Trigger Condition:

Metric: Average Memory Utilization
Threshold: > 80%
Evaluation: 2 consecutive periods of 5 minutes
Total duration: 10 minutes above threshold

What It Means: When memory utilization exceeds 80% for 10 consecutive minutes, this alarm triggers. High memory usage can indicate:

Memory leaks in application code
Large dataset processing
Caching layers consuming excessive memory
Insufficient memory allocation for workloads

Response Actions:

Identify memory-intensive pods using kubectl top pods
Review application logs for memory errors or OOM events
Analyze heap dumps if available
Consider increasing memory limits or adding more nodes
Investigate memory leaks or optimize data structures

Critical Threshold: If CPU or memory reaches 90%+, pods may be throttled or evicted, causing application instability. Consider setting additional alarms at 90% for critical alerts.

3. CloudWatch Dashboard

The dashboard provides a unified view of cluster metrics and application logs. Dashboard Name: {cluster_name}-dashboard Widgets:

Type: Time series graph
Metrics: CPU and Memory Utilization
Period: 5 minutes
View: Line graph with both metrics overlaid

Use Cases:

Quick health check of cluster resource utilization
Identify trends and patterns over time
Correlate CPU/memory spikes with application events
Capacity planning insights

Type: Log query table
Source: Application log group
Query: Latest 20 log entries, sorted by timestamp
Refresh: Auto-refresh every 1 minute

Use Cases:

Real-time log streaming for debugging
Immediate visibility into recent errors
Monitor application activity during deployments
Quick access to latest log events

Accessing the Dashboard: The dashboard URL is available as a Terraform output:

https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#dashboards:name={cluster_name}-dashboard

4. CloudWatch Insights Queries

Pre-configured query definitions for common log analysis tasks.

Error Logs Query

Query Name: {cluster_name}-error-logs Log Groups: Application and Celery Query:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

Use Cases:

Quickly find all error messages across application and workers
Troubleshoot failures and exceptions
Identify error patterns and frequencies
Generate error reports for analysis

Running the Query:

Navigate to CloudWatch Console → Insights
Select the error-logs query definition
Choose time range
Click “Run query”

Performance Logs Query

Query Name: {cluster_name}-performance-logs Log Groups: Application Query:

fields @timestamp, @message
| filter @message like /performance/ or @message like /slow/
| sort @timestamp desc
| limit 100

Use Cases:

Identify slow API endpoints or database queries
Monitor performance degradation over time
Find operations exceeding performance SLAs
Prioritize optimization efforts

Custom Queries: You can create additional queries for specific needs: Count Errors by Type:

fields @message
| filter @message like /ERROR/
| parse @message /ERROR: (?<error_type>.*?) -/
| stats count() by error_type
| sort count() desc

Request Duration Statistics:

fields @timestamp, @message
| filter @message like /completed in/
| parse @message /completed in (?<duration>\d+)ms/
| stats avg(duration), max(duration), min(duration), count()

The SNS (Simple Notification Service) topic distributes alarm notifications to configured endpoints. Topic Name: {cluster_name}-alerts Encryption: KMS-encrypted for security Subscribers: Configure email, SMS, HTTP/HTTPS endpoints, or Lambda functions to receive alerts. How It Works:

CloudWatch alarm enters ALARM state
Alarm publishes message to SNS topic
SNS delivers notification to all subscribers
Team members receive alerts via configured channels

Example Alert Message:

{
  "AlarmName": "artos-production-high-cpu",
  "AlarmDescription": "This metric monitors EKS cluster CPU utilization",
  "NewStateValue": "ALARM",
  "NewStateReason": "Threshold Crossed: 2 datapoints [85.4, 87.2] were greater than the threshold (80.0).",
  "StateChangeTime": "2024-01-15T10:35:00.000Z",
  "Region": "us-east-1"
}

Adding Subscribers:

# Subscribe email to SNS topic
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:artos-production-alerts \
  --protocol email \
  --notification-endpoint ops-team@yourcompany.com

# Confirm subscription from email
# Check inbox for confirmation link and click to confirm

Integration Options:

Email: Direct email notifications
SMS: Text message alerts for critical events
Slack/Teams: Webhook integration for team channels
PagerDuty: Incident management system integration
Lambda: Custom processing and routing logic

6. CloudWatch Events (EventBridge)

Captures EKS cluster state change events for automated responses. Event Rule Name: {cluster_name}-events Event Pattern:

{
  "source": ["aws.eks"],
  "detail-type": ["EKS Cluster State Change"],
  "detail": {
    "name": ["{cluster_name}"]
  }
}

Captured Events:

Cluster creation or deletion
Cluster version updates
Node group scaling events
Add-on installation or updates
Cluster configuration changes

Event Target: SNS topic (alerts are sent to subscribers) Use Cases:

Automated notification of cluster changes
Audit trail for infrastructure modifications
Trigger automated workflows on cluster events
Compliance logging for change management

Example Event:

{
  "version": "0",
  "id": "abc-123-def",
  "detail-type": "EKS Cluster State Change",
  "source": "aws.eks",
  "time": "2024-01-15T10:40:00Z",
  "region": "us-east-1",
  "detail": {
    "name": "artos-production",
    "status": "UPDATING",
    "resourcesVpcConfig": {
      "endpointPublicAccess": false,
      "endpointPrivateAccess": true
    }
  }
}

Module Configuration

Basic Configuration

module "monitoring" {
  source = "./modules/monitoring"

  cluster_name = "artos-production"
  aws_region   = "us-east-1"
  kms_key_arn  = module.kms.key_arn
  
  # Log retention
  log_retention_days = 30
  
  # Enable all features
  enable_alarms      = true
  enable_dashboard   = true
  enable_sns_alerts  = true
  enable_events      = true
  
  # Alarm actions (empty = no notifications)
  alarm_actions = []
  
  tags = {
    Environment = "production"
  }
}

module "monitoring_production" {
  source = "./modules/monitoring"

  cluster_name = "artos-production"
  aws_region   = "us-east-1"
  kms_key_arn  = module.kms.key_arn
  
  # Extended log retention for production
  log_retention_days = 90
  
  # Enable all monitoring features
  enable_alarms      = true
  enable_dashboard   = true
  enable_sns_alerts  = true
  enable_events      = true
  
  # Send alarms to SNS topic
  alarm_actions = [
    module.monitoring.sns_topic_arn  # Reference SNS topic created by module
  ]
  
  tags = {
    Environment = "production"
    AlertLevel  = "critical"
  }
}

# Subscribe operations team to alerts
resource "aws_sns_topic_subscription" "ops_team" {
  topic_arn = module.monitoring_production.sns_topic_arn
  protocol  = "email"
  endpoint  = "ops-team@yourcompany.com"
}

resource "aws_sns_topic_subscription" "on_call" {
  topic_arn = module.monitoring_production.sns_topic_arn
  protocol  = "sms"
  endpoint  = "+1234567890"
}

Development Configuration

module "monitoring_dev" {
  source = "./modules/monitoring"

  cluster_name = "artos-dev"
  aws_region   = "us-east-1"
  kms_key_arn  = module.kms.key_arn
  
  # Minimal log retention for development
  log_retention_days = 3
  
  # Enable dashboard only
  enable_alarms      = false  # Disable alarms in dev
  enable_dashboard   = true
  enable_sns_alerts  = false  # No alerts needed
  enable_events      = false  # No event tracking
  
  alarm_actions = []
  
  tags = {
    Environment = "development"
  }
}

Configuring Application Logging

Application Code Configuration

Configure your applications to send logs to CloudWatch Log Groups: Python (using watchtower):

import logging
from watchtower import CloudWatchLogHandler

# Configure CloudWatch handler
handler = CloudWatchLogHandler(
    log_group='/aws/eks/artos-production/application',
    stream_name='backend-api',
    use_queues=True
)

# Add to logger
logger = logging.getLogger(__name__)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Log messages
logger.info("Application started")
logger.error("Failed to process document", extra={
    "document_id": "doc_123",
    "user_id": "user_456"
})

Node.js (using winston-cloudwatch):

const winston = require('winston');
const WinstonCloudWatch = require('winston-cloudwatch');

const logger = winston.createLogger({
  transports: [
    new WinstonCloudWatch({
      logGroupName: '/aws/eks/artos-production/application',
      logStreamName: 'backend-api',
      awsRegion: 'us-east-1'
    })
  ]
});

logger.info('Application started');
logger.error('Failed to process document', {
  document_id: 'doc_123',
  user_id: 'user_456'
});

Kubernetes Configuration

Alternatively, use a log forwarder like Fluent Bit or Fluentd: Fluent Bit ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [OUTPUT]
        Name cloudwatch_logs
        Match application.*
        region us-east-1
        log_group_name /aws/eks/artos-production/application
        log_stream_prefix backend-
        auto_create_group true
    
    [OUTPUT]
        Name cloudwatch_logs
        Match celery.*
        region us-east-1
        log_group_name /aws/eks/artos-production/celery
        log_stream_prefix worker-
        auto_create_group true

Accessing Monitoring Resources

CloudWatch Console

View Dashboard:

Navigate to AWS Console → CloudWatch
Select “Dashboards” from left menu
Click on {cluster_name}-dashboard

Query Logs:

Navigate to CloudWatch → Logs → Insights
Select log groups to query
Choose saved query or write custom query
Adjust time range and run query

View Alarms:

Navigate to CloudWatch → Alarms
Filter by cluster name
View alarm state and history

AWS CLI

Query Recent Logs:

# Get latest application logs
aws logs tail /aws/eks/artos-production/application --follow

# Search for errors
aws logs filter-log-events \
  --log-group-name /aws/eks/artos-production/application \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000

Check Alarm Status:

# List all alarms
aws cloudwatch describe-alarms \
  --alarm-name-prefix artos-production

# Get alarm history
aws cloudwatch describe-alarm-history \
  --alarm-name artos-production-high-cpu \
  --max-records 10

Run Insights Query:

# Start query
QUERY_ID=$(aws logs start-query \
  --log-group-name /aws/eks/artos-production/application \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 20' \
  --query 'queryId' \
  --output text)

# Get results
aws logs get-query-results --query-id $QUERY_ID

Best Practices

1. Structured Logging

Use structured (JSON) logging instead of plain text for better query capabilities: Good - Structured:

{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "ERROR",
  "message": "Database connection failed",
  "error_code": "DB_CONN_TIMEOUT",
  "database": "postgres-prod",
  "retry_count": 3
}

Avoid - Unstructured:

ERROR: Database connection failed - postgres-prod timed out after 3 retries

2. Log Levels

Use appropriate log levels to control verbosity:

Level	When to Use
ERROR	Application errors requiring immediate attention
WARN	Potential issues that don’t halt execution
INFO	Important business events and milestones
DEBUG	Detailed diagnostic information (dev/staging only)

3. Alarm Threshold Tuning

Adjust alarm thresholds based on your application’s normal behavior:

Start with conservative thresholds (80%)
Monitor false positive rate
Adjust based on actual capacity needs
Set multiple severity levels (warning at 70%, critical at 90%)

4. Dashboard Customization

Extend the default dashboard with application-specific metrics:

# Add custom widget to dashboard
resource "aws_cloudwatch_dashboard" "custom" {
  dashboard_name = "${var.cluster_name}-custom-dashboard"
  
  dashboard_body = jsonencode({
    widgets = [
      # Include default widgets...
      {
        type = "metric"
        properties = {
          metrics = [
            ["Artos", "DocumentsProcessed", "ClusterName", var.cluster_name],
            [".", "APIRequestRate", ".", "."]
          ]
          title = "Business Metrics"
        }
      }
    ]
  })
}

5. Log Sampling

For high-volume logs, consider sampling to reduce storage costs while maintaining visibility:

import random

# Sample 10% of INFO logs, keep all ERROR logs
if log_level == "INFO" and random.random() > 0.1:
    return  # Skip logging
logger.log(log_level, message)

EKS Module - Cluster metrics and control plane logs
IAM Module - IAM permissions for CloudWatch access
Bastion Module - Access logs and session monitoring

Module Maintenance: This module is compatible with Terraform 1.0+ and AWS Provider 5.x. CloudWatch Logs are retained according to configured retention periods and automatically deleted after expiration. Review retention settings periodically to balance observability needs with storage requirements.

Getting Started

API Reference

Core Concepts

SDK Reference

Cookbooks

Integrations

Deployment

Overview

Key Features

Core Components

1. CloudWatch Log Groups

Application Log Group

Celery Log Group

Nginx Log Group

2. CloudWatch Alarms

High CPU Alarm

High Memory Alarm

3. CloudWatch Dashboard

EKS Cluster Metrics Widget

Application Logs Widget

4. CloudWatch Insights Queries

Error Logs Query

Performance Logs Query

6. CloudWatch Events (EventBridge)

Module Configuration

Basic Configuration

Development Configuration

Configuring Application Logging

Application Code Configuration

Kubernetes Configuration

Accessing Monitoring Resources

CloudWatch Console

AWS CLI

Best Practices

1. Structured Logging

2. Log Levels

3. Alarm Threshold Tuning

4. Dashboard Customization

5. Log Sampling

Getting Started

API Reference

Core Concepts

SDK Reference

Cookbooks

Integrations

Deployment

​Overview

​Key Features

​Core Components

​1. CloudWatch Log Groups

​Application Log Group

​Celery Log Group

​Nginx Log Group

​2. CloudWatch Alarms

​High CPU Alarm

​High Memory Alarm

​3. CloudWatch Dashboard

​EKS Cluster Metrics Widget

​Application Logs Widget

​4. CloudWatch Insights Queries

​Error Logs Query

​Performance Logs Query

​5. SNS Topic for Alerts

​6. CloudWatch Events (EventBridge)

​Module Configuration

​Basic Configuration

​Production Configuration with SNS Alerts

​Development Configuration

​Configuring Application Logging

​Application Code Configuration

​Kubernetes Configuration

​Accessing Monitoring Resources

​CloudWatch Console

​AWS CLI

​Best Practices

​1. Structured Logging

​2. Log Levels

​3. Alarm Threshold Tuning

​4. Dashboard Customization

​5. Log Sampling

​Related Modules

Overview

Key Features

Core Components

1. CloudWatch Log Groups

Application Log Group

Celery Log Group

Nginx Log Group

2. CloudWatch Alarms

High CPU Alarm

High Memory Alarm

3. CloudWatch Dashboard

EKS Cluster Metrics Widget

Application Logs Widget

4. CloudWatch Insights Queries

Error Logs Query

Performance Logs Query

5. SNS Topic for Alerts

6. CloudWatch Events (EventBridge)

Module Configuration

Basic Configuration

Production Configuration with SNS Alerts

Development Configuration

Configuring Application Logging

Application Code Configuration

Kubernetes Configuration

Accessing Monitoring Resources

CloudWatch Console

AWS CLI

Best Practices

1. Structured Logging

2. Log Levels

3. Alarm Threshold Tuning

4. Dashboard Customization

5. Log Sampling

Related Modules