Overview
The Monitoring Module provides comprehensive observability for your Artos deployment using AWS CloudWatch. It creates centralized log aggregation, performance metrics, automated alerting, and visualization dashboards to help you monitor application health, diagnose issues, and maintain system reliability.Key Features
- Centralized Logging: Aggregated logs from all application components
- Performance Metrics: CPU and memory utilization tracking
- Automated Alerts: Proactive notifications for anomalies and threshold breaches
- Visual Dashboards: Real-time insights into cluster and application health
- Query Templates: Pre-configured CloudWatch Insights queries for common investigations
- Event Tracking: Capture and respond to EKS cluster state changes
Core Components
1. CloudWatch Log Groups
Log groups provide centralized storage and organization for application logs from different components.Application Log Group
Log Group Name:/aws/eks/{cluster_name}/application
Purpose: Stores logs from the main Artos backend API server.
Typical Log Contents:
- HTTP request/response logs
- Business logic execution traces
- Error stack traces and exceptions
- Database query logs
- Authentication events
- API performance metrics
Celery Log Group
Log Group Name:/aws/eks/{cluster_name}/celery
Purpose: Stores logs from Celery background workers processing asynchronous tasks.
Typical Log Contents:
- Task execution start/completion
- Task failures and retries
- Worker health status
- Queue processing metrics
- Long-running job progress
Nginx Log Group
Log Group Name:/aws/eks/{cluster_name}/nginx
Purpose: Stores logs from Nginx ingress controller or reverse proxy.
Typical Log Contents:
- HTTP access logs
- Request routing decisions
- SSL/TLS handshake events
- Upstream connection errors
- Rate limiting events
2. CloudWatch Alarms
Alarms monitor metrics and automatically trigger notifications when thresholds are breached.High CPU Alarm
Alarm Name:{cluster_name}-high-cpu
Trigger Condition:
- Metric: Average CPU Utilization
- Threshold: > 80%
- Evaluation: 2 consecutive periods of 5 minutes
- Total duration: 10 minutes above threshold
- Increased application load requiring more compute capacity
- CPU-intensive operations (AI model inference, data processing)
- Inefficient code or infinite loops
- Need to scale node groups or add more pods
- Check CloudWatch metrics to identify which pods are consuming CPU
- Review application logs for unusual activity
- Consider horizontal scaling (more pod replicas) or vertical scaling (larger instance types)
- Investigate for potential performance bottlenecks
High Memory Alarm
Alarm Name:{cluster_name}-high-memory
Trigger Condition:
- Metric: Average Memory Utilization
- Threshold: > 80%
- Evaluation: 2 consecutive periods of 5 minutes
- Total duration: 10 minutes above threshold
- Memory leaks in application code
- Large dataset processing
- Caching layers consuming excessive memory
- Insufficient memory allocation for workloads
- Identify memory-intensive pods using
kubectl top pods - Review application logs for memory errors or OOM events
- Analyze heap dumps if available
- Consider increasing memory limits or adding more nodes
- Investigate memory leaks or optimize data structures
3. CloudWatch Dashboard
The dashboard provides a unified view of cluster metrics and application logs. Dashboard Name:{cluster_name}-dashboard
Widgets:
EKS Cluster Metrics Widget
- Type: Time series graph
- Metrics: CPU and Memory Utilization
- Period: 5 minutes
- View: Line graph with both metrics overlaid
- Quick health check of cluster resource utilization
- Identify trends and patterns over time
- Correlate CPU/memory spikes with application events
- Capacity planning insights
Application Logs Widget
- Type: Log query table
- Source: Application log group
- Query: Latest 20 log entries, sorted by timestamp
- Refresh: Auto-refresh every 1 minute
- Real-time log streaming for debugging
- Immediate visibility into recent errors
- Monitor application activity during deployments
- Quick access to latest log events
4. CloudWatch Insights Queries
Pre-configured query definitions for common log analysis tasks.Error Logs Query
Query Name:{cluster_name}-error-logs
Log Groups: Application and Celery
Query:
- Quickly find all error messages across application and workers
- Troubleshoot failures and exceptions
- Identify error patterns and frequencies
- Generate error reports for analysis
- Navigate to CloudWatch Console → Insights
- Select the error-logs query definition
- Choose time range
- Click “Run query”
Performance Logs Query
Query Name:{cluster_name}-performance-logs
Log Groups: Application
Query:
- Identify slow API endpoints or database queries
- Monitor performance degradation over time
- Find operations exceeding performance SLAs
- Prioritize optimization efforts
5. SNS Topic for Alerts
The SNS (Simple Notification Service) topic distributes alarm notifications to configured endpoints. Topic Name:{cluster_name}-alerts
Encryption: KMS-encrypted for security
Subscribers: Configure email, SMS, HTTP/HTTPS endpoints, or Lambda functions to receive alerts.
How It Works:
- CloudWatch alarm enters ALARM state
- Alarm publishes message to SNS topic
- SNS delivers notification to all subscribers
- Team members receive alerts via configured channels
- Email: Direct email notifications
- SMS: Text message alerts for critical events
- Slack/Teams: Webhook integration for team channels
- PagerDuty: Incident management system integration
- Lambda: Custom processing and routing logic
6. CloudWatch Events (EventBridge)
Captures EKS cluster state change events for automated responses. Event Rule Name:{cluster_name}-events
Event Pattern:
- Cluster creation or deletion
- Cluster version updates
- Node group scaling events
- Add-on installation or updates
- Cluster configuration changes
- Automated notification of cluster changes
- Audit trail for infrastructure modifications
- Trigger automated workflows on cluster events
- Compliance logging for change management
Module Configuration
Basic Configuration
Production Configuration with SNS Alerts
Development Configuration
Configuring Application Logging
Application Code Configuration
Configure your applications to send logs to CloudWatch Log Groups: Python (using watchtower):Kubernetes Configuration
Alternatively, use a log forwarder like Fluent Bit or Fluentd: Fluent Bit ConfigMap:Accessing Monitoring Resources
CloudWatch Console
View Dashboard:- Navigate to AWS Console → CloudWatch
- Select “Dashboards” from left menu
- Click on
{cluster_name}-dashboard
- Navigate to CloudWatch → Logs → Insights
- Select log groups to query
- Choose saved query or write custom query
- Adjust time range and run query
- Navigate to CloudWatch → Alarms
- Filter by cluster name
- View alarm state and history
AWS CLI
Query Recent Logs:Best Practices
1. Structured Logging
Use structured (JSON) logging instead of plain text for better query capabilities: Good - Structured:2. Log Levels
Use appropriate log levels to control verbosity:| Level | When to Use |
|---|---|
| ERROR | Application errors requiring immediate attention |
| WARN | Potential issues that don’t halt execution |
| INFO | Important business events and milestones |
| DEBUG | Detailed diagnostic information (dev/staging only) |
3. Alarm Threshold Tuning
Adjust alarm thresholds based on your application’s normal behavior:- Start with conservative thresholds (80%)
- Monitor false positive rate
- Adjust based on actual capacity needs
- Set multiple severity levels (warning at 70%, critical at 90%)
4. Dashboard Customization
Extend the default dashboard with application-specific metrics:5. Log Sampling
For high-volume logs, consider sampling to reduce storage costs while maintaining visibility:Related Modules
- EKS Module - Cluster metrics and control plane logs
- IAM Module - IAM permissions for CloudWatch access
- Bastion Module - Access logs and session monitoring
Module Maintenance: This module is compatible with Terraform 1.0+ and AWS Provider 5.x. CloudWatch Logs are retained according to configured retention periods and automatically deleted after expiration. Review retention settings periodically to balance observability needs with storage requirements.