This document describes the monitoring, alerting, and notification infrastructure for NHS HomeTest.
CloudWatch Alarms → SNS Topics (tiered) → AWS Chatbot → Slack (#hometest-ops-alerts)
→ Email subscriptions
GitHub Actions → Slack Webhook (secret) → #hometest-ops-alertsAll CloudWatch alarms publish to one of three tiered SNS topics in shared_services. AWS Chatbot subscribes to each topic and forwards formatted messages to the configured Slack channel.
GitHub Actions deployment notifications use a separate webhook stored as a repository secret.
| Topic Suffix | Severity | Purpose | Examples |
|---|---|---|---|
alerts-critical |
P1 | Service-impacting issues requiring immediate attention | Lambda errors, DLQ messages, 5XX spikes, DB deadlocks, SQS age threshold |
alerts-warning |
P2 | Capacity/performance degradation | High latency, NAT port allocation errors, Aurora capacity approaching limits |
alerts-security |
P3 | Security events | WAF blocked request spikes, SQL injection attempts, rate limiting triggers |
All topics also send to the email subscriptions configured in sns_alerts_email_subscriptions (currently england.HomeTestInfraAdmins@nhs.net).
- Module:
infrastructure/modules/slack-alerts - Deployed in:
shared_serviceslayer - Mechanism: AWS Chatbot Slack channel configurations subscribe to SNS topics and post formatted alarm messages to Slack
- Slack channel:
#hometest-ops-alerts(all tiers currently routed to one channel)
AWS Chatbot natively renders CloudWatch alarm details including alarm name, state, reason, metric, and account context.
- Authorize the Slack workspace in the AWS Chatbot console — this is a one-time manual step per AWS account + Slack workspace
- Obtain the Slack workspace ID (
T0XXXXXXX) and channel ID(s) (C0XXXXXXX) — right-click the channel in Slack → View channel details → copy the ID at the bottom
# infrastructure/environments/<account>/core/shared_services/terragrunt.hcl
inputs = {
enable_slack_alerts = true
slack_workspace_id = "T0XXXXXXX" # Slack workspace (team) ID
slack_channel_id_critical = "C0XXXXXXX" # Channel for critical alerts
slack_channel_id_warning = "C0XXXXXXX" # Channel for warning alerts (same channel for now)
slack_channel_id_security = "C0XXXXXXX" # Channel for security alerts (same channel for now)
}To disable Slack alerts for an environment, set enable_slack_alerts = false. Email alerts continue independently.
All three severity tiers currently point to the same channel (#hometest-ops-alerts). To split by severity, create additional Slack channels and set distinct channel IDs for each slack_channel_id_* variable.
- Action:
.github/actions/notify-slack - Secret:
SLACK_WEBHOOK_URL(GitHub Actions repository secret) - Triggers: End of
cicd-deploy-poc,cicd-deploy-dev,deploy-hometest-app, anddeploy-demoworkflows - Content: Deployment status (success/failure/cancelled), environment, module, actor, and a link to the run
WAF Alarms (modules/waf-alarms)
| Alarm | Metric | Threshold | Severity |
|---|---|---|---|
| Blocked Request Spike | BlockedRequests | > 100 in 5 min | Security |
| Rate Limit Triggered | RateLimitRule count | > 0 in 5 min | Security |
| SQLi Detected | SQLiRule count | > 0 in 5 min | Security |
Applied to both the regional WAF (API Gateway/ALB) and CloudFront WAF.
Note: CloudFront WAF alarms are created in us-east-1 via
providers = { aws = aws.us_east_1 }because CloudFront WAF metrics are published there.
Network Alarms (modules/network-alarms)
| Alarm | Metric | Threshold | Severity |
|---|---|---|---|
| NAT GW Port Allocation Errors | ErrorPortAllocation | > 0 in 5 min | Warning |
| NAT GW Packets Dropped | PacketsDropCount | > 100 in 5 min | Warning |
| Network Firewall Dropped Packets | DroppedPackets | > 100 in 5 min | Warning |
NAT Gateway alarms are created per gateway via for_each.
Aurora Alarms (modules/aurora-alarms)
| Alarm | Metric | Threshold | Severity |
|---|---|---|---|
| High CPU | CPUUtilization | > 80% avg over 5 min | Critical |
| Low Freeable Memory | FreeableMemory | < 256 MB avg over 5 min | Critical |
| High Connections | DatabaseConnections | > 100 avg over 5 min | Critical |
| Deadlocks | Deadlocks | > 0 sum in 5 min | Critical |
| Replica Lag | AuroraReplicaLag | > 100 ms avg over 5 min | Critical |
| High Serverless Capacity | ServerlessDatabaseCapacity | > max ACU × 80% | Critical |
| Low Free Storage | FreeLocalStorage | < 5 GB avg over 5 min | Critical |
Thresholds are configurable via the module variables.
Lambda Alarms (built into modules/lambda)
Each Lambda function gets these alarms automatically:
| Alarm | Metric | Threshold | Severity |
|---|---|---|---|
| Errors | Errors | ≥ 1 sum in 1 min | Critical |
| Throttles | Throttles | ≥ 1 sum in 1 min | Critical |
| Duration (p99) | Duration (p99) | ≥ timeout × 80% | Critical |
| Concurrent Executions | ConcurrentExecutions | ≥ reserved × 80% | Critical |
| Logged Errors | Custom metric (CloudWatch Logs metric filter) | ≥ 5 sum in 5 min | Critical |
The concurrency alarm is only created when reserved_concurrent_executions > 0.
The Logged Errors alarm uses a CloudWatch Logs metric filter to catch errors that are logged (e.g. console.error, caught exceptions) but don't fail the Lambda invocation. The filter matches ?ERROR ?Error ?Exception ?errorType patterns.
SQS Alarms (built into modules/sqs)
Each SQS queue gets an ApproximateAgeOfOldestMessage alarm. The following queues are monitored:
order-placementorder-resultorder-notificationorder-evictionsupplier-notification
Each queue also has a dead-letter queue (DLQ) with its own age alarm.
API Gateway Alarms (modules/api-gateway-alarms)
| Alarm | Metric | Threshold | Severity |
|---|---|---|---|
| 5XX Error Rate | 5XXError / Count × 100 | > 1% in 5 min | Critical |
| 4XX Error Rate | 4XXError / Count × 100 | > 10% in 5 min | Critical |
| Latency (p99) | Latency (p99) | > 3000 ms in 5 min | Critical |
| Integration Latency (p99) | IntegrationLatency (p99) | > 2000 ms in 5 min | Critical |
CloudFront Alarms (modules/cloudfront-alarms)
| Alarm | Metric | Threshold | Severity |
|---|---|---|---|
| 5XX Error Rate | 5xxErrorRate | > 1% in 5 min | Critical |
| 4XX Error Rate | 4xxErrorRate | > 10% in 5 min | Critical |
| Origin Latency (p99) | OriginLatency (p99) | > 5000 ms in 5 min | Critical |
Note: CloudFront alarms are created in us-east-1 via
providers = { aws = aws.us_east_1 }because CloudFront metrics are published there.
All alarm thresholds are configurable via module variables. Default values are chosen for a typical workload and should be reviewed per environment:
- POC/Dev: Defaults are usually fine; consider relaxing to avoid noise during development.
- Production: Review thresholds against actual traffic patterns. Tighten error rate thresholds and shorten evaluation periods.
To override a threshold, set the corresponding variable in the terragrunt inputs for the relevant layer.
By default, alarms do not send notifications when returning to OK state (enable_ok_actions = false). This reduces noise in dev/POC environments where alarms may fire on first deploy before metrics exist.
To enable recovery notifications (recommended for production):
# In terragrunt inputs for each layer (shared_services, hometest-app, aurora-postgres)
enable_ok_actions = trueThis is configured per-layer so you can enable it selectively (e.g. prod only).
- Determine the layer: Which Terraform source deploys the resource? (
shared_services,hometest-app,aurora-postgres) - Choose the severity: Critical (P1), Warning (P2), or Security (P3)
- Create or extend a module: Add the CloudWatch alarm with
alarm_actions = [var.sns_alerts_<severity>_topic_arn] - Wire in the source: Reference the module in the appropriate
src/*/alarms.tffile - Pass the SNS topic ARN: Ensure the terragrunt config passes the topic ARN from
shared_servicesoutputs
-
Verify the Slack workspace is authorized in the AWS Chatbot console
-
Check the Chatbot channel configuration exists:
aws chatbot describe-slack-channel-configurations -
Verify the workspace ID and channel ID match your Slack workspace/channel
-
Check SNS subscriptions:
aws sns list-subscriptions-by-topic --topic-arn <topic-arn>— Chatbot should appear as a subscriber -
Check the Chatbot CloudWatch log group for errors (logging level must be
INFOorERROR) -
Test with a manual publish:
aws sns publish --topic-arn <topic-arn> --subject "Test Alert" \ --message '{"AlarmName":"test","NewStateValue":"ALARM","NewStateReason":"Manual test"}'
-
Verify the
SLACK_WEBHOOK_URLrepository secret is set in GitHub → Settings → Secrets and variables → Actions -
Check the workflow run logs for the
notify-slackstep output -
Test the webhook manually:
curl -X POST -H 'Content-type: application/json' --data '{"text":"test"}' <webhook-url>
This typically means the metric has no data points yet. For Lambda metrics, trigger the function at least once. For API Gateway metrics, send a request. The alarm will transition to OK once data appears.