Skip to content

Latest commit

 

History

History
90 lines (62 loc) · 8.27 KB

File metadata and controls

90 lines (62 loc) · 8.27 KB

Production checklist

For the service to be ready for end users, it must be reliable, performant and sustainable.

StatusCake

This is the most essential monitoring as if it alerts, it means users cannot access the site. It monitors both uptime and SSL certificate. Use the terraform module to configure it.

Ask the infra team for help with these steps:

  • Create the dev team contact group if necessary. Add the team email, developer emails and phone numbers if desired
  • Get the dev team contact group id from the URL
  • Configure credentials
  • Fill in enable_monitoring, external_url and statuscake_contact_groups variables in the environment tfvars.json file. Example:
    "enable_monitoring" : true,
    "external_url": "https://calculate-teacher-pay.education.gov.uk/healthcheck",
    "statuscake_contact_groups": [195955]
  • For production, add the infra team contact group id: 282453

Multiple replicas

By default the template deploys only 1 replica for each kubernetes deployment. This is not sufficient for production as if the container is unavailable, there is no other replica to serve the requests. It may be unavailable because of high usage or simply because the cluster is moving the container to another node. This will happen when the cluster version is updated.

Use at least 2 replicas or as many as required by performance testing.

Database plan

The template deploys a default plan for postgres and redis.

It may be sufficient for the test environments, but it may not offer enough CPU, memory or network bandwidth for production. Performance testing will help determine the right plans.

Note that for redis, all azure_family, azure_sku_name and azure_capacity must be changed jointly. Check terraform postgres documentation for the allowed values.

High availability

Each Azure region provides multiple availability zones. The kubernetes cluster is deployed across 3 zones so in case one is failing, the workload continues on the 2 others.

The same should be applied to database clusters. For postgres, set azure_enable_high_availability to true. For redis, use a Premium plan.

Note the cost is doubled for postgres, and much higher for redis, so this should be used carefully.

Performance testing

Simulate load from user traffic to determine the right number of instances and the database plan. This should cover the most typical user journeys. We recommend K6 as it can be deployed to the cluster to minimise latency. Check the example in teacher pay calculator.

If time is short or user traffic is expected to be low, make sure to monitor the application and database usage after launch, and everytime there is a new significant feature. And be ready to scale up.

Postgres backups to Azure storage

Azure postgres provides an automatic backup with a 7 days retention period. It can be restored from a point in time to a new database server.

In case there is a major issue and the above doesn't work, we strongly suggest taking another daily backup every night and storing it in Azure storage. Set azure_enable_backup_storage variable to true to create the storage account. Then create a workflow using the backup-postgres github action and schedule it nightly.

Postgres and redis monitoring

Set azure_enable_monitoring to true to enable logging, monitoring and alerting. It will alert the infrastructure team by email by default.

Front door monitoring

Set azure_enable_monitoring to true in the domains/infrastructure module to enable logging on front door. It is verbose and costly and should not be used by default. But it can be extremely useful for troubleshooting.

Custom domain

The default web application domain in production is teacherservices.cloud, and the application domain is <application_name>.teacherservices.cloud. It should not be used by end users. Rather we normally create a subdomain of either education.gov.uk or service.gov.uk. Here is the process:

If an apex domain is used, make sure to configure StatusCake SSL monitoring as the certificate must be regenerated manually every 180 days.

Pin all versions

The infrastructure code should pin the versions of all components to avoid receiving different versions. The build must be predictable between environments and over time. We should upgrade versions frequently, but only when it is desired and fully tested.

Components with versions:

  • Terraform (in application, domains infrastructure and environment_domains)
  • Terraform providers (azure, kubernetes, StatusCake)
  • Postgres
  • Redis
  • Terrafile binary
  • Terrafile environment files: each one should point at either main, testing or stable according to the terraform modules release process

Maintenance window

Azure applies patches and minor updates to postgres and redis. Since this may cause a minor disruption, use the azure_maintenance_window and azure_patch_schedule variables to set them to a convenient time, when the service receives less traffic.

Note the postgres patches will always be applied first to environments where the maintenance window is not set.

Service offering

The new service template uses the default "Teacher services cloud" value for the Product tag. This tag is used to identify the service in the Azure finance reporting. Each service must register a new service offering and product and replace "Teacher services cloud" with the right name so that Azure costs are allocated accordingly.

Maintenance page

Optional but recommended for user facing services. See Maintenance page for more details.

Lock critical resources

Add a lock to critical Azure resources to prevent against accidental deletion, such as production databases. Members of the s189-teacher-services-cloud-ResLock Admin Entra ID group (infra team) can manage locks.

  • Open the resource in the Azure portal
  • Settings > Locks > + Add > Lock name: Delete, Lock type: Delete > OK