All notable changes to the AIStore K8s Operator project are documented in this file, starting with version v2.2.0.
Note: Changes to Helm charts, Ansible playbooks, and other deployment tools are not included.
We structure this changelog in accordance with Keep a Changelog guidelines, and this project follows Semantic Versioning.
- Restore rebalance config before decommissioning targets during scale-down, as it may still be disabled from a prior rollout.
- Separated rollout (pod template updates) and scaling (replica count changes) into explicitly guarded operations that cannot overlap
- Target decommission now checks pod status when a node is absent from the cluster map, skipping pods that are NotFound, Unschedulable, or in CrashLoopBackOff and waiting for others to register.
- Replaced
ClusterScalingCR state with predicates inferred from StatefulSet status fields; reconciliation decisions are no longer driven by CR state - Admin client reconciliation
- Skips externally-managed deployments (e.g. deployed via Helm) to avoid conflicts
- Uses K8s patch API to avoid update conflicts
- Fixed a bug causing API calls every reconcile due to K8s default fields
- AuthN support for operator-managed admin client when
spec.auth.usernamePasswordis configured - Restricted security context for init and logSidecar containers
- Native support for arm64 hosts with multi-arch container image build targets
- Added
operator_state.mddocumenting the cluster lifecycle states spec.proxySpec.pvcRetentionPolicyandspec.targetSpec.pvcRetentionPolicyfor configuring retention policies for persistent volume claims.- Default init container resources set to 1 CPU / 1Gi memory (requests == limits) to support Guaranteed QoS
- Applied on spec sync to avoid forced rollout
- Reconciliation for volumes and priority class name.
- Sync primary container securityContext from spec
spec.proxySpec.probesandspec.targetSpec.probesfor configuring health probe timing parameters (liveness, readiness, startup) per daemon role
- Helm chart will now be persistent in the repository at
operator/helmand allow additional customization and overrides of manifests - Updated all tool versions and minor dependencies
- Updated target pod rollout strategy to search for lowest pod ordinal not on new revision instead of relying on
UpdatedReplicas - Fixed a bug where readiness check against target count would deadlock when scaling up by >1 replica
spec.logSidecaroption to consolidate log sidecar configurationnamespaceScopehelm value takes an optional list of namespaces the operator can access and watch- Disables the ClusterRoleBinding for the operator ServiceAccount to the ClusterRole
- Templates a RoleBinding for each namespace provided to the operator ServiceAccount
- Templates a ClusterRole and binding for specific cluster-wide resource watches (nodes, storage classes)
- Adds the list of namespaces as a comma-separated arg
--watch-namespacesto the manager command to restrict watched namespaces
spec.tlsoption to consolidate TLS configuration under a single field.tls.secretNamefor using an existing Kubernetes TLS secret.tls.certificatefor cert-manager managed certificates withmode: secret(Certificate CR) ormode: csi(CSI driver).
spec.tlsCertificate- usespec.tls.certificateinstead.spec.tlsSecretName- usespec.tls.secretNameinstead.spec.tlsCertManagerIssuerName- usespec.tls.certificatewithmode: csiinstead.spec.logSidecarResources- usespec.logSidecar.resourcesinstead.spec.logSidecarImage- usespec.logSidecar.imageinstead.
- Cleanup code for legacy cluster-specific ClusterRoles and ClusterRoleBindings, along with associated permissions
- TLS options (
tlsCertificate,tlsSecretName,tlsCertManagerIssuerName) now automatically set AIS TLS config to use auto-mounted certificates at/var/certs. - Fixed a missing RBAC rule for job creation, required to run host path cleanup jobs.
- Updated
ConfigToUpdateto bring config options to parity with AIS version 4.2.
spec.tlsCertificateoption to generate TLS certificates via cert-manager.- Operator deployments will now create a ServiceAccount
ais-operator-controller-managerand bind only that ServiceAccount to theais-operator-manager-roleClusterRole. priorityClassNamespec option to set a priority class for AIS podslogSidecarResourcesspec option to set K8s resource options for the logging sidecar
- Updated helm chart generation scripts to use new
helmoverlay for better templating. - Operator now correctly detects endpoint changes in AIS clusters and recreates the client when the URL (including ports) changes.
- Removed limitations on path length for HostPath volume mounts by using a constant, indexed volume name for HostPath volumes within pods.
- Fixed a bug where if all cluster nodes and state are lost but the CR is persisted in K8s, clusterID in status would not update to new cluster.
- Reduced scope of ClusterRole
manager-role - Tightened default securityContext for manager pod and container
spec.targetSpec.pdboption to configure a PodDisruptionBudget for target pods.
spec.adminClientoption to deploy an AIS admin client auto-configured to connect to the cluster.- Supports custom CA trust via
caConfigMapfor TLS-enabled clusters.
- Supports custom CA trust via
- Host cleanup jobs use all tolerations provided to proxy and target pods.
AIS_AUTHN_CMoperator deployment environment variable and kustomize patch- All references to deprecated ConfigMap-based cluster authentication info for the operator. Use the
authoption from spec. - Deprecated option
enableNodeNameHost. UsepublicNetDNSMode: Nodefor equivalent behavior.
- Add support for OAuth compatible 3rd party auth services with password-based login. Set
auth.serviceURLto the token login endpoint and configureauth.usernamePassword.loginConf.clientID - Add
publicNetDNSModeoption. SupportsIP,Node, orPodvalues to determine what AIS uses for public network DNS.IPis the current default and matches existing deployments.Podcan be used with host networking to use pod DNS to resolve the host IP, allowing for more granular TLS certificates. - Add
HOST_IPSenv var to init containers with field refstatus.hostIPs. Used for future init containers to use to determine public host based on other options without an explicit variable from the operator.
- Deprecate
enableNodeNameHostadded in v2.9.1 in favor ofpublicNetDNSMode == Node
- Updated dependencies including AIS to latest
v4.1-rc1commit to include latestcluster_keyconfig
- Avoid checking for removed env var
AIS_PUBLIC_HOSTNAMEfor AIS container that would cause rollout on upgrade
- Add
enableNodeNameHostto allow using K8s node hostnames for public interface. Uses K8s environmentspec.nodeNameinstead ofstatus.hostIPif enabled.
- Fixed a bug where an empty
net.http.client_auth_tlsin AIS spec would cause an exception if TLS enabled
-
Auth
- Support for new AIS config options under
configToUpdate.auth.cluster_keyfor configuring target validation of proxy-signed requests - TLS support for operator-to-auth service communication with
spec.auth.tls.caCertPathconfiguration - Fallback to default CA bundle path (
/etc/ssl/certs/auth-ca/ca.crt) whenspec.auth.tls.caCertPathis not configured - TLS config caching (6 hour TTL) to minimize disk I/O when loading CA certificates
truststorepackage for CA certificate loading and TLS configuration management- TLS certificate verification for the auth service can be disabled via
spec.auth.tls.insecureSkipVerify(not recommended for production) - Operator mounts
ais-operator-auth-caConfigMap to/etc/ssl/certs/auth-cafor Auth CA certificates whenauthCAConfigmapNameis specified in the helm chart - OIDC issuer CA configuration via
spec.issuerCAConfigMapfor automatic certificate mounting andauth.oidc.issuer_ca_bundleconfiguration
- Support for new AIS config options under
-
Autoscaling cluster size can now be limited by
spec.proxySpec.autoScale.sizeLimitandspec.targetSpec.autoScale.sizeLimit
- Auth
- TLS configuration only applied for HTTPS URLs; HTTP connections skip
- Return errors on TLS failures instead of silently falling back to insecure connections
- Operator uses required audiences from AIStore cluster's
spec.configToUpdate.auth.required_claims.audto requests tokens with matching audiences during token exchange - Configurable Helm values (
authCAConfigmapNameandaisCAConfigmapName) for auth service and AIStore custom CA bundle configmaps
- Fixed a bug where resuming from shutdown state would become stuck on target scale up due to failing API calls
- Build:
mockgennow installed toLOCALBINwith versioned suffix to prevent version mismatches that cause unnecessary diffs in generated mock files - Use a common statefulset ready check for better enforcement of proxy rollout before starting target rollout
- Removed deprecation notice for
hostPathPrefixoption, withstateStorageClassstill recommended for easier host cleanup
- Defining the location of the admin credentials secret via
AIS_AUTHN_CMConfigMap- Use
spec.auth.usernamePassword.secretNameandspec.auth.usernamePassword.secretNamespacefor static secrets - Use
spec.auth.tokenExchangeoptions for token exchange
- Use
- Helm chart: AIStore CRD now includes
helm.sh/resource-policy: keepannotation to prevent CRD deletion duringhelm uninstall, protecting AIStore clusters from cascade deletion - Add
clusterIDfield in AIStore status to track the unique identifier for the cluster - Operator to check cluster map proxy/target counts match the spec before setting the CR
Readycondition toTrue - Auth
- RFC 8693 OAuth 2.0 Token Exchange support for AuthN token exchange with proper form-encoded requests
TokenInfostruct to track both token and expiration time returned from AuthN- Support for RFC 8693 standard response fields (
access_token,issued_token_type,token_type,expires_in) - Backward compatibility with legacy
tokenfield in token exchange responses - AuthN configuration can now be specified directly in the AIStore CRD via
spec.authfield with support for multiple authentication methods (username/password and token exchange) using CEL validation
-
Updated Go version to 1.25 and updated direct dependencies
-
Reduced requeues and set specific requeue delays instead of exponential backoff
-
Set
publishNotReadyAddresses: trueon headless SVCs for proxies and targets to enable pre-readiness peer discovery -
Auth
- Token exchange now implements RFC 8693 specification with
application/x-www-form-urlencodedcontent type - AuthN client methods now return
*TokenInfoinstead of plain string to include expiration metadata - Token exchange requests use RFC 8693 required fields:
grant_type,subject_token, andsubject_token_type - Operator AuthN configuration prioritizes CRD
spec.authover ConfigMap (ConfigMap approach is deprecated but supported for backward compatibility) - TLS configuration is now only applied for HTTPS URLs (HTTP connections no longer attempt TLS setup)
- Updated CRD
configToUpdate.authoptions to match latest changes to AIS config including RSA key support, required audience claims, and OIDC issuer lookup
- Token exchange now implements RFC 8693 specification with
- Auto-scaling mode: Set
size: -1to automatically scale proxy/target pods based on node selectors and tolerations - Host path mounting: Use
useHostPath: truein mount spec to bypass PV/PVC provisioning for direct host storage - Auto-scale status tracking: New
AutoScaleStatusfield in cluster status tracks expected nodes for autoScaling clusters - Token exchange authentication support to allow operators to exchange tokens (e.g., Kubernetes service account tokens or OIDC tokens) with authentication services for AIS JWT tokens
- Token expiration tracking: Support for OAuth
expires_infield in token exchange responses with automatic token refresh - Efficient token refresh: In-place token updates without rebuilding HTTP clients when tokens expire
- Size validation: Allow
size: -1for autoScaling mode - Update target StatefulSet update strategy changed from RollingUpdate to OnDelete.
- Set maintenance mode before pod deletion during target rollouts.
- Reverse target rollout order to start from lowest ordinal (0 to N-1).
- Read authN configuration and secret location from ConfigMap defined by
AIS_AUTHN_CM
- Fix cleanup job loop to skip deleted jobs and avoid unnecessary requeues during cluster cleanup.
- Enforce primary proxy reassignment, if required, during proxy scaledown (was previously best-effort).
- Refactor proxy scaledown handling to improve primary proxy reassignment and node decommissioning with better logging.
- All AuthN environment variables from operator deployment
AIS_AUTHN_SU_NAMEAIS_AUTHN_SU_PASSAIS_AUTHN_SERVICE_HOSTAIS_AUTHN_SERVICE_PORTAIS_AUTHN_USE_HTTPS
- Allow configured cloud backends via
spec.configToUpdate.backendin the absence of K8s secrets -- supports alternative secret injection - Use statefulset status to simplify proxy rollout
- Update direct dependencies including AIS
- Add support for
labelsinAIStoreCRD to allow users to specify custom labels that will be applied to either proxies or targets Pods. - Add support for
AIS_TEST_API_MODEenvironment variable to specify API mode for non-external LB clusters. - Add support for
TEST_EPHEMERAL_CLUSTERenvironment variable to skip cleanup/teardown when testing on ephemeral clusters (e.g. in CI). - Add optional mount for
operator-tlsfor supplying the operator with a certificate for client authentication. - Add
ais-client-cert-pathfor defining specific location of operator AIS client certificates. - Add
ais-client-cert-per-clusterto support separate certificate locations for each AIS cluster. - Add
OPERATOR_SKIP_VERIFY_CRToption to deployment, which will initially default toTrueto match previous deployments. - Add TLS configuration to AIS API client, supporting additional CA trust and client Auth.
- Add kustomize patch to mount a configMap
ais-operator-ais-cafor trusting specific AIS CAs.
- Apply
imagePullSecretsfor image pull authentication to service account instead of individual proxy/target pod specs. - Update RBAC rule in AIS service account to remove access to secrets and configmaps.
- Update kustomize structure to support overlays with different patch options on top of a common base.
- Remove kustomize usage of deprecated 'vars'.
- Support for the following env vars for testing
- AIS_TEST_NODE_IMAGE
- AIS_TEST_PREV_NODE_IMAGE
- AIS_TEST_INIT_IMAGE
- AIS_TEST_PREV_INIT_IMAGE
- COMPATIBILITY CHANGE: Removed creation of a ServiceAccount with ClusterRole. ClusterRoles for existing clusters will be deleted.
- Since ClusterRole is not namespaced, this allowed a creation of a namespaced AIS custom resource to result in a service account with cluster-wide access, which could then be impersonated.
- This operator version will ONLY support AIS versions v3.28 or later.
- Older AIS versions will error when trying to use the removed ClusterRole.
- Related AIS change: https://github.com/NVIDIA/aistore/commit/160626c8fa44fc43ba7e9d42561dfbfe4216745e