Conversation
WalkthroughThis pull request introduces a Kubernetes-native operator to orchestrate OpenFGA database migrations, replacing the existing Helm hook-based approach. The implementation includes a controller-runtime operator in Go, a Helm subchart for operator deployment, integration into the main OpenFGA chart, comprehensive testing, and architectural documentation via ADRs. Changes
Sequence DiagramsequenceDiagram
participant User as GitOps/User
participant Helm as Helm Install
participant K8s as Kubernetes
participant Op as Migration Operator
participant DB as Database
User->>Helm: helm install (operator.enabled=true)
Helm->>K8s: Create Deployment (replicas=0, annotations)
Helm->>K8s: Create MigrationReconciler
Helm->>K8s: Return immediately
Note over K8s: Operator starts watching...
K8s->>Op: Detect Deployment with migration enabled
Op->>Op: Read current migration version from ConfigMap
Op->>Op: Extract desired version from image tag
Note over Op: Versions differ → Migration needed
Op->>K8s: Scale Deployment replicas → 0
Op->>K8s: Create migration Job (runs openfga migrate)
K8s->>DB: Job executes migration SQL
DB-->>K8s: Migration succeeds
K8s->>Op: Job marked as Complete
Op->>K8s: Update migration-status ConfigMap (new version)
Op->>K8s: Scale Deployment replicas → desired count
K8s->>K8s: Deployment pods start (migration complete)
K8s-->>User: OpenFGA ready
Note over Op,User: On migration failure:
Note over Op: Set MigrationFailed condition
Note over Op: Delete failed Job after backoff
Note over Op: Requeue with 60s cooldown
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR introduces an OpenFGA migration operator and integrates it into the existing Helm chart to replace Helm hook/initContainer-based migration orchestration (improving compatibility with GitOps tools and helm --wait).
Changes:
- Added a new Go-based operator (controller-runtime) that orchestrates migrations via Kubernetes Jobs and a migration-status ConfigMap.
- Added an
openfga-operatorHelm subchart and integrated it as an optional dependency of the parentopenfgachart (operator.enabled). - Added ADR documentation and local integration test values/instructions for validating operator-driven migration behavior.
Reviewed changes
Copilot reviewed 37 out of 39 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/tests/values-no-db.yaml | Values for permanent DB failure integration scenario |
| operator/tests/values-happy-path.yaml | Values + embedded Postgres manifests for happy-path scenario |
| operator/tests/values-db-outage.yaml | Values + Postgres scaled-to-0 outage/recovery scenario |
| operator/tests/README.md | Step-by-step local integration test instructions |
| operator/README.md | Operator overview, dev commands, and configuration docs |
| operator/Makefile | Basic build/test/vet/lint and docker targets for the operator |
| operator/internal/controller/migration_controller.go | Core reconciliation logic for scaling, Job orchestration, ConfigMap status |
| operator/internal/controller/migration_controller_test.go | Unit tests for reconcile paths and image tag parsing |
| operator/internal/controller/helpers.go | Job builder + scaling + ConfigMap helpers |
| operator/go.mod | Operator module definition and dependencies |
| operator/go.sum | Dependency lockfile for operator module |
| operator/Dockerfile | Multi-stage build producing distroless runtime image |
| operator/cmd/main.go | Operator entrypoint, flags, manager setup, health endpoints |
| operator/.dockerignore | Docker context exclusions for faster builds |
| docs/adr/README.md | ADR index and ADR process documentation |
| docs/adr/004-operator-deployment-model.md | ADR for deploying operator as Helm subchart dependency |
| docs/adr/003-declarative-store-lifecycle-crds.md | ADR for future CRD-based store/model/tuples management |
| docs/adr/002-operator-managed-migrations.md | ADR describing migration flow moved from Helm hooks to operator |
| docs/adr/001-adopt-openfga-operator.md | ADR establishing the operator approach and staged roadmap |
| docs/adr/000-template.md | ADR template for future architectural decisions |
| charts/openfga/values.yaml | Added operator + migration values surface |
| charts/openfga/values.schema.json | Schema updates for new operator/migration values and subchart passthrough |
| charts/openfga/templates/serviceaccount.yaml | Adds optional dedicated migration ServiceAccount in operator mode |
| charts/openfga/templates/rbac.yaml | Skips legacy hook/init-container RBAC when operator is enabled |
| charts/openfga/templates/job.yaml | Skips legacy Helm hook migration Job when operator is enabled |
| charts/openfga/templates/deployment.yaml | Operator mode annotations, replicas=0 behavior, disables hook init-container path |
| charts/openfga/templates/_helpers.tpl | Helper for migration ServiceAccount naming |
| charts/openfga/Chart.yaml | Adds openfga-operator dependency gated by operator.enabled |
| charts/openfga/Chart.lock | Lockfile updated to include the new subchart dependency |
| charts/openfga-operator/values.yaml | Operator chart configuration (image, leader election, watch scope, resources) |
| charts/openfga-operator/templates/serviceaccount.yaml | Operator ServiceAccount manifest |
| charts/openfga-operator/templates/NOTES.txt | Post-install notes for operator chart |
| charts/openfga-operator/templates/deployment.yaml | Operator Deployment manifest and CLI args wiring |
| charts/openfga-operator/templates/clusterrolebinding.yaml | ClusterRoleBinding for operator permissions |
| charts/openfga-operator/templates/clusterrole.yaml | ClusterRole for operator (deployments/jobs/configmaps/etc.) |
| charts/openfga-operator/templates/_helpers.tpl | Operator chart naming/labels helpers |
| charts/openfga-operator/crds/README.md | Placeholder note for future CRDs (Stage 2+) |
| charts/openfga-operator/Chart.yaml | Operator chart metadata/versioning |
| .github/workflows/operator.yml | CI workflow for operator tests + multi-arch build/push |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 17
🧹 Nitpick comments (1)
operator/tests/values-no-db.yaml (1)
18-18: Avoid inline credentials in test fixture URIs.Line 18 embeds basic-auth credentials, which triggers secret scanners and normalizes plaintext credential patterns. Prefer a redacted placeholder for this failure-path fixture.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@operator/tests/values-no-db.yaml` at line 18, Replace the inline basic-auth credentials embedded in the YAML "uri" value (the failing PostgreSQL URI) with a redacted placeholder or environment-variable reference (e.g., remove "openfga:changeme@" and replace with a token like "<REDACTED_CREDENTIALS>" or use an env var reference) so the test fixture no longer contains plaintext credentials; update the "uri" key's value accordingly and ensure any test code reading this fixture can handle the placeholder or env var.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/operator.yml:
- Around line 7-11: The workflow path filters are too narrow; update the
operator workflow (operator.yml) to also trigger on changes that affect image
publishing and self-validation by adding the chart and workflow files to the
paths filter — include at minimum "charts/openfga-operator/Chart.yaml" and
".github/workflows/operator.yml" (or broader "charts/**" and
".github/workflows/**") so that functions relying on Chart.yaml-derived tags and
workflow self-changes will cause the workflow to run.
In `@charts/openfga-operator/templates/clusterrole.yaml`:
- Around line 20-25: Remove the unused Secret and ServiceAccount permissions
from the operator ClusterRole template: delete the rules that list resources
"secrets" with verb "get" and "serviceaccounts" with verbs "get","list","create"
in the charts/openfga-operator/templates/clusterrole.yaml so the ClusterRole
only grants permissions actually used by the controller (referenced by
migration_controller.go and helpers.go which do not access Secrets or create
ServiceAccounts); ensure no other ClusterRole rules depend on those resource
entries and run a quick helm template to validate the manifest.
In `@charts/openfga-operator/templates/deployment.yaml`:
- Around line 27-36: The deployment template leaves podSecurityContext and
container securityContext empty (podSecurityContext and securityContext blocks)
which allows runtime-default privileges; update the chart defaults in
values.yaml and the deployment template to enable hardened defaults: set
podSecurityContext with runAsNonRoot: true and runAsUser (non-root uid), and set
container securityContext to runAsNonRoot: true, runAsUser, seccompProfile: {
type: RuntimeDefault }, capabilities: { drop: ["ALL"] }, and
readOnlyRootFilesystem: true; ensure these keys are present and not commented
out so the operator container and pod inherit the hardened defaults (refer to
podSecurityContext and the container securityContext entries in the deployment
template and the commented hardening directives in values.yaml).
In `@charts/openfga-operator/templates/NOTES.txt`:
- Around line 3-5: Replace the hard assertion in the Helm NOTES template that
the operator image "({{ .Values.image.repository }}:{{ .Values.image.tag |
default .Chart.AppVersion }}) does not exist yet" with a neutral, conditional
message: detect if a developer flag or a value (e.g., .Values.image.isDev or by
checking if .Values.image.tag equals .Chart.AppVersion) indicates a development
build and only then warn that the image may not be pushed; otherwise emit a
neutral info line that the operator uses the specified image without asserting
non-existence. Update the template text in NOTES.txt around the referenced
symbols (.Values.image.repository, .Values.image.tag, .Chart.AppVersion) to
reflect this conditional/neutral wording.
In `@charts/openfga/templates/_helpers.tpl`:
- Around line 80-85: The helper openfga.migrationServiceAccountName currently
always returns a generated name; change it to first check
.Values.migration.serviceAccount.create and
.Values.migration.serviceAccount.name, and if create is false and name is not
provided, return an empty string (so operator-created migration Jobs won't
reference a non-existent SA). Update the logic in the
openfga.migrationServiceAccountName function to: if
.Values.migration.serviceAccount.name use that (trunc/trimSuffix), else if
.Values.migration.serviceAccount.create is false return "" immediately,
otherwise generate and return the fallback printf "%s-migration" (include
"openfga.fullname" .) with trunc/trimSuffix.
In `@charts/openfga/templates/deployment.yaml`:
- Around line 8-10: The template is honoring only .Values.operator.enabled but
must also respect .Values.migration.enabled; update the conditional logic
wherever migration-related annotations/fields are set (e.g., the blocks that set
openfga.dev/desired-replicas, openfga.dev/migration-service-account, and any
migration init-container toggles) to require both .Values.operator.enabled and
.Values.migration.enabled (e.g., change conditionals to check both flags before
emitting those annotations/fields), and apply the same fix to the other
occurrences noted (lines ~16-23 and ~49) so operator-mode branches do not inject
migration annotations or set replicas to 0 when migration.enabled is false.
In `@docs/adr/002-operator-managed-migrations.md`:
- Around line 119-125: The ADR currently states "exponential backoff" for
database connectivity retries but the controller implements a fixed retry
cadence; update the ADR text in the "Database unreachable" row to document the
actual retry policy used by the controller (replace "exponential backoff" with
the fixed 60s retry), and mention how this interacts with Job reschedules and
operator restart behavior (keep references to MigrationFailed condition,
activeDeadlineSeconds, ConfigMap re-read and Job status resume semantics to
preserve context).
In `@docs/adr/004-operator-deployment-model.md`:
- Around line 36-39: Update the ADR text and all examples to reflect that the
operator is disabled by default: change the description of the "Operator as a
conditional subchart dependency (selected)" option from "Enabled by default" to
"Disabled by default", update the examples to show how to enable it by setting
operator.enabled: true, and adjust the other occurrences that claim it is
enabled by default (mentions of openfga-operator and any numbered sections
referenced around lines 74-75 and 127-128) so they align with the actual default
in values.schema.json where operator.enabled is false.
In `@docs/adr/README.md`:
- Around line 33-39: The fenced code block in docs/adr/README.md is missing a
language label which triggers markdownlint; edit that block (the diagram block
containing "Proposed → Accepted → (optionally) Superseded or Deprecated" and the
feedback loop arrows) to add a language specifier (e.g., use "text") immediately
after the opening backticks so the block becomes a labeled fenced code block.
In `@operator/cmd/main.go`:
- Around line 40-70: Default flags leave the operator watching all namespaces;
ensure it falls back to the pod's namespace by reading the POD_NAMESPACE env var
when watchNamespace is empty and watchAllNamespaces is false, and set
cacheOpts.DefaultNamespaces accordingly before calling ctrl.NewManager. In
practice, add an os.LookupEnv("POD_NAMESPACE") check after flag.Parse() and, if
watchNamespace == "" && !watchAllNamespaces && POD_NAMESPACE is present, set
cacheOpts.DefaultNamespaces = map[string]cache.Config{podNamespace: {}}
(referencing watchNamespace, watchAllNamespaces, cacheOpts.DefaultNamespaces,
and ctrl.NewManager), and also ensure the chart/deployment injects POD_NAMESPACE
into the operator pod env.
In `@operator/go.mod`:
- Line 3: The Docker builder image is using a floating patch tag (FROM
golang:1.25) while operator/go.mod pins go 1.25.6; update the Dockerfile's base
image to the exact patch version to match go.mod by changing the FROM
golang:1.25 line to FROM golang:1.25.6 so builds are reproducible and aligned
with the declared go 1.25.6 in operator/go.mod.
In `@operator/internal/controller/helpers.go`:
- Around line 120-142: The Job pod template currently copies
NodeSelector/Tolerations/Affinity but omits imagePullSecrets and security
contexts; update the PodTemplateSpec in the migration Job (where Template:
corev1.PodTemplateSpec is built) to also set Spec.ImagePullSecrets =
deployment.Spec.Template.Spec.ImagePullSecrets, Spec.SecurityContext =
deployment.Spec.Template.Spec.SecurityContext, and copy the container-level
SecurityContext into the "migrate-database" container
(Containers[0].SecurityContext =
deployment.Spec.Template.Spec.Containers[i].SecurityContext for the matching
mainContainer.Image/name), ensuring the migration container keeps the same
Image/Args/Env while inheriting the source Deployment's imagePullSecrets and
security settings.
In `@operator/internal/controller/migration_controller.go`:
- Around line 49-55: The code incorrectly assumes the main OpenFGA container is
Containers[0]; update the logic to locate the OpenFGA container by name (e.g.,
look for container.Name == "openfga" or the configured main container name)
instead of indexing 0, retrieve its Image and env once (store the resolved core
container object or its index) and pass or reuse that reference when calling
extractImageTag and buildMigrationJob so both use the same resolved container;
update any checks (previous len/Containers[0] usages) to handle missing named
container and return/skip with a clear log if not found.
- Around line 108-121: The success branch handling job.Status.Succeeded >= 1
never clears a prior MigrationFailed condition; update the logic so that after
logging success and before returning it clears/removes the MigrationFailed
condition in the migration status ConfigMap. Specifically, in the
job.Status.Succeeded block (around updateMigrationStatus and
ensureDeploymentScaled), modify or extend updateMigrationStatus (or call a new
helper) to set the MigrationFailed condition to false/remove it for
desiredVersion/jobName so kubectl describe no longer shows a stale failure; keep
the existing replica scaling flow (ensureDeploymentScaled) and return only after
the status has been updated.
- Around line 129-149: The 60s backoff is being bypassed because deleting the
failed Job immediately requeues the owning Deployment; fix by persisting a
retry-until timestamp on the Deployment before deleting the Job and checking
that timestamp during job-creation logic. Specifically: in the failure branch
where you call setMigrationFailedCondition(deployment, desiredVersion) and
r.Status().Update(...) (and before r.Delete(...)), set an annotation or a field
on deployment.Status (e.g., "migration.retryUntil" =
time.Now().Add(60*time.Second).UTC().Format(time.RFC3339)) and persist it
(handle update conflicts); then in the reconcile path that decides to create a
new migration Job (the code that currently creates the Job when IsNotFound),
read and parse that "migration.retryUntil" value and skip Job creation until the
timestamp has passed (clear the annotation/status field when creating the Job).
This ensures the controller honors the cooldown even though the owned Job watch
will requeue the Deployment.
In `@operator/Makefile`:
- Around line 5-6: The build target fails when bin/ is missing; update the
Makefile's build target (the "build:" target) to ensure the output directory
exists by creating bin before running go build (use mkdir -p bin) so that go
build -o bin/operator ./cmd/ never errors on fresh clones or after make clean.
In `@operator/README.md`:
- Around line 9-15: Update the README text that currently references the
hard-coded ConfigMap name "openfga-migration-status" to use the
deployment-scoped ConfigMap name pattern used elsewhere in the PR (e.g.,
"<release>-migration-status"); specifically, edit the line describing version
comparison so it no longer mentions "openfga-migration-status" but instead
references the deployment-scoped ConfigMap name pattern
("<release>-migration-status" or equivalent placeholder) so docs and tests point
to the correct resource.
---
Nitpick comments:
In `@operator/tests/values-no-db.yaml`:
- Line 18: Replace the inline basic-auth credentials embedded in the YAML "uri"
value (the failing PostgreSQL URI) with a redacted placeholder or
environment-variable reference (e.g., remove "openfga:changeme@" and replace
with a token like "<REDACTED_CREDENTIALS>" or use an env var reference) so the
test fixture no longer contains plaintext credentials; update the "uri" key's
value accordingly and ensure any test code reading this fixture can handle the
placeholder or env var.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8f9a57a5-18a8-4395-92d9-b4eae0a8affc
⛔ Files ignored due to path filters (2)
charts/openfga/Chart.lockis excluded by!**/*.lockoperator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (37)
.github/workflows/operator.ymlcharts/openfga-operator/Chart.yamlcharts/openfga-operator/crds/README.mdcharts/openfga-operator/templates/NOTES.txtcharts/openfga-operator/templates/_helpers.tplcharts/openfga-operator/templates/clusterrole.yamlcharts/openfga-operator/templates/clusterrolebinding.yamlcharts/openfga-operator/templates/deployment.yamlcharts/openfga-operator/templates/serviceaccount.yamlcharts/openfga-operator/values.yamlcharts/openfga/Chart.yamlcharts/openfga/templates/_helpers.tplcharts/openfga/templates/deployment.yamlcharts/openfga/templates/job.yamlcharts/openfga/templates/rbac.yamlcharts/openfga/templates/serviceaccount.yamlcharts/openfga/values.schema.jsoncharts/openfga/values.yamldocs/adr/000-template.mddocs/adr/001-adopt-openfga-operator.mddocs/adr/002-operator-managed-migrations.mddocs/adr/003-declarative-store-lifecycle-crds.mddocs/adr/004-operator-deployment-model.mddocs/adr/README.mdoperator/.dockerignoreoperator/Dockerfileoperator/Makefileoperator/README.mdoperator/cmd/main.gooperator/go.modoperator/internal/controller/helpers.gooperator/internal/controller/migration_controller.gooperator/internal/controller/migration_controller_test.gooperator/tests/README.mdoperator/tests/values-db-outage.yamloperator/tests/values-happy-path.yamloperator/tests/values-no-db.yaml
- Harden pod security (runAsNonRoot, seccompProfile, drop ALL caps) - Find container by name instead of index to handle sidecars - Skip migration for memory datastore - Persist retry-after annotation before Job deletion to survive re-enqueue - Clear MigrationFailed condition on success - Propagate imagePullSecrets and securityContext to migration Jobs - Remove unused RBAC rules (secrets, serviceaccounts) - Add POD_NAMESPACE downward API for namespace-scoped watch default - Remove no-op migration values (timeout, backoffLimit, resources) - Fix migration SA helper to require name when create=false - Guard operator logic on both operator.enabled and migration.enabled - Build and load operator image into kind for chart-testing CI - Add path filters to operator workflow - Fix ADR inaccuracies (retry strategy, default-enabled wording) - Pin Dockerfile base image to golang:1.26.2
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Propose adopting a Kubernetes operator for OpenFGA lifecycle management, covering migration handling, declarative CRDs for stores/models/tuples, and the operator deployment model as a Helm subchart. Also adds the ADR process documentation, template, and chart analysis. ADR-001: Adopt a Kubernetes Operator ADR-002: Operator-Managed Migrations ADR-003: Declarative Store Lifecycle CRDs ADR-004: Operator Deployment as Helm Subchart
- Harden pod security (runAsNonRoot, seccompProfile, drop ALL caps) - Find container by name instead of index to handle sidecars - Skip migration for memory datastore - Persist retry-after annotation before Job deletion to survive re-enqueue - Clear MigrationFailed condition on success - Propagate imagePullSecrets and securityContext to migration Jobs - Remove unused RBAC rules (secrets, serviceaccounts) - Add POD_NAMESPACE downward API for namespace-scoped watch default - Remove no-op migration values (timeout, backoffLimit, resources) - Fix migration SA helper to require name when create=false - Guard operator logic on both operator.enabled and migration.enabled - Build and load operator image into kind for chart-testing CI - Add path filters to operator workflow - Fix ADR inaccuracies (retry strategy, default-enabled wording) - Pin Dockerfile base image to golang:1.26.2
24897df to
f149c60
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@copilot review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 38 out of 40 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
charts/openfga/templates/deployment.yaml:65
- When
operator.enabled=truebutmigration.enabled=false, the chart disables the hook-based migration Job (templates/job.yaml) and RBAC (templates/rbac.yaml), but the Deployment can still render the legacywait-for-migrationinitContainer because$operatorMigration := and operator.enabled migration.enabledevaluates to false. With the defaultdatastore.migrationType: job+waitForMigrations: true, this will deadlock (initContainer waits for a Job that the chart no longer creates). Consider changing the initContainer gating to disable legacy migration orchestration wheneveroperator.enabledis true (or explicitly fail fast for the unsupported combination).
{{- $operatorMigration := and .Values.operator.enabled .Values.migration.enabled }}
{{ if or (and (not $operatorMigration) (or (and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations))) .Values.extraInitContainers }}
initContainers:
{{- if not $operatorMigration }}
{{- if and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations (eq .Values.datastore.migrationType "job") }}
- name: wait-for-migration
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: "{{ .Values.initContainer.repository }}:{{ .Values.initContainer.tag }}"
imagePullPolicy: {{ .Values.initContainer.pullPolicy }}
args: ["job-wr", '{{ include "openfga.fullname" . }}-migrate']
resources:
{{- toYaml .Values.datastore.migrations.resources | nindent 12 }}
{{- end }}
{{- if and (has .Values.datastore.engine (list "postgres" "mysql")) (eq .Values.datastore.migrationType "initContainer") }}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (7)
operator/README.md (1)
90-102: Add language specifier to the project structure block.The directory tree is missing a language identifier. Add
textafter the opening backticks to resolve the markdownlint warning.📝 Proposed fix
-``` +```text operator/ ├── cmd/ │ └── main.go # Entry point, manager setup🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@operator/README.md` around lines 90 - 102, Update the fenced code block in operator/README.md that contains the project tree (the block starting with "operator/" and listing cmd/, internal/, Dockerfile, etc.) to include a language specifier by changing the opening ``` to ```text so the block becomes a text-formatted code fence; this will resolve the markdownlint warning without altering the block contents.docs/adr/004-operator-deployment-model.md (1)
69-90: Add language specifier to this fenced code block.The directory structure block is missing a language identifier, which triggers the markdownlint warning. Add
textafter the opening backticks.📝 Proposed fix
-``` +```text helm-charts/ ├── charts/ │ ├── openfga/ # Main chart (existing)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/adr/004-operator-deployment-model.md` around lines 69 - 90, The fenced code block showing the Helm chart directory structure lacks a language specifier; update that block by adding "text" immediately after the opening triple backticks (i.e., change ``` to ```text) so the directory tree is treated as plain text and markdownlint warning is resolved (refer to the directory structure block containing "helm-charts/" and the nested "openfga" / "openfga-operator" entries).operator/cmd/main.go (1)
31-43: Consider validating that Job configuration flags are non-negative.The flags
--backoff-limit,--active-deadline-seconds, and--ttl-seconds-after-finishedare parsed asintbut later cast toint32/int64. Negative values would produce unexpected behavior. The Kubernetes API would reject invalid values, but an early validation with a clear error message would improve the operator experience.🛡️ Suggested validation after flag.Parse()
flag.Parse() + if backoffLimit < 0 || activeDeadline < 0 || ttlAfterFinished < 0 { + fmt.Fprintln(os.Stderr, "error: job configuration flags must be non-negative") + os.Exit(1) + } + ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@operator/cmd/main.go` around lines 31 - 43, After parsing flags, validate the three Job-related flag variables backoffLimit, activeDeadline, and ttlAfterFinished are non-negative; in main.go (after flag.Parse()) add checks that return/log a clear fatal error if any value < 0 (e.g., using log.Fatalf or klog.Fatalf) so the operator exits with a descriptive message instead of proceeding to cast to int32/int64 and creating invalid Kubernetes Job specs; reference the flag variables backoffLimit, activeDeadline, ttlAfterFinished and the controller.Default* constants when adding the validation.operator/internal/controller/migration_controller.go (1)
79-90: Status update error handling could be improved.When clearing the
MigrationFailedcondition, errors fromr.Status().Update()are logged but execution continues. If this update fails due to a conflict, the condition remains stale. Consider returning a requeue to ensure the condition is eventually cleared.♻️ Suggested improvement
if currentVersion == desiredVersion { logger.V(1).Info("migration up to date", "version", desiredVersion) clearMigrationFailedCondition(deployment) if patchErr := r.Status().Update(ctx, deployment); patchErr != nil { - logger.Error(patchErr, "failed to clear MigrationFailed condition") + if apierrors.IsConflict(patchErr) { + return ctrl.Result{Requeue: true}, nil + } + logger.Error(patchErr, "failed to clear MigrationFailed condition") } if _, scaleErr := ensureDeploymentScaled(ctx, r.Client, deployment); scaleErr != nil {Apply the same pattern at lines 146-148.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@operator/internal/controller/migration_controller.go` around lines 79 - 90, The status update that clears MigrationFailed via clearMigrationFailedCondition(deployment) calls r.Status().Update(ctx, deployment) but only logs errors and proceeds; change this so that if r.Status().Update returns an error (e.g., a conflict) you return that error (or return ctrl.Result{Requeue: true} with the error) instead of continuing, mirroring the pattern used at lines 146-148; locate the block that checks currentVersion == desiredVersion, handle patchErr from r.Status().Update(ctx, deployment) by returning the error or a requeue result so the condition will be retried before calling ensureDeploymentScaled and returning success.operator/internal/controller/migration_controller_test.go (1)
166-231: Test lacks verification of MigrationFailed condition clearing.The
TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUptest verifies ConfigMap creation and Deployment scaling, but doesn't verify that a priorMigrationFailedcondition gets cleared. Consider adding a test case where the Deployment starts with aMigrationFailedcondition and verifying it's cleared after success.🧪 Suggested additional test case
func TestReconcile_JobSucceeded_ClearsMigrationFailedCondition(t *testing.T) { // Given: a Deployment with a MigrationFailed condition and a succeeded Job. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) dep.Annotations[AnnotationDesiredReplicas] = "3" dep.Status.Conditions = []appsv1.DeploymentCondition{ { Type: "MigrationFailed", Status: corev1.ConditionTrue, Reason: "MigrationJobFailed", }, } job := &batchv1.Job{ // ... same as existing test ... Status: batchv1.JobStatus{Succeeded: 1}, } r := newReconciler(dep, job) _, err := r.Reconcile(context.Background(), ctrl.Request{ NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, }) if err != nil { t.Fatalf("unexpected error: %v", err) } // Verify MigrationFailed condition is cleared. updated := &appsv1.Deployment{} _ = r.Get(context.Background(), types.NamespacedName{Name: "openfga", Namespace: "default"}, updated) for _, c := range updated.Status.Conditions { if c.Type == "MigrationFailed" && c.Status == corev1.ConditionTrue { t.Error("expected MigrationFailed condition to be cleared") } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@operator/internal/controller/migration_controller_test.go` around lines 166 - 231, Add a test that ensures a pre-existing MigrationFailed condition on the Deployment is removed after a successful Job: create a Deployment via newTestDeployment, set dep.Annotations[AnnotationDesiredReplicas] and seed dep.Status.Conditions with a DeploymentCondition having Type "MigrationFailed" and Status corev1.ConditionTrue, create a succeeded batchv1.Job as in the existing TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp, call newReconciler(...).Reconcile(...), then Get the updated Deployment and assert that no Status.Condition remains with Type "MigrationFailed" && Status corev1.ConditionTrue; name the new test TestReconcile_JobSucceeded_ClearsMigrationFailedCondition and reuse the same Job/Reconiler setup symbols.operator/internal/controller/helpers.go (1)
258-285: Scaling logic has a subtle edge case when replicas annotation already exists.If the
AnnotationDesiredReplicasannotation already exists (e.g., from a previous scale-down), the function correctly avoids overwriting it (line 272). However, if the annotation contains "0" from a previous interrupted reconciliation where the Deployment was manually set to 0 replicas before the operator took over,ensureDeploymentScaledwould restore to 0 replicas.This is documented behavior per the ADR ("Deployment starts at replicas: 0"), but consider logging a warning when the stored desired replicas is 0 to aid debugging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@operator/internal/controller/helpers.go` around lines 258 - 285, In scaleDeploymentToZero, detect when deployment.Annotations[AnnotationDesiredReplicas] already exists and equals "0" and emit a warning with the Deployment identity to aid debugging; update the function (around the annotation-check block) to check the stored value string == "0" and call your controller logger (e.g., klog.Warningf or ctrl.Log.WithValues("deployment", deployment.Name, "namespace", deployment.Namespace).V(0).Info/Warning) to record that the saved desired replicas is zero before returning/continuing, leaving the existing behavior of not overwriting the annotation intact.docs/adr/002-operator-managed-migrations.md (1)
76-94: Consider adding language specifiers to code fences.The ASCII diagrams and sequence blocks lack language specifiers. Adding
textorplaintextas the language would satisfy markdown linters and improve rendering consistency across platforms.📝 Suggested fix
-``` +```text ┌────────────────────────────────────────────────────────┐ │ Operator Reconciliation │Apply similarly to the code fences at lines 130 and 147.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/adr/002-operator-managed-migrations.md` around lines 76 - 94, The ASCII diagrams and sequence blocks (for example the block starting with "Operator Reconciliation" and the other two visual blocks later in the file) are missing Markdown language specifiers; update each triple-backtick fence that contains ASCII art/sequence steps to use a plain text specifier (e.g., ```text or ```plaintext) so markdown linters/renderers handle them consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/adr/002-operator-managed-migrations.md`:
- Around line 76-94: The ASCII diagrams and sequence blocks (for example the
block starting with "Operator Reconciliation" and the other two visual blocks
later in the file) are missing Markdown language specifiers; update each
triple-backtick fence that contains ASCII art/sequence steps to use a plain text
specifier (e.g., ```text or ```plaintext) so markdown linters/renderers handle
them consistently.
In `@docs/adr/004-operator-deployment-model.md`:
- Around line 69-90: The fenced code block showing the Helm chart directory
structure lacks a language specifier; update that block by adding "text"
immediately after the opening triple backticks (i.e., change ``` to ```text) so
the directory tree is treated as plain text and markdownlint warning is resolved
(refer to the directory structure block containing "helm-charts/" and the nested
"openfga" / "openfga-operator" entries).
In `@operator/cmd/main.go`:
- Around line 31-43: After parsing flags, validate the three Job-related flag
variables backoffLimit, activeDeadline, and ttlAfterFinished are non-negative;
in main.go (after flag.Parse()) add checks that return/log a clear fatal error
if any value < 0 (e.g., using log.Fatalf or klog.Fatalf) so the operator exits
with a descriptive message instead of proceeding to cast to int32/int64 and
creating invalid Kubernetes Job specs; reference the flag variables
backoffLimit, activeDeadline, ttlAfterFinished and the controller.Default*
constants when adding the validation.
In `@operator/internal/controller/helpers.go`:
- Around line 258-285: In scaleDeploymentToZero, detect when
deployment.Annotations[AnnotationDesiredReplicas] already exists and equals "0"
and emit a warning with the Deployment identity to aid debugging; update the
function (around the annotation-check block) to check the stored value string ==
"0" and call your controller logger (e.g., klog.Warningf or
ctrl.Log.WithValues("deployment", deployment.Name, "namespace",
deployment.Namespace).V(0).Info/Warning) to record that the saved desired
replicas is zero before returning/continuing, leaving the existing behavior of
not overwriting the annotation intact.
In `@operator/internal/controller/migration_controller_test.go`:
- Around line 166-231: Add a test that ensures a pre-existing MigrationFailed
condition on the Deployment is removed after a successful Job: create a
Deployment via newTestDeployment, set dep.Annotations[AnnotationDesiredReplicas]
and seed dep.Status.Conditions with a DeploymentCondition having Type
"MigrationFailed" and Status corev1.ConditionTrue, create a succeeded
batchv1.Job as in the existing
TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp, call
newReconciler(...).Reconcile(...), then Get the updated Deployment and assert
that no Status.Condition remains with Type "MigrationFailed" && Status
corev1.ConditionTrue; name the new test
TestReconcile_JobSucceeded_ClearsMigrationFailedCondition and reuse the same
Job/Reconiler setup symbols.
In `@operator/internal/controller/migration_controller.go`:
- Around line 79-90: The status update that clears MigrationFailed via
clearMigrationFailedCondition(deployment) calls r.Status().Update(ctx,
deployment) but only logs errors and proceeds; change this so that if
r.Status().Update returns an error (e.g., a conflict) you return that error (or
return ctrl.Result{Requeue: true} with the error) instead of continuing,
mirroring the pattern used at lines 146-148; locate the block that checks
currentVersion == desiredVersion, handle patchErr from r.Status().Update(ctx,
deployment) by returning the error or a requeue result so the condition will be
retried before calling ensureDeploymentScaled and returning success.
In `@operator/README.md`:
- Around line 90-102: Update the fenced code block in operator/README.md that
contains the project tree (the block starting with "operator/" and listing cmd/,
internal/, Dockerfile, etc.) to include a language specifier by changing the
opening ``` to ```text so the block becomes a text-formatted code fence; this
will resolve the markdownlint warning without altering the block contents.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5c414505-2633-4683-95be-cd73ce829f78
⛔ Files ignored due to path filters (2)
charts/openfga/Chart.lockis excluded by!**/*.lockoperator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (38)
.github/workflows/operator.yml.github/workflows/test.ymlcharts/openfga-operator/Chart.yamlcharts/openfga-operator/crds/README.mdcharts/openfga-operator/templates/NOTES.txtcharts/openfga-operator/templates/_helpers.tplcharts/openfga-operator/templates/clusterrole.yamlcharts/openfga-operator/templates/clusterrolebinding.yamlcharts/openfga-operator/templates/deployment.yamlcharts/openfga-operator/templates/serviceaccount.yamlcharts/openfga-operator/values.yamlcharts/openfga/Chart.yamlcharts/openfga/templates/_helpers.tplcharts/openfga/templates/deployment.yamlcharts/openfga/templates/job.yamlcharts/openfga/templates/rbac.yamlcharts/openfga/templates/serviceaccount.yamlcharts/openfga/values.schema.jsoncharts/openfga/values.yamldocs/adr/000-template.mddocs/adr/001-adopt-openfga-operator.mddocs/adr/002-operator-managed-migrations.mddocs/adr/003-declarative-store-lifecycle-crds.mddocs/adr/004-operator-deployment-model.mddocs/adr/README.mdoperator/.dockerignoreoperator/Dockerfileoperator/Makefileoperator/README.mdoperator/cmd/main.gooperator/go.modoperator/internal/controller/helpers.gooperator/internal/controller/migration_controller.gooperator/internal/controller/migration_controller_test.gooperator/tests/README.mdoperator/tests/values-db-outage.yamloperator/tests/values-happy-path.yamloperator/tests/values-no-db.yaml
✅ Files skipped from review due to trivial changes (11)
- operator/.dockerignore
- charts/openfga-operator/crds/README.md
- charts/openfga/Chart.yaml
- operator/go.mod
- operator/Dockerfile
- charts/openfga-operator/templates/NOTES.txt
- docs/adr/001-adopt-openfga-operator.md
- charts/openfga-operator/Chart.yaml
- charts/openfga-operator/templates/_helpers.tpl
- charts/openfga/templates/_helpers.tpl
- charts/openfga-operator/values.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- charts/openfga/templates/serviceaccount.yaml
- charts/openfga/values.schema.json
- .github/workflows/operator.yml
- charts/openfga/templates/deployment.yaml
Replace Helm hook-based migrations with a lightweight Kubernetes operator that watches OpenFGA Deployments, detects version changes, and runs migrations as regular Jobs. - Go operator using controller-runtime (no CRDs) - Helm subchart with opt-in via operator.enabled (default false) - Dedicated migration ServiceAccount (separate from runtime) - Auto-recovery on database failure (delete/retry cycle) - GitHub Actions workflow for multi-arch image builds - Integration test values for local Kubernetes clusters Resolves #211, #107, #120, #100, #126
- Harden pod security (runAsNonRoot, seccompProfile, drop ALL caps) - Find container by name instead of index to handle sidecars - Skip migration for memory datastore - Persist retry-after annotation before Job deletion to survive re-enqueue - Clear MigrationFailed condition on success - Propagate imagePullSecrets and securityContext to migration Jobs - Remove unused RBAC rules (secrets, serviceaccounts) - Add POD_NAMESPACE downward API for namespace-scoped watch default - Remove no-op migration values (timeout, backoffLimit, resources) - Fix migration SA helper to require name when create=false - Guard operator logic on both operator.enabled and migration.enabled - Build and load operator image into kind for chart-testing CI - Add path filters to operator workflow - Fix ADR inaccuracies (retry strategy, default-enabled wording) - Pin Dockerfile base image to golang:1.26.2
- Render extraInitContainers in operator mode (previously skipped) - Add version label to migration Jobs and delete stale Jobs on image change - Use namespaced Role/RoleBinding when watchAllNamespaces is false - Replace Status().Update with Status().Patch to avoid write conflicts - Fix logger.Error(nil, ...) to logger.Info for expected failure state - Wire desiredVersion param into buildMigrationJob for version tracking - Update RBAC: deployments/status verb from update to patch - Don't force replicas: 0 for memory engine in operator mode - Guard migration SA creation on migration.enabled - Document both required labels and mutable tag limitation in README
- Add opt-in annotation (openfga.dev/migration-enabled) so the operator only manages migrations for explicitly opted-in Deployments - Propagate volumes, volumeMounts, and envFrom from the Deployment to migration Jobs for TLS certs and file-based credentials - Remove watchAllNamespaces option; operator is now always namespace-scoped - Update ADR-004 dependency example to match actual file:// reference - Add test for migration-not-enabled skip behavior
The migration Job should only receive explicitly filtered OPENFGA_DATASTORE_* env vars, not the full EnvFrom from the source Deployment which could leak non-datastore secrets.
- Wrap deployment annotations in conditional to avoid emitting empty annotations: field which produces an invalid manifest - Store full version in annotation (openfga.dev/desired-version) and truncate label to 63 chars to support digest-pinned images - Align operator image default to ghcr.io/openfga/openfga-operator to match CI publishing target
- Return error on retry-after annotation patch failure to prevent Job churn that bypasses the 60s cooldown - Add test for stale-Job version mismatch deletion path
…values overflowing)
When operator.enabled=true but migration.enabled=false, the legacy wait-for-migration initContainer could render and hang waiting for a Job that will never be created. Gate on operator.enabled instead so the legacy path is fully disabled when the operator is installed.
…essions The double-quoted annotation values contained inner double quotes (e.g. "memory") which produced invalid YAML that IDEs flagged as errors. Switch to single-quote wrappers so the inner Go template strings don't conflict.
- Pin Dockerfile base images by digest for reproducible builds - Fix version label fallback comparison for digest-pinned images by sanitizing desiredVersion before comparing to the label value
…y migration-specific volumes
Prevents the operator from being evicted during node drains, which could leave the OpenFGA Deployment stuck at 0 replicas with no controller to scale it back up. Disabled by default; supports both minAvailable and maxUnavailable modes.
Replace checks on job.Status.Failed >= backoffLimit with isJobConditionTrue(job, batchv1.JobFailed), and job.Status.Succeeded with batchv1.JobComplete. The Job controller sets conditions atomically when it makes its final decision, avoiding races where the operator acts on intermediate counter states before Kubernetes has finished cleaning up.
20 new tests across 4 files covering the operator-enabled code paths that previously had no unit test coverage: - deployment: annotations, replicas=0, autoscaling conflict, initContainers gating - job: template not rendered when operator enabled - serviceaccount: migration SA creation, custom names, IRSA annotations - rbac: legacy Role/RoleBinding not rendered when operator enabled
The old Helm-templated migration Job uses datastore.migrations.resources for resource limits, but the operator-built Job had none. Inherit the main container's Resources to maintain parity and prevent unbounded resource consumption during migrations.
Four new tests covering previously untested code paths: - StaleJob_LabelOnlyFallback: version-mismatch detection when Job has only a label (no annotation), exercising the sanitized-label fallback - JobSucceeded_UpdatesExistingConfigMap: ConfigMap update path when a prior version's ConfigMap already exists - ScaleToZero_NilAnnotationsMap: scaleDeploymentToZero correctly stores desired-replicas when the annotation was not previously set - JobInProgress_Requeues: in-progress Job triggers 10s requeue without scaling up or modifying the Deployment
The README documented these flags as configurable via the subchart values.yaml, but only --leader-elect and --watch-namespace were actually wired. Add migrationJob.backoffLimit, activeDeadlineSeconds, and ttlSecondsAfterFinished values with matching args in the operator Deployment template.
The _helpers.tpl namespace template already supported namespaceOverride but the value was undeclared in values.yaml, making it undiscoverable.
The test validates that scaleDeploymentToZero stores the current replica count in the desired-replicas annotation before zeroing, not that it handles a nil annotations map.
Falling back to the "default" ServiceAccount would silently grant operator RBAC permissions (Deployment patch, Job create/delete) to a shared SA. Require an explicit name so the user makes a deliberate choice.
When concurrent reconciles race between the GET and CREATE, the second create returns AlreadyExists. Treat this as benign and requeue to poll the existing Job instead of returning a hard error that produces noisy reconcile failures in the controller logs.
Address review findings: add allowPrivilegeEscalation: false for restricted PSS compliance, set default resource requests/limits, add values.schema.json validation, use stable selectorLabels on pod template to prevent spurious rollouts, add .helmignore, and improve Chart.yaml metadata and NOTES.txt with migration commands.
The operator was scaling Deployments to 0 replicas during every migration, causing a full outage on every helm upgrade — a regression from the existing rolling update behavior. OpenFGA already gates readiness on schema version (MinimumSupportedDatastoreSchemaRevision in sqlcommon.IsReady), so new pods naturally block until migration completes while old pods keep serving. Use Helm's lookup function to preserve the live replica count on upgrade (falling back to replicas: 0 on fresh install where no Deployment exists). Remove scaleDeploymentToZero from the operator reconcile loop. Update ADR-002 to document the rationale and the readiness gate dependency.
5ea7c09 to
dfe3180
Compare
Description
What problem is being solved?
The OpenFGA Helm chart uses Helm hooks (
post-install,post-upgrade) and ak8s-wait-forinit container to orchestrate database migrations. This approach breaks in multiple ways:helm install --waitdeadlocks — the init container waits for the hook Job, but the hook only runs after the Deployment is ready ("hen egg" problem when used with helm deploy #120)hook-delete-policyconflict — the Job is cleaned up before FluxCD can confirm the Deployment is healthy (Hooked job deletion incompatible with Helm--waitand FluxCD by default #100)k8s-wait-foris unmaintained — 2+ years stale, known CVEs, pinned by tag not digest (Reuse k8s-wait-for image repo and tag across migrate job and initContainer #126,k8s-wait-forimage is inactively maintained and contains many vulnerabilities #132, An external dependency to groundnuty/k8s-wait-for is pinned using tag. #144)These represent the single biggest pain point for users of the chart.
How is it being solved?
A lightweight Kubernetes operator moves migration orchestration from deploy-time Helm hooks into the runtime control plane. The operator watches OpenFGA Deployments, detects version changes, and runs migrations as regular Kubernetes Jobs — no hooks, no init containers.
This is Stage 1 of the operator, focused solely on migration orchestration. It does not introduce any CRDs. The operator watches native Deployments by label selector (app.kubernetes.io/part-of: openfga, app.kubernetes.io/component: authorization-controller).
What changes are made to solve it?
New: Go operator (
operator/){name}-migration-statusConfigMap): scales Deployment to 0, creates a migration Job runningopenfga migrate, waits for completion, updates ConfigMap, scales Deployment back upMigrationFailedcondition on the Deployment, deletes the failed Job, retries after 60s — self-heals when the database comes backOPENFGA_DATASTORE_*env vars to the migration Job (least privilege)New: Operator Helm subchart (
charts/openfga-operator/)values.yaml: leader election, watch namespace, resource limits, Job backoff/TTL settingsModified: Parent chart (
charts/openfga/)operator.enabledcondition (defaults tofalse— fully backward compatible)openfga.dev/desired-replicasandopenfga.dev/migration-service-accountannotations; init containers, hook Job, and hook RBAC are skippedoperator.enabled && migration.serviceAccount.createvalues.schema.jsonupdated for newoperatorandmigrationpropertiesNew: GitHub Actions workflow (
.github/workflows/operator.yml)ghcr.io/{owner}/openfga-operatoroperator/**New: Integration test values (
operator/tests/)How each issue is resolved:
--waitdeadlock)hook-delete-policy— operator manages Job lifecycle with TTL and explicit cleanupk8s-wait-for)References
--waitand FluxCD by default #100k8s-wait-forimage is inactively maintained and contains many vulnerabilities #132, An external dependency to groundnuty/k8s-wait-for is pinned using tag. #144Review Checklist
mainSummary by CodeRabbit
New Features
operator.enabledconfiguration option with backward-compatible defaultsDocumentation
Tests