feat: operator migration by emilic · Pull Request #309 · openfga/helm-charts

emilic · 2026-04-10T17:10:53Z

Description

What problem is being solved?

The OpenFGA Helm chart uses Helm hooks (post-install, post-upgrade) and a k8s-wait-for init container to orchestrate database migrations. This approach breaks in multiple ways:

ArgoCD ignores Helm hooks — the migration Job is never created, so the init container waits forever (Initial chart deployment fails: Error from server (NotFound): jobs.batch "openfga-migrate" not found #211, Migrate job not showing up in ArgoCD #107)
helm install --wait deadlocks — the init container waits for the hook Job, but the hook only runs after the Deployment is ready ("hen egg" problem when used with helm deploy #120)
FluxCD's hook-delete-policy conflict — the Job is cleaned up before FluxCD can confirm the Deployment is healthy (Hooked job deletion incompatible with Helm --wait and FluxCD by default #100)
Migration and runtime share a ServiceAccount — the runtime gets excessive DDL privileges that should only be needed during migration (Migration job and deployment use the same service account #95)
k8s-wait-for is unmaintained — 2+ years stale, known CVEs, pinned by tag not digest (Reuse k8s-wait-for image repo and tag across migrate job and initContainer #126, k8s-wait-for image is inactively maintained and contains many vulnerabilities #132, An external dependency to groundnuty/k8s-wait-for is pinned using tag. #144)

These represent the single biggest pain point for users of the chart.

How is it being solved?

A lightweight Kubernetes operator moves migration orchestration from deploy-time Helm hooks into the runtime control plane. The operator watches OpenFGA Deployments, detects version changes, and runs migrations as regular Kubernetes Jobs — no hooks, no init containers.

This is Stage 1 of the operator, focused solely on migration orchestration. It does not introduce any CRDs. The operator watches native Deployments by label selector (app.kubernetes.io/part-of: openfga, app.kubernetes.io/component: authorization-controller).

What changes are made to solve it?

New: Go operator (operator/)

controller-runtime based reconciler that watches OpenFGA Deployments
On version change (image tag vs {name}-migration-status ConfigMap): scales Deployment to 0, creates a migration Job running openfga migrate, waits for completion, updates ConfigMap, scales Deployment back up
On failure: sets MigrationFailed condition on the Deployment, deletes the failed Job, retries after 60s — self-heals when the database comes back
Only passes OPENFGA_DATASTORE_* env vars to the migration Job (least privilege)
Multi-stage Dockerfile (distroless runtime)
Unit tests using controller-runtime fake client

New: Operator Helm subchart (charts/openfga-operator/)

Deployment, ClusterRole, ClusterRoleBinding, ServiceAccount
Configurable via values.yaml: leader election, watch namespace, resource limits, Job backoff/TTL settings

Modified: Parent chart (charts/openfga/)

Added operator.enabled condition (defaults to false — fully backward compatible)
When enabled: Deployment starts at 0 replicas with openfga.dev/desired-replicas and openfga.dev/migration-service-account annotations; init containers, hook Job, and hook RBAC are skipped
Dedicated migration ServiceAccount created when operator.enabled && migration.serviceAccount.create
values.schema.json updated for new operator and migration properties

New: GitHub Actions workflow (.github/workflows/operator.yml)

Builds and pushes operator image to ghcr.io/{owner}/openfga-operator
Runs on push to main and PRs touching operator/**
Multi-platform (amd64 + arm64)

New: Integration test values (operator/tests/)

Happy path, database outage & recovery, and permanent failure scenarios
README with step-by-step verification instructions for local Kubernetes clusters

How each issue is resolved:

Issue	Resolution
#211, #107 (ArgoCD)	No Helm hooks — migration Job is a regular resource, visible to ArgoCD
#120 (`--wait` deadlock)	No init container, no hook Job — Deployment starts at 0 replicas so Helm considers it ready immediately
#100 (FluxCD)	No `hook-delete-policy` — operator manages Job lifecycle with TTL and explicit cleanup
#95 (shared SA)	Dedicated migration ServiceAccount with DDL permissions, separate from runtime SA
#126, #132, #144 (`k8s-wait-for`)	Eliminated entirely — operator watches Job status directly

References

Review Checklist

I have clicked on "allow edits by maintainers".
I have added documentation for new/changed functionality in this PR or in a PR to openfga.dev
The correct base branch is being used, if not main
I have added tests to validate that the change in functionality is working as expected

Summary by CodeRabbit

New Features
- Introduced Kubernetes Operator for managing OpenFGA lifecycle in Kubernetes environments
- Operator automatically orchestrates database migration workflows including scaling and monitoring
- Added operator.enabled configuration option with backward-compatible defaults
- Deployable as Helm subchart or standalone cluster component
- Includes dedicated migration service account for operational separation
Documentation
- Added Architecture Decision Records documenting operator design and deployment model
- Added comprehensive operator deployment and testing guides
Tests
- Added operator unit and integration test coverage
- Added Helm chart verification tests for operator integration

coderabbitai · 2026-04-10T17:11:09Z

Walkthrough

This pull request introduces a Kubernetes-native operator to orchestrate OpenFGA database migrations, replacing the existing Helm hook-based approach. The implementation includes a controller-runtime operator in Go, a Helm subchart for operator deployment, integration into the main OpenFGA chart, comprehensive testing, and architectural documentation via ADRs.

Changes

Cohort / File(s)	Summary
Operator Go Implementation `operator/cmd/main.go`, `operator/go.mod`, `operator/internal/controller/migration_controller.go`, `operator/internal/controller/migration_controller_test.go`, `operator/internal/controller/helpers.go`	Core operator logic: CLI flag parsing, manager setup, migration reconciliation loop, deployment/job/configmap orchestration, helper functions for image tag extraction and replica management; comprehensive unit tests covering migration states, job lifecycle, failure handling, and retry logic.
Operator Build & Config `operator/Dockerfile`, `operator/Makefile`, `operator/.dockerignore`	Two-stage Docker build for static binary, development Makefile with build/test/lint/docker targets, and ignore patterns.
Operator Helm Chart `charts/openfga-operator/Chart.yaml`, `charts/openfga-operator/values.yaml`, `charts/openfga-operator/templates/_helpers.tpl`, `charts/openfga-operator/templates/deployment.yaml`, `charts/openfga-operator/templates/serviceaccount.yaml`, `charts/openfga-operator/templates/role.yaml`, `charts/openfga-operator/templates/rolebinding.yaml`, `charts/openfga-operator/templates/pdb.yaml`, `charts/openfga-operator/templates/NOTES.txt`, `charts/openfga-operator/crds/README.md`	Complete Helm subchart with manifests for Deployment, RBAC (Role/RoleBinding), ServiceAccount, PodDisruptionBudget, and helper templates for naming/labeling; values configuration for replicas, image, security contexts, leader election, and resource limits.
OpenFGA Chart Integration `charts/openfga/Chart.yaml`, `charts/openfga/values.yaml`, `charts/openfga/values.schema.json`	Declares operator subchart as conditional dependency; adds `operator.enabled` and `migration.*` configuration options with schema validation.
OpenFGA Deployment & RBAC Updates `charts/openfga/templates/deployment.yaml`, `charts/openfga/templates/rbac.yaml`, `charts/openfga/templates/job.yaml`, `charts/openfga/templates/serviceaccount.yaml`, `charts/openfga/templates/_helpers.tpl`	Conditional logic to suppress Helm hook-based migration when operator is enabled; separates migration ServiceAccount creation; gates RBAC to operator-disabled mode; gates migration init containers when operator is active; adds migration annotations and replica control when operator/migration both enabled.
Helm Chart Tests `charts/openfga/tests/operator_mode_test.yaml`, `charts/openfga/tests/operator_mode_job_test.yaml`, `charts/openfga/tests/operator_mode_rbac_test.yaml`, `charts/openfga/tests/operator_mode_serviceaccount_test.yaml`	Unit test suites validating annotation behavior, replica scaling, RBAC conditional rendering, ServiceAccount separation, and mutual exclusivity constraints (operator + autoscaling).
Integration Testing & CI `.github/workflows/operator.yml`, `.github/workflows/test.yml`, `operator/tests/README.md`, `operator/tests/values-*.yaml`	GitHub Actions workflow for operator build/test/push; manual integration test guide with three scenarios (happy path, DB outage/recovery, permanent failure); test values files configuring isolated Postgres instances and operator image loading.
Architecture Documentation `docs/adr/README.md`, `docs/adr/000-template.md`, `docs/adr/001-adopt-openfga-operator.md`, `docs/adr/002-operator-managed-migrations.md`, `docs/adr/003-declarative-store-lifecycle-crds.md`, `docs/adr/004-operator-deployment-model.md`	ADR process framework and four architectural decision records: operator adoption rationale, Stage 1 migration orchestration control flow, future declarative CRD stages (Stores/Models/Tuples), and Helm installation/dependency model with CRD handling.
Operator Documentation `operator/README.md`	Stage 1 operator behavior, deployment lifecycle, failure modes, prerequisites, build/test/integration instructions, operator flags, and deployment annotations.

Sequence Diagram

sequenceDiagram
    participant User as GitOps/User
    participant Helm as Helm Install
    participant K8s as Kubernetes
    participant Op as Migration Operator
    participant DB as Database
    
    User->>Helm: helm install (operator.enabled=true)
    Helm->>K8s: Create Deployment (replicas=0, annotations)
    Helm->>K8s: Create MigrationReconciler
    Helm->>K8s: Return immediately
    Note over K8s: Operator starts watching...
    
    K8s->>Op: Detect Deployment with migration enabled
    Op->>Op: Read current migration version from ConfigMap
    Op->>Op: Extract desired version from image tag
    Note over Op: Versions differ → Migration needed
    
    Op->>K8s: Scale Deployment replicas → 0
    Op->>K8s: Create migration Job (runs openfga migrate)
    K8s->>DB: Job executes migration SQL
    DB-->>K8s: Migration succeeds
    K8s->>Op: Job marked as Complete
    
    Op->>K8s: Update migration-status ConfigMap (new version)
    Op->>K8s: Scale Deployment replicas → desired count
    K8s->>K8s: Deployment pods start (migration complete)
    K8s-->>User: OpenFGA ready
    
    Note over Op,User: On migration failure:
    Note over Op: Set MigrationFailed condition
    Note over Op: Delete failed Job after backoff
    Note over Op: Requeue with 60s cooldown

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	The pull request comprehensively addresses all coding requirements from linked issues: removes Helm hooks (`#211`, `#107`), eliminates init-container deadlocks (`#120`), manages Job lifecycle without hook-delete-policy conflicts (`#100`), replaces k8s-wait-for dependency (`#126`), and enforces least-privilege migration ServiceAccount (`#95`).
Out of Scope Changes check	✅ Passed	All changes are directly scoped to operator implementation: new operator codebase, Helm charts (openfga-operator subchart and parent chart modifications), CI/CD workflows, documentation (ADRs), and tests all align with Stage 1 migration orchestration objectives.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'feat: operator migration' directly reflects the main change—introducing a Kubernetes operator for managing OpenFGA database migrations. It is concise, clear, and specific.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/operator-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

socket-security · 2026-04-10T17:11:25Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	sigs.k8s.io/controller-runtime@v0.23.3
	k8s.io/apimachinery@v0.35.3
	k8s.io/client-go@v0.35.3
	k8s.io/api@v0.35.3
	k8s.io/utils@v0.0.0-20260319190234-28399d86e0b5

View full report

Copilot

Pull request overview

This PR introduces an OpenFGA migration operator and integrates it into the existing Helm chart to replace Helm hook/initContainer-based migration orchestration (improving compatibility with GitOps tools and helm --wait).

Changes:

Added a new Go-based operator (controller-runtime) that orchestrates migrations via Kubernetes Jobs and a migration-status ConfigMap.
Added an openfga-operator Helm subchart and integrated it as an optional dependency of the parent openfga chart (operator.enabled).
Added ADR documentation and local integration test values/instructions for validating operator-driven migration behavior.

Reviewed changes

Copilot reviewed 37 out of 39 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
operator/tests/values-no-db.yaml	Values for permanent DB failure integration scenario
operator/tests/values-happy-path.yaml	Values + embedded Postgres manifests for happy-path scenario
operator/tests/values-db-outage.yaml	Values + Postgres scaled-to-0 outage/recovery scenario
operator/tests/README.md	Step-by-step local integration test instructions
operator/README.md	Operator overview, dev commands, and configuration docs
operator/Makefile	Basic build/test/vet/lint and docker targets for the operator
operator/internal/controller/migration_controller.go	Core reconciliation logic for scaling, Job orchestration, ConfigMap status
operator/internal/controller/migration_controller_test.go	Unit tests for reconcile paths and image tag parsing
operator/internal/controller/helpers.go	Job builder + scaling + ConfigMap helpers
operator/go.mod	Operator module definition and dependencies
operator/go.sum	Dependency lockfile for operator module
operator/Dockerfile	Multi-stage build producing distroless runtime image
operator/cmd/main.go	Operator entrypoint, flags, manager setup, health endpoints
operator/.dockerignore	Docker context exclusions for faster builds
docs/adr/README.md	ADR index and ADR process documentation
docs/adr/004-operator-deployment-model.md	ADR for deploying operator as Helm subchart dependency
docs/adr/003-declarative-store-lifecycle-crds.md	ADR for future CRD-based store/model/tuples management
docs/adr/002-operator-managed-migrations.md	ADR describing migration flow moved from Helm hooks to operator
docs/adr/001-adopt-openfga-operator.md	ADR establishing the operator approach and staged roadmap
docs/adr/000-template.md	ADR template for future architectural decisions
charts/openfga/values.yaml	Added `operator` + `migration` values surface
charts/openfga/values.schema.json	Schema updates for new operator/migration values and subchart passthrough
charts/openfga/templates/serviceaccount.yaml	Adds optional dedicated migration ServiceAccount in operator mode
charts/openfga/templates/rbac.yaml	Skips legacy hook/init-container RBAC when operator is enabled
charts/openfga/templates/job.yaml	Skips legacy Helm hook migration Job when operator is enabled
charts/openfga/templates/deployment.yaml	Operator mode annotations, replicas=0 behavior, disables hook init-container path
charts/openfga/templates/_helpers.tpl	Helper for migration ServiceAccount naming
charts/openfga/Chart.yaml	Adds `openfga-operator` dependency gated by `operator.enabled`
charts/openfga/Chart.lock	Lockfile updated to include the new subchart dependency
charts/openfga-operator/values.yaml	Operator chart configuration (image, leader election, watch scope, resources)
charts/openfga-operator/templates/serviceaccount.yaml	Operator ServiceAccount manifest
charts/openfga-operator/templates/NOTES.txt	Post-install notes for operator chart
charts/openfga-operator/templates/deployment.yaml	Operator Deployment manifest and CLI args wiring
charts/openfga-operator/templates/clusterrolebinding.yaml	ClusterRoleBinding for operator permissions
charts/openfga-operator/templates/clusterrole.yaml	ClusterRole for operator (deployments/jobs/configmaps/etc.)
charts/openfga-operator/templates/_helpers.tpl	Operator chart naming/labels helpers
charts/openfga-operator/crds/README.md	Placeholder note for future CRDs (Stage 2+)
charts/openfga-operator/Chart.yaml	Operator chart metadata/versioning
.github/workflows/operator.yml	CI workflow for operator tests + multi-arch build/push

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

operator/internal/controller/migration_controller.go

charts/openfga/templates/deployment.yaml

charts/openfga/values.yaml

operator/internal/controller/migration_controller.go

operator/internal/controller/helpers.go

operator/README.md

charts/openfga-operator/templates/deployment.yaml

operator/internal/controller/migration_controller.go

coderabbitai

Actionable comments posted: 17

🧹 Nitpick comments (1)

operator/tests/values-no-db.yaml (1)
18-18: Avoid inline credentials in test fixture URIs.

Line 18 embeds basic-auth credentials, which triggers secret scanners and normalizes plaintext credential patterns. Prefer a redacted placeholder for this failure-path fixture.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/tests/values-no-db.yaml` at line 18, Replace the inline basic-auth
credentials embedded in the YAML "uri" value (the failing PostgreSQL URI) with a
redacted placeholder or environment-variable reference (e.g., remove
"openfga:changeme@" and replace with a token like "<REDACTED_CREDENTIALS>" or
use an env var reference) so the test fixture no longer contains plaintext
credentials; update the "uri" key's value accordingly and ensure any test code
reading this fixture can handle the placeholder or env var.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/operator.yml:
- Around line 7-11: The workflow path filters are too narrow; update the
operator workflow (operator.yml) to also trigger on changes that affect image
publishing and self-validation by adding the chart and workflow files to the
paths filter — include at minimum "charts/openfga-operator/Chart.yaml" and
".github/workflows/operator.yml" (or broader "charts/**" and
".github/workflows/**") so that functions relying on Chart.yaml-derived tags and
workflow self-changes will cause the workflow to run.

In `@charts/openfga-operator/templates/clusterrole.yaml`:
- Around line 20-25: Remove the unused Secret and ServiceAccount permissions
from the operator ClusterRole template: delete the rules that list resources
"secrets" with verb "get" and "serviceaccounts" with verbs "get","list","create"
in the charts/openfga-operator/templates/clusterrole.yaml so the ClusterRole
only grants permissions actually used by the controller (referenced by
migration_controller.go and helpers.go which do not access Secrets or create
ServiceAccounts); ensure no other ClusterRole rules depend on those resource
entries and run a quick helm template to validate the manifest.

In `@charts/openfga-operator/templates/deployment.yaml`:
- Around line 27-36: The deployment template leaves podSecurityContext and
container securityContext empty (podSecurityContext and securityContext blocks)
which allows runtime-default privileges; update the chart defaults in
values.yaml and the deployment template to enable hardened defaults: set
podSecurityContext with runAsNonRoot: true and runAsUser (non-root uid), and set
container securityContext to runAsNonRoot: true, runAsUser, seccompProfile: {
type: RuntimeDefault }, capabilities: { drop: ["ALL"] }, and
readOnlyRootFilesystem: true; ensure these keys are present and not commented
out so the operator container and pod inherit the hardened defaults (refer to
podSecurityContext and the container securityContext entries in the deployment
template and the commented hardening directives in values.yaml).

In `@charts/openfga-operator/templates/NOTES.txt`:
- Around line 3-5: Replace the hard assertion in the Helm NOTES template that
the operator image "({{ .Values.image.repository }}:{{ .Values.image.tag |
default .Chart.AppVersion }}) does not exist yet" with a neutral, conditional
message: detect if a developer flag or a value (e.g., .Values.image.isDev or by
checking if .Values.image.tag equals .Chart.AppVersion) indicates a development
build and only then warn that the image may not be pushed; otherwise emit a
neutral info line that the operator uses the specified image without asserting
non-existence. Update the template text in NOTES.txt around the referenced
symbols (.Values.image.repository, .Values.image.tag, .Chart.AppVersion) to
reflect this conditional/neutral wording.

In `@charts/openfga/templates/_helpers.tpl`:
- Around line 80-85: The helper openfga.migrationServiceAccountName currently
always returns a generated name; change it to first check
.Values.migration.serviceAccount.create and
.Values.migration.serviceAccount.name, and if create is false and name is not
provided, return an empty string (so operator-created migration Jobs won't
reference a non-existent SA). Update the logic in the
openfga.migrationServiceAccountName function to: if
.Values.migration.serviceAccount.name use that (trunc/trimSuffix), else if
.Values.migration.serviceAccount.create is false return "" immediately,
otherwise generate and return the fallback printf "%s-migration" (include
"openfga.fullname" .) with trunc/trimSuffix.

In `@charts/openfga/templates/deployment.yaml`:
- Around line 8-10: The template is honoring only .Values.operator.enabled but
must also respect .Values.migration.enabled; update the conditional logic
wherever migration-related annotations/fields are set (e.g., the blocks that set
openfga.dev/desired-replicas, openfga.dev/migration-service-account, and any
migration init-container toggles) to require both .Values.operator.enabled and
.Values.migration.enabled (e.g., change conditionals to check both flags before
emitting those annotations/fields), and apply the same fix to the other
occurrences noted (lines ~16-23 and ~49) so operator-mode branches do not inject
migration annotations or set replicas to 0 when migration.enabled is false.

In `@docs/adr/002-operator-managed-migrations.md`:
- Around line 119-125: The ADR currently states "exponential backoff" for
database connectivity retries but the controller implements a fixed retry
cadence; update the ADR text in the "Database unreachable" row to document the
actual retry policy used by the controller (replace "exponential backoff" with
the fixed 60s retry), and mention how this interacts with Job reschedules and
operator restart behavior (keep references to MigrationFailed condition,
activeDeadlineSeconds, ConfigMap re-read and Job status resume semantics to
preserve context).

In `@docs/adr/004-operator-deployment-model.md`:
- Around line 36-39: Update the ADR text and all examples to reflect that the
operator is disabled by default: change the description of the "Operator as a
conditional subchart dependency (selected)" option from "Enabled by default" to
"Disabled by default", update the examples to show how to enable it by setting
operator.enabled: true, and adjust the other occurrences that claim it is
enabled by default (mentions of openfga-operator and any numbered sections
referenced around lines 74-75 and 127-128) so they align with the actual default
in values.schema.json where operator.enabled is false.

In `@docs/adr/README.md`:
- Around line 33-39: The fenced code block in docs/adr/README.md is missing a
language label which triggers markdownlint; edit that block (the diagram block
containing "Proposed → Accepted → (optionally) Superseded or Deprecated" and the
feedback loop arrows) to add a language specifier (e.g., use "text") immediately
after the opening backticks so the block becomes a labeled fenced code block.

In `@operator/cmd/main.go`:
- Around line 40-70: Default flags leave the operator watching all namespaces;
ensure it falls back to the pod's namespace by reading the POD_NAMESPACE env var
when watchNamespace is empty and watchAllNamespaces is false, and set
cacheOpts.DefaultNamespaces accordingly before calling ctrl.NewManager. In
practice, add an os.LookupEnv("POD_NAMESPACE") check after flag.Parse() and, if
watchNamespace == "" && !watchAllNamespaces && POD_NAMESPACE is present, set
cacheOpts.DefaultNamespaces = map[string]cache.Config{podNamespace: {}}
(referencing watchNamespace, watchAllNamespaces, cacheOpts.DefaultNamespaces,
and ctrl.NewManager), and also ensure the chart/deployment injects POD_NAMESPACE
into the operator pod env.

In `@operator/go.mod`:
- Line 3: The Docker builder image is using a floating patch tag (FROM
golang:1.25) while operator/go.mod pins go 1.25.6; update the Dockerfile's base
image to the exact patch version to match go.mod by changing the FROM
golang:1.25 line to FROM golang:1.25.6 so builds are reproducible and aligned
with the declared go 1.25.6 in operator/go.mod.

In `@operator/internal/controller/helpers.go`:
- Around line 120-142: The Job pod template currently copies
NodeSelector/Tolerations/Affinity but omits imagePullSecrets and security
contexts; update the PodTemplateSpec in the migration Job (where Template:
corev1.PodTemplateSpec is built) to also set Spec.ImagePullSecrets =
deployment.Spec.Template.Spec.ImagePullSecrets, Spec.SecurityContext =
deployment.Spec.Template.Spec.SecurityContext, and copy the container-level
SecurityContext into the "migrate-database" container
(Containers[0].SecurityContext =
deployment.Spec.Template.Spec.Containers[i].SecurityContext for the matching
mainContainer.Image/name), ensuring the migration container keeps the same
Image/Args/Env while inheriting the source Deployment's imagePullSecrets and
security settings.

In `@operator/internal/controller/migration_controller.go`:
- Around line 49-55: The code incorrectly assumes the main OpenFGA container is
Containers[0]; update the logic to locate the OpenFGA container by name (e.g.,
look for container.Name == "openfga" or the configured main container name)
instead of indexing 0, retrieve its Image and env once (store the resolved core
container object or its index) and pass or reuse that reference when calling
extractImageTag and buildMigrationJob so both use the same resolved container;
update any checks (previous len/Containers[0] usages) to handle missing named
container and return/skip with a clear log if not found.
- Around line 108-121: The success branch handling job.Status.Succeeded >= 1
never clears a prior MigrationFailed condition; update the logic so that after
logging success and before returning it clears/removes the MigrationFailed
condition in the migration status ConfigMap. Specifically, in the
job.Status.Succeeded block (around updateMigrationStatus and
ensureDeploymentScaled), modify or extend updateMigrationStatus (or call a new
helper) to set the MigrationFailed condition to false/remove it for
desiredVersion/jobName so kubectl describe no longer shows a stale failure; keep
the existing replica scaling flow (ensureDeploymentScaled) and return only after
the status has been updated.
- Around line 129-149: The 60s backoff is being bypassed because deleting the
failed Job immediately requeues the owning Deployment; fix by persisting a
retry-until timestamp on the Deployment before deleting the Job and checking
that timestamp during job-creation logic. Specifically: in the failure branch
where you call setMigrationFailedCondition(deployment, desiredVersion) and
r.Status().Update(...) (and before r.Delete(...)), set an annotation or a field
on deployment.Status (e.g., "migration.retryUntil" =
time.Now().Add(60*time.Second).UTC().Format(time.RFC3339)) and persist it
(handle update conflicts); then in the reconcile path that decides to create a
new migration Job (the code that currently creates the Job when IsNotFound),
read and parse that "migration.retryUntil" value and skip Job creation until the
timestamp has passed (clear the annotation/status field when creating the Job).
This ensures the controller honors the cooldown even though the owned Job watch
will requeue the Deployment.

In `@operator/Makefile`:
- Around line 5-6: The build target fails when bin/ is missing; update the
Makefile's build target (the "build:" target) to ensure the output directory
exists by creating bin before running go build (use mkdir -p bin) so that go
build -o bin/operator ./cmd/ never errors on fresh clones or after make clean.

In `@operator/README.md`:
- Around line 9-15: Update the README text that currently references the
hard-coded ConfigMap name "openfga-migration-status" to use the
deployment-scoped ConfigMap name pattern used elsewhere in the PR (e.g.,
"<release>-migration-status"); specifically, edit the line describing version
comparison so it no longer mentions "openfga-migration-status" but instead
references the deployment-scoped ConfigMap name pattern
("<release>-migration-status" or equivalent placeholder) so docs and tests point
to the correct resource.

---

Nitpick comments:
In `@operator/tests/values-no-db.yaml`:
- Line 18: Replace the inline basic-auth credentials embedded in the YAML "uri"
value (the failing PostgreSQL URI) with a redacted placeholder or
environment-variable reference (e.g., remove "openfga:changeme@" and replace
with a token like "<REDACTED_CREDENTIALS>" or use an env var reference) so the
test fixture no longer contains plaintext credentials; update the "uri" key's
value accordingly and ensure any test code reading this fixture can handle the
placeholder or env var.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8f9a57a5-18a8-4395-92d9-b4eae0a8affc

📥 Commits

Reviewing files that changed from the base of the PR and between fed787a and 7684485.

⛔ Files ignored due to path filters (2)

charts/openfga/Chart.lock is excluded by !**/*.lock
operator/go.sum is excluded by !**/*.sum

📒 Files selected for processing (37)

.github/workflows/operator.yml
charts/openfga-operator/Chart.yaml
charts/openfga-operator/crds/README.md
charts/openfga-operator/templates/NOTES.txt
charts/openfga-operator/templates/_helpers.tpl
charts/openfga-operator/templates/clusterrole.yaml
charts/openfga-operator/templates/clusterrolebinding.yaml
charts/openfga-operator/templates/deployment.yaml
charts/openfga-operator/templates/serviceaccount.yaml
charts/openfga-operator/values.yaml
charts/openfga/Chart.yaml
charts/openfga/templates/_helpers.tpl
charts/openfga/templates/deployment.yaml
charts/openfga/templates/job.yaml
charts/openfga/templates/rbac.yaml
charts/openfga/templates/serviceaccount.yaml
charts/openfga/values.schema.json
charts/openfga/values.yaml
docs/adr/000-template.md
docs/adr/001-adopt-openfga-operator.md
docs/adr/002-operator-managed-migrations.md
docs/adr/003-declarative-store-lifecycle-crds.md
docs/adr/004-operator-deployment-model.md
docs/adr/README.md
operator/.dockerignore
operator/Dockerfile
operator/Makefile
operator/README.md
operator/cmd/main.go
operator/go.mod
operator/internal/controller/helpers.go
operator/internal/controller/migration_controller.go
operator/internal/controller/migration_controller_test.go
operator/tests/README.md
operator/tests/values-db-outage.yaml
operator/tests/values-happy-path.yaml
operator/tests/values-no-db.yaml

.github/workflows/operator.yml

charts/openfga-operator/templates/clusterrole.yaml

charts/openfga-operator/templates/deployment.yaml

charts/openfga-operator/templates/NOTES.txt

charts/openfga/templates/_helpers.tpl

operator/internal/controller/migration_controller.go

operator/Makefile

operator/README.md

- Harden pod security (runAsNonRoot, seccompProfile, drop ALL caps) - Find container by name instead of index to handle sidecars - Skip migration for memory datastore - Persist retry-after annotation before Job deletion to survive re-enqueue - Clear MigrationFailed condition on success - Propagate imagePullSecrets and securityContext to migration Jobs - Remove unused RBAC rules (secrets, serviceaccounts) - Add POD_NAMESPACE downward API for namespace-scoped watch default - Remove no-op migration values (timeout, backoffLimit, resources) - Fix migration SA helper to require name when create=false - Guard operator logic on both operator.enabled and migration.enabled - Build and load operator image into kind for chart-testing CI - Add path filters to operator workflow - Fix ADR inaccuracies (retry strategy, default-enabled wording) - Pin Dockerfile base image to golang:1.26.2

emilic · 2026-04-11T12:36:03Z

@coderabbitai review

coderabbitai · 2026-04-11T12:36:10Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Propose adopting a Kubernetes operator for OpenFGA lifecycle management, covering migration handling, declarative CRDs for stores/models/tuples, and the operator deployment model as a Helm subchart. Also adds the ADR process documentation, template, and chart analysis. ADR-001: Adopt a Kubernetes Operator ADR-002: Operator-Managed Migrations ADR-003: Declarative Store Lifecycle CRDs ADR-004: Operator Deployment as Helm Subchart

- Harden pod security (runAsNonRoot, seccompProfile, drop ALL caps) - Find container by name instead of index to handle sidecars - Skip migration for memory datastore - Persist retry-after annotation before Job deletion to survive re-enqueue - Clear MigrationFailed condition on success - Propagate imagePullSecrets and securityContext to migration Jobs - Remove unused RBAC rules (secrets, serviceaccounts) - Add POD_NAMESPACE downward API for namespace-scoped watch default - Remove no-op migration values (timeout, backoffLimit, resources) - Fix migration SA helper to require name when create=false - Guard operator logic on both operator.enabled and migration.enabled - Build and load operator image into kind for chart-testing CI - Add path filters to operator workflow - Fix ADR inaccuracies (retry strategy, default-enabled wording) - Pin Dockerfile base image to golang:1.26.2

emilic · 2026-04-11T12:39:09Z

@coderabbitai review

coderabbitai · 2026-04-11T12:39:16Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

emilic · 2026-04-11T12:39:28Z

@copilot review

Copilot

Pull request overview

Copilot reviewed 38 out of 40 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

charts/openfga/templates/deployment.yaml:65

When operator.enabled=true but migration.enabled=false, the chart disables the hook-based migration Job (templates/job.yaml) and RBAC (templates/rbac.yaml), but the Deployment can still render the legacy wait-for-migration initContainer because $operatorMigration := and operator.enabled migration.enabled evaluates to false. With the default datastore.migrationType: job + waitForMigrations: true, this will deadlock (initContainer waits for a Job that the chart no longer creates). Consider changing the initContainer gating to disable legacy migration orchestration whenever operator.enabled is true (or explicitly fail fast for the unsupported combination).

      {{- $operatorMigration := and .Values.operator.enabled .Values.migration.enabled }}
      {{ if or (and (not $operatorMigration) (or (and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations))) .Values.extraInitContainers }}
      initContainers:
        {{- if not $operatorMigration }}
        {{- if and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations (eq .Values.datastore.migrationType "job") }}
        - name: wait-for-migration
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.initContainer.repository }}:{{ .Values.initContainer.tag }}"
          imagePullPolicy: {{ .Values.initContainer.pullPolicy }}
          args: ["job-wr", '{{ include "openfga.fullname" . }}-migrate']
          resources:
            {{- toYaml .Values.datastore.migrations.resources | nindent 12 }}
        {{- end }}
        {{- if and (has .Values.datastore.engine (list "postgres" "mysql")) (eq .Values.datastore.migrationType "initContainer") }}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

operator/internal/controller/helpers.go

operator/README.md

charts/openfga/templates/serviceaccount.yaml

charts/openfga-operator/templates/role.yaml

operator/internal/controller/migration_controller.go

coderabbitai

🧹 Nitpick comments (7)

operator/README.md (1)

90-102: Add language specifier to the project structure block.

The directory tree is missing a language identifier. Add text after the opening backticks to resolve the markdownlint warning.

📝 Proposed fix

-```
+```text
 operator/
 ├── cmd/
 │   └── main.go                          # Entry point, manager setup

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/README.md` around lines 90 - 102, Update the fenced code block in
operator/README.md that contains the project tree (the block starting with
"operator/" and listing cmd/, internal/, Dockerfile, etc.) to include a language
specifier by changing the opening ``` to ```text so the block becomes a
text-formatted code fence; this will resolve the markdownlint warning without
altering the block contents.

docs/adr/004-operator-deployment-model.md (1)

69-90: Add language specifier to this fenced code block.

The directory structure block is missing a language identifier, which triggers the markdownlint warning. Add text after the opening backticks.

📝 Proposed fix

-```
+```text
 helm-charts/
 ├── charts/
 │   ├── openfga/                    # Main chart (existing)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/adr/004-operator-deployment-model.md` around lines 69 - 90, The fenced
code block showing the Helm chart directory structure lacks a language
specifier; update that block by adding "text" immediately after the opening
triple backticks (i.e., change ``` to ```text) so the directory tree is treated
as plain text and markdownlint warning is resolved (refer to the directory
structure block containing "helm-charts/" and the nested "openfga" /
"openfga-operator" entries).

operator/cmd/main.go (1)

31-43: Consider validating that Job configuration flags are non-negative.

The flags --backoff-limit, --active-deadline-seconds, and --ttl-seconds-after-finished are parsed as int but later cast to int32/int64. Negative values would produce unexpected behavior. The Kubernetes API would reject invalid values, but an early validation with a clear error message would improve the operator experience.
🛡️ Suggested validation after flag.Parse()
 	flag.Parse()

+	if backoffLimit < 0 || activeDeadline < 0 || ttlAfterFinished < 0 {
+		fmt.Fprintln(os.Stderr, "error: job configuration flags must be non-negative")
+		os.Exit(1)
+	}
+
 	ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/cmd/main.go` around lines 31 - 43, After parsing flags, validate the
three Job-related flag variables backoffLimit, activeDeadline, and
ttlAfterFinished are non-negative; in main.go (after flag.Parse()) add checks
that return/log a clear fatal error if any value < 0 (e.g., using log.Fatalf or
klog.Fatalf) so the operator exits with a descriptive message instead of
proceeding to cast to int32/int64 and creating invalid Kubernetes Job specs;
reference the flag variables backoffLimit, activeDeadline, ttlAfterFinished and
the controller.Default* constants when adding the validation.

operator/internal/controller/migration_controller.go (1)

79-90: Status update error handling could be improved.

When clearing the MigrationFailed condition, errors from r.Status().Update() are logged but execution continues. If this update fails due to a conflict, the condition remains stale. Consider returning a requeue to ensure the condition is eventually cleared.

♻️ Suggested improvement

 	if currentVersion == desiredVersion {
 		logger.V(1).Info("migration up to date", "version", desiredVersion)
 		clearMigrationFailedCondition(deployment)
 		if patchErr := r.Status().Update(ctx, deployment); patchErr != nil {
-			logger.Error(patchErr, "failed to clear MigrationFailed condition")
+			if apierrors.IsConflict(patchErr) {
+				return ctrl.Result{Requeue: true}, nil
+			}
+			logger.Error(patchErr, "failed to clear MigrationFailed condition")
 		}
 		if _, scaleErr := ensureDeploymentScaled(ctx, r.Client, deployment); scaleErr != nil {

Apply the same pattern at lines 146-148.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/migration_controller.go` around lines 79 - 90,
The status update that clears MigrationFailed via
clearMigrationFailedCondition(deployment) calls r.Status().Update(ctx,
deployment) but only logs errors and proceeds; change this so that if
r.Status().Update returns an error (e.g., a conflict) you return that error (or
return ctrl.Result{Requeue: true} with the error) instead of continuing,
mirroring the pattern used at lines 146-148; locate the block that checks
currentVersion == desiredVersion, handle patchErr from r.Status().Update(ctx,
deployment) by returning the error or a requeue result so the condition will be
retried before calling ensureDeploymentScaled and returning success.

operator/internal/controller/migration_controller_test.go (1)

166-231: Test lacks verification of MigrationFailed condition clearing.

The TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp test verifies ConfigMap creation and Deployment scaling, but doesn't verify that a prior MigrationFailed condition gets cleared. Consider adding a test case where the Deployment starts with a MigrationFailed condition and verifying it's cleared after success.

🧪 Suggested additional test case

func TestReconcile_JobSucceeded_ClearsMigrationFailedCondition(t *testing.T) {
	// Given: a Deployment with a MigrationFailed condition and a succeeded Job.
	dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0)
	dep.Annotations[AnnotationDesiredReplicas] = "3"
	dep.Status.Conditions = []appsv1.DeploymentCondition{
		{
			Type:   "MigrationFailed",
			Status: corev1.ConditionTrue,
			Reason: "MigrationJobFailed",
		},
	}

	job := &batchv1.Job{
		// ... same as existing test ...
		Status: batchv1.JobStatus{Succeeded: 1},
	}

	r := newReconciler(dep, job)
	_, err := r.Reconcile(context.Background(), ctrl.Request{
		NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"},
	})
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}

	// Verify MigrationFailed condition is cleared.
	updated := &appsv1.Deployment{}
	_ = r.Get(context.Background(), types.NamespacedName{Name: "openfga", Namespace: "default"}, updated)
	for _, c := range updated.Status.Conditions {
		if c.Type == "MigrationFailed" && c.Status == corev1.ConditionTrue {
			t.Error("expected MigrationFailed condition to be cleared")
		}
	}
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/migration_controller_test.go` around lines 166 -
231, Add a test that ensures a pre-existing MigrationFailed condition on the
Deployment is removed after a successful Job: create a Deployment via
newTestDeployment, set dep.Annotations[AnnotationDesiredReplicas] and seed
dep.Status.Conditions with a DeploymentCondition having Type "MigrationFailed"
and Status corev1.ConditionTrue, create a succeeded batchv1.Job as in the
existing TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp, call
newReconciler(...).Reconcile(...), then Get the updated Deployment and assert
that no Status.Condition remains with Type "MigrationFailed" && Status
corev1.ConditionTrue; name the new test
TestReconcile_JobSucceeded_ClearsMigrationFailedCondition and reuse the same
Job/Reconiler setup symbols.

operator/internal/controller/helpers.go (1)

258-285: Scaling logic has a subtle edge case when replicas annotation already exists.

If the AnnotationDesiredReplicas annotation already exists (e.g., from a previous scale-down), the function correctly avoids overwriting it (line 272). However, if the annotation contains "0" from a previous interrupted reconciliation where the Deployment was manually set to 0 replicas before the operator took over, ensureDeploymentScaled would restore to 0 replicas.

This is documented behavior per the ADR ("Deployment starts at replicas: 0"), but consider logging a warning when the stored desired replicas is 0 to aid debugging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/helpers.go` around lines 258 - 285, In
scaleDeploymentToZero, detect when
deployment.Annotations[AnnotationDesiredReplicas] already exists and equals "0"
and emit a warning with the Deployment identity to aid debugging; update the
function (around the annotation-check block) to check the stored value string ==
"0" and call your controller logger (e.g., klog.Warningf or
ctrl.Log.WithValues("deployment", deployment.Name, "namespace",
deployment.Namespace).V(0).Info/Warning) to record that the saved desired
replicas is zero before returning/continuing, leaving the existing behavior of
not overwriting the annotation intact.

docs/adr/002-operator-managed-migrations.md (1)

76-94: Consider adding language specifiers to code fences.

The ASCII diagrams and sequence blocks lack language specifiers. Adding text or plaintext as the language would satisfy markdown linters and improve rendering consistency across platforms.

📝 Suggested fix

-```
+```text
 ┌────────────────────────────────────────────────────────┐
 │                  Operator Reconciliation                │

Apply similarly to the code fences at lines 130 and 147.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/adr/002-operator-managed-migrations.md` around lines 76 - 94, The ASCII
diagrams and sequence blocks (for example the block starting with "Operator
Reconciliation" and the other two visual blocks later in the file) are missing
Markdown language specifiers; update each triple-backtick fence that contains
ASCII art/sequence steps to use a plain text specifier (e.g., ```text or
```plaintext) so markdown linters/renderers handle them consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/adr/002-operator-managed-migrations.md`:
- Around line 76-94: The ASCII diagrams and sequence blocks (for example the
block starting with "Operator Reconciliation" and the other two visual blocks
later in the file) are missing Markdown language specifiers; update each
triple-backtick fence that contains ASCII art/sequence steps to use a plain text
specifier (e.g., ```text or ```plaintext) so markdown linters/renderers handle
them consistently.

In `@docs/adr/004-operator-deployment-model.md`:
- Around line 69-90: The fenced code block showing the Helm chart directory
structure lacks a language specifier; update that block by adding "text"
immediately after the opening triple backticks (i.e., change ``` to ```text) so
the directory tree is treated as plain text and markdownlint warning is resolved
(refer to the directory structure block containing "helm-charts/" and the nested
"openfga" / "openfga-operator" entries).

In `@operator/cmd/main.go`:
- Around line 31-43: After parsing flags, validate the three Job-related flag
variables backoffLimit, activeDeadline, and ttlAfterFinished are non-negative;
in main.go (after flag.Parse()) add checks that return/log a clear fatal error
if any value < 0 (e.g., using log.Fatalf or klog.Fatalf) so the operator exits
with a descriptive message instead of proceeding to cast to int32/int64 and
creating invalid Kubernetes Job specs; reference the flag variables
backoffLimit, activeDeadline, ttlAfterFinished and the controller.Default*
constants when adding the validation.

In `@operator/internal/controller/helpers.go`:
- Around line 258-285: In scaleDeploymentToZero, detect when
deployment.Annotations[AnnotationDesiredReplicas] already exists and equals "0"
and emit a warning with the Deployment identity to aid debugging; update the
function (around the annotation-check block) to check the stored value string ==
"0" and call your controller logger (e.g., klog.Warningf or
ctrl.Log.WithValues("deployment", deployment.Name, "namespace",
deployment.Namespace).V(0).Info/Warning) to record that the saved desired
replicas is zero before returning/continuing, leaving the existing behavior of
not overwriting the annotation intact.

In `@operator/internal/controller/migration_controller_test.go`:
- Around line 166-231: Add a test that ensures a pre-existing MigrationFailed
condition on the Deployment is removed after a successful Job: create a
Deployment via newTestDeployment, set dep.Annotations[AnnotationDesiredReplicas]
and seed dep.Status.Conditions with a DeploymentCondition having Type
"MigrationFailed" and Status corev1.ConditionTrue, create a succeeded
batchv1.Job as in the existing
TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp, call
newReconciler(...).Reconcile(...), then Get the updated Deployment and assert
that no Status.Condition remains with Type "MigrationFailed" && Status
corev1.ConditionTrue; name the new test
TestReconcile_JobSucceeded_ClearsMigrationFailedCondition and reuse the same
Job/Reconiler setup symbols.

In `@operator/internal/controller/migration_controller.go`:
- Around line 79-90: The status update that clears MigrationFailed via
clearMigrationFailedCondition(deployment) calls r.Status().Update(ctx,
deployment) but only logs errors and proceeds; change this so that if
r.Status().Update returns an error (e.g., a conflict) you return that error (or
return ctrl.Result{Requeue: true} with the error) instead of continuing,
mirroring the pattern used at lines 146-148; locate the block that checks
currentVersion == desiredVersion, handle patchErr from r.Status().Update(ctx,
deployment) by returning the error or a requeue result so the condition will be
retried before calling ensureDeploymentScaled and returning success.

In `@operator/README.md`:
- Around line 90-102: Update the fenced code block in operator/README.md that
contains the project tree (the block starting with "operator/" and listing cmd/,
internal/, Dockerfile, etc.) to include a language specifier by changing the
opening ``` to ```text so the block becomes a text-formatted code fence; this
will resolve the markdownlint warning without altering the block contents.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5c414505-2633-4683-95be-cd73ce829f78

📥 Commits

Reviewing files that changed from the base of the PR and between 7684485 and f149c60.

⛔ Files ignored due to path filters (2)

charts/openfga/Chart.lock is excluded by !**/*.lock
operator/go.sum is excluded by !**/*.sum

📒 Files selected for processing (38)

.github/workflows/operator.yml
.github/workflows/test.yml
charts/openfga-operator/Chart.yaml
charts/openfga-operator/crds/README.md
charts/openfga-operator/templates/NOTES.txt
charts/openfga-operator/templates/_helpers.tpl
charts/openfga-operator/templates/clusterrole.yaml
charts/openfga-operator/templates/clusterrolebinding.yaml
charts/openfga-operator/templates/deployment.yaml
charts/openfga-operator/templates/serviceaccount.yaml
charts/openfga-operator/values.yaml
charts/openfga/Chart.yaml
charts/openfga/templates/_helpers.tpl
charts/openfga/templates/deployment.yaml
charts/openfga/templates/job.yaml
charts/openfga/templates/rbac.yaml
charts/openfga/templates/serviceaccount.yaml
charts/openfga/values.schema.json
charts/openfga/values.yaml
docs/adr/000-template.md
docs/adr/001-adopt-openfga-operator.md
docs/adr/002-operator-managed-migrations.md
docs/adr/003-declarative-store-lifecycle-crds.md
docs/adr/004-operator-deployment-model.md
docs/adr/README.md
operator/.dockerignore
operator/Dockerfile
operator/Makefile
operator/README.md
operator/cmd/main.go
operator/go.mod
operator/internal/controller/helpers.go
operator/internal/controller/migration_controller.go
operator/internal/controller/migration_controller_test.go
operator/tests/README.md
operator/tests/values-db-outage.yaml
operator/tests/values-happy-path.yaml
operator/tests/values-no-db.yaml

✅ Files skipped from review due to trivial changes (11)

operator/.dockerignore
charts/openfga-operator/crds/README.md
charts/openfga/Chart.yaml
operator/go.mod
operator/Dockerfile
charts/openfga-operator/templates/NOTES.txt
docs/adr/001-adopt-openfga-operator.md
charts/openfga-operator/Chart.yaml
charts/openfga-operator/templates/_helpers.tpl
charts/openfga/templates/_helpers.tpl
charts/openfga-operator/values.yaml

🚧 Files skipped from review as they are similar to previous changes (4)

charts/openfga/templates/serviceaccount.yaml
charts/openfga/values.schema.json
.github/workflows/operator.yml
charts/openfga/templates/deployment.yaml

linux-foundation-easycla · 2026-04-11T19:57:48Z

The committers listed above are authorized under a signed CLA.

✅ login: emilic / name: Ed Milic (0f0c736, 1be6309, 2908fa1, 329f05e, 3740978, 3d88092, 40ca6ca, 4108b2a, 467b1ea, 65be2d7, 6d25647, 722f3e7, 767d6eb, 7e33bf7, 96edaf9, a71fbe6, aaddab3, b2c0061, bd534ca, c415e56, c6a645a, c84d751, d642358, dfe3180, ec643f9, f70c58f, f864451, ff2b39a)

Replace Helm hook-based migrations with a lightweight Kubernetes operator that watches OpenFGA Deployments, detects version changes, and runs migrations as regular Jobs. - Go operator using controller-runtime (no CRDs) - Helm subchart with opt-in via operator.enabled (default false) - Dedicated migration ServiceAccount (separate from runtime) - Auto-recovery on database failure (delete/retry cycle) - GitHub Actions workflow for multi-arch image builds - Integration test values for local Kubernetes clusters Resolves #211, #107, #120, #100, #126

- Harden pod security (runAsNonRoot, seccompProfile, drop ALL caps) - Find container by name instead of index to handle sidecars - Skip migration for memory datastore - Persist retry-after annotation before Job deletion to survive re-enqueue - Clear MigrationFailed condition on success - Propagate imagePullSecrets and securityContext to migration Jobs - Remove unused RBAC rules (secrets, serviceaccounts) - Add POD_NAMESPACE downward API for namespace-scoped watch default - Remove no-op migration values (timeout, backoffLimit, resources) - Fix migration SA helper to require name when create=false - Guard operator logic on both operator.enabled and migration.enabled - Build and load operator image into kind for chart-testing CI - Add path filters to operator workflow - Fix ADR inaccuracies (retry strategy, default-enabled wording) - Pin Dockerfile base image to golang:1.26.2

- Render extraInitContainers in operator mode (previously skipped) - Add version label to migration Jobs and delete stale Jobs on image change - Use namespaced Role/RoleBinding when watchAllNamespaces is false - Replace Status().Update with Status().Patch to avoid write conflicts - Fix logger.Error(nil, ...) to logger.Info for expected failure state - Wire desiredVersion param into buildMigrationJob for version tracking - Update RBAC: deployments/status verb from update to patch - Don't force replicas: 0 for memory engine in operator mode - Guard migration SA creation on migration.enabled - Document both required labels and mutable tag limitation in README

- Add opt-in annotation (openfga.dev/migration-enabled) so the operator only manages migrations for explicitly opted-in Deployments - Propagate volumes, volumeMounts, and envFrom from the Deployment to migration Jobs for TLS certs and file-based credentials - Remove watchAllNamespaces option; operator is now always namespace-scoped - Update ADR-004 dependency example to match actual file:// reference - Add test for migration-not-enabled skip behavior

The migration Job should only receive explicitly filtered OPENFGA_DATASTORE_* env vars, not the full EnvFrom from the source Deployment which could leak non-datastore secrets.

- Wrap deployment annotations in conditional to avoid emitting empty annotations: field which produces an invalid manifest - Store full version in annotation (openfga.dev/desired-version) and truncate label to 63 chars to support digest-pinned images - Align operator image default to ghcr.io/openfga/openfga-operator to match CI publishing target

…lue"

- Return error on retry-after annotation patch failure to prevent Job churn that bypasses the 60s cooldown - Add test for stale-Job version mismatch deletion path

…values overflowing)

When operator.enabled=true but migration.enabled=false, the legacy wait-for-migration initContainer could render and hang waiting for a Job that will never be created. Gate on operator.enabled instead so the legacy path is fully disabled when the operator is installed.

…essions The double-quoted annotation values contained inner double quotes (e.g. "memory") which produced invalid YAML that IDEs flagged as errors. Switch to single-quote wrappers so the inner Go template strings don't conflict.

- Pin Dockerfile base images by digest for reproducible builds - Fix version label fallback comparison for digest-pinned images by sanitizing desiredVersion before comparing to the label value

…y migration-specific volumes

Prevents the operator from being evicted during node drains, which could leave the OpenFGA Deployment stuck at 0 replicas with no controller to scale it back up. Disabled by default; supports both minAvailable and maxUnavailable modes.

Replace checks on job.Status.Failed >= backoffLimit with isJobConditionTrue(job, batchv1.JobFailed), and job.Status.Succeeded with batchv1.JobComplete. The Job controller sets conditions atomically when it makes its final decision, avoiding races where the operator acts on intermediate counter states before Kubernetes has finished cleaning up.

20 new tests across 4 files covering the operator-enabled code paths that previously had no unit test coverage: - deployment: annotations, replicas=0, autoscaling conflict, initContainers gating - job: template not rendered when operator enabled - serviceaccount: migration SA creation, custom names, IRSA annotations - rbac: legacy Role/RoleBinding not rendered when operator enabled

The old Helm-templated migration Job uses datastore.migrations.resources for resource limits, but the operator-built Job had none. Inherit the main container's Resources to maintain parity and prevent unbounded resource consumption during migrations.

Four new tests covering previously untested code paths: - StaleJob_LabelOnlyFallback: version-mismatch detection when Job has only a label (no annotation), exercising the sanitized-label fallback - JobSucceeded_UpdatesExistingConfigMap: ConfigMap update path when a prior version's ConfigMap already exists - ScaleToZero_NilAnnotationsMap: scaleDeploymentToZero correctly stores desired-replicas when the annotation was not previously set - JobInProgress_Requeues: in-progress Job triggers 10s requeue without scaling up or modifying the Deployment

The README documented these flags as configurable via the subchart values.yaml, but only --leader-elect and --watch-namespace were actually wired. Add migrationJob.backoffLimit, activeDeadlineSeconds, and ttlSecondsAfterFinished values with matching args in the operator Deployment template.

The _helpers.tpl namespace template already supported namespaceOverride but the value was undeclared in values.yaml, making it undiscoverable.

The test validates that scaleDeploymentToZero stores the current replica count in the desired-replicas annotation before zeroing, not that it handles a nil annotations map.

Falling back to the "default" ServiceAccount would silently grant operator RBAC permissions (Deployment patch, Job create/delete) to a shared SA. Require an explicit name so the user makes a deliberate choice.

When concurrent reconciles race between the GET and CREATE, the second create returns AlreadyExists. Treat this as benign and requeue to poll the existing Job instead of returning a hard error that produces noisy reconcile failures in the controller logs.

Address review findings: add allowPrivilegeEscalation: false for restricted PSS compliance, set default resource requests/limits, add values.schema.json validation, use stable selectorLabels on pod template to prevent spurious rollouts, add .helmignore, and improve Chart.yaml metadata and NOTES.txt with migration commands.

The operator was scaling Deployments to 0 replicas during every migration, causing a full outage on every helm upgrade — a regression from the existing rolling update behavior. OpenFGA already gates readiness on schema version (MinimumSupportedDatastoreSchemaRevision in sqlcommon.IsReady), so new pods naturally block until migration completes while old pods keep serving. Use Helm's lookup function to preserve the live replica count on upgrade (falling back to replicas: 0 on fresh install where no Deployment exists). Remove scaleDeploymentToZero from the operator reconcile loop. Update ADR-002 to document the rationale and the readiness gate dependency.

Copilot AI review requested due to automatic review settings April 10, 2026 17:10

emilic requested review from a team as code owners April 10, 2026 17:10

Copilot started reviewing on behalf of emilic April 10, 2026 17:11 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

emilic marked this pull request as draft April 10, 2026 17:45

emilic force-pushed the feat/operator-migration branch from 24897df to f149c60 Compare April 11, 2026 12:37

Copilot started work on behalf of emilic April 11, 2026 12:39 View session

emilic requested a review from Copilot April 11, 2026 12:39

emilic marked this pull request as ready for review April 11, 2026 12:40

Copilot started reviewing on behalf of emilic April 11, 2026 12:40 View session

Copilot finished work on behalf of emilic April 11, 2026 12:41

Copilot AI reviewed Apr 11, 2026

View reviewed changes

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

emilic requested a review from Copilot April 11, 2026 12:47

emilic added 27 commits April 12, 2026 12:32

fix: remove EnvFrom from migration Job to preserve least-privilege

ff2b39a

The migration Job should only receive explicitly filtered OPENFGA_DATASTORE_* env vars, not the full EnvFrom from the source Deployment which could leak non-datastore secrets.

fix: desired version to replace problematic ":" with "_" for label va…

b2c0061

…lue"

fix: address Copilot review comments

c84d751

- Return error on retry-after annotation patch failure to prevent Job churn that bypasses the 60s cooldown - Add test for stale-Job version mismatch deletion path

fix: validate flags to prevent negative or out-of-range values (i.e. …

0f0c736

…values overflowing)

fix: address Copilot review round 6 on PR #309

3740978

- Pin Dockerfile base images by digest for reproducible builds - Fix version label fallback comparison for digest-pinned images by sanitizing desiredVersion before comparing to the label value

fix: remove env var filtering, fix tests, updated README.md to clarif…

a71fbe6

…y migration-specific volumes

fix: update includes OwnerReferences

767d6eb

feat: add PodDisruptionBudget to operator subchart

65be2d7

Prevents the operator from being evicted during node drains, which could leave the OpenFGA Deployment stuck at 0 replicas with no controller to scale it back up. Disabled by default; supports both minAvailable and maxUnavailable modes.

fix: document namespaceOverride in operator subchart values.yaml

1be6309

The _helpers.tpl namespace template already supported namespaceOverride but the value was undeclared in values.yaml, making it undiscoverable.

fix: rename misleading ScaleToZero test to match actual behavior

96edaf9

The test validates that scaleDeploymentToZero stores the current replica count in the desired-replicas annotation before zeroing, not that it handles a nil annotations map.

fix: require explicit serviceAccount.name when create=false

329f05e

Falling back to the "default" ServiceAccount would silently grant operator RBAC permissions (Deployment patch, Job create/delete) to a shared SA. Require an explicit name so the user makes a deliberate choice.

docs: update chart structure

ec643f9

emilic force-pushed the feat/operator-migration branch from 5ea7c09 to dfe3180 Compare April 12, 2026 16:34

rhamzeh changed the title ~~feat/operator migration~~ feat: operator migration Apr 14, 2026

Conversation

emilic commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What problem is being solved?

How is it being solved?

What changes are made to solve it?

References

Review Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

socket-security bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emilic commented Apr 11, 2026

Uh oh!

coderabbitai bot commented Apr 11, 2026

Uh oh!

emilic commented Apr 11, 2026

Uh oh!

coderabbitai bot commented Apr 11, 2026

Uh oh!

emilic commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

emilic commented Apr 10, 2026 •

edited

Loading

coderabbitai bot commented Apr 10, 2026 •

edited

Loading

socket-security bot commented Apr 10, 2026 •

edited

Loading

linux-foundation-easycla bot commented Apr 11, 2026 •

edited

Loading