Skip to content

fix: wire egress gateway self-enrollment#570

Merged
vitramir merged 5 commits into
mainfrom
noa/issue-153-bootstrap-enrollment
Jun 18, 2026
Merged

fix: wire egress gateway self-enrollment#570
vitramir merged 5 commits into
mainfrom
noa/issue-153-bootstrap-enrollment

Conversation

@casey-brooks

@casey-brooks casey-brooks commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replaces the closed PR fix(egress): align gateway ziti identity #569 bootstrap shim direction with the existing provider-created enrollment Secret contract.
  • Keeps ziti_identity.egress_gateway / egress_gateway_enrollment_token / kubernetes_secret_v1.egress_gateway_enrollment as the bootstrap path.
  • Mounts the enrollment Secret and a writable /var/lib/ziti volume for the egress-gateway runtime to self-enroll.
  • Pins egress-compatible runtime chart defaults.
  • Builds terraform-provider-agyn PR Fix platform-ui replica count #81 from noa/issue-153-continue during full-apply and passes the built provider binary into e2e, so bootstrap CI recognizes agyn_egress_rule and agyn_egress_rule_attachment until provider Fix platform-ui replica count #81 lands.

Linked to agynio/architecture#153 and follows up on #569 review feedback.

Runtime/chart support PRs:

Test & Lint Summary

  • terraform -chdir=stacks/platform fmt -check -diff — passed.
  • terraform -chdir=stacks/platform init -backend=false — passed.
  • terraform -chdir=stacks/platform validate — passed.
  • actionlint .github/workflows/bootstrap.yml — passed with no errors.
  • git diff --check — passed with no whitespace errors.
  • go build -o /tmp/terraform-provider-agyn . from terraform-provider-agyn PR Fix platform-ui replica count #81 branch — passed.
  • go test -run '^$' -tags 'e2e svc_gateway tf_provider_agyn' ./tests from e2e/suites/go-terraform — passed: 1 package, failed: 0, skipped: 0; no tests run in compile-only mode.
  • go vet -tags 'e2e svc_gateway tf_provider_agyn' ./tests from e2e/suites/go-terraform — passed with no errors.

Notes

  • The provider binary is staged at provider-bin/terraform-provider-agyn rather than the repository root, because go build -o ../terraform-provider-agyn . would create a binary inside the existing checkout directory and would not satisfy the e2e action's file check.
  • The egress backend default-port fix is in fix(egress): default Ziti service ports egress#14 and the e2e workflow pins that fixed runtime image until it is released.

@casey-brooks casey-brooks requested a review from a team as a code owner June 12, 2026 08:05
@casey-brooks

Copy link
Copy Markdown
Contributor Author

Status update for #153:

  • Pushed 2752a9a fix(platform): pin egress runtime charts.
  • Kept the existing self-enrollment wiring and pinned platform defaults to the released egress-compatible runtime charts: secrets_chart_version = 0.2.2 and egress_gateway_chart_version = 0.1.2.

Local validation:

  • terraform -chdir=stacks/platform fmt -check -diff — passed.
  • terraform -chdir=stacks/platform init -backend=false — passed.
  • terraform -chdir=stacks/platform validate — passed.

Note: the previous full-apply failure was caused by the deployed Secrets chart lacking ResolveSecretExists; the chart pin update is intended to address that runtime mismatch.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Follow-up fix for CI failure:

  • Pushed be58db2 fix(platform): pass single secrets database source.
  • The secrets chart 0.2.2 has default database.existingSecret.name = agyn-platform-database-urls; bootstrap was also passing database.url, so Argo CD/Helm failed with set only one of database.url or database.existingSecret.name.
  • Updated local.secrets_values to explicitly clear database.existingSecret.name / key when passing the direct database.url, so only one database source is rendered.

Local validation:

  • terraform -chdir=stacks/platform fmt -check -diff — passed.
  • terraform -chdir=stacks/platform init -backend=false — passed.
  • terraform -chdir=stacks/platform validate — passed.
  • helm template secrets oci://ghcr.io/agynio/charts/secrets --version 0.2.2 --namespace platform -f /tmp/bootstrap-secrets-values.yaml — passed; rendered DATABASE_URL as a direct value and no secretKeyRef for the database URL.

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting review.

noa-lucent
noa-lucent previously approved these changes Jun 13, 2026

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review complete. Bootstrap now wires the enrollment Secret plus writable identity volume and pins the runtime chart versions needed for the chain. Verified terraform fmt -check and terraform validate for stacks/platform after init.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Follow-up fixes pushed for the full-apply failures.

Changes:

  • Added commit 9bd0b1c ci(bootstrap): use egress provider branch on noa/issue-153-bootstrap-enrollment.
  • The bootstrap full-apply workflow now builds agynio/terraform-provider-agyn from PR Fix platform-ui replica count #81 branch noa/issue-153-continue and passes that binary into the e2e runner via provider-binary, so the Terraform acceptance tests see the new agyn_egress_rule and agyn_egress_rule_attachment resource types until provider Fix platform-ui replica count #81 lands.
  • The workflow also runs e2e from noa/issue-206, which now pins the fixed egress runtime image.

Root cause on the Ziti port failure:

Validation run for bootstrap changes:

Command Result
terraform -chdir=stacks/platform fmt -check -diff passed
terraform -chdir=stacks/platform init -backend=false passed
terraform -chdir=stacks/platform validate passed
git diff --check passed
go build -o /tmp/terraform-provider-agyn . from terraform-provider-agyn PR #81 branch passed
go test -run '^$' -tags 'e2e svc_gateway tf_provider_agyn' ./tests from e2e/suites/go-terraform passed: 1 package, failed: 0, skipped: 0; no tests run in compile-only mode
go vet -tags 'e2e svc_gateway tf_provider_agyn' ./tests from e2e/suites/go-terraform passed

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Follow-up CI wiring fix pushed: 2a8440e fix(ci): stage provider binary file.

What changed:

  • The workflow now builds terraform-provider-agyn PR Fix platform-ui replica count #81 into provider-bin/terraform-provider-agyn.
  • It passes provider-binary: provider-bin/terraform-provider-agyn to the e2e action.
  • This avoids the previous go build -o ../terraform-provider-agyn . path ambiguity, where Go would place the binary inside the existing checkout directory rather than at a file path accepted by the e2e action.

Test & lint summary:

Command Result
actionlint .github/workflows/bootstrap.yml passed with no errors
terraform -chdir=stacks/platform fmt -check -diff passed
terraform -chdir=stacks/platform init -backend=false passed
terraform -chdir=stacks/platform validate passed
git diff --check passed
go build -o /tmp/terraform-provider-agyn . from terraform-provider-agyn PR #81 branch passed
go test -run '^$' -tags 'e2e svc_gateway tf_provider_agyn' ./tests from e2e/suites/go-terraform passed: 1 package, failed: 0, skipped: 0; no tests run in compile-only mode
go vet -tags 'e2e svc_gateway tf_provider_agyn' ./tests from e2e/suites/go-terraform passed with no errors

noa-lucent
noa-lucent previously approved these changes Jun 13, 2026

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review complete. The bootstrap workflow provider staging fix looks correct: it builds terraform-provider-agyn PR #81 into a real file path and passes that file to the e2e action. Terraform validation and diff checks pass locally. The remaining egress image availability issue is tracked on egress#14/e2e#207 rather than bootstrap.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Follow-up for downstream image availability pushed: a003ef8 ci(bootstrap): wait for egress PR image.

Although this PR was already approved, I added the same guard here so bootstrap full-apply also cannot race the egress PR image publication:

  • Waits for ghcr.io/agynio/egress:pr-14-75d68527f7f2ac9e69f7e036c39fe9d0af956e19 before provisioning.
  • Exports TF_VAR_egress_image_tag=pr-14-75d68527f7f2ac9e69f7e036c39fe9d0af956e19 only after the image is visible.
  • Times out clearly before Kubernetes deployment if the image is missing.

Local validation:

Command Result
actionlint .github/workflows/bootstrap.yml passed with no errors
terraform -chdir=stacks/platform fmt -check -diff passed
terraform -chdir=stacks/platform validate passed
git diff --check passed

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updated bootstrap full-apply to consume the latest egress backend fix image.

Pushed b2ebbda ci(bootstrap): pin host address egress image.

What changed:

  • The bootstrap image wait/export step now targets:
    • ghcr.io/agynio/egress:pr-14-b5bd494ac0c9f1782616b0709a4529252ce8a081
  • This is the egress docs: document inotify tuning for DinD #14 image tag for the backend fix that sets host.v1.address from the rule matcher domain.

Local validation:

Command Result
actionlint .github/workflows/bootstrap.yml passed with no errors
terraform -chdir=stacks/platform fmt -check -diff passed
terraform -chdir=stacks/platform validate passed
git diff --check passed

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-reviewed the updated image pin/wait wiring. The workflow now waits for and exports the egress PR image tag before provisioning, which addresses the previous image-availability concern in this PR.

I am not approving this yet because the pinned egress PR still has a remaining backend reconciliation issue (host.v1.address is not included in drift detection). Once agynio/egress#14 is fixed, this downstream pin should be good to proceed.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updated bootstrap downstream pin after the egress reconciliation-drift fix.

Pushed 7d96470 ci(bootstrap): pin address drift image.

What changed:

  • Bootstrap full-apply now waits for and exports TF_VAR_egress_image_tag=pr-14-6bd908c0bb48895ab55cccdeaf20c4ae4e707908.
  • This matches the new egress PR docs: document inotify tuning for DinD #14 head image after adding host address drift detection.

Local validation:

Command Result
actionlint .github/workflows/bootstrap.yml passed with no errors
terraform -chdir=stacks/platform fmt -check -diff passed
terraform -chdir=stacks/platform validate passed
git diff --check passed

noa-lucent
noa-lucent previously approved these changes Jun 13, 2026

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review complete. The workflow now waits for and exports the updated egress PR image tag pr-14-6bd908c0bb48895ab55cccdeaf20c4ae4e707908 before provisioning, matching the fixed egress#14 head. No further issues found in the updated pin.

Local check: git diff --check passed.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updated bootstrap to consume the new egress PR image containing the service-policy role fix.

Summary:

  • EGRESS_IMAGE_TAG now points to pr-14-1df3c611e8474c71de8f611f6dde1edf62f64719.
  • Existing image availability wait remains in place before exporting TF_VAR_egress_image_tag.

Validation:

  • actionlint .github/workflows/bootstrap.yml -> passed with no lint errors.
  • terraform -chdir=stacks/platform fmt -check -diff -> passed.
  • terraform -chdir=stacks/platform validate -> passed: configuration valid.
  • git diff --check -> passed.

Commits:

  • 8a8ae45 ci(bootstrap): pin service role image

CI currently pending:

noa-lucent
noa-lucent previously approved these changes Jun 13, 2026

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review complete. The downstream workflow now waits for and exports the updated egress PR image tag pr-14-1df3c611e8474c71de8f611f6dde1edf62f64719, matching the latest fixed egress#14 head. No further issues found in this update.

Local check: git diff --check passed.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updated bootstrap to deploy a ziti-management build that implements the service reconcile RPCs required by the egress service-role fix.

Summary:

  • Added a wait for ghcr.io/agynio/ziti-management:pr-61-c1d860ed955c62e06ef06be5ce1bb1befc07c5e5.
  • Exports TF_VAR_ziti_management_image_tag after the image exists, so platform provisioning uses the compatible ziti-management image.
  • Kept egress pinned to the already-published service-role fix image: pr-14-1df3c611e8474c71de8f611f6dde1edf62f64719.

Validation:

  • actionlint .github/workflows/bootstrap.yml -> passed with no lint errors.
  • terraform -chdir=stacks/platform fmt -check -diff -> passed.
  • terraform -chdir=stacks/platform validate -> passed: configuration valid.
  • git diff --check -> passed.

Commit:

  • 43dc619 ci(bootstrap): pin ziti reconcile image

CI currently pending:

noa-lucent
noa-lucent previously approved these changes Jun 13, 2026

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review complete. The bootstrap workflow now waits for the ziti-management PR image and exports TF_VAR_ziti_management_image_tag before provisioning, so the deployment chain consumes the GetService-capable ziti-management image. No further issues found.

Local check: git diff --check passed.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updated downstream pin to the newly published ziti-management PR image.

Commit:

  • c6c1169a65836ce78168abd5716bcfe90afbc8b3 (chore(ci): pin ziti-management image)

Image:

  • ghcr.io/agynio/ziti-management:pr-61-138daaf29682b3d1fbc0ba695318c15bb3ece6aa
  • Verified available with docker buildx imagetools inspect; digest sha256:dc5976e6fae732a77719895f4c32e9e70d5c06911116dd355a59974bc63aba0b.

Validation:

  • actionlint .github/workflows/bootstrap.yml — passed with no errors.
  • terraform -chdir=stacks/platform init -backend=false — passed.
  • terraform -chdir=stacks/platform validate — passed: configuration is valid.
  • git diff --check — passed.

CI:

  • bootstrap run 27474505826 is currently in progress for head c6c1169a65836ce78168abd5716bcfe90afbc8b3.

noa-lucent
noa-lucent previously approved these changes Jun 13, 2026

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review complete. The bootstrap workflow now waits for and exports the updated ziti-management PR image tag pr-61-138daaf29682b3d1fbc0ba695318c15bb3ece6aa, matching the latest idempotency/publish fixes. No further issues found.

Local check: git diff --check passed.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updated to the latest ziti-management PR image after api#146 wire-compatibility and ziti-management response follow-up.

Commit:

  • f530fc1a3f9e0c2c482d77e60d4baeb90235f375 (chore(ci): pin ziti-management image)

Confirmed duplicate config failure:

  • Addressed by ziti-management#61 commit 138daaf29682b3d1fbc0ba695318c15bb3ece6aa (fix(ziti): reuse service configs by name).
  • The backend now reuses/patches existing deterministic host.v1 / intercept.v1 configs instead of creating duplicate config names.

Image:

  • ghcr.io/agynio/ziti-management:pr-61-d81d5dc495ac4860c88206418cbbb9fccb59b0c3
  • Verified available with docker buildx imagetools inspect; digest sha256:f27cac6ee1ddd56de098bb0bb8b6cff0960cc081203942f09f838c80f1dd6a91.

Validation:

  • actionlint .github/workflows/bootstrap.yml — passed with no errors.
  • terraform -chdir=stacks/platform init -backend=false — passed.
  • terraform -chdir=stacks/platform validate — passed: configuration is valid.
  • git diff --check — passed.

CI:

  • Latest bootstrap run 27474658117 is pending for head f530fc1a3f9e0c2c482d77e60d4baeb90235f375.

noa-lucent
noa-lucent previously approved these changes Jun 13, 2026

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review complete. The bootstrap workflow now waits for and exports the ziti-management image tag pr-61-d81d5dc495ac4860c88206418cbbb9fccb59b0c3, matching the API-compatible ziti-management head. No further issues found.

Local check: git diff --check passed.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updated egress image pin to the current egress#14 PR image.

Commit:

  • 9a890b1872836205a54fc55f258d9c9f68508714 (chore(ci): pin current egress image)

Image pins now used by bootstrap#570:

  • Egress: ghcr.io/agynio/egress:pr-14-6c1275b0094393429b5f97788520d14123d97ba0
  • Ziti-management: ghcr.io/agynio/ziti-management:pr-61-d81d5dc495ac4860c88206418cbbb9fccb59b0c3

I verified the current egress PR image exists with docker manifest inspect before pushing.

Validation:

  • actionlint .github/workflows/bootstrap.yml — passed with no errors.
  • terraform -chdir=stacks/platform init -backend=false — passed.
  • terraform -chdir=stacks/platform validate — passed: configuration is valid.
  • git diff --check — passed.

CI:

  • Latest bootstrap run 27475374347 is queued/in progress for head 9a890b1872836205a54fc55f258d9c9f68508714.

noa-lucent
noa-lucent previously approved these changes Jun 15, 2026

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final re-review complete. The console-app image override from the prior blocker has been removed, the previously flagged thread is resolved, local Terraform validation for the platform stack passes, and the latest full-apply check is green. Approving.

@rowan-stein

Copy link
Copy Markdown
Collaborator

Blocking this PR pending correction.

The workflow must not use PR images, PR image waits, branch refs, or PR-built dependency artifacts. Everything in Bootstrap must consume main/released artifacts only.

Required before human review/merge:

  • Remove Wait for egress PR image, Wait for egress-gateway PR image, and Wait for ziti-management PR image steps.
  • Remove branch references such as noa/issue-206 / noa/issue-153-continue and the provider PR checkout/build path.
  • Reduce this PR to the minimal production/mainline-compatible egress gateway self-enrollment wiring, following existing ziti service patterns.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updated Bootstrap #570 to remove the unacceptable CI dependency chain and keep only the production/mainline-compatible egress gateway self-enrollment wiring.

Summary:

  • Removed PR image waits/tags and branch refs from .github/workflows/bootstrap.yml; workflow is back to agynio/e2e/.github/actions/run-tests@main with no provider PR checkout and no PR-built provider binary.
  • Removed Terraform Registry retry/plugin-cache hardening from apply.sh and .gitignore because it was CI-specific and outside the minimal production change.
  • Kept the platform runtime change to use released chart artifacts:
    • ziti-management chart 0.10.14
    • egress-gateway chart 0.1.3
  • Updated egress-gateway values to use ZITI_IDENTITY_FILE=/var/lib/ziti/identity.json with a writable emptyDir, removing the enrollment-JWT mount/env from the egress-gateway pod path.

Files changed in the final diff vs main:

  • stacks/platform/main.tf
  • stacks/platform/variables.tf

Validation:

  • terraform fmt -check -recursive stacks - passed
  • bash -n apply.sh - passed
  • git diff --check - passed
  • /root/go/bin/actionlint .github/workflows/bootstrap.yml - passed
  • terraform -chdir=stacks/platform init -backend=false - passed
  • terraform -chdir=stacks/platform validate - passed
  • Forbidden refs/pins scan - passed (no noa/, PR image tags, provider binary wiring, issue-206, issue-153-continue, Terraform plugin-cache/retry hardening, or workflow image tag overrides remain)

Head commit: aa659c0b458b3f0ad4447459816b286266f36da9

@casey-brooks casey-brooks force-pushed the noa/issue-153-bootstrap-enrollment branch from aa659c0 to 7ce93aa Compare June 16, 2026 09:21
@casey-brooks

Copy link
Copy Markdown
Contributor Author

Correction pushed: I rewrote the existing PR branch onto origin/main so the forbidden workflow commits are no longer in the branch history, not just absent from the final aggregate diff.

Current PR head: 7ce93aac6e0607aafea727efb76c87869d8a59a8

Verified remote PR state:

  • git log origin/main..origin/noa/issue-153-bootstrap-enrollment now shows exactly one commit: 7ce93aa fix(platform): use released egress enrollment
  • gh pr diff --name-only now shows only:
    • stacks/platform/main.tf
    • stacks/platform/variables.tf
  • Forbidden workflow scan on the remote branch is clean: no PR image waits/tags, no noa/issue-206, no noa/issue-153-continue, no provider checkout/build, no provider-bin, no e2e-ref, no provider-binary.

Final change scope:

  • stacks/platform/main.tf: egress-gateway now uses released self-enrollment wiring with ZITI_IDENTITY_FILE=/var/lib/ziti/identity.json and a writable ziti-identity emptyDir; removed egress-gateway enrollment JWT env/mount usage.
  • stacks/platform/variables.tf: bumped released mainline artifacts required for this wiring:
    • ziti_management_chart_version = 0.10.14
    • egress_gateway_chart_version = 0.1.3

Validation:

  • terraform fmt -check -recursive stacks - passed
  • bash -n apply.sh - passed
  • git diff --check - passed
  • /root/go/bin/actionlint .github/workflows/bootstrap.yml - passed
  • terraform -chdir=stacks/platform init -backend=false - passed
  • terraform -chdir=stacks/platform validate - passed
  • Forbidden refs/pins scan - passed

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Pushed a follow-up fix for the fresh full-apply failure.

Head commit: 20be06f1a900551fa6844e1aa4a7252a618afe6d

Root cause from run 27607586578 / job 81623447647:

  • Provision did complete Terraform apply, then failed in the platform health wait before E2E.
  • The first health gate that timed out was runners: runners did not become Synced+Healthy within timeout; the final dump showed runners:0.5.11 in CrashLoopBackOff with liveness/readiness refusing :50051.
  • The same dump also showed platform-default runtime crash loops/restarts for ziti-management:0.10.14, llm-proxy:0.12.5, agents-orchestrator:0.13.17, gateway, and tracing. The released platform-default chart stabilization for this set was already available in released artifacts but was not included in the minimal PR after branch cleanup.
  • Specifically, ziti-management chart 0.10.14 requires the chart-level zitiControllerUrl value; setting only env in Bootstrap is overwritten by the chart template, leaving ZITI_CONTROLLER_URL empty and making ziti-management crash. Separately, released runners 0.5.12, llm-proxy 0.12.6, and agents-orchestrator 0.13.18 contain the platform-default stabilization used by the green released runtime path.

Fix applied, still with no PR images/branch refs/provider PR build/CI dependency chain:

  • stacks/platform/main.tf
    • set zitiControllerUrl in ziti_management_values using the existing local domain/ingress port.
    • added the released runners platform env defaults required by runners:0.5.12 (GRPC_ADDR, identity/authz/agents/ziti-management/notifications addresses).
  • stacks/platform/variables.tf
    • bumped released artifacts only:
      • agents_orchestrator_chart_version: 0.13.17 -> 0.13.18
      • runners_chart_version: 0.5.11 -> 0.5.12
      • llm_proxy_chart_version: 0.12.5 -> 0.12.6
    • retained the earlier released egress self-enrollment artifacts: ziti-management 0.10.14, egress-gateway 0.1.3.

Current PR diff still only changes:

  • stacks/platform/main.tf
  • stacks/platform/variables.tf

Validation:

  • terraform fmt -check -recursive stacks - passed
  • bash -n apply.sh - passed
  • git diff --check - passed
  • /root/go/bin/actionlint .github/workflows/bootstrap.yml - passed
  • terraform -chdir=stacks/platform init -backend=false - passed
  • terraform -chdir=stacks/platform validate - passed
  • forbidden refs/pins scan - passed; workflow still has no PR image waits/tags, no noa/issue-206, no noa/issue-153-continue, no provider checkout/build, and no PR-built artifacts.

@casey-brooks casey-brooks force-pushed the noa/issue-153-bootstrap-enrollment branch from 20be06f to c2f74f9 Compare June 16, 2026 12:44
@casey-brooks

Copy link
Copy Markdown
Contributor Author

Addressed the runners env review feedback and pushed the correction.

Head commit: c2f74f9ce30fbd8feb690a8310384481edcd74aa

What changed:

  • Removed the explicit runners env overrides from Bootstrap.
  • Rebased/reduced fix: wire egress gateway self-enrollment #570 onto current origin/main, which already contains the previously required egress/ziti/secrets/mainline bootstrap fixes from fix(platform): bump secrets chart #572.
  • Current PR diff now only changes stacks/platform/variables.tf:
    • agents_orchestrator_chart_version: 0.13.17 -> 0.13.18
    • runners_chart_version: 0.5.11 -> 0.5.12
    • llm_proxy_chart_version: 0.12.5 -> 0.12.6

Why those runners env entries were added:

  • They were added while diagnosing the prior full-apply health failure because the failed run showed runners:0.5.11 stuck in CrashLoopBackOff/probe refusal and adjacent runtime components restarting.
  • On review, I verified runners already has these as service/chart defaults where appropriate:
    • GRPC_ADDR=:50051, IDENTITY_ADDRESS=identity:50051, and AUTHORIZATION_ADDRESS=authorization:50051 are chart defaults in agynio/runners v0.5.12.
    • AGENTS_ADDRESS=agents:50051, ZITI_MANAGEMENT_ADDRESS=ziti-management:50051, and NOTIFICATIONS_ADDRESS=notifications:50051 are binary defaults in internal/config/config.go.
  • Therefore Bootstrap should not set them explicitly. Removed.

Effect on the full-apply failure diagnosis:

  • Removing the explicit env entries should not reintroduce the diagnosed issue because the same values are provided by the released chart/binary defaults.
  • The remaining fix relevant to that failure is consuming the released stabilized runtime charts: runners 0.5.12, llm-proxy 0.12.6, and agents-orchestrator 0.13.18.
  • Current origin/main already carries ziti-management 0.10.14 plus zitiControllerUrl and egress-gateway 0.1.3, so fix: wire egress gateway self-enrollment #570 no longer needs to touch those files/values.

Constraints verified:

  • No PR images.
  • No PR image waits.
  • No branch refs.
  • No provider PR checkout/build.
  • No PR-built artifacts.
  • Remote PR diff has one file only: stacks/platform/variables.tf.

Validation:

  • terraform fmt -check -recursive stacks - passed
  • bash -n apply.sh - passed
  • git diff --check - passed
  • /root/go/bin/actionlint .github/workflows/bootstrap.yml - passed
  • terraform -chdir=stacks/platform init -backend=false - passed
  • terraform -chdir=stacks/platform validate - passed
  • forbidden refs/pins scan - passed

vitramir
vitramir previously approved these changes Jun 16, 2026
@casey-brooks

Copy link
Copy Markdown
Contributor Author

Current Bootstrap #570 blocker diagnosis

The current Bootstrap #570 branch is structurally clean and still only consumes main/released artifacts. The latest full-apply failure is not caused by Bootstrap workflow pins or unreleased artifacts.

Root causes found in released upstream runtime artifacts:

  1. ghcr.io/agynio/egress-gateway:0.1.3 cannot parse the cert-manager Egress CA key currently mounted into the pod:

  2. ghcr.io/agynio/ziti-management:0.10.14 emits unsupported OpenZiti filters for prefix listing:

Validation

Egress Gateway PR:

  • CGO_ENABLED=0 go test ./internal/egress/... - pass: 1 package, 0 failed.
  • CGO_ENABLED=0 go vet ./internal/egress/... - pass: no issues.

Ziti Management PR:

  • CGO_ENABLED=0 go test ./internal/ziti/... - pass: 1 package, 0 failed.
  • CGO_ENABLED=0 go vet ./internal/ziti/... - pass: no issues.

Next action

Do not rerun Bootstrap #570 full-apply yet. Merge/release the upstream fixes first, then update this PR to consume the new released chart versions only:

  • egress-gateway: next release after 0.1.3 (expected 0.1.4).
  • ziti-management: next release after 0.10.14 (expected 0.10.15).

After those released artifacts are available and stacks/platform/variables.tf is bumped to them, rerun Bootstrap #570 full-apply.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Released chart bump pushed; full-apply rerun failed in provisioning

Pushed commit cb5ebf3 fix(platform): consume released egress runtime charts to this PR branch.

Released chart versions now consumed from stacks/platform/variables.tf:

  • ziti_management_chart_version = 0.10.15
  • egress_gateway_chart_version = 0.1.4

The PR remains minimal: only stacks/platform/variables.tf differs from main; no workflow changes, PR images, branch refs, provider PR build, or dependency chain were introduced.

Validation before push:

  • terraform fmt -check stacks/platform/variables.tf - pass.
  • git diff --check - pass.
  • Workflow scan for forbidden PR/dependency patterns in .github/workflows/bootstrap.yml - clean except normal actions/checkout@v4.

Triggered run:

Failure observed in released artifacts:

  • egress-gateway:0.1.4 no longer has the EC CA key parse failure and reaches egress-gateway ziti data-plane listening.
  • egress still fails reconciliation through ziti-management:0.10.15 with OpenZiti filter error:
    • symbol 'roleAttributes' is a set symbol but is used in non-set function context
    • request path: list egress rule services -> list ziti services -> [GET /services][400] listServicesBadRequest
  • Final diagnostics also show agents-orchestrator:0.13.18 in BackOff, but the egress reconciliation error above is the remaining blocker related to this dependency chain.

This means the previous startsWith issue is fixed by ziti-management:0.10.15, but the released artifact still emits an OpenZiti service filter using roleAttributes contains ..., which this controller rejects. The next fix belongs upstream in ziti-management, not Bootstrap, unless maintainers choose a different released ziti-management version after that upstream fix.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Follow-up upstream fix opened for remaining ziti-management blocker

I diagnosed the current Bootstrap run failure (27649538842 / 81769873884) against released ziti-management:0.10.15.

Root cause:

  • ziti-management fixed the unsupported startsWith filter in 0.10.15, but still sends roleAttributes contains ... in OpenZiti /services list filters.
  • The OpenZiti controller in full-apply rejects that with:
    • symbol 'roleAttributes' is a set symbol but is used in non-set function context
    • [GET /services][400] listServicesBadRequest

Upstream fix PR opened:

What PR #75 changes:

  • Stops sending roleAttributes contains ... to OpenZiti service list filters.
  • Preserves ServiceListFilter.RoleAttributes semantics by filtering service role attributes client-side, like the existing client-side prefix filtering.
  • Updates tests to assert role-attribute filters are not sent to OpenZiti.

Validation on PR #75:

  • CGO_ENABLED=0 go test ./internal/ziti/... - pass: 1 package, 0 failed.
  • CGO_ENABLED=0 go vet ./internal/ziti/... - pass: no issues.
  • git diff --check - pass.

Bootstrap #570 should remain on released artifacts only. Next required action is to merge/release ziti-management after PR #75, then update this PR to the new released ziti-management chart version and rerun full-apply. No PR images, branch refs, provider PR builds, or workflow dependency chains are needed.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Follow-up upstream fix opened for service-policy filter blocker

Diagnosed Bootstrap run 27651192069 / 81775328422 after ziti-management:0.10.16.

Progress:

  • Previous OpenZiti /services roleAttributes filter rejection is gone.
  • Current failure is OpenZiti /service-policies rejecting the service-policy filter:
    • operation type = Dial is not supported with operands types number, string
    • [GET /service-policies][400] listServicePoliciesBadRequest
    • INVALID_FILTER

Upstream fix PR opened:

What PR #76 changes:

  • Stops sending unsupported OpenZiti service-policy filters for type, identityRoles, and serviceRoles.
  • Preserves ServicePolicyListFilter behavior by applying those filters client-side with the existing paginated prefix scan.
  • Adds regression coverage for the service-policy set/type filter path that caused Bootstrap full-apply to fail.

Validation on PR #76:

  • CGO_ENABLED=0 go test ./internal/ziti/... - pass: 1 package, 0 failed.
  • CGO_ENABLED=0 go vet ./internal/ziti/... - pass: no issues.
  • git diff --check - pass.

Bootstrap #570 remains on released artifacts only. Next action is review/merge/release ziti-management PR #76, then bump this PR to the new released ziti-management chart version and rerun full-apply.

@casey-brooks

Copy link
Copy Markdown
Contributor Author

Update

Removed the explicit EGRESS_CA_NAMESPACE=agyn-workloads override from the agents-orchestrator Helm values.

Root cause from run 27652711940 / 81780200803: agents-orchestrator 0.13.18 was the only CrashLooping pod. Its rendered environment set EGRESS_CA_NAMESPACE=agyn-workloads, but Bootstrap creates the egress-ca certificate/Secret in the platform namespace and the agents-orchestrator chart RBAC grants access in the release namespace by default. With the explicit override removed, agents-orchestrator resolves the namespace from its service-account namespace (platform) and reads the released/mainline egress-ca Secret.

This stays on released/main artifacts only; no PR images, branch refs, provider PR builds, or CI dependency chain were added.

Test & Lint Summary

  • terraform -chdir=stacks/platform fmt -check -diff: passed; no formatting changes required.
  • terraform -chdir=stacks/platform validate: passed; configuration valid.
  • git diff --check: passed; no whitespace errors.

Tests: 0 failed. Lint/format validation passed with no errors.

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final re-review complete. The PR is now back on released/main artifacts, keeps the prior console-app override removed, consumes the released ziti-management fixes through chart version 0.10.17, removes the incorrect agents-orchestrator EGRESS_CA_NAMESPACE override, and the latest full-apply is green. Local Terraform validation also passes. Approving.

@vitramir vitramir merged commit 02271d8 into main Jun 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants