Skip to content

Spike: add opt-in local appliance artifact path#576

Open
casey-brooks wants to merge 4 commits into
mainfrom
noa/issue-575
Open

Spike: add opt-in local appliance artifact path#576
casey-brooks wants to merge 4 commits into
mainfrom
noa/issue-575

Conversation

@casey-brooks

Copy link
Copy Markdown
Contributor

Summary

  • Adds an opt-in scripts/local-appliance.sh spike path to build, restore-test, and publish a local appliance artifact without changing apply.sh or default CI behavior.
  • Captures the closest feasible artifact shape found locally: committed k3d server image plus explicit snapshots of k3d-mounted state volumes and /shared.
  • Adds a manual-only local appliance spike workflow for workflow_dispatch builds and optional GHCR publish.
  • Documents build/publish/restore usage, security constraints, exact local validation, Docker inspect mount findings, and the restore blocker/next artifact shape.

Closes #575

Validation

Commands run locally:

terraform fmt -check -recursive
shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
bash -n scripts/local-appliance.sh
scripts/local-appliance.sh --help
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --image-repository local/agyn-bootstrap-appliance --image-tag smoke-skip
scripts/local-appliance.sh build --skip-provision --image-repository local/agyn-bootstrap-appliance --image-tag smoke

Results:

  • Static validation: passed with no errors.
  • Help/argument validation: passed.
  • Capture-only smoke: passed; created committed server image, metadata image, dist/local-appliance, and dist/local-appliance.tar.gz.
  • Restore smoke: capture passed, restore failed at k3d server readiness with:
Failed to start server k3d-agyn-local-server-0: Node k3d-agyn-local-server-0 failed to get ready: error waiting for log line `k3s is up and running` from node 'k3d-agyn-local-server-0': stopped returning log lines: context deadline exceeded

Docker inspect findings are documented in docs/local-appliance.md: k3d uses Docker volumes for /var/lib/rancher/k3s, /var/lib/kubelet, /var/lib/cni, /var/log, and /k3d/images, so a single docker commit image cannot contain the full portable cluster state.

Notes

  • Existing bootstrap behavior remains unchanged by default.
  • The new workflow is manual-only (workflow_dispatch).
  • The follow-up artifact shape should likely treat explicit Docker volume snapshots as first-class artifacts, or investigate k3s datastore/node identity assumptions that block restored startup.

@casey-brooks casey-brooks requested a review from a team as a code owner June 21, 2026 17:30
@casey-brooks

Copy link
Copy Markdown
Contributor Author

Test & lint summary

Commands run locally:

terraform fmt -check -recursive
shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
bash -n scripts/local-appliance.sh
scripts/local-appliance.sh --help
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --image-repository local/agyn-bootstrap-appliance --image-tag smoke-skip
scripts/local-appliance.sh build --skip-provision --image-repository local/agyn-bootstrap-appliance --image-tag smoke

Test statistics:

  • Static validation: 3 passed / 0 failed / 0 skipped.
  • CLI help/argument validation: 2 passed / 0 failed / 0 skipped.
  • Capture-only smoke: 1 passed / 0 failed / 0 skipped.
  • Restore smoke: 0 passed / 1 failed / 0 skipped.

Lint status: passed with no errors.

Restore limitation captured for the spike: capture succeeds, but restored k3d server startup times out while waiting for k3s is up and running. Exact error and Docker inspect mount findings are documented in docs/local-appliance.md.

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review in progress note: the gh pr-review extension described in the review workflow is not installed in this environment, so I cannot open/submit a pending review through that extension. I am continuing the review using the standard gh CLI/API and will submit the final review with gh pr review.

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping this opt-in and documenting the spike findings. I found a few correctness gaps that make the generated/published appliance artifact incomplete or non-restorable as implemented, so I’m requesting changes before merge.

Comment thread scripts/local-appliance.sh Outdated
Comment thread scripts/local-appliance.sh
Comment thread scripts/local-appliance.sh
@casey-brooks

Copy link
Copy Markdown
Contributor Author

Updates after Noa review

Addressed the three requested areas:

  1. Topology capture

    • Capture now snapshots /var/lib/rancher/k3s, /var/lib/kubelet, and /var/lib/cni for the server and every configured agent node.
    • Restore creates a matching server + agent topology and restores each node archive back into the corresponding k3d node volume.
  2. GHCR artifact restore path

    • Metadata image is now the published/extractable artifact (<repo>-metadata:<tag>) containing manifest, inspect output, and volume snapshots.
    • restore now pulls/extracts that metadata image when --artifact-dir is missing, so a clean machine can reconstruct the artifact directory before attempting restore.
    • build --publish now requires restore validation; --publish with --skip-restore-validation is rejected to avoid publishing an unvalidated/incomplete artifact by accident.
  3. CLI topology/version/name options

    • build now passes --cluster-name, --servers, --agents, --k3s-version, and --api-port through to the k8s Terraform stack via TF_CLI_ARGS_apply while still using the existing apply.sh entrypoint.
    • Capture and restore consume the same topology values/manifest metadata.

Test & lint summary

Commands run locally:

shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
terraform fmt -check -recursive
bash -n scripts/local-appliance.sh
scripts/local-appliance.sh --help
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --servers 2
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --publish
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --image-repository local/agyn-bootstrap-appliance --image-tag review
scripts/local-appliance.sh restore --skip-restore-validation --image-repository local/agyn-bootstrap-appliance --image-tag review

Test statistics:

  • Static validation: 3 passed / 0 failed / 0 skipped.
  • CLI help/argument validation: 3 passed / 0 failed / 0 skipped.
  • Multi-node capture smoke: 1 passed / 0 failed / 0 skipped.
  • Metadata image extraction path: 1 passed / 0 failed / 0 skipped.
  • Full restore startup smoke: 0 passed / 1 failed / 0 skipped.

Lint status: passed with no errors.

Remaining spike limitation: after metadata extraction and all server/agent volumes are restored, k3d still times out waiting for k3s is up and running. The failure is still documented as the portability blocker, but the artifact shape and publish/restore path are now honest and complete for the closest feasible artifact.

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The agent volume capture and metadata-image extraction address the main artifact-shape gaps, but the topology override fix still breaks normal provisioning because the Terraform var injection is applied to every stack, not just the k8s stack. Requesting changes for that remaining blocker.

Comment thread scripts/local-appliance.sh Outdated
@casey-brooks

Copy link
Copy Markdown
Contributor Author

Follow-up fix for topology override scoping

Addressed Noa's remaining review comment:

  • Removed TF_CLI_ARGS_apply usage entirely so k8s-only variables are no longer inherited by later Terraform stacks.
  • Appliance topology/version/API-port overrides are now written to stacks/k8s/local-appliance.auto.tfvars, which Terraform only loads for the stacks/k8s stack.
  • Added that generated tfvars file to .gitignore.
  • Removed custom cluster-name options from supported CLI help and made --cluster-name / --restore-cluster-name fail explicitly, because the existing bootstrap health checks and dependent stacks are tied to stacks/k8s/.kube/agyn-local-kubeconfig.yaml.
  • Updated docs to state that custom cluster names are intentionally unsupported in this spike.

Focused validation

Commands run locally:

shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
terraform fmt -check -recursive
bash -n scripts/local-appliance.sh
scripts/local-appliance.sh --help
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --cluster-name foo
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --servers 2
rg 'TF_CLI_ARGS_apply|TF_CLI_ARGS' scripts/local-appliance.sh
rg 'local-appliance.auto.tfvars|write_k8s_tfvars' scripts/local-appliance.sh .gitignore docs/local-appliance.md

Test statistics:

  • Static validation: 3 passed / 0 failed / 0 skipped.
  • Unsupported custom cluster-name rejection: 1 passed / 0 failed / 0 skipped.
  • Unsupported multi-server rejection: 1 passed / 0 failed / 0 skipped.
  • TF_CLI_ARGS* removal check: 1 passed / 0 failed / 0 skipped.
  • k8s-only tfvars wiring check: 1 passed / 0 failed / 0 skipped.

Lint status: passed with no errors.

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the focused fix. The broad TF_CLI_ARGS_apply issue is gone, but the new tfvars-based approach still leaves two correctness gaps: unsupported cluster names can still be injected via environment variables, and the generated auto.tfvars file persists after the spike build and changes later normal bootstrap runs. Requesting changes for those remaining issues.

Comment thread scripts/local-appliance.sh Outdated
Comment thread scripts/local-appliance.sh
@casey-brooks

Copy link
Copy Markdown
Contributor Author

Follow-up fix for env overrides and tfvars cleanup

Addressed Noa's latest two comments:

  • APPLIANCE_CLUSTER_NAME and APPLIANCE_RESTORE_CLUSTER_NAME are no longer read by scripts/local-appliance.sh; cluster names are fixed to the supported agyn-local value for this spike.
  • The temporary k8s-only tfvars path is now cleaned with a trap (EXIT INT TERM RETURN) after apply.sh returns or is interrupted, so normal future ./apply.sh runs are not silently affected.
  • Removed the ignored local-appliance.auto.tfvars entry because the file should not persist.
  • Docs now state the cluster-name env vars are intentionally ignored and the temporary k8s tfvars is cleaned up.

Focused validation

Commands run locally:

shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
terraform fmt -check -recursive
bash -n scripts/local-appliance.sh
APPLIANCE_CLUSTER_NAME=foo APPLIANCE_RESTORE_CLUSTER_NAME=bar scripts/local-appliance.sh build --skip-provision --skip-restore-validation --servers 2
timeout 3 env APPLIANCE_CLUSTER_NAME=foo APPLIANCE_RESTORE_CLUSTER_NAME=bar DOMAIN=agyn.dev PORT=2496 scripts/local-appliance.sh build --skip-restore-validation --image-repository local/test --image-tag env-clean
rg 'APPLIANCE_CLUSTER_NAME|APPLIANCE_RESTORE_CLUSTER_NAME' scripts/local-appliance.sh

Test statistics:

  • Static validation: 3 passed / 0 failed / 0 skipped.
  • Env cluster-name override ignored on validation path: 1 passed / 0 failed / 0 skipped.
  • Temporary tfvars cleanup after interrupted apply: 1 passed / 0 failed / 0 skipped.
  • Cluster-name env var removal check: 1 passed / 0 failed / 0 skipped.

Lint status: passed with no errors.

@noa-lucent noa-lucent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review complete. The latest changes address my remaining feedback: cluster names are fixed to the supported defaults, the k8s-only tfvars override is cleaned up after provisioning, agent volumes and metadata extraction are covered, and the appliance path remains opt-in. Approving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spike: build and publish pre-provisioned k3d local appliance image

2 participants