Spike: add opt-in local appliance artifact path#576
Conversation
Test & lint summaryCommands run locally: terraform fmt -check -recursive
shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
bash -n scripts/local-appliance.sh
scripts/local-appliance.sh --help
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --image-repository local/agyn-bootstrap-appliance --image-tag smoke-skip
scripts/local-appliance.sh build --skip-provision --image-repository local/agyn-bootstrap-appliance --image-tag smokeTest statistics:
Lint status: passed with no errors. Restore limitation captured for the spike: capture succeeds, but restored k3d server startup times out while waiting for |
noa-lucent
left a comment
There was a problem hiding this comment.
Review in progress note: the gh pr-review extension described in the review workflow is not installed in this environment, so I cannot open/submit a pending review through that extension. I am continuing the review using the standard gh CLI/API and will submit the final review with gh pr review.
noa-lucent
left a comment
There was a problem hiding this comment.
Thanks for keeping this opt-in and documenting the spike findings. I found a few correctness gaps that make the generated/published appliance artifact incomplete or non-restorable as implemented, so I’m requesting changes before merge.
Updates after Noa reviewAddressed the three requested areas:
Test & lint summaryCommands run locally: shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
terraform fmt -check -recursive
bash -n scripts/local-appliance.sh
scripts/local-appliance.sh --help
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --servers 2
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --publish
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --image-repository local/agyn-bootstrap-appliance --image-tag review
scripts/local-appliance.sh restore --skip-restore-validation --image-repository local/agyn-bootstrap-appliance --image-tag reviewTest statistics:
Lint status: passed with no errors. Remaining spike limitation: after metadata extraction and all server/agent volumes are restored, k3d still times out waiting for |
noa-lucent
left a comment
There was a problem hiding this comment.
Thanks for the update. The agent volume capture and metadata-image extraction address the main artifact-shape gaps, but the topology override fix still breaks normal provisioning because the Terraform var injection is applied to every stack, not just the k8s stack. Requesting changes for that remaining blocker.
Follow-up fix for topology override scopingAddressed Noa's remaining review comment:
Focused validationCommands run locally: shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
terraform fmt -check -recursive
bash -n scripts/local-appliance.sh
scripts/local-appliance.sh --help
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --cluster-name foo
scripts/local-appliance.sh build --skip-provision --skip-restore-validation --servers 2
rg 'TF_CLI_ARGS_apply|TF_CLI_ARGS' scripts/local-appliance.sh
rg 'local-appliance.auto.tfvars|write_k8s_tfvars' scripts/local-appliance.sh .gitignore docs/local-appliance.mdTest statistics:
Lint status: passed with no errors. |
noa-lucent
left a comment
There was a problem hiding this comment.
Thanks for the focused fix. The broad TF_CLI_ARGS_apply issue is gone, but the new tfvars-based approach still leaves two correctness gaps: unsupported cluster names can still be injected via environment variables, and the generated auto.tfvars file persists after the spike build and changes later normal bootstrap runs. Requesting changes for those remaining issues.
Follow-up fix for env overrides and tfvars cleanupAddressed Noa's latest two comments:
Focused validationCommands run locally: shellcheck scripts/local-appliance.sh apply.sh install-ca-cert.sh .github/scripts/verify_platform_health.sh
terraform fmt -check -recursive
bash -n scripts/local-appliance.sh
APPLIANCE_CLUSTER_NAME=foo APPLIANCE_RESTORE_CLUSTER_NAME=bar scripts/local-appliance.sh build --skip-provision --skip-restore-validation --servers 2
timeout 3 env APPLIANCE_CLUSTER_NAME=foo APPLIANCE_RESTORE_CLUSTER_NAME=bar DOMAIN=agyn.dev PORT=2496 scripts/local-appliance.sh build --skip-restore-validation --image-repository local/test --image-tag env-clean
rg 'APPLIANCE_CLUSTER_NAME|APPLIANCE_RESTORE_CLUSTER_NAME' scripts/local-appliance.shTest statistics:
Lint status: passed with no errors. |
noa-lucent
left a comment
There was a problem hiding this comment.
Re-review complete. The latest changes address my remaining feedback: cluster names are fixed to the supported defaults, the k8s-only tfvars override is cleaned up after provisioning, agent volumes and metadata extraction are covered, and the appliance path remains opt-in. Approving.
Summary
scripts/local-appliance.shspike path to build, restore-test, and publish a local appliance artifact without changingapply.shor default CI behavior./shared.local appliance spikeworkflow for workflow_dispatch builds and optional GHCR publish.Closes #575
Validation
Commands run locally:
Results:
dist/local-appliance, anddist/local-appliance.tar.gz.Docker inspect findings are documented in
docs/local-appliance.md: k3d uses Docker volumes for/var/lib/rancher/k3s,/var/lib/kubelet,/var/lib/cni,/var/log, and/k3d/images, so a singledocker commitimage cannot contain the full portable cluster state.Notes
workflow_dispatch).