diff --git a/.gitignore b/.gitignore index 85e6d1d..c3c804d 100644 --- a/.gitignore +++ b/.gitignore @@ -177,6 +177,11 @@ media/csv_files/* media/text_files/* env static/ + +# Helm packaged sub-chart tarballs (regenerated by `helm dependency update`) +helm/charts/*.tgz +# Helm values backup files (created by some local tooling) +helm/values.yaml.backup.* helm/values.local.yaml # Local copy of debug toolkit for testing diff --git a/helm/README.md b/helm/README.md new file mode 100644 index 0000000..b5b61cc --- /dev/null +++ b/helm/README.md @@ -0,0 +1,190 @@ +# DRD VPC Agent — Helm Chart + +This chart deploys the Doctor Droid VPC Agent (celery-beat + celery-worker + redis) and an optional restart CronJob into your cluster. + +## Prerequisites + +- `kubectl` and `helm` v3+ installed locally +- Cluster access with permission to create the namespace, ServiceAccounts, ClusterRole, ClusterRoleBinding, Deployments, and (optionally) a CronJob +- A `DRD_CLOUD_API_TOKEN` from +- A populated `credentials/secrets.yaml` describing the connectors you want the agent to poll (see `credentials/credentials_template.yaml`) + +## Quick start + +From the repository root: + +```bash +# 1. Create the namespace +kubectl create namespace drdroid + +# 2. Apply your connector credentials (edit credentials/secrets.yaml first) +kubectl -n drdroid apply -f helm/credentials-secret.yaml + +# 3. Install the chart +helm dependency update helm/ +helm upgrade --install drd-vpc-agent helm/ \ + -n drdroid \ + --set global.DRD_CLOUD_API_TOKEN= +``` + +Verify: + +```bash +kubectl -n drdroid get pods +# drd-vpc-agent-celery-beat-… 1/1 Running +# drd-vpc-agent-celery-worker-… 3/3 Running +# redis-… 1/1 Running +``` + +## Configuration via `values.yaml` + +The chart is driven entirely by `helm/values.yaml`. Three things are now configurable per component: + +1. **Image** — `repository`, `tag`, `pullPolicy` +2. **Image pull secrets** — global and/or per component, merged together +3. **Security context** — pod-level (`podSecurityContext`) and container-level (`securityContext`) + +### Components + +| Key | What it controls | +|---|---| +| `celery-beat` | Beat scheduler pod (1 main container + 1 init container) | +| `celery-worker` | Worker pod (3 main containers: scheduler, task-executor, asset-extractor + 1 init container) | +| `redis` | Redis broker pod | +| `autoUpdate` | The kubectl rollout-restart CronJob (only rendered when `autoUpdate.enabled=true`) | +| `global` | Settings shared across all components: `DRD_CLOUD_API_TOKEN`, `DRD_CLOUD_API_HOST`, `nodeSelector`, `tolerations`, `imagePullSecrets` | + +### Using a private registry + +You can mirror or self-host any of the four images. Point each component at your registry and provide a pull secret. + +```bash +kubectl -n drdroid create secret docker-registry my-registry-pull \ + --docker-server=my-registry.example.com \ + --docker-username=… --docker-password=… +``` + +```yaml +# values.override.yaml +global: + imagePullSecrets: + - name: my-registry-pull # applied to every pod in the chart + +celery-beat: + image: + repository: my-registry.example.com/drd/drd-vpc-agent + tag: 1.0.6 + pullPolicy: IfNotPresent + initContainer: + image: + repository: my-registry.example.com/drd/busybox + tag: "1.36" + +celery-worker: + image: + repository: my-registry.example.com/drd/drd-vpc-agent + tag: 1.0.6 + pullPolicy: IfNotPresent + initContainer: + image: + repository: my-registry.example.com/drd/busybox + tag: "1.36" + +redis: + image: + repository: my-registry.example.com/drd/redis + tag: 8-alpine + imagePullSecrets: + - name: dockerhub-mirror # additional secret only for redis; merged with global + +autoUpdate: + image: + repository: my-registry.example.com/drd/kubectl + tag: latest +``` + +```bash +helm upgrade --install drd-vpc-agent helm/ \ + -n drdroid \ + -f helm/values.yaml \ + -f values.override.yaml \ + --set global.DRD_CLOUD_API_TOKEN= +``` + +### Security context (PSP / Gatekeeper / Pod Security Standards) + +The chart ships defaults that satisfy the common "must run as non-root" and "no privilege escalation" policies: + +| Component | Default `runAsUser` | Reason | +|---|---|---| +| `celery-beat`, `celery-worker` | `33` | matches the `www-data` user the agent image chowns `/code` to | +| `redis` | `999` | matches the `redis` user baked into `redis:8-alpine` | +| `autoUpdate` (kubectl CronJob) | `1000` | non-root, no filesystem requirements | + +If your policy is stricter (e.g. requires `runAsUser` inside a specific UID range), override per-component: + +```yaml +celery-worker: + podSecurityContext: + runAsNonRoot: true + runAsUser: 10001 + runAsGroup: 10001 + fsGroup: 10001 + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 10001 + readOnlyRootFilesystem: true + capabilities: + drop: [ALL] +``` + +If you build your own image with a different baked-in user, change `runAsUser` to match the UID that owns `/code` in your image. You can probe it with: + +```bash +kubectl run uid-probe --rm -it --restart=Never --namespace drdroid \ + --image= --command -- sh -c 'id www-data; ls -ld /code' +``` + +## Upgrades + +`values.yaml` and any override files are the source of truth. To roll out a change: + +```bash +helm upgrade drd-vpc-agent helm/ \ + -n drdroid \ + -f helm/values.yaml \ + -f values.override.yaml +``` + +The chart's pod template includes a `rollme` annotation pinned to deploy-time, so every `helm upgrade` triggers a rolling restart of the agent pods even when the image tag is `latest`. The `autoUpdate` CronJob (default: daily at 00:00 UTC) issues `kubectl rollout restart` against both deployments to pick up new `latest` images between releases. + +## Troubleshooting + +**Gatekeeper denies pod admission with `psp-pods-allowed-user-ranges`** +The chart's defaults set `runAsNonRoot: true` and a non-zero `runAsUser`. If your policy still denies, your cluster likely enforces a UID *range* — set `runAsUser` (and `runAsGroup` / `fsGroup`) to a value inside the allowed range under each component's `podSecurityContext` and `securityContext`. + +**Container exits with `unable to open database file` / permission errors** +The `runAsUser` you've set doesn't match the UID that owns `/code` in the image. Probe the image (see snippet above) and adjust `runAsUser` to that UID. + +**Image pull fails with `ImagePullBackOff`** +Either the image isn't present in your registry, or the pull secret isn't reachable from the pod's namespace. Confirm: +```bash +kubectl -n drdroid get secrets | grep -i pull +kubectl -n drdroid describe pod # check the Events section +``` +Make sure the secret named in `imagePullSecrets` exists in the same namespace as the release. + +**Old pods stuck in CrashLoopBackOff after upgrade, blocking the new pod from scheduling** +Rolling-update strategy keeps the old pod alive until the new one is Ready. If the cluster is CPU-tight and the new pod is Pending, scale the deployment to 0 and back to 1: +```bash +kubectl -n drdroid scale deployment drd-vpc-agent-celery-worker --replicas=0 +kubectl -n drdroid scale deployment drd-vpc-agent-celery-worker --replicas=1 +``` + +## Uninstall + +```bash +helm -n drdroid uninstall drd-vpc-agent +kubectl delete namespace drdroid # only if you want to remove the credentials secret too +``` diff --git a/helm/charts/celery_beat/templates/deployment.yaml b/helm/charts/celery_beat/templates/deployment.yaml index 89e4efb..0ea8174 100644 --- a/helm/charts/celery_beat/templates/deployment.yaml +++ b/helm/charts/celery_beat/templates/deployment.yaml @@ -17,6 +17,14 @@ spec: rollme: "{{ now | unixEpoch }}" spec: serviceAccountName: drd-vpc-agent + {{- with (concat (default (list) .Values.global.imagePullSecrets) (default (list) .Values.imagePullSecrets)) }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.podSecurityContext }} + securityContext: + {{- toYaml . | nindent 8 }} + {{- end }} {{- if .Values.global.nodeSelector }} nodeSelector: {{- toYaml .Values.global.nodeSelector | nindent 8 }} @@ -27,7 +35,12 @@ spec: {{- end }} initContainers: - name: wait-for-redis - image: busybox:1.36 + image: "{{ .Values.initContainer.image.repository }}:{{ .Values.initContainer.image.tag }}" + imagePullPolicy: {{ .Values.initContainer.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} command: - sh - -c @@ -47,12 +60,16 @@ spec: memory: "32Mi" containers: - name: celery-beat - image: {{ .Values.image.repository }}:{{ .Values.image.tag }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" imagePullPolicy: {{ .Values.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} command: ["./start-celery-beat.sh"] env: - name: DJANGO_DEBUG - value: "True" + value: "False" - name: CELERY_BROKER_URL value: "redis://redis-service:6379/0" - name: CELERY_RESULT_BACKEND @@ -89,9 +106,9 @@ spec: - /bin/sh - -c - "test -f /code/celerybeat.pid && ps -p $(cat /code/celerybeat.pid) > /dev/null" - initialDelaySeconds: 30 + initialDelaySeconds: 60 periodSeconds: 30 - timeoutSeconds: 5 + timeoutSeconds: 10 failureThreshold: 3 startupProbe: exec: @@ -99,11 +116,11 @@ spec: - /bin/sh - -c - "test -f /code/celerybeat.pid" - initialDelaySeconds: 15 - periodSeconds: 5 - timeoutSeconds: 3 - failureThreshold: 12 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 30 volumes: - name: credentials-volume secret: - secretName: credentials-secret \ No newline at end of file + secretName: credentials-secret diff --git a/helm/charts/celery_worker/templates/deployment.yaml b/helm/charts/celery_worker/templates/deployment.yaml index 284905d..e32efb5 100644 --- a/helm/charts/celery_worker/templates/deployment.yaml +++ b/helm/charts/celery_worker/templates/deployment.yaml @@ -17,6 +17,14 @@ spec: rollme: "{{ now | unixEpoch }}" spec: serviceAccountName: drd-vpc-agent + {{- with (concat (default (list) .Values.global.imagePullSecrets) (default (list) .Values.imagePullSecrets)) }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.podSecurityContext }} + securityContext: + {{- toYaml . | nindent 8 }} + {{- end }} {{- if .Values.global.nodeSelector }} nodeSelector: {{- toYaml .Values.global.nodeSelector | nindent 8 }} @@ -27,7 +35,12 @@ spec: {{- end }} initContainers: - name: wait-for-redis - image: busybox:1.36 + image: "{{ .Values.initContainer.image.repository }}:{{ .Values.initContainer.image.tag }}" + imagePullPolicy: {{ .Values.initContainer.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} command: - sh - -c @@ -47,12 +60,16 @@ spec: memory: "32Mi" containers: - name: celery-worker-scheduler # Lightweight task scheduler - image: {{ .Values.image.repository }}:{{ .Values.image.tag }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" imagePullPolicy: {{ .Values.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} command: ["./start-celery-worker.sh"] env: - name: DJANGO_DEBUG - value: "True" + value: "False" - name: CELERY_BROKER_URL value: "redis://redis-service:6379/0" - name: CELERY_RESULT_BACKEND @@ -92,29 +109,33 @@ spec: command: - /bin/sh - -c - - "celery -A agent inspect ping -d celery@$HOSTNAME -t 5" - initialDelaySeconds: 30 - periodSeconds: 10 - timeoutSeconds: 5 - failureThreshold: 12 + - "celery -A agent inspect ping -d celery@$HOSTNAME -t 20" + initialDelaySeconds: 60 + periodSeconds: 15 + timeoutSeconds: 25 + failureThreshold: 20 livenessProbe: exec: command: - /bin/sh - -c - - "celery -A agent inspect ping -d celery@$HOSTNAME -t 5" - initialDelaySeconds: 30 - periodSeconds: 30 - timeoutSeconds: 10 + - "celery -A agent inspect ping -d celery@$HOSTNAME -t 20" + initialDelaySeconds: 60 + periodSeconds: 60 + timeoutSeconds: 25 failureThreshold: 3 - name: celery-worker-task-executor # Task executor for high-priority tasks - image: {{ .Values.image.repository }}:{{ .Values.image.tag }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" imagePullPolicy: {{ .Values.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} command: ["./start-celery-worker.sh"] env: - name: DJANGO_DEBUG - value: "True" + value: "False" - name: CELERY_BROKER_URL value: "redis://redis-service:6379/0" - name: CELERY_RESULT_BACKEND @@ -154,29 +175,33 @@ spec: command: - /bin/sh - -c - - "celery -A agent inspect ping -d celery@$HOSTNAME -t 5" - initialDelaySeconds: 30 - periodSeconds: 10 - timeoutSeconds: 5 - failureThreshold: 12 + - "celery -A agent inspect ping -d celery@$HOSTNAME -t 20" + initialDelaySeconds: 60 + periodSeconds: 15 + timeoutSeconds: 25 + failureThreshold: 20 livenessProbe: exec: command: - /bin/sh - -c - - "celery -A agent inspect ping -d celery@$HOSTNAME -t 5" - initialDelaySeconds: 30 - periodSeconds: 30 - timeoutSeconds: 10 + - "celery -A agent inspect ping -d celery@$HOSTNAME -t 20" + initialDelaySeconds: 60 + periodSeconds: 60 + timeoutSeconds: 25 failureThreshold: 3 - name: celery-worker-asset-extractor # Task executor for asset extraction tasks, which run rarely and are long-running - image: {{ .Values.image.repository }}:{{ .Values.image.tag }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" imagePullPolicy: {{ .Values.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} command: ["./start-celery-worker.sh"] env: - name: DJANGO_DEBUG - value: "True" + value: "False" - name: CELERY_BROKER_URL value: "redis://redis-service:6379/0" - name: CELERY_RESULT_BACKEND @@ -216,23 +241,23 @@ spec: command: - /bin/sh - -c - - "celery -A agent inspect ping -d celery@$HOSTNAME -t 5" - initialDelaySeconds: 30 - periodSeconds: 10 - timeoutSeconds: 5 - failureThreshold: 12 + - "celery -A agent inspect ping -d celery@$HOSTNAME -t 20" + initialDelaySeconds: 60 + periodSeconds: 15 + timeoutSeconds: 25 + failureThreshold: 20 livenessProbe: exec: command: - /bin/sh - -c - - "celery -A agent inspect ping -d celery@$HOSTNAME -t 5" - initialDelaySeconds: 30 - periodSeconds: 30 - timeoutSeconds: 10 + - "celery -A agent inspect ping -d celery@$HOSTNAME -t 20" + initialDelaySeconds: 60 + periodSeconds: 60 + timeoutSeconds: 25 failureThreshold: 3 volumes: - name: credentials-volume secret: - secretName: credentials-secret \ No newline at end of file + secretName: credentials-secret diff --git a/helm/charts/redis/templates/deployment.yaml b/helm/charts/redis/templates/deployment.yaml index 35f6a67..f70654d 100644 --- a/helm/charts/redis/templates/deployment.yaml +++ b/helm/charts/redis/templates/deployment.yaml @@ -16,6 +16,14 @@ spec: spec: restartPolicy: Always serviceAccountName: drd-vpc-agent + {{- with (concat (default (list) .Values.global.imagePullSecrets) (default (list) .Values.imagePullSecrets)) }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.podSecurityContext }} + securityContext: + {{- toYaml . | nindent 8 }} + {{- end }} {{- if .Values.global.nodeSelector }} nodeSelector: {{- toYaml .Values.global.nodeSelector | nindent 8 }} @@ -26,7 +34,12 @@ spec: {{- end }} containers: - name: redis - image: redis:8-alpine + image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} ports: - containerPort: 6379 resources: @@ -63,4 +76,4 @@ spec: emptyDir: {} - name: redis-config configMap: - name: redis-config \ No newline at end of file + name: redis-config diff --git a/helm/templates/cronjob-restart.yaml b/helm/templates/cronjob-restart.yaml index 2d4315a..577af9e 100644 --- a/helm/templates/cronjob-restart.yaml +++ b/helm/templates/cronjob-restart.yaml @@ -10,9 +10,30 @@ spec: template: spec: serviceAccountName: drd-vpc-agent-restart + {{- with (concat (default (list) .Values.global.imagePullSecrets) (default (list) .Values.autoUpdate.imagePullSecrets)) }} + imagePullSecrets: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.autoUpdate.podSecurityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- if .Values.global.nodeSelector }} + nodeSelector: + {{- toYaml .Values.global.nodeSelector | nindent 12 }} + {{- end }} + {{- if .Values.global.tolerations }} + tolerations: + {{- toYaml .Values.global.tolerations | nindent 12 }} + {{- end }} containers: - name: kubectl - image: ghcr.io/drdroidlab/drd-vpc-agent/kubectl:latest + image: "{{ .Values.autoUpdate.image.repository }}:{{ .Values.autoUpdate.image.tag }}" + imagePullPolicy: {{ .Values.autoUpdate.image.pullPolicy }} + {{- with .Values.autoUpdate.securityContext }} + securityContext: + {{- toYaml . | nindent 16 }} + {{- end }} command: - /bin/sh - -c diff --git a/helm/values.yaml b/helm/values.yaml index c3de4db..20d307b 100644 --- a/helm/values.yaml +++ b/helm/values.yaml @@ -10,7 +10,6 @@ global: # "worker-type": "aux-pool" # "kubernetes.io/os": "linux" nodeSelector: {} - # Example tolerations: # tolerations: # - key: "dedicated" @@ -19,6 +18,14 @@ global: # effect: "NoSchedule" tolerations: [] + # Global image pull secrets applied to every pod (celery-beat, celery-worker, + # redis, init containers, restart cronjob). Per-component imagePullSecrets are + # merged on top of this list. + # Example: + # imagePullSecrets: + # - name: jfrog-pull-secret + imagePullSecrets: [] + # Network Mapper Configuration # Controls whether the network mapper components are deployed # The network mapper provides service topology visibility and network insights @@ -33,18 +40,92 @@ networkMapper: autoUpdate: enabled: true schedule: "0 0 * * *" # every day at 00:00 UTC + image: + repository: ghcr.io/drdroidlab/drd-vpc-agent/kubectl + tag: latest + pullPolicy: Always + # Image pull secrets for the kubectl image used by the restart CronJob. + # These are merged with global.imagePullSecrets. + # Example: + # imagePullSecrets: + # - name: ghcr-pull-secret + imagePullSecrets: [] + podSecurityContext: + runAsNonRoot: true + runAsUser: 1000 + runAsGroup: 1000 + fsGroup: 1000 + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 1000 + capabilities: + drop: + - ALL celery-beat: image: repository: ghcr.io/drdroidlab/drd-vpc-agent/drd-vpc-agent tag: latest pullPolicy: Always + # Image pull secrets for the celery-beat pod (merged with global.imagePullSecrets). + # Example: + # imagePullSecrets: + # - name: jfrog-pull-secret + imagePullSecrets: [] + # Init container that waits for redis. Image is fully configurable. + initContainer: + image: + repository: busybox + tag: "1.36" + pullPolicy: IfNotPresent + # Pod-level security context (applies to all containers in the pod). + # UID/GID 33 = www-data (the user the drd-vpc-agent image chowns /code to). + # Defaults satisfy PSP/Gatekeeper "runAsNonRoot or runAsUser != 0" policies. + podSecurityContext: + runAsNonRoot: true + runAsUser: 33 + runAsGroup: 33 + fsGroup: 33 + # Container-level security context (applied to celery-beat container and + # init container). Override runAsUser to match the UID baked into your image. + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 33 + capabilities: + drop: + - ALL celery-worker: image: repository: ghcr.io/drdroidlab/drd-vpc-agent/drd-vpc-agent tag: latest pullPolicy: Always + # Image pull secrets for the celery-worker pod (merged with global.imagePullSecrets). + # Applies to all worker containers (scheduler, task-executor, asset-extractor). + # Example: + # imagePullSecrets: + # - name: jfrog-pull-secret + imagePullSecrets: [] + initContainer: + image: + repository: busybox + tag: "1.36" + pullPolicy: IfNotPresent + # UID/GID 33 = www-data (the user the drd-vpc-agent image chowns /code to). + podSecurityContext: + runAsNonRoot: true + runAsUser: 33 + runAsGroup: 33 + fsGroup: 33 + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 33 + capabilities: + drop: + - ALL # Resource configuration for celery worker containers resources: scheduler: @@ -73,4 +154,25 @@ celery-worker: memory: "1536Mi" redis: - image: redis:8-alpine + image: + repository: redis + tag: 8-alpine + pullPolicy: IfNotPresent + # Image pull secrets for the redis pod (merged with global.imagePullSecrets). + # Example: + # imagePullSecrets: + # - name: dockerhub-pull-secret + imagePullSecrets: [] + # redis:8-alpine ships a `redis` user with UID/GID 999. + podSecurityContext: + runAsNonRoot: true + runAsUser: 999 + runAsGroup: 999 + fsGroup: 999 + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 999 + capabilities: + drop: + - ALL