This document serves as the knowledge base for troubleshooting the Open Data Hub Operator. More information can be found at https://github.com/opendatahub-io/opendatahub-operator/wiki
This also applies to any local build deployment from the "main" branch.
To upgrade, follow these steps:
- Disable the component(s) in your DSC instance.
- Delete both the DSC instance and DSCI instance.
- Click "uninstall" Open Data Hub operator.
- If exposed on v1alpha1, delete the DSC CRD and DSCI CRD.
All of the above steps can be performed either through the console UI or via the oc/kubectl CLI.
After completing these steps, please refer to the installation guide to proceed with a clean installation of the v2.2+ operator.
Only if managementState is explicitliy set to "Managed" on component level, below configs in DSC CR to component "X" take the same effects:
spec:
components:
X:
managementState: Removed
spec:
components:
X: {}This is a loose list of tools to install on your linux box in order to compile, test and deploy the operator.
ssh-keygen -t ed25519 -C "<email-registered-on-github-account>"
# upload public key to github
sudo dnf makecache --refresh
sudo dnf install -y git-all
sudo dnf install -y golang
sudo dnf install -y podman
sudo dnf install -y cri-o kubernetes-kubeadm kubernetes-node kubernetes-client cri-tools
sudo dnf install -y operator-sdk
sudo dnf install -y wget
wget https://mirror.openshift.com/pub/openshift-v4/clients/oc/latest/linux/oc.tar.gz
cd bin/; tar -xzvf ../oc.tar.gz ; cd .. ; rm oc.tar.gz
sudo dnf install -y zsh
# update PATH
echo 'export PATH=${PATH}:~/bin' >> ~/.zshrc
echo 'export GOPROXY=https://proxy.golang.org' >> ~/.zshrcTo support the ability for a developer to customize the Makefile execution to support their development environment, you can create a local.mk file in the root of this repo to specify custom values that match your environment.
$ cat local.mk
VERSION=9.9.9
IMAGE_TAG_BASE=quay.io/my-dev-env/opendatahub-operator
IMG_TAG=my-dev-tag
OPERATOR_NAMESPACE=my-dev-odh-operator-system
IMAGE_BUILD_FLAGS=--build-arg USE_LOCAL=true
E2E_TEST_FLAGS="--deletion-policy=never" -timeout 15m
DEFAULT_MANIFESTS_PATH=./opt/manifests
PLATFORM=linux/amd64,linux/ppc64le,linux/s390x
-
Operator pod is keeping crash Ensure in your cluster, only one application has label
opendatahub.io/application-namespace=true. This is similar to case (3). -
error "DSCI must used the same namespace which has opendatahub.io/application-namespace=true label" In the cluster, one namespace has label
opendatahub.io/application-namespace=true, but it is not being set in the DSCI's.spec.applicationsNamespace, solutions (any of below ones should work):
- delete existin DSCI, and re-create it with namespace which already has label
opendatahub.io/application-namespace=true - remove label
opendatahub.io/application-namespace=truefrom the other namespace to the one specified in the DSCI, and wait for a couple of minutes to allow DSCI continue.
- error "only support max. one namespace with label: opendatahub.io/application-namespace=true" Refer to (1).
If running with the make run, or make run-nowebhook commands, pprof is enabled.
When pprof is enabled, you can explore collected pprof profiles using commands such as:
go tool pprof -http : http://localhost:6060/debug/pprof/heapgo tool pprof -http : http://localhost:6060/debug/pprof/profilego tool pprof -http : http://localhost:6060/debug/pprof/block
You can also save a pprof file for use in other tools or offline analysis as follows:
curl -s "http://127.0.0.1:6060/debug/pprof/profile" > ./cpu-profile.outThis is disabled by default outside local development, but can be enabled by setting the PPROF_BIND_ADDRESS env var:
- name: PPROF_BIND_ADDRESS
value: 0.0.0.0:6060This can be set in an existing opendatahub-operator-controller-manager deployment, or on the operator subscription per https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#env
See https://github.com/google/pprof/blob/main/doc/README.md for more details on how to use pprof
Alert: OperatorPodRestartingFrequently
Severity: Warning
Description: The operator pod has restarted more than 3 times in a 5-minute period.
- Prometheus alert
OperatorPodRestartingFrequentlyis firing - Operator pod restart count is high
- Components may not be reconciling properly
- Operator logs show crash or restart messages
oc get pods -n redhat-ods-operator
oc describe pod <operator-pod-name> -n redhat-ods-operatorkubectl get pod <operator-pod-name> -n redhat-ods-operator -o jsonpath='{.status.containerStatuses[0].restartCount}'# Current logs
oc logs -n redhat-ods-operator <operator-pod-name> -c rhods-operator --tail=100
# Previous crashed container logs
oc logs -n redhat-ods-operator <operator-pod-name> -c rhods-operator --previous# Check resource limits
oc get pod <operator-pod-name> -n redhat-ods-operator -o jsonpath='{.spec.containers[0].resources}'
# Check events
oc get events -n redhat-ods-operator --sort-by='.lastTimestamp' | grep <operator-pod-name>
# Check for OOM kills
oc get pod <operator-pod-name> -n redhat-ods-operator -o jsonpath='{.status.containerStatuses[0].lastState}'- Symptom:
lastState.terminated.reason: OOMKilled - Solution: Increase memory limits
oc patch deployment rhods-operator-controller-manager -n redhat-ods-operator \ --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "2Gi"}]'
- Symptom: High CPU usage, slow reconciliation
- Solution: Increase CPU limits
oc patch deployment rhods-operator-controller-manager -n redhat-ods-operator \ --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1000m"}]'
- Symptom: Logs show certificate errors
- Solution: Check certificate secrets
oc get secret -n redhat-ods-operator | grep webhook oc get validatingwebhookconfiguration oc get mutatingwebhookconfiguration
- Symptom: Logs show panic stack traces or fatal errors
- Solution: Review logs for root cause, may need code fix
oc logs -n redhat-ods-operator <operator-pod-name> -c rhods-operator --previous | grep -A 20 "panic\|fatal"
-
Immediate Action: If operator is non-functional, restart it:
oc rollout restart deployment rhods-operator -n redhat-ods-operator
-
Increase Resources (if OOM/CPU throttling):
oc patch deployment rhods-operator-controller-manager -n redhat-ods-operator \ --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "2Gi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1000m"} ]'
-
Check for Cluster-Wide Issues:
# Check node resources oc top nodes # Check if other operators are also restarting oc get pods --all-namespaces | grep -E 'CrashLoop|Error'
-
Verify Operator Configuration:
# Check DSCI oc get dsci -o yaml # Check DSC oc get dsc -o yaml # Validate no circular dependencies or misconfigurations
-
Collect Debug Information:
# Get full operator state oc get deployment rhods-operator-controller-manager -n redhat-ods-operator -o yaml > operator-deployment.yaml oc get pods -n redhat-ods-operator -o yaml > operator-pods.yaml oc logs -n redhat-ods-operator <operator-pod-name> --all-containers --previous > operator-previous-logs.txt oc logs -n redhat-ods-operator <operator-pod-name> --all-containers > operator-current-logs.txt
Escalate to the development team if:
- Restarts continue after resource increases
- Logs show unhandled panics or fatal errors
- Issue correlates with specific DSC/DSCI configuration
- Problem started after operator upgrade
- Multiple restarts with no clear cause in logs
Bug Report: Include all debug information collected above and steps taken.