What

This document serves as the knowledge base for troubleshooting the Open Data Hub Operator. More information can be found at https://github.com/opendatahub-io/opendatahub-operator/wiki

Troubleshooting

Upgrade from Operator v2.0/v2.1 to v2.2+

This also applies to any local build deployment from the "main" branch.

To upgrade, follow these steps:

Disable the component(s) in your DSC instance.
Delete both the DSC instance and DSCI instance.
Click "uninstall" Open Data Hub operator.
If exposed on v1alpha1, delete the DSC CRD and DSCI CRD.

All of the above steps can be performed either through the console UI or via the oc/kubectl CLI. After completing these steps, please refer to the installation guide to proceed with a clean installation of the v2.2+ operator.

Why component's managementState is set to {} not Removed?

Only if managementState is explicitliy set to "Managed" on component level, below configs in DSC CR to component "X" take the same effects:

spec:
components:
    X:
        managementState: Removed

spec:
components:
    X: {}

Setting up a Fedora-based development environment

This is a loose list of tools to install on your linux box in order to compile, test and deploy the operator.

ssh-keygen -t ed25519 -C "<email-registered-on-github-account>"
# upload public key to github

sudo dnf makecache --refresh
sudo dnf install -y git-all
sudo dnf install -y golang
sudo dnf install -y podman
sudo dnf install -y cri-o kubernetes-kubeadm kubernetes-node kubernetes-client cri-tools
sudo dnf install -y operator-sdk
sudo dnf install -y wget
wget https://mirror.openshift.com/pub/openshift-v4/clients/oc/latest/linux/oc.tar.gz
cd bin/; tar -xzvf ../oc.tar.gz ; cd .. ; rm oc.tar.gz
sudo dnf install -y zsh

# update PATH
echo 'export PATH=${PATH}:~/bin' >> ~/.zshrc
echo 'export GOPROXY=https://proxy.golang.org' >> ~/.zshrc

Using a local.mk file to override Makefile variables for your development environment

To support the ability for a developer to customize the Makefile execution to support their development environment, you can create a local.mk file in the root of this repo to specify custom values that match your environment.

$ cat local.mk
VERSION=9.9.9
IMAGE_TAG_BASE=quay.io/my-dev-env/opendatahub-operator
IMG_TAG=my-dev-tag
OPERATOR_NAMESPACE=my-dev-odh-operator-system
IMAGE_BUILD_FLAGS=--build-arg USE_LOCAL=true
E2E_TEST_FLAGS="--deletion-policy=never" -timeout 15m
DEFAULT_MANIFESTS_PATH=./opt/manifests
PLATFORM=linux/amd64,linux/ppc64le,linux/s390x

When I try to use my own application namespace, I get different errors:

Operator pod is keeping crash Ensure in your cluster, only one application has label opendatahub.io/application-namespace=true. This is similar to case (3).
error "DSCI must used the same namespace which has opendatahub.io/application-namespace=true label" In the cluster, one namespace has label opendatahub.io/application-namespace=true, but it is not being set in the DSCI's .spec.applicationsNamespace, solutions (any of below ones should work):

delete existin DSCI, and re-create it with namespace which already has label opendatahub.io/application-namespace=true
remove label opendatahub.io/application-namespace=true from the other namespace to the one specified in the DSCI, and wait for a couple of minutes to allow DSCI continue.

error "only support max. one namespace with label: opendatahub.io/application-namespace=true" Refer to (1).

Profiling with pprof

If running with the make run, or make run-nowebhook commands, pprof is enabled.

When pprof is enabled, you can explore collected pprof profiles using commands such as:

go tool pprof -http : http://localhost:6060/debug/pprof/heap
go tool pprof -http : http://localhost:6060/debug/pprof/profile
go tool pprof -http : http://localhost:6060/debug/pprof/block

You can also save a pprof file for use in other tools or offline analysis as follows:

curl -s "http://127.0.0.1:6060/debug/pprof/profile" > ./cpu-profile.out

This is disabled by default outside local development, but can be enabled by setting the PPROF_BIND_ADDRESS env var:

  - name: PPROF_BIND_ADDRESS
    value: 0.0.0.0:6060

This can be set in an existing opendatahub-operator-controller-manager deployment, or on the operator subscription per https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#env

See https://github.com/google/pprof/blob/main/doc/README.md for more details on how to use pprof

Operator Pod Restarting Frequently

Alert: OperatorPodRestartingFrequently
Severity: Warning
Description: The operator pod has restarted more than 3 times in a 5-minute period.

Symptoms

Prometheus alert OperatorPodRestartingFrequently is firing
Operator pod restart count is high
Components may not be reconciling properly
Operator logs show crash or restart messages

Investigation Steps

1. Check operator pod status:

oc get pods -n redhat-ods-operator
oc describe pod <operator-pod-name> -n redhat-ods-operator

2. Check restart count:

kubectl get pod <operator-pod-name> -n redhat-ods-operator -o jsonpath='{.status.containerStatuses[0].restartCount}'

3. Get operator logs (current and previous):

# Current logs
oc logs -n redhat-ods-operator <operator-pod-name> -c rhods-operator --tail=100

# Previous crashed container logs
oc logs -n redhat-ods-operator <operator-pod-name> -c rhods-operator --previous

4. Check for common issues:

# Check resource limits
oc get pod <operator-pod-name> -n redhat-ods-operator -o jsonpath='{.spec.containers[0].resources}'

# Check events
oc get events -n redhat-ods-operator --sort-by='.lastTimestamp' | grep <operator-pod-name>

# Check for OOM kills
oc get pod <operator-pod-name> -n redhat-ods-operator -o jsonpath='{.status.containerStatuses[0].lastState}'

Common Causes & Solutions

1. Out of Memory (OOM)

Symptom: lastState.terminated.reason: OOMKilled

Solution: Increase memory limits

oc patch deployment rhods-operator-controller-manager -n redhat-ods-operator \
  --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "2Gi"}]'

2. CPU Throttling

Symptom: High CPU usage, slow reconciliation

Solution: Increase CPU limits

oc patch deployment rhods-operator-controller-manager -n redhat-ods-operator \
  --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1000m"}]'

3. Webhook Certificate Issues

Symptom: Logs show certificate errors

Solution: Check certificate secrets

oc get secret -n redhat-ods-operator | grep webhook
oc get validatingwebhookconfiguration
oc get mutatingwebhookconfiguration

4. Panic or Fatal Errors

Symptom: Logs show panic stack traces or fatal errors

Solution: Review logs for root cause, may need code fix

oc logs -n redhat-ods-operator <operator-pod-name> -c rhods-operator --previous | grep -A 20 "panic\|fatal"

Resolution Steps

Immediate Action: If operator is non-functional, restart it:

oc rollout restart deployment rhods-operator -n redhat-ods-operator

Increase Resources (if OOM/CPU throttling):

oc patch deployment rhods-operator-controller-manager -n redhat-ods-operator \
  --type='json' -p='[
    {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "2Gi"},
    {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"},
    {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1000m"}
  ]'

Check for Cluster-Wide Issues:

# Check node resources
oc top nodes

# Check if other operators are also restarting
oc get pods --all-namespaces | grep -E 'CrashLoop|Error'

Verify Operator Configuration:

# Check DSCI
oc get dsci -o yaml

# Check DSC
oc get dsc -o yaml

# Validate no circular dependencies or misconfigurations

Collect Debug Information:

# Get full operator state
oc get deployment rhods-operator-controller-manager -n redhat-ods-operator -o yaml > operator-deployment.yaml
oc get pods -n redhat-ods-operator -o yaml > operator-pods.yaml
oc logs -n redhat-ods-operator <operator-pod-name> --all-containers --previous > operator-previous-logs.txt
oc logs -n redhat-ods-operator <operator-pod-name> --all-containers > operator-current-logs.txt

When to Escalate

Escalate to the development team if:

Restarts continue after resource increases
Logs show unhandled panics or fatal errors
Issue correlates with specific DSC/DSCI configuration
Problem started after operator upgrade
Multiple restarts with no clear cause in logs

Bug Report: Include all debug information collected above and steps taken.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What

Troubleshooting

Upgrade from Operator v2.0/v2.1 to v2.2+

Why component's managementState is set to {} not Removed?

Setting up a Fedora-based development environment

Using a local.mk file to override Makefile variables for your development environment

When I try to use my own application namespace, I get different errors:

Profiling with pprof

Operator Pod Restarting Frequently

Symptoms

Investigation Steps

1. Check operator pod status:

2. Check restart count:

3. Get operator logs (current and previous):

4. Check for common issues:

Common Causes & Solutions

1. Out of Memory (OOM)

2. CPU Throttling

3. Webhook Certificate Issues

4. Panic or Fatal Errors

Resolution Steps

When to Escalate

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

What

Troubleshooting

Upgrade from Operator v2.0/v2.1 to v2.2+

Why component's managementState is set to {} not Removed?

Setting up a Fedora-based development environment

Using a local.mk file to override Makefile variables for your development environment

When I try to use my own application namespace, I get different errors:

Profiling with pprof

Operator Pod Restarting Frequently

Symptoms

Investigation Steps

1. Check operator pod status:

2. Check restart count:

3. Get operator logs (current and previous):

4. Check for common issues:

Common Causes & Solutions

1. Out of Memory (OOM)

2. CPU Throttling

3. Webhook Certificate Issues

4. Panic or Fatal Errors

Resolution Steps

When to Escalate