openshift-eng · openshift-merge-bot · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026
diff --git a/README.md b/README.md
@@ -6,9 +6,9 @@ A scheduler which aims to distribute OpenShift clusters among a pool of vCenters
 number vCenters, datacenter, and clusters, ensuring that an OpenShift cluster is being installed in to a environment with sufficient
 capactiy.
 
-![overview](/doc/vSphere%20Resource%20Manager.png)
+User-focused diagrams and walkthroughs: [doc/README.md](doc/README.md) (see [How it works](doc/how-it-works.md)).
 
-## Teminology
+## Terminology
 
 ### Pools
 
@@ -35,7 +35,7 @@ metadata:
   labels:
     boskos-lease-id: "test-id"
 spec:
-  requiredPool: <optional: name of the required pool>
+  required-pool: <optional: metadata name of the pool>
   vcpus: 24
   memory: 96
   networks: 1
@@ -175,9 +175,9 @@ A pool can be excluded from consideration unless a lease specifically requests i
 unique environment, or configuration, which warrants intentional scheduling to the pool.  To exclude a pool from scheduling, set 
 `spec.exclude` to true.
 
-To request a specific pool, a Lease must set `spec.requiredPool` to the name of the pool.
+To request a specific pool, a Lease must set `spec.required-pool` to the **metadata name** of the pool.
 
-TO-DO: implement a poolSelector paradigm
+To restrict scheduling by **labels**, use `spec.poolSelector` (key/value map on pool labels). To restrict by **dedicated hardware or queues**, use **taints** on the Pool and **tolerations** on the Lease. See [doc/scheduling.md](doc/scheduling.md).
 
 ## Networks
 

diff --git a/doc/README.md b/doc/README.md
@@ -0,0 +1,22 @@
+# vSphere Capacity Manager — user documentation
+
+The vSphere Capacity Manager is a Kubernetes operator that tracks **capacity** (vCPU, memory, networks) per vSphere failure domain and **fulfills Leases** by choosing a **Pool** and **Network** that satisfy each request.
+
+## Contents
+
+| Document | Audience |
+|----------|----------|
+| [Concepts](concepts.md) | What Pool, Lease, and Network mean |
+| [How it works](how-it-works.md) | Reconciliation flow and diagrams |
+| [Scheduling](scheduling.md) | `poolSelector`, taints, tolerations, exclude / noSchedule |
+| [Purpose-built networks](networks-purpose-built.md) | Adding a Network CR and wiring it to a Pool |
+| [CLI](cli.md) | `oc` / `kubectl` and the optional `oc-vcm` plugin |
+| [Pools and networks inventory](inventory-pools-networks.md) | Snapshot of CRs in one environment (refresh manually) |
+| [openshift/release and vsphere-elastic](ci-openshift-release.md) | Boskos, ci-operator `cluster_profile`, step-registry `-vcm` chains |
+| [CI / Prow / vsphere-elastic](doc.md) | Job env vars, `SHARED_DIR` files, Vault, step pairs |
+
+Developer build and test commands remain in the [repository README](../README.md).
+
+## API group
+
+All custom resources use API version `vspherecapacitymanager.splat.io/v1`. They are **namespaced**; examples in this repo often use `vsphere-infra-helpers` — use the namespace where your operator runs.
diff --git a/doc/ci-openshift-release.md b/doc/ci-openshift-release.md
@@ -0,0 +1,85 @@
+# vsphere-elastic, ci-operator, and openshift/release
+
+This page maps how **OpenShift CI** (Prow, **ci-operator**, and the **step-registry**) reaches the vSphere Capacity Manager. Source of truth for paths below is the **[openshift/release](https://github.com/openshift/release)** repository (for example a local clone at `~/Development/release`).
+
+## End-to-end flow
+
+1. **Boskos** hands out an abstract quota slice (names like `vsphere-elastic-0`, `vsphere-elastic-1`, …). Types and resources are defined in [`core-services/prow/02_config/_boskos.yaml`](https://github.com/openshift/release/blob/master/core-services/prow/02_config/_boskos.yaml).
+2. **ci-operator** turns a test that declares **`cluster_profile: vsphere-elastic`** in [`ci-operator/config`](https://github.com/openshift/release/tree/master/ci-operator/config) into a ProwJob annotated with **`ci-operator.openshift.io/cloud-cluster-profile: vsphere-elastic`**. Pods for that job see **`CLUSTER_PROFILE_NAME=vsphere-elastic`**.
+3. **`ipi-conf-vsphere-check-vcm`** runs only when `CLUSTER_PROFILE_NAME` **is** `vsphere-elastic` (otherwise it exits immediately). It creates **`Lease`** resources (`apiVersion: vspherecapacitymanager.splat.io/v1`) in **`vsphere-infra-helpers`** using **`oc`** and **`SA_KUBECONFIG`** (default in the script: `/var/run/vault/vsphere-ibmcloud-ci/vsphere-capacity-manager-kubeconfig`). It waits until **`status.phase=Fulfilled`**, then writes install metadata under **`${SHARED_DIR}`** (`vsphere_context.sh`, `govc.sh`, `platform.yaml`, `subnets.json`, `LEASE_*.json`, `NETWORK_*.json`, etc.).
+4. Other **`*-vcm`** steps read those files. **Legacy** steps (no `-vcm` suffix) do the opposite: they exit early when the profile **is** `vsphere-elastic`, so one workflow can serve both modes.
+
+```mermaid
+sequenceDiagram
+  participant Prow as Prow_Boskos
+  participant Step as check_vcm_step
+  participant API as Lease_API
+  participant Op as VCM_operator
+  participant Shared as SHARED_DIR
+  Prow->>Step: LEASED_RESOURCE_and_secrets
+  Step->>API: create_Leases
+  Op->>API: fulfill_Leases
+  Step->>API: wait_Fulfilled
+  Step->>Shared: write_platform_and_context
+```
+
+Scripts: [`ipi-conf-vsphere-check-vcm-commands.sh`](https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/vsphere/check/vcm/ipi-conf-vsphere-check-vcm-commands.sh), legacy sibling [`ipi-conf-vsphere-check-commands.sh`](https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/vsphere/check/ipi-conf-vsphere-check-commands.sh).
+
+## Step-registry chains
+
+Chains intentionally list **both** legacy and **`-vcm`** steps. Only the branch that matches **`CLUSTER_PROFILE_NAME`** does real work.
+
+**Standard IPI configure chain** — [`ipi/conf/vsphere/ipi-conf-vsphere-chain.yaml`](https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/vsphere/ipi-conf-vsphere-chain.yaml):
+
+- `ipi-conf-vsphere-check` then `ipi-conf-vsphere-check-vcm`
+- `ipi-conf-vsphere-vips` then `ipi-conf-vsphere-vips-vcm`
+- `ipi-conf-vsphere-dns`, `ipi-conf`, `ipi-conf-telemetry`, `ipi-conf-vsphere`, `ipi-conf-vsphere-vcm`, …
+
+**Multi–vCenter IPI configure chain** — [`ipi/conf/vsphere/multi-vcenter/ipi-conf-vsphere-multi-vcenter-chain.yaml`](https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/vsphere/multi-vcenter/ipi-conf-vsphere-multi-vcenter-chain.yaml): same check/vips split, then `ipi-conf-vsphere-multi-vcenter` and `ipi-conf-vsphere-vcm`.
+
+## Job configuration (ci-operator)
+
+Set the cluster profile on the test (exact YAML shape depends on repo and file):
+
+```yaml
+tests:
+- as: example-e2e
+  steps:
+    cluster_profile: vsphere-elastic
+    env:
+      POOLS: ""                         # optional: space-separated pool metadata names
+      POOL_COUNT: "1"                  # pools to request when not using POOLS
+      POOL_SELECTOR: "region=us-east"  # optional: comma-separated key=value → Lease poolSelector
+      NETWORK_TYPE: single-tenant      # or multi-tenant, nested-multi-tenant, …
+      OPENSHIFT_REQUIRED_CORES: "24"
+      OPENSHIFT_REQUIRED_MEMORY: "96"
+    workflow: openshift-e2e-vsphere-…
+```
+
+Authoritative env list and defaults for the check step: [`ipi-conf-vsphere-check-vcm-ref.yaml`](https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/vsphere/check/vcm/ipi-conf-vsphere-check-vcm-ref.yaml). Additional behavior (multi-NIC, multi-network failure domains, Vault-driven defaults) is described in comments at the top of `ipi-conf-vsphere-check-vcm-commands.sh`.
+
+## Selectors and tolerations from CI
+
+| Mechanism | In CI today |
+|-----------|-------------|
+| **`POOL_SELECTOR`** | Implemented: comma-separated `key=value` pairs are turned into **`spec.poolSelector`** on each created **Lease**. |
+| **Tolerations** | Supported on the **Lease** API; see [scheduling.md](scheduling.md). The **check-vcm** script does **not** set tolerations from an environment variable. To use pool taints from Prow, you would extend the step or apply a custom manifest. |
+
+## Other `-vcm` steps (step-registry)
+
+All under **`ci-operator/step-registry/`**:
+
+| Step | Role (high level) |
+|------|---------------------|
+| [`ipi/conf/vsphere/check/vcm`](https://github.com/openshift/release/tree/master/ci-operator/step-registry/ipi/conf/vsphere/check/vcm) | Create/wait on **Leases**; populate **`SHARED_DIR`**. |
+| [`ipi/conf/vsphere/vips/vcm`](https://github.com/openshift/release/tree/master/ci-operator/step-registry/ipi/conf/vsphere/vips/vcm) | VIP handling for VCM-derived config. |
+| [`ipi/conf/vsphere/vcm`](https://github.com/openshift/release/tree/master/ci-operator/step-registry/ipi/conf/vsphere/vcm) | Build **`install-config.yaml`** from VCM context. |
+| [`upi/conf/vsphere/vcm`](https://github.com/openshift/release/tree/master/ci-operator/step-registry/upi/conf/vsphere/vcm) | UPI configure path when profile is `vsphere-elastic`. |
+| [`upi/conf/vsphere/ova/vcm`](https://github.com/openshift/release/tree/master/ci-operator/step-registry/upi/conf/vsphere/ova/vcm) | OVA step sibling for elastic profile. |
+| [`ipi/deprovision/vsphere/diags/vcm`](https://github.com/openshift/release/tree/master/ci-operator/step-registry/ipi/deprovision/vsphere/diags/vcm) | Diagnostics deprovision for VCM workflows. |
+
+## See also
+
+- [doc.md](doc.md) — tables of **`SHARED_DIR`** files, VCM vs legacy step pairs, multi-tenant Vault lists, and job examples.
+- [scheduling.md](scheduling.md) — `poolSelector`, taints, tolerations on the CRs themselves.
+- [concepts.md](concepts.md) — Pool, Lease, Network.
diff --git a/doc/cli.md b/doc/cli.md
@@ -0,0 +1,33 @@
+# CLI reference
+
+Replace the namespace if yours differs from `vsphere-infra-helpers`.
+
+## List and inspect CRs
+
+```sh
+NS=vsphere-infra-helpers
+
+oc get pools.vspherecapacitymanager.splat.io -n "$NS" -o wide
+oc get leases.vspherecapacitymanager.splat.io -n "$NS" -o wide
+oc get networks.vspherecapacitymanager.splat.io -n "$NS"
+```
+
+Describe a single object:
+
+```sh
+oc describe pool.vspherecapacitymanager.splat.io/<name> -n "$NS"
+```
+
+## Optional `oc-vcm` plugin
+
+The repo ships a helper script — see [repository README](../README.md#oc-plugin-installation). After installing:
+
+```sh
+oc vcm
+```
+
+Subcommands include `status`, `networks`, pool cordon/uncordon, exclude/include, VLAN helpers, etc.
+
+## Inventory snapshot
+
+To regenerate the tables in [inventory-pools-networks.md](inventory-pools-networks.md), use the refresh section at the bottom of that file.
diff --git a/doc/concepts.md b/doc/concepts.md
@@ -0,0 +1,39 @@
+# Concepts
+
+## Pool
+
+A **Pool** is one schedulable slice of vSphere capacity: vCenter connection, datacenter / cluster / datastore topology, total vCPU and memory, and the list of **port group paths** that may be used for installs.
+
+- **Status** fields (`vcpus-available`, `memory-available`, `network-available`, `lease-count`) reflect what the operator thinks is still free after fulfilled leases.
+- **exclude**: pool is skipped by default scheduling; a lease can still target it with `spec.required-pool` (or match via labels/tolerations as documented in [scheduling](scheduling.md)).
+- **noSchedule**: like cordoning a node — existing leases stay; **new** leases are not placed here.
+
+## Lease
+
+A **Lease** is a request for resources: vCPU, memory, number of networks (today **`spec.networks` is 1**), optional storage, and optional **network type** (single-tenant, multi-tenant, etc.).
+
+When the operator can place the lease, **status.phase** becomes **Fulfilled** and status carries failure-domain and env information consumers use to build install configs.
+
+For **multiple failure domains** (e.g. multiple vSphere clusters), create **one Lease per domain**.
+
+## Network
+
+A **Network** CR describes one vSphere **port group** at a given **pod** / **datacenter**: VLAN, machine CIDR, gateways, etc. Only networks that are both **listed on a Pool** and **not already owned by another lease** can be assigned.
+
+See [Purpose-built networks](networks-purpose-built.md) for how to add one.
+
+## How they connect
+
+```mermaid
+flowchart LR
+  subgraph resources [Custom resources]
+    L[Lease]
+    P[Pool]
+    N[Network]
+  end
+  L -->|scheduled onto| P
+  P -->|topology lists port groups| N
+  L -->|claims| N
+```
+
+The operator’s job is to pick a **Pool** with enough free capacity, then a **Network** that matches the lease’s **network-type** and is tied to that pool.
diff --git a/doc/doc.md b/doc/doc.md
@@ -1,5 +1,9 @@
 # Overview
 
+Operator concepts, scheduling, and CLI: see [doc/README.md](README.md).
+
+Where this fits in **openshift/release** (Boskos, chains, `ipi-conf-vsphere-check-vcm`): [ci-openshift-release.md](ci-openshift-release.md).
+
 vSphere CI requires the use of mutliple environments in order to support the number of jobs and various required configurations. Historically, this has been handled by the creation of purpose targeted lease pools. While this has worked, some environments are overutilized while some environments are idle.  The VCM handles scheduling jobs to the most appropriate environment based on the requirements of the job and the utilization of environments. 
 
 # Job Configuration

diff --git a/doc/how-it-works.md b/doc/how-it-works.md
@@ -0,0 +1,31 @@
+# How it works
+
+## High-level flow
+
+1. A **Lease** is created (or updated) with CPU, memory, network count, and optional scheduling constraints.
+2. The operator finds **Pool**(s) that fit capacity and policy ([scheduling](scheduling.md)).
+3. For each pool, it looks for a free **Network** compatible with `spec.network-type`.
+4. When successful, it updates **Lease status** (phase **Fulfilled**, pool info, env snippets) and records ownership so the network is not double-booked.
+
+```mermaid
+stateDiagram-v2
+  [*] --> Pending
+  Pending --> Partial: partial allocation
+  Pending --> Fulfilled: all requirements met
+  Partial --> Fulfilled: remaining work done
+  Pending --> Failed: unrecoverable error
+  Fulfilled --> [*]: lease released
+  Failed --> [*]: lease released
+```
+
+Phases are defined in the API (for example `Pending`, `Partial`, `Fulfilled`, `Failed`). Conditions on the Lease give more detail while work is in progress.
+
+## Related leases and networks
+
+When several leases share the same **boskos-lease-id** label and the **same vCenter**, the operator tries to give them a **consistent network** story so multi–failure-domain jobs can coordinate. (See [repository README](../README.md) for the short bullet list.)
+
+## Where to go next
+
+- [Scheduling](scheduling.md) — labels, taints, `required-pool`
+- [CLI](cli.md) — inspect Pools, Leases, Networks
+- [CI-focused detail](doc.md) — Prow, `vsphere-elastic`, files under `SHARED_DIR`