Skip to content

Docs: Add job dispatch and resource tiers documentation#271

Open
erick-GeGe wants to merge 2 commits intomainfrom
docs/job-dispatch
Open

Docs: Add job dispatch and resource tiers documentation#271
erick-GeGe wants to merge 2 commits intomainfrom
docs/job-dispatch

Conversation

@erick-GeGe
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes, relevant motivation and context.

Issue

  • Github Issue ID.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • My code follows the style guidelines of this project.
  • I have made corresponding changes to the documentation.
  • New and existing tests pass locally with my changes.
  • If this change is a core feature, I have added thorough tests.
  • If this change affects or depends on the behavior of other estela repositories, I have created pull requests with the relevant changes in the affected repositories. Please, refer to our official documentation.
  • I understand that my pull request may be closed if it becomes obvious or I did not perform all of the steps above.

@erick-GeGe erick-GeGe requested a review from joaquingx March 27, 2026 14:01
Copy link
Copy Markdown
Contributor

@joaquingx joaquingx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good coverage of the dispatch system, but I think this reads more like internal code notes than documentation. A few suggestions:

Too much implementation detail

  • Things like the Redis lock command (SET spider_jobs_lock 1 NX EX 120), internal function names (_get_cluster_resources(), _dispatch_single_job()), and the "Key Files" table are implementation details that will go stale as code changes. These belong in code comments or docstrings, not in docs.

Suggestion: Simplify to focus on what users need to know (tiers, statuses, config options) and drop the code-level details. The code should document itself.

exits immediately.

2. **Fetch queued jobs**: Queries jobs with `IN_QUEUE` status, ordered by creation
date (FIFO), limited to `RUN_JOBS_PER_LOT` (default 100, 1000 in production).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommended in production is 1000?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a recommendation — it's the current value in config/settings/prod.py. The default in base.py is 100, but prod.py overrides it to 1000.

## Cluster Resource Checking

The `_get_cluster_resources()` function queries the K8s API to determine available
capacity on worker nodes. Nodes are selected by label (`role=<SPIDER_NODE_ROLE>`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if the DEDICATED_SPIDER_NODES is set to true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When DEDICATED_SPIDER_NODES=True, spider pods are scheduled only on nodes labeled with role=<SPIDER_NODE_ROLE>, and _get_cluster_resources() checks capacity only on those nodes. When set to False, pods have no nodeSelector and can land on any node — but the capacity check won't work accurately since it doesn't know which nodes to measure.

Comment on lines +128 to +133
- **`MULTI_NODE_MODE` must be `"True"`**: This is **critical**. When `MULTI_NODE_MODE`
is enabled, spider pods are scheduled with a `nodeSelector` matching `SPIDER_NODE_ROLE`,
and `_get_cluster_resources()` queries only those labeled nodes. If `MULTI_NODE_MODE`
is `"False"`, pods have no `nodeSelector` and the capacity check has no way to
accurately measure available resources. The sequential dispatch system is designed
to work with `MULTI_NODE_MODE=True`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is mandatory why there's an option to deactivate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly mandatory — it's the recommended setup for production or any infrastructure running many spiders at scale. With DEDICATED_SPIDER_NODES=True, you get accurate capacity checking and isolation between spider workloads and system components. However, for smaller setups where everything runs on one or a few nodes, you may want spiders to be scheduled on any available node, so the option exists for that flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants