6 scheduling by vlerkin · Pull Request #36 · q-m/scrapyd-k8s

vlerkin · 2024-11-07T17:25:47Z

What happens in the PR:

The logic of event watcher was separated in an observer class; the logic of log watching stayed in a log handler class, but the initialization was changed to subscribe to the event in case jobless feature was configured;
The new class KubernetesScheduler was created to handle logic when jobs must be unsuspended and how (ordered);
scheduler endpoint was modified, logic to set a value for start_suspended parameter was added;
schedule method from k8s launcher has a new start_suspended parameter, it's value is passed when called inside the api; also new methods were added: unsuspend_job patches existing suspended job suspend=False, get_running_jobs_count returns the number of jobs that are currently running, list_suspended_jobs returns the list of jobs where spec.suspend is true, _get_job_name extracts the job name from the metadata, it is then used for unsuspend function;

The big picture:
Event watcher connects to the k8s api and receives the stream of events, it then notifies the subscribers if a new event is received and passes it to the provided callback. The subscriber - KubernetesScheduler - receives event in a handle_pod_event method, this method reacts to the changes in job statuses, and if job completed running or failed it calls another method - check_and_unsuspend_jobs - that checks capacity and unsuspends jobs until the number of allowed parallel jobs is reached, while doing this it relies on another method - get_next_suspended_job_id - to unsuspend the most recent job, to keep the order in which jobs were initially scheduled.
When the job is scheduled, based on the number of currently active jobs and max_proc provided in the config (default is 4), the job runs or goes to the queue of suspended jobs (native k8s queue). Then events that change the number of active jobs trigger the logic of KubernetesScheduler class that unsuspend suspended jobs until the desired state (num of parallel jobs) is achieved.

wvengen

Ah, nice you were able to come up with something so quickly already!
I looked at it from a high level, and noticed that this is currently implemented for Kubernetes only (that makes sense), and also setup in such a way that it needs refactoring for Docker. I would think of the scheduler as something that could work for both Docker and Kubernetes, especially the scheduling decisions. Also, there is now k8s-specific code in the main file (e.g. the import), and the kubernetes scheduler, this makes the code somewhat spaghetti: there are specific implementation-specific classes where responsibility is meant to be delegated. If you need to access the scheduler in the main file, use a generic scheduler, and make the docker-based parts not implemented. I think that would give a much cleaner design.

Also, I would consider making the launcher responsible for scheduling. And then have the scheduler talk to the launcher to actually start jobs.

I'm not yet sure if we should allow running without the scheduler, or if it would always be active.

kubernetes.yaml

wvengen · 2024-11-08T08:33:12Z

Hope my feedback was at an angle that helps you at this stage. In any case, well done, keep it going!

p.s. the CI error looks like it could be cause by Kubernetes-specific things having entered into the main api code, which wouldn't work when running with Docker.

vlerkin · 2024-11-11T08:12:26Z

Working on Docker implementation to be added to this PR

wvengen

Great to see a working version! Quite readable :)
I think it needs a little cleanup, but you're getting there, I think.

scrapyd_k8s.sample-docker.conf

scrapyd_k8s/api.py

scrapyd_k8s/k8s_resource_watcher.py

scrapyd_k8s/k8s_scheduler.py

scrapyd_k8s/launcher/k8s.py

vlerkin · 2024-11-20T20:42:51Z

I have problems because I separated this PR partially and now I have a multiverse which I need to refactor to the only source of truth. Going to spend some uncertain amount of time on that.

wvengen · 2024-11-21T06:31:08Z

The way I would do this:

Continue working on this PR, until you need the functionality developed in the other PR (or until it has been merged).
Interactive rebase on the branch of the other PR. Filter out the commits you had here that you rewrote in the other branch.
There may be little or much work to do in resolving conflicts. If it is really many, in various commits, you may consider another route (see below).
Test, done.

Of this is much work in many commits, you may consider first doing an interactive rebase of this PR, to simplify it, and reduce the number of commits (that each may need amending).

Yes, this is a bit of work, but something I come across now and then, in various projects.
Sorry for the complexity!

vlerkin · 2024-11-21T07:29:02Z

Thank you for the advice!
I was thinking of dropping the commit with the merging main to this branch, then make the code work so the tests run if needed. Then merging with that other branch that refactored the observer further and make the code of both branches work together and then check if there are any conflicts with main and resolving those. This is a bit longer way than simply redoing the merge with the main branch but I messed up the last one because I lost track of changes, so gradually rebuilding this branch is a bit easier for me.

No worries, this is me who messed up merging, complexity is part of the job:D Learning to make more granular commits and cleaner PRs the hard way:D

vlerkin · 2024-11-27T10:38:02Z

I modified one of the methods in the scheduler (get_next_suspended_job_id) to handle cases if a job does not have a creation_timestamp. It is not expected but if someone used a custom resource and forgot to add this field or made any other error, the job will get the timestamp assigned and will be processed like other jobs in the queue.

Also, there are now unit tests that cover different scenarios for the scheduler.

If you have any other comments for improvements, let me know!

scrapyd_k8s/joblogs/log_handler_k8s.py

wvengen · 2025-01-21T10:05:03Z

Could you please resolve conflicts on main, so that I can see what this PR specifically changes?

wvengen

At first glance, well done! Seems to do the job (though haven't tested it).
An integration test would increase my confidence that it performs well (now I'd have to run it locally to see if it actually works - I think it would, but still I would feel necessitated to do so).

Also, I think the scheduling logic is now implemented in K8s and Docker separately. Would it make sense to have a single piece of code decide when to schedule, and let e.g. the launcher and listener be the interface to K8s/Docker? Haven't thought this fully through, but the question comes up.

Some questions and notes remain, otherwise it's well on the way.

scrapyd_k8s.sample-k8s.conf

scrapyd_k8s/tests/integration/test_api.py

scrapyd_k8s/tests/unit/k8s_scheduler/test_k8s_scheduler.py

vlerkin · 2025-01-28T12:43:32Z

The launcher is already sort of an interface, since every implementation uses its own launcher. Listener cannot be an interface because event watcher is a native Kubernetes thing, if I understand your comment correctly. So for k8s we still use the watcher to start suspended jobs when at least one job is done or deleted, and for Docker we have a background thread that checks the state of existing containers, we have discussed it previously.

scrapyd_k8s/k8s_scheduler/k8s_scheduler.py

…logic for observer in a RecourceWatcher class; added method to stop a thread gracefully

…that handles the logic to unsuspend jobs and get the next in order according to the creation timestamp; modify schedule endpoint to start jobs suspended if there is already enogh jobs running; modify corresponding function in k8s launcher; add to k8s launcher methods to unsuspend job, to get current number of running jobs, to list suspended jobs and a private method to get job name to be used for unsuspend function

…source watcher instance to enable_joblogs to subscribe to the event watcher if the log feature is configured; delete logic about event watcher from main; pass container for list objects function instead of container name; remove start methon from log handler class; modify joblogs init to subscribe to event watcher

…rs and run more from the queue of created jobs when capacity is available; add backgroung thread that sleeps for 5 sec and triggers the function that starts additional containers up to capacity; add a method to gracefully stop the background thread that might be used in the future to stop the thread when app stops; encapsulate k8s and docker related schedule functionality in corresponding launchers and keep api.py launcher agnostic; add max_proc to config for docker

…nd patch is sufficient

…nnect loop for event watcher; make number of reconnect attempts, backoff time and a coefficient for exponential growth configurable via config; add backoff_time, reconnection_attempts and backoff_coefficient as attributes to the resource watcher init; add resource_version as a param to w.stream so a failed stream can read from the last resource it was able to catch; add urllib3.exceptions.ProtocolError and handle reconnection after some exponential backoff time to avoid api flooding; add config as a param for init for resource watcher; modify config in kubernetes.yaml and k8s config to contain add backoff_time, reconnection_attempts and backoff_coefficient

…and a label selector to make the code in listjobs, get_running_jobs and list_suspended_jobs DRY; refactor listjobs to use the helper function with the existing _parse_job as a filter_func parameter

…unction because list jobs uses a different logic

…nnect loop for event watcher; make number of reconnect attempts, backoff time and a coefficient for exponential growth configurable via config; add backoff_time, reconnection_attempts and backoff_coefficient as attributes to the resource watcher init; add resource_version as a param to w.stream so a failed stream can read from the last resource it was able to catch; add urllib3.exceptions.ProtocolError and handle reconnection after some exponential backoff time to avoid api flooding; add config as a param for init for resource watcher; modify config in kubernetes.yaml and k8s config to contain add backoff_time, reconnection_attempts and backoff_coefficient

… connection to the k8s was achieved so only sequential failures detected; add exception handling to watch_pods to handle failure in urllib3, when source version is old and not available anymore, and when stream is ended; remove k8s resource watcher initialization from run function in api.py and move it to k8s.py launcher as _init_resource_watcher; refactor existing logic from joblogs/__init__.py to keep it in _init_resource_watcher and enable_joblogs in k8s launcher

wvengen · 2025-03-04T14:00:20Z

Rebased on main, adapted integration tests to setup with different configuration files.

wvengen · 2025-03-10T13:29:48Z

I'm not fully happy with the current integration tests, using a shellscript to patch the k8s setup. It might be cleaner to add a YAML file with the desired k8s manifest (in this case, for the role), and let the CI script save the cluster state on cluster setup, and restore it before running a test (kubectl apply -f first the pristine cluster state, then the test-specific resource; would even save a scale down as it is part of the pristine cluster state - but not the waiting on it, so perhaps remains useful to keep scale down).
Downside would be that the full role state is still necessary, so any change to scrapyd-k8d's required roles need to be included in the test-specific role manifest as well, so in that respect, a patch is actually more to the point.
So perhaps it is fine as it is, just wanted to share my thoughts here.

vlerkin · 2025-03-10T14:40:54Z

Wait until the second is finished too

listjobs_wait(jobid2, 'finished', max_wait=STATIC_SLEEP+MAX_WAIT)

Just curious why would you wait until the second job is done? If it was scheduled after the first job is done then the feature works properly, I am not sure what we are testing here with the waiting for the second one.

vlerkin · 2025-03-10T14:54:49Z

I would group the unit tests under 4 classes:
TestKubernetesSchedulerInitialization: test_k8s_scheduler_init, test_k8s_scheduler_init_invalid_max_proc

TestPodEventHandling:
test_handle_pod_event_with_non_dict_event,
test_handle_pod_event_pod_missing_status, etc

TestJobSuspensionManagement: test_check_and_unsuspend_jobs_with_capacity_and_suspended_jobs, test_check_and_unsuspend_jobs_no_suspended_jobs, etc

TestSuspendedJobSelection: test_get_next_suspended_job_id_with_suspended_jobs, test_get_next_suspended_job_id_no_suspended_jobs, etc

This is much more readable and easier to contribute to add or change tests. What do you think?

vlerkin · 2025-03-10T14:56:28Z

And to make code more compact, we could use @pytest.mark.parametrize to avoid code duplications

wvengen · 2025-03-10T15:45:04Z

Hi 👋 Thanks for your input!

Just curious why would you wait until the second job is done? If it was scheduled after the first job is done then the feature works properly, I am not sure what we are testing here with the waiting for the second one.

As a general note, the integration tests are, integration tests, and meant to test the system as a whole. Here it means that it is not strictly necessary to wait on the second job, but it is still part of the expected flow, and doesn't hurt to test. These tests are more like what a user would expect when using the system; not handling specific edge cases very isolated, but may include edge cases in the integrated flow.

TestKubernetesSchedulerInitialization, TestPodEventHandling, TestJobSuspensionManagement, TestSuspendedJobSelection

These tests sound Kubernetes-specific, and are not really about testing a full interaction cycle with scrapyd-k8s using its API only. Therefore they don't really belong in the (current) integration tests, I think.

There are probably specific cases to cover, as you write. Very useful to know about. Maybe we could express them as a full integration tests, that triggers a certain corner-case, and should work in a certain way, regardless of backend (k8s/docker).

If we want to test the launchers (incl. the possible schedulers), then we'd need another kind of tests, perhaps docker- and k8s-specific tests, that check how a cluster/node responds to launcher commands, and vice versa. Here we might test the surface API of the launcher (instead of the REST API). This is a kind of test we don't have yet (all tests are now backend-agnostic).

In the early stages of this project, to keep testing work managable, no backend-specific tests were created. There may come a time where this project grows, and needs backend-specific tests, but I see the overhead as a bit too much as of now - as long as we can cover enough ground with the backend-independent integration tests.

test_k8s_scheduler_init_invalid_max_proc

This seems like it would be a separate integration test with a config file having an invalid value. I think the API does not expose this, so either the daemon does not start at all, or runs with a reduced feature set. Both are not reported through the API, so cannot be well tested now. So this requires a different kind of testing setup, curently out-of-scope. I think this would belong in a different issue, in revising the testing infrastructure.

Note that I'd really like to keep a distinction between running actual tests, and setting up the daemon under test in an environment.

vlerkin · 2025-03-11T07:01:38Z

These tests sound Kubernetes-specific, and are not really about testing a full interaction cycle with scrapyd-k8s using its API only. Therefore they don't really belong in the (current) integration tests, I think.

They are unit tests, I just looked at them again and suggested possible improvements, it was probably a bit confusing:)

…plicate code

vlerkin · 2025-03-11T14:24:13Z

I don't really see any improvements for the integration tests, looks good as it is

vlerkin · 2025-03-27T08:32:20Z

Hey Willem, is there anything you would like to add to this PR?

wvengen

Some small notes, but I think it is mostly ready to merge! 🎉

scrapyd_k8s.sample-k8s.conf

scrapyd_k8s/k8s_resource_watcher.py

scrapyd_k8s/k8s_scheduler/__init__.py

scrapyd_k8s/tests/unit/k8s_scheduler/test_k8s_scheduler.py

scrapyd_k8s/tests/unit/launcher/test_docker.py

…her file; move the dir with the k8s scheduler to the launcher

vlerkin · 2025-03-28T13:29:19Z

I made changes you mentioned, could you, please, check if this is what you expected?

wvengen

Almost there!

It's always good to take a fresh look at the PR yourself, after you haven't looked it at for a while. And then you're still bound to miss some things that have become too familiar. Thanks for taking the time to finish the last bits!

CONFIG.md

scrapyd_k8s.sample-k8s.conf

scrapyd_k8s/api.py

…fig, added an empty line in api.py

wvengen

Thank you!

vlerkin requested a review from wvengen November 7, 2024 17:25

wvengen reviewed Nov 8, 2024

View reviewed changes

kubernetes.yaml Outdated Show resolved Hide resolved

wvengen reviewed Nov 12, 2024

View reviewed changes

wvengen mentioned this pull request Nov 18, 2024

6 scheduling observer extraction #38

Merged

vlerkin force-pushed the 6-scheduling branch 2 times, most recently from c8b35ad to 6394633 Compare November 21, 2024 17:38

leewesleyv reviewed Dec 4, 2024

View reviewed changes

scrapyd_k8s/joblogs/log_handler_k8s.py Show resolved Hide resolved

wvengen reviewed Jan 21, 2025

View reviewed changes

scrapyd_k8s.sample-k8s.conf Outdated Show resolved Hide resolved

scrapyd_k8s/tests/integration/test_api.py Outdated Show resolved Hide resolved

scrapyd_k8s/tests/unit/k8s_scheduler/test_k8s_scheduler.py Show resolved Hide resolved

vlerkin force-pushed the 6-scheduling branch from 16a8276 to 9175fec Compare January 28, 2025 12:36

wvengen reviewed Feb 11, 2025

View reviewed changes

scrapyd_k8s/k8s_scheduler/k8s_scheduler.py Show resolved Hide resolved

vlerkin added 12 commits March 4, 2025 14:25

extracted resource watching logic into a separate class; implemented …

6415a61

…logic for observer in a RecourceWatcher class; added method to stop a thread gracefully

change enable_joblogs method signature in docker

7c39fc8

refactor api.py and launcher/k8s.py to keep api.py launcher agnostic

ec75600

remove update from RBAC because we perform patching of the resource a…

fdce0c0

…nd patch is sufficient

create a helper finction _filter_jobs that accepts a filter function …

62cf9ed

…and a label selector to make the code in listjobs, get_running_jobs and list_suspended_jobs DRY; refactor listjobs to use the helper function with the existing _parse_job as a filter_func parameter

change the signature of the filtering method to also accept a parse f…

8a499d8

…unction because list jobs uses a different logic

wvengen force-pushed the 6-scheduling branch 9 times, most recently from 9a9e29b to f8418ea Compare March 5, 2025 09:33

Adapt integration testing setup

eb7173f

wvengen force-pushed the 6-scheduling branch from f8418ea to eb7173f Compare March 5, 2025 19:16

grouped unit tests under classes; added parametrization to avoide dou…

d0501d5

…plicate code

wvengen reviewed Mar 27, 2025

View reviewed changes

vlerkin force-pushed the 6-scheduling branch from 8f9a62b to d0501d5 Compare March 28, 2025 08:07

vlerkin added 2 commits March 28, 2025 09:18

remove num of attempts from sample config; add empty line in the watc…

a482e49

…her file; move the dir with the k8s scheduler to the launcher

modify path to mock scheduler

f9427ec

wvengen reviewed Apr 1, 2025

View reviewed changes

CONFIG.md Show resolved Hide resolved

scrapyd_k8s.sample-k8s.conf Outdated Show resolved Hide resolved

scrapyd_k8s/api.py Show resolved Hide resolved

edited config to two params, removed those params from the sample con…

371b2b6

…fig, added an empty line in api.py

wvengen approved these changes Apr 1, 2025

View reviewed changes

wvengen merged commit 078c57c into q-m:main Apr 1, 2025
5 checks passed

Conversation

vlerkin commented Nov 7, 2024

Uh oh!

wvengen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wvengen commented Nov 8, 2024

Uh oh!

vlerkin commented Nov 11, 2024

Uh oh!

wvengen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlerkin commented Nov 20, 2024

Uh oh!

wvengen commented Nov 21, 2024

Uh oh!

vlerkin commented Nov 21, 2024

Uh oh!

vlerkin commented Nov 27, 2024

Uh oh!

Uh oh!

wvengen commented Jan 21, 2025

Uh oh!

wvengen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlerkin commented Jan 28, 2025

Uh oh!

Uh oh!

wvengen commented Mar 4, 2025

Uh oh!

wvengen commented Mar 10, 2025

Uh oh!

vlerkin commented Mar 10, 2025

Wait until the second is finished too

Uh oh!

vlerkin commented Mar 10, 2025

Uh oh!

vlerkin commented Mar 10, 2025

Uh oh!

wvengen commented Mar 10, 2025

Uh oh!

vlerkin commented Mar 11, 2025

Uh oh!

vlerkin commented Mar 11, 2025

Uh oh!

vlerkin commented Mar 27, 2025

Uh oh!

wvengen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlerkin commented Mar 28, 2025

Uh oh!

wvengen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wvengen left a comment

Choose a reason for hiding this comment

wvengen left a comment •

edited

Loading