Support max-allowed Pods for nodes by islinwb · Pull Request #3 · huawei-cloudnative/firmament

islinwb · 2018-06-05T08:45:54Z

xref:

m1093782566 · 2018-06-07T06:52:06Z

islinwb · 2018-06-09T05:39:35Z

m1093782566 · 2018-06-09T06:38:48Z

src/scheduling/flow/cpu_cost_model.cc

+        string core_id_substr = label.substr(idx + 4, label.size() - idx - 4);
+        uint32_t core_id = strtoul(core_id_substr.c_str(), 0, 10);
+        float available_cpu_cores =
+                latest_stats.cpus_stats(core_id).cpu_capacity() *


I feel a bit confused about why we need to touch the CPU stats?

Here we are reading latest machine stats sample for a machine. And fetching particular core's cpu utilization, available cpus and storing it in PU's resource descriptor. And then we accumulate combined cpu stats of all PUs(PU - core, machine may more than 1 PUs) at the Machine's resource descriptor. While deciding the cost, drawing arcs we use collected available cpu from machine's resource descriptor.

m1093782566 · 2018-06-12T05:32:48Z

src/scheduling/flow/cpu_cost_model.cc

+      }
+      // Running/idle task count
+      rd_ptr->set_num_running_tasks_below(rd_ptr->current_running_tasks_size());
+      rd_ptr->set_num_slots_below(FLAGS_max_tasks_per_pu);


Should not we set the rd.num_slots_below()?

Current firmament code fixed the maximum number of pods that can be scheduled on PU by using constant number FLAGS_max_tasks_per_pu. So because of this when running_tasks on PU equals FLAGS_max_tasks_per_pu, capacity of arc from Machine to PU is being set to zero. So we are restricted to schedule only FLAGS_max_tasks_per_pu number of tasks/pods on the PU. So in order to remove this restriction we need to set this rd.num_slots_below() to maximum pods that can be scheduled on that PU, that too only for PU.

Code changes can be like.
1)Add new field ‘max_pods’ in ResourceDescriptor, which gets value from kubelet parameter ‘max-pods’ only once when machine is added.
2)For PU node only, while updating the resource descriptor, update num_slots_below to max_pods like below.
rd.set_num_slots_below(machine_rd.max_pods());

In this way, we can schedule max_pods number of pods on that PU not just FLAGS_max_tasks_per_pu.

Currently max-pods is passed to num_slots_below. Need a new field?

Since num_slots_below gets changed every scheduling round, This value does not persist. So better to add new field.

shivramsrivastava · 2018-06-12T16:24:57Z

src/scheduling/flow/cpu_cost_model.cc

+        rd_ptr->mutable_available_resources()->set_cpu_cores(
+                available_cpu_cores);
+      }
+      // Running/idle task count


353 to 324 is already done at 322 to 324. So should be removed. When accumulator->type_ is PU both code sections gets executed. Because 'other' is 'SINK' when accumulator is 'PU'.

353 to 354?

Yes. Sorry for the mistake.

islinwb · 2018-06-22T01:19:55Z

@shivramsrivastava PTAL

nrshrivatsan · 2018-09-08T21:16:41Z

TL;DR

Schedulers should be cognizant about

Workload-Priority
Predicates/Prerequisite for Workloads

Priority

Instead of a discrete logic of MAX permissible Pods on a Node, would you encourage us having a more condition tree based scheduling?

Node ---- HAS ----> Pods ---> have ---> Priority

Schedulers could honor Pod Priority [ P1, P2, P3 ] along-side maximum permissible number of Pods.

Scheduler ---> Node A ---> Spawn max `P1` Pods ----> Spaw max `P2` Pods

Predicate

Context

It's a good start to have a Rate-Limiting function for workload scheduling.
However workloads in Production grade systems tend to have criteria like

Affinity/Anti-affinity
Taint tolerant workloads
Volume dependent
Fault-domain resilient

Suggestion

If we abstract the Rate-limit to a be a type of Predicate for scheduling, we could have an elegant predicate-based posidon scheduling.

Reasons

From Business availability stand point, there is a priority for any workload
MUST have Pods need to be scheduled first, followed by Nice-to-Have pods

@shivramsrivastava @m1093782566 @islinwb - RFC

deepak-vij · 2018-09-08T22:01:23Z

Thanks for your feedback. This is exactly we currently support as part of Poseidon/Firmament scheduler. Node Filtering (Hard Constraints) -> Priority/Scoring (Soft Constraints) -> Taints -> Volume dependency. We are about to incorporate max. pods/node capability to go along with all of the above. If this is what you meant.

nrshrivatsan · 2018-09-09T00:01:58Z

@deepak-vij thanks for the comment. Could I request for a few reference hyperlinks to the documentation for Hard & Soft constraints?

shivramsrivastava · 2018-09-18T04:31:10Z

@nrshrivatsan

Taints and Tolerations
Pod Level Affinity/Anti-Affinity
Node Level Affinity/Anti-Affinity
Volumes dependent workloads:
We are currently working on this one, we support only few volumes types and only support pre-bound PVC's and do not support dynamic binding/storage classes. We will publish a document on this soon.

Can you please elaborate on the Rate-Limiting predicate?
Are you suggesting to introduce Rate-Limiting for pods scheduling, as a part of predicate operation?

nrshrivatsan · 2018-09-19T04:48:47Z

@shivramsrivastava Love the links!

Rate Limiting

Set of Pods could satisfy Predicates
While scheduling these Pods, since resource quotas might be confined, it would be wise if the Scheduler schedules them in a rate-limited fashion
Value gains of Rate-limiting : determinism in pod-scheduling rate.

islinwb mentioned this pull request Jun 5, 2018

Max-allowed Pods for nodes kubernetes-retired/poseidon#112

Closed

islinwb force-pushed the max_pod_num branch from 65c10db to a0ce28f Compare June 5, 2018 08:52

islinwb changed the title ~~[WIP]support max-allowed Pods for nodes~~ Support max-allowed Pods for nodes Jun 9, 2018

m1093782566 reviewed Jun 9, 2018

View reviewed changes

islinwb force-pushed the max_pod_num branch from a0ce28f to ea6bf96 Compare June 12, 2018 01:54

m1093782566 reviewed Jun 12, 2018

View reviewed changes

shivramsrivastava reviewed Jun 12, 2018

View reviewed changes

islinwb force-pushed the max_pod_num branch 4 times, most recently from 5ba6239 to 972b86e Compare June 21, 2018 18:06

Support max pods allowed per node

f7766d2

islinwb force-pushed the max_pod_num branch from 972b86e to f7766d2 Compare June 22, 2018 02:09

Conversation

islinwb commented Jun 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m1093782566 commented Jun 7, 2018

Uh oh!

islinwb commented Jun 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

islinwb commented Jun 22, 2018

Uh oh!

nrshrivatsan commented Sep 8, 2018

TL;DR

Priority

Predicate

Context

Suggestion

Reasons

Uh oh!

deepak-vij commented Sep 8, 2018

Uh oh!

nrshrivatsan commented Sep 9, 2018

Uh oh!

shivramsrivastava commented Sep 18, 2018

Uh oh!

nrshrivatsan commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

islinwb commented Jun 5, 2018 •

edited

Loading

nrshrivatsan commented Sep 19, 2018 •

edited

Loading