Support max-allowed Pods for nodes#3
Conversation
|
@shivramsrivastava The PR's ready. |
| string core_id_substr = label.substr(idx + 4, label.size() - idx - 4); | ||
| uint32_t core_id = strtoul(core_id_substr.c_str(), 0, 10); | ||
| float available_cpu_cores = | ||
| latest_stats.cpus_stats(core_id).cpu_capacity() * |
There was a problem hiding this comment.
I feel a bit confused about why we need to touch the CPU stats?
There was a problem hiding this comment.
Here we are reading latest machine stats sample for a machine. And fetching particular core's cpu utilization, available cpus and storing it in PU's resource descriptor. And then we accumulate combined cpu stats of all PUs(PU - core, machine may more than 1 PUs) at the Machine's resource descriptor. While deciding the cost, drawing arcs we use collected available cpu from machine's resource descriptor.
| } | ||
| // Running/idle task count | ||
| rd_ptr->set_num_running_tasks_below(rd_ptr->current_running_tasks_size()); | ||
| rd_ptr->set_num_slots_below(FLAGS_max_tasks_per_pu); |
There was a problem hiding this comment.
Should not we set the rd.num_slots_below()?
There was a problem hiding this comment.
Current firmament code fixed the maximum number of pods that can be scheduled on PU by using constant number FLAGS_max_tasks_per_pu. So because of this when running_tasks on PU equals FLAGS_max_tasks_per_pu, capacity of arc from Machine to PU is being set to zero. So we are restricted to schedule only FLAGS_max_tasks_per_pu number of tasks/pods on the PU. So in order to remove this restriction we need to set this rd.num_slots_below() to maximum pods that can be scheduled on that PU, that too only for PU.
Code changes can be like.
1)Add new field ‘max_pods’ in ResourceDescriptor, which gets value from kubelet parameter ‘max-pods’ only once when machine is added.
2)For PU node only, while updating the resource descriptor, update num_slots_below to max_pods like below.
rd.set_num_slots_below(machine_rd.max_pods());
In this way, we can schedule max_pods number of pods on that PU not just FLAGS_max_tasks_per_pu.
There was a problem hiding this comment.
Currently max-pods is passed to num_slots_below. Need a new field?
There was a problem hiding this comment.
Since num_slots_below gets changed every scheduling round, This value does not persist. So better to add new field.
| rd_ptr->mutable_available_resources()->set_cpu_cores( | ||
| available_cpu_cores); | ||
| } | ||
| // Running/idle task count |
There was a problem hiding this comment.
353 to 324 is already done at 322 to 324. So should be removed. When accumulator->type_ is PU both code sections gets executed. Because 'other' is 'SINK' when accumulator is 'PU'.
5ba6239 to
972b86e
Compare
|
@shivramsrivastava PTAL |
TL;DRSchedulers should be cognizant about
PriorityInstead of a discrete logic of MAX permissible Pods on a Node, would you encourage us having a more condition tree based scheduling?
Schedulers could honor Pod Priority [
PredicateContextIt's a good start to have a
SuggestionIf we abstract the Rate-limit to a be a type of Predicate for scheduling, we could have an elegant predicate-based posidon scheduling. Reasons
|
|
Thanks for your feedback. This is exactly we currently support as part of Poseidon/Firmament scheduler. Node Filtering (Hard Constraints) -> Priority/Scoring (Soft Constraints) -> Taints -> Volume dependency. We are about to incorporate max. pods/node capability to go along with all of the above. If this is what you meant. |
|
@deepak-vij thanks for the comment. Could I request for a few reference hyperlinks to the documentation for Hard & Soft constraints? |
Can you please elaborate on the Rate-Limiting predicate? |
|
@shivramsrivastava Love the links! Rate Limiting
|

xref: