Conversation
|
If resource starvation is a concern (too many reads that writes never run), a simple idea would be to process N writes for every batch of reads. |
|
Thanks for this contribution! We saw this is a real pain point. |
|
I need to test it, I'll do it ASAP. |
|
@dannyharnik this will be a good idea |
|
After implementing a but it is still way better than what I saw with main ( |
|
So if all workers dedicated to reads we get 0.05s and with 75% of the workers it takes 0.081s. This seems reasonable. Can you try with a couple more test points (say 90% readers and 50% readers). If the results make sense then I guess you can remove the assert (or relax it..) |
|
Results reported from the test: It looks like 75% and 90% perform the same, probably due to the low number of threads and limitations of the current test. Note The test case changed completely so it's not possible to compare to the other results discussed up until now. |
kfirtoledo
left a comment
There was a problem hiding this comment.
overall looks good, some minor comments
Add ThreadPool priority using a dual queue approach with shared workers. Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
b506db9 to
e2654fa
Compare
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
|
I found some missing changes that got lost in the middle of the rebase probably, and added a new test for write starvation as suggested: |
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Summary
The FS backend's storage connector uses a single thread pool for both read (GET: Storage->GPU) and write (PUT: GPU->Storage) operations. When write-heavy workloads fill the task queue, read operations (which are latency critical for serving inference requests)get stuck behind writes in a FIFO queue.
This PR introduces a dual-queue with shared workers, so we keep a single pool of workers but maintain two internal queues: a high-priority read queue and a normal-priority write queue. Workers check the read queue first; if empty, they fall through to the write queue.
Other approaches such as separated
ThreadPoolinstances are simpler but double the thread-local resource usage. With this approach, the thread-local resources (staging buffers, CUDA streams...) are not affected since we are only changing how tasks are queued and dequeued, not how they execute.Test plan
(Edit)
In order to have a sense of how this affects performance, three new test cases were added.
The tests pass showing that prioritization is working as expected.
Related issues