add device data in resourceclaim#128
add device data in resourceclaim#128guptaNswati wants to merge 7 commits intokubernetes-sigs:mainfrom
Conversation
|
Welcome @guptaNswati! |
|
cc @nojnhuh for review. |
|
FYI @guptaNswati I was hoping to get this PR merged before #129 to keep you from having to deal with those changes here. There are other changes in the works that are based on that PR though, so I may elect to merge that one first. If #129 merges first, I'm happy to help resolve conflicts with your PR here! |
Oh i see. thank you for the headsup. i will see if i can update this tomorrow or else you can go ahead and merge #129. Main blocker is the e2e, right? we can't merge this without the tests? |
|
@guptaNswati Yes, I would like to include something in the tests for this. #128 (comment) is the other comment I would like to resolve before merging. |
|
@guptaNswati Have you pushed your recent changes? The most recent commit I see here is from when the PR was first opened. |
sorry pushing them rn. |
|
this is the quick test with updated changes. fixing e2e test |
|
e2e test |
a03108c to
b77e105
Compare
There was a problem hiding this comment.
I don't think we need to address this now, but we should think about how to handle errors here besides only logging them. Retries will help in some cases (HTTP 409), but we should prevent blocking subsequent NodePrepareResources calls on retries that may never succeed and I'm not sure what an appropriate amount of time would be to spend on that. I wonder if a workqueue running async would be the most appropriate way to handle status updates?
@guptaNswati Could you please open an issue to track improvements to error handling here?
There was a problem hiding this comment.
You also need this to de-couple the allocation times from apiserver update latency. May be as a hacky first implementation you can do this in a goroutine?
I guess the main question to ask is, is this update required for the driver to declare allocation as complete or is the information in this update auxiliary to allocation?
|
sorry was out for holidays. looking at it now. |
5c158c0 to
cc7f77e
Compare
|
Quick test after the rebase: |
|
@nojnhuh i made the suggested changes. Its ready to review again.
|
|
/assign @nojnhuh |
There was a problem hiding this comment.
Could we call this something like gpu-device-status instead to avoid confusing with #125?
| return resultConfigs, nil | ||
| } | ||
|
|
||
| func (s *DeviceState) buildDeviceStatus(res resourceapi.DeviceRequestAllocationResult) resourceapi.AllocatedDeviceStatus { |
There was a problem hiding this comment.
Let's move this logic into the gpu profile so we can implement this differently for different kinds of devices. See #129 for more about the "profile" concept.
Ideally we shouldn't have to implement this for every kind of device.
There was a problem hiding this comment.
okay. looking to it. need sometime to understand the refractor. will be back at it.
| checkpointManager: checkpointManager, | ||
| configDecoder: decoder, | ||
| configHandler: configHandler, | ||
| config: config, |
There was a problem hiding this comment.
Let's keep NewDeviceState where we pull everything out of the Config that we need vs. storing the whole Config on the DeviceState. Config contains user-facing config like command line flags, so let's not plumb that too far down.
|
Ack. I will work on it next week. |
Signed-off-by: Swati Gupta <swatig@nvidia.com> address review comment: fix pointer ref Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>
c52ea98 to
9f5e69d
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: guptaNswati The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Swati Gupta <swatig@nvidia.com>
Addressing #101 to add device attributes to demonstrate how to do resourceclaim status update.
Test run