nvmeof: Add mount cache and locking for safe nvme disconnect#6192
nvmeof: Add mount cache and locking for safe nvme disconnect#6192gadididi wants to merge 3 commits intoceph:develfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an in-memory mount cache for NVMe-oF staging mounts so NodeUnstage can avoid repeated findmnt --list calls and decide when it’s safe to disconnect NVMe controllers.
Changes:
- Changed
GetAllNVMeMountedDevices()to returndevice -> stagingTargetmappings (instead of aboolset). - Introduced
MountCache(bidirectional device/path cache) and initialized it at NodeServer startup. - Updated NodeUnstage disconnect decision flow to use the in-memory cache snapshot.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| internal/nvmeof/util/mounter.go | Changes mounted-device discovery to return device→mountpoint mapping. |
| internal/nvmeof/util/mounter_cache.go | Adds new mutex-protected mount cache with bidirectional maps. |
| internal/nvmeof/nvmeof_initiator.go | Updates disconnect interfaces/helpers to accept the new mounted-devices map type. |
| internal/nvmeof/nodeserver/nodeserver.go | Initializes cache on startup; updates stage/unstage flows to maintain and consult the cache for disconnect decisions. |
| transaction.isMounted = true | ||
|
|
||
| // Resolve real device and update cache | ||
| realDevice, err := nvmeutil.GetDeviceFromMountpoint(ctx, stagingTargetPath) | ||
| if err != nil { | ||
| log.WarningLog(ctx, "failed to resolve device: %v", err) | ||
| // Fallback - try to continue | ||
| realDevice = devicePath | ||
| } | ||
|
|
||
| ns.mountCache.Add(realDevice, stagingTargetPath) |
There was a problem hiding this comment.
Between the successful mount (transaction.isMounted = true) and the later ns.mountCache.Add call, there is a window where the namespace is mounted in the system but missing from mountCache. Because NodeStage/NodeUnstage are only locked per-volume (volumeLocks), a concurrent unstage of a different namespace in the same subsystem could take a snapshot that omits this mount and incorrectly disconnect shared controllers. To avoid false negatives, consider updating mountCache immediately after the mount succeeds (using the best-available device identity) or adding a safety validation (e.g., re-check mounts via findmnt) before disconnect decisions.
| transaction.isMounted = true | |
| // Resolve real device and update cache | |
| realDevice, err := nvmeutil.GetDeviceFromMountpoint(ctx, stagingTargetPath) | |
| if err != nil { | |
| log.WarningLog(ctx, "failed to resolve device: %v", err) | |
| // Fallback - try to continue | |
| realDevice = devicePath | |
| } | |
| ns.mountCache.Add(realDevice, stagingTargetPath) | |
| // Immediately record the mount in the cache using the best-available device identity | |
| ns.mountCache.Add(devicePath, stagingTargetPath) | |
| // Resolve real device and refine cache entry if possible | |
| realDevice, err := nvmeutil.GetDeviceFromMountpoint(ctx, stagingTargetPath) | |
| if err != nil { | |
| log.WarningLog(ctx, "failed to resolve device: %v", err) | |
| // Fallback - keep using devicePath for cache | |
| } else if realDevice != "" && realDevice != devicePath { | |
| // Update cache to use the canonical device identifier for this mountpoint | |
| ns.mountCache.RemoveByMountPoint(stagingTargetPath) | |
| ns.mountCache.Add(realDevice, stagingTargetPath) | |
| } | |
| transaction.isMounted = true |
There was a problem hiding this comment.
Because NodeStage/NodeUnstage are only locked per-volume (volumeLocks), a concurrent unstage of a different namespace in the same subsystem could take a snapshot that omits this mount and incorrectly disconnect shared controllers
As I said in the PR description, in the next PR I will add group lock(which locks per group type stage\unstage) . Once this PR will be merged: #6183
8673fba to
077fce8
Compare
077fce8 to
9558efb
Compare
| // MountCache is a thread-safe cache that maintains the mapping between staging paths and devices. | ||
| // Note: This map is 1:1, meaning each staging path corresponds to exactly one device. | ||
| // device can only be mounted to one staging path at a time, and each staging path can only have one device. | ||
| type MountCache struct { |
There was a problem hiding this comment.
Can you provide a MountCache interface and mountCache struct that contains the actual implementation?
| } | ||
|
|
||
| // Initialize mounted devices cache on startup to ensure we have an accurate view | ||
| mountedDevices, err := getNVMeMountedDevices(context.Background()) |
There was a problem hiding this comment.
Make getNVMeMountedDevices() a function of NodeServer (call it initNVMeMountedDevices()?) and add the path/device directly in that function. This removes the for-loop a little below and makes this function a little smaller.
5e46539 to
6d03f95
Compare
| // so we should try to disconnect if this was the last mount | ||
| if transaction.devicePath != "" { | ||
| mountedDevices, err := getNVMeMountedDevices(ctx) | ||
| mountedDevices := ns.mountCache.GetCopyAllDevices() |
There was a problem hiding this comment.
this now does not return err anymore, the check for that on the next line does not make sense.
53bba11 to
2fe4143
Compare
Add MountCache to track nvme device mounts in memory, eliminating repeated `findmnt` system calls during unstage operations. The cache maintains a bidirectional map between staging paths and device paths, allowing O(1) lookups in both directions. also the get device from staging path call is saved from `findmnt` sys call. Structure: - pathToDevice: staging path -> device (/staging/vol-123 → /dev/nvme0n1) - deviceToPath: device -> staging path (reverse lookup) - Protected by sync.Mutex for thread-safe concurrent access The cache will be initialized at node server startup by scanning existing mounts, then updated on each stage/unstage operation. This improves performance by replacing expensive findmnt --list calls (scanning all mounts) with fast in-memory map lookups. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
change the map from (staging path -> true\false) map to more informative map- staging path -> device. this function will be in used at node server init. scan the current stage of mounted nvme device. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
adding `mountCache`, which saves the findmnt calls. aslo change in the code any place we call to `findmnt` to call to cache get. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
2fe4143 to
0803810
Compare
| // Initialize mounted devices cache on startup to ensure we have an accurate view | ||
| if err := ns.initNVMeMountedDevices(context.Background()); err != nil { | ||
| log.WarningLog(context.Background(), | ||
| "failed to initialize mounted devices cache on startup: %v (mounted devices cache will be empty)", err) |
There was a problem hiding this comment.
NewNodeServer continues when initNVMeMountedDevices() fails, leaving mountCache empty. In that state, NodeUnstageVolume will make disconnect decisions using an incomplete snapshot and may disconnect shared controllers while other NVMe namespaces remain mounted (especially after node-plugin restart with pre-existing mounts). Consider either failing NodeServer startup on init error, or tracking an "initialized" flag and falling back to an authoritative findmnt scan for disconnect decisions until initialization succeeds (or retry initialization).
| // Initialize mounted devices cache on startup to ensure we have an accurate view | |
| if err := ns.initNVMeMountedDevices(context.Background()); err != nil { | |
| log.WarningLog(context.Background(), | |
| "failed to initialize mounted devices cache on startup: %v (mounted devices cache will be empty)", err) | |
| // Initialize mounted devices cache on startup to ensure we have an accurate view. | |
| // Failing to initialize this cache can lead to incorrect disconnect decisions, so | |
| // treat initialization failures as fatal and fail NodeServer construction. | |
| if err := ns.initNVMeMountedDevices(context.Background()); err != nil { | |
| return nil, fmt.Errorf("failed to initialize mounted devices cache on startup: %w", err) |
Describe what this PR does
Adds an in-memory cache to track which nvme devices are currently mounted at staging paths. This eliminates expensive
findmnt --listsystem calls on every unstage operation.Structure:
The cache maintains bidirectional maps for fast lookups:
pathToDevice: staging path -> device pathdeviceToPath: device path -> staging path (reverse)Initialization:
Cache is populated at node server startup by scanning existing mounts, then kept up to date as volumes are staged and unstaged.
Thread Safety:
Protected by
sync.Mutex. We use a simple mutex instead ofsync.RWMutexbecause:Why copying the map is safe:
When
GetAllDevices()returns a copy of the cache, callers get a consistent snapshot at that moment. The copy happens after cache updates (first, the pod unmounts, then deletes itself, and last fetch the copy of cache), so concurrentunstageoperations each see current state. Even if two unstages run simultaneously, each removes its device from the cache first, then gets a fresh snapshot showing remaining devices.So the last of them who call to
GetAllDevices()will get the update list, and will know if disconnect or not. (NodeStage cannot run at this time !!)Related PR##
after this pr: #6183 will be merged, group locking (between NodeStage\NodeUnStage) will added.
Future concerns
next step is to add group lock after #6183 will be merged.
Checklist:
guidelines in the developer
guide.
Request
notes
updated with breaking and/or notable changes for the next major release.
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)