You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Although the self-healing policy can control whether a rebuild will occur,
6
6
occasionally an administrator may need to stop a rebuild already started and
7
-
restart it later with a "rebuild start" command (or, perform an alternate
7
+
restart it later with a `rebuild start` command (or, perform an alternate
8
8
action/rebuild such as direct reintegration). For this purpose, DAOS provides
9
9
the following interactive rebuild control command-line interfaces:
10
10
@@ -13,14 +13,14 @@ the following interactive rebuild control command-line interfaces:
13
13
*`dmg system rebuild stop [--force]`
14
14
*`dmg system rebuild start`
15
15
16
-
The system commands will apply to all pools in the DAOS system (i.e., have the
17
-
same effect as if multiple `pool rebuild stop` commands are issued, one per
18
-
pool).
16
+
The system-level commands (e.g., `dmg system rebuild stop`) apply to all pools in the DAOS
17
+
system (i.e., they have the same effect as if multiple `dmg pool rebuild stop` commands are issued,
18
+
one per pool).
19
19
20
20
Upon stopping a pool's rebuild, its rebuild state as reported by `dmg pool query`
21
-
will be an idle state, and an error status=-2027 (`-DER_OP_CANCELED` DAOS error code).
21
+
will be an idle state, and an error status=`-2027` (`-DER_OP_CANCELED` DAOS error code).
22
22
23
-
The effect of a rebuild stop command is "one shot", meaning only a pool's
23
+
The effect of a `rebuild stop` command is "one shot", meaning only a pool's
24
24
currently-running rebuild is stopped and there is no persistent effect on future
25
25
operations. Subsequent self-healing automatic recovery, or administrator command
26
26
(e.g., system stop, system/pool exclude, reintegrate, drain, pool extend) can
@@ -59,16 +59,16 @@ For a **failed rebuild**, the sequence is:
59
59
60
60
1. Run `op:Rebuild`
61
61
2. Run `op:Fail_reclaim` to clean up
62
-
* If `Fail_reclaim` failed, retry `Fail_reclaim`
62
+
* If `Fail_reclaim`*itself*failed, retry `Fail_reclaim`
63
63
* If `Fail_reclaim` succeeded, retry the original `op:Rebuild`
64
64
65
-
The "rebuild stop" commands are not typically allowed to terminate a rebuild in
65
+
The `rebuild stop` commands are not typically allowed to terminate a rebuild in
66
66
the `op:Reclaim` and `op:Fail_reclaim` phases — instead the command must be
67
67
issued during the `op:Rebuild` execution. An exception is available with the
68
-
`--force` option to "rebuild stop", intended to be applied for rebuilds that
68
+
`--force` option to `rebuild stop`, intended to be applied for rebuilds that
69
69
repeatedly fail and possibly may even be looping `Fail_reclaim` operations.
70
70
71
-
Because of these details, carefully timing the execution of "rebuild stop"
71
+
Because of these details, carefully timing the execution of `rebuild stop`
72
72
commands is needed, which can be facilitated with pool rebuild state querying
73
73
with `dmg pool query`. See the section
74
74
[Rebuild Stop Command Errors](#rebuild-stop-command-errors) for examples of
@@ -77,15 +77,18 @@ errors returned by "rebuild stop" in different timing circumstances.
77
77
78
78
## Example Usage: Stop a Single Pool Rebuild, then Direct Reintegration
79
79
80
-
A system has detected the loss of an engine (rank 3) and has launched a
81
-
corresponding rebuild on pool `p1` for that exclusion. The administrator decides
82
-
in this case that it is preferable to stop the exclude rebuild rather than let
83
-
it finish (perhaps it is possible to quickly remedy the rank 3 engine issue,
84
-
restart it, and directly reintegrate it back into the system and pool — rather
85
-
than perform 2 separate rebuilds for the exclusion and the reintegration).
80
+
A system has detected the loss of an engine (rank 3) that has 8 storage targets.
81
+
A corresponding rebuild has launched on pool `p1` after the engine's exclusion
82
+
from the pool map. The administrator decides to stop this rebuild
83
+
(perhaps because it is possible to quickly remedy the rank 3 engine issue, restart it,
84
+
and reintegrate it into the system and pool). This will result in a single
85
+
rebuild for the reintegration, rather than two rebuilds (for the initial exclusion,
86
+
and later for the reintegration).
86
87
87
88
**1.** Observe the pool `p1` is rebuilding after the fault is detected.
88
89
90
+
The pool state reflects 8 disabled targets (`disabled=8`) corresponding to the exclusion
91
+
of engine rank 3. Rebuild is underway (`busy` rebuild state).
89
92
```bash
90
93
$ dmg pool query --health-only p1
91
94
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=8, leader=6, version=77, state=TargetsExcluded
@@ -95,37 +98,42 @@ Pool health info:
95
98
- Data redundancy: degraded
96
99
```
97
100
98
-
**2.** Stop the rebuild and monitor with pool query until the stop is confirmed
99
-
("stopped", state=idle, status=-2027). Notice some pool query rebuild state
100
-
changes while waiting.
101
+
**2.** Stop the rebuild and monitor with multiple `pool query` commands until the stop is confirmed
102
+
(`stopped, state=idle, status=-2027`). Notice some rebuild state changes while waiting.
101
103
104
+
Using a command at single-pool scope, no output is expected if the request is successfully
105
+
sent to the storage system.
102
106
```bash
103
107
$ dmg pool rebuild stop p1
104
108
```
105
109
106
-
Run `dmg pool query` in a loop (with short sleeps between commands):
110
+
Run `dmg pool query` in a loop (with short delays between commands).
107
111
112
+
The rebuild state output `stopping` along with `state=busy` and `status=-2027` is an indication
113
+
that the stop command is being processed, and rebuild has not been entirely stopped yet.
108
114
```bash
109
-
# confirmation that stop command is being handled
110
115
$ dmg pool query --health-only p1
111
116
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=8, leader=6, version=77, state=TargetsExcluded
112
117
Pool health info:
113
118
- Disabled ranks: 3
114
119
- Rebuild stopping (state=busy, status=-2027)
115
120
```
116
121
122
+
The rebuild state has temporarily transitioned to `busy`, reflecting that the `op:Rebuild` is no
123
+
longer running, but a reclaim phase is now running (in this case, `op:Fail_reclaim`, since a stopped
124
+
rebuild is processed in the same way as a failed rebuild).
117
125
```bash
118
-
# state temporarily changes back to busy, representing the Fail_reclaim stage
119
-
# for the stopped operation is now running
120
126
$ dmg pool query --health-only p1
121
127
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=8, leader=6, version=77, state=TargetsExcluded
122
128
Pool health info:
123
129
- Disabled ranks: 3
124
130
- Rebuild busy, 0 objs, 0 recs
125
131
```
126
132
133
+
The rebuild `op:Fail_reclaim` has finished, and now the pool query output shows that the
134
+
rebuild is stopped. This is the final state reflecting that the rebuild is stopped
135
+
(`state=idle, status=-2027`).
127
136
```bash
128
-
# Fail_reclaim has finished, and now the pool shows its rebuild has fully stopped
129
137
$ dmg pool query --health-only p1
130
138
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=8, leader=6, version=77, state=TargetsExcluded
131
139
Pool health info:
@@ -134,16 +142,33 @@ Pool health info:
134
142
- Data redundancy: degraded
135
143
```
136
144
137
-
**3.** Directly reintegrate the engine back into pool `p1` and monitor progress
138
-
with multiple pool query commands until the reintegration rebuild is
139
-
successfully finished.
145
+
**3.** Restart engine rank 3, wait for it to join the system, directly reintegrate it
146
+
back into pool `p1`, and wait for the rebuild to finish successfully.
140
147
141
148
```bash
149
+
$ dmg system start --ranks=3
150
+
# Repeat dmg system query commands until engine rank 3 shown in the joined state
151
+
$ dmg system query
152
+
Rank State
153
+
---- -----
154
+
[0-2,4-7] Joined
155
+
3 Stopped
156
+
157
+
$ dmg system query
158
+
Rank State
159
+
---- -----
160
+
[0-7] Joined
161
+
162
+
# Now, reintegrate engine rank 3 into pool p1
142
163
$ dmg pool reintegrate --ranks=3 p1
143
164
```
144
165
145
-
Run `dmg pool query` in a loop (with short sleeps between commands):
166
+
Run `dmg pool query` in a loop (with short delays between commands).
146
167
168
+
First, it may be seen that the pool map has been updated for the reintegrating engine rank 3
169
+
(pool map `version=85` instead of 77, and targets `disabled=0` instead of 8). And it could be
170
+
that the pool rebuild has not yet started, with state reflecting the same state following
171
+
the previous (stopped) rebuild (`state=idle, status=-2027`).
147
172
```bash
148
173
$ dmg pool query --health-only p1
149
174
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=0, leader=6, version=85, state=Ready
@@ -152,6 +177,7 @@ Pool health info:
152
177
- Data redundancy: normal
153
178
```
154
179
180
+
Rebuild is now `busy` (performing the reintegration) according to rebuild state.
155
181
```bash
156
182
$ dmg pool query --health-only p1
157
183
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=0, leader=6, version=85, state=Ready
@@ -160,6 +186,7 @@ Pool health info:
160
186
- Data redundancy: normal
161
187
```
162
188
189
+
Rebuild is still `busy`, showing increasing object / record counts.
163
190
```bash
164
191
$ dmg pool query --health-only p1
165
192
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=0, leader=6, version=85, state=Ready
@@ -168,15 +195,20 @@ Pool health info:
168
195
- Data redundancy: normal
169
196
```
170
197
198
+
Rebuild is still `busy`, though object / record counts have been reset. Also, the pool map version
199
+
has increased to 93 (previously 85). This indicates the `op:Rebuild` has completed, and
200
+
the rebuild is cleaning up in `op:Reclaim` phase. The pool map was updated to promote the
201
+
engine rank 3 targets from `UP` (during reintegration) to `UP_IN` (reintegration complete,
202
+
ready for client I/O).
171
203
```bash
172
-
# Notice zero objs, recs corresponds to a transition from Rebuild done to Reclaim phase now running.
173
204
$ dmg pool query --health-only p1
174
205
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=0, leader=6, version=93, state=Ready
175
206
Pool health info:
176
207
- Rebuild busy, 0 objs, 0 recs
177
208
- Data redundancy: normal
178
209
```
179
210
211
+
Rebuild is still `busy` in `op:Reclaim` phase.
180
212
```bash
181
213
$ dmg pool query --health-only p1
182
214
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=0, leader=6, version=93, state=Ready
@@ -185,6 +217,8 @@ Pool health info:
185
217
- Data redundancy: normal
186
218
```
187
219
220
+
Rebuild (including reclaim) has finished since state is `done`. Engine rank 3 has been reintegrated
221
+
into the pool.
188
222
```bash
189
223
# all done
190
224
$ dmg pool query --health-only p1
@@ -214,9 +248,25 @@ Rank State
214
248
3 Stopped
215
249
```
216
250
217
-
This shows that pool `p2` has started its rebuild (state "busy"). `p1` is
218
-
"done" from a prior rebuild, and will start rebuilding soon.
251
+
Attempting to stop pool rebuilds too soon (before they have actually started) will produce an error.
252
+
The `dmg system rebuild stop` command reports how many pools had the request successfully
253
+
issued (in this case, 0 successful pools).
254
+
```bash
255
+
$ dmg system rebuild stop
256
+
System-rebuild stop request succeeded on 0 pools
257
+
ERROR: dmg: system rebuild stop failed: pool-rebuild stop failed on pool p1: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist, pool-rebuild stop failed on pool p2: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist
258
+
```
259
+
260
+
The following output shows that pool `p2` has excluded targets and rebuild has started:
261
+
-`State` column shows `TargetsExcluded`
262
+
-`Disabled` column reports 8 out of 64 targets are disabled, corresponding here to the lost
263
+
engine rank 3 targets.
264
+
-`Rebuild State` column shows `busy`
219
265
266
+
Also, pool `p1` has not excluded targets yet, and has not started rebuild:
267
+
-`State` column is `Ready`
268
+
-`Disabled` column reflects 0 targets disabled.
269
+
-`Rebuild State` column is `done`, reflecting state from a previously-completed rebuild.
220
270
```bash
221
271
$ dmg system list-pools -v
222
272
Label UUID State SvcReps SCM Size SCM Used SCM Imbalance NVME Size NVME Used NVME Imbalance Disabled UpgradeNeeded? Rebuild State
The rebuild stop command may return errors when it is issued at a time that it
283
-
may not be able to handle the request. For example:
363
+
The `rebuild stop` command may return errors when it is issued at a time that it
364
+
may not be able to handle the request. The following subsections show examples.
284
365
285
366
### No Rebuild Currently Running
286
367
287
368
When no rebuild is currently running, the command will report a "nonexist"
288
369
error:
289
-
290
370
```bash
291
-
# system command applying to 2 pools
371
+
# system-level command applying to two pools, both of which have not started rebuilding yet
292
372
$ dmg system rebuild stop
293
373
System-rebuild stop request succeeded on 0 pools
294
374
ERROR: dmg: system rebuild stop failed: pool-rebuild stop failed on pool p1: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist, pool-rebuild stop failed on pool p2: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist
295
375
296
-
# single pool command
376
+
# single pool scope command applied to one pool that is not currently rebuilding
297
377
$ dmg pool rebuild stop p2
298
378
ERROR: dmg: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist
299
379
```
300
380
301
-
### Rebuild Finished, Reclaim in Progress
381
+
### Rebuild Finished Successfully, Reclaim in Progress
302
382
303
-
When the rebuild stage has successfully finished and is in its reclaim cleanup
304
-
stage, `dmg` will report a busy error. For example when pools `p1` and `p2` are
383
+
When the rebuild stage has successfully finished and is in its `op:Reclaim` cleanup
384
+
stage, `dmg` will report a (generic) busy error. For example when pools `p1` and `p2` are
305
385
both done rebuilding and in the reclaim stage:
306
-
307
386
```bash
308
387
$ dmg system rebuild stop
309
388
System-rebuild stop request succeeded on 0 pools
@@ -316,8 +395,7 @@ When the rebuild stage has finished (but failed), and is in its `Fail_reclaim`
316
395
cleanup stage, `dmg` will report a no permissions error, `-DER_NO_PERM`.
317
396
318
397
In this scenario, the admin can wait for the rebuild to be retried, and then
319
-
reissue the "rebuild stop" command:
320
-
398
+
reissue the `rebuild stop` command:
321
399
```bash
322
400
# multiple command invocations to query pool rebuild status
0 commit comments