Hello,
We would like to request an issue regarding an abnormal checkpoint/oplog reader behavior observed in MongoShake.
Environment
- MongoShake version: v2.8.7
- Sync mode:
sync_mode = all
- Source MongoDB version: 4.2.18
- Target MongoDB version: 7.0.28
- Incremental sync fetch method: oplog
- Source connect mode: secondaryPreferred
- Checkpoint storage:
mongoshake.ckpt_default
- Checkpoint interval: 5000
- Typical oplog window on source: around 4.5 hours
Observed Behavior
- MongoShake checkpoint and
/repl status values appeared to stop advancing after 2026-03-25.
- In particular,
lsn_ckpt and lsn_ack remained at old values.
- However, from the service/application perspective, actual syncing continued to work normally until 2026-04-12.
- In other words, checkpoint/status values appeared stale, while actual data synchronization seemed to continue normally.
- Later, oplog collection capped error started repeating.
- The process itself stayed alive, and
ensure network logs kept appearing, but lsn_ack, lsn_ckpt, write_success, and tps no longer changed.
Key Timeline
1. 2026-03-25 20:44:13 JST
First occurrence of:
oplog collection capped may happen: (CappedPositionLost) CollectionScan died due to position in capped collection being deleted.
2. After that, checkpoints still advanced for a short period:
- 20:44:18 → checkpoint success [1774423477]
- 20:44:23 → checkpoint success [1774423480]
- 20:44:28 → checkpoint success [1774423484]
- …
- 20:45:06 → last checkpoint success [1774423506]
3. After 2026-03-25 20:44:52 JST
The following error started repeating roughly every 6 seconds:
Syncer[mongod_replset_1] oplog collection capped error, users should fix it manually
4. 2026-03-25 20:57:55 JST
The following logs kept repeating, but the status values stayed unchanged:
oplogReader[...] ensure network
[name=mongod_replset_1, stage=incr, get=69383925, filter=34741929, write_success=34641955, tps=0,
lsn_ckpt={..., 2026-03-25 16:25:06}, lsn_ack={..., 2026-03-25 16:25:24}]
5. However, based on our operational observation, actual syncing still continued normally until 2026-04-12.
Important Notes
- The time fields in
/repl appear to be in UTC. Based on the unix timestamps:
lsn_ckpt = 1774423506 → 2026-03-25 20:25:06 JST
lsn_ack = 1774423524 → 2026-03-25 20:25:24 JST
- Therefore, the gap between the last acknowledged oplog position and the first
CappedPositionLost event is only about 19 minutes.
- Also, there was no replication lag at 2026-03-25 20:44 JST.
- Therefore, this does not match a simple explanation that MongoShake exceeded the oplog window because of replication lag.
Additional Observation
- There were duplicate key (
E11000) logs, but they were concentrated mostly around 2026-03-25 11:29 ~ 11:41 JST.
- Checkpoints continued to advance normally until 20:45 JST, so duplicate key handling does not appear to be the direct cause of this capped error incident.
Our Current Interpretation
Our current understanding is as follows:
- The oplog reader cursor was lost when
CappedPositionLost occurred.
- However, MongoShake still had some oplog data already buffered in internal workers/buffers, so the checkpoint continued to move slightly forward until 20:45:06 JST.
- After that point, it could no longer read new oplog entries, and only the
oplog collection capped error kept repeating.
- The process itself remained alive and
ensure network kept running, but the actual reader/status remained stale.
- Nevertheless, actual data syncing continued to work normally until 2026-04-12.
- As a result, this looks like a case where checkpoint /
/repl status became inconsistent with the actual syncing behavior.
We can provide/repl and /conf outputs if needed.
Thank you.
Hello,
We would like to request an issue regarding an abnormal checkpoint/oplog reader behavior observed in MongoShake.
Environment
sync_mode = allmongoshake.ckpt_defaultObserved Behavior
/replstatus values appeared to stop advancing after 2026-03-25.lsn_ckptandlsn_ackremained at old values.ensure networklogs kept appearing, butlsn_ack,lsn_ckpt,write_success, andtpsno longer changed.Key Timeline
1. 2026-03-25 20:44:13 JST
First occurrence of:
2. After that, checkpoints still advanced for a short period:
3. After 2026-03-25 20:44:52 JST
The following error started repeating roughly every 6 seconds:
4. 2026-03-25 20:57:55 JST
The following logs kept repeating, but the status values stayed unchanged:
5. However, based on our operational observation, actual syncing still continued normally until 2026-04-12.
Important Notes
/replappear to be in UTC. Based on the unix timestamps:lsn_ckpt = 1774423506→ 2026-03-25 20:25:06 JSTlsn_ack = 1774423524→ 2026-03-25 20:25:24 JSTCappedPositionLostevent is only about 19 minutes.Additional Observation
E11000) logs, but they were concentrated mostly around 2026-03-25 11:29 ~ 11:41 JST.Our Current Interpretation
Our current understanding is as follows:
CappedPositionLostoccurred.oplog collection capped errorkept repeating.ensure networkkept running, but the actual reader/status remained stale./replstatus became inconsistent with the actual syncing behavior.We can provide
/repland/confoutputs if needed.Thank you.