Skip to content

Inconsistent checkpoint/status and syncing behavior after CappedPositionLost (MongoShake v2.8.7) #956

@syKim12

Description

@syKim12

Hello,

We would like to request an issue regarding an abnormal checkpoint/oplog reader behavior observed in MongoShake.

Environment

  • MongoShake version: v2.8.7
  • Sync mode: sync_mode = all
  • Source MongoDB version: 4.2.18
  • Target MongoDB version: 7.0.28
  • Incremental sync fetch method: oplog
  • Source connect mode: secondaryPreferred
  • Checkpoint storage: mongoshake.ckpt_default
  • Checkpoint interval: 5000
  • Typical oplog window on source: around 4.5 hours

Observed Behavior

  • MongoShake checkpoint and /repl status values appeared to stop advancing after 2026-03-25.
  • In particular, lsn_ckpt and lsn_ack remained at old values.
  • However, from the service/application perspective, actual syncing continued to work normally until 2026-04-12.
  • In other words, checkpoint/status values appeared stale, while actual data synchronization seemed to continue normally.
  • Later, oplog collection capped error started repeating.
  • The process itself stayed alive, and ensure network logs kept appearing, but lsn_ack, lsn_ckpt, write_success, and tps no longer changed.

Key Timeline

1. 2026-03-25 20:44:13 JST
First occurrence of:

oplog collection capped may happen: (CappedPositionLost) CollectionScan died due to position in capped collection being deleted.

2. After that, checkpoints still advanced for a short period:

  • 20:44:18 → checkpoint success [1774423477]
  • 20:44:23 → checkpoint success [1774423480]
  • 20:44:28 → checkpoint success [1774423484]
  • 20:45:06 → last checkpoint success [1774423506]

3. After 2026-03-25 20:44:52 JST
The following error started repeating roughly every 6 seconds:

Syncer[mongod_replset_1] oplog collection capped error, users should fix it manually

4. 2026-03-25 20:57:55 JST
The following logs kept repeating, but the status values stayed unchanged:

oplogReader[...] ensure network
[name=mongod_replset_1, stage=incr, get=69383925, filter=34741929, write_success=34641955, tps=0,
 lsn_ckpt={..., 2026-03-25 16:25:06}, lsn_ack={..., 2026-03-25 16:25:24}]

5. However, based on our operational observation, actual syncing still continued normally until 2026-04-12.

Important Notes

  • The time fields in /repl appear to be in UTC. Based on the unix timestamps:
    • lsn_ckpt = 17744235062026-03-25 20:25:06 JST
    • lsn_ack = 17744235242026-03-25 20:25:24 JST
  • Therefore, the gap between the last acknowledged oplog position and the first CappedPositionLost event is only about 19 minutes.
  • Also, there was no replication lag at 2026-03-25 20:44 JST.
  • Therefore, this does not match a simple explanation that MongoShake exceeded the oplog window because of replication lag.

Additional Observation

  • There were duplicate key (E11000) logs, but they were concentrated mostly around 2026-03-25 11:29 ~ 11:41 JST.
  • Checkpoints continued to advance normally until 20:45 JST, so duplicate key handling does not appear to be the direct cause of this capped error incident.

Our Current Interpretation

Our current understanding is as follows:

  1. The oplog reader cursor was lost when CappedPositionLost occurred.
  2. However, MongoShake still had some oplog data already buffered in internal workers/buffers, so the checkpoint continued to move slightly forward until 20:45:06 JST.
  3. After that point, it could no longer read new oplog entries, and only the oplog collection capped error kept repeating.
  4. The process itself remained alive and ensure network kept running, but the actual reader/status remained stale.
  5. Nevertheless, actual data syncing continued to work normally until 2026-04-12.
  6. As a result, this looks like a case where checkpoint / /repl status became inconsistent with the actual syncing behavior.

We can provide/repl and /conf outputs if needed.

Thank you.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions