Inconsistent checkpoint/status and syncing behavior after CappedPositionLost (MongoShake v2.8.7)

Hello,

We would like to request an issue regarding an abnormal checkpoint/oplog reader behavior observed in MongoShake.

### **Environment**
* **MongoShake version:** v2.8.7
* **Sync mode:** `sync_mode = all`
* **Source MongoDB version:** 4.2.18
* **Target MongoDB version:** 7.0.28
* **Incremental sync fetch method:** oplog
* **Source connect mode:** secondaryPreferred
* **Checkpoint storage:** `mongoshake.ckpt_default`
* **Checkpoint interval:** 5000
* **Typical oplog window on source:** around 4.5 hours

### **Observed Behavior**
* MongoShake checkpoint and `/repl` status values appeared to stop advancing after 2026-03-25.
* In particular, `lsn_ckpt` and `lsn_ack` remained at old values.
* However, from the service/application perspective, actual syncing continued to work normally until 2026-04-12.
* In other words, checkpoint/status values appeared stale, while actual data synchronization seemed to continue normally.
* Later, oplog collection capped error started repeating.
* The process itself stayed alive, and `ensure network` logs kept appearing, but `lsn_ack`, `lsn_ckpt`, `write_success`, and `tps` no longer changed.

### **Key Timeline**
**1. 2026-03-25 20:44:13 JST**
First occurrence of:
```text
oplog collection capped may happen: (CappedPositionLost) CollectionScan died due to position in capped collection being deleted.
```

**2. After that, checkpoints still advanced for a short period:**
* 20:44:18 → checkpoint success [1774423477]
* 20:44:23 → checkpoint success [1774423480]
* 20:44:28 → checkpoint success [1774423484]
* …
* 20:45:06 → last checkpoint success [1774423506]

**3. After 2026-03-25 20:44:52 JST**
The following error started repeating roughly every 6 seconds:
```text
Syncer[mongod_replset_1] oplog collection capped error, users should fix it manually
```

**4. 2026-03-25 20:57:55 JST**
The following logs kept repeating, but the status values stayed unchanged:
```text
oplogReader[...] ensure network
[name=mongod_replset_1, stage=incr, get=69383925, filter=34741929, write_success=34641955, tps=0,
 lsn_ckpt={..., 2026-03-25 16:25:06}, lsn_ack={..., 2026-03-25 16:25:24}]
```

**5. However, based on our operational observation, actual syncing still continued normally until 2026-04-12.**

### **Important Notes**
* The time fields in `/repl` appear to be in UTC. Based on the unix timestamps:
    * `lsn_ckpt = 1774423506` → **2026-03-25 20:25:06 JST**
    * `lsn_ack = 1774423524` → **2026-03-25 20:25:24 JST**
* Therefore, the gap between the last acknowledged oplog position and the first `CappedPositionLost` event is only about 19 minutes.
* Also, **there was no replication lag at 2026-03-25 20:44 JST.**
* Therefore, this does not match a simple explanation that MongoShake exceeded the oplog window because of replication lag.

### **Additional Observation**
* There were duplicate key (`E11000`) logs, but they were concentrated mostly around 2026-03-25 11:29 ~ 11:41 JST.
* Checkpoints continued to advance normally until 20:45 JST, so duplicate key handling does not appear to be the direct cause of this capped error incident.

### **Our Current Interpretation**
Our current understanding is as follows:
1.  The oplog reader cursor was lost when `CappedPositionLost` occurred.
2.  However, MongoShake still had some oplog data already buffered in internal workers/buffers, so the checkpoint continued to move slightly forward until 20:45:06 JST.
3.  After that point, it could no longer read new oplog entries, and only the `oplog collection capped error` kept repeating.
4.  The process itself remained alive and `ensure network` kept running, but the actual reader/status remained stale.
5.  Nevertheless, actual data syncing continued to work normally until 2026-04-12.
6.  **As a result, this looks like a case where checkpoint / `/repl` status became inconsistent with the actual syncing behavior.**

We can provide`/repl` and `/conf` outputs if needed.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent checkpoint/status and syncing behavior after CappedPositionLost (MongoShake v2.8.7) #956

Environment

Observed Behavior

Key Timeline

Important Notes

Additional Observation

Our Current Interpretation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent checkpoint/status and syncing behavior after CappedPositionLost (MongoShake v2.8.7) #956

Description

Environment

Observed Behavior

Key Timeline

Important Notes

Additional Observation

Our Current Interpretation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions