Skip to content

fix(core): recurse into submodule files when crawling git repos#4596

Open
he-yufeng wants to merge 2 commits into
QwenLM:mainfrom
he-yufeng:fix/recurse-submodules-crawler
Open

fix(core): recurse into submodule files when crawling git repos#4596
he-yufeng wants to merge 2 commits into
QwenLM:mainfrom
he-yufeng:fix/recurse-submodules-crawler

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

Summary

Fixes #4568.

git ls-files --cached -t reports a submodule as the gitlink path, so the crawler only saw vendor/lib and missed files tracked inside that submodule. This adds --recurse-submodules to the tracked-file listing path so submodule contents are crawled the same way regular tracked files are.

I also added a real git fixture that creates a parent repo, adds a local submodule, and verifies the crawler returns a file from inside the submodule.

Verification

  • npx vitest run src/utils/filesearch/crawler.test.ts -t "recurse into tracked submodules"
  • npx vitest run src/utils/filesearch/crawler.test.ts
  • npm run typecheck --workspace=@qwen-code/qwen-code-core
  • npm run lint --workspace=@qwen-code/qwen-code-core -- src/utils/filesearch/crawler.ts src/utils/filesearch/crawler.test.ts
  • npm run build --workspace=@qwen-code/qwen-code-core
  • git diff --check

// versions; newline-delimited output is fine here (index paths cannot contain newlines).
const trackedArgs = ['--literal-pathspecs', 'ls-files', '--cached'];
const trackedArgs = [
'--literal-pathspecs',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] git ls-files --deleted --recurse-submodules is unsupported by git (fatal: "unsupported mode"). This creates an asymmetry: when a tracked file inside an initialized submodule is deleted from the working tree, --cached --recurse-submodules still lists it (from the submodule's index), but listDeletedTrackedFiles (--deleted) cannot detect the deletion. The deletedSet check in processTrackedFile therefore does not filter it, and the crawler returns a phantom path that doesn't exist on disk.

Impact: File search results include ghost entries for deleted submodule files. Downstream consumers (read file, @-mention completion) will get ENOENT errors.

Suggested change
'--literal-pathspecs',
const trackedArgs = [
'--literal-pathspecs',
'ls-files',
'--cached',
'--recurse-submodules',
];
// NOTE: git does not support `--deleted --recurse-submodules` (fatal error).
// Deleted files inside submodules won't appear in deletedSet.
// Consider adding an fs.access() guard for paths under submodule mount points,
// or use `git submodule foreach --recursive 'git ls-files -z --deleted'`
// to collect submodule-internal deleted files.

Also: When a submodule is registered but not initialized (common after plain git clone), --cached --recurse-submodules -t emits the gitlink mount point (e.g., H vendor/lib) as a cached entry. processTrackedFile only skips status S, but gitlinks appear as H — so a directory path ends up in the file list. Consider filtering gitlink entries or adding a stat guard.

— qwen3.7-max via Qwen Code /review

@he-yufeng
Copy link
Copy Markdown
Contributor Author

Addressed the submodule edge cases from review in c806c2a.

Changes:

  • kept deleted-file filtering on the top-level git index path instead of relying on unsupported git ls-files --deleted --recurse-submodules
  • check each cached tracked entry against the working tree and skip entries that are missing
  • skip cached directory entries, which covers uninitialized submodule gitlinks

Validation:

  • npm run test --workspace=packages/core -- crawler.test.ts
  • npm run typecheck --workspace=packages/core
  • npx eslint packages/core/src/utils/filesearch/crawler.ts packages/core/src/utils/filesearch/crawler.test.ts --max-warnings 0
  • git diff --check

Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review findings at this commit. Downgraded from Approve to Comment: CI still running. The implementation correctly addresses the previously raised Critical (deleted-file filtering asymmetry handled via lstatSync guard) and Suggestion (gitlink directory entries filtered by isDirectory check). All 47 crawler tests pass. — qwen3.7-max via Qwen Code /review

@wenshao
Copy link
Copy Markdown
Collaborator

wenshao commented May 29, 2026

Verification Report — PR #4596 (Recurse into Submodule Files)

Tested by: wenshao
Branch: fix/recurse-submodules-crawler
Commit: c806c2a5 (latest)
Environment: macOS Darwin 25.4.0, Node.js, real git submodule fixtures


1. Build / Typecheck / Lint

Check Result
npm run build --workspace=packages/core PASS (0 errors)
npm run typecheck --workspace=packages/core PASS (0 TS errors)
npm run lint --workspace=packages/core -- crawler.ts crawler.test.ts PASS (0 lint issues)
git diff --check PASS (no whitespace issues)

2. Unit Tests — crawler.test.ts

Run Tests Result Notes
Default timeout (5s) 47 46 pass, 1 timeout should recurse into tracked submodules takes ~5.4s, exceeds default 5s
Extended timeout (30s) 47 47 pass All tests pass including the 3 new submodule tests

New tests added by this PR:

Test Type Result Time
should recurse into tracked submodules on the git path Real git fixture PASS ~5.3s
should skip missing tracked paths from submodule indexes Mock PASS 4ms
should skip cached gitlink directories from uninitialized submodules Mock PASS 2ms

3. Live Submodule Integration (manual verification)

Created a real parent repo with a submodule at vendor/lib containing files at two depth levels.

Before fix (git ls-files --cached -t without --recurse-submodules):

H .gitmodules
H root.txt
H vendor/lib          ← gitlink directory only, files inside invisible

After fix (git ls-files --cached -t --recurse-submodules):

H .gitmodules
H root.txt
H vendor/lib/deep/nested/deep-file.txt   ← now visible
H vendor/lib/inner.txt                    ← now visible

Also verified:

  • lstatSync filter: vendor/lib gitlink entry correctly identified as directory and filtered out
  • Stale index entries: lstatSync try/catch correctly filters paths that exist in the index but not on disk

4. Findings

  • ⚠️ Test timeout (non-blocking): The real submodule fixture test (should recurse into tracked submodules) takes ~5.3s due to git init + git submodule add + git commit overhead. It marginally exceeds the default 5s Vitest timeout. CI passes (macOS/Ubuntu/Windows all green) because CI machines are faster, but locally on macOS the test can flap. Consider adding { timeout: 15_000 } to this specific test for margin.
  • The code change is minimal and surgical: 1 flag added (--recurse-submodules) + 1 stat guard (8 lines) to filter stale/directory entries.
  • No regressions in the existing 44 crawler tests.

5. Verdict

PASS — ready to merge. The core fix is correct and well-tested. The timeout finding is cosmetic (CI is green) but worth a quick fix for local dev experience.


Tested on 2026-05-29 via tmux with real git submodule fixtures and full test suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: @ file completion shows submodule folder but no files inside it

2 participants