Fix: resolved issue #279 by improving docs_crawler by Ash-934 · Pull Request #280 · jenkinsci/resources-ai-chatbot-plugin

Ash-934 · 2026-03-12T21:19:17Z

Improve docs_crawler with sitemap seeding and async parallel fetching

Fixes #279

Description

The current docs_crawler.py uses synchronous, sequential HTTP requests (stack-based DFS) to crawl Jenkins documentation pages. This makes the crawl slow and may miss pages that aren't linked from other doc pages but are listed in the sitemap.

This PR replaces the synchronous crawler with a hybrid approach that:

Seeds from sitemap.xml — Fetches the sitemap and extracts all /doc/ URLs upfront, ensuring comprehensive page coverage.
Follows in-page links — Still discovers additional URLs by parsing links on each fetched page, catching anything the sitemap might miss.
Uses async parallel fetching — Uses aiohttp with an async worker pool for concurrent page fetching, significantly improving crawl speed.

Testing

Ran the updated crawler end-to-end and verified the output jenkins_docs.json is produced correctly.
Compared output against the original synchronous crawler to verify content parity. Was able to fetch additional html pages when compared to original crawler output.

sharma-sugurthi

good direction,sitemap seeding + async parallel fetching is a real improvement over sequential DFS. But some suggestions reagrding the blockers,

from bs4 import BeautifulSoup is imported twice at the top,remove the duplicate.
visited_urls, page_content, and non_canonic_content_urls are module-level globals.if crawl() is called more than once (tests, retry scenarios), they carry over stale state.these should be initialized inside crawl() and passed down to workers.
fetch_sitemap_urls() still uses synchronous requests, while all page fetching uses aiohttp. Since the crawler is now async, the sitemap fetch could also use aiohttp inside the same event loop,dropping the requests dependency for this module entirely.

the new async logic (fetch_and_process_page, worker, crawl) has no unit tests.even basic mocking of aiohttp.ClientSession to verify retry behaviour and link discovery would help catch regressions.

Ash-934 · 2026-03-14T21:19:24Z

@sharma-sugurthi Thanks for the review! Here's what I've addressed:

1. Module-level globals - Replaced with a CrawlState dataclass created fresh in start_crawl() and passed down to all functions. No more stale state.

2. Synchronous sitemap fetch - Keeping requests for this. It's a single call before the async loop starts - no concurrency to gain.

3. Unit tests - Added test_docs_crawler.py covering _fetch_html , fetch_and_process_page, and worker along with other functions. All use mocked aiohttp.ClientSession with pytest-asyncio.

Additionally fixed pylint warnings.

Ash-934 requested a review from a team as a code owner March 12, 2026 21:19

sharma-sugurthi reviewed Mar 13, 2026

View reviewed changes

Ash-934 added 3 commits March 15, 2026 02:25

Fix: resolved issue jenkinsci#279 by improving docs_crawler

3993c59

added CrawlState and minor re-structuring

2cc067e

added unit tests for docs_crawler

7dab9f5

Ash-934 force-pushed the fix-issue-279 branch from 7adfff8 to 7dab9f5 Compare March 14, 2026 21:00

berviantoleo added bug For changelog: Minor bug. Will be listed after features enhancement For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted and removed bug For changelog: Minor bug. Will be listed after features labels Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: resolved issue #279 by improving docs_crawler#280

Fix: resolved issue #279 by improving docs_crawler#280
Ash-934 wants to merge 3 commits intojenkinsci:mainfrom
Ash-934:fix-issue-279

Ash-934 commented Mar 12, 2026 •

edited

Loading

Uh oh!

sharma-sugurthi left a comment

Uh oh!

Ash-934 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Ash-934 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improve docs_crawler with sitemap seeding and async parallel fetching

Description

Testing

Uh oh!

sharma-sugurthi left a comment

Choose a reason for hiding this comment

Uh oh!

Ash-934 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ash-934 commented Mar 12, 2026 •

edited

Loading