Skip to content

Fix: resolved issue #279 by improving docs_crawler#280

Open
Ash-934 wants to merge 3 commits intojenkinsci:mainfrom
Ash-934:fix-issue-279
Open

Fix: resolved issue #279 by improving docs_crawler#280
Ash-934 wants to merge 3 commits intojenkinsci:mainfrom
Ash-934:fix-issue-279

Conversation

@Ash-934
Copy link

@Ash-934 Ash-934 commented Mar 12, 2026

Improve docs_crawler with sitemap seeding and async parallel fetching

Fixes #279

Description

The current docs_crawler.py uses synchronous, sequential HTTP requests (stack-based DFS) to crawl Jenkins documentation pages. This makes the crawl slow and may miss pages that aren't linked from other doc pages but are listed in the sitemap.

This PR replaces the synchronous crawler with a hybrid approach that:

  1. Seeds from sitemap.xml — Fetches the sitemap and extracts all /doc/ URLs upfront, ensuring comprehensive page coverage.
  2. Follows in-page links — Still discovers additional URLs by parsing links on each fetched page, catching anything the sitemap might miss.
  3. Uses async parallel fetching — Uses aiohttp with an async worker pool for concurrent page fetching, significantly improving crawl speed.

Testing

  • Ran the updated crawler end-to-end and verified the output jenkins_docs.json is produced correctly.
  • Compared output against the original synchronous crawler to verify content parity. Was able to fetch additional html pages when compared to original crawler output.

@Ash-934 Ash-934 requested a review from a team as a code owner March 12, 2026 21:19
Copy link
Contributor

@sharma-sugurthi sharma-sugurthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good direction,sitemap seeding + async parallel fetching is a real improvement over sequential DFS. But some suggestions reagrding the blockers,

  • from bs4 import BeautifulSoup is imported twice at the top,remove the duplicate.
  • visited_urls, page_content, and non_canonic_content_urls are module-level globals.if crawl() is called more than once (tests, retry scenarios), they carry over stale state.these should be initialized inside crawl() and passed down to workers.
  • fetch_sitemap_urls() still uses synchronous requests, while all page fetching uses aiohttp. Since the crawler is now async, the sitemap fetch could also use aiohttp inside the same event loop,dropping the requests dependency for this module entirely.

the new async logic (fetch_and_process_page, worker, crawl) has no unit tests.even basic mocking of aiohttp.ClientSession to verify retry behaviour and link discovery would help catch regressions.

@Ash-934
Copy link
Author

Ash-934 commented Mar 14, 2026

@sharma-sugurthi Thanks for the review! Here's what I've addressed:

1. Module-level globals - Replaced with a CrawlState dataclass created fresh in start_crawl() and passed down to all functions. No more stale state.

2. Synchronous sitemap fetch - Keeping requests for this. It's a single call before the async loop starts - no concurrency to gain.

3. Unit tests - Added test_docs_crawler.py covering _fetch_html , fetch_and_process_page, and worker along with other functions. All use mocked aiohttp.ClientSession with pytest-asyncio.

Additionally fixed pylint warnings.

@berviantoleo berviantoleo added bug For changelog: Minor bug. Will be listed after features enhancement For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted and removed bug For changelog: Minor bug. Will be listed after features labels Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve docs_crawler with sitemap seeding and async parallel fetching

3 participants