Fix: resolved issue #279 by improving docs_crawler#280
Fix: resolved issue #279 by improving docs_crawler#280Ash-934 wants to merge 3 commits intojenkinsci:mainfrom
Conversation
sharma-sugurthi
left a comment
There was a problem hiding this comment.
good direction,sitemap seeding + async parallel fetching is a real improvement over sequential DFS. But some suggestions reagrding the blockers,
- from bs4 import BeautifulSoup is imported twice at the top,remove the duplicate.
- visited_urls, page_content, and non_canonic_content_urls are module-level globals.if crawl() is called more than once (tests, retry scenarios), they carry over stale state.these should be initialized inside crawl() and passed down to workers.
- fetch_sitemap_urls() still uses synchronous requests, while all page fetching uses aiohttp. Since the crawler is now async, the sitemap fetch could also use aiohttp inside the same event loop,dropping the requests dependency for this module entirely.
the new async logic (fetch_and_process_page, worker, crawl) has no unit tests.even basic mocking of aiohttp.ClientSession to verify retry behaviour and link discovery would help catch regressions.
|
@sharma-sugurthi Thanks for the review! Here's what I've addressed: 1. Module-level globals - Replaced with a 2. Synchronous sitemap fetch - Keeping 3. Unit tests - Added Additionally fixed pylint warnings. |
Improve docs_crawler with sitemap seeding and async parallel fetching
Fixes #279
Description
The current docs_crawler.py uses synchronous, sequential HTTP requests (stack-based DFS) to crawl Jenkins documentation pages. This makes the crawl slow and may miss pages that aren't linked from other doc pages but are listed in the sitemap.
This PR replaces the synchronous crawler with a hybrid approach that:
sitemap.xml— Fetches the sitemap and extracts all/doc/URLs upfront, ensuring comprehensive page coverage.aiohttpwith an async worker pool for concurrent page fetching, significantly improving crawl speed.Testing
jenkins_docs.jsonis produced correctly.