Add Mermaid-based architecture diagram to README#6
Add Mermaid-based architecture diagram to README#6alizahh-7 wants to merge 2 commits intodbpedia:mainfrom
Conversation
📝 WalkthroughWalkthroughDocumentation update to README.md introducing an architecture overview diagram of the Wikimedia Dumps automation pipeline workflow. Significant refactoring of wikimedia_publish.py converting production API workflow to test-centric implementation with new helper functions for SHA-256 computation and payload assembly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 🧪 Unit Test Generation v2 is now available!We have significantly improved our unit test generation capabilities. To enable: Add this to your reviews:
finishing_touches:
unit_tests:
enabled: trueTry it out by using the Have feedback? Share your thoughts on our Discord thread! Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
this pr addresses #5 |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@wikimedia_publish.py`:
- Around line 102-135: The Databus payloads use databus_id, title and
description that ignore the loop variable variant, causing different variants to
share the same `@id` and overwrite each other; update how databus_id, title and
description are built inside the for variant, file_list in file_groups.items()
loop to incorporate the variant (e.g. append or embed variant into the
databus_id string and into title/description) before appending to payloads so
each payload uses a unique identifier per variant (refer to databus_id, title,
description, payloads and file_groups in the diff).
- Around line 167-207: The main issue is that main is hardcoded to run in test
mode (main(test_mode=True)) causing production runs to always use
test_dumpstatus.json; update the function signature and invocation so production
default is disabled and real runs use crawled_urls.txt: change async def
main(test_mode=True) to async def main(test_mode=False) (or remove the default
and require an explicit flag), and update the __main__ block from
asyncio.run(main(test_mode=True)) to asyncio.run(main()) (or pass a value based
on an environment variable or CLI arg). Ensure references to
process_single_job(session, ...) remain correct when test_mode is False so real
HTTP requests execute.
- Around line 76-109: The compute_sha256 call in create_api_payload is passing
None for the HTTP session which causes session.get to fail and the broad except
to return the dummy hash; change the call to await compute_sha256(session,
download_url) in create_api_payload and update compute_sha256 to narrow its
exception handling (catch only network/HTTP errors such as aiohttp.ClientError
or response.raise_for_status failures), log the real exception, and only return
the test/dummy hash under an explicit test-mode flag or when a specific
recoverable network error occurs so genuine failures aren't silently masked.
🧹 Nitpick comments (1)
wikimedia_publish.py (1)
140-145: Silence unused-arg warnings in test stub.
sessionandapi_keyaren’t used in this test-only stub; consider prefixing with_or removing to satisfy linting.♻️ Minimal lint-friendly tweak
-async def make_api_request(session, payload, api_key): +async def make_api_request(_session, payload, _api_key):
What does this PR do?
Why is this change needed?
Changes included
Summary by CodeRabbit
Documentation
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.