Skip to content

Add Mermaid-based architecture diagram to README#6

Open
alizahh-7 wants to merge 2 commits intodbpedia:mainfrom
alizahh-7:add-architecture-diagram
Open

Add Mermaid-based architecture diagram to README#6
alizahh-7 wants to merge 2 commits intodbpedia:mainfrom
alizahh-7:add-architecture-diagram

Conversation

@alizahh-7
Copy link

@alizahh-7 alizahh-7 commented Jan 30, 2026

What does this PR do?

  • Adds a detailed Mermaid flowchart to the README.
  • The diagram explains the end-to-end workflow for:
    • Crawling Wikimedia dump pages
    • Detecting new dumps
    • Storing discovered URLs
    • Validating Databus configuration
    • Generating RDF metadata
    • Publishing data to the Databus Knowledge Graph

Why is this change needed?

  • Improves project documentation and onboarding.
  • Makes the system architecture and data flow easier to understand.
  • Helps contributors quickly grasp how crawling and publishing components interact.

Changes included

  • Updated README with an architecture diagram rendered using Mermaid.

Summary by CodeRabbit

  • Documentation

    • Added Architecture Overview section with a workflow diagram illustrating the Wikimedia Dumps automation pipeline, detailing data flow from crawling through metadata validation to publication.
  • Refactor

    • Restructured the core publishing pipeline with simplified helpers for content handling and payload assembly.
    • Streamlined processing logic and removed legacy functions.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

Documentation update to README.md introducing an architecture overview diagram of the Wikimedia Dumps automation pipeline workflow. Significant refactoring of wikimedia_publish.py converting production API workflow to test-centric implementation with new helper functions for SHA-256 computation and payload assembly.

Changes

Cohort / File(s) Summary
Documentation
README.md
Added "Architecture Overview" section with Mermaid flowchart visualizing data flow from Wikimedia dumps crawling through Databus publication, including metadata validation and RDF generation steps. Added "Project Setup Guide" continuation header.
Workflow Refactoring
wikimedia_publish.py
Replaced production API workflow with test-centric implementation. Added compute_sha256() for remote URL hashing, create_api_payload() for grouping files by variant and assembling Version graphs, and process_single_job() returning mock status markers. Introduced main() entry point with test_mode switch. Removed extensive production functions and simplified content-variant, file-extension, and filename parsing helpers. Configuration filters (ALLOWED_JOBS, BLOCKED_JOBS) retained but simplified.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title describes adding a Mermaid diagram to the README, which matches the stated README.md changes, but the PR also contains substantial refactoring of wikimedia_publish.py that is not reflected in the title. Update the title to reflect both major changes, such as 'Refactor wikimedia_publish.py and add architecture diagram to README' or split into focused commits.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@alizahh-7
Copy link
Author

this pr addresses #5

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@wikimedia_publish.py`:
- Around line 102-135: The Databus payloads use databus_id, title and
description that ignore the loop variable variant, causing different variants to
share the same `@id` and overwrite each other; update how databus_id, title and
description are built inside the for variant, file_list in file_groups.items()
loop to incorporate the variant (e.g. append or embed variant into the
databus_id string and into title/description) before appending to payloads so
each payload uses a unique identifier per variant (refer to databus_id, title,
description, payloads and file_groups in the diff).
- Around line 167-207: The main issue is that main is hardcoded to run in test
mode (main(test_mode=True)) causing production runs to always use
test_dumpstatus.json; update the function signature and invocation so production
default is disabled and real runs use crawled_urls.txt: change async def
main(test_mode=True) to async def main(test_mode=False) (or remove the default
and require an explicit flag), and update the __main__ block from
asyncio.run(main(test_mode=True)) to asyncio.run(main()) (or pass a value based
on an environment variable or CLI arg). Ensure references to
process_single_job(session, ...) remain correct when test_mode is False so real
HTTP requests execute.
- Around line 76-109: The compute_sha256 call in create_api_payload is passing
None for the HTTP session which causes session.get to fail and the broad except
to return the dummy hash; change the call to await compute_sha256(session,
download_url) in create_api_payload and update compute_sha256 to narrow its
exception handling (catch only network/HTTP errors such as aiohttp.ClientError
or response.raise_for_status failures), log the real exception, and only return
the test/dummy hash under an explicit test-mode flag or when a specific
recoverable network error occurs so genuine failures aren't silently masked.
🧹 Nitpick comments (1)
wikimedia_publish.py (1)

140-145: Silence unused-arg warnings in test stub.

session and api_key aren’t used in this test-only stub; consider prefixing with _ or removing to satisfy linting.

♻️ Minimal lint-friendly tweak
-async def make_api_request(session, payload, api_key):
+async def make_api_request(_session, payload, _api_key):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments