Add Mermaid-based architecture diagram to README by alizahh-7 · Pull Request #6 · dbpedia/wikimedia-dumps

alizahh-7 · 2026-01-30T03:01:41Z

What does this PR do?

Adds a detailed Mermaid flowchart to the README.
The diagram explains the end-to-end workflow for:
- Crawling Wikimedia dump pages
- Detecting new dumps
- Storing discovered URLs
- Validating Databus configuration
- Generating RDF metadata
- Publishing data to the Databus Knowledge Graph

Why is this change needed?

Improves project documentation and onboarding.
Makes the system architecture and data flow easier to understand.
Helps contributors quickly grasp how crawling and publishing components interact.

Changes included

Updated README with an architecture diagram rendered using Mermaid.

Summary by CodeRabbit

Documentation
- Added Architecture Overview section with a workflow diagram illustrating the Wikimedia Dumps automation pipeline, detailing data flow from crawling through metadata validation to publication.
Refactor
- Restructured the core publishing pipeline with simplified helpers for content handling and payload assembly.
- Streamlined processing logic and removed legacy functions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-30T03:02:05Z

📝 Walkthrough

Walkthrough

Documentation update to README.md introducing an architecture overview diagram of the Wikimedia Dumps automation pipeline workflow. Significant refactoring of wikimedia_publish.py converting production API workflow to test-centric implementation with new helper functions for SHA-256 computation and payload assembly.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Added "Architecture Overview" section with Mermaid flowchart visualizing data flow from Wikimedia dumps crawling through Databus publication, including metadata validation and RDF generation steps. Added "Project Setup Guide" continuation header.
Workflow Refactoring `wikimedia_publish.py`	Replaced production API workflow with test-centric implementation. Added `compute_sha256()` for remote URL hashing, `create_api_payload()` for grouping files by variant and assembling Version graphs, and `process_single_job()` returning mock status markers. Introduced `main()` entry point with test_mode switch. Removed extensive production functions and simplified content-variant, file-extension, and filename parsing helpers. Configuration filters (ALLOWED_JOBS, BLOCKED_JOBS) retained but simplified.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Add detailed architecture diagram to README for Wikimedia Dumps pipeline #5: Implements the requested Mermaid architecture overview diagram in README.md showing the complete Wikimedia Dumps automation pipeline workflow.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title describes adding a Mermaid diagram to the README, which matches the stated README.md changes, but the PR also contains substantial refactoring of wikimedia_publish.py that is not reflected in the title.	Update the title to reflect both major changes, such as 'Refactor wikimedia_publish.py and add architecture diagram to README' or split into focused commits.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

alizahh-7 · 2026-01-30T03:03:22Z

this pr addresses #5

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@wikimedia_publish.py`:
- Around line 102-135: The Databus payloads use databus_id, title and
description that ignore the loop variable variant, causing different variants to
share the same `@id` and overwrite each other; update how databus_id, title and
description are built inside the for variant, file_list in file_groups.items()
loop to incorporate the variant (e.g. append or embed variant into the
databus_id string and into title/description) before appending to payloads so
each payload uses a unique identifier per variant (refer to databus_id, title,
description, payloads and file_groups in the diff).
- Around line 167-207: The main issue is that main is hardcoded to run in test
mode (main(test_mode=True)) causing production runs to always use
test_dumpstatus.json; update the function signature and invocation so production
default is disabled and real runs use crawled_urls.txt: change async def
main(test_mode=True) to async def main(test_mode=False) (or remove the default
and require an explicit flag), and update the __main__ block from
asyncio.run(main(test_mode=True)) to asyncio.run(main()) (or pass a value based
on an environment variable or CLI arg). Ensure references to
process_single_job(session, ...) remain correct when test_mode is False so real
HTTP requests execute.
- Around line 76-109: The compute_sha256 call in create_api_payload is passing
None for the HTTP session which causes session.get to fail and the broad except
to return the dummy hash; change the call to await compute_sha256(session,
download_url) in create_api_payload and update compute_sha256 to narrow its
exception handling (catch only network/HTTP errors such as aiohttp.ClientError
or response.raise_for_status failures), log the real exception, and only return
the test/dummy hash under an explicit test-mode flag or when a specific
recoverable network error occurs so genuine failures aren't silently masked.

🧹 Nitpick comments (1)

wikimedia_publish.py (1)
140-145: Silence unused-arg warnings in test stub.

session and api_key aren’t used in this test-only stub; consider prefixing with _ or removing to satisfy linting.
♻️ Minimal lint-friendly tweak
-async def make_api_request(session, payload, api_key):
+async def make_api_request(_session, payload, _api_key):

wikimedia_publish.py

alizahh-7 added 2 commits January 13, 2026 20:18

compute SHA256 automatically

bedce59

Add architecture diagram to README

794b4a2

alizahh-7 mentioned this pull request Jan 30, 2026

Add detailed architecture diagram to README for Wikimedia Dumps pipeline #5

Open

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

wikimedia_publish.py Show resolved Hide resolved

wikimedia_publish.py Show resolved Hide resolved

wikimedia_publish.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mermaid-based architecture diagram to README#6

Add Mermaid-based architecture diagram to README#6
alizahh-7 wants to merge 2 commits intodbpedia:mainfrom
alizahh-7:add-architecture-diagram

alizahh-7 commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Uh oh!

alizahh-7 commented Jan 30, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

alizahh-7 commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is this change needed?

Changes included

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Uh oh!

alizahh-7 commented Jan 30, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

alizahh-7 commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading