This file provides guidance to Claude Code when working with code in this repository.
Note: Generic coding standards apply across all projects.
Curator Backend is a FastAPI service that manages CuratorBot jobs for uploading media to Wikimedia Commons. It provides a REST API and WebSocket endpoint for the frontend, with Celery workers handling background upload tasks.
Tech Stack: Python 3.13, FastAPI, SQLModel, Redis, Celery, MediaWikiClient (requests + mwoauth) | Deployment: Wikimedia Toolforge
poetry run web # FastAPI server (port 8000) - DO NOT RUN
poetry run worker # Celery worker - DO NOT RUN
poetry run pytest -q # Run tests
poetry run ty check # Type check
poetry run isort . # Sort imports
poetry run ruff format # Format code
poetry run ruff check # Run linter
poetry run alembic upgrade head # Apply migrationsProject-Specific Development Workflow: After completing backend tasks, always run in order:
poetry run isort . && poetry run ruff format && poetry run ruff check && poetry run pytest -q && poetry run ty checkTo type-check specific files only: poetry run ty check path/to/file.py
main.py (Routes) → handler.py (Business Logic) → dal.py (Database) → models.py (SQLModel)
- Routes (
main.py) → validate request, get user session - Handlers (
handler*.py) → business logic, orchestration - DAL (
dal.py) → database queries and persistence - Models (
models.py) → SQLModel definitions
Database access goes through DAL functions using the get_session() dependency.
src/curator/
├── main.py # FastAPI application entry point
├── auth.py # OAuth1 authentication with Wikimedia
├── ws.py # WebSocket endpoint
├── protocol.py # AsyncAPI WebSocket protocol (union types)
├── admin.py # Admin endpoints
├── frontend_utils.py # Frontend asset management
├── core/ # Business logic and application services
│ ├── config.py # Configuration from environment
│ ├── handler.py # WebSocket business logic
│ ├── auth.py, crypto.py, wcqs.py
│ ├── errors.py # DuplicateUploadError, HashLockError
│ ├── geocoding.py # Reverse geocoding for location enrichment
│ ├── task_enqueuer.py # Upload task enqueueing with rate limiting
│ ├── rate_limiter.py # Upload rate limiting with privileged user handling
│ └── recovery.py # Startup recovery for uploads stuck in queued state
├── db/ # Database layer
│ ├── engine.py # SQLAlchemy engine and get_session()
│ ├── commons_engine.py # Wikimedia Commons replica DB connection
│ ├── models.py # SQLModel models
│ ├── dal_batches.py # Batch queries
│ ├── dal_uploads.py # Upload request queries
│ ├── dal_presets.py # Preset queries
│ └── dal_users.py # User queries
├── mediawiki/ # Wikimedia Commons integration
│ ├── client.py # MediaWiki API client
│ ├── commons.py # Upload workflow (download, hash, upload, SDC)
│ ├── sdc_v2.py, sdc_merge.py
├── handlers/ # Image source handlers
│ ├── interfaces.py # Abstract Handler class
│ ├── mapillary_handler.py
│ └── flickr_handler.py
├── asyncapi/ # Auto-generated AsyncAPI models (do not edit)
└── workers/ # Celery workers
├── tasks.py # Background tasks
├── celery.py # Celery config
└── ingest.py # Ingestion worker
Abstract Handler class in handlers/interfaces.py defines the contract for image sources:
fetch_collection()- Get all images from a collectionfetch_image_metadata()- Get single image metadatafetch_existing_pages()- Check which images already exist on Commonsfetch_collection_ids()- Get all image IDs in a collectionfetch_images_batch()- Batch fetch images by ID
Implementations: MapillaryHandler, FlickrHandler. Used by both WebSocket handler and ingestion worker.
recovery.py handles uploads stuck in queued state after a Redis restart. Uses a sentinel key (curator:started) written to Redis on successful startup — missing key means Redis was restarted and recovery is needed. On recovery, queued uploads are grouped by (userid, edit_group_id), the OAuth token validated against MediaWiki, then re-enqueued via enqueue_uploads(). Invalid tokens mark the group's uploads as failed with a session-expired message in a single DB call. Called from main.py lifespan after Alembic migrations.
- The sentinel key is set atomically using
redis_client.set(SENTINEL_KEY, "1", nx=True)— if it returnsNone, another instance already claimed recovery and the function returns immediately. This prevents double-enqueuing on concurrent startups.
Retry functionality allows users and admins to retry failed uploads. The current implementation creates new UploadRequest objects (copies) in a new batch rather than modifying the originals. This preserves the original failed uploads in the original batch for history/audit purposes.
dal.reset_failed_uploads_to_new_batch()- User retry, creates copies of failed uploads in new batchdal.retry_selected_uploads_to_new_batch()- Admin retry, same pattern- After enqueueing Celery tasks,
update_celery_task_id()is called to enable cancellation admin_retry_uploadsinadmin.pycallsprocess_upload.apply_async()directly, bypassingenqueue_uploads()and rate limiting — admin retries are not rate limited
Redis serves as both the Celery broker (task queue) and result backend. A Redis restart destroys all in-flight task data — tasks sitting in the broker queue are gone, but the database retains status="queued" records. The startup recovery system (recovery.py) reconciles this.
- Rate limits are fetched from
action=query&meta=userinfo&uiprop=ratelimits|rightsviaMediaWikiClient.get_user_rate_limits()which returns(ratelimits, rights) - Each upload costs 2 edit API calls (SDC apply + null edit), so the edit limit is halved before comparing against the upload limit — the more restrictive of the two is used
- Users with the
noratelimitright are exempt and receive_NO_RATE_LIMIT; MediaWiki returns"ratelimits": {}for exempt users (sysops, bots) — this is expected, not an error - Uses Redis to track next available upload slot per user with key
ratelimit:{userid}:next_available - Rate limit keys have no TTL — stale past-timestamp values are handled correctly by
max(0.0, next_available - current_time), and a TTL would incorrectly reset the slot for large batches - All uploads go to
QUEUE_NORMAL get_user_groups()inrecovery.pyis used only to validate OAuth tokens on startup recovery — the returned groups are not used
commons.py contains both the upload-to-Commons workflow (upload_file_chunked) and the image download function (download_file) which fetches from external CDNs (e.g., Mapillary/Facebook). Both download and upload chunk retries use config.HTTP_RETRY_DELAYS for escalating backoff but have separate retry loops — they differ in error handling and response processing.
MediaWikiClientclass handles all Wikimedia Commons operations- Instantiate directly:
MediaWikiClient(access_token=access_token) _api_request()always retries with exponential backoff (3 attempts: 0s, 1s, 3s delays)requests.exceptions.RequestException,badtokenCSRF errors, and OAuth "Nonce already used" errors trigger retries; other exceptions propagate immediately- API-level error responses (e.g.
{"error": {"code": "..."}}) are returned as-is —_api_requestdoes not raise on them; callers must check for expected keys (e.g."query") before indexing - Client instances are passed where needed (no global state)
- Async methods are used where available, or
asyncio.to_threadfor synchronous calls - Clients are closed after use (e.g., using
try...finally) _clientis the underlyingrequests.Session— close it explicitly withclient._client.close()in afinallyblock when the client is short-lived (not managed by a context manager)upload_file()acceptsfile_path: str(notfile_content: bytes) for memory efficiencycommons.py:upload_file_chunked()provides the complete upload workflow (download, hash, duplicate check, upload, SDC).MediaWikiClient.upload_file()is a low-level method that only performs chunked upload.- Chunked upload flow: Chunks are uploaded with
stash=1, then a final commit publishes the file - Per-chunk retryable errors: substring match on
error.codeforUploadStashFileException,uploadstash-exception(bare variant fromhandleStashExceptiondefault case — also coversUploadChunkFileExceptionfalling through),UploadChunkFileException,JobQueueError; also matchesbackend-fail-internalinerror.info(handlesstashfailedcode with Swift storage errors) — transient infrastructure errors that retry up to 4 attempts with 3/5/10s delays - Final commit retryable errors (substring match on
error.code):backend-fail-internal,JobQueueError— retried with the same 4-attempt / 3/5/10s logic. The final commit has its own separate retry loop inupload_file(). uploadstash-file-not-found,uploadstash-bad-path, andstashfailed(with "No chunked upload session with this key") are NOT retried inside_upload_chunk— all mean the entire stash is gone, so retrying the same chunk is useless. Instead, they surface asUploadResult(success=False)and_upload_with_retryiningest.pyrestarts the full upload from scratch (re-download + re-upload all chunks).MAX_UPLOADSTASH_TRIEScontrols how many full-restart attempts are allowed. All three are detected by_is_uploadstash_gone_error.- Chunk
UploadResult.errorformat: alwaysf"{error_code}: {error_info}"— omitting the error code means any code path that checks for a specific error code string (e.g._is_uploadstash_gone_error) will silently fail to match. - Duplicate detection:
warnings.duplicateandwarnings.existsare handled during the stash phase (chunk responses withstash=1).warnings.exists(same filename, content may differ) fetches remote SHA1 and raisesDuplicateUploadErroronly if hashes match; otherwise falls through to a generic warning failure. The final commit can also return aWarningresult —warnings.nochange(content identical, confirmed by MediaWiki) raisesDuplicateUploadErrorusing thewarnings.existstitle if present.upload_file()acceptsfile_sha1to enable the exists comparison;upload_file_chunked()incommons.pypasses the hash computed during download. - When fetching SDC by title, the code uses
sites=commonswiki&titles=File:Example.jpginstead ofids=M12345to avoid an extra API call to fetch page ID - When using
sites/titlesin wbgetentities, the entity is keyed by entity ID, not title - extracted withnext(iter(entities)) - Entity ID "-1" with site/title keys means non-existent file (raises error); positive entity ID with "missing" key means file exists but has no SDC (returns None)
commons.py:upload_file_chunkedacquires a Redis hash lock (keyed by file hash) after duplicate check; always released intry/finally— the TTL is a safety net, not the primary release mechanismupload_fileinmediawiki_client.py: all result processing (if "upload" in data, error logging) lives inside theif file_key:block —datais only assigned there
- Auto-generated AsyncAPI models use kebab-case aliases (e.g.,
entity-type,numeric-id) - DAL's
_fix_sdc_keys()function recursively maps snake_case to kebab-case for database storage - Mapping is defined in
dal.pyand is updated when AsyncAPI schema changes
db/commons_engine.py provides a read-only connection to the Wikimedia Commons replica database, available from Toolforge. Uses the same TOOL_TOOLSDB_USER/TOOL_TOOLSDB_PASSWORD credentials as the app's own DB, but connects to commonswiki.analytics.db.svc.wikimedia.cloud / commonswiki_p. Falls back to COMMONS_DB_URL env var for local development.
Local development: When TOOL_TOOLSDB_USER is absent, the engine auto-starts an SSH tunnel through login.toolforge.org (override with TOOLFORGE_SSH_HOST env var) on the first request. Credentials are read from the Toolforge bastion via ssh login.toolforge.org cat ~/replica.my.cnf — no local .cnf file needed. The tunnel uses start_new_session=True so it survives poetry file-watch restarts. PyMySQL reads .cnf files natively via connect_args={"read_default_file": path} — no manual parsing needed.
The replica schema uses linktarget as a normalised title store — cl_to, pl_title, tl_title columns were removed in MediaWiki 1.43–1.45 and replaced with cl_target_id/pl_target_id/tl_target_id foreign keys to linktarget(lt_id, lt_namespace, lt_title).
Commons replica query patterns (benchmarked against the live replica):
NOT EXISTSsubqueries are catastrophically slow on large tables likecategorylinks(~4.5 min for 100 rows) — the optimizer cannot short-circuit them. Always useLEFT JOIN ... WHERE p_target.page_id IS NULLinstead.EXPLAINis not available on replica (insufficient privileges) — benchmark withtime sql commonswiki.categorylinksdoes not have acl_tocolumn (removed in 1.43–1.45) — join viacl_target_id→linktarget(lt_id, lt_namespace, lt_title).- For categorylinks-based wanted-categories queries (discarded approach), filtering
p_from.page_namespace IN (0, 6, 14)halves query time (3.7s → 1.7s). Not applicable to the acceptedcategorytable query, which has nop_from. DISTINCTwithLIMITdoes not allow early termination — the DB must find all distinct rows before applying the limit. AvoidDISTINCTon high-cardinality scans where possible; use narrow filters instead to reduce the working set.
category table for wanted-categories counts:
The category table stores pre-computed per-category statistics maintained by MediaWiki in real time — including for categories that don't have a page yet. Columns: cat_title, cat_pages (total members), cat_subcats, cat_files. Use cat_pages - cat_subcats - cat_files to get regular pages (galleries/templates). This is how SpecialWantedCategories.php fetches live counts in preprocessResults().
Query: SELECT c.cat_title, c.cat_subcats, c.cat_files, (c.cat_pages - c.cat_subcats - c.cat_files) AS pages, c.cat_pages AS total FROM category c LEFT JOIN page p ON p.page_namespace = 14 AND p.page_title = c.cat_title WHERE p.page_id IS NULL ORDER BY c.cat_pages DESC LIMIT 100
Performance findings:
querypage=WantedCategoriesMediaWiki API is disabled on Commons — cannot use the pre-computed cache via API.- No index on
cat_pages—ORDER BY cat_pages DESCalways requires a full table sort (~7–16s). - Threshold filters (
cat_pages >= N) do not help — full scan cost is fixed regardless. - Removing
ORDER BYbrings LIMIT 20 to 0.22s but results are unsorted and the full-scan cost returns with larger LIMITs. - Multiple-query approach ruled out: top 10,000 categories by
cat_pagescontain only 1 missing one — missing categories are NOT correlated with the largest categories, so no indexed shortcut exists. - GROUP BY aggregation directly on
categorylinkstakes several minutes — do not use. - Accepted approach: run the ~7s query directly (admin-only, infrequent use, loading state shown). Redis caching with a Celery periodic task is the known improvement path if latency becomes a problem.
Only search indexed columns - When implementing text search/filter functionality, only include columns that have database indexes. Searching unindexed columns (especially JSONB fields) will be very slow on large datasets. Check model definitions for index=True before adding search.
- TypedDict for query return types — when a query returns rows with mixed key types (e.g.,
strtitle +intcounts), define a privateTypedDict(prefix_) in the DAL file.dict[str, int | str]is too broad — ty flags callers that expect a specific type per key ImageHandlerenum - usestr(handler)when passing to functions expectingstrtype- Pydantic objects (e.g.,
Label) - usemodel_dump(mode="json")to convert to dict for database storage - Optional asyncapi booleans - add
or False/or Truewhen passing to functions expecting non-optionalbool - When AsyncAPI fields become required with defaults (e.g.,
Field(default=False)), remove the fallback pattern
- Union types
ClientMessageandServerMessagedefined inprotocol.py - 50+ auto-generated message types in
asyncapi/ - Two-phase upload process: creation phase (slices via
UPLOAD_SLICE) → subscription phase (SUBSCRIBE_BATCH) UploadSliceAckprovides immediate item status updates to client
Backend models are auto-generated from ../frontend/asyncapi.json. When updating schema:
- Update
../frontend/asyncapi.jsonwith schema changes - Run
cd ../frontend && bun generatefrom frontend directory- Generates Python models to
src/curator/asyncapi/ - Auto-formats generated code
- Generates Python models to
- Update all code that constructs or accesses the modified models
- Run tests:
poetry run pytest -q
Design Patterns:
- Group related fields into nested objects (e.g.,
MediaImage.urls,MediaImage.camera) - Use short names without redundant prefixes (e.g.,
originalnoturl_original) - Boolean flags should be required with defaults, not Optional
- Required boolean fields generate as
Field(default=False)in Python models
When adding new server messages, update all 4 locations in asyncapi.json (alphabetical order):
components/messages/- Message definition ("RetryUploadsResponse": {"payload": {...}})components/schemas/- Schema definition with type, data, nonce propertieschannels/wsChannel/messages/- Channel message referenceoperations/ServerMessage/messages/- Server operation reference (alphabetical order)
Use Alembic CLI auto-generator:
poetry run alembic revision --autogenerate -m "description"
poetry run alembic upgrade headWhen dropping a column that has both a foreign key and an index, the autogenerated migration drops the index first — MySQL requires the FK to be dropped first. Always verify the order: drop_constraint (FK) → drop_index → drop_column.
pytestwith tests intests/- All imports must be at the top of test files (no inline imports)
- Use
patch()fromunittest.mock, notpytest.mock.patch - NO nested function definitions in tests - avoid
def func(): def inner():pattern - For complex mock behavior, use module-level helper functions (prefix with
_) passed toside_effect - BDD tests in
tests/bdd/, async tests with pytest-asyncio - pytest timeout is configured to
0.25seconds inpytest.ini - Mock objects match actual return type structure (e.g.,
UploadRequestneedsid,key,statusattributes) - When mocking
process_upload.apply_async(), the queue is checked viacall[1]["queue"]and args viacall[1]["args"] - AsyncMock assertions use
assert_called_once_with()for keyword arguments
When mocking file operations or other builtins, use a single with statement with comma-separated patches:
# Wrong (nested with statements)
with patch("os.path.getsize", return_value=1000):
with patch("builtins.open", mock_open(read_data=b"data")):
result = func()
# Correct (single with statement)
with patch("os.path.getsize", return_value=1000), patch(
"builtins.open", mock_open(read_data=b"data")
):
result = func()When adding new async send methods to protocol.py:
- Add
sender.send_new_method = AsyncMock()tomock_senderfixture intests/fixtures.py - This is required for tests that use WebSocket handlers (mock_sender is autouse for BDD tests)
The tests/fixtures.py file contains an autouse fixture mock_external_calls that patches many external dependencies. This fixture is designed for BDD tests but can cause issues with other tests. If tests fail with strange errors:
- Check if the test needs to be isolated from the autouse fixture
- The fixture patches
curator.app.handler.encrypt_access_tokenand other common dependencies - Some tests may need to run without this fixture or use
@pytest.mark.usefixtures("mock_external_calls")explicitly - The fixture skips for modules with "mediawiki", "geocoding", or "test_download" in the name, avoiding a slow Celery import chain that causes timeouts. New test files that don't need BDD mocks should be added to this skip list in
fixtures.py.
- Use
@given(..., target_fixture="fixture_name")to make fixture values available to other steps - Use
globalvariables to track state across BDD step definitions (e.g.,_use_default_mockflag) - Mock
time.sleepin tests involving retry logic to avoid pytest-timeout (default 0.25s) - Use
@when(..., target_fixture="result")to make return values available to@thensteps - Handle expected exceptions in
@whensteps, return them in result dict for@thenverification
When writing a failing test before the implementation exists, mocker.patch("module.symbol") raises AttributeError if the symbol doesn't exist yet. Use create=True to allow patching symbols that don't exist in the target module:
mocker.patch("curator.core.handler.get_redlinks", return_value=[...], create=True)This lets the test fail for the right reason (missing method on Handler) rather than a patching error.
- When patching instance methods, use
side_effectwith functions accepting*args, **kwargs - This avoids "got multiple values for argument" errors when method is called with keyword arguments
- Example:
mocker.patch("path.to.Class.method", side_effect=lambda *args, **kwargs: {...})
@app.task(bind=True)withautoretry_forstores the pre-autoretry function astask._orig_run(a bound method)- To call with a mock
self:process_upload._orig_run.__func__(mock_self, upload_id, edit_group_id)—__func__gives the unbound function - Calling
process_upload(mock_self, ...)orprocess_upload._orig_run(mock_self, ...)both fail: the former because Celery's task machinery conflicts with keyword args, the latter because_orig_runis already bound to the task instance - When using
MagicMock()as taskself, set integer attributes explicitly:mock_self.max_retries = 3— MagicMock attributes default to MagicMock instances, causingTypeErroron numeric comparisons
- pytest-timeout may enforce 0.25s default even with
timeout = 0in pytest.ini - Tests with
time.sleep()must mocktimemodule to avoid timeouts:mocker.patch("time.sleep")
To verify selectinload is applied to a query, inspect _with_options on the captured query:
option_keys = [opt.path.path[1].key for opt in query._with_options] — path[0] is the Mapper, path[1] is the RelationshipProperty (has .key).
Passing Model.relationship directly to selectinload causes a ty type error — the attribute resolves as the related model type, not QueryableAttribute. Use class_mapper instead: class_mapper(Model).relationships["name"].class_attribute.
- Phabricator tasks must be linked by full URL in PR descriptions:
https://phabricator.wikimedia.org/T123456
- Use
gh api repos/{owner}/{repo}/pulls/{number}/commentsto get line-by-line review comments with file paths and line numbers gh pr view --json reviewsonly shows high-level review summaries, not specific line comments
- Functions that always return or raise an exception use
raise AssertionError("Unreachable")to satisfy the type checker - AsyncAPI models are auto-generated from
frontend/asyncapi.json - Large file uploads use
NamedTemporaryFile()with streaming downloads (seecommons.py) - Database sessions use the
get_session()dependency - Code follows the layered architecture: routes → handlers → DAL → models
When adding imports between core modules (commons.py, mediawiki_client.py, etc.), be aware of circular dependencies:
commons.pyimports frommediawiki_client.pymediawiki_client.pyshould NOT import fromcommons.pydirectly- For shared exceptions like
DuplicateUploadError, use a dedicatederrors.pymodule that only imports fromasyncapi(which has no dependencies on other app modules) - Import exceptions inside functions if needed to avoid circular imports:
from curator.app.errors import DuplicateUploadError
list[str] | None = NonewithoutQuery()always resolves toNone— repeated URL params like?status=queued&status=failedare silently ignored- Use
Query(default=None)for any list-typed query parameter:status: list[str] | None = Query(default=None) - Tests that call endpoint functions directly bypass HTTP query string parsing, so list param issues won't be caught — use
TestClientto verify HTTP-level behavior
session.exec(select(col(Table.column))).all()returnslist[value], notlist[Row](SQLModel-specific)- Use
session.exec()for all queries;session.execute()is deprecated in SQLModel - Multi-column queries (e.g. GROUP BY with
sa_select):session.exec()returns rows that unpack as tuples —bid, count = row sa_select(...)(raw SQLAlchemy multi-column select) doesn't match SQLModel'ssession.exec()overloads — use# type: ignoreon those lines (known SQLModel limitation: fastapi/sqlmodel#909)
# type: ignore[error-code]does NOT suppress errors — ty only honors bare# type: ignore- When a DAL function calls
session.exec()multiple times, tests must usemock_session.exec.side_effect = [result1, result2]— a singlereturn_valueonly handles one call