Skip to content

feat(self-host): add local workspace and OCR flow#46

Merged
willamhou merged 1 commit intomainfrom
self-host-pr-clean
Apr 23, 2026
Merged

feat(self-host): add local workspace and OCR flow#46
willamhou merged 1 commit intomainfrom
self-host-pr-clean

Conversation

@willamhou
Copy link
Copy Markdown
Member

What

Add the self-host workspace and OCR milestone on top of current main, without reintroducing the SDK conflict set.

Included:

  • local assets and collections with /api/v2/assets compatibility
  • multi-workspace local-owner flow without auth
  • OCR paper import with local dataset materialization and Volcengine normalization
  • route-level and browser-level self-host smoke coverage
  • CI coverage for both self-host OCR smoke paths
  • Hermes integration design note for the next runtime track

Excluded on purpose:

  • SDK tree changes that currently conflict with main
  • unrelated local-only files from the working branch

Why

The previous PR branch hit merge conflicts against main, which prevented GitHub Actions pull_request workflows from running at all.

This branch keeps the self-host product changes but drops the overlapping SDK surface so CI can run normally.

How

  • start from current main
  • bring over only the required web/, docker/, docs/self-hosting, and workflow changes
  • leave sdk/* out of this PR
  • preserve the self-host browser flow: workspace -> Reader -> Paper Library import -> Asset Browser reopen

Checklist

  • npm run lint passes (in web/)
  • npm test passes (in web/)
  • Documentation updated (if applicable)
  • Screenshots attached (for UI changes)

Type of Change

  • New feature
  • Breaking change
  • Documentation
  • Agent template
  • Docker / infrastructure
  • Refactor

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ce2285b27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +140 to +141
const candidate = path.join(root, paperId);
if (await pathExists(candidate)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Constrain OCR paper directory lookups to root

findOCRPaperDirectory joins an untrusted paperId directly onto each OCR root and only checks existence, so values like %2e%2e can resolve to the parent directory of the configured OCR root. Because /api/ocr/[paperId]/[...path] uses this helper before reading files, a crafted request can make the API serve files outside data/ocr whenever the target file exists under that parent tree. Resolve and validate that the computed path stays within the configured root before accepting it.

Useful? React with 👍 / 👎.

Comment on lines +57 to +58
async function fetchRemotePdf(sourceUrl: string): Promise<Buffer> {
const response = await fetch(sourceUrl);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block SSRF in source URL PDF imports

The upload flow fetches arbitrary user-provided sourceUrl values server-side without restricting scheme, hostname, or private-address targets. On any deployment where /api/papers/upload is reachable, an attacker can trigger requests to internal services (for example cloud metadata or intranet hosts) and persist the response into local OCR storage, which is then retrievable via /api/ocr/.... Add URL validation (http/https only, deny private/link-local/loopback ranges, and consider allowlisting) before calling fetch.

Useful? React with 👍 / 👎.

@willamhou willamhou merged commit a2b57ef into main Apr 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant