feat(self-host): add local workspace and OCR flow#46
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ce2285b27
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const candidate = path.join(root, paperId); | ||
| if (await pathExists(candidate)) { |
There was a problem hiding this comment.
Constrain OCR paper directory lookups to root
findOCRPaperDirectory joins an untrusted paperId directly onto each OCR root and only checks existence, so values like %2e%2e can resolve to the parent directory of the configured OCR root. Because /api/ocr/[paperId]/[...path] uses this helper before reading files, a crafted request can make the API serve files outside data/ocr whenever the target file exists under that parent tree. Resolve and validate that the computed path stays within the configured root before accepting it.
Useful? React with 👍 / 👎.
| async function fetchRemotePdf(sourceUrl: string): Promise<Buffer> { | ||
| const response = await fetch(sourceUrl); |
There was a problem hiding this comment.
Block SSRF in source URL PDF imports
The upload flow fetches arbitrary user-provided sourceUrl values server-side without restricting scheme, hostname, or private-address targets. On any deployment where /api/papers/upload is reachable, an attacker can trigger requests to internal services (for example cloud metadata or intranet hosts) and persist the response into local OCR storage, which is then retrievable via /api/ocr/.... Add URL validation (http/https only, deny private/link-local/loopback ranges, and consider allowlisting) before calling fetch.
Useful? React with 👍 / 👎.
What
Add the self-host workspace and OCR milestone on top of current
main, without reintroducing the SDK conflict set.Included:
/api/v2/assetscompatibilityExcluded on purpose:
mainWhy
The previous PR branch hit merge conflicts against
main, which prevented GitHub Actionspull_requestworkflows from running at all.This branch keeps the self-host product changes but drops the overlapping SDK surface so CI can run normally.
How
mainweb/,docker/,docs/self-hosting, and workflow changessdk/*out of this PRworkspace -> Reader -> Paper Library import -> Asset Browser reopenChecklist
npm run lintpasses (inweb/)npm testpasses (inweb/)Type of Change