fix(onboarding): run brand analysis as a background job (avoid reverse-proxy 504s)#385
Open
melalj wants to merge 2 commits into
Open
fix(onboarding): run brand analysis as a background job (avoid reverse-proxy 504s)#385melalj wants to merge 2 commits into
melalj wants to merge 2 commits into
Conversation
analyzeBrand ran synchronously inside the onboarding server function and takes ~1 minute (LLM + web search). Reverse proxies (nginx/CapRover) kill the request at their read timeout, so users get a 504 even though the analysis finishes server-side. Move it onto a pg-boss `analyze-brand` queue handled by the worker: - worker: new analyze-brand job + queue; the handler returns the suggestion, which pg-boss stores as the job output (batchSize 1 keeps output mapped 1:1 to a single job). - web: startAnalyzeBrandFn enqueues and returns a jobId immediately; getAnalyzeBrandStatusFn reads job state/output via getJobById. - wizard: enqueue then poll every 2s (up to ~6 min) instead of holding a single long-lived request open. No DB migration — the result rides in the pg-boss job output.
|
Someone is attempting to deploy a commit to the Blue Whale Labs Team on Vercel. A member of the Team first needs to authorize it. |
Contributor
|
Thanks for the PR! At a glance it looks good, I'll do a deeper look soon and get it merged! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First off — thank you for building and open-sourcing Elmo 🙏. A self-hostable, fully-auditable AEO/GEO tracker is genuinely valuable, and the codebase was a pleasure to work in — the existing pg-boss worker pattern made this fix straightforward to slot in.
Problem
When self-hosting behind a reverse proxy (CapRover/nginx in my case), the onboarding "Analyze brand" step returns a
504 Gateway Time-out, even though the analysis actually succeeds server-side. The worker log tells the story:analyzeBrand()runs synchronously inside theanalyzeBrandFnserver function and takes ~1 minute (LLM + web search). nginx's defaultproxy_read_timeoutis 60s, so the proxy gives up before the response comes back. Raising the proxy timeout is only a band-aid — a ~1-minute synchronous request is fragile by design.Fix
Move brand analysis onto the existing pg-boss worker and let the wizard poll for the result, so the HTTP request returns in milliseconds.
analyze-brandqueue + handler (apps/worker/src/jobs/analyze-brand.ts). The handler returns theOnboardingSuggestion, which pg-boss stores as the job output. Registered withbatchSize: 1(see note).startAnalyzeBrandFnenqueues and returns ajobIdimmediately;getAnalyzeBrandStatusFnreads job state/output viagetJobById.prompt-wizard.tsx) — enqueues, then polls every 2s (up to ~6 min), reusing the existing "Analyzing brand…" UI. On worker/DB trouble it surfaces a clean "timed out, please try again" instead of a 504.No DB migration — the result rides in the pg-boss job output.
Design notes
batchSize: 1is load-bearing, not cosmetic. In pg-boss v12 (manager.js), the handler's return value is persisted as the job output only for single-job batches:batchSize: 1./api/v1/tools/analyze) and the admin re-run still callanalyzeBranddirectly — different consumers with their own timeout expectations. Happy to convert those the same way if you'd like.withSentryis now generic over the handler's return type (was hard-coded toPromise<void>); existing handlers are unaffected.Testing
pnpm --filter @workspace/worker check-typesand--filter @workspace/web check-types— both clean.policies.test.ts— 85/85 (updated the representative path from/_server/analyzeBrandFn→/_server/startAnalyzeBrandFn).postgres:16-alpinewith the pinnedpg-boss@12.19.1, replicating the exact wiring:getJobByIdoutput deep-equals the suggestion ✅state=failed,output.messagereadable (the failed branch ofgetAnalyzeBrandStatusFn) ✅Not yet exercised end-to-end: the full HTTP + React polling path with the real LLM call (existing, unchanged code).
Notes / open questions
melaljadded to.github/contributors.txtin this PR.Thanks again for the project! 🐳