fix(onboarding): run brand analysis as a background job (avoid reverse-proxy 504s) by melalj · Pull Request #385 · elmohq/elmo

melalj · 2026-06-25T22:18:42Z

First off — thank you for building and open-sourcing Elmo 🙏. A self-hostable, fully-auditable AEO/GEO tracker is genuinely valuable, and the codebase was a pleasure to work in — the existing pg-boss worker pattern made this fix straightforward to slot in.

Problem

When self-hosting behind a reverse proxy (CapRover/nginx in my case), the onboarding "Analyze brand" step returns a 504 Gateway Time-out, even though the analysis actually succeeds server-side. The worker log tells the story:

[onboarding] analyzeBrand done: https://2sync.com/ in 55915ms (brand="2sync", competitors=10, prompts=24)

analyzeBrand() runs synchronously inside the analyzeBrandFn server function and takes ~1 minute (LLM + web search). nginx's default proxy_read_timeout is 60s, so the proxy gives up before the response comes back. Raising the proxy timeout is only a band-aid — a ~1-minute synchronous request is fragile by design.

Fix

Move brand analysis onto the existing pg-boss worker and let the wizard poll for the result, so the HTTP request returns in milliseconds.

worker — new analyze-brand queue + handler (apps/worker/src/jobs/analyze-brand.ts). The handler returns the OnboardingSuggestion, which pg-boss stores as the job output. Registered with batchSize: 1 (see note).
web — startAnalyzeBrandFn enqueues and returns a jobId immediately; getAnalyzeBrandStatusFn reads job state/output via getJobById.
wizard (prompt-wizard.tsx) — enqueues, then polls every 2s (up to ~6 min), reusing the existing "Analyzing brand…" UI. On worker/DB trouble it surfaces a clean "timed out, please try again" instead of a 504.

No DB migration — the result rides in the pg-boss job output.

Design notes

batchSize: 1 is load-bearing, not cosmetic. In pg-boss v12 (manager.js), the handler's return value is persisted as the job output only for single-job batches:
```
await this.complete(name, jobIds, jobIds.length === 1 ? result : undefined);
```
With a larger batch the output would be dropped (and the wizard would poll until timeout), so the queue is registered with batchSize: 1.
Scope kept intentionally tight. The public API route (/api/v1/tools/analyze) and the admin re-run still call analyzeBrand directly — different consumers with their own timeout expectations. Happy to convert those the same way if you'd like.
withSentry is now generic over the handler's return type (was hard-coded to Promise<void>); existing handlers are unaffected.

Testing

pnpm --filter @workspace/worker check-types and --filter @workspace/web check-types — both clean.
policies.test.ts — 85/85 (updated the representative path from /_server/analyzeBrandFn → /_server/startAnalyzeBrandFn).
Live smoke test against a real postgres:16-alpine with the pinned pg-boss@12.19.1, replicating the exact wiring:
- enqueue → worker returns suggestion → getJobById output deep-equals the suggestion ✅
- handler throws → state=failed, output.message readable (the failed branch of getAnalyzeBrandStatusFn) ✅

Not yet exercised end-to-end: the full HTTP + React polling path with the real LLM call (existing, unchanged code).

Notes / open questions

Glad to adjust polling cadence/UX, add a worker unit test (there are no worker tests yet, so I didn't want to set a pattern without your steer), or wire the public API route the same way. If you'd prefer an issue first for a change of this size, just say so.
CLA: signed — melalj added to .github/contributors.txt in this PR.

Thanks again for the project! 🐳

analyzeBrand ran synchronously inside the onboarding server function and takes ~1 minute (LLM + web search). Reverse proxies (nginx/CapRover) kill the request at their read timeout, so users get a 504 even though the analysis finishes server-side. Move it onto a pg-boss `analyze-brand` queue handled by the worker: - worker: new analyze-brand job + queue; the handler returns the suggestion, which pg-boss stores as the job output (batchSize 1 keeps output mapped 1:1 to a single job). - web: startAnalyzeBrandFn enqueues and returns a jobId immediately; getAnalyzeBrandStatusFn reads job state/output via getJobById. - wizard: enqueue then poll every 2s (up to ~6 min) instead of holding a single long-lived request open. No DB migration — the result rides in the pg-boss job output.

vercel · 2026-06-25T22:18:45Z

Someone is attempting to deploy a commit to the Blue Whale Labs Team on Vercel.

A member of the Team first needs to authorize it.

jrhizor · 2026-06-25T22:43:58Z

Thanks for the PR! At a glance it looks good, I'll do a deeper look soon and get it merged!

chore: sign the CLA

0d7c625

melalj marked this pull request as ready for review June 25, 2026 22:27

melalj requested a review from jrhizor as a code owner June 25, 2026 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(onboarding): run brand analysis as a background job (avoid reverse-proxy 504s)#385

fix(onboarding): run brand analysis as a background job (avoid reverse-proxy 504s)#385
melalj wants to merge 2 commits into
elmohq:mainfrom
melalj:fix/async-brand-analysis

melalj commented Jun 25, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 25, 2026

Uh oh!

jrhizor commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

melalj commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Design notes

Testing

Notes / open questions

Uh oh!

vercel Bot commented Jun 25, 2026

Uh oh!

jrhizor commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

melalj commented Jun 25, 2026 •

edited

Loading