Skip to content

fix(onboarding): run brand analysis as a background job (avoid reverse-proxy 504s)#385

Open
melalj wants to merge 2 commits into
elmohq:mainfrom
melalj:fix/async-brand-analysis
Open

fix(onboarding): run brand analysis as a background job (avoid reverse-proxy 504s)#385
melalj wants to merge 2 commits into
elmohq:mainfrom
melalj:fix/async-brand-analysis

Conversation

@melalj

@melalj melalj commented Jun 25, 2026

Copy link
Copy Markdown

First off — thank you for building and open-sourcing Elmo 🙏. A self-hostable, fully-auditable AEO/GEO tracker is genuinely valuable, and the codebase was a pleasure to work in — the existing pg-boss worker pattern made this fix straightforward to slot in.

Problem

When self-hosting behind a reverse proxy (CapRover/nginx in my case), the onboarding "Analyze brand" step returns a 504 Gateway Time-out, even though the analysis actually succeeds server-side. The worker log tells the story:

[onboarding] analyzeBrand done: https://2sync.com/ in 55915ms (brand="2sync", competitors=10, prompts=24)

analyzeBrand() runs synchronously inside the analyzeBrandFn server function and takes ~1 minute (LLM + web search). nginx's default proxy_read_timeout is 60s, so the proxy gives up before the response comes back. Raising the proxy timeout is only a band-aid — a ~1-minute synchronous request is fragile by design.

Fix

Move brand analysis onto the existing pg-boss worker and let the wizard poll for the result, so the HTTP request returns in milliseconds.

  • worker — new analyze-brand queue + handler (apps/worker/src/jobs/analyze-brand.ts). The handler returns the OnboardingSuggestion, which pg-boss stores as the job output. Registered with batchSize: 1 (see note).
  • webstartAnalyzeBrandFn enqueues and returns a jobId immediately; getAnalyzeBrandStatusFn reads job state/output via getJobById.
  • wizard (prompt-wizard.tsx) — enqueues, then polls every 2s (up to ~6 min), reusing the existing "Analyzing brand…" UI. On worker/DB trouble it surfaces a clean "timed out, please try again" instead of a 504.

No DB migration — the result rides in the pg-boss job output.

Design notes

  • batchSize: 1 is load-bearing, not cosmetic. In pg-boss v12 (manager.js), the handler's return value is persisted as the job output only for single-job batches:
    await this.complete(name, jobIds, jobIds.length === 1 ? result : undefined);
    With a larger batch the output would be dropped (and the wizard would poll until timeout), so the queue is registered with batchSize: 1.
  • Scope kept intentionally tight. The public API route (/api/v1/tools/analyze) and the admin re-run still call analyzeBrand directly — different consumers with their own timeout expectations. Happy to convert those the same way if you'd like.
  • withSentry is now generic over the handler's return type (was hard-coded to Promise<void>); existing handlers are unaffected.

Testing

  • pnpm --filter @workspace/worker check-types and --filter @workspace/web check-types — both clean.
  • policies.test.ts — 85/85 (updated the representative path from /_server/analyzeBrandFn/_server/startAnalyzeBrandFn).
  • Live smoke test against a real postgres:16-alpine with the pinned pg-boss@12.19.1, replicating the exact wiring:
    • enqueue → worker returns suggestion → getJobById output deep-equals the suggestion ✅
    • handler throws → state=failed, output.message readable (the failed branch of getAnalyzeBrandStatusFn) ✅

Not yet exercised end-to-end: the full HTTP + React polling path with the real LLM call (existing, unchanged code).

Notes / open questions

  • Glad to adjust polling cadence/UX, add a worker unit test (there are no worker tests yet, so I didn't want to set a pattern without your steer), or wire the public API route the same way. If you'd prefer an issue first for a change of this size, just say so.
  • CLA: signed — melalj added to .github/contributors.txt in this PR.

Thanks again for the project! 🐳

analyzeBrand ran synchronously inside the onboarding server function and
takes ~1 minute (LLM + web search). Reverse proxies (nginx/CapRover) kill
the request at their read timeout, so users get a 504 even though the
analysis finishes server-side.

Move it onto a pg-boss `analyze-brand` queue handled by the worker:
- worker: new analyze-brand job + queue; the handler returns the
  suggestion, which pg-boss stores as the job output (batchSize 1 keeps
  output mapped 1:1 to a single job).
- web: startAnalyzeBrandFn enqueues and returns a jobId immediately;
  getAnalyzeBrandStatusFn reads job state/output via getJobById.
- wizard: enqueue then poll every 2s (up to ~6 min) instead of holding a
  single long-lived request open.

No DB migration — the result rides in the pg-boss job output.
@vercel

vercel Bot commented Jun 25, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the Blue Whale Labs Team on Vercel.

A member of the Team first needs to authorize it.

@melalj melalj marked this pull request as ready for review June 25, 2026 22:27
@melalj melalj requested a review from jrhizor as a code owner June 25, 2026 22:27
@jrhizor

jrhizor commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Thanks for the PR! At a glance it looks good, I'll do a deeper look soon and get it merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants