feat: refactoring and enabling diagnostics logger and traces#3300
feat: refactoring and enabling diagnostics logger and traces#3300ommeirelles wants to merge 42 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughMigrate diagnostics from per-package maps to global singletons; add OTLP traces/log exporter modules, console proxying, and refactor telemetry bootstrap, exports, OTEL context injection, and prod config/deps. ChangesGlobal Singleton Diagnostics Architecture
Sequence DiagramsequenceDiagram
participant App as Application
participant Start as getTelemetryClient()
participant Logger as getLogger()
participant Tracer as getTracesClient()
participant Console as globalThis.console
participant OTLP as OTLP Endpoint
App->>Start: Initialize telemetry
Start->>Logger: getLogger(client, opts)
Logger->>OTLP: setupLogsExporter()
OTLP-->>Logger: Exporter ready
Logger->>Logger: Create LogsClient
Logger->>Console: Replace with Proxy
Logger-->>Start: Logger cached
Start->>Tracer: getTracesClient(telemetry)
Tracer->>OTLP: setupTracesExporter()
OTLP-->>Tracer: Exporter ready
Tracer->>Tracer: Create TraceClient
Tracer-->>Start: Traces cached
Start-->>App: TelemetryClient ready
rect rgba(100, 150, 200, 0.5)
Note over Console,OTLP: Runtime console override
App->>Console: console.info("message")
Console->>Logger: logger.info(formatted)
Logger->>OTLP: Send log
Console->>Console: Original console.info()
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
@faststore/api
@faststore/cli
@faststore/components
@faststore/core
@faststore/diagnostics
@faststore/lighthouse
@faststore/sdk
@faststore/ui
commit: |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/diagnostics/configs/prod.json (1)
4-8:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReduce the production sampling jump.
Sampling every request at
0.3turns a 1%-style baseline into a 30x higher trace volume for all traffic. That is a meaningful cost and backpressure increase in production. Please gate this behind an env flag or narrow it to specific rules/endpoints first.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/diagnostics/configs/prod.json` around lines 4 - 8, The production diagnostics config currently raises global sampling from defaultRate to 0.3 and defines a broad rule "trace_all" with sampleRate 0.3 which will massively increase trace volume; change this by lowering defaultRate and/or sampleRate on the "trace_all" rule, and gate the higher sampling behind an environment flag (e.g., DIAGNOSTICS_HIGH_SAMPLING) or restrict the rule to specific endpoints/tags instead of "trace_all" so you only sample targeted traffic when the flag is set; update the JSON keys defaultRate, rules[*].name "trace_all", and rules[*].sampleRate accordingly and add logic to read the env flag before enabling the high-sampling rule.
🧹 Nitpick comments (1)
packages/diagnostics/package.json (1)
30-31: ⚡ Quick winPlease capture the bundle-size impact for these new runtime deps.
This adds two runtime dependencies, but there’s no accompanying budget/update note to show the downstream size impact. Please attach the delta or update the package budget check with this change.
As per coding guidelines, "Maintain strict performance budgets by monitoring bundle size when adding new dependencies".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/diagnostics/package.json` around lines 30 - 31, The PR adds two runtime deps "@vtex/diagnostics-nodejs" and "@vtex/diagnostics-semconv" but lacks a bundle-size impact report or budget update; run your project's bundle-size/size-budget tool (or local webpack/rollup build + source-map-explorer) to measure the delta introduced by these packages, attach the resulting size delta to the PR, and update the package budget/check file (the project's size budget config) to account for the added bytes if acceptable; reference the dependency entries "@vtex/diagnostics-nodejs" and "@vtex/diagnostics-semconv" when updating the budget or PR description so reviewers can verify the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/diagnostics/src/globals.ts`:
- Around line 3-5: The globals TELEMETRY_CLIENT, TRACE_CLIENT, and LOGGER_CLIENT
are currently singletons which makes getTelemetryClient(serviceName, clientName,
account, packageName) sticky to the first caller’s metadata; change these
globals back to keyed caches (e.g., Map or object) and update
getTelemetryClient, getTraceClient, and getLoggerClient to compute a composite
key from serviceName/clientName/account/packageName and look up/create clients
per-key instead of using a single process-wide value; alternatively, if
singleton behavior is intended, remove per-call metadata parameters from the
public API (getTelemetryClient signature) so callers cannot pass differing
metadata that would be ignored. Ensure creation paths set the keyed cache entry
and retrieval paths use the same composite key.
In `@packages/diagnostics/src/logger.ts`:
- Around line 15-18: The logs exporter is forcing plaintext transport by passing
insecure: true to Exporters.CreateLogsExporterConfig with OTLP_LOGGER_ENDPOINT;
change this to be environment-aware (e.g. derive insecure from IS_DEV or an env
var) so insecure is true only in development/testing and false in production,
updating the call in logger.ts (the Exporters.CreateLogsExporterConfig
invocation) to compute insecure dynamically and preserve current behavior for
traces which already use IS_DEV.
In `@packages/diagnostics/src/start.ts`:
- Around line 17-18: getTelemetryClient() currently returns the existing
global.fsDiagnostics.TELEMETRY_CLIENT synchronously before getLogger() and
getTracesClient() finish, causing tracing/logger setup race conditions; change
the function to await completion of getLogger() and getTracesClient() before
returning the client and promote the singleton to an initialization promise
(e.g. store a TELEMETRY_CLIENT_INIT promise on global.fsDiagnostics or make
TELEMETRY_CLIENT hold the promise) so concurrent first calls reuse the same
initialization flow and do not create duplicate clients or surface unhandled
rejections.
In `@packages/diagnostics/src/tracer.ts`:
- Around line 18-24: Move the cache check to the top of getTracesClient so we
don't create an exporter when a trace client already exists: check
global.fsDiagnostics.TRACE_CLIENT first and return it if present, and only call
setupTracesExporter() and telemetryClient.newTracesClient(...) when the cache is
empty; update getTracesClient to set global.fsDiagnostics.TRACE_CLIENT after
creating the exporter/client (use the existing setupTracesExporter and
telemetryClient.newTracesClient symbols).
---
Outside diff comments:
In `@packages/diagnostics/configs/prod.json`:
- Around line 4-8: The production diagnostics config currently raises global
sampling from defaultRate to 0.3 and defines a broad rule "trace_all" with
sampleRate 0.3 which will massively increase trace volume; change this by
lowering defaultRate and/or sampleRate on the "trace_all" rule, and gate the
higher sampling behind an environment flag (e.g., DIAGNOSTICS_HIGH_SAMPLING) or
restrict the rule to specific endpoints/tags instead of "trace_all" so you only
sample targeted traffic when the flag is set; update the JSON keys defaultRate,
rules[*].name "trace_all", and rules[*].sampleRate accordingly and add logic to
read the env flag before enabling the high-sampling rule.
---
Nitpick comments:
In `@packages/diagnostics/package.json`:
- Around line 30-31: The PR adds two runtime deps "@vtex/diagnostics-nodejs" and
"@vtex/diagnostics-semconv" but lacks a bundle-size impact report or budget
update; run your project's bundle-size/size-budget tool (or local webpack/rollup
build + source-map-explorer) to measure the delta introduced by these packages,
attach the resulting size delta to the PR, and update the package budget/check
file (the project's size budget config) to account for the added bytes if
acceptable; reference the dependency entries "@vtex/diagnostics-nodejs" and
"@vtex/diagnostics-semconv" when updating the budget or PR description so
reviewers can verify the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6d2e341d-d5bb-4397-aaea-bb6035adb74b
⛔ Files ignored due to path filters (4)
.github/workflows/ci.ymlis excluded by none and included by none.github/workflows/packages-preview.ymlis excluded by none and included by none.npmrcis excluded by none and included by nonepnpm-lock.yamlis excluded by!**/pnpm-lock.yamland included by none
📒 Files selected for processing (8)
packages/diagnostics/@types/global.d.tspackages/diagnostics/configs/prod.jsonpackages/diagnostics/package.jsonpackages/diagnostics/src/globals.tspackages/diagnostics/src/index.tspackages/diagnostics/src/logger.tspackages/diagnostics/src/start.tspackages/diagnostics/src/tracer.ts
6c7ef8e to
4d4ac3a
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
packages/core/src/instrumentation.ts (2)
18-20:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSwallowed error loses the actual failure reason.
The
catchdiscardserror, so on-call will see “Failed to initialize OTEL Instrumentation” with zero context. Please include the error in the log so misconfigurations are debuggable.🪵 Suggested fix
- } catch (error) { - console.error('Failed to initialize OTEL Instrumentation') - } + } catch (error) { + console.error('Failed to initialize OTEL Instrumentation', error) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/core/src/instrumentation.ts` around lines 18 - 20, The catch in packages/core/src/instrumentation.ts currently swallows the error; update the catch block in the OTEL initialization code (the try/catch around instrumentation setup in instrumentation.ts) to include the thrown error when logging (e.g., pass the caught error or its message/stack to console.error or your logger) so the log reads the failure message plus the actual error details.
9-9:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTypo in log:
Instrumemtation.Tiny but it will show up in every node boot log.
✏️ Suggested fix
- console.log('Instrumemtation.ts: Getting telemetry client') + console.log('Instrumentation.ts: Getting telemetry client')🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/core/src/instrumentation.ts` at line 9, Fix the typo in the console.log message that reads 'Instrumemtation.ts: Getting telemetry client' by updating the string to the correct spelling 'Instrumentation.ts: Getting telemetry client' in the file where that log is emitted (look for the exact log string to locate it).packages/api/src/observability/telemetry.ts (1)
21-24:⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftAsync resolvers will get truncated spans and uncaught rejections.
GraphQL resolvers very often return a
Promise. With the new try/catch/finally:
finally { span.end() }runs the momentfn(...)returns the Promise, not when it resolves — so span duration measures Promise construction, not the actual work, and any child spans created during await chains have no live parent.catchonly catches synchronous throws. A rejected Promise fromfnwalks past this handler entirely, sosetStatus(ERROR)/recordException/ theconsole.errorlog never fire for the most common failure mode.You'll want to branch on the return value (or
awaitit) so spans only end on settle and rejections are recorded. Since the wrapper returnsTReturn(which may legitimately be a Promise), keeping the sync fast path is fine:🧵 Sketch — handle sync and async paths
- try { - return fn(source, vars, graphqlContext, info) - } catch (error: any) { - span?.setStatus({ code: OTELAPI.SpanStatusCode.ERROR }) - span?.recordException(error) - console.error( - `Error when executing resolver: ${resolverName}: \n %o`, - error - ) - throw error - } finally { - span?.end() - } + const recordError = (error: any) => { + span?.setStatus({ code: OTELAPI.SpanStatusCode.ERROR }) + span?.recordException(error) + console.error(`Error when executing resolver: ${resolverName}: \n %o`, error) + } + + try { + const result = fn(source, vars, graphqlContext, info) + if (result && typeof (result as any).then === 'function') { + return (result as any).then( + (v: any) => { span?.end(); return v }, + (err: any) => { recordError(err); span?.end(); throw err } + ) as TReturn + } + span?.end() + return result + } catch (error: any) { + recordError(error) + span?.end() + throw error + }Also applies to: 46-58
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/api/src/observability/telemetry.ts` around lines 21 - 24, The wrapper currently ends spans and catches errors synchronously which truncates async spans and misses promise rejections; change the wrapper around fn(...) so it examines the return value and handles both sync and async paths: call const result = fn(source, vars, graphqlContext, info) inside your try block, then if result is a Promise (e.g., result && typeof result.then === 'function') attach result.then(...) to end the span and return the resolved value, and attach a .catch(...) that calls span.setStatus/recordException/console.error as needed and rethrows after ending the span; for non-Promise results keep the existing try/catch/finally behavior (ending the span in finally). Ensure you reference and update the span.end(), span.setStatus(ERROR), span.recordException(...) and console.error(...) usage so they run on promise rejection as well; preserve the function signature that returns TReturn (which may be a Promise).
🧹 Nitpick comments (3)
packages/components/src/molecules/Toast/Toast.tsx (1)
8-8: ⚡ Quick winConsider using
ReturnType<typeof setTimeout>for automatic type inference.The union type
number | NodeJS.Timeouthandles both browser and Node.js environments correctly, butReturnType<typeof setTimeout>would automatically adapt to the current environment's type definitions without manual union types.♻️ Alternative approach using ReturnType
- const timeoutRef = useRef<number | NodeJS.Timeout>() + const timeoutRef = useRef<ReturnType<typeof setTimeout>>()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/components/src/molecules/Toast/Toast.tsx` at line 8, Replace the explicit union type on the timeout ref with an environment-agnostic inferred type: update the timeoutRef declaration in Toast.tsx (the useRef for the timeout) to use ReturnType<typeof setTimeout> so the type automatically matches the runtime (browser or Node) instead of manually using number | NodeJS.Timeout.packages/api/src/observability/telemetry.ts (1)
26-26: 💤 Low valueNit: tracer name
'Graphql'→'GraphQL'.Standard capitalization; will read better in trace UIs and matches the wider ecosystem (
@opentelemetry/instrumentation-graphqlusesGraphQL).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/api/src/observability/telemetry.ts` at line 26, Update the tracer name passed to OTELAPI.trace.getTracer from 'Graphql' to the standard 'GraphQL' to match ecosystem capitalization; locate the getTracer call (const tracer = OTELAPI.trace.getTracer('Graphql')) and change the literal to 'GraphQL' so trace UIs and instrumentation naming are consistent.packages/core/src/instrumentation.ts (1)
4-22: ⚡ Quick winUnconditional OTEL bootstrap removes the only opt-out.
With
config.analytics.otelEnabledgone, every nodejs runtime — including local dev, CI, and self-hosted deployments without an OTLP collector — will attempt to initialize the diagnostics client and emit traces/logs. Consider keeping a single env-based kill switch (e.g.OTEL_SDK_DISABLEDor aFASTSTORE_OTEL_ENABLEDenv) so consumers can opt out without forking the package. Otherwise misconfigured collectors will spew connection errors on every request.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/core/src/instrumentation.ts` around lines 4 - 22, The register() function unconditionally initializes OTEL in Node runtimes; add an opt-out check (e.g. respect a FASTSTORE_OTEL_ENABLED or OTEL_SDK_DISABLED env var or a config.analytics.otelEnabled flag) and return early when disabled to avoid attempting to import/@faststore/diagnostics; update the top of register() to bail out if the env/config indicates OTEL is disabled, and only proceed to import getTelemetryClient and call it when enabled (preserve existing try/catch and include the same return behavior using pkgJSON, config.api.storeId, and getTelemetryClient).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/api/src/observability/telemetry.ts`:
- Around line 38-44: The current code discards the Context returned by
OTELAPI.trace.setSpan and never activates it, so child operations don't inherit
the parent span; capture the returned Context (e.g., newCtx =
OTELAPI.trace.setSpan(...)) and run the wrapped work via
OTELAPI.context.with(newCtx, () => ... ) so the extracted parent is actually
activated; also ensure span?.end() runs after fn completes by awaiting the
result if fn returns a Promise (e.g., run fn inside the context, await its
result in a try/finally, then end the span in finally) and return the awaited
result so callers receive the original return value.
In `@packages/core/discovery.config.default.js`:
- Line 169: The change to the serviceName entry (serviceName: 'faststore') will
fragment historical telemetry keyed by service.name; either revert this value
back to the previous identity (e.g. 'faststore-proxy') in
discovery.config.default.js or, if the rename is intentional, coordinate with
observability owners and update dashboards/alerts/saved queries to use the new
name and add a clear migration note/release-log entry so historical and new
telemetry are reconciled; locate the serviceName assignment to make the revert
or to add the migration comment and notify owners.
In `@packages/core/src/server/options.ts`:
- Around line 2-6: The import list in options.ts includes an unused symbol
getTraceClient which triggers lint errors; edit the import from
'@faststore/diagnostics' to remove getTraceClient and keep only
getTelemetryClient and OTELAPI so the file only imports used symbols (ensure any
other references to getTraceClient in this file are also removed or replaced if
present).
- Around line 36-71: The withTraceClient function currently starts and ends a
span inside initialization (span in withTraceClient) which dies before request
handling (so child spans from GraphqlVtexContextFactory lose their parent), has
a try/catch that can fall through without returning in strict mode, and narrows
options.OTEL.__otelContext to {} which breaks propagation typing; fix by: stop
starting/ending a request-scoped span inside withTraceClient (move
startSpan/span.end to the HTTP/graphql request handler so the span wraps the
full request lifecycle), ensure the function always returns or rethrows in the
catch (e.g., remove the try/catch or rethrow the error after recording it in
span), and change the OTEL carrier declaration to an explicit
Record<string,string> (set options.OTEL.__otelContext type to
Record<string,string>) so OTELAPI.propagation.inject accepts it.
In `@packages/diagnostics/src/logger.ts`:
- Around line 27-38: The current getLogger implementation caches a single logger
in global.fsDiagnostics.LOGGER_CLIENT which captures and reuses the first call's
opt.serviceName/opt.client for all subsequent callers; change this to cache
loggers per unique caller (e.g., key = `${opt.serviceName}:${opt.client}`)
instead of a single global value so each service/client pair gets its own
LogsClient and console proxy. Update getLogger to look up and return from a map
(or object) stored on global.fsDiagnostics (instead of LOGGER_CLIENT), create a
new logger and call overrideConsole(logger, opt) only for the missing key, and
ensure overrideConsole is passed the current opt so metadata is not closed over
by the first initialization. Ensure references to getLogger,
global.fsDiagnostics.LOGGER_CLIENT, overrideConsole, opt.serviceName and
opt.client are updated accordingly.
- Around line 31-40: Multiple concurrent callers can pass the initial
global.fsDiagnostics.LOGGER_CLIENT check and each start creating
exporters/clients; fix this by caching the in-flight initialization promise in
global.fsDiagnostics.LOGGER_CLIENT before awaiting client.newLogsClient so other
callers await the same promise. Concretely: when entering the init, assign a
promise (that calls setupLogsExporter(), client.newLogsClient(...), then calls
overrideConsole(logger) and returns logger) to
global.fsDiagnostics.LOGGER_CLIENT immediately; await that promise to get the
final logger; on rejection ensure you clear global.fsDiagnostics.LOGGER_CLIENT
so future attempts can retry. Use the existing symbols client.newLogsClient,
setupLogsExporter, overrideConsole and global.fsDiagnostics.LOGGER_CLIENT to
implement this change.
---
Outside diff comments:
In `@packages/api/src/observability/telemetry.ts`:
- Around line 21-24: The wrapper currently ends spans and catches errors
synchronously which truncates async spans and misses promise rejections; change
the wrapper around fn(...) so it examines the return value and handles both sync
and async paths: call const result = fn(source, vars, graphqlContext, info)
inside your try block, then if result is a Promise (e.g., result && typeof
result.then === 'function') attach result.then(...) to end the span and return
the resolved value, and attach a .catch(...) that calls
span.setStatus/recordException/console.error as needed and rethrows after ending
the span; for non-Promise results keep the existing try/catch/finally behavior
(ending the span in finally). Ensure you reference and update the span.end(),
span.setStatus(ERROR), span.recordException(...) and console.error(...) usage so
they run on promise rejection as well; preserve the function signature that
returns TReturn (which may be a Promise).
In `@packages/core/src/instrumentation.ts`:
- Around line 18-20: The catch in packages/core/src/instrumentation.ts currently
swallows the error; update the catch block in the OTEL initialization code (the
try/catch around instrumentation setup in instrumentation.ts) to include the
thrown error when logging (e.g., pass the caught error or its message/stack to
console.error or your logger) so the log reads the failure message plus the
actual error details.
- Line 9: Fix the typo in the console.log message that reads
'Instrumemtation.ts: Getting telemetry client' by updating the string to the
correct spelling 'Instrumentation.ts: Getting telemetry client' in the file
where that log is emitted (look for the exact log string to locate it).
---
Nitpick comments:
In `@packages/api/src/observability/telemetry.ts`:
- Line 26: Update the tracer name passed to OTELAPI.trace.getTracer from
'Graphql' to the standard 'GraphQL' to match ecosystem capitalization; locate
the getTracer call (const tracer = OTELAPI.trace.getTracer('Graphql')) and
change the literal to 'GraphQL' so trace UIs and instrumentation naming are
consistent.
In `@packages/components/src/molecules/Toast/Toast.tsx`:
- Line 8: Replace the explicit union type on the timeout ref with an
environment-agnostic inferred type: update the timeoutRef declaration in
Toast.tsx (the useRef for the timeout) to use ReturnType<typeof setTimeout> so
the type automatically matches the runtime (browser or Node) instead of manually
using number | NodeJS.Timeout.
In `@packages/core/src/instrumentation.ts`:
- Around line 4-22: The register() function unconditionally initializes OTEL in
Node runtimes; add an opt-out check (e.g. respect a FASTSTORE_OTEL_ENABLED or
OTEL_SDK_DISABLED env var or a config.analytics.otelEnabled flag) and return
early when disabled to avoid attempting to import/@faststore/diagnostics; update
the top of register() to bail out if the env/config indicates OTEL is disabled,
and only proceed to import getTelemetryClient and call it when enabled (preserve
existing try/catch and include the same return behavior using pkgJSON,
config.api.storeId, and getTelemetryClient).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bc7f7cf8-7fc1-4ed9-95c3-eeff6f2c33b8
⛔ Files ignored due to path filters (1)
docs/observability.mdis excluded by none and included by none
📒 Files selected for processing (13)
packages/api/src/observability/telemetry.tspackages/api/src/typings/globals.tspackages/components/src/molecules/Toast/Toast.tsxpackages/core/discovery.config.default.jspackages/core/next.config.jspackages/core/src/instrumentation.tspackages/core/src/server/options.tspackages/diagnostics/@types/global.d.tspackages/diagnostics/package.jsonpackages/diagnostics/src/globals.tspackages/diagnostics/src/logger.tspackages/diagnostics/src/start.tspackages/diagnostics/src/tracer.ts
💤 Files with no reviewable changes (1)
- packages/api/src/typings/globals.ts
🚧 Files skipped from review as they are similar to previous changes (5)
- packages/diagnostics/src/tracer.ts
- packages/diagnostics/src/globals.ts
- packages/diagnostics/package.json
- packages/diagnostics/src/start.ts
- packages/diagnostics/@types/global.d.ts
| OTELAPI.trace.setSpan( | ||
| OTELAPI.propagation.extract( | ||
| OTELAPI.context.active(), | ||
| graphqlContext.OTEL.__otelContext | ||
| ), | ||
| span | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
@opentelemetry/api context.with vs trace.setSpan immutable context pattern
💡 Result:
In OpenTelemetry JS, the “immutable context” pattern applies to both choices: 1) api.context.with(ctx, callback) - context.with makes the provided Context the active one for the duration of callback execution, and returns the callback’s return value. [1] - It is scoped (so you don’t permanently change the global active context), and it requires a ContextManager to actually propagate the active context. [1] 2) trace.setSpan(context, span) - trace.setSpan (and the related context setValue pattern) returns a new Context instance that includes the given span; it does not mutate the input Context. This matches the general rule that Context is immutable and write operations create a new Context. [2][3] - The usual pattern is: create a new context with trace.setSpan(existingContext, span) and then make that new context active via context.with. (This is how to run code “in context of some span.”) [1][4] So what’s the practical difference? - trace.setSpan answers “which Context should contain this span?”—it produces a new (immutable) Context. [2][3] - context.with answers “for this code region (sync block), which Context is active?”—it activates the provided Context only during callback execution. [1] A canonical pattern (JS) - Create/obtain a Context (often context.active). - Call trace.setSpan(ctx, span) to get a new Context that contains that span (immutable update). [2] - Call context.with(newCtx, => {... }) so that context.active inside the callback is the new context (scoped activation). [1] Key takeaway - There is no conflict between “context.with” and “trace.setSpan immutable context pattern”: trace.setSpan is the immutable update step; context.with is the scoped activation step. [1][2] Notes on “startActiveSpan” - The SDK also provides tracer.startActiveSpan, which “starts a new Span and calls the given function … Additionally the new span gets set in context and this context is activated for the duration of the function call.” [5] This is essentially a convenience wrapper around the same two concerns (set span into a context, then activate it for the callback duration). [5]
Citations:
- 1: https://opentelemetry.io/docs/instrumentation/js/api/context/
- 2: https://opentelemetry.io/docs/specs/otel/context/
- 3: https://opentelemetry.io/docs/languages/js/context/
- 4: how to set the current active context and current active span without callbacks open-telemetry/opentelemetry-js#3558
- 5: https://open-telemetry.github.io/opentelemetry-js/interfaces/_opentelemetry_sdk-node._opentelemetry_api.Tracer.html
🏁 Script executed:
find . -type f -name "telemetry.ts" -path "*/observability/*" | head -5Repository: vtex/faststore
Length of output: 104
🏁 Script executed:
cat -n ./packages/api/src/observability/telemetry.ts | sed -n '30,65p'Repository: vtex/faststore
Length of output: 1041
🏁 Script executed:
cat -n ./packages/api/src/observability/telemetry.ts | sed -n '1,60p'Repository: vtex/faststore
Length of output: 2101
setSpan() return value is discarded — propagated context is never activated.
setSpan returns a new immutable Context; the extracted parent context is computed and immediately thrown away. As written, fn(...) runs under the pre-existing active context, not the one carrying the parent span. Use context.with(newCtx, () => fn(...)) to activate the context so child operations inherit the parent.
Additionally, if fn() can return a Promise, span?.end() fires immediately in the finally block before resolution completes—the span will close before async operations finish.
🔧 Suggested fix
- OTELAPI.trace.setSpan(
- OTELAPI.propagation.extract(
- OTELAPI.context.active(),
- graphqlContext.OTEL.__otelContext
- ),
- span
- )
-
- try {
- return fn(source, vars, graphqlContext, info)
- } catch (error: any) {
+ const parentCtx = OTELAPI.propagation.extract(
+ OTELAPI.context.active(),
+ graphqlContext.OTEL.__otelContext
+ )
+ const ctx = OTELAPI.trace.setSpan(parentCtx, span)
+
+ try {
+ return OTELAPI.context.with(ctx, () => fn(source, vars, graphqlContext, info))
+ } catch (error: any) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/api/src/observability/telemetry.ts` around lines 38 - 44, The
current code discards the Context returned by OTELAPI.trace.setSpan and never
activates it, so child operations don't inherit the parent span; capture the
returned Context (e.g., newCtx = OTELAPI.trace.setSpan(...)) and run the wrapped
work via OTELAPI.context.with(newCtx, () => ... ) so the extracted parent is
actually activated; also ensure span?.end() runs after fn completes by awaiting
the result if fn returns a Promise (e.g., run fn inside the context, await its
result in a try/finally, then end the span in finally) and return the awaited
result so callers receive the original return value.
| import { | ||
| getTelemetryClient, | ||
| getTraceClient, | ||
| OTELAPI, | ||
| } from '@faststore/diagnostics' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP "\bgetTraceClient\b" --type=ts -C2Repository: vtex/faststore
Length of output: 825
Remove unused import getTraceClient.
It's imported but never used in this file, which will trigger Biome/lint. Safe cleanup.
Suggested fix
import {
getTelemetryClient,
- getTraceClient,
OTELAPI,
} from '@faststore/diagnostics'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import { | |
| getTelemetryClient, | |
| getTraceClient, | |
| OTELAPI, | |
| } from '@faststore/diagnostics' | |
| import { | |
| getTelemetryClient, | |
| OTELAPI, | |
| } from '@faststore/diagnostics' |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/core/src/server/options.ts` around lines 2 - 6, The import list in
options.ts includes an unused symbol getTraceClient which triggers lint errors;
edit the import from '@faststore/diagnostics' to remove getTraceClient and keep
only getTelemetryClient and OTELAPI so the file only imports used symbols
(ensure any other references to getTraceClient in this file are also removed or
replaced if present).
| export async function withTraceClient<T extends APIOptions = typeof apiOptions>( | ||
| apiOptions: T | ||
| ): Promise<T> { | ||
| const OTEL = {} | ||
| getTraceClient( | ||
| apiOptions?.discoveryConfig?.analytics?.serviceName ?? name | ||
| )?.inject(OTEL) | ||
| // Safe guard in dev mode to prevent the | ||
| // global scope to be erased in hot-module-reload. | ||
| await getTelemetryClient({ | ||
| serviceName: storeConfig.analytics?.serviceName ?? 'faststore', | ||
| version, | ||
| account: storeConfig.api.storeId, | ||
| clientName: storeConfig.api.storeId, | ||
| packageName: name, | ||
| }) | ||
|
|
||
| return { | ||
| const options = { | ||
| ...apiOptions, | ||
| OTEL: { | ||
| ...OTEL, | ||
| enabled: storeConfig.analytics?.otelEnabled?.toString() === 'true', | ||
| __otelContext: {}, | ||
| enabled: true, | ||
| }, | ||
| } as T | ||
| } satisfies T | ||
|
|
||
| const tracer = OTELAPI.trace.getTracer('@faststore/core') | ||
| const span = tracer.startSpan('@faststore/core graphql') | ||
|
|
||
| const context = OTELAPI.trace.setSpan(OTELAPI.context.active(), span) | ||
| OTELAPI.propagation.inject(options.OTEL.__otelContext, context) | ||
|
|
||
| try { | ||
| return options as T | ||
| } catch (error) { | ||
| span?.setStatus({ code: OTELAPI.SpanStatusCode.ERROR }) | ||
| span?.recordException(error) | ||
| } finally { | ||
| span?.end() | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
@opentelemetry/api propagation.inject carrier type and recommended pattern for request-scoped parent spans
💡 Result:
In @opentelemetry/api, propagation.inject takes: 1) context: @opentelemetry/api.Context 2) carrier: a generic “Carrier” type (depends on the propagator); for TextMap propagators it’s typically a mutable key/value map (commonly an object used like HTTP headers) 3) optional setter: TextMapSetter (defaults to defaultTextMapSetter) So the “carrier type” is not a single fixed TS type in the API—it is whatever Carrier type the configured TextMapPropagator expects; the API treats it as a generic parameter and passes it to the propagator’s inject method. [1][2][3] Recommended pattern for request-scoped parent spans (i.e., for making child spans use the inbound parent): A) Extract on the receiving service at the request entry point, using the incoming request metadata (for HTTP: headers). This yields an extracted Context; spans started/created within that context will use the extracted SpanContext as parent. [4][5][6] B) For outbound calls from that request scope, inject the current active context into the outbound carrier (for HTTP: headers) immediately before the call. This propagates trace/span IDs so the next service can extract them and continue the same trace. [7][5] Concrete JS-style pattern (matches the OpenTelemetry JS docs examples): // Sending side: inject current context into a mutable carrier (e.g., headers object) propagation.inject(context.active, output); [7][1] // Receiving side: extract context from incoming carrier (e.g., parsed headers) const activeContext = propagation.extract(context.active, req.headers); [7][5] // Then start spans using that context as current so they become children of the extracted parent. context.with(activeContext, => { // startSpan(...) }); [8] Notes: - Carriers used at inject time are expected to be mutable. [9][2] - The TextMapPropagator injects/extracts string key/value pairs (encoded like HTTP headers). [1][2] - Instrumentation should call inject on outbound requests and extract on inbound requests; the propagators API is the manual mechanism when you must do it yourself. [10][5] - For request-scoped parent spans across async code in JS, ensure you have a proper Context Manager; otherwise api.context.active may always be ROOT_CONTEXT. [8]
Citations:
- 1: https://open-telemetry.github.io/opentelemetry-js/interfaces/_opentelemetry_api._opentelemetry_api.TextMapPropagator.html
- 2: https://opentelemetry.io/docs/specs/otel/context/api-propagators/
- 3: https://github.com/open-telemetry/opentelemetry-js/blob/08f597f3a3d71a4852b0afbba120af15ca038121/api/src/api/propagation.ts
- 4: https://opentelemetry.io/docs/concepts/context-propagation/
- 5: http://opentelemetry.io/docs/concepts/context-propagation/
- 6: https://oneuptime.com/blog/post/2026-01-24-opentelemetry-span-context-propagation/view
- 7: https://opentelemetry.io/docs/languages/js/propagation/
- 8: https://opentelemetry.io/docs/instrumentation/js/api/context/
- 9: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.6.1/specification/context/api-propagators.md
- 10: http://opentelemetry.io/docs/specs/otel/context/api-propagators/
🏁 Script executed:
# First, verify the file exists and check the code at the specified lines
fd -type f -name "options.ts" -path "*/packages/core/src/server/*"Repository: vtex/faststore
Length of output: 229
🏁 Script executed:
# Read the file to verify the code matches the review snippet
# File: packages/core/src/server/options.ts, lines 36-71
cat -n packages/core/src/server/options.ts | sed -n '30,80p'Repository: vtex/faststore
Length of output: 1409
🏁 Script executed:
# Check if there's any usage of withTraceClient to understand context
rg "withTraceClient" --type ts --type tsx -B 2 -A 5Repository: vtex/faststore
Length of output: 86
🏁 Script executed:
# Fix the rg command - search for withTraceClient usage
rg "withTraceClient" -B 2 -A 5Repository: vtex/faststore
Length of output: 1594
🏁 Script executed:
# Also check imports at the top of the file to understand OTELAPI
head -30 packages/core/src/server/options.tsRepository: vtex/faststore
Length of output: 1091
🏁 Script executed:
# Check tsconfig.json for strict mode setting
find . -name "tsconfig.json" -type f | head -5Repository: vtex/faststore
Length of output: 228
🏁 Script executed:
# Look at tsconfig in the packages/core directory
cat packages/tsconfig.json 2>/dev/null || cat tsconfig.json 2>/dev/null | head -50Repository: vtex/faststore
Length of output: 634
🏁 Script executed:
# Check how GraphqlVtexContextFactory uses the options
rg "GraphqlVtexContextFactory" -B 3 -A 10Repository: vtex/faststore
Length of output: 12467
Span lifecycle ends prematurely, catch block lacks return in strict mode, and carrier type needs explicit definition.
Three stacked problems in this function:
-
Span ends before work begins.
span.end()fires in thefinallyblock immediately afterreturn optionsresolves, but the actual GraphQL work (including the propagation context extraction inGraphqlVtexContextFactory) happens later. Child spans created from__otelContextwon't have a live parent, defeating distributed tracing. The span should wrap the entire request lifecycle, not this initialization step. -
Catch block missing return violates
noImplicitReturns. With strict mode enabled, all code paths must explicitly return a value matchingPromise<T>. If an error is caught, the function falls through without returning, yieldingundefinedinstead. Either rethrow or remove the try/catch—there's no error recovery logic here. -
satisfies Tnarrows__otelContextto{}. When you mutate it withOTELAPI.propagation.inject(options.OTEL.__otelContext, context), TypeScript loses type info. Explicitly type it asRecord<string, string>to match what the propagation API expects.
🩹 Recommended fix (propagation only)
- const options = {
- ...apiOptions,
- OTEL: {
- __otelContext: {},
- enabled: true,
- },
- } satisfies T
-
- const tracer = OTELAPI.trace.getTracer('@faststore/core')
- const span = tracer.startSpan('@faststore/core graphql')
-
- const context = OTELAPI.trace.setSpan(OTELAPI.context.active(), span)
- OTELAPI.propagation.inject(options.OTEL.__otelContext, context)
-
- try {
- return options as T
- } catch (error) {
- span?.setStatus({ code: OTELAPI.SpanStatusCode.ERROR })
- span?.recordException(error)
- } finally {
- span?.end()
- }
+ const carrier: Record<string, string> = {}
+ OTELAPI.propagation.inject(OTELAPI.context.active(), carrier)
+
+ return {
+ ...apiOptions,
+ OTEL: {
+ __otelContext: carrier,
+ enabled: true,
+ },
+ } as TIf a request-scoped parent span is genuinely needed, lift startSpan and span.end() to the caller (e.g., the HTTP route handler) so it wraps the full await graphqlHandler(...) execution.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export async function withTraceClient<T extends APIOptions = typeof apiOptions>( | |
| apiOptions: T | |
| ): Promise<T> { | |
| const OTEL = {} | |
| getTraceClient( | |
| apiOptions?.discoveryConfig?.analytics?.serviceName ?? name | |
| )?.inject(OTEL) | |
| // Safe guard in dev mode to prevent the | |
| // global scope to be erased in hot-module-reload. | |
| await getTelemetryClient({ | |
| serviceName: storeConfig.analytics?.serviceName ?? 'faststore', | |
| version, | |
| account: storeConfig.api.storeId, | |
| clientName: storeConfig.api.storeId, | |
| packageName: name, | |
| }) | |
| return { | |
| const options = { | |
| ...apiOptions, | |
| OTEL: { | |
| ...OTEL, | |
| enabled: storeConfig.analytics?.otelEnabled?.toString() === 'true', | |
| __otelContext: {}, | |
| enabled: true, | |
| }, | |
| } as T | |
| } satisfies T | |
| const tracer = OTELAPI.trace.getTracer('@faststore/core') | |
| const span = tracer.startSpan('@faststore/core graphql') | |
| const context = OTELAPI.trace.setSpan(OTELAPI.context.active(), span) | |
| OTELAPI.propagation.inject(options.OTEL.__otelContext, context) | |
| try { | |
| return options as T | |
| } catch (error) { | |
| span?.setStatus({ code: OTELAPI.SpanStatusCode.ERROR }) | |
| span?.recordException(error) | |
| } finally { | |
| span?.end() | |
| } | |
| } | |
| export async function withTraceClient<T extends APIOptions = typeof apiOptions>( | |
| apiOptions: T | |
| ): Promise<T> { | |
| // Safe guard in dev mode to prevent the | |
| // global scope to be erased in hot-module-reload. | |
| await getTelemetryClient({ | |
| serviceName: storeConfig.analytics?.serviceName ?? 'faststore', | |
| version, | |
| account: storeConfig.api.storeId, | |
| clientName: storeConfig.api.storeId, | |
| packageName: name, | |
| }) | |
| const carrier: Record<string, string> = {} | |
| OTELAPI.propagation.inject(OTELAPI.context.active(), carrier) | |
| return { | |
| ...apiOptions, | |
| OTEL: { | |
| __otelContext: carrier, | |
| enabled: true, | |
| }, | |
| } as T | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/core/src/server/options.ts` around lines 36 - 71, The
withTraceClient function currently starts and ends a span inside initialization
(span in withTraceClient) which dies before request handling (so child spans
from GraphqlVtexContextFactory lose their parent), has a try/catch that can fall
through without returning in strict mode, and narrows options.OTEL.__otelContext
to {} which breaks propagation typing; fix by: stop starting/ending a
request-scoped span inside withTraceClient (move startSpan/span.end to the
HTTP/graphql request handler so the span wraps the full request lifecycle),
ensure the function always returns or rethrows in the catch (e.g., remove the
try/catch or rethrow the error after recording it in span), and change the OTEL
carrier declaration to an explicit Record<string,string> (set
options.OTEL.__otelContext type to Record<string,string>) so
OTELAPI.propagation.inject accepts it.
| export async function getLogger( | ||
| client: TelemetryClient, | ||
| opt: { serviceName: string; client: string } | ||
| ) { | ||
| if (global.fsDiagnostics.LOGGER_CLIENT) | ||
| return global.fsDiagnostics.LOGGER_CLIENT | ||
|
|
||
| const logger = await client.newLogsClient({ | ||
| exporter: await setupLogsExporter(), | ||
| }) | ||
|
|
||
| overrideConsole(logger, opt) |
There was a problem hiding this comment.
The singleton console proxy hard-codes the first caller’s metadata.
The proxy closes over the first opt.serviceName / opt.client, and later getLogger(...) calls just return the cached singleton. If different services initialize diagnostics in the same process, every forwarded console entry is attributed to the first one.
Also applies to: 45-76
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/diagnostics/src/logger.ts` around lines 27 - 38, The current
getLogger implementation caches a single logger in
global.fsDiagnostics.LOGGER_CLIENT which captures and reuses the first call's
opt.serviceName/opt.client for all subsequent callers; change this to cache
loggers per unique caller (e.g., key = `${opt.serviceName}:${opt.client}`)
instead of a single global value so each service/client pair gets its own
LogsClient and console proxy. Update getLogger to look up and return from a map
(or object) stored on global.fsDiagnostics (instead of LOGGER_CLIENT), create a
new logger and call overrideConsole(logger, opt) only for the missing key, and
ensure overrideConsole is passed the current opt so metadata is not closed over
by the first initialization. Ensure references to getLogger,
global.fsDiagnostics.LOGGER_CLIENT, overrideConsole, opt.serviceName and
opt.client are updated accordingly.
| if (global.fsDiagnostics.LOGGER_CLIENT) | ||
| return global.fsDiagnostics.LOGGER_CLIENT | ||
|
|
||
| const logger = await client.newLogsClient({ | ||
| exporter: await setupLogsExporter(), | ||
| }) | ||
|
|
||
| overrideConsole(logger, opt) | ||
|
|
||
| global.fsDiagnostics.LOGGER_CLIENT ??= logger |
There was a problem hiding this comment.
Cache the in-flight logger initialization.
Concurrent callers can all pass the LOGGER_CLIENT check before it is assigned, so this can create multiple exporters/clients and wrap globalThis.console more than once. That leads to duplicated log writes and double-formatted console output.
Suggested change
+let loggerPromise: Promise<LogClient> | undefined
+
export async function getLogger(
client: TelemetryClient,
opt: { serviceName: string; client: string }
) {
if (global.fsDiagnostics.LOGGER_CLIENT)
return global.fsDiagnostics.LOGGER_CLIENT
-
- const logger = await client.newLogsClient({
- exporter: await setupLogsExporter(),
- })
-
- overrideConsole(logger, opt)
-
- global.fsDiagnostics.LOGGER_CLIENT ??= logger
-
- return logger
+ if (!loggerPromise) {
+ loggerPromise = (async () => {
+ const logger = await client.newLogsClient({
+ exporter: await setupLogsExporter(),
+ })
+
+ overrideConsole(logger, opt)
+ global.fsDiagnostics.LOGGER_CLIENT = logger
+
+ return logger
+ })().catch((error) => {
+ loggerPromise = undefined
+ throw error
+ })
+ }
+
+ return loggerPromise
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/diagnostics/src/logger.ts` around lines 31 - 40, Multiple concurrent
callers can pass the initial global.fsDiagnostics.LOGGER_CLIENT check and each
start creating exporters/clients; fix this by caching the in-flight
initialization promise in global.fsDiagnostics.LOGGER_CLIENT before awaiting
client.newLogsClient so other callers await the same promise. Concretely: when
entering the init, assign a promise (that calls setupLogsExporter(),
client.newLogsClient(...), then calls overrideConsole(logger) and returns
logger) to global.fsDiagnostics.LOGGER_CLIENT immediately; await that promise to
get the final logger; on rejection ensure you clear
global.fsDiagnostics.LOGGER_CLIENT so future attempts can retry. Use the
existing symbols client.newLogsClient, setupLogsExporter, overrideConsole and
global.fsDiagnostics.LOGGER_CLIENT to implement this change.
Co-authored-by: Fanny Chien <fanny.chien@vtex.com>
92e3cc5 to
ffde67e
Compare
| - `@faststore_account_name: required_trace_account_name` | ||
| - `@faststore_environment:production|development` | ||
|
|
||
|
|
Co-authored-by: Fanny Chien <fanny.chien@vtex.com>

Summary by CodeRabbit
New Features
Improvements
Dependencies