Skip to content

Tooling: call-site allocation profiler for the rt_mmap allocator #618

Description

@ehartford

Motivation

Root-causing a frontend memory blowup (with check src/main.w peaked at 4.76 GB; a trivial file is ~25 MB) needed a way to attribute allocations to a line of code, and none of the existing tooling worked at scale.

The bug was ultimately localized by committed-bytes-delta instrumentation: a temporary with_alloc_committed_bytes() accessor bracketing each comptime-eval phase (read committed before/after, accumulate deltas + a call count, print). That pinned the growth to prepare_comptime_eval_copy (src/Sema.w) deep-cloning the full ~8.5 MB compiler source text on every one of ~239 comptime evals — ~2 GB of dead copies. Fixed by sharing the read-only source vectors; check src/main.w is now 2.51 GB, fixpoint byte-identical.

That manual bracketing works but is per-investigation and only localizes to a phase, not a call site. with_alloc_committed_bytes() is now a permanent rt primitive (see docs/debug-allocator.md) so the interim technique needs no throwaway runtime edit — but we still want a real call-site profiler.

What exists and why each is inadequate

  • --debug-alloc / debug allocator — coarse origin tags + leak detection, but a fixed-size ledger that overflows at this scale (debug-alloc: ledger full, tracking truncated). Reports leaks, not a call-site/bytes breakdown.
  • committed-bytes-delta bracketing (the interim that cracked this bug) — localizes to a manually-chosen phase, not automatically to a call site; requires hand-editing the phase brackets each time.
  • macOS sample — CPU sampling, not allocation.
  • --stats — no memory/allocation reporting.
  • lldb breakpoint per allocation — too slow at high frequency (timed out at 80k ignore-count).
  • OS allocator tools (Instruments, malloc_history, leaks) — can't see rt_mmap; our allocator bypasses malloc.
  • Constraint: the runtime deliberately avoids in-process frame-pointer backtraces; at -O1 frame pointers may be omitted, so a naive runtime fp-walk is unreliable.

Ask

A call-site allocation profiler for the rt_mmap/rt_alloc allocator:

  • Aggregates bytes + counts by call site into a bounded hashtable (survives millions of allocations — unlike the per-allocation ledger).
  • Gated by an env var (e.g. WITH_ALLOC_PROFILE=1) so it is zero-cost when off.
  • Dumps the top allocators by total bytes at exit, symbolized to file:line.
  • Call-site capture approaches to evaluate: DWARF-based unwinding, a sampled statistical allocation profiler, or a reliable bounded frame walk that stops at the first non-runtime frame.

Payoff

Pinpoint allocation hot spots directly and verify memory fixes quantitatively instead of by rebuild-and-remeasure. Reusable across the compiler.

Reference: docs/debug-allocator.md, docs/deep-debugging-tools.md. Related: the source-text fix (comptime eval), the --emit-c self-compile OOM (#619).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions