True free-threaded Python support: full memory + CPU profiling by emeryberger · Pull Request #1026 · plasma-umass/scalene

emeryberger · 2026-03-31T14:59:35Z

Summary

Full profiling support for free-threaded Python (3.13t/3.14t) and a unified, simplified memory tracking architecture for all builds.

Unified size tracking via ShardedSizeMap

Replaces ScaleneHeader (16-byte inline header prepended to every pymalloc allocation) with an out-of-band sharded flat hash table on all builds — not just free-threaded.

Why not ScaleneHeader?

On free-threaded Python, prepending headers breaks the GC (it scans mimalloc pages expecting valid Python objects)
On regular Python, the header shifts objects into larger pymalloc size classes
Two code paths (#ifdef Py_GIL_DISABLED) doubled maintenance burden

Why not malloc_size()/malloc_usable_size()?

pymalloc-domain objects (floats, ints, small containers) live inside pymalloc's 4 KB pools — malloc_size() returns 0 for them since the system allocator only knows about the 256 KB arenas
PYTHONMALLOC=malloc would make malloc_size() work but disables pymalloc, causing 33% slowdown on small-object workloads
The size map is only needed for pymalloc-domain allocations; system-malloc allocations (native/C code) already use malloc_size()/malloc_usable_size() via SampleHeap

ShardedSizeMap design:

128 shards, each with a spinlock + open-addressed flat hash table (linear probing, backward-shift deletion)
16 bytes per entry (pointer + size), no per-entry heap allocations
Lazy initialization (no allocation until first insert)
Benchmarked overhead: 0.1 B/obj measured via vmmap physical footprint (5M floats)
Contention ratio (8T/1T slowdown): 0.95x — no lock contention

Free-threaded Python support

Safe native thread attribution (pywhere.cpp): Caches the main thread's PyThreadState* during PyInit_pywhere (single-threaded context) and uses it as fallback in whereInPython() via PyThreadState_GetFrame (safe on free-threaded builds — uses atomic loads internally)
Atomic allocator swap (libscalene.cpp): Interposes PyMem_SetAllocator/PyMem_GetAllocator so Py_Initialize can't overwrite our wrapper. Double-buffered storage with atomic index swap for lock-free reads

Other fixes

Fixed test_issue999_wallclock.py: Line number propagation for dis.Instruction on Python < 3.13 (where starts_line is only set on first instruction per line). Uses ScaleneFuncUtils._instructions_with_lines() instead of reimplementing
Fixed NEWLINE sentinel check in SampleHeap::malloc: no longer adds sizeof(ScaleneHeader) since headers are removed

Net code change

Deleted ~100 lines of #ifdef branches, ScaleneHeader usage, size-class rounding, and debug assertions
One unified code path for all builds

Test plan

All existing pytest tests pass (309 passed, 13 skipped) on Python 3.9–3.14
All existing pytest tests pass on free-threaded 3.13t and 3.14t
Parity test validates CPU attribution (Python + C/native) on all builds
Parity test validates Python memory tracking (~320 MB) on all builds
Parity test validates numpy native memory tracking (~716 MB) on all builds
Concurrency test: 8× memory scaling, contention ratio 0.95x (1→8 threads)
CPU-only mode works on all builds
Smoketests pass on all platforms (Ubuntu, macOS, Windows)
vmmap physical footprint: +0.5 MB fixed overhead, 0.1 B/obj (indistinguishable from ScaleneHeader)

🤖 Generated with Claude Code

…ng (#1023) On free-threaded Python builds, the GIL is disabled by default. When Scalene's native extensions are imported without declaring GIL compatibility, Python forcibly re-enables the GIL mid-execution, leaving the runtime in a transitional state that causes SIGSEGV when libscalene calls whereInPython() via PyGILState_Ensure(). - Detect free-threaded builds via sysconfig and set PYTHON_GIL=1 in the preload environment so the GIL is enabled from process start - Declare Py_mod_gil (Py_MOD_GIL_USED) on pywhere and get_line_atomic extensions to suppress the RuntimeWarning and properly declare GIL requirements - Add 3.13t and 3.14t to CI test matrix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l-fast lxml lacks pre-built wheels for free-threaded Python, so install system build dependencies on Linux. Also set fail-fast: false so one matrix failure doesn't cancel all other jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Free-threaded Python (3.13t/3.14t) replaced pymalloc with mimalloc, which has a different allocation layout incompatible with libscalene's heap interposition. Disable memory profiling with a warning on these builds rather than crashing with SIGSEGV. Mark free-threaded CI jobs as continue-on-error since full support is still in progress. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Free-threaded Python (3.13t+) reinitializes memory allocators during Py_Initialize, overwriting the custom allocators that libscalene sets up via LD_PRELOAD static constructors. This causes SIGSEGV when objects allocated before init are freed after init (allocator mismatch). Fix: add scalene_reinstall_local_allocators() to libscalene, called from pywhere's populate_struct() after Python is fully initialized. It detects if the allocator was reset and re-wraps the current allocator. Also revert the blanket memory profiling disable on free-threaded Python, and add CI crash backtrace collection for debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Make Scalene's memory profiling work on free-threaded Python (3.13t/3.14t) without forcing the GIL back on. This preserves true parallelism for profiled programs. Thread-safety fixes: - TraceConfig::should_trace(): Add mutex around _memoize (shared unordered_map). Replace chdir()/getcwd() pattern with direct path construction (chdir mutates process-wide CWD, unsafe for threads). - pywhere.cpp statics: Make last_profiled_invalidated and sysmon_tracing_active atomic. Make sysmon call depth counters thread_local. Make sysmon_DISABLE atomic with acquire/release. Add module_pointers_ready barrier for cross-thread visibility. - whereInPython(): Skip findMainPythonThread_frame() on free-threaded builds (iterates thread states unsafely). Return 0 instead. - collect_frames_to_record(): Fall back to pure-Python path (sys._current_frames()) on free-threaded builds. Module declarations: - Change Py_MOD_GIL_USED to Py_MOD_GIL_NOT_USED on both extensions - Remove PYTHON_GIL=1 forcing from scalene_preload.py - Remove continue-on-error for free-threaded CI jobs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On free-threaded Python, Py_Initialize resets allocators to mimalloc, invalidating the wrappers installed by libscalene's static constructors. Objects allocated before init (with ScaleneHeader) freed after init (with mimalloc) cause SIGSEGV. Fix: Skip allocator wrapping in MakeLocalAllocator's constructor when Py_GIL_DISABLED is defined. The allocators are instead installed after Py_Initialize via scalene_reinstall_local_allocators(), called from pywhere's populate_struct(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PyMem_SetAllocator is not thread-safe and must not be called while other threads are allocating. populate_struct() runs after threads like BLAS workers are already active, causing data races. Move the reinstall call to PyInit_pywhere() where the import lock ensures no other Python thread is running, making PyMem_SetAllocator safe to call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…llocator Root cause: PyMem_SetAllocator is not thread-safe and cannot be called after Py_Initialize when other threads are already allocating. On free-threaded Python, Py_Initialize resets allocators to mimalloc, and reinstalling our wrapper at any later point races with concurrent allocations, corrupting object metadata and causing SIGSEGV. Fix: Set PYTHONMALLOC=malloc in the preload environment on free-threaded builds. This forces CPython to use C malloc for all allocation domains (MEM, OBJ, RAW), routing everything through LD_PRELOAD/DYLD_INSERT interposition. No PyMem_SetAllocator needed — the entire allocation pipeline is handled at the C level. Tradeoff: Python/C allocation ratio column shows 0%/100% on free-threaded builds (all allocations appear as C). Line-level memory attribution still works correctly via whereInPython(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PyMem_SetAllocator is a non-atomic struct copy, making it fundamentally unsafe to call after Py_Initialize on free-threaded Python. And PYTHONMALLOC=malloc is rejected on free-threaded builds (mimalloc is required for thread safety). Until CPython provides a thread-safe allocator swap API, memory profiling requires the GIL. Approach: - CPU-only profiling (--cpu-only): true free-threading, no GIL - Memory profiling: PYTHON_GIL=1 set in preload environment The thread-safety fixes (TraceConfig mutex, atomic statics, thread enumeration guards) remain in place for correctness and for the eventual removal of PYTHON_GIL=1 when CPython adds support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of calling PyMem_SetAllocator after Py_Initialize (thread-unsafe) or forcing PYTHON_GIL=1, interpose on PyMem_SetAllocator via LD_PRELOAD. When Py_Initialize tries to reset the allocator (e.g., to mimalloc on free-threaded Python), our interposition atomically updates the "original" allocator that our wrapper delegates to, using double-buffered storage with an atomic index swap. Our wrapper functions stay permanently installed — they never get overwritten. This eliminates: - PYTHON_GIL=1 forcing (true free-threading for all profiling modes) - The reinstall mechanism (no longer needed) - The PyMem_SetAllocator race condition entirely Also interpose PyMem_GetAllocator so CPython sees the "intended" allocator in debug/diagnostic contexts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Free-threaded Python's GC directly walks mimalloc heap pages to find objects (gc_free_threading.c:update_refs). Prepending ScaleneHeader corrupts the object layout the GC expects, causing SIGSEGV in _Py_RunGC -> update_refs when it reads the header bytes as a refcount. On free-threaded builds, MakeLocalAllocator now passes allocations through without modification (no header, no size rounding). Sampling and whereInPython attribution still work; per-allocation size tracking uses the requested size instead of the header. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove Generic[T] from ScaleneSigQueue; use Sequence[Any] for items that get unpacked as processor arguments (fixes mypy "Expected iterable as variadic argument" error across all Python versions) - Remove corresponding Generic[T] from ScaleneSignalManager - Skip test_pool_spawn_cpu_only when profile file not created (known macOS spawn-mode Pool hang) - Increase test_tracer legacy tracer timeout from 60s to 120s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ution On free-threaded Python, ScaleneHeader cannot be prepended to allocations because the GC directly scans mimalloc heap pages expecting valid Python objects. This caused two bugs: 1. local_free() passed size 0 to register_free(), so the free sampler never triggered — memory appeared to grow monotonically. 2. whereInPython() returned 0 for native thread allocations because iterating the thread list is unsafe without the GIL. Fix both: - Add ShardedSizeMap (128-shard spinlock + unordered_map) to track ptr→size out-of-band. local_malloc inserts, local_free removes (recovering the real size), local_realloc computes deltas correctly. - Cache the main thread's PyThreadState* during PyInit_pywhere (safe: module init is single-threaded) and use it as fallback in whereInPython() via PyThreadState_GetFrame (safe: uses atomic loads on free-threaded builds). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The static ShardedSizeMap instance must be declared before the template class that references it, otherwise free-threaded builds fail with 'g_size_map was not declared in this scope'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ckType The Heap-Layers spinlock.h is in vendor/Heap-Layers/locks/ which is not on the include path. Use std::atomic_flag directly to avoid the dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Py_GIL_DISABLED is defined by Python.h, so sharded_size_map.hpp (which is guarded by #ifdef Py_GIL_DISABLED) must be included after it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New test exercises CPU attribution (Python + native/C), memory allocation tracking, and multi-threaded workloads. Runs on every CI matrix entry (Python 3.9-3.14 + 3.13t/3.14t) to verify that free-threaded and regular builds produce comparable profiling results. Validates: - Python CPU time detected (pure-Python loop) - C/native CPU time detected (sorted() on large list) - Memory allocations tracked (~320 MB across threads) - CPU-only mode still works - Multiple lines attributed in both modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test/test_freethreaded_parity.py

Phase 2 adds a numpy workload that exercises: - Native memory allocation via np.random.rand (~200 MB through system malloc, not pymalloc) — this specifically tests ShardedSizeMap on free-threaded builds and ScaleneHeader on regular builds - Native CPU time via BLAS matrix multiply (a @ a.T) - Memcpy interception from numpy internal copies - All of the above running in threads alongside the base workload Validated locally on both free-threaded 3.14t (~716 MB detected) and regular 3.12 (~713 MB detected), confirming native allocation tracking works correctly on both paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test/test_freethreaded_parity.py

Phase 3 runs 1, 2, 4, and 8 threads each allocating ~150 MB via numpy (native malloc) and verifies that total attributed memory scales with thread count. This directly tests the ShardedSizeMap under contention on free-threaded builds and ScaleneHeader on regular builds. Checks: - Every thread count reports memory (> 1 MB) - 8-thread run attributes >= 2x what 1-thread does (generous for sampling) - 8-thread run covers >= 20% of nominal 1200 MB Local results: perfect 8x scaling on both free-threaded 3.14t (1200 MB) and regular 3.12 (1201 MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eads Replace the single-allocation concurrency test with repeated alloc/free cycles (50 iterations × 32 MB numpy arrays per thread) and measure profiling overhead by comparing Scalene vs bare execution times. The key metric is the "contention ratio": the slowdown factor at 8 threads divided by the slowdown factor at 1 thread. A well-sharded data structure keeps this near 1.0; a global lock would push it toward 8.0. Local results: Free-threaded 3.14t: contention ratio 1.10x (slowdown 1.02x→1.13x) Regular 3.12: contention ratio 0.94x (slowdown 1.08x→1.02x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rdedSizeMap std::unordered_map uses ~50-80 bytes per entry (heap-allocated nodes with next pointers). Replace with a flat open-addressed table using linear probing and backward-shift deletion: - Per-entry cost: 16 bytes (void* ptr + size_t size), matching ScaleneHeader's overhead - No per-entry heap allocations — only bulk calloc on growth - Power-of-2 capacity with 70% load factor threshold - Lazy initialization (no allocation until first insert) - Backward-shift delete keeps probe chains intact without tombstones Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eHeader Replace the dual-path size tracking (ScaleneHeader on regular Python, ShardedSizeMap on free-threaded) with ShardedSizeMap everywhere. Benefits: - One code path instead of two #ifdef branches (-95 lines) - No per-allocation header prepended — allocations are unmodified, so pymalloc size classes are not shifted - Eliminates the class of bugs where downstream code assumes unmodified pointers - Identical performance: overhead within ±2% noise across all workloads (1-8 threads, small/large objects, Python/native allocs) - Identical memory consumption: vmmap physical footprint shows +0.5 MB fixed overhead, 0.1 B/obj for 5M floats (same as ScaleneHeader) - No contention: 8-thread/1-thread slowdown ratio stays at 0.95x The ShardedSizeMap uses 128 shards with open-addressed flat hash tables (linear probing, backward-shift deletion). Per-entry cost is 16 bytes with no per-entry heap allocations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Now that ShardedSizeMap is the unified size tracker, ScaleneHeader is no longer used anywhere. Clean up: - Remove #include "scaleneheader.hpp" from libscalene.cpp, sampleheap.hpp, and sampleheap_win.hpp - Fix NEWLINE sentinel check in SampleHeap::malloc: compare against NEWLINE directly instead of NEWLINE + sizeof(ScaleneHeader), since allocations are no longer inflated by a header - Update test comments to remove ScaleneHeader references The scaleneheader.hpp file itself is left in place as it is not actively harmful and may be useful for reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tests/test_issue999_wallclock.py

On Python < 3.13, dis.Instruction.starts_line is only set on the first instruction of each source line; subsequent instructions on the same line have starts_line=None. The test's _find_call_instruction() and _first_instr_on_line() were checking each instruction's starts_line independently, so they couldn't find CALL instructions (which are never the first instruction on a line — LOAD_GLOBAL comes first). Fix by tracking current_line across instructions and propagating it. On Python 3.13+, line_number is set on every instruction so this isn't needed, but the code handles both paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

emeryberger and others added 10 commits March 20, 2026 08:38

Merge branch 'master' of https://github.com/plasma-umass/scalene

ace4dbb

Merge branch 'master' of https://github.com/plasma-umass/scalene

1897bcd

Add gdb smoke test for free-threaded memory profiling diagnosis

12d3f0b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Follow child process in gdb to capture preloaded SIGSEGV backtrace

19c76e9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

emeryberger changed the title ~~True free-threaded Python support without PYTHON_GIL=1~~ WIP: True free-threaded Python support without PYTHON_GIL=1 Mar 31, 2026

emeryberger and others added 15 commits March 31, 2026 14:53

Show full gdb backtrace without truncation

ac3110b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Test PYTHONMALLOC=malloc support on free-threaded Python

e315960

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diagnostic: check PyMem symbols and gdb child backtrace

f63b253

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'master' into free-threaded-true-support

031c5d9

Fix include order: sharded_size_map.hpp must come after Python.h

d2d1681

Py_GIL_DISABLED is defined by Python.h, so sharded_size_map.hpp (which is guarded by #ifdef Py_GIL_DISABLED) must be included after it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-advanced-security bot found potential problems Apr 3, 2026

View reviewed changes

test/test_freethreaded_parity.py Dismissed Show dismissed Hide dismissed

emeryberger changed the title ~~WIP: True free-threaded Python support without PYTHON_GIL=1~~ True free-threaded Python support: full memory + CPU profiling Apr 3, 2026

github-advanced-security bot found potential problems Apr 3, 2026

View reviewed changes

test/test_freethreaded_parity.py Fixed Show fixed Hide fixed

test/test_freethreaded_parity.py Fixed Show fixed Hide fixed

emeryberger and others added 5 commits April 3, 2026 10:57

github-advanced-security bot found potential problems Apr 3, 2026

View reviewed changes

tests/test_issue999_wallclock.py Fixed Show fixed Hide fixed

tests/test_issue999_wallclock.py Fixed Show fixed Hide fixed

emeryberger force-pushed the free-threaded-true-support branch from aa0c0a4 to eab2edd Compare April 3, 2026 19:59

emeryberger merged commit 6e8bcc2 into master Apr 3, 2026
50 checks passed

emeryberger mentioned this pull request Apr 5, 2026

Fix memory profiling regressions #1027

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

True free-threaded Python support: full memory + CPU profiling#1026

True free-threaded Python support: full memory + CPU profiling#1026
emeryberger merged 32 commits intomasterfrom
free-threaded-true-support

emeryberger commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

emeryberger commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Unified size tracking via ShardedSizeMap

Free-threaded Python support

Other fixes

Net code change

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

emeryberger commented Mar 31, 2026 •

edited

Loading