True free-threaded Python support: full memory + CPU profiling#1026
Merged
emeryberger merged 32 commits intomasterfrom Apr 3, 2026
Merged
True free-threaded Python support: full memory + CPU profiling#1026emeryberger merged 32 commits intomasterfrom
emeryberger merged 32 commits intomasterfrom
Conversation
…ng (#1023) On free-threaded Python builds, the GIL is disabled by default. When Scalene's native extensions are imported without declaring GIL compatibility, Python forcibly re-enables the GIL mid-execution, leaving the runtime in a transitional state that causes SIGSEGV when libscalene calls whereInPython() via PyGILState_Ensure(). - Detect free-threaded builds via sysconfig and set PYTHON_GIL=1 in the preload environment so the GIL is enabled from process start - Declare Py_mod_gil (Py_MOD_GIL_USED) on pywhere and get_line_atomic extensions to suppress the RuntimeWarning and properly declare GIL requirements - Add 3.13t and 3.14t to CI test matrix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l-fast lxml lacks pre-built wheels for free-threaded Python, so install system build dependencies on Linux. Also set fail-fast: false so one matrix failure doesn't cancel all other jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Free-threaded Python (3.13t/3.14t) replaced pymalloc with mimalloc, which has a different allocation layout incompatible with libscalene's heap interposition. Disable memory profiling with a warning on these builds rather than crashing with SIGSEGV. Mark free-threaded CI jobs as continue-on-error since full support is still in progress. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Free-threaded Python (3.13t+) reinitializes memory allocators during Py_Initialize, overwriting the custom allocators that libscalene sets up via LD_PRELOAD static constructors. This causes SIGSEGV when objects allocated before init are freed after init (allocator mismatch). Fix: add scalene_reinstall_local_allocators() to libscalene, called from pywhere's populate_struct() after Python is fully initialized. It detects if the allocator was reset and re-wraps the current allocator. Also revert the blanket memory profiling disable on free-threaded Python, and add CI crash backtrace collection for debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make Scalene's memory profiling work on free-threaded Python (3.13t/3.14t) without forcing the GIL back on. This preserves true parallelism for profiled programs. Thread-safety fixes: - TraceConfig::should_trace(): Add mutex around _memoize (shared unordered_map). Replace chdir()/getcwd() pattern with direct path construction (chdir mutates process-wide CWD, unsafe for threads). - pywhere.cpp statics: Make last_profiled_invalidated and sysmon_tracing_active atomic. Make sysmon call depth counters thread_local. Make sysmon_DISABLE atomic with acquire/release. Add module_pointers_ready barrier for cross-thread visibility. - whereInPython(): Skip findMainPythonThread_frame() on free-threaded builds (iterates thread states unsafely). Return 0 instead. - collect_frames_to_record(): Fall back to pure-Python path (sys._current_frames()) on free-threaded builds. Module declarations: - Change Py_MOD_GIL_USED to Py_MOD_GIL_NOT_USED on both extensions - Remove PYTHON_GIL=1 forcing from scalene_preload.py - Remove continue-on-error for free-threaded CI jobs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On free-threaded Python, Py_Initialize resets allocators to mimalloc, invalidating the wrappers installed by libscalene's static constructors. Objects allocated before init (with ScaleneHeader) freed after init (with mimalloc) cause SIGSEGV. Fix: Skip allocator wrapping in MakeLocalAllocator's constructor when Py_GIL_DISABLED is defined. The allocators are instead installed after Py_Initialize via scalene_reinstall_local_allocators(), called from pywhere's populate_struct(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyMem_SetAllocator is not thread-safe and must not be called while other threads are allocating. populate_struct() runs after threads like BLAS workers are already active, causing data races. Move the reinstall call to PyInit_pywhere() where the import lock ensures no other Python thread is running, making PyMem_SetAllocator safe to call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llocator Root cause: PyMem_SetAllocator is not thread-safe and cannot be called after Py_Initialize when other threads are already allocating. On free-threaded Python, Py_Initialize resets allocators to mimalloc, and reinstalling our wrapper at any later point races with concurrent allocations, corrupting object metadata and causing SIGSEGV. Fix: Set PYTHONMALLOC=malloc in the preload environment on free-threaded builds. This forces CPython to use C malloc for all allocation domains (MEM, OBJ, RAW), routing everything through LD_PRELOAD/DYLD_INSERT interposition. No PyMem_SetAllocator needed — the entire allocation pipeline is handled at the C level. Tradeoff: Python/C allocation ratio column shows 0%/100% on free-threaded builds (all allocations appear as C). Line-level memory attribution still works correctly via whereInPython(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyMem_SetAllocator is a non-atomic struct copy, making it fundamentally unsafe to call after Py_Initialize on free-threaded Python. And PYTHONMALLOC=malloc is rejected on free-threaded builds (mimalloc is required for thread safety). Until CPython provides a thread-safe allocator swap API, memory profiling requires the GIL. Approach: - CPU-only profiling (--cpu-only): true free-threading, no GIL - Memory profiling: PYTHON_GIL=1 set in preload environment The thread-safety fixes (TraceConfig mutex, atomic statics, thread enumeration guards) remain in place for correctness and for the eventual removal of PYTHON_GIL=1 when CPython adds support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of calling PyMem_SetAllocator after Py_Initialize (thread-unsafe) or forcing PYTHON_GIL=1, interpose on PyMem_SetAllocator via LD_PRELOAD. When Py_Initialize tries to reset the allocator (e.g., to mimalloc on free-threaded Python), our interposition atomically updates the "original" allocator that our wrapper delegates to, using double-buffered storage with an atomic index swap. Our wrapper functions stay permanently installed — they never get overwritten. This eliminates: - PYTHON_GIL=1 forcing (true free-threading for all profiling modes) - The reinstall mechanism (no longer needed) - The PyMem_SetAllocator race condition entirely Also interpose PyMem_GetAllocator so CPython sees the "intended" allocator in debug/diagnostic contexts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Free-threaded Python's GC directly walks mimalloc heap pages to find objects (gc_free_threading.c:update_refs). Prepending ScaleneHeader corrupts the object layout the GC expects, causing SIGSEGV in _Py_RunGC -> update_refs when it reads the header bytes as a refcount. On free-threaded builds, MakeLocalAllocator now passes allocations through without modification (no header, no size rounding). Sampling and whereInPython attribution still work; per-allocation size tracking uses the requested size instead of the header. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove Generic[T] from ScaleneSigQueue; use Sequence[Any] for items that get unpacked as processor arguments (fixes mypy "Expected iterable as variadic argument" error across all Python versions) - Remove corresponding Generic[T] from ScaleneSignalManager - Skip test_pool_spawn_cpu_only when profile file not created (known macOS spawn-mode Pool hang) - Increase test_tracer legacy tracer timeout from 60s to 120s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ution On free-threaded Python, ScaleneHeader cannot be prepended to allocations because the GC directly scans mimalloc heap pages expecting valid Python objects. This caused two bugs: 1. local_free() passed size 0 to register_free(), so the free sampler never triggered — memory appeared to grow monotonically. 2. whereInPython() returned 0 for native thread allocations because iterating the thread list is unsafe without the GIL. Fix both: - Add ShardedSizeMap (128-shard spinlock + unordered_map) to track ptr→size out-of-band. local_malloc inserts, local_free removes (recovering the real size), local_realloc computes deltas correctly. - Cache the main thread's PyThreadState* during PyInit_pywhere (safe: module init is single-threaded) and use it as fallback in whereInPython() via PyThreadState_GetFrame (safe: uses atomic loads on free-threaded builds). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The static ShardedSizeMap instance must be declared before the template class that references it, otherwise free-threaded builds fail with 'g_size_map was not declared in this scope'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckType The Heap-Layers spinlock.h is in vendor/Heap-Layers/locks/ which is not on the include path. Use std::atomic_flag directly to avoid the dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Py_GIL_DISABLED is defined by Python.h, so sharded_size_map.hpp (which is guarded by #ifdef Py_GIL_DISABLED) must be included after it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New test exercises CPU attribution (Python + native/C), memory allocation tracking, and multi-threaded workloads. Runs on every CI matrix entry (Python 3.9-3.14 + 3.13t/3.14t) to verify that free-threaded and regular builds produce comparable profiling results. Validates: - Python CPU time detected (pure-Python loop) - C/native CPU time detected (sorted() on large list) - Memory allocations tracked (~320 MB across threads) - CPU-only mode still works - Multiple lines attributed in both modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 2 adds a numpy workload that exercises: - Native memory allocation via np.random.rand (~200 MB through system malloc, not pymalloc) — this specifically tests ShardedSizeMap on free-threaded builds and ScaleneHeader on regular builds - Native CPU time via BLAS matrix multiply (a @ a.T) - Memcpy interception from numpy internal copies - All of the above running in threads alongside the base workload Validated locally on both free-threaded 3.14t (~716 MB detected) and regular 3.12 (~713 MB detected), confirming native allocation tracking works correctly on both paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 runs 1, 2, 4, and 8 threads each allocating ~150 MB via numpy (native malloc) and verifies that total attributed memory scales with thread count. This directly tests the ShardedSizeMap under contention on free-threaded builds and ScaleneHeader on regular builds. Checks: - Every thread count reports memory (> 1 MB) - 8-thread run attributes >= 2x what 1-thread does (generous for sampling) - 8-thread run covers >= 20% of nominal 1200 MB Local results: perfect 8x scaling on both free-threaded 3.14t (1200 MB) and regular 3.12 (1201 MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eads Replace the single-allocation concurrency test with repeated alloc/free cycles (50 iterations × 32 MB numpy arrays per thread) and measure profiling overhead by comparing Scalene vs bare execution times. The key metric is the "contention ratio": the slowdown factor at 8 threads divided by the slowdown factor at 1 thread. A well-sharded data structure keeps this near 1.0; a global lock would push it toward 8.0. Local results: Free-threaded 3.14t: contention ratio 1.10x (slowdown 1.02x→1.13x) Regular 3.12: contention ratio 0.94x (slowdown 1.08x→1.02x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rdedSizeMap std::unordered_map uses ~50-80 bytes per entry (heap-allocated nodes with next pointers). Replace with a flat open-addressed table using linear probing and backward-shift deletion: - Per-entry cost: 16 bytes (void* ptr + size_t size), matching ScaleneHeader's overhead - No per-entry heap allocations — only bulk calloc on growth - Power-of-2 capacity with 70% load factor threshold - Lazy initialization (no allocation until first insert) - Backward-shift delete keeps probe chains intact without tombstones Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eHeader Replace the dual-path size tracking (ScaleneHeader on regular Python, ShardedSizeMap on free-threaded) with ShardedSizeMap everywhere. Benefits: - One code path instead of two #ifdef branches (-95 lines) - No per-allocation header prepended — allocations are unmodified, so pymalloc size classes are not shifted - Eliminates the class of bugs where downstream code assumes unmodified pointers - Identical performance: overhead within ±2% noise across all workloads (1-8 threads, small/large objects, Python/native allocs) - Identical memory consumption: vmmap physical footprint shows +0.5 MB fixed overhead, 0.1 B/obj for 5M floats (same as ScaleneHeader) - No contention: 8-thread/1-thread slowdown ratio stays at 0.95x The ShardedSizeMap uses 128 shards with open-addressed flat hash tables (linear probing, backward-shift deletion). Per-entry cost is 16 bytes with no per-entry heap allocations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Now that ShardedSizeMap is the unified size tracker, ScaleneHeader is no longer used anywhere. Clean up: - Remove #include "scaleneheader.hpp" from libscalene.cpp, sampleheap.hpp, and sampleheap_win.hpp - Fix NEWLINE sentinel check in SampleHeap::malloc: compare against NEWLINE directly instead of NEWLINE + sizeof(ScaleneHeader), since allocations are no longer inflated by a header - Update test comments to remove ScaleneHeader references The scaleneheader.hpp file itself is left in place as it is not actively harmful and may be useful for reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On Python < 3.13, dis.Instruction.starts_line is only set on the first instruction of each source line; subsequent instructions on the same line have starts_line=None. The test's _find_call_instruction() and _first_instr_on_line() were checking each instruction's starts_line independently, so they couldn't find CALL instructions (which are never the first instruction on a line — LOAD_GLOBAL comes first). Fix by tracking current_line across instructions and propagating it. On Python 3.13+, line_number is set on every instruction so this isn't needed, but the code handles both paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aa0c0a4 to
eab2edd
Compare
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full profiling support for free-threaded Python (3.13t/3.14t) and a unified, simplified memory tracking architecture for all builds.
Unified size tracking via ShardedSizeMap
Replaces ScaleneHeader (16-byte inline header prepended to every pymalloc allocation) with an out-of-band sharded flat hash table on all builds — not just free-threaded.
Why not ScaleneHeader?
#ifdef Py_GIL_DISABLED) doubled maintenance burdenWhy not
malloc_size()/malloc_usable_size()?malloc_size()returns 0 for them since the system allocator only knows about the 256 KB arenasPYTHONMALLOC=mallocwould makemalloc_size()work but disables pymalloc, causing 33% slowdown on small-object workloadsmalloc_size()/malloc_usable_size()via SampleHeapShardedSizeMap design:
vmmapphysical footprint (5M floats)Free-threaded Python support
pywhere.cpp): Caches the main thread'sPyThreadState*duringPyInit_pywhere(single-threaded context) and uses it as fallback inwhereInPython()viaPyThreadState_GetFrame(safe on free-threaded builds — uses atomic loads internally)libscalene.cpp): InterposesPyMem_SetAllocator/PyMem_GetAllocatorsoPy_Initializecan't overwrite our wrapper. Double-buffered storage with atomic index swap for lock-free readsOther fixes
test_issue999_wallclock.py: Line number propagation fordis.Instructionon Python < 3.13 (wherestarts_lineis only set on first instruction per line). UsesScaleneFuncUtils._instructions_with_lines()instead of reimplementingSampleHeap::malloc: no longer addssizeof(ScaleneHeader)since headers are removedNet code change
#ifdefbranches, ScaleneHeader usage, size-class rounding, and debug assertionsTest plan
vmmapphysical footprint: +0.5 MB fixed overhead, 0.1 B/obj (indistinguishable from ScaleneHeader)🤖 Generated with Claude Code