Skip to content

Latest commit

 

History

History
517 lines (320 loc) · 118 KB

File metadata and controls

517 lines (320 loc) · 118 KB

Utilization-Aware Head Pruning and Mixed Quantization for Device-Targeted Language Models

Abstract

We present a plasticity-driven compaction pipeline that produces device-specific language model artifacts from a single training run. During LoRA fine-tuning, per-head gradient magnitudes are captured via an EMA-smoothed callback, producing a utilization map across all attention heads. Heads with low utilization are pruned entirely; remaining heads receive quantization precision (Q2K through BF16) proportional to their measured contribution. This produces mixed-quantization GGUF models optimized for specific memory budgets — targeting MacBook Air (11GB), MacBook Pro (16GB), and RTX 5090 (28GB) from one 27B parameter base model.

Unlike uniform quantization (which treats all heads equally) or magnitude pruning (which ignores task-specific usage), our approach uses actual training gradients to determine what matters for the specific domain being learned. A head that's critical for coding tasks may be irrelevant for creative writing — the utilization map captures this.

Key result: Qwen 2.5 Coder 14B compacted from 27GB to 8.9GB (67% reduction, 3x speedup). Published as continuum-ai/qwen2.5-coder-14b-compacted on HuggingFace.

Note on the discovery process. This paper was substantially revised after we attempted to validate the published artifact end-to-end and discovered three nested failures in our own pipeline: the published model is unrunnable in the dominant consumer inference runtime, the source weights had not been preserved by our forge pipeline, and the originally proposed importance metric was subsequently outperformed by an order of magnitude by a simpler alternative. Sections §2 describe the v1 pipeline as it was originally built and published; §4.1.1–§4.1.3 describe what we found when we attempted to validate it and the resulting methodology changes. The reader is invited to follow the same chronology we did.

1. Introduction

The deployment of large language models on consumer hardware is constrained by memory. A 27B parameter model in BF16 requires ~54GB — exceeding all but the highest-end GPUs. Existing approaches to this problem fall into two categories:

  1. Uniform quantization (GPTQ, AWQ, GGUF): Apply the same precision to all weights. Simple but wasteful — critical attention heads get the same treatment as dead ones.

  2. Structured pruning (magnitude-based, lottery ticket): Remove weights below a threshold. Ignores task-specific relevance — a head with small weights may be critical for the specific domain.

We propose a third approach: utilization-aware compaction, where training itself determines which heads matter. During LoRA fine-tuning on a target domain (e.g., coding), we capture per-head gradient magnitudes through the LoRA_B projection weights. These gradients reveal which heads are actively learning — and which are dead weight for this specific task.

The key insight is that LoRA training already touches every attention projection (Q, K, V, O). By instrumenting the training loop with a gradient callback, we get a utilization map at zero additional cost — no separate profiling pass, no calibration dataset, no additional forward passes.

2. Method

Note on this section. §2 describes the v1 pipeline as originally implemented and used to produce the published 14B and 27B artifacts. We have intentionally preserved it unchanged so that the reader can follow §4.1.1 in its original chronology. Section §4.1.3 documents the methodology revisions that the validation work in §4.1 surfaced; readers building a v2 pipeline should treat §4.1.3 as superseding §2.1 and §2.4 where they conflict.

2.1 Gate Gradient Capture

During standard PEFT LoRA training, we attach a GateGradientCallback to the HuggingFace SFTTrainer. At each optimizer step (before zero_grad), the callback:

  1. Walks the model's transformer layers
  2. For each attention projection (Q, K, V, O), finds the LoRA_B weight gradient
  3. Reshapes the gradient to per-head dimensions: [n_heads, head_dim, rank]
  4. Computes the L2 norm per head: magnitude = ||grad_head||
  5. Normalizes to [0, 1] range
  6. Updates an EMA-smoothed score: score = (1 - α) * score + α * magnitude

The EMA smoothing (α = 0.1) prevents single-step outliers from dominating. After training completes, the callback writes gate_gradients.json containing:

{
  "layer_scores": [[0.82, 0.03, 0.91, ...], ...],
  "num_steps": 4700,
  "model_name": "Qwen/Qwen2.5-Coder-14B-Instruct",
  "num_heads": 40,
  "num_kv_heads": 8
}

2.2 Utilization Scoring

The scoring engine reads the gate gradient data and classifies each head into action tiers:

Utilization Score Action Precision
< 0.10 Prune Removed entirely
0.10 – 0.30 Heavy compress Q2K or Q3K
0.30 – 0.70 Standard compress Q4K or Q5K
0.70 – 0.90 Light compress Q8_0
> 0.90 Full precision BF16 (may benefit from higher-rank LoRA)

GQA (Grouped Query Attention) constraints are enforced: KV heads cannot be pruned independently of their corresponding Q heads. Minimum head counts per layer prevent architectural collapse.

2.3 Device-Targeted Compaction

Given a memory budget (e.g., 16GB for MacBook Pro), the pipeline:

  1. Starts with the utilization-scored topology
  2. Iteratively adjusts precision tiers to fit the budget
  3. Prioritizes keeping high-utilization heads at higher precision
  4. Produces a mixed-quantization GGUF where each tensor gets independent precision

The same gate_gradients.json produces three different GGUF files by varying only the memory budget parameter.

2.4 Physical Head Pruning

Pruned heads are physically removed from the safetensors — not masked or zeroed. The compactor:

  1. Loads each safetensor shard
  2. For Q/K/V projections: slices out rows corresponding to pruned heads
  3. For O projection: slices out columns
  4. Writes compacted tensors to new safetensor files
  5. Saves head_topology.json mapping original → compacted head indices

This produces a genuinely smaller model, not a sparse one.

3. Implementation

The pipeline is implemented in Rust (continuum-core) for performance:

  • scoring.rs: Utilization scoring with configurable thresholds
  • compactor.rs: Multi-shard safetensor head pruning
  • gguf_writer.rs: Mixed-quantization GGUF export
  • pipeline.rs: End-to-end orchestration
  • topology.rs: Head topology serialization

The gate gradient callback is Python (integrated into peft-train.py), as it hooks into the HuggingFace Trainer callback system.

Total pipeline time for a 14B model: ~15 minutes on a single RTX 5090.

4. Results

4.1 Qwen 2.5 Coder 14B

Metric Original (BF16) Compacted
Size 27 GB 8.9 GB
Reduction 67%
Speedup ~3×
Published continuum-ai/qwen2.5-coder-14b-compacted
EvalPlus HumanEval+ (pending — see below) (pending — see below)

4.1.1 What we found when we tried to measure it

We attempted to measure EvalPlus HumanEval+ on the published 14B compacted model and discovered four nested failures, all of which strengthen the argument of the companion paper VALIDATED-TENSOR-SURGERY rather than weaken the result reported here. We report this discovery process explicitly because the failure modes we encountered are exactly the failure modes a structured-pruning validation harness is designed to make impossible, and our own published model was the first place they materialized. Each of the four failures is independent of the others — none is a consequence of the others, and any one of them would have rendered the published artifact broken in a way that no internal check the lab had at the time could detect.

Failure 1 — the published artifact is unrunnable in the dominant inference runtime. The published GGUF (qwen14b-compacted-q5ks.gguf) refuses to load in llama.cpp:

llama_model_load: error loading model: check_tensor_dims:
  tensor 'blk.0.attn_q.weight' has wrong shape;
  expected 5120,5120, got 5120,3200

The published config.json is internally consistent: hidden_size=5120, num_attention_heads=25, head_dim=128q_proj output dimension 25 × 128 = 3200. This is a perfectly valid transformer in which Q is a bottleneck projection 5120 → 3200 and O projects back 3200 → 5120. The transformers and vLLM loaders handle this layout natively. llama.cpp's GGUF loader hardcodes the assumption q_proj.shape == [hidden_size, hidden_size] and refuses to load any model that violates it. The compaction pipeline pruned Q heads from 40 to 25 as part of utilization-driven head removal, and the resulting layout silently violated an unstated architectural invariant that the dominant consumer-hardware runtime depends on. No part of our pre-publication validation tested for this, because our validation was implicitly defined as "save and load round-trip in transformers," and the bug lives in the gap between what the spec requires (nothing about q_proj.shape in particular) and what the most-deployed runtime hardcodes (q_proj.shape == [hidden_size, hidden_size]).

Failure 2 — there is no safetensors fallback in the published repository. A user who hits Failure 1 cannot work around it by selecting the safetensors version, because no safetensors version exists. The HF repository continuum-ai/qwen2.5-coder-14b-compacted contains only the GGUF (qwen14b-compacted-q5ks.gguf), the fast-tokenizer file (tokenizer.json), the model config (config.json), and the README. The decision to publish only a quantized format meant that a single runtime-compatibility bug rendered the entire published artifact inaccessible to every consumer of the dominant runtime.

Failure 3 — the lab's own forge pipeline did not preserve the intermediate weights. When we attempted to recover from Failure 2 by re-deriving a runnable artifact from our own forge run, we found that the source weights had not been preserved by the pipeline that produced the published model. The forge directory expected to contain pre-defrag and post-defrag intermediate checkpoints does not exist. The producer of the model — us — could no longer reproduce the model's pre-defrag state from our own infrastructure. This is a substrate-level reproducibility failure: in the absence of content-addressed retention guarantees on intermediate forge stages (see forge-alloy#11), the only path to a runnable artifact was to dequantize the published GGUF and treat its weights as the canonical state. The original fp16 weights are now permanently lost. Any future evaluation of the v1 artifact is, by necessity, an evaluation of the dequantized GGUF — the lab is not in a position to recover the pre-quantization weights it published, because those weights were never persisted with the same chain-of-custody guarantees as the final published file. The substrate-level fix is the one forge-alloy#11 proposes: every forge stage must emit its output as a content-addressed alloy link with an explicit retention policy, and a chain whose intermediate links have been garbage-collected must be detectable at audit time so that this exact failure cannot recur silently.

Failure 4 — the published repository is missing tokenizer_config.json entirely. Independent of the three failures above, the published HF repository for continuum-ai/qwen2.5-coder-14b-compacted does not contain a tokenizer_config.json file at all. The repository ships tokenizer.json (the fast-tokenizer vocabulary and merge rules) and nothing else for the tokenizer. tokenizer_config.json is the file that holds the chat template, the special-token configuration, the BOS/EOS settings, and several other fields that downstream consumers rely on. Its absence means that any user who loaded the model and called tokenizer.apply_chat_template(messages) — the standard HuggingFace pattern for invoking a code model in instruction-following mode — would receive either an error or, worse, a silent fallback to a default template that does not match the format the model was trained against. The published model is, in this dimension, broken for instruction-style use even in the runtimes where it can be loaded at all.

The blast radius of Failure 4 is independent of Failures 1, 2, and 3: a user with a vLLM or transformers backend (where Failure 1 does not apply) who downloaded the artifact and tried to use it for instruction-following code generation would still encounter this failure mode. The root cause is in the publication step of the v1 forge pipeline, which copied a subset of the tokenizer files to the HF repo without including tokenizer_config.json. As with Failures 1 and 3, no internal check the lab had at the time would have caught this, because the lab's validation never exercised the published HF artifact through the apply-chat-template path. We discovered Failure 4 only because tonight's recovery work had to inspect the published repository's file list directly to understand why the v1.5 reconstruction inherited a tokenizer with no chat template — and at that point the absence of tokenizer_config.json from the HF repo became impossible to miss.

The substrate-level fix is the same shape as the fix for the others: the validation harness must verify that every file the upstream model shipped is present in the published artifact, with the same field-level contents (or with explicit, attested deletions of fields the forge intentionally removed). Filed as a separate sentinel-ai issue: "v1 forge publication step writes incomplete tokenizer files to HF; missing tokenizer_config.json and therefore the chat template, BOS/EOS, and special-token configuration." The harness's Layer 7 (deployment-runtime load test, see VALIDATED-TENSOR-SURGERY) needs an extension that exercises apply_chat_template() against the published artifact and fails the publication if the call errors or silently falls back to a default template; we will refer to that as Layer 7.5 in subsequent text until it has a stable name.

A subtle and somewhat black-comic consequence of Failure 4: because the published v1 artifact had no chat template, our own evaluation tonight (which loaded v1.5 — the bit-identical reconstruction of v1 — and ran EvalPlus against it) was automatically routed to the raw-completion code path, because EvalPlus's prompt-format selector decides between chat-templated and raw-completion based on whether the tokenizer has a chat template. Failure 4, in other words, accidentally protected our own eval from a separate eval-pipeline bug we had not yet caught — a bug that would have hit the unmodified base 14B run we queued in parallel (which does have a chat template, and would have been mis-formatted as a result). The two bugs canceled each other out for v1.5 specifically. The right fix is not to rely on this accidental protection, but to pass --force-base-prompt to EvalPlus explicitly for any base-model evaluation, and to fix Failure 4 in the forge publication path independently. Both fixes are required because either bug being fixed in isolation re-exposes the other.

Recovery: streaming GGUF dequantization. We wrote an 80-line streaming dequantizer (tools/stream_dequant.py in the validation harness, [sentinel-ai PR pending]) that reads each tensor from the GGUF individually, dequantizes on GPU, writes to a safetensors shard, frees, and repeats. This approach never holds the full model in either system RAM or VRAM at once — important because the conventional transformers.from_pretrained(..., gguf_file=...) path stages dequantized tensors in CPU memory and OOMs at ~62% of the way through on a 31 GB host. The streaming path completed in 98 seconds, produced 25.7 GB of fp16 safetensors across six shards, mapped all 579 tensors with zero unmapped, and the resulting model loads cleanly in transformers and produces coherent code on a smoke test.

The dequantizer captures per-tensor metadata — original quant type, shape, fp16 L2 norm, dequantization wall-time — to a JSON sidecar. We include this so that the eval number we report below can be honestly characterized as "EvalPlus HumanEval+ on the dequantized published GGUF, with mean per-layer reconstruction error X, total dequantization noise contribution to score Y," rather than as a raw number that could be misread as a measurement of the original (lost) fp16 weights.

The number. (EvalPlus HumanEval+ run currently in progress on the dequantized artifact, BigMama RTX 5090, transformers backend, greedy decoding. Will populate this paragraph with pass@1 base, pass@1 plus, and the dequantization-noise summary when the run completes. ETA ~50 minutes from launch.)

4.1.2 Why we are reporting this rather than silently fixing it

The straightforward thing to do would be to quietly re-export the artifact, replace the published model, and publish a clean §4.1 with a single number and no narrative. We are not doing that. The discovery process that produced the three failures above is itself the strongest empirical evidence we have for the validation framework in VALIDATED-TENSOR-SURGERY, and the framework's central thesis — validation against the training framework is not validation; the contract between a model and its consumers includes invariants that are not in the standard config interface — is exactly what we just learned the hard way by attempting to validate our own published model.

Our own first failure is the strongest evidence we could publish for the framework, and a paper section that omitted it in favor of a clean number would be making the same mistake every other lab in the field is currently making: reporting the number that says the work succeeded, omitting the discovery process that says the work was harder than the number suggests, and consequently failing to give readers any way to learn from the failure modes the work actually surfaced.

Deprecation status. The current continuum-ai/qwen2.5-coder-14b-compacted HF repository will be marked deprecated with a notice that points at this section, the validation paper, and the replacement v2 artifact (re-derived with the activation-magnitude importance metric from VALIDATED-TENSOR-SURGERY Finding 4 and the q_proj-padding constraint from Finding 6). The v1 artifact will be left visible as a public record of the bug class and the recovery process. The deprecation page itself becomes a citable artifact for the validation paper.

4.1.3 What this changes about the methodology proposed in §2

Two of the §2 design choices are weakened by what we found in §4.1.1 and need to be revised in §2 of this paper or in any future use of the pipeline.

First, the gate-gradient capture in §2.1 (the "free utilization map" trick) was the original proposal for the importance metric. Subsequent work in the validation harness (VALIDATED-TENSOR-SURGERY Finding 4 and the four-metric comparison in §3.4 of that paper) found that on the model class we care about, activation magnitude with as few as eight calibration samples outperforms gradient-based importance by approximately a factor of seven on post-prune perplexity, and outperforms L2-weight-norm importance by approximately a factor of 105. Multiplying activation by gradient (Wanda-style saliency) makes the metric worse than activation alone by a factor of 2.6, the opposite of the field-consensus prediction. We tested this at 64 calibration samples and the ordering did not change; gradient was actually slightly worse at the larger sample size. This is a single-model result and we flag it as such, but it changes the recommended default for the compaction pipeline from "instrument the trainer to capture gradients" to "run a forward pass on eight calibration samples and read activation magnitudes from forward hooks." The latter is structurally cheaper, requires no trainer instrumentation, and produces the better metric.

Second, the §2.4 physical head pruning step is the operation that produced the bottleneck-Q layout responsible for Failure 1 above. The pipeline must, going forward, take some combination of three approaches: (a) preserve the q_proj_out == hidden_size invariant by zero-padding the pruned positions in the wire format (with the attention compute still operating over only the surviving 25 heads); (b) declare a runtime-compatibility tag in the artifact metadata and refuse to publish a GGUF unless the pruned shape passes the runtime's loader; or (c) submit a patch to llama.cpp that handles bottleneck Q projections natively, removing the architectural assumption at its source. (a) and (b) are local workarounds; (c) is the upstream fix. For the v2 14B artifact we will take (a), because the wire-format overhead from padding is small compared to the cost of an unrunnable model. We are also pursuing (c) as a parallel upstream contribution, because the bug class affects every defrag-style structured pruner that targets llama.cpp deployment, and fixing it once at the source is more useful to the field than working around it in each downstream pipeline. The validation harness in VALIDATED-TENSOR-SURGERY is gaining a corresponding Layer 7 step (deployment-runtime load-test) so that the next time a pruning operation produces a runtime-incompatible artifact, the failure surfaces before the model is published rather than after a third party tries to download it.

4.1.3.1 The activation-magnitude metric has a depth-dependent bias (refining Finding 4)

The first run of the v2-7B methodology validation (row 5 of §4.1.4 above) produced a result that was structurally sound (Layer 7 gate passed, smoke test coherent, dead positions verified zero) but numerically below the calibration anchor by ~12 / ~9 percentage points on HumanEval / HumanEval+. The investigation that followed — driven by the discipline of "do not write the win paragraph until the cause of the gap is understood" — surfaced a previously-unknown failure mode of the activation-magnitude importance metric proposed earlier in this section, and it is sharp enough to deserve its own subsection.

The observation. When the un-normalized global activation-magnitude metric was applied to Qwen2.5-Coder-7B base with --prune-level 0.3 (target: 30% of attention heads removed), the actual selection produced by select_heads_to_prune was distributed across layers as follows:

Layer Heads pruned (out of 28) % of layer
0 0 0%
1 21 75%
2 14 50%
3 7 25%
4 7 25%
5–26 0 0%
27 0 0%
Total 70 / 784 8.9%

Two anomalies in the same data: the total pruning rate was 8.9%, not the 30% requested by --prune-level 0.3 (the difference is absorbed by min_surviving_per_layer=4 clipping and by the global ranking running out of selectable heads outside layers 1-4); and the spatial distribution concentrated essentially all of the prunes into layers 1-4, with layers 5-27 entirely untouched.

The cause. The flat global ranking by activation magnitude is structurally biased toward early layers because the residual stream activation magnitude grows through the network — early layers have systematically smaller activations than late layers, regardless of the per-head importance within each layer. The metric, as currently formulated, finds heads with low activation and prunes them; on a deep model this is equivalent to "find the layers with the smallest residual stream norms and prune their heads first." On a 28-layer model the bias is large enough to dominate, on a 24-layer model (the four-metric comparison harness in VALIDATED-TENSOR-SURGERY §3.4 was run on Qwen2.5-0.5B with 24 layers) the bias is small enough to be absorbed into the noise of the per-prune perplexity comparison, and on a 48-layer model (Qwen2.5-Coder-14B base, 48 layers) we expect the bias to be more severe than the 28-layer 7B case. The depth-dependence of the bias is the part that matters: it means the metric's quality varies as a function of model depth, and the validation comparison in §3.4 of the harness paper does not extrapolate to the model sizes we care about.

The consequence for the v2-7B preliminary number. Layer 1 lost 75% of its attention heads in a single pruning step. The 500-step LoRA training that followed cannot recover from a layer that has been almost entirely ablated — there is not enough capacity left in layer 1 for the surviving 7 heads to absorb the work of the removed 21. The model in §4.1.4 row 5 is therefore not a valid test of the activation-magnitude methodology; it is a test of the layer-1-mostly-ablated version of the methodology, which is a different and structurally crippled experiment. The 50.0 / 44.5 result is consistent with that interpretation: a model whose first transformer layer is mostly destroyed and whose remaining layers are mostly untouched, recovered partially by 500 steps of LoRA on the surviving structure.

The fix. Layer-normalized activation-magnitude importance: rank heads relative to the mean activation magnitude of the layer they belong to, not relative to the global activation distribution. Equivalently, enforce a per-layer prune budget that distributes the global prune target across layers in proportion to layer head count, then apply the activation-magnitude ranking within each layer. Either formulation eliminates the depth-dependent bias and produces a pruning distribution that is calibrated to within-layer importance rather than to the cross-layer residual norm growth.

Why this refines rather than refutes Finding 4 of the harness paper. Finding 4 said: "activation-magnitude head importance outperforms L2-weight head importance by approximately a factor of 105 on post-prune perplexity (Qwen2.5-0.5B)." That finding stands — the activation-magnitude metric is a substantially better signal than the weight-norm metric on a per-head basis, regardless of the cross-layer bias. The refinement is: the activation-magnitude metric on its own is depth-biased, and the depth-bias becomes load-bearing on models with more than ~24 layers. The corrected metric is layer-normalized activation magnitude, which preserves the per-head signal of the original while removing the cross-layer artifact. The harness paper's four-metric comparison should be re-run on the current generation of models across a range of depths — Qwen3.5-0.8B (24 layers), Qwen3.5-4B (28 layers), Qwen3.5-9B (32 layers), and Qwen3.5-27B-Base (40 layers, via bnb-8bit per sentinel-ai#160) — with both the un-normalized and layer-normalized variants. We use the current Qwen3.5 family rather than the historical Qwen2.5 family for this study because the depth-bias study is a new experiment, not a historical reproduction, and there is no methodological reason to study a phenomenon on stale models when current ones span the same depth range and reflect the architectures the field will actually deploy. The depth-bias re-run is small (the existing four-metric comparison harness was tonight's work, and adding a fifth column plus running on four model sizes is on the order of one hour of GPU time at AI-collaborator speed; the 27B requires bnb-8bit due to the WSL2 VRAM constraint documented in sentinel-ai#162).

A note on model-family choice that applies more broadly: this paper's §4.1 sections discuss Qwen2.5-Coder-14B by historical necessity (it is the model the deprecated v1 artifact was forged from, and the deprecation analysis cannot use a different model), and §4.1.4 row 5's v2-7B methodology validation uses Qwen2.5-Coder-7B by validation necessity (it must be in the same model family as v1 to make the methodology-recovery comparison clean). Every other forge target in this paper — the depth-bias study above, §4.2's held 27B forge (now updated to Qwen3.5-27B-Base), and any future Path B work that generalizes the methodology across model families — should use the current Qwen3.5 family or its non-Qwen contemporaries (DeepSeek-V3.x, current Llama, current Mistral) rather than the historical Qwen2.5 family. The current models reflect the architectures the field actually deploys in 2026 and the methodology claims should be anchored on those, not on the family that happened to be current when the historical v1 forge was attempted.

Why this generalizes beyond our pipeline. Any structured-pruning method that ranks heads by a quantity which scales with layer position (residual stream norm, gradient magnitude, output activation, KV-cache contribution) will exhibit some version of this bias on deep models. The fix — normalize within-layer before ranking globally, or enforce per-layer budgets — applies to all of them. We are unaware of any prior published structured-pruning result that explicitly reports its per-layer prune distribution, which makes us suspect this depth-bias is widespread in the literature and has gone unnoticed because most published structured-pruning evaluations are on models small enough (<24 layers) that the bias is absorbed into noise. A claim we are making, with the appropriate caveat that it is based on our two model sizes only: the absence of per-layer prune-distribution reporting in published structured-pruning papers is itself evidence that this failure mode is silently present in the literature, and the fix should be adopted as a default for all activation-based importance metrics on models of nontrivial depth.

Status of the layer-normalized re-run — empirically confirmed. The layer-normalized version of select_heads_to_prune was implemented and the v2-7B forge was re-run with the corrected metric. The before/after comparison validates the layer-bias finding directly:

Metric Broken (global, un-normalized) Fixed (layer-normalized) Δ
HumanEval pass@1 50.0% 54.9% +4.9 abs / +9.8% relative
HumanEval+ pass@1 44.5% 48.8% +4.3 abs / +9.7% relative
Layers touched 4 of 28 28 of 28 uniform distribution
Total heads pruned 70 (8.9%) 224 (28.6%) 3.2× more pruning
Per-layer prune count 0–21 (concentrated in layers 1–4) 8 (uniform) distribution fixed

The fixed run recovered ~5 absolute / ~10% relative HumanEval points while simultaneously pruning 3.2× more total heads — exactly the directional outcome predicted in the diagnosis above. The depth-bias was real, the per-layer normalization is the right fix, and the empirical validation rests on a clean two-point comparison with all other variables held constant. Row 5 of §4.1.4 carries the fixed-run number (54.9 / 48.8) as the measured methodology-validation result. The residual 7.3-point gap to the base anchor is not attributable to insufficient training — see §4.1.3.2 below for the empirical refutation of that hypothesis and the deeper finding it surfaced.

4.1.3.2 The PPL/HumanEval disconnect: a reproducible limitation of activation-magnitude head importance

The 1-cycle layer-normalized result in §4.1.3.1 above left a 7.3-point HumanEval gap between v2-7B and the base 7B anchor. The natural hypothesis was that 500 steps of single-cycle LoRA were insufficient to recover from 28.6% structured pruning, and that more training would close the gap. We tested this hypothesis directly by running a third forge: same model, same layer-normalized metric, same pad-mode defrag, but with 3 cycles of plasticity at 1000 steps each (6× the total training, with multi-cycle structure to allow per-cycle prune-and-recover). The hypothesis was falsified, and the data revealed a deeper structural finding.

# Method Cycles × steps Total heads pruned Internal PPL HumanEval pass@1 HumanEval+ pass@1
1 global activation rank (broken, depth-biased) 1 × 500 70 (concentrated in layers 1–4) high (degraded) 50.0% 44.5%
2 per-layer rank (fix from §4.1.3.1) 1 × 500 224 (uniform 8/layer) 1.77 54.9% 48.8%
3 per-layer rank, multi-cycle 3 × 1000 252 (uniform across cycles) 1.77 (−16% vs baseline 2.10) 46.3% 41.5%

Two contradictions in the row 3 data:

(a) PPL improved while HumanEval degraded. Run 3 produced a model whose internal perplexity is 16% better than the unmodified base (1.77 vs 2.10) — exactly the §3.3 specialization signal that EXPERIENTIAL-PLASTICITY predicted from multi-cycle pruning + retraining. But the same model's HumanEval pass@1 dropped from row 2's 54.9% to 46.3%, an 8.6-point absolute decrease — worse than the broken-baseline row 1 result. The two metrics point in opposite directions on the same model.

(b) More training made HumanEval worse, not better. The training-budget-bound hypothesis from §4.1.3.1 is empirically refuted: 6× the total training, with multi-cycle structure, reduced HumanEval capability rather than recovering it. The residual gap from row 2 is not closeable by adding training within the activation-magnitude metric framework.

The mechanistic interpretation. The activation-magnitude head importance metric — even with the per-layer normalization fix from §4.1.3.1 — measures importance relative to the calibration distribution used during forge profiling. That distribution is necessarily close to the model's existing fine-tuning data, which is close to the model's existing competency surface. The metric therefore identifies as "low importance" the heads whose contribution to the local fine-tuning loss is small. Those heads are pruned. But "low contribution to the local fine-tuning loss" is not the same property as "low contribution to held-out task generalization." A head can be small-contribution on the local distribution while being load-bearing for held-out tasks like HumanEval, where one wrong token breaks an entire test. Multi-cycle plasticity makes the disconnect worse, not better, because each additional cycle prunes more heads using the same locally-biased ranking and reinforces the locally-optimal allocation at the cost of held-out generalization. The model's PPL improves because the surviving heads are absorbing the locally-relevant work; the model's HumanEval degrades because the heads being pruned across cycles include some that were marginally important for held-out tasks but invisible to the local metric.

This is a reproduction of the §3.3 disconnect from EXPERIENTIAL-PLASTICITY. That paper's §3.3 honest disclosure noted that the Qwen3.5-4B LoRA-only forge produced +24% PPL improvement but lost 1.3 points on HumanEval. At the time of that disclosure, the result read as a single-model anomaly worth flagging. Our 3-cycle result on a different model family (Qwen2.5-Coder-7B), with a more rigorous methodology (layer-normalized importance, pad-mode defrag, multi-cycle plasticity, calibrated eval anchored against published Qwen-reported numbers), reproduces the same qualitative pattern with a much larger magnitude. The PPL/HumanEval disconnect is not an anomaly. It is a systematic limitation of activation-magnitude head importance as a forge importance metric, and it generalizes across model families and across cycle counts. EXPERIENTIAL-PLASTICITY §3.3 was, in retrospect, predicting a reproducible phenomenon and not just disclosing a one-shot exception.

Implications for the §4.1.3 methodology claim. The methodology revisions in §4.1.3 and §4.1.3.1 are empirically necessary for any meaningful recovery — the broken global-flat run is dramatically worse than the fixed runs — but they are not sufficient for full recovery of held-out task capability. The §4.1.3 win paragraph that would have been written from a clean 1-cycle 54.9-or-higher result must instead be a partial-validation paragraph: the corrected methodology recovers ~88% of base HumanEval capability (54.9 / 62.2) at 1-cycle 500-step training, and the residual gap is structurally not closeable by additional training within the activation-magnitude metric framework. Closing it requires either:

  1. A held-out-task-aware importance metric that ranks heads by their contribution to a held-out task distribution rather than to the local fine-tuning loss. The §4.1.3.1 finding that pure activation-magnitude beats Wanda-style activation × gradient suggests the saliency formulation is wrong as currently constituted, but a held-out-aware reformulation (e.g., gradient computed against a distinct held-out task distribution rather than against the fine-tuning data) is the natural next direction.
  2. Held-out-aware calibration data. If the calibration set used for activation profiling includes held-out task instances (HumanEval-style problems for code, GSM8K-style for math, etc.), the metric identifies the load-bearing heads for those tasks as important and protects them. Cheapest fix; directly testable in a follow-up forge.
  3. Both. Held-out-aware metric + held-out-aware calibration data is the most defensible methodology revision and is the natural next experimental wave.

We do not run any of these in this paper; they are queued for the next experimental wave. Row 5 of §4.1.4 carries the 1-cycle 54.9 / 48.8 number as the best honest measurement available with the methodology as currently constituted, and this section (§4.1.3.2) is the honest disclosure of the structural limitation that the disciplined ablation study surfaced.

Why we publish the negative finding rather than re-running until we get a better number. The same intellectual honesty that produced §4.1.2's deprecation narrative applies here. A less disciplined methodology would have shipped the broken-global-flat result at HumanEval 50.0%, declared methodology validation, and moved on. A slightly more disciplined version would have shipped the 1-cycle 54.9% result without running the 3-cycle test, and would never have discovered the structural limitation that more cycles makes it worse. The disciplined three-row ablation — running the multi-cycle test specifically to falsify the training-budget-bound hypothesis — is what surfaced the PPL/HumanEval disconnect at the methodology paper's level rather than letting it remain as a one-line disclosure in the original §3.3. The negative finding is itself the contribution, and publishing it strengthens rather than weakens the paper because it generalizes a known-but-underweighted phenomenon into a falsifiable structural limitation that the field can build on.

Update from §4.1.3.3 below. The structural limitation documented in this section was empirically closed in subsequent work using compensation LoRA via teacher distillation (KL on output logits). The §4.1.3.2 disconnect is therefore not a permanent ceiling on the methodology — it is a limitation of single-distribution activation-magnitude calibration that can be structurally circumvented by adding a small distillation-trained correction layer. The recovery is documented in §4.1.3.3 with a clean before/after comparison (54.9 → 61.0 HumanEval) and a loss-function ablation (MSE-on-hidden collapses the model to 0.0; KL-on-logits recovers it to 61.0) that explains why the structural fix works.

4.1.3.3 Compensation LoRA via teacher distillation: the structural fix for §4.1.3.2

The §4.1.3.2 disconnect — the activation-magnitude metric, even with the §4.1.3.1 layer-normalization fix, produces a v2-7B model whose HumanEval pass@1 lands 7.3 percentage points below the base 7B anchor, with multi-cycle plasticity making the gap worse rather than better — was framed in §4.1.3.2 as a structural limitation of the activation-magnitude metric: the metric is computed against a calibration distribution close to the local fine-tuning data, so it optimizes for local loss rather than held-out task generalization, and no amount of additional training within that framework closes the gap.

The natural response to a structural limitation is a structural fix. The §4.1.3.2 section named two candidate fixes: (a) held-out-aware calibration data, which keeps the activation-magnitude metric but trains it against the right distribution; and (b) a held-out-task-aware importance metric reformulation, which replaces the metric entirely. A third structural fix, not named in §4.1.3.2 but implemented after that section was written, turned out to be the one that worked: add a small learned compensation structure (LoRA adapter, ~0.5% additional parameters) trained against the unmodified teacher's output distribution via knowledge distillation, instead of (or in addition to) trying to make the surviving heads absorb the pruned heads' work via fine-tuning. The architectural pattern is from models/unet_transformer.py's BaselineIntegratedBlock (which was already present in the sentinel-ai codebase as a complete reference implementation for GPT-2 era models, ported here as a model-agnostic LoRA adapter via PEFT), and the design and implementation are documented in sentinel-ai/docs/COMPENSATION-LORA-DESIGN.md and sentinel-ai/scripts/compensation_lora.py.

The empirical result. Applied to the existing 1-cycle layer-normalized v2-7B (which had landed at 54.9 / 48.8), compensation LoRA via KL-on-logits distillation against the unmodified Qwen2.5-Coder-7B teacher recovered the model to:

Variant HumanEval pass@1 HumanEval+ pass@1 Δ vs base 7B anchor
Base Qwen2.5-Coder-7B (anchor, our calibrated pipeline) 62.2 53.7 (anchor)
v2-7B forged, uncompensated (1-cycle layer-normalized) 54.9 48.8 −7.3 / −4.9
v2-7B + compensation LoRA via KL distillation 61.0 53.0 −1.2 / −0.7

The compensated v2-7B is within 1.2 percentage points of the base anchor on HumanEval and within 0.7 on HumanEval+ — both inside the §4.1.4.1 calibration tolerance band of ±3pt, meaning the recovery is statistically indistinguishable from "fully recovered" at the precision the calibrated eval pipeline can measure. The §4.1.3.2 PPL/HumanEval disconnect is empirically closed.

The compensation pass cost: 500 LoRA training steps, ~50 calibration examples drawn from a held-out task mixture, ~0.5% additional parameters wrapped as a LoRA adapter on the existing pruned student. The LoRA is merged into the student weights at save time, so the inference-time VRAM and tokens-per-second cost is zero: the compensated model has the same disk footprint as the un-compensated one and runs at the same speed. The compensation step is essentially free at inference time and produces a near-base-quality model.

The loss-function ablation. The compensation LoRA was run twice with the same LoRA configuration, training schedule, and calibration data, varying only the distillation loss function. The two losses are conceptually similar (both ask the student to match the teacher) but mathematically distinct (hidden-state alignment vs output-distribution alignment). The results were qualitatively different:

Distillation loss HumanEval pass@1 HumanEval+ pass@1 Notes
MSE on per-layer hidden states 0.0 0.0 Model collapsed; smoke test produced empty / degenerate output
KL divergence on output logits (T=2.0) 61.0 53.0 Near-base recovery

This is itself a substantive paper finding, and it explains why the structural fix works mechanistically. The MSE-on-hidden objective is satisfiable by a degenerate fixed point: the student can drive the loss to zero by producing near-zero hidden states everywhere, regardless of whether those hidden states encode useful information. The training found that fixed point (the LoRA learned to suppress the student's outputs to match a near-zero teacher signal that doesn't exist) and the resulting model produces no useful tokens. The KL-on-logits objective has no degenerate fixed point because it is computed over a probability distribution: matching a uniform distribution costs maximum KL, matching a peaked distribution costs minimum KL only if the student's distribution is also peaked at the same place. The student therefore cannot collapse to zero output; it must learn to peak its output distribution at the same tokens the teacher peaks at, which is exactly the constraint that recovers task-level capability.

The mechanistic insight that this surfaces. The §4.1.3.2 disconnect lives in the output distribution, not in the hidden state distribution. The hidden states of a pruned model can be recoverable to near-teacher levels by a small LoRA — but recovering the hidden states does not recover the model's behavior, because the output distribution is what determines token-level behavior and the relationship between hidden states and output distribution is not a simple distance preservation. Two models can have very similar hidden states and very different output distributions if their final projection layers and softmax temperatures interact differently. The lesson generalizes: for distillation to recover task-level capability, the distillation signal must be at the layer where task-level behavior is determined — which for an autoregressive language model is the output logit distribution, not the intermediate residual stream. Hidden-state matching is a necessary but not sufficient signal; output-distribution matching is what makes the compensation LoRA actually compensate.

The methodology pivot. With this result, the methodology paper's central framing shifts. The original §4.1.3 was "we have a methodology revision (layer-normalized activation magnitude + pad-mode defrag) that recovers most of base capability after pruning." The §4.1.3.2 finding was "the methodology has a structural limitation that single-distribution activation-magnitude calibration cannot close." The §4.1.3.3 finding is "the structural limitation is closeable by adding a distillation-trained compensation LoRA at the output-logit level, at minimal cost." Together, the three sub-sections constitute a complete methodology validation arc: identify the metric refinement (§4.1.3.1), discover the metric's structural limitation (§4.1.3.2), apply a structural fix that resolves the limitation (§4.1.3.3). The forge methodology is now validated end-to-end on Qwen2.5-Coder-7B base, with the compensated v2-7B landing within calibration tolerance of the unmodified base anchor.

The deeper architectural reframing is that distillation-first becomes the foundational primitive of the methodology, not a recovery patch applied after pruning fails. Under the distillation-first framing, the forge is a sequence of (transform, distill, test) cycles rather than a single (prune, retrain) operation. Any transformation can be applied to a copy of the teacher (head pruning, MoE expert pruning, quantization, RoPE-based context extension, modality fusion, etc.); the distillation step recovers from the transformation by training a small compensation structure against the teacher's output distribution; the test step measures the compensated artifact against the per-tier (quality + context_window_bonus) / vram metric from §4.1.4.1. The pruning metric matters less when distillation is foundational — the metric just needs to be good enough that the distillation recovery converges in a reasonable number of steps, which the compensated v2-7B result demonstrates is achievable with very modest training budgets (500 LoRA steps) on a moderate calibration set (50 examples).

This framing also unifies the substrate work that was already in the sentinel-ai codebase: models/unet_transformer.py's BaselineIntegratedBlock was the GPT-2-era reference implementation of exactly this pattern (baseline integration via learned adapter + gate), and scripts/compensation_lora.py is the model-agnostic LoRA-adapter port of the same architectural pattern, applicable to any HF transformers model. The lab's stated original goal of "pruning/mitosis and U-Net compression/compensation" — which had been parked as a future direction in EXPERIENTIAL-PLASTICITY's "skip connections (U-Net style) were disabled earlier due to instability; re-enabling is a future task" line — turns out to have been the right goal all along; it just needed to be re-applied in the narrower form of "distillation compensation specifically for pruning loss" (which is much more constrained than general U-Net skip connections in a transformer) and at the LoRA-adapter level (which avoids re-implementing custom transformer wrappers for every model family).

Status of the next experimental wave. With the §4.1.3.2 disconnect closed by compensation LoRA, the methodology is ready to be applied to the canonical replacement targets in §4.1.4 rows 6 and 7 (Qwen3.5-397B-A17B Instruct grid moonshot and Qwen3.5-35B-A3B Instruct single-machine). Both inherit the validated methodology stack: layer-normalized activation-magnitude head importance + pad-mode defrag + LoRA fine-tuning + KL-distillation compensation LoRA against the unmodified teacher. The next experimental milestone is the row 7 forge on Qwen3.5-35B-A3B Instruct via the CPU-first pre-removable MoE expert pruning pipeline, with the compensation LoRA step inherited directly from the v2-7B work documented here.

4.1.3.4 The importance-metric calibration lesson generalizes across structural unit (heads → experts)

§4.1.3.1 documented a layer-bias failure of the global activation-magnitude head importance metric on dense models: the metric overweighted late-layer heads because residual norms grow with depth, and the fix was per-layer normalization. §4.1.3.2 documented a deeper failure of any importance metric computed against a calibration distribution that biases toward locally-rewarded patterns: the heads identified as least important under local fine-tuning loss are often load-bearing for held-out task generalization. §4.1.3.3 introduced the compensation LoRA structural fix for the §4.1.3.2 disconnect. This section documents that the same pattern recurs at the MoE expert level, with a router-importance metric playing the role that activation magnitude played for dense heads. The lesson is now structurally invariant: any importance metric computed without explicit task-conditioned activation profiling underperforms held-out benchmarks regardless of whether the prunable unit is a head, an expert, a layer, or any future structural unit.

Empirical reproduction on Qwen3-Coder-30B-A3B-Instruct. The forge run that produced the v1 of qwen3-coder-30b-a3b-compacted-19b-256k used the default importance metric in cpu_expert_prune_v2.py: per-layer-normalized L2 norm of the router gate row vector for each expert. This is a pure architectural metric — it asks "which experts has the router learned to weight more strongly during typical training?" — without ever passing inputs through the model. We pruned the top-80 of 128 experts per layer by router-gate-L2 ranking, requantized to GGUF Q5_K_M, and evaluated against the unmodified base anchor on the same hardware in the same eval pipeline:

Variant HumanEval pass@1 HumanEval+ pass@1 Δ vs base anchor
Base Qwen3-Coder-30B-A3B-Instruct (anchor, our calibrated pipeline) 92.1 89.0 (anchor)
Pruned student, router-gate-L2-norm importance (Q5_K_M) 78.7 73.8 −13.4 / −15.2
Pruned student, calibration-aware activation-count importance (Q5_K_M) 88.4 86.0 −3.7 / −3.0

The −13.4 HumanEval gap from the router-gate-L2 baseline was suspicious given the v2-7B precedent in §4.1.3.1–§4.1.3.3, where dense head pruning at higher prune rates closed to within −7.3 of the base anchor before compensation. We ran the full bug-class verification protocol (per-tensor source consistency across the gate/up/down trio for every surviving expert, router gate row alignment vs surviving indices, GGUF metadata consistency, Q8_0 vs Q5_K_M consistency to rule out quantization compounding) and every check passed cleanly. The drop was not a bug; it was the metric.

The fix. We implemented expert_activation_profile.py: load the unmodified base model in 4/8-bit on a single GPU, register forward hooks on every router gate, run a held-out code calibration corpus (300 Python examples / 125K tokens) through the model, accumulate per-layer per-expert activation counts (how often each expert is in the top-k routing decision), and serialize the result as the importance JSON. cpu_expert_prune_v2.py then reads this JSON via a new --importance-json flag and uses the per-layer activation counts as the survivor ranking instead of router-gate row L2 norms. Per-layer overlap between the two rankings averaged ~65% — substantial swap of which experts survive, but not random.

The empirical result is the +9.7 / +12.2 swing in the table above, on the same source model, same keep-K, same hardware, same harness, same hour, with no fine-tuning or compensation training. The structural fix is at the metric layer alone. The residual −3.7 / −3.0 calibrated delta to the base anchor is much smaller than the v2-7B compensation-recovery margin (−1.2 / −0.7), but it lands the artifact firmly inside the §4.1.4.1 calibration tolerance band of ±3pt for HumanEval+ and just outside it for HumanEval, on the first iteration of the calibration-aware pipeline.

The structural lesson. The pattern is now empirically validated at two structurally distinct prunable units:

  1. Dense head pruning (§4.1.3.1, §4.1.3.2): the activation-magnitude metric, even with per-layer normalization, is biased toward the local fine-tuning distribution and underperforms held-out HumanEval. The §4.1.3.3 compensation LoRA recovers most of the gap by training against the teacher's output distribution rather than the local loss.
  2. MoE expert pruning (§4.1.3.4, this section): the router-gate-L2 metric is purely architectural and ignores task-conditioned activation. Replacing it with calibration-aware activation counts on a held-out code corpus closes 9–12 HumanEval points on the same prune budget. Compensation LoRA on top of the calibration-aware student is the next experimental milestone (v2 of the artifact); the v1 ships with the metric fix alone as the empirical anchor for this section.

The two data points form the start of a methodology curve, not a single anomaly. The unifying claim is that any importance metric for a prunable unit must be derived from task-conditioned activation profiling, not from architectural weight statistics or local fine-tuning loss alone. This holds whether the prunable unit is a head, an expert, a layer, a context-extension band, a vision-tower channel, or any future structural unit. The calibration corpus must reflect the held-out task distribution the artifact will be evaluated and used on.

The empirical anchor for this section is the v1 publication of continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k (alloy hash aa61c4bdf463847c). The published artifact carries both the calibration-aware result (88.4 / 86.0) as the current prune AND the router-gate-L2 result (78.7 / 73.8) in the alloy's priorMetricBaselines array as the negative-baseline empirical control. Both per-problem JSONL outputs are uploaded with sha256 result hashes recorded in the alloy, so any third party can re-score either run against the same base anchor without trusting the producer's claim. Without the negative baseline, the §4.1.3.4 claim is unfalsifiable; with it, the +9.7 / +12.2 swing is independently reproducible from the published artifact alone.

Cross-architecture validation: the second empirical anchor. The methodology was independently re-validated on OlmoeForCausalLM (Allen AI's OLMoE-1B-7B-0924-Instruct) — a structurally distinct MoE family with a different vendor, different parameter scale (7B vs 30B), different active fraction (1.3B vs 3.3B), and different prune ratio (25% vs 37.5%). The same expert_activation_profile.py and cpu_expert_prune_v2.py --importance-json scripts ran on OLMoE without any modification, confirming the unfused-MoE module-tree pattern is shared between the two families. The artifact is continuum-ai/olmoe-1b-7b-compacted-5b (alloy hash bba0a92ff0c8bebb):

OLMoE-1B-7B-0924-Instruct HumanEval pass@1 HumanEval+ pass@1 Δ vs base
Base (Q5_K_M, hardware-measured) 40.9 36.6 (anchor)
Student, broad-corpus calibration (negative baseline) 28.0 26.2 −12.9 / −10.4
Student, code-corpus calibration 36.0 31.7 −4.9 / −4.9

The within-model A/B isolates the calibration-corpus lever from every other variable. Same architecture, same prune budget (k=48 of 64), same hardware, same eval pipeline, same metric formula. The only thing that changed between the two student runs was the calibration corpus passed to expert_activation_profile.py: 300 mixed-domain held-out examples (1/6 code) vs 300 Python code held-out examples (100% code). The +8.0 / +5.5 HumanEval swing is the lever in pure isolation. The OLMoE artifact's priorMetricBaselines[] carries the broad-corpus negative baseline alongside the code-corpus current prune so the within-model isolation is independently reproducible from the published artifact alone.

The 13-point ceiling. Across the two within-architecture A/Bs and the cross-architecture comparison, four data cells now exist:

Run Importance metric Calibration corpus Δ HumanEval
Qwen3-Coder-30B-A3B (37.5% prune) router-gate-L2 (architectural) broad heldout −13.4
OLMoE-1B-7B (25% prune) activation-count (calibration-aware) broad heldout (1/6 code) −12.9
Qwen3-Coder-30B-A3B (37.5% prune) activation-count (calibration-aware) code-heavy heldout −3.7
OLMoE-1B-7B (25% prune) activation-count (calibration-aware) code-heavy heldout −4.9

The wrong-metric failure (−13.4) and the wrong-corpus failure (−12.9) saturate at near-identical magnitude across different model families, different active fractions, and different prune ratios. The metric lever and the corpus lever appear to be substitutable failure modes: getting either wrong is sufficient to ceiling the damage at ~13 HumanEval points; getting them both wrong does not visibly add to the damage. They are not independent additive sources of loss but two access paths to the same structural ceiling. We do not yet have a fourth cell with both wrong on the same model, but the magnitude match across the three observed failure cells is striking enough to record as a hypothesis worth specifically refuting in future work.

A second observation from the cross-architecture data: smaller models are more sensitive to calibration alignment. OLMoE at 25% prune lost 4.9 HumanEval points after the calibration-aware fix; Qwen3-Coder-30B-A3B at 37.5% prune lost 3.7. The smaller model with the less aggressive prune produced a larger residual gap. The directional implication is that smaller models have less expert redundancy per active capacity, so any individual code-relevant expert removal cuts deeper. The calibration-corpus-must-reflect-eval-workload rule therefore matters more for smaller models, not less — counterintuitive (one would expect larger models to be more sensitive to subtle calibration drift, not the opposite), and worth flagging as a future systematic study.

4.1.3.4.1 Discipline gate: calibration corpus identity must be hash-pinned in the alloy

The §4.1.4.1 anchor-reproduction discipline gate prevents shipping artifacts whose base anchor cannot be reproduced within ±3pt on the publishing pipeline. The §4.1.3.4 within-model isolation surfaces a second hard rule that must clear before any artifact ships under the calibrated-discipline brand:

§4.1.3.4 calibration-corpus discipline gate. The calibration corpus used for importance profiling must be declared in the alloy as a hash-pinned dataset (sha256 of the corpus file, file size in tokens, and a summary of the content distribution). The eval benchmark must be a representative sample of the same distribution. Forge artifacts whose calibration corpus does not reflect the eval workload distribution shall not ship under the calibrated-discipline brand. This gate is a hard precondition on shipping, alongside §4.1.4.1.

The motivation is the within-model isolation above: the +8.0 HumanEval swing on OLMoE between broad-corpus and code-corpus calibration, with no other variable changed, is the lower bound on the damage that can be hidden inside a "calibration-aware" claim that does not specify the calibration distribution. Two artifacts with the same forge methodology and the same prune budget can differ by 8 HumanEval points purely on calibration corpus selection. The consumer of a published artifact has no way to know which calibration distribution was used unless the alloy declares it explicitly with a hash that can be re-computed against the published corpus file.

The discipline gate has three concrete requirements:

  1. Calibration corpus is uploaded to the artifact's HF repo alongside the model weights and benchmark sample JSONLs, under a calibration/ subdirectory. The corpus file is the actual ground-truth content used for profiling, not a description of it.
  2. The alloy's expert-activation-profile stage records the corpus's sha256 hash in addition to its filename, example count, and token count. The stage's sidecar metadata embeds the same hash for cross-reference.
  3. The published model card declares both the calibration corpus and the eval benchmark explicitly with the rationale for the alignment. If the eval is HumanEval, the calibration must be code-heavy; if the eval is GSM8K, the calibration must be math-heavy; if the eval is MMLU, the calibration must be broad. Mismatch is a discipline-gate failure and the artifact does not ship.

Both empirical anchors above (qwen3-coder-30b-a3b v1 and olmoe-1b-7b v1) carry their calibration corpora at calibration/heldout_code300.jsonl in the published HF repo and the corpus sha256 in the alloy's expert-activation-profile stage metadata. The discipline gate is satisfied retroactively for both, and is enforced going forward by publish_model.py requiring the calibration corpus to be present in the staging directory before the publish step proceeds.

The lab now has two discipline gates derived from empirical failures rather than asserted from first principles: §4.1.4.1 anchor reproduction (catches eval-pipeline drift) and §4.1.3.4.1 calibration-corpus identity (catches importance-metric corpus drift). Both are preconditions on shipping. Neither is theoretical — both exist because the failures they prevent have already happened in this work and been measured.

Status of the next experimental wave. With the §4.1.3.4 metric fix landed, row 7 of §4.1.4 carries the v1 calibration-aware artifact (88.4 / 86.0). Row 7 v2 will add KL-distillation compensation LoRA on top of the calibration-aware student to attempt to close the residual −3.7 / −3.0 gap, paralleling the v2-7B §4.1.3.3 closure. The compensation step is currently blocked on a memory-architecture issue: at 30B class with both teacher and student on a single 32 GB GPU, transformers' caching_allocator_warmup pre-allocates an fp16 buffer equal to the model size before bnb 4-bit quantization takes effect, exceeding total VRAM even with both models nominally configured for 4-bit. The architecturally correct fix is offline teacher-logit precomputation: phase 1 loads the teacher alone in 4-bit and dumps (input_ids, logits) to disk on the calibration corpus, phase 2 unloads the teacher and frees the GPU, phase 3 loads the student alone in 4-bit and trains against the on-disk logits with the full GPU available. This rewrite is the prerequisite to v2 and is the next sentinel-ai sprint after the v1 publication.

4.1.4 The measurement, calibrated against an external anchor

The §4.1.1 reconstruction recovered a runnable v1.5 artifact, but a number from a single in-house pipeline is not yet a measurement we are willing to publish — the same discipline that produced the rest of this section requires that the eval pipeline itself be validated against a known-good reference before any number out of it can be trusted. Without that calibration step, a low v1.5 score is ambiguous between two failure modes (the model is degraded, or the eval is misconfigured), and our own attempt to evaluate v1.5 surfaced exactly that ambiguity in real time during the work for this paper: the first run produced a number that we briefly mistook for the model's true score before noticing that the eval setup had not been validated.

The reference we calibrate against is the Qwen2.5-Coder family's HumanEval / HumanEval+ pass@1 as reported by the model authors themselves in the Qwen2.5-Coder Technical Report (Hui et al., 2024, arXiv:2409.12186, Table 5). The report explicitly cites EvalPlus (Liu et al., 2023) as the evaluation framework and publishes the eval code at github.com/QwenLM/Qwen2.5-Coder, allowing us to match the methodology directly: same framework, same scoring, same prompt-format selection (raw completion for base models). The calibration question reduces to "does our local instance of EvalPlus reproduce Qwen's reported numbers on an unmodified Qwen2.5-Coder base model, within a few points of noise."

A constraint that shaped the calibration. We could not directly run the unmodified Qwen2.5-Coder-14B base model through our local pipeline as the calibration check, because of a structural VRAM issue on the eval host: a persistent ~1.6 GiB GPU reservation (almost certainly WSL2 paravirtualization holding state outside any visible CUDA process) reduced the effective free VRAM from 31.84 GiB to 30.2 GiB, which is below the budget needed to host the base 14B in fp16 (~28 GiB of weights) plus the KV cache and activation overhead vLLM requires for HumanEval. We document this as sentinel-ai#162. The honest workaround was to perform the calibration check on the 7B base model from the same Qwen2.5-Coder family, which fits comfortably in the available VRAM, and to use Qwen's published 14B numbers as the reference baseline for the v1.5 comparison without an independent local re-measurement of the 14B at fp16. The implications of this indirect calibration are discussed below; in short, it is strong but not strict, and the §4.1.2 narrative is written with the appropriate caveat.

The resulting table has six rows. Rows 1–2 are the calibration check; row 3 is the unmeasured-but-Qwen-published 14B reference used as the v1 comparison baseline; row 4 is the v1.5 measurement; row 5 is the methodology-validation forge of v2-7B; row 6 is the future canonical replacement.

Row Model Size Method Source HumanEval pass@1 HumanEval+ pass@1 Δ vs ref Notes
1 Qwen2.5-Coder-7B base — calibration anchor (Qwen-published) 7B unmodified base Qwen2.5-Coder TR Table 5 61.6% 53.0% (anchor) Calibration anchor. We anchor against the 7B because the structural VRAM constraint on the eval host (sentinel-ai#162) prevented a direct fp16 measurement of the 14B base.
2 Qwen2.5-Coder-7B base — our pipeline 7B unmodified base this work; vLLM + EvalPlus + --force-base-prompt + --bs 1, methodology matched to row 1 62.2% 53.7% +0.6 / +0.7 Pipeline calibration check. Both deltas inside ±3 pt tolerance; pipeline accepted as calibrated for the Qwen2.5-Coder family.
3 Qwen2.5-Coder-14B base — Qwen reference (not locally re-measured) 14B unmodified base Qwen2.5-Coder TR Table 5 64.0% 57.9% n/a Reference baseline used for the v1 comparison in row 4. Not directly re-measured through our pipeline because of sentinel-ai#162; the validity of using this number as the v1 comparison baseline rests on the calibration of rows 1–2 as evidence that the pipeline is reliable for the Qwen2.5-Coder family in general. See "Indirect calibration caveat" below.
4 v1 (= published continuum-ai/qwen2.5-coder-14b-compacted), measured via v1.5 14B L2-weight head importance, GQA-group slice defrag, q5_K_S this work; v1.5 dequantized reconstruction (bit-identical to v1; see §4.1.1), evaluated through the calibrated pipeline of rows 1–2 26.8% 25.0% −37.2 / −32.9 First-ever measurement of the published v1 model's coding ability. Honest because v1.5 is bit-identical to v1 (verified via torch.equal on the logits). Comparison baseline is row 3 (Qwen-published 14B), not a local re-measurement; see caveat.
5 v2-7B + compensation LoRA — methodology fully validated; §4.1.3.2 disconnect closed 7B layer-normalized activation-magnitude importance (per §4.1.3.1), 8 calibration samples, pad-mode defrag, q5_K_S, 500-step single-cycle LoRA forge, plus compensation LoRA via KL-on-logits distillation against the unmodified Qwen2.5-Coder-7B teacher (see §4.1.3.3) this work; v2-7B forge + compensation pass on Qwen2.5-Coder-7B base, evaluated through the calibrated pipeline of rows 1–2 61.0% 53.0% −1.2 / −0.7 vs row 2 (within the ±3pt calibration tolerance band) Methodology fully validated; the §4.1.3.2 PPL/HumanEval disconnect documented above is empirically closed by compensation LoRA via KL distillation. The forge was run with the layer-normalized activation-magnitude metric (per §4.1.3.1) which produced an uncompensated result of 54.9 / 48.8 — directionally validated against the broken global-flat baseline (50.0 / 44.5) but 7.3 / 4.9 below the base anchor (62.2 / 53.7), a residual gap that §4.1.3.2 documented as a structural limitation of the activation-magnitude metric (calibration distribution too narrow, optimizes for local fine-tuning loss not held-out generalization). A second pass — compensation LoRA via KL-on-output-logits distillation against the unmodified Qwen2.5-Coder-7B teacher, 500 steps, ~50 calibration examples drawn from a held-out task mixture, ~0.5% additional parameters wrapped as a LoRA adapter on the existing pruned student — recovered HumanEval to 61.0% and HumanEval+ to 53.0%, within 1.2 and 0.7 percentage points of the unmodified base 7B respectively, well inside the §4.1.4.1 calibration tolerance band of ±3pt. The compensation pass cost no additional VRAM at inference time (the LoRA is merged into the student weights at save time), preserved Layer 7 deployment-runtime compatibility (the compensated model still loads in llama.cpp and produces coherent code), and ran in less wall clock than the original forge. The full progression on the same Qwen2.5-Coder-7B base across four runs, each with one variable changed: broken global → 50.0 / 44.5; layer-normalized fix (§4.1.3.1) → 54.9 / 48.8; layer-normalized + multi-cycle → 46.3 / 41.5 (worse, exposing §4.1.3.2 disconnect); layer-normalized + compensation LoRA via KL distillation (§4.1.3.3) → 61.0 / 53.0 (essentially recovered). The methodology is fully validated by this progression; details and the loss-function ablation that surfaced the right form of distillation are in §4.1.3.3 below.
6 Qwen3.5-397B-A17B Instruct v2 — grid moonshot (the headline continuum demo) 397B total / 17B active MoE + hybrid attention/DeltaNet Pre-removable MoE expert pruning via selective safetensors load (never allocate dropped experts at any point), then grid expert sharding across the tailnet (continuum's distributed inference substrate), then layer-aware defrag for the hybrid attention layers (sentinel-ai#163 Strategy A), then per-shard GPU forge with layer-normalized activation-magnitude head-importance + pad-mode defrag + LoRA retraining + per-shard q5_K_S quantization, with cryptographic per-shard handoff via forge-alloy attestations grid moonshot; depends on row 7 (single-machine pipeline proven), the grid expert sharding work, and continuum's distributed inference substrate being end-to-end through the tailnet
7 Qwen3.5-35B-A3B Instruct v2 — single-machine flagship-class replacement and proof-of-pipeline for row 6 (queued) 35B total / ~3B active MoE + hybrid attention/DeltaNet CPU-first pre-removable MoE expert pruning (selective safetensors load — never allocate dropped experts), then layer-aware defrag for hybrid attention layers (sentinel-ai#163 Strategy A), then GPU forge with layer-normalized activation-magnitude head-importance + pad-mode defrag + LoRA retraining + q5_K_S quantization. Single-machine — no grid sharding required. this work; queued immediately after the v2-7B re-run closes, after cpu_expert_prune.py is verified per the read-it checklist, after sentinel-ai#163 Strategy A lands, and after the LiveCodeBench v6 calibration anchor is added to eval_with_calibration.py
(deferred / legacy) All previous Qwen3-Coder-* / Qwen2.5-Coder-* / smaller-Qwen3.5 row variants are intentionally not enumerated in this table. The methodology demonstration only requires two representative current-generation rows (a single-machine target and a grid moonshot, both at flagship-class) to establish that the substrate works across the scales the lab actually serves. Additional rows are forge work, not paper material; they may be added to a follow-up paper or to the public continuum-ai HuggingFace page as they ship, without re-publishing this paper. Forward-migration policy: when Qwen3.6 (or Qwen4, or whatever the next stable generation is) ships, rows 6 and 7 migrate mechanically to the new base — the methodology paragraphs in §4.1.3 / §4.1.3.1 / §4.1.3.2 are family-agnostic and do not change; only the row contents (model names, sizes, calibration anchor numbers) get refreshed.

The empirical claim. The published continuum-ai/qwen2.5-coder-14b-compacted artifact (v1), measured for the first time via its bit-identical reconstruction v1.5 through a pipeline calibrated against Qwen2.5-Coder-7B base (rows 1–2 above, ±0.6/±0.7 percentage points of Qwen's published values for the 7B), scored pass@1 = 26.8% (HumanEval) / 25.0% (HumanEval+). Qwen's published baseline for the unmodified Qwen2.5-Coder-14B base model is 64.0% / 57.9% (row 3). The compaction operation that produced v1 — L2-weight head importance, GQA-group slice defrag, q5_K_S quantization — therefore removed approximately 37.2 percentage points of HumanEval pass rate and 32.9 percentage points of HumanEval+ pass rate from the unmodified base model, a relative reduction of approximately 58% / 57% in pass rate. The published artifact lost more than half its coding capability through the compaction operation, and the loss was invisible to the v1 publication process because the validation framework that would have caught it did not exist.

The published artifact was therefore broken in four independent ways simultaneously, all invisible until the validation work in §4.1.1: (1) it could not be loaded by the dominant consumer-hardware inference runtime due to the q_proj invariant violation in Failure 1; (2) it shipped without a safetensors fallback (Failure 2); (3) the lab's own forge pipeline did not preserve the source weights, making the pre-quantization model permanently unrecoverable (Failure 3); and (4) the published HF repository was missing tokenizer_config.json entirely, breaking apply_chat_template() for any downstream user (Failure 4). And, separately from the four publication-time failures, the underlying weights had lost approximately 58% of their HumanEval pass rate through the L2-weight defrag operation — a quality failure that, unlike the four publication failures, would have been invisible even with perfect pre-publication validation of the artifact files, because no eval was run by the v1 publication step at all. Five independent failure modes in one published artifact, each prevented going forward by a different element of the framework documented in VALIDATED-TENSOR-SURGERY and the §4.1.3 methodology revisions.

Indirect calibration caveat. The 14B reference baseline used in the comparison above is Qwen's published value, not a local re-measurement of the 14B base through our pipeline. The 7B calibration (rows 1–2) provides strong evidence that our pipeline is reliable for the Qwen2.5-Coder family in general — it reproduces Qwen's published 7B numbers within ±0.7 percentage points on both metrics — but it does not directly verify that the same pipeline produces an identical reproduction of the 14B at fp16. If our pipeline has a bug that is specific to the 14B's architectural configuration (40 Q heads / 8 KV heads / group size 5, vs. the 7B's 28 / 4 / 7) or to weight-loading paths exercised only at the larger scale, the 7B calibration would not catch it. In practice, vLLM's Qwen2 code paths are uniform across model sizes within a family and we are not aware of any size-specific bugs that would produce a multi-point reproduction error, but the strict version of the calibration discipline would require a direct fp16 measurement of the 14B base, which we cannot perform on the current eval host without first resolving sentinel-ai#162. We therefore qualify the empirical claim above as "the gap between v1.5 and the Qwen-published 14B baseline, with the pipeline transitively calibrated through the 7B." A reviewer who finds the indirect calibration insufficient should treat the headline 58% relative-drop number as having an additional ~3 percentage points of uncertainty (the calibration tolerance band), giving a defensible "between 55% and 61% relative drop" range. Even at the most conservative end of that range, the original abstract's "while maintaining coding capability" claim is contradicted by approximately a factor of 2.

Removing the indirect-calibration caveat. The cleanest path to a directly-measured 14B baseline is to load the unmodified Qwen2.5-Coder-14B base in bnb-8bit (via BitsAndBytesConfig(load_in_8bit=True)), which roughly halves the weight memory footprint (~14 GiB vs ~28 GiB for fp16) and fits within the 30.2 GiB ghost-constrained budget with comfortable headroom for KV cache. The bnb-8bit measurement introduces its own ~1–2 percentage point quantization noise relative to fp16, which is small compared to the 37-point gap we are reporting. A bnb-8bit local re-measurement of the 14B base would close the indirect-calibration caveat to within the bnb noise envelope and is queued as a follow-up. It is not blocking the §4.1.2 deprecation narrative as written, but it is the cleanest defensive move against a hostile reviewer.

Calibration tolerance. We allow ±3 percentage points between an anchor row and the corresponding pipeline row before declaring the pipeline calibrated for that model class. The 3-point band is roughly the variation we observe across published HumanEval numbers for the same model from different evaluation toolchains (bigcode-evaluation-harness, EvalPlus, custom in-house scripts), and it is wide enough to absorb the noise from differences in greedy-decoding tie-breaking, tokenizer fast-vs-slow paths, and EvalPlus version drift, but narrow enough to detect a structurally broken pipeline. The actual 7B calibration delta of +0.6 / +0.7 is well inside the tolerance band, which is consistent with a pipeline that is reproducing Qwen's methodology accurately rather than diverging in a structural way.

Pipeline determinism. The calibration check (rows 1–2) is deterministic at fp16 precision: two independent runs of the same EvalPlus invocation against the unmodified Qwen2.5-Coder-7B base, separated in time and run as part of separate workflows (the standalone calibration run and the calibrated eval_with_calibration.py wrapper used for the v2-7B methodology validation), produced bit-identical pass@1 numbers in both directions (62.2% / 53.7% on both invocations, with the same +0.6 / +0.7 delta against the published anchor each time). This is a stronger property than "within tolerance": the pipeline produces the same number, not merely a near-enough number, when re-run on the same input. Determinism at this layer matters because it converts the calibration check from a noisy estimate ("we ran it once and trusted it") into a true gate ("a third party who reproduces this protocol on the same hardware will recover the same number to the digit"). It also means that any future drift in row 2 (e.g., after a vLLM upgrade, an EvalPlus version bump, or a tokenizer fast/slow path change) is immediately detectable as a regression rather than being absorbed into the tolerance band, because the baseline expectation is the exact value, not a range. We adopt the determinism observation as a load-bearing property of the calibration discipline going forward: any pipeline that is not deterministic on this check is rejected and replaced before being trusted as a measurement device for any of the rows below.

Structural validation of v2-7B (Layer 7 gate). Independently of the EvalPlus numerical result reported in row 5 below, the v2-7B forge artifact was validated structurally through the deployment-runtime gate documented in VALIDATED-TENSOR-SURGERY. The forged q5_K_S GGUF (5.3 GB on disk) loads in llama.cpp at 4.7 GB VRAM on an RTX 5090, generates at 213.6 t/s, and produces qualitatively coherent code on the smoke prompt (an iterative-Fibonacci implementation that handles the n ≤ 0, n == 1, and n == 2 edge cases before setting up the iterative loop). All four end-to-end methodology elements from §4.1.3 — activation-magnitude importance (with 8 calibration samples), pad-mode physical head removal, q5_K_S quantization preserving the post-pad layout, and llama.cpp deployment compatibility — are validated by the gate passing on a real forged model rather than only on the test fixtures used to develop them. The structural claim of §4.1.3 — the corrected methodology produces a runnable artifact end-to-end — is therefore independent of the row 5 EvalPlus number and is supported by this gate alone. The row 5 number quantifies how good the resulting model is; the gate proves that the methodology works at all.

Status of the rows at time of writing. Rows 1, 2, 3, and 4 are in. Row 5 has its structural validation in (Layer 7 gate passed; see preceding paragraph) and its EvalPlus measurement pending (run in flight at time of writing, ~3 min ETA on the calibrated pipeline). Row 6 is deferred to future work. The §4.1.2 deprecation narrative is fully supported by rows 1–4 (with the indirect-calibration caveat). The §4.1.3 methodology-validation structural claim is supported by the v2-7B gate paragraph above; the §4.1.3 methodology-validation quantitative claim depends on the row 5 EvalPlus number and will be filled in when it lands, with the v2-7B comparison anchor (row 1) already directly measured and deterministic.

4.1.4.1 The right baseline: per-hardware-tier Pareto comparison, not "vs base on full hardware"

The rows above measure forge results against the unmodified base model running at full precision on hardware that fits the full base. That comparison answers a methodological question — "how much capability does the forge preserve relative to the base" — but it does not answer the product question that determines whether the forge has consumer value: "on hardware where the base model would not fit, does our forge produce a better artifact than the alternative compaction methods (flat-precision quantization, random or magnitude-based pruning) the user would otherwise run at the same hardware tier?"

The base-vs-forge comparison is the wrong baseline by this criterion. Users who can run the base model on its full hardware do not need the forge — they run the base. Users who cannot run the base on their hardware run something else (typically a Q4 or Q3 GGUF quantization of the base, a smaller model from the same family, or a different model entirely). The forge's product value is whether it gives those users a better option than what they'd otherwise have at the same VRAM tier and tokens-per-second tier. The competitive surface is multi-axis: VRAM, tokens per second, capability on the user's task, supported context window, and supported modality (text/vision/audio). The forge toolkit composes across all of these axes (head pruning + quantization + RoPE-based context extension + modality fusion), and a "win" at any of them — at any hardware tier — is a product win even if the absolute capability is below the base model running on bigger hardware.

This is a substantially different framing than the rest of §4.1.4 and the rest of the paper has been using, and it changes which experiments are load-bearing for the methodology paper's claim. The §4.1.4 rows 5–7 measured the methodology against the wrong baseline.

The objective function, in pseudomath: value = (quality + context_window_bonus) / vram. The forge wins on a given hardware tier if it produces an artifact with a higher value-ratio than the best alternative compaction method at that tier. Quality is the held-out task pass rate (HumanEval, LiveCodeBench, MMLU, etc., or a weighted combination depending on the user's workload). VRAM is the GB footprint at inference time. Context window bonus is a workload-dependent additive term that captures the value of supporting long context — near-zero for short-prompt workloads (single-function code completion), large for long-document workloads (summarization, multi-file refactoring, agentic coding). Tokens-per-second is a third axis worth tracking separately (a forge that delivers 1.3× the inference speed at the same quality/vram is a Pareto win on the latency axis even at unchanged disk size). The forge toolkit's product value is moving artifacts up on this multi-axis frontier in directions pure quantization alone cannot — because the forge composes head pruning + quantization + RoPE-based context extension + modality fusion on the same substrate, where pure quantization can only move you down-and-right on the quality/vram plane.

The right per-tier comparison table looks like:

Hardware tier (approx VRAM) Best alternative compaction (base + standard quantization) Our forge artifact Verdict
~5 GB base 7B Q5_K_M (~5.0 GB, ~60% HumanEval, ~baseline t/s) v2-7B q5_K_S (5.3 GB, 54.9 HumanEval, 213.6 t/s on 5090) Currently Pareto-dominated at this tier; t/s comparable; capability lower; size comparable
~3.5 GB base 7B Q3_K_M (~3.5 GB, ~55-58% HumanEval, faster t/s than Q5) v2-7B Q3_K_M (estimated ~3 GB, untested) Decisive experiment: if the forged model holds capability under Q3 better than the base does under Q3, the forge buys a Pareto frontier shift that pure quantization cannot; if not, the methodology as currently constituted is Pareto-dominated at this tier too
~2.5 GB base 7B Q2_K (~2.5 GB, capability noisy and degrading) v2-7B Q2_K (estimated ~2 GB, untested) Same decisive question at the smaller tier
~2 GB and below sub-Q2 quantization (typically unusable for code) or smaller base model from a different family v2-7B at extreme quantization, untested Tier where forge value is most likely to manifest because base quantization breaks down

The aggressive-quantization test (Q3_K_M and Q2_K on v2-7B) is the immediate experiment that decides whether the methodology has product relevance. It is quantization-only (no re-forging), runs in ~10-15 minutes wall clock on the existing v2-7B artifact, and produces numbers that drop directly into the table above. The result either validates the forge as a Pareto frontier shift (forged model holds capability at aggressive quantization better than base does at the same aggressive quantization, because the surviving heads have absorbed the work of removed heads and tolerate more aggressive precision compression) or shows the forge is Pareto-dominated even at the smallest tiers (in which case the methodology needs the held-out-aware calibration refinement from §4.1.3.2 or the slice-mode artifacts via the llama.cpp upstream patch before it has product value).

Tokens per second is a separate axis worth measuring for the same table. Pruned models should be faster at inference at the same precision because there is less compute per token (fewer attention heads computing, fewer KV cache reads). The v2-7B's 213.6 t/s on the 5090 is a measured number that belongs in the comparison; the corresponding base-7B-at-q5_K_S t/s on the same hardware is the baseline that completes the second axis. Even at unchanged disk size, a forged model that delivers 1.3× the inference speed of the base at the same quantization is a Pareto win on the t/s axis for users whose constraint is conversational latency, not VRAM.

Implications for rows 6 and 7. The same per-tier comparison framing applies to the canonical replacement targets (Qwen3.5-35B-A3B Instruct, Qwen3.5-397B-A17B Instruct grid moonshot). The relevant baseline for those rows is not the base flagship at full precision (most users cannot run that anyway) — it is the best Q4/Q3 quantization of the same base that fits the user's hardware. Both rows should be measured against per-tier alternative-compaction baselines, not against the unmodified base, and the §4.1.4 table when those rows land should follow the per-hardware-tier structure introduced here rather than the original "v2 vs base on full hardware" structure that the §4.1 historical narrative used. The original structure stays in §4.1.4 rows 1–7 as the historical record; this subsection (§4.1.4.1) defines the forward measurement structure for any future row added to the table.

What this section does and does not change about the §4.1.3 / §4.1.3.1 / §4.1.3.2 findings. The methodology revisions in §4.1.3 (layer-normalized importance, pad-mode defrag, MoE expert pruning) and the findings in §4.1.3.1 (depth-bias) and §4.1.3.2 (PPL/HumanEval disconnect) are unchanged. They are valid descriptions of the methodology's behavior under the experiments we ran. What changes is the interpretation of the row 5 number (54.9 / 48.8): under the "vs base on full hardware" framing, it reads as "the methodology recovers 88% of base capability and the residual 12% gap is structural." Under the "vs alternative compaction at the same hardware tier" framing, it reads as "the methodology produces a 5.3 GB artifact with 54.9 HumanEval, which is currently Pareto-dominated by the base at q5_K_M at the same tier; the question of whether the methodology has product relevance is decided by aggressive-quantization tests on the existing v2-7B artifact, which are queued and small-cost." Both framings are honest; the second is the one that determines whether the forge is a useful product.

The methodology is pluggable, not monolithic. The forge pipeline is correctly understood as a substrate that hosts swappable algorithm choices along four orthogonal axes:

Axis Examples
Importance metric activation magnitude, gradient magnitude, activation × gradient (saliency), L2 weight norm, learned importance from a small meta-model, held-out-aware activation
Calibration distribution local fine-tuning data, single held-out task, multi-task mixture, model author's published evaluation set, user's deployment workload sample
Selection rule global ranking, per-layer budget (§4.1.3.1), per-layer budget with hardware-aware invariant constraints (q_proj invariant from §4.1.3 / Finding 6 in VALIDATED-TENSOR-SURGERY), MoE expert ranking, learned routing-based selection
Training schedule single-cycle, multi-cycle plasticity, progressive prune-with-retraining, prune-first-then-train (the §4.1.3 fix), train-first-then-prune (the v1 broken ordering)

The four-metric comparison harness from VALIDATED-TENSOR-SURGERY §3.4 (L2 vs gradient vs activation vs activation × gradient) is the v0 of this pluggable architecture for axis 1; the cleanest forward direction is to formalize axes 2–4 as the same kind of pluggable backends and re-run the comparison across the product space (importance metric × calibration distribution × selection rule × training schedule). The result is a strategy × model × hardware-tier comparison table that shows which strategy wins where on the per-tier (quality + context_window_bonus) / vram metric, and the methodology paper's central claim shifts from "we found the right importance metric" (which §4.1.3.2 falsified at the task level for the activation-magnitude metric as currently constituted) to "we built a substrate for plugging in importance metrics, calibration distributions, and selection rules, and iterating on them with cheap measured comparison; here is the strategy-vs-tier table we have so far, and here are the cross-strategy invariants we have discovered."

Under this pluggable framing, the §4.1.3.1 and §4.1.3.2 findings re-read as cross-strategy invariants that constrain the strategy space rather than as failures of any particular strategy. §4.1.3.1 says: any strategy whose importance signal scales with layer position will exhibit a depth-bias on deep models, and per-layer normalization is required at the rank step regardless of which underlying metric is used. §4.1.3.2 says: any strategy whose importance signal is computed against a calibration distribution close to the fine-tuning data will optimize for local fine-tuning loss rather than held-out task generalization, and held-out-aware calibration is required regardless of which underlying metric is used. Both are invariants — they apply to all strategies in the strategy space, not just to the activation-magnitude metric we tested first — and the methodology paper's contribution is the discovery of those invariants together with the substrate that lets future strategies be tested against them quickly.

The minimum bar for any pluggable strategy on this substrate is "do meaningfully better than random" — i.e., produce a forged artifact whose (quality + context_window_bonus) / vram exceeds what naive flat quantization at the same hardware tier produces. The bar is not "match the base model on full hardware," which is the wrong baseline for the product-relevant question. The aggressive-quantization Pareto test queued above is the immediate experiment that decides whether the current best strategy in our strategy space (1-cycle layer-normalized activation-magnitude with single-distribution coding calibration) clears that bar at any hardware tier; subsequent experiments iterate on the pluggable axes (held-out-aware calibration first, since it tests the §4.1.3.2 fix; then held-out-aware metric reformulation; then learned importance) until at least one strategy clears the bar at every tier the lab cares about.

4.1.4.2 Per-tier Pareto comparison: the activation-magnitude forge does not beat base+quant on dense Qwen2.5-Coder-7B at any tier we measured

The aggressive-quantization Pareto test from §4.1.4.1 has now been run end-to-end on the v2-7B forge artifact and the unmodified Qwen2.5-Coder-7B base, with both models quantized to the same three quant levels (Q5_K_S, Q3_K_M, Q2_K) via the same llama.cpp toolchain, and both evaluated on HumanEval pass@1 through the same patched vLLM-GGUF backend that anchored against Qwen's published 61.6 / 53.0 to within +0.6 / +0.7 (deterministic across five independent runs). The result is the cleanest possible per-hardware-tier comparison the forge methodology has been measured on, and it is unambiguous in its direction:

Tier (VRAM) Variant HumanEval pass@1 HumanEval+ pass@1 quality / vram Winner at this tier
5.0 GB (Q5_K_S) v2-7B forge 55.5 48.8 11.10
5.0 GB (Q5_K_S) base 7B + quant 63.4 55.5 12.68 base by +14% on the ratio
3.6 GB (Q3_K_M) v2-7B forge 54.3 47.6 15.08
3.6 GB (Q3_K_M) base 7B + quant 59.8 53.7 16.61 base by +10% on the ratio
2.9 GB (Q2_K) v2-7B forge 42.7 39.6 14.72
2.9 GB (Q2_K) base 7B + quant 43.3 41.5 14.93 tie within run noise (+1.4% on the ratio)

Base Qwen2.5-Coder-7B + standard llama.cpp quantization Pareto-dominates the v2-7B forge at every tier we tested. The closest the forge gets to parity is Q2_K, where the gap collapses to 0.6 absolute points and 1.4% on the quality/vram ratio — within run noise. At Q5_K_S and Q3_K_M the gap is decisively in favor of base+quant. By the §4.1.4.1 product-relevance criterion, the v2 forge methodology as currently constituted does not produce a useful product on the Qwen2.5-Coder-7B family.

This is the empirical answer to the §4.1.4.1 question of whether the per-tier framing rescues the methodology. It does not. The methodology is structurally sound (every Layer 7 gate passed, every artifact loads in llama.cpp at the expected speed, every measurement is calibrated), but it produces artifacts that are dominated by the no-effort alternative llama-quantize. Three independent failure modes have now been ruled out as fixes, all on the same v2-7B base, with the same calibrated pipeline:

Hypothesis Experiment Result
500 steps was insufficient training; multi-cycle recovery helps 1000 steps × 3 cycles = 3000 total steps with progressive 10%/cycle prune WORSE — HumanEval dropped from 54.9 to 46.3 (sentinel-ai#165 Run 3). Internal perplexity improved (2.10 → 1.77) but task capability fell.
Calibration distribution overfits to local fine-tuning loss; held-out-aware code calibration helps (the §4.1.3.2 hypothesis) Same forge with --calibration-source code (hand-written HumanEval-format prompts, NOT drawn from any benchmark) NO IMPROVEMENT — HumanEval landed at 53.7, slightly below 54.9, within run noise.
Aggressive quantization tier shifts the comparison in the forge's favor (the §4.1.4.1 hypothesis) v2-7B and base 7B both quantized to Q5_K_S / Q3_K_M / Q2_K, evaluated through the same vLLM-GGUF backend NO — base+quant wins at every tier (this section).

Each of these hypotheses was the leading explanation for the residual gap at the moment it was tested. Each was tested cheaply, in ~30 minutes of wall clock, with the cause-of-the-gap discipline that §4.1.3.1 documented. Each was falsified by the disciplined comparison rather than validated by the win we hoped for. Three negative results in succession on three independent fix candidates is now strong empirical evidence that the activation-magnitude head-pruning + LoRA-recovery approach does not have a Pareto-improving sweet spot for dense base models that already have good quantization options, regardless of which knob in the strategy space (training duration, calibration distribution, quant tier) is tuned.

4.1.4.3 What this implies for the forge's product positioning

The §4.1.4.2 result is not a failure of the forge as a substrate; it is a failure of one specific strategy choice on the substrate (activation-magnitude importance + per-layer normalization + pad-mode defrag + single-distribution LoRA) to clear the product bar on a particular kind of target (dense base models with good off-the-shelf quantization). The substrate itself — pluggable importance metrics, pluggable calibration distributions, pluggable selection rules, the calibrated measurement pipeline, the deployment-runtime gate, the no-fallback discipline, the cryptographic provenance via forge-alloy — is unchanged and validated by the very fact that it produced a sharp negative result quickly and disciplined enough to publish.

What §4.1.4.2 does change is the forge's product positioning. There are now exactly two product positions where the forge has a defensible value proposition, and dense Qwen2.5-Coder-7B-class models with good Q3/Q2 quantizations are not one of them:

  1. Distillation-first compaction for any base model (dense, MoE, hybrid) that has a good teacher available. This is the §4.1.5 pivot below — instead of "prune + retrain against fine-tuning loss" (the v2 strategy that §4.1.4.2 just falsified), the forge's primary mechanism becomes "distill into a small adapter against the unmodified teacher's hidden states", which is the structural fix candidate for the §4.1.3.2 PPL/HumanEval disconnect that today's three negative results converged on.

  2. Pre-removable expert pruning + structural compaction for MoE / hybrid attention / oversized models that cannot be reached at all by base + quantization, because the base does not fit on the target hardware even at Q2_K and the user therefore has no "good Q3/Q2 quantization of the base" alternative. The Qwen3.5-35B-A3B (target A) and Qwen3.5-397B-A17B (target B grid moonshot) work falls in this category — the relevant comparison is not "v2 forge vs base+quant at the same tier" because base+quant at the same tier is a model the user cannot run. The relevant comparison is what's the best model the user can run at all on this hardware, and the forge's value is making models reachable that the alternative compaction methods cannot reach.

The §4.1.4 table going forward should be read in this two-position framing. Rows 1–4 (v1 deprecation case study) and row 5 (v2-7B methodology validation) stay as the historical record of how the lab learned to ask the product-relevant question. Rows 6 and 7 (the canonical replacement targets) move out of the §4.1.4.2 dense-model failure mode by virtue of being in product position #2 — they are MoE/hybrid models where base+quant cannot compete because base+quant cannot fit. The forge's job for those rows is to enable inference at all, not to match a base+quant alternative, which is a fundamentally easier bar.

The dense-model forge work is suspended until distillation-first lands. Tonight's three negative results are sufficient evidence that the v2 head-pruning strategy is not the right strategy for dense Qwen2.5-Coder-class targets. The next experiment in the dense-model branch is the compensation LoRA / teacher-distillation work introduced in §4.1.5 below; until that experiment produces a number, no additional dense-model forge runs will be produced through the head-pruning + retraining strategy.

4.1.5 Distillation-first compaction: the next-iteration methodology proposal

The empirical pattern across §4.1.3.1, §4.1.3.2, §4.1.4.1, and §4.1.4.2 is consistent enough to motivate a structural pivot in the forge's primary compaction mechanism, and the substrate work to support that pivot has already landed. This subsection introduces the proposal and points at the implementation; the empirical validation is queued and the §4.1.5 results paragraph will be added when the first measured number on a real model lands.

The pivot. The v2 methodology validated through §4.1.3 / §4.1.4 is "structured head pruning + LoRA recovery against a fine-tuning loss." The three negative results in §4.1.4.2 falsify the hypothesis that this strategy has a Pareto-improving sweet spot on dense base models with good quantization options. The pivot is to invert the dependency: instead of pruning first and recovering against a local fine-tuning loss (which §4.1.3.2 showed optimizes for the wrong objective), distill first against the unmodified teacher's hidden states (which by construction encode the full distribution of held-out task behavior, not just the local fine-tuning loss). The student model is the pad-mode-pruned forge artifact (which is structurally sound but capability-degraded per §4.1.4.2); the teacher is the unmodified base; the supervisor signal is the teacher's per-layer hidden state magnitudes (MSE) and/or output logits (KL); the trainable surface is a small LoRA adapter wrapped around the student (~few hundred MB at LoRA rank 8). The output is a compensated student whose per-layer activations are constrained to track the teacher's, which means the held-out task circuits the teacher uses to solve HumanEval / LiveCodeBench / MMLU are preserved by construction in the compensated student even though the underlying weights have been compacted.

Why this structurally addresses §4.1.3.2. The PPL/HumanEval disconnect surfaced in §4.1.3.2 happens because activation-magnitude head importance computed on a local calibration set selects heads that are unimportant for the local distribution but load-bearing for held-out task circuits. Per-layer normalization (§4.1.3.1) fixes the depth-bias but not the distribution bias. Held-out-aware calibration data was tested in §4.1.4.2 and produced no meaningful improvement, which we interpret as evidence that the bias is structural to the metric and not to the calibration data alone — the metric optimizes for "what does this attention head's output magnitude look like" rather than "what would this attention head's removal cost the held-out task." Distillation against the teacher's hidden states inverts this: the supervisor signal is the teacher's actual contribution to its full output distribution at every layer, which is the ground-truth answer to "what would removal of this contribution cost." The student adapter learns to recover whatever the teacher actually uses, including the parts the activation-magnitude metric was missing.

The substrate is already in place. The compensation LoRA implementation lives at sentinel-ai/scripts/compensation_lora.py (~510 lines) with a CPU-runnable smoke test at sentinel-ai/scripts/test_compensation_lora.py (~290 lines, runs in ~30 seconds on a Mac CPU using distilgpt2 as both teacher and student) and a design document at sentinel-ai/docs/COMPENSATION-LORA-DESIGN.md. The smoke test was run during the writing of this section and passed all five stability checks: (1) tokenizer alignment between teacher and student is bit-identical; (2) per-layer hidden state magnitudes between teacher and pad-mode-pruned student are within 2× across all layers (max ratio diff 0.31); (3) loss decreased monotonically over the 30-step smoke training (3.42 → 2.08, -39.35% relative); (4) per-layer losses are balanced within 3.14× of the median (no single layer dominating the gradient); (5) no NaN/inf in the gradient stream. The math is sound at distilgpt2 scale; the implementation isn't broken; the production scale-up to Qwen2.5-Coder-7B is unblocked by running the script, not by writing more code.

The §4.1.5 experiment, queued. Take the v2-7B artifact from row 5 of §4.1.4 (HumanEval 54.9 / 48.8, the structurally pruned but disconnect-degraded forge), apply compensation LoRA with Qwen2.5-Coder-7B base as the teacher and a held-out-aware calibration mixture as the training data, measure HumanEval pass@1 through the same calibrated pipeline that anchored the row 5 measurement. The success criterion is HumanEval pass@1 ≥ 58.0 (a 3-point improvement over the row 5 baseline, just outside the ±3-point calibration tolerance band, which is the smallest improvement we can claim with confidence given the pipeline noise envelope). If the experiment lands at or above that bar, distillation-first compaction is empirically validated as the structural fix for §4.1.3.2 and the §4.1.5 results paragraph is written from the measured number; the dense-model forge branch is unfrozen and §4.1.4 row 5 gets a successor row showing the compensated artifact. If the experiment lands below the bar, the failure mode escalation path documented in COMPENSATION-LORA-DESIGN.md is followed: the next iteration is a cross-layer skip path that requires modifying the HF model class to expose intermediate residual streams, which is more invasive but a strict generalization of the compensation LoRA pattern.

Independence from the moonshot work. Distillation-first is the dense-model branch's path forward and is independent of the MoE/hybrid moonshot branch (Qwen3.5-35B-A3B, Qwen3.5-397B-A17B grid). The two branches are not in tension and do not block each other: the MoE moonshot work uses cpu_expert_prune_v2.py for its primary compaction and the existing forge pipeline for its retraining, and is in product position #2 from §4.1.4.3 where the per-tier dense-model failure mode does not apply. Compensation LoRA can be layered on top of any forged MoE artifact in a follow-up experiment if the moonshot result also exhibits the §4.1.3.2 disconnect, but it is not a prerequisite for the moonshot landing. The two branches advance in parallel.

4.2 Qwen3.5-27B-Base (Held — two blockers)

Target Device Memory Budget Expected Size Quantization Mix
MacBook Air 16GB 11 GB ~11 GB Q3_K_S dominant
MacBook Pro 32GB 16 GB ~16 GB Q4_K_M dominant
RTX 5090 32GB 28 GB ~28 GB Q8/BF16 mixed

The Qwen3.5-27B-Base forge is held pending two separate prerequisites, both surfaced by the validation work in §4.1 and by inspection of the current Qwen3.5 architecture. First, the methodology revisions of §4.1.3 (activation-magnitude importance, pad-mode physical head removal, q_proj invariant preservation, content-addressed intermediate retention) must be in place before any v2 forge. The v2-7B run reported in row 5 of §4.1.4 is the methodology validation for this prerequisite at 7B scale; the equivalent 14B work depends on the bnb-aware pad-defrag tracked in sentinel-ai#160 before it can run at fp16-class footprint on consumer hardware. Second, and specific to the Qwen3.5 family, the architecture is a hybrid of full self-attention layers and Gated DeltaNet (linear-attention) layers, with the language-model layer mix expressed as 8 × (3 × DeltaNet → 1 × FullAttention) — 24 linear-attention layers and 8 full-attention layers in a 3:1 alternating pattern, plus a multimodal vision tower. The current defrag_inline.py (per VALIDATED-TENSOR-SURGERY Finding 3) assumes uniform attention layers and would corrupt the DeltaNet layers if invoked on them. A layer-aware defrag that reads config.layer_types[i] and operates only on full_attention layers (skipping DeltaNet entirely and treating the vision tower as out-of-scope) is the prerequisite blocker for any Qwen3.5 forge at all. This is filed as a follow-up to sentinel-ai#160 and is the next item on the post-Path-A work order.

When both blockers are resolved, the 27B will be forged with the corrected pipeline and the device-tier table above will be populated with measured numbers. The forge will use the same methodology as v2-7B (activation-magnitude importance, pad-mode defrag) restricted to the 8 full-attention layers, with the 24 DeltaNet layers compacted via a separate mechanism (TBD) or left uncompressed and absorbed into the quantization step. The original gate-gradient artifact from the historical 27B forge attempt is preserved as a record but will not be used for the v2 forge, because §4.1.3 supersedes the gate-gradient method.

A side benefit of the Qwen3.5 architecture as a forge target: the hybrid attention/linear-attention layer mix is exactly the architecture class that EXPERIENTIAL-PLASTICITY §3.4 predicts will respond more strongly to plasticity than pure-attention architectures of the same parameter count, because attention is scarce (only ~25% of layers carry full attention) and each pruned attention head therefore has a proportionally larger structural impact. The Qwen3.5-27B-Base forge, when it lands, will be the first empirical test of that prediction at the HumanEval-eval level rather than the perplexity-only level. We do not pre-commit to which direction the result will go; the test exists to validate or falsify the prediction, and either outcome is reportable.

5. Related Work

  • GPTQ (Frantar et al., 2022): Post-training quantization using Hessian information. Uniform precision.
  • AWQ (Lin et al., 2023): Activation-aware weight quantization. Per-channel but not per-head.
  • SparseGPT (Frantar & Alistarh, 2023): Unstructured sparsity via optimal brain surgeon. Does not produce smaller models.
  • LLM-Pruner (Ma et al., 2023): Structured pruning based on gradient information. Similar motivation but uses a separate profiling pass rather than piggybacking on LoRA training.
  • Wanda (Sun et al., 2023): Pruning by weights and activations. Unstructured. Wanda's central claim is that activation × weight-magnitude saliency outperforms either signal alone. In our four-metric comparison (see VALIDATED-TENSOR-SURGERY §3.4) we found that the analogous activation × gradient saliency underperformed pure activation magnitude on our model class by approximately a factor of 2.6 in post-prune perplexity, with the gap stable across 8 and 64 calibration samples. This is a single-model result and we flag it as such, but it is qualitatively in the wrong direction relative to the Wanda framing and we report it as a falsifiable challenge to the saliency-based pruning framing in the LLM-inference regime; we welcome replication on other model families.

Our contribution: utilization-aware compaction that (a) requires no separate profiling pass, (b) produces genuinely smaller models via physical head removal, (c) targets specific device memory budgets with mixed quantization, and (d) integrates naturally into the LoRA fine-tuning workflow. We also contribute three engineering findings from the validation of our own published artifact (§4.1): the deployment-runtime architectural-invariant bug class (Finding 6 in the validation paper), the substrate-level reproducibility requirement for forge intermediate weights (forge-alloy#11), and the empirical activation-vs-saliency comparison (§4.1.3 above).

6. Conclusion

The original contribution of this paper is the utilization-aware compaction pipeline of §2: drive both structured pruning and mixed-quantization precision allocation from a single per-head importance signal, producing device-targeted model artifacts from one training run. The pipeline as described in §2 produced the published 14B compacted model, a 67% size reduction from the BF16 baseline.

The discovery process of §4.1 changed the recommended form of the pipeline in two ways. First, the original importance signal — gradient magnitude captured during LoRA training — was outperformed by approximately a factor of seven by activation magnitude measured on as few as eight calibration samples, with the activation metric also requiring no trainer instrumentation. Activation magnitude with a small calibration set is the recommended default for v2 of the pipeline, and gradient capture is retained only as a fallback for environments where forward hooks are unavailable. Second, the physical head pruning step must respect deployment-runtime architectural invariants — concretely, q_proj_out == hidden_size for any artifact intended for llama.cpp deployment — or the validation harness must declare the deployment target unsupported. The published v1 14B artifact violated this invariant silently and is being deprecated and replaced.

The deeper lesson, which is the contribution we most want the reader to take from this paper alongside the engineering result, is that publishing a model is not the same as validating a model. The pipeline that produced our v1 artifact was technically correct by every test we had at the time, and the artifact still failed catastrophically when we attempted to validate it in the runtime its users would actually deploy it in. Our own first failure is the strongest empirical evidence we can offer for the validation framework in the companion paper, and we offer it openly because the shape of the failure is the same shape every published-but-unvalidated model in the field is currently exposed to.

The pipeline's value is in the Continuum ecosystem, where models are continuously fine-tuned for specific roles and each training session refines the substrate that the next compaction depends on. The substrate now includes content-addressed retention of intermediate forge stages (forge-alloy#11) and a deployment-runtime load-test layer in the validation harness (VALIDATED-TENSOR-SURGERY Layer 7), so that the failure modes of §4.1 cannot recur silently.

Acknowledgments

Built on the Continuum collaborative AI training system. Gate gradient capture integrated into the Academy training pipeline. Compaction engine implemented in Rust using the safetensors and candle crates.

References

[To be populated with full citations]