This project takes the Rectified Spectral Units (ReSU) primitive from Qin et al. 2026 — a biologically-inspired, backprop-free, self-supervised feature extractor — and stress-tests it on Atari games (Pong, Breakout, Space Invaders). Nine stages of experiments, each designed as a falsifiable test of some specific claim about ReSU.
| Pong (BC clone) | Breakout (BC clone) | Asteroids (BC clone, stochastic expert) |
|---|---|---|
![]() |
![]() |
![]() |
The ReSU encoder is fit by closed-form SVD on past-future input covariances (~2 seconds, no backprop). A small MLP head trained on 60k frames of hand-crafted-expert demonstrations drives the policy. On Pong the clone roughly matches its -13.4 expert; on Breakout it clears the first wave of bricks. The Asteroids clone is a Stage-11 stress test: vector-style graphics, 1% non-black pixels, and a stochastic sweep-and-fire expert that's only marginally above chance to imitate (val acc 0.25 with 7-class output) — the agent still fires while drifting and scores points.
The headline calibrated finding after all the work: ReSU's unique technical contribution is a closed-form, deterministic, forgetting-free encoder that adapts during deployment via streaming covariance updates + SVD. Its other claimed advantages (deep stacking, beating backprop CNNs) do not generalize from the paper's 1-D Drosophila demo to Atari.
This README summarizes everything; for deeper context see PLAN.md (initial
strategy) and RESULTS.md (running record).
| # | Stage | What it tests | Result |
|---|---|---|---|
| 0 | Reproduce paper code | Math validation | Top-10 canonical correlations match reference to 1e-8 |
| 1 | Refactor as resu_core |
Reusable primitive | OU + dead-leaves both reproduce paper's L1/L2/L3 story |
| 2 | 2-D conv-style encoder on Atari | Does CCA generalize from 1-D to 2-D? | Ball velocity R² jumps 0.59 (raw) → 0.92 (ReSU-L2). Encoder captures motion. |
| 3 | Action conditioning at layer 1 | Where should action enter? | Per-pixel injection crushes the action signal; conclusion: keep encoder obs-only, action goes to policy head |
| 4a | Behavior cloning on Pong | Are ReSU features good enough for control? | ReSU+MLP clone scores -12.6, basically matches the -13.4 hand-crafted expert |
| 5 | Streaming Pong→Breakout transfer | Does test-time training actually work? | Yes — streaming matches/exceeds from-scratch retraining (19.2 vs 17.4) |
| 6 | CNN BC honesty check | Is ReSU actually better than backprop CNN? | No, they're tied — both at ~-11 on Pong, ReSU is not "better" but it does fit in 2s vs 10s |
| 7 | DQN RL (from-scratch and warm-start) | Can we beat the expert via reward? | Inconclusive negative — vanilla DQN too short; warm-start catastrophic forgetting |
| 8 | Deeper ReSU stack on Space Invaders | Does stacking past 2 layers earn its keep? | No. L3 in isolation is worse than L1; L1+L2+L3 marginally above L1+L2 |
| 9 | Timescale + transfer matrix + forgetting | Where exactly does test-time training pay off? | Streaming helps when source encoder is ill-suited (≈ +5–50 points); catastrophic forgetting is essentially zero |
| 11 | Visually distinct games (Asteroids + Enduro) | Are the L1 filters truly universal? | Filter 0 specializes: even lowpass (Pong) vs flat DC integrator (Asteroids) vs sharp impulse (Enduro). Only filter 1 (derivative) generalizes. Streaming converges to native filters. |
# Setup
conda create -n resu python=3.12 && conda activate resu
pip install -r requirements.txt
# Clone the upstream paper repo for tests/test_ou_match_reference.py
# (it imports `analysis.SVD_analysis` from the released code):
git clone https://github.com/ShawnQin/ReSU.git
# Validate the core math (Stage 0–1):
python3 tests/test_ou_match_reference.py
python3 tests/test_deadleaves_drosophila.py
# Collect data and run stages:
python3 src/collect_pong_buffer.py 200000 # random-policy Pong frames
python3 src/probe_pong.py # Stage 2: linear-probe go/no-go
python3 src/probe_pong_velocity.py # Stage 2: velocity probe
python3 src/probe_pong_action.py # Stage 3: action conditioning
python3 src/bc_pong.py # Stage 4a: ReSU BC
python3 src/bc_pong_eval.py # Stage 4a in-env eval
python3 src/bc_pong_cnn.py # Stage 6: CNN BC baseline
python3 src/bc_breakout.py # Stage 5: Pong→Breakout streaming
python3 src/collect_si_buffer.py # Stage 8 prep: SI random
python3 src/collect_si_expert_demos.py # Stage 9 prep: SI expert demos
python3 src/depth_ablation_si.py # Stage 8: depth ablation
python3 src/transfer_matrix.py # Stage 9: timescale + matrix + forgetting
# Record gameplay GIFs (the ones at the top of this README):
python3 src/record_gameplay.pyHardware: tested on an RTX 4090 with 24 GB VRAM and 62 GB RAM. Most experiments take 5–15 min; Stage 9 is the longest at ~30 min.
A Rectified Spectral Unit (Qin et al. 2026) takes a length-m past window of
its inputs p_t, projects it onto a canonical direction v_i obtained from
the SVD of the whitened past–future cross-covariance, then half-wave
rectifies:
W = C_ff^{-1/2} · C_fp · C_pp^{-1/2} # whitened past-future cross-cov
U, S, V^T = SVD(W) # canonical correlations on diag(S)
Ψ = V^T · C_pp^{-1/2} # temporal filters
z_t,i^+ = max(+Ψ[i] · p_t, 0) # ON ReSU
z_t,i^- = max(-Ψ[i] · p_t, 0) # OFF ReSU
The "learning" is the closed-form computation of Ψ from data covariances.
No backprop, no gradient steps, no learning-rate schedule. Once you have
Ψ the unit is a fixed linear filter + rectifier.
The follow-up paper claims you can stack ReSUs into deep networks, that they recover physiological features (Drosophila L1/L2/L3, T4 motion detection), and that they're a "path to deep brain-like networks." We tested all three claims on Atari.
Reproduced analysis.SVD_analysis from the released paper code on the OU
process. Top-10 canonical correlations match to 1e-8 numerical precision;
top-5 filter directions match in cosine similarity > 0.999. Dead-leaves
synthetic data reproduces the paper's qualitative L1/L2/L3 story (single-lobe
lowpass + bipolar derivative) and the SNR-induced lobe-count reduction from
Fig 3C.
Per-pixel temporal CCA at layer 1 with weights shared across all 84×84 pixels (closed-form SVD on pooled covariances). Layer 2 = 3×3 spatial-temporal CCA on layer-1 features. Linear probes against ball position and velocity on random-policy Pong:
| Target | raw(t) | raw(t,t-1) | ReSU-L1 | ReSU-L2 |
|---|---|---|---|---|
| ball xy[t] | 0.787 | 0.835 | 0.976 | 0.985 |
| ball xy[t+4] | 0.746 | 0.803 | 0.946 | 0.956 |
| ball velocity | 0.483 | 0.589 | 0.900 | 0.920 |
| paddle y[t] | 0.990 | 0.990 | 0.997 | 0.996 |
| paddle y[t+4] | 0.738 | 0.733 | 0.651 | 0.570 |
Velocity R² 0.59 → 0.92 is the strongest evidence that the encoder really does capture motion. Future-paddle prediction is the natural failure mode of an obs-only encoder (paddle motion is action-driven, unpredictable from observations alone), which motivated Stage 3.
Augmented past lag vector with one-hot action history. Adding action history to raw pixels jumps paddle_y[t+4] R² from 0.733 → 0.822. But the layer-1 action-conditioned encoder only got to 0.660 — the per-pixel broadcast + rectification crushes the scalar action signal.
Conclusion: action information should enter at the policy head, not the encoder. This is also how standard RL is structured.
Hand-crafted reactive expert (move paddle toward ball y) scores -13.4 vs the built-in CPU. We trained 3-class BC heads (NOOP/UP/DOWN) on 60k expert frames.
| Encoder | Head | Val acc | In-env (5 games) |
|---|---|---|---|
| raw(t, t-1) flat | linear | 0.426 | -21.0 |
| raw(t, t-1) flat | MLP 256×2 | 0.338 | -21.0 |
| ReSU-L1 pooled 1024d | linear | 0.657 | -19.0 |
| ReSU-L1 pooled 1024d | MLP 256×2 | 0.857 | -12.6 |
The ReSU+MLP clone slightly exceeds the expert it was trained on (-12.6 vs -13.4). Raw-pixel flat-MLP BC collapsed to a single action.
But this comparison was unfair — see Stage 6.
A Pong-pretrained encoder, EMA-streamed on Breakout frames with
alpha=0.99. Three regimes compared (20 games per condition):
| Encoder | Val acc | In-env Breakout |
|---|---|---|
| A: frozen Pong, never adapted | 0.825 | 14.8 ± 1.7 |
| B: streamed (Pong→Breakout EMA) | 0.817 | 19.2 ± 2.5 |
| C: from-scratch Breakout fit | 0.843 | 17.4 ± 2.1 |
Streaming matches / edges out from-scratch retraining, without ever
running offline training on the new game. This is the first concrete
demonstration of ReSU's test-time-training story. A bug in cov-initialization
(zero start + slow EMA) had caused an earlier run to underperform; with
initialized=False so the first chunk fully populates covariances, results
become consistent.
The mechanism: the universal lowpass filter (filter 1) stays at cos≈0.99 to its Pong twin throughout streaming. Only higher-order modes (filters 3, 4) drift to Breakout-specific structure.
Stage 4a's "raw pixels can't BC" was unfair — raw was processed by a flat MLP, not a CNN. Trained a standard DQN-architecture 3-conv-layer CNN (1.69M params, Mnih 2015) end-to-end on the same Pong demos.
| Encoder | Val acc | In-env (20 games) ± SE |
|---|---|---|
| ReSU + MLP (1024d pool, 0.79M params) | 0.861 | -11.85 ± 0.86 |
| End-to-end CNN | 0.867 / 0.760 (seed var) | -10.85 ± 1.04 |
They are statistically tied (gap 1.0, combined SE 1.35). ReSU's actual edges remain: closed-form 2s fit vs CNN's 10s + seed variance. Stage 4a's "ReSU makes BC possible" should be read as "ReSU and CNN both make BC possible; flat-MLP-on-pixels doesn't."
Two attempts to show RL improvement:
- From scratch (500k env steps, both ReSU and CNN): both stuck at -19.7 even after ε decayed to 0.10. Vanilla DQN literature needs 1-2M+ env steps for Pong; 500k is below the learning threshold.
- Warm-started from BC (200k env steps): catastrophic. Both variants' eval scores collapsed from -8/-12 BC starts to -21 flat within 50k steps. BC cross-entropy logits aren't valid Q-values; bootstrap noise scrambled them before they could become meaningful.
A clean RL comparison would require either many-hour from-scratch DQN runs or a proper offline-RL recipe (CQL/AWAC/IQL) with BC regularization. Skipped due to compute budget.
Tested 3 stacked ReSU layers on 60k random-policy SI frames. Linear probes against four targets (ship_x, formation y, alive-invader pixels, bullet pixels) at different abstraction levels.
| Target | raw | L1 | L2 | L3 | L1+L2 | L1+L2+L3 |
|---|---|---|---|---|---|---|
| ship_x | 0.833 | 0.850 | 0.939 | 0.817 | 0.948 | 0.955 |
| invader formation y | 0.991 | 0.985 | 0.986 | 0.923 | 0.992 | 0.994 |
| alive_pixels | 0.994 | 0.982 | 0.986 | 0.919 | 0.994 | 0.994 |
| bullet_pixels | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
Canonical correlation spectra:
- L1: [0.96, 0.02, 0.01, 0.01] — top mode dominant, normal
- L2: [0.9999, 0.9996, ..., 0.9958] — all 8 near unity = degenerate
- L3: [0.99997, ..., 0.99984] — even more degenerate
Findings:
- Degenerate canonical correlations ≠ useless features. L2 helps ship_x significantly (0.85 → 0.94) despite all 8 singular values being ≈ 1.
- L3 alone is consistently worse than L1. For all positional targets, L3 alone gives -3 to -7 R² points compared to L1 alone.
- L3 adds essentially nothing on top of L1+L2. ≤ 0.7 R² points across all targets.
The paper's "deep ReSU stack" claim doesn't generalize empirically. Two layers is the practical ceiling on these Atari games.
Three follow-ups to Stage 5's streaming claim.
9a — Adaptation timescale (Pong→Breakout, 10 games per cell):
| Frames streamed | Score ± SE |
|---|---|
| 0 (frozen) | 16.9 ± 3.7 |
| 10k | 19.8 ± 3.3 ← peak |
| 30k | 14.2 ± 2.1 |
| 60k | 14.3 ± 2.2 |
Early adaptation gives a +3 point gain. More adaptation does not help. Stage 5's particular 60k+lucky-seed combo was noisier than the underlying phenomenon.
9b — 3-game transfer matrix (frozen / streaming, 10 games):
| →Pong | →Breakout | →SI | |
|---|---|---|---|
| Pong→ | -12.9 / -12.9 | +12.9 / +9.7 | +332.5 / +343.5 |
| Breakout→ | -12.0 / -11.7 | +16.9 / +16.9 | +283.5 / +335.0 (+52) |
| SI→ | -18.0 / -10.5 (+7.5) | +12.2 / +18.0 (+5.8) | +352.0 / +352.0 |
Two clear patterns:
- Frozen cross-game transfer is genuinely good. Pong→Breakout frozen gets +12.9 (Breakout expert: +38; random: ~1). Breakout→Pong frozen gets -12.0 (Pong expert: -13.4; random: -21). The L1 lowpass+derivative are universal features across these games — no adaptation required.
- Streaming helps proportionally to the source-target gap. SI→Pong (most visually different) gets +7.5 points from streaming; SI→Breakout +5.8; Breakout→SI +52. Within similar pairs (Pong↔Breakout, both paddle physics), streaming is a wash.
9c — Catastrophic forgetting (10 games):
| Encoder | Pong score ± SE |
|---|---|
| Original Pong-frozen | -10.6 ± 1.2 |
| Pong→Breakout-streamed (60k frames) | -11.0 ± 1.0 |
| Δ (forgetting cost) | -0.4 points (within noise) |
This is the cleanest positive result of the entire project. After 60k frames of Breakout streaming, the encoder still plays Pong at the same level as the never-touched original. End-to-end CNN fine-tuning on Breakout would corrupt Pong-specific representations through gradient drift across the whole network. ReSU's closed-form CCA refit preserves the principled spectral decomposition — universal directions stay universal, only the truly game-specific subspace adapts.
This property — forgetting-free test-time adaptation at SVD cost (milliseconds per refit) — is the genuine technical contribution that survives all of our stress tests.
Stage 9's "universal lowpass + derivative" framing was tested on Pong / Breakout / Space Invaders — all sprite-based games with similar visual statistics. Stage 11 tests two deliberately different cases:
- Asteroids: vector-style sprites, ~1.1% of pixels non-black. Per-pixel signal is silent most of the time.
- Enduro: scrolling racer, ~78% of pixels non-black with constant motion. Every pixel changes every frame in a structured way.
Native canonical correlations (top mode out of 4):
| Game | Top corr | Filter 0 shape |
|---|---|---|
| Pong | 0.995 | even plateau (lowpass) |
| Breakout | 0.996 | even plateau |
| Space Invaders | 0.957 | mild leaky decay |
| Asteroids | 0.868 | flat DC integrator (pull signal from any lag) |
| Enduro | 0.989 | sharp impulse on current frame (older lags stale) |
Filter 0 specializes dramatically to game statistics. With sparse signal (Asteroids), the optimal filter integrates over all time lags. With dense motion (Enduro), the optimal filter peaks on the current frame because older frames are no longer informative.
Filter 1 (the bipolar derivative) is more consistent — always a sign flip between recent and older past, across all five games. This is the truly universal direction.
Streaming converges to native. A Pong-trained encoder streamed on Asteroids ends up with Asteroids' flat filter, not Pong's even lowpass. The encoder genuinely re-shapes to new statistics, not just slightly perturbs.
Reward probe on Asteroids (logistic regression, AUC on "reward in next 8 frames", base rate 11%):
| Encoder | AUC |
|---|---|
| raw(t, t-1) | 0.609 |
| Pong frozen | 0.627 |
| Breakout frozen | 0.615 |
| SI frozen | 0.623 |
| Asteroids native | 0.623 |
| Pong→Asteroids streamed | 0.626 |
All clustered in 0.61-0.63. Native fit gives no edge over cross-game frozen. Asteroids reward is action-driven (firing while aimed at an asteroid), unrecoverable from past pixels alone. This is the Stage 3 lesson again: obs-only encoder doesn't capture action-driven rewards.
Enduro reward probe: N/A — random policy never overtakes a car in 60k frames. The probe needs a hand-crafted expert. The encoder still fits cleanly (top corr 0.989); we just lack data to measure reward predictability.
Net refinement of the Stage 9 story: The "universal lowpass + derivative" framing was too strong. Only the derivative (filter 1) generalizes; the lowpass (filter 0) specializes to each game's statistics. Streaming adaptation works as advertised — filters converge to native game statistics rather than staying near the source.
See results/stage11_filter_shapes.png for a side-by-side visualization
of all 4 filters × 7 encoder variants.
What ReSU actually offers, after all the testing:
- Closed-form, deterministic encoder fit. ~2 seconds vs CNN's ~10s with high seed-to-seed variance (CNN val acc swung 0.76 ↔ 0.87 on identical data; ReSU is reproducible).
- Cheap test-time refit (one SVD, milliseconds). No SSL loss to maintain.
- Forgetting-free adaptation. Universal spectral directions are preserved during streaming; only game-specific subspaces adapt.
- Cross-game encoder transfer for free. The L1 lowpass + derivative filters generalize across Pong / Breakout / Space Invaders.
What ReSU does NOT offer (contra the paper's framing):
- No advantage over backprop CNN at fixed-task BC — they're statistically tied on Pong (-11.85 vs -10.85).
- No working "deep brain-like network" via stacking. Layer 3 doesn't earn its keep on the games we tested.
- No demonstrated path to SOTA RL. Vanilla DQN didn't show learning in our compute budget; warm-start RL catastrophically forgot the BC policy. These could probably be fixed with better RL infrastructure (CQL/AWAC), but it's not free.
If you want to use ReSU: it's a competitive (not winning) encoder for static tasks, and a genuinely unique encoder for tasks where the input distribution drifts and you can't afford full retraining or per-update gradient steps. Lifelong/continual-learning scenarios are its natural niche.
If you want to extend the paper's biology story: this work doesn't say much about that. The Drosophila L1/L2/L3 reproduction is the paper's actual scientific contribution; the Atari results don't speak to it.
- A proper offline-RL recipe (CQL/AWAC/IQL) with BC-regularized loss might enable the warm-start RL comparison that vanilla DQN failed at.
- Live test-time training inside gameplay. Stage 9 streamed over pre-collected frame buffers; the natural follow-up is per-step encoder updates during actual play episodes, with online BC or RL on top.
- Action injection at a pooled feature layer (the Stage 3 takeaway). Per-pixel layer-1 conditioning was wrong; a layer-2 action-conditioned encoder might work better.
- Drosophila-style scientific reproduction at full fidelity. The biology story is the paper's actual contribution; this project doesn't address it.
PLAN.md — Initial plan (now mostly executed)
RESULTS.md — Running stage-by-stage record
README.md — This file
src/resu_core.py — Core past/future-CCA primitive
src/resu_conv.py — 2D conv-style temporal & spatio-temporal layers
src/resu_conv_action.py — Action-conditioned layer 1
src/resu_streaming.py — Streaming EMA + SVD refits (test-time training)
src/atari_env.py — Atari preprocessing
src/pong_labels.py, pong_expert.py — Pong labeller + reactive expert
src/breakout_labels.py, breakout_expert.py — Breakout labeller + ball-tracking expert
src/si_expert.py — Space Invaders alien-tracking expert
src/collect_pong_buffer.py — Random-policy Pong frames
src/collect_si_buffer.py — Random-policy SI frames + heuristic labels
src/collect_si_expert_demos.py — SI expert demos
src/eval_expert.py, eval_breakout_expert.py — Sanity-check the experts
src/probe_pong.py — Stage 2: linear-probe go/no-go
src/probe_pong_velocity.py — Stage 2: velocity probe
src/probe_pong_action.py — Stage 3: action conditioning A/B
src/bc_pong.py, bc_pong_eval.py — Stage 4a: Pong BC offline + in-env
src/bc_pong_cnn.py — Stage 6: CNN BC baseline
src/bc_breakout.py — Stage 5: Pong→Breakout streaming
src/dqn_pong.py — Stage 7: DQN from scratch (negative)
src/dqn_pong_warmstart.py — Stage 7: BC-warm-start DQN (also negative)
src/depth_ablation_si.py — Stage 8: 3-layer stack on Space Invaders
src/transfer_matrix.py — Stage 9: timescale + matrix + forgetting
src/asteroids_enduro_stress.py — Stage 11: visually distinct games
src/asteroids_expert.py — Stochastic sweep-and-fire expert
src/collect_asteroids_demos.py
src/record_gameplay.py — Render Pong/Breakout BC-clone gameplay as GIFs
src/record_asteroids.py — Render Asteroids BC clone as GIF
media/ — Gameplay GIFs (Pong, Breakout)
tests/test_ou_match_reference.py — Numerical match against paper code
tests/test_deadleaves_drosophila.py — Qualitative L1/L2/L3 reproduction
ReSU/ — Original paper's released repo (read-only ref)
data/ — Frame buffers, expert demos, labels
results/ — Logs and saved arrays for each stage
The original paper, code, and Drosophila biology story: Qin, S.; Pughe-Sanford, J.L.; Genkin, A.; Ozdil, P.G.; Greengard, P.; Sengupta, A.M.; Chklovskii, D.B. A Network of Biologically Inspired Rectified Spectral Units (ReSUs) Learns Hierarchical Features Without Error Backpropagation. arXiv:2512.23146 (2025). Code: https://github.com/ShawnQin/ReSU


