Skip to content

Pathos0925/resu-atari

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReSU on Atari: A Calibrated Investigation

This project takes the Rectified Spectral Units (ReSU) primitive from Qin et al. 2026 — a biologically-inspired, backprop-free, self-supervised feature extractor — and stress-tests it on Atari games (Pong, Breakout, Space Invaders). Nine stages of experiments, each designed as a falsifiable test of some specific claim about ReSU.

Pong (BC clone) Breakout (BC clone) Asteroids (BC clone, stochastic expert)
Pong Breakout Asteroids

The ReSU encoder is fit by closed-form SVD on past-future input covariances (~2 seconds, no backprop). A small MLP head trained on 60k frames of hand-crafted-expert demonstrations drives the policy. On Pong the clone roughly matches its -13.4 expert; on Breakout it clears the first wave of bricks. The Asteroids clone is a Stage-11 stress test: vector-style graphics, 1% non-black pixels, and a stochastic sweep-and-fire expert that's only marginally above chance to imitate (val acc 0.25 with 7-class output) — the agent still fires while drifting and scores points.

The headline calibrated finding after all the work: ReSU's unique technical contribution is a closed-form, deterministic, forgetting-free encoder that adapts during deployment via streaming covariance updates + SVD. Its other claimed advantages (deep stacking, beating backprop CNNs) do not generalize from the paper's 1-D Drosophila demo to Atari.

This README summarizes everything; for deeper context see PLAN.md (initial strategy) and RESULTS.md (running record).


TL;DR by stage

# Stage What it tests Result
0 Reproduce paper code Math validation Top-10 canonical correlations match reference to 1e-8
1 Refactor as resu_core Reusable primitive OU + dead-leaves both reproduce paper's L1/L2/L3 story
2 2-D conv-style encoder on Atari Does CCA generalize from 1-D to 2-D? Ball velocity R² jumps 0.59 (raw) → 0.92 (ReSU-L2). Encoder captures motion.
3 Action conditioning at layer 1 Where should action enter? Per-pixel injection crushes the action signal; conclusion: keep encoder obs-only, action goes to policy head
4a Behavior cloning on Pong Are ReSU features good enough for control? ReSU+MLP clone scores -12.6, basically matches the -13.4 hand-crafted expert
5 Streaming Pong→Breakout transfer Does test-time training actually work? Yes — streaming matches/exceeds from-scratch retraining (19.2 vs 17.4)
6 CNN BC honesty check Is ReSU actually better than backprop CNN? No, they're tied — both at ~-11 on Pong, ReSU is not "better" but it does fit in 2s vs 10s
7 DQN RL (from-scratch and warm-start) Can we beat the expert via reward? Inconclusive negative — vanilla DQN too short; warm-start catastrophic forgetting
8 Deeper ReSU stack on Space Invaders Does stacking past 2 layers earn its keep? No. L3 in isolation is worse than L1; L1+L2+L3 marginally above L1+L2
9 Timescale + transfer matrix + forgetting Where exactly does test-time training pay off? Streaming helps when source encoder is ill-suited (≈ +5–50 points); catastrophic forgetting is essentially zero
11 Visually distinct games (Asteroids + Enduro) Are the L1 filters truly universal? Filter 0 specializes: even lowpass (Pong) vs flat DC integrator (Asteroids) vs sharp impulse (Enduro). Only filter 1 (derivative) generalizes. Streaming converges to native filters.

Quick reproduction

# Setup
conda create -n resu python=3.12 && conda activate resu
pip install -r requirements.txt

# Clone the upstream paper repo for tests/test_ou_match_reference.py
# (it imports `analysis.SVD_analysis` from the released code):
git clone https://github.com/ShawnQin/ReSU.git

# Validate the core math (Stage 0–1):
python3 tests/test_ou_match_reference.py
python3 tests/test_deadleaves_drosophila.py

# Collect data and run stages:
python3 src/collect_pong_buffer.py 200000        # random-policy Pong frames
python3 src/probe_pong.py                        # Stage 2: linear-probe go/no-go
python3 src/probe_pong_velocity.py               # Stage 2: velocity probe
python3 src/probe_pong_action.py                 # Stage 3: action conditioning

python3 src/bc_pong.py                           # Stage 4a: ReSU BC
python3 src/bc_pong_eval.py                      # Stage 4a in-env eval
python3 src/bc_pong_cnn.py                       # Stage 6: CNN BC baseline

python3 src/bc_breakout.py                       # Stage 5: Pong→Breakout streaming
python3 src/collect_si_buffer.py                 # Stage 8 prep: SI random
python3 src/collect_si_expert_demos.py           # Stage 9 prep: SI expert demos
python3 src/depth_ablation_si.py                 # Stage 8: depth ablation
python3 src/transfer_matrix.py                   # Stage 9: timescale + matrix + forgetting

# Record gameplay GIFs (the ones at the top of this README):
python3 src/record_gameplay.py

Hardware: tested on an RTX 4090 with 24 GB VRAM and 62 GB RAM. Most experiments take 5–15 min; Stage 9 is the longest at ~30 min.


What is ReSU (one paragraph)

A Rectified Spectral Unit (Qin et al. 2026) takes a length-m past window of its inputs p_t, projects it onto a canonical direction v_i obtained from the SVD of the whitened past–future cross-covariance, then half-wave rectifies:

W   = C_ff^{-1/2} · C_fp · C_pp^{-1/2}        # whitened past-future cross-cov
U, S, V^T = SVD(W)                            # canonical correlations on diag(S)
Ψ   = V^T · C_pp^{-1/2}                       # temporal filters
z_t,i^+ = max(+Ψ[i] · p_t, 0)                 # ON ReSU
z_t,i^- = max(-Ψ[i] · p_t, 0)                 # OFF ReSU

The "learning" is the closed-form computation of Ψ from data covariances. No backprop, no gradient steps, no learning-rate schedule. Once you have Ψ the unit is a fixed linear filter + rectifier.

The follow-up paper claims you can stack ReSUs into deep networks, that they recover physiological features (Drosophila L1/L2/L3, T4 motion detection), and that they're a "path to deep brain-like networks." We tested all three claims on Atari.


Detailed findings

Stage 0–1: Math validation

Reproduced analysis.SVD_analysis from the released paper code on the OU process. Top-10 canonical correlations match to 1e-8 numerical precision; top-5 filter directions match in cosine similarity > 0.999. Dead-leaves synthetic data reproduces the paper's qualitative L1/L2/L3 story (single-lobe lowpass + bipolar derivative) and the SNR-induced lobe-count reduction from Fig 3C.

Stage 2: 2-D conv-style encoder on Atari frames

Per-pixel temporal CCA at layer 1 with weights shared across all 84×84 pixels (closed-form SVD on pooled covariances). Layer 2 = 3×3 spatial-temporal CCA on layer-1 features. Linear probes against ball position and velocity on random-policy Pong:

Target raw(t) raw(t,t-1) ReSU-L1 ReSU-L2
ball xy[t] 0.787 0.835 0.976 0.985
ball xy[t+4] 0.746 0.803 0.946 0.956
ball velocity 0.483 0.589 0.900 0.920
paddle y[t] 0.990 0.990 0.997 0.996
paddle y[t+4] 0.738 0.733 0.651 0.570

Velocity R² 0.59 → 0.92 is the strongest evidence that the encoder really does capture motion. Future-paddle prediction is the natural failure mode of an obs-only encoder (paddle motion is action-driven, unpredictable from observations alone), which motivated Stage 3.

Stage 3: Action conditioning at layer 1

Augmented past lag vector with one-hot action history. Adding action history to raw pixels jumps paddle_y[t+4] R² from 0.733 → 0.822. But the layer-1 action-conditioned encoder only got to 0.660 — the per-pixel broadcast + rectification crushes the scalar action signal.

Conclusion: action information should enter at the policy head, not the encoder. This is also how standard RL is structured.

Stage 4a: Pong behavior cloning

Hand-crafted reactive expert (move paddle toward ball y) scores -13.4 vs the built-in CPU. We trained 3-class BC heads (NOOP/UP/DOWN) on 60k expert frames.

Encoder Head Val acc In-env (5 games)
raw(t, t-1) flat linear 0.426 -21.0
raw(t, t-1) flat MLP 256×2 0.338 -21.0
ReSU-L1 pooled 1024d linear 0.657 -19.0
ReSU-L1 pooled 1024d MLP 256×2 0.857 -12.6

The ReSU+MLP clone slightly exceeds the expert it was trained on (-12.6 vs -13.4). Raw-pixel flat-MLP BC collapsed to a single action.

But this comparison was unfair — see Stage 6.

Stage 5: Streaming Pong→Breakout

A Pong-pretrained encoder, EMA-streamed on Breakout frames with alpha=0.99. Three regimes compared (20 games per condition):

Encoder Val acc In-env Breakout
A: frozen Pong, never adapted 0.825 14.8 ± 1.7
B: streamed (Pong→Breakout EMA) 0.817 19.2 ± 2.5
C: from-scratch Breakout fit 0.843 17.4 ± 2.1

Streaming matches / edges out from-scratch retraining, without ever running offline training on the new game. This is the first concrete demonstration of ReSU's test-time-training story. A bug in cov-initialization (zero start + slow EMA) had caused an earlier run to underperform; with initialized=False so the first chunk fully populates covariances, results become consistent.

The mechanism: the universal lowpass filter (filter 1) stays at cos≈0.99 to its Pong twin throughout streaming. Only higher-order modes (filters 3, 4) drift to Breakout-specific structure.

Stage 6: CNN BC honesty check

Stage 4a's "raw pixels can't BC" was unfair — raw was processed by a flat MLP, not a CNN. Trained a standard DQN-architecture 3-conv-layer CNN (1.69M params, Mnih 2015) end-to-end on the same Pong demos.

Encoder Val acc In-env (20 games) ± SE
ReSU + MLP (1024d pool, 0.79M params) 0.861 -11.85 ± 0.86
End-to-end CNN 0.867 / 0.760 (seed var) -10.85 ± 1.04

They are statistically tied (gap 1.0, combined SE 1.35). ReSU's actual edges remain: closed-form 2s fit vs CNN's 10s + seed variance. Stage 4a's "ReSU makes BC possible" should be read as "ReSU and CNN both make BC possible; flat-MLP-on-pixels doesn't."

Stage 7: DQN RL (negative result)

Two attempts to show RL improvement:

  1. From scratch (500k env steps, both ReSU and CNN): both stuck at -19.7 even after ε decayed to 0.10. Vanilla DQN literature needs 1-2M+ env steps for Pong; 500k is below the learning threshold.
  2. Warm-started from BC (200k env steps): catastrophic. Both variants' eval scores collapsed from -8/-12 BC starts to -21 flat within 50k steps. BC cross-entropy logits aren't valid Q-values; bootstrap noise scrambled them before they could become meaningful.

A clean RL comparison would require either many-hour from-scratch DQN runs or a proper offline-RL recipe (CQL/AWAC/IQL) with BC regularization. Skipped due to compute budget.

Stage 8: Deeper stack on Space Invaders

Tested 3 stacked ReSU layers on 60k random-policy SI frames. Linear probes against four targets (ship_x, formation y, alive-invader pixels, bullet pixels) at different abstraction levels.

Target raw L1 L2 L3 L1+L2 L1+L2+L3
ship_x 0.833 0.850 0.939 0.817 0.948 0.955
invader formation y 0.991 0.985 0.986 0.923 0.992 0.994
alive_pixels 0.994 0.982 0.986 0.919 0.994 0.994
bullet_pixels 1.000 1.000 1.000 1.000 1.000 1.000

Canonical correlation spectra:

  • L1: [0.96, 0.02, 0.01, 0.01] — top mode dominant, normal
  • L2: [0.9999, 0.9996, ..., 0.9958] — all 8 near unity = degenerate
  • L3: [0.99997, ..., 0.99984] — even more degenerate

Findings:

  1. Degenerate canonical correlations ≠ useless features. L2 helps ship_x significantly (0.85 → 0.94) despite all 8 singular values being ≈ 1.
  2. L3 alone is consistently worse than L1. For all positional targets, L3 alone gives -3 to -7 R² points compared to L1 alone.
  3. L3 adds essentially nothing on top of L1+L2. ≤ 0.7 R² points across all targets.

The paper's "deep ReSU stack" claim doesn't generalize empirically. Two layers is the practical ceiling on these Atari games.

Stage 9: Test-time adaptation in depth

Three follow-ups to Stage 5's streaming claim.

9a — Adaptation timescale (Pong→Breakout, 10 games per cell):

Frames streamed Score ± SE
0 (frozen) 16.9 ± 3.7
10k 19.8 ± 3.3 ← peak
30k 14.2 ± 2.1
60k 14.3 ± 2.2

Early adaptation gives a +3 point gain. More adaptation does not help. Stage 5's particular 60k+lucky-seed combo was noisier than the underlying phenomenon.

9b — 3-game transfer matrix (frozen / streaming, 10 games):

→Pong →Breakout →SI
Pong→ -12.9 / -12.9 +12.9 / +9.7 +332.5 / +343.5
Breakout→ -12.0 / -11.7 +16.9 / +16.9 +283.5 / +335.0 (+52)
SI→ -18.0 / -10.5 (+7.5) +12.2 / +18.0 (+5.8) +352.0 / +352.0

Two clear patterns:

  1. Frozen cross-game transfer is genuinely good. Pong→Breakout frozen gets +12.9 (Breakout expert: +38; random: ~1). Breakout→Pong frozen gets -12.0 (Pong expert: -13.4; random: -21). The L1 lowpass+derivative are universal features across these games — no adaptation required.
  2. Streaming helps proportionally to the source-target gap. SI→Pong (most visually different) gets +7.5 points from streaming; SI→Breakout +5.8; Breakout→SI +52. Within similar pairs (Pong↔Breakout, both paddle physics), streaming is a wash.

9c — Catastrophic forgetting (10 games):

Encoder Pong score ± SE
Original Pong-frozen -10.6 ± 1.2
Pong→Breakout-streamed (60k frames) -11.0 ± 1.0
Δ (forgetting cost) -0.4 points (within noise)

This is the cleanest positive result of the entire project. After 60k frames of Breakout streaming, the encoder still plays Pong at the same level as the never-touched original. End-to-end CNN fine-tuning on Breakout would corrupt Pong-specific representations through gradient drift across the whole network. ReSU's closed-form CCA refit preserves the principled spectral decomposition — universal directions stay universal, only the truly game-specific subspace adapts.

This property — forgetting-free test-time adaptation at SVD cost (milliseconds per refit) — is the genuine technical contribution that survives all of our stress tests.

Stage 11: Visually distinct games (Asteroids + Enduro)

Stage 9's "universal lowpass + derivative" framing was tested on Pong / Breakout / Space Invaders — all sprite-based games with similar visual statistics. Stage 11 tests two deliberately different cases:

  • Asteroids: vector-style sprites, ~1.1% of pixels non-black. Per-pixel signal is silent most of the time.
  • Enduro: scrolling racer, ~78% of pixels non-black with constant motion. Every pixel changes every frame in a structured way.

Native canonical correlations (top mode out of 4):

Game Top corr Filter 0 shape
Pong 0.995 even plateau (lowpass)
Breakout 0.996 even plateau
Space Invaders 0.957 mild leaky decay
Asteroids 0.868 flat DC integrator (pull signal from any lag)
Enduro 0.989 sharp impulse on current frame (older lags stale)

Filter 0 specializes dramatically to game statistics. With sparse signal (Asteroids), the optimal filter integrates over all time lags. With dense motion (Enduro), the optimal filter peaks on the current frame because older frames are no longer informative.

Filter 1 (the bipolar derivative) is more consistent — always a sign flip between recent and older past, across all five games. This is the truly universal direction.

Streaming converges to native. A Pong-trained encoder streamed on Asteroids ends up with Asteroids' flat filter, not Pong's even lowpass. The encoder genuinely re-shapes to new statistics, not just slightly perturbs.

Reward probe on Asteroids (logistic regression, AUC on "reward in next 8 frames", base rate 11%):

Encoder AUC
raw(t, t-1) 0.609
Pong frozen 0.627
Breakout frozen 0.615
SI frozen 0.623
Asteroids native 0.623
Pong→Asteroids streamed 0.626

All clustered in 0.61-0.63. Native fit gives no edge over cross-game frozen. Asteroids reward is action-driven (firing while aimed at an asteroid), unrecoverable from past pixels alone. This is the Stage 3 lesson again: obs-only encoder doesn't capture action-driven rewards.

Enduro reward probe: N/A — random policy never overtakes a car in 60k frames. The probe needs a hand-crafted expert. The encoder still fits cleanly (top corr 0.989); we just lack data to measure reward predictability.

Net refinement of the Stage 9 story: The "universal lowpass + derivative" framing was too strong. Only the derivative (filter 1) generalizes; the lowpass (filter 0) specializes to each game's statistics. Streaming adaptation works as advertised — filters converge to native game statistics rather than staying near the source.

See results/stage11_filter_shapes.png for a side-by-side visualization of all 4 filters × 7 encoder variants.


Overall calibrated story

What ReSU actually offers, after all the testing:

  • Closed-form, deterministic encoder fit. ~2 seconds vs CNN's ~10s with high seed-to-seed variance (CNN val acc swung 0.76 ↔ 0.87 on identical data; ReSU is reproducible).
  • Cheap test-time refit (one SVD, milliseconds). No SSL loss to maintain.
  • Forgetting-free adaptation. Universal spectral directions are preserved during streaming; only game-specific subspaces adapt.
  • Cross-game encoder transfer for free. The L1 lowpass + derivative filters generalize across Pong / Breakout / Space Invaders.

What ReSU does NOT offer (contra the paper's framing):

  • No advantage over backprop CNN at fixed-task BC — they're statistically tied on Pong (-11.85 vs -10.85).
  • No working "deep brain-like network" via stacking. Layer 3 doesn't earn its keep on the games we tested.
  • No demonstrated path to SOTA RL. Vanilla DQN didn't show learning in our compute budget; warm-start RL catastrophically forgot the BC policy. These could probably be fixed with better RL infrastructure (CQL/AWAC), but it's not free.

If you want to use ReSU: it's a competitive (not winning) encoder for static tasks, and a genuinely unique encoder for tasks where the input distribution drifts and you can't afford full retraining or per-update gradient steps. Lifelong/continual-learning scenarios are its natural niche.

If you want to extend the paper's biology story: this work doesn't say much about that. The Drosophila L1/L2/L3 reproduction is the paper's actual scientific contribution; the Atari results don't speak to it.


What remains open

  • A proper offline-RL recipe (CQL/AWAC/IQL) with BC-regularized loss might enable the warm-start RL comparison that vanilla DQN failed at.
  • Live test-time training inside gameplay. Stage 9 streamed over pre-collected frame buffers; the natural follow-up is per-step encoder updates during actual play episodes, with online BC or RL on top.
  • Action injection at a pooled feature layer (the Stage 3 takeaway). Per-pixel layer-1 conditioning was wrong; a layer-2 action-conditioned encoder might work better.
  • Drosophila-style scientific reproduction at full fidelity. The biology story is the paper's actual contribution; this project doesn't address it.

File map

PLAN.md                                — Initial plan (now mostly executed)
RESULTS.md                             — Running stage-by-stage record
README.md                              — This file

src/resu_core.py                       — Core past/future-CCA primitive
src/resu_conv.py                       — 2D conv-style temporal & spatio-temporal layers
src/resu_conv_action.py                — Action-conditioned layer 1
src/resu_streaming.py                  — Streaming EMA + SVD refits (test-time training)
src/atari_env.py                       — Atari preprocessing

src/pong_labels.py, pong_expert.py     — Pong labeller + reactive expert
src/breakout_labels.py, breakout_expert.py — Breakout labeller + ball-tracking expert
src/si_expert.py                       — Space Invaders alien-tracking expert
src/collect_pong_buffer.py             — Random-policy Pong frames
src/collect_si_buffer.py               — Random-policy SI frames + heuristic labels
src/collect_si_expert_demos.py         — SI expert demos
src/eval_expert.py, eval_breakout_expert.py — Sanity-check the experts

src/probe_pong.py                      — Stage 2: linear-probe go/no-go
src/probe_pong_velocity.py             — Stage 2: velocity probe
src/probe_pong_action.py               — Stage 3: action conditioning A/B

src/bc_pong.py, bc_pong_eval.py        — Stage 4a: Pong BC offline + in-env
src/bc_pong_cnn.py                     — Stage 6: CNN BC baseline
src/bc_breakout.py                     — Stage 5: Pong→Breakout streaming

src/dqn_pong.py                        — Stage 7: DQN from scratch (negative)
src/dqn_pong_warmstart.py              — Stage 7: BC-warm-start DQN (also negative)

src/depth_ablation_si.py               — Stage 8: 3-layer stack on Space Invaders
src/transfer_matrix.py                 — Stage 9: timescale + matrix + forgetting
src/asteroids_enduro_stress.py         — Stage 11: visually distinct games
src/asteroids_expert.py                — Stochastic sweep-and-fire expert
src/collect_asteroids_demos.py
src/record_gameplay.py                 — Render Pong/Breakout BC-clone gameplay as GIFs
src/record_asteroids.py                — Render Asteroids BC clone as GIF

media/                                 — Gameplay GIFs (Pong, Breakout)

tests/test_ou_match_reference.py        — Numerical match against paper code
tests/test_deadleaves_drosophila.py     — Qualitative L1/L2/L3 reproduction

ReSU/                                   — Original paper's released repo (read-only ref)
data/                                   — Frame buffers, expert demos, labels
results/                                — Logs and saved arrays for each stage

Acknowledgements

The original paper, code, and Drosophila biology story: Qin, S.; Pughe-Sanford, J.L.; Genkin, A.; Ozdil, P.G.; Greengard, P.; Sengupta, A.M.; Chklovskii, D.B. A Network of Biologically Inspired Rectified Spectral Units (ReSUs) Learns Hierarchical Features Without Error Backpropagation. arXiv:2512.23146 (2025). Code: https://github.com/ShawnQin/ReSU

About

ReSU on Atari: a calibrated investigation of a backprop-free biologically-inspired feature extractor on Pong / Breakout / Space Invaders

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages