A research implementation of a Hierarchical Reasoning Model for language generation with continual learning capabilities and dynamic model expansion.
HRM (Hierarchical Reasoning Model) is an experimental language model architecture that uses two levels of reasoning:
- L-Module (Low-level): Handles detailed, step-by-step processing
- H-Module (High-level): Integrates information and makes higher-level decisions
The model can think in cycles, performing multiple reasoning steps before generating each token, similar to how humans might think through a problem.
The HRM architecture consists of several key components:
Input Text → Embedding → RoPE → L-Module ⟷ H-Module → Output Head → Generated Text
↑ ↑
(T timesteps) (N cycles)
- CastedEmbedding: Converts tokens to dense vectors
- Embedding Scaling: Multiplies by √(hidden_size) for stability
- RoPE (Rotary Position Encoding): Encodes positional information
- L-Module: Multiple transformer blocks for detailed processing
- H-Module: Fewer transformer blocks for high-level integration
- Cross-Module Communication: L and H modules share information
- SwiGLU Activation: Gated activation function for better performance
- RMS Normalization: More stable than LayerNorm
- Multi-Head Attention: With RoPE integration
- Adaptive Computation Time (ACT): Variable reasoning depth
- Output Head: Linear layer mapping hidden states to vocabulary
- Stablemax: Numerically stable alternative to softmax
Each transformer block contains:
TransformerBlock:
├── MultiHeadAttention (with RoPE)
├── RMSNorm
├── SwiGLU Feed-Forward
└── RMSNormSwiGLU Feed-Forward:
gate, up = linear(x).chunk(2, dim=-1)
output = linear_down(silu(gate) * up)python hrm_create_3.pyThis script creates the foundational model with:
Default Configuration:
{
'vocab_size': 100, # Optimized for efficiency
'hidden_size': 256, # Model dimension
'num_heads': 8, # Attention heads
'L_layers': 3, # Low-level reasoning layers
'H_layers': 3, # High-level reasoning layers
'expansion': 2.5, # Feed-forward expansion ratio
'halt_max_steps': 8, # Maximum reasoning cycles
'forward_dtype': torch.float32
}What happens during creation:
- Parameter Initialization: Uses truncated normal initialization
- Layer Construction: Builds L-module and H-module transformer stacks
- Weight Setup: Initializes embeddings, attention, and feed-forward layers
- Model Saving: Saves to
hrm_model_advanced_with_config.pth
Files Created:
hrm_model_advanced_with_config.pth- Complete model architecture- Model configuration embedded in the file
python hrm_run.pyThis tests the created model with:
- Different reasoning configurations (N×T combinations)
- Parameter counting and memory usage
- Architecture validation
- Sample forward passes
python train_hrm_repo_style_2.py-
GUI File Selection
- Opens file dialog to select training text
- Supports
.txtfiles of any size - Automatically handles encoding
-
Training Configuration
┌─────────────────────────────┐ │ HRM Training Settings │ ├─────────────────────────────┤ │ Epochs: [3] │ │ Batch Size: [16] │ │ Save every N epochs: [1] │ │ Evaluate every N epochs: [1]│ │ [🚀 Start Training] [❌ Cancel]│ └─────────────────────────────┘ -
Automatic Tokenizer Building
# Character-level tokenizer creation tokenizer = SimpleTokenizer(level='char') tokenizer.build_vocab([training_text]) # Saves to: continual_tokenizer.json
-
Dataset Preparation
# Text splitting train_text = current_text[:90%] # 90% for training val_text = current_text[90%:] # 10% for validation # Sequence creation TextDataset(text, tokenizer, seq_length=80, stride=1)
-
Model Training Loop
for epoch in range(epochs): # Training phase for batch in train_loader: # Forward pass with N cycles, T steps loss = train_step(batch, N=2, T=2) loss.backward() optimizer.step() # Evaluation phase eval_metrics = evaluate(val_loader) # Save checkpoint save_checkpoint(model, tokenizer, config)
🔄 Simplified Training Mode:
- No ACT complexity for stability
- Fixed reasoning cycles (N=2, T=2)
- Gradient accumulation support
- Learning rate scheduling
📊 Comprehensive Logging:
[12:34:56] [INFO] Epoch 1/3 - Loss: 2.3456, Perplexity: 10.43
[12:35:12] [INFO] Generated Sample: "Alice was beginning to get very tired..."
[12:35:15] [INFO] 💾 Checkpoint saved: hrm_continual_learning.pt
💾 Checkpoint Management:
hrm_continual_learning.pt- Main checkpointhrm_continual_learning_latest.pt- Most recent statehrm_continual_learning_best.pt- Best validation losscontinual_tokenizer.json- Vocabulary file
| Parameter | Default | Description |
|---|---|---|
epochs |
3 | Number of complete passes through data |
batch_size |
16 | Sequences processed simultaneously |
seq_length |
80 | Length of input sequences |
lr |
1e-3 | Main learning rate |
embed_lr |
5e-4 | Embedding layer learning rate |
weight_decay |
0.01 | L2 regularization strength |
warmup_ratio |
0.1 | Fraction of steps for learning rate warmup |
N |
2 | Number of high-level reasoning cycles |
T |
2 | Number of low-level steps per cycle |
The HRM system is designed to learn from multiple datasets without forgetting previous knowledge:
Dataset 1 (Alice) → Model v1 → Dataset 2 (Shakespeare) → Model v2 → Dataset 3 (News) → Model v3
↓ ↓ ↓
Vocab: 87 tokens Vocab: 142 tokens Vocab: 203 tokens
Knowledge: Alice Knowledge: Alice + Shakespeare Knowledge: Alice + Shakespeare + News
python train_hrm_repo_style_2.py
# Select: alice.txt
# Creates new model with Alice vocabularyWhat happens:
- Creates fresh tokenizer from Alice text
- Builds vocabulary (e.g., 87 unique characters)
- Trains model from scratch
- Saves:
hrm_continual_learning.pt+continual_tokenizer.json
python train_hrm_repo_style_2.py
# Select: shakespeare.txt
# System detects existing modelContinual Learning Dialog:
CONTINUAL LEARNING - EXISTING MODEL FOUND
============================================
🆕 Models NOT yet trained on 'shakespeare':
1. hrm_continual_learning.pt
Vocabulary: 87 tokens
Parameters: 2,156,032
Datasets: alice
Last updated: 2024-01-15 14:30:22
2. Start completely fresh model (lose all previous learning)
Choose model to continue learning (1-2): 1
Vocabulary Expansion Process:
# Load existing tokenizer
old_vocab_size = 87 # Alice vocabulary
# Expand with Shakespeare characters
new_chars = {'!', '?', ';', 'æ', ...} # New characters in Shakespeare
new_vocab_size = 142 # Expanded vocabulary
# Expand model layers
expand_model_for_new_vocab(model, old_vocab_size, new_vocab_size)Model Expansion Details:
- Preserve Existing Embeddings: Old vocabulary keeps learned representations
- Initialize New Embeddings: New tokens get small random initializations
- Expand Output Layer: Add new vocabulary predictions
- Reset Optimizers: Fresh optimizer state for stability
Each new dataset:
- Checks for vocabulary expansion needs
- Preserves all previous knowledge
- Adds new learning on top of existing knowledge
- Maintains backward compatibility
🧠 Knowledge Preservation:
- Previous datasets remain accessible
- No catastrophic forgetting
- Vocabulary never shrinks
- Model capacity grows as needed
📈 Progressive Learning:
datasets_trained = ['alice', 'shakespeare', 'news', 'code']
# Model remembers training history🔧 Smart Resume Logic:
- Same Dataset: Resume from checkpoint (continue training)
- New Dataset: Reset epoch counter, keep model weights
- Fresh Start: Option to start completely over
# During training, press Ctrl+C
^C
Training interruption signal received...
Options:
1. Pause and save checkpoint (resume later)
2. Stop training completely
Enter choice (1 or 2): 1Creates hrm_continual_learning_pause.pt for exact resume.
python train_hrm_repo_style_2.py
# System automatically detects pause checkpoint and offers to resumeThe system can automatically expand model capacity based on data volume:
# Expansion triggers
chars_per_expansion = 3_000_000 # Expand every 3M characters
params_per_expansion = 1_000_000 # Add ~1M parameters each timeclass CharacterTracker:
total_chars_processed: 0
datasets_char_counts: {
'alice': 144_132,
'shakespeare': 1_234_567,
'news': 2_890_445
}
expansion_history: [
{'chars_at_expansion': 3_000_000, 'params_added': 1_048_576},
{'chars_at_expansion': 6_000_000, 'params_added': 1_048_576}
]# Add balanced mix of L and H layers
expansion_config = {
'new_L_layers': 2, # Add 2 low-level layers
'new_H_layers': 1, # Add 1 high-level layer
'estimated_params': 1_048_576,
'strategy': 'balanced_layer_addition'
}# Create new transformer blocks
for _ in range(new_L_layers):
new_layer = TransformerBlock(hidden_size, num_heads, expansion)
init_layer_weights(new_layer) # Small random initialization
model.L_module.layers.append(new_layer)
# Reset optimizers to include new parameters
train_state = create_train_state(expanded_model, total_steps, lr, embed_lr)📊 Capacity Scaling: Model grows with data complexity
🧠 Knowledge Retention: Existing layers preserve learned patterns
⚡ Efficiency: Only add parameters when needed
🔄 Automatic: No manual intervention required
python hrm_chat.pyFeatures:
- Reasoning Modes:
fast(1×1),normal(2×2),deep(3×2),extreme(6×4) - Temperature Control:
temp 0.8(creativity level) - Response Modes:
interactive(live generation),unlimited(no limits) - Natural Stopping: Intelligent stopping at sentence boundaries
Sample Session:
You: Tell me about Alice
🤖 Generating...
Bot: Alice was a curious young girl who followed a white rabbit down
a deep hole and found herself in a most peculiar wonderland. She
encountered many strange creatures, including a grinning Cheshire
cat and a mad hatter hosting an endless tea party...
✨ Long response generated (342 chars)
⏱️ 2.3s | 🧠 normal | 🌡️ 0.8
| Command | Example | Description |
|---|---|---|
mode <type> |
mode deep |
Change reasoning depth |
temp <value> |
temp 1.2 |
Set creativity (0.1-2.0) |
interactive |
- | Watch tokens generate live |
batch |
- | Fast non-interactive mode |
unlimited |
- | Remove response length limits |
natural |
- | Stop at natural break points |
status |
- | Show detailed model info |
python hrm_conversation.py- Maintains conversation history
- Context-aware responses
- Memory management
python hrm_fast_chat.py- Minimal overhead
- Quick responses
- Good for testing
python hrm_test.py- Generation debugging
- Token analysis
- Error diagnostics
Edit hrm_create_3.py for architecture changes:
# Create custom model
model = create_hrm_model_advanced(
vocab_size=200, # Larger vocabulary
hidden_size=512, # Bigger model
num_heads=16, # More attention heads
L_layers=4, # More detailed reasoning
H_layers=2, # Focused high-level reasoning
expansion=3.0, # Larger feed-forward
halt_max_steps=16, # More reasoning cycles
)Edit training script for custom settings:
train_hrm_continual_learning(
text_path="your_text.txt",
epochs=5, # Longer training
batch_size=32, # Larger batches (if GPU allows)
seq_length=128, # Longer sequences
lr=5e-4, # Lower learning rate
weight_decay=0.02, # More regularization
N=3, T=3, # Deeper reasoning
eval_every=1, # Evaluate each epoch
save_every_steps=1000, # Frequent saves
)For Limited GPU Memory:
# Reduce model size
hidden_size=128
batch_size=8
seq_length=64
# Enable memory efficiency
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = TrueFor CPU Training:
# Will automatically use CPU if no GPU
device = 'cpu'
# Consider smaller model and longer training timepython character_trigram_cosine_similarity.pyFeatures:
- Compare output to GPT-1 baseline
- Character distribution analysis
- Word formation quality scoring
- Sentence structure evaluation
- Semantic meaning assessment
Metrics Provided:
- Character Coherence (0-100): Letter/space/punctuation balance
- Word Formation (0-100): Complete vs malformed words
- Sentence Structure (0-100): Grammar and syntax patterns
- Semantic Meaning (0-100): Coherent phrases and concepts
- Token Diversity (0-100): Vocabulary variety
python architecture_demo.pyCapabilities:
- Visualize internal reasoning process
- Compare different N×T configurations
- Analyze state evolution during reasoning
- No text generation (pure architecture demo)
Sample Analysis:
🧠 REASONING ARCHITECTURE ANALYSIS
Input sequence length: 8
Reasoning configuration: N=2 cycles, T=2 steps per cycle
🔄 REASONING CYCLE 1/2
📈 L-step 1: norm 45.2 → 52.1 (Δ+6.9)
📈 L-step 2: norm 52.1 → 48.7 (Δ-3.4)
🎯 H-update: norm 38.4 → 44.2 (Δ+5.8)
python tokenizer_recover.pyUse when:
- Model generates gibberish
- Vocabulary size mismatches
- Tokenizer file corruption
- Training/inference incompatibility
What it does:
- Extracts vocabulary from trained model
- Rebuilds tokenizer file
- Verifies compatibility
- Creates backup of old tokenizer
Symptoms:
User: Hello
Bot: q3x#mz9!pFgW@8kLnR4vY7cE2sA1uI6oP0hD5tJ
Solutions:
# Step 1: Fix tokenizer
python tokenizer_recover.py
# Step 2: Verify model is trained
ls -la hrm_continual_learning*.pt
# Step 3: Check vocab size match
python hrm_chat.py
# Look for "âš ï¸ Adjusting model vocab size" messageSymptoms:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
Solutions:
# Reduce batch size
batch_size = 8 # or 4
# Reduce sequence length
seq_length = 64 # or 32
# Use gradient accumulation
# (handled automatically in training script)
# Enable memory optimizations
torch.backends.cuda.matmul.allow_tf32 = TrueSymptoms:
❌ Error during generation at step 45: tensor size mismatch
Solutions:
# Check file integrity
python hrm_run.py
# Reset to last known good checkpoint
# Rename current checkpoint and restart training
# Use debug mode
python hrm_test.py
# Set debug mode on to see detailed errorsSymptoms:
❌ Error: Required model files not found!
Check files exist:
ls -la hrm_continual_learning*.pt continual_tokenizer.jsonFile priority order:
hrm_continual_learning_latest.pt- Most recenthrm_continual_learning_best.pt- Best performancehrm_continual_learning.pt- Main checkpointcheckpoint.pt- Temporary checkpoint
Solutions in chat:
# Remove limits
unlimited
# Increase creativity
temp 1.5
# Use deeper reasoning
mode deep
# Check if model is actually trained
status# Monitor GPU usage
nvidia-smi
# Expected usage during training:
# Memory: 60-80% of available
# Utilization: 80-95%# Typical speeds (GTX 960M example):
# Batch processing: ~2-3 batches/second
# Epoch time: 5-15 minutes (depending on text size)
# Generation: 10-50 tokens/secondLoss Progression:
Epoch 1: Loss 3.2456 → Perplexity 25.7
Epoch 2: Loss 2.1234 → Perplexity 8.4
Epoch 3: Loss 1.8765 → Perplexity 6.5
Good Training Signs:
- Loss decreases steadily
- Perplexity under 10 after few epochs
- Generated samples become coherent
- No NaN or infinity values
Character-level Metrics:
- Coherence: 70-90% (good character distribution)
- Word Formation: 60-80% (recognizable words)
- Structure: 50-70% (basic grammar)
- Meaning: 40-60% (some coherent phrases)
Expected Progression:
Untrained: "q3x#mz9!pFgW@8kLn..." (0% quality)
1 epoch: "the was and to of..." (20% quality)
3 epochs: "Alice was walking down the..." (60% quality)
10 epochs: "Alice wondered if the rabbit..." (80% quality)
- OS: Windows 10, macOS 10.14, Ubuntu 18.04+
- Python: 3.8+
- RAM: 8GB system memory
- Storage: 2GB free space
- GPU: Optional but recommended
- Python: 3.9-3.11
- RAM: 16GB+ system memory
- GPU: 8GB+ VRAM (RTX 3070, RTX 4060, etc.)
- Storage: 10GB+ free space (for multiple models)
- CPU: 8+ cores for faster data processing
- NVIDIA: GTX 1060 6GB+ (CUDA 11.0+)
- AMD: Limited support via ROCm
- Apple Silicon: M1/M2 with MPS backend
- Intel: XeGPU with Intel Extension for PyTorch
# Core dependencies
pip install torch torchvision torchaudio
# For CUDA (if available)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Additional requirements
pip install numpy tqdm
# GUI components (included with Python)
# tkinter - usually pre-installed- Hierarchical Reasoning: Study multi-level cognitive processes
- Continual Learning: Investigate knowledge retention and transfer
- Adaptive Computation: Variable reasoning depth analysis
- Model Scaling: Dynamic parameter growth strategies
- Custom Domain Models: Train on specific technical domains
- Multilingual Systems: Continual learning across languages
- Creative Writing: Generate stories, poetry, scripts
- Code Generation: Learn programming patterns and syntax
- Architecture Ablations: Remove components to study importance
- Reasoning Analysis: Visualize internal decision processes
- Transfer Learning: Apply learned patterns to new domains
- Memory Mechanisms: Study long-term knowledge retention
###Example output for a 4M model trained on 2 books (Alice in Wonderland and Through the Looking Glass)
epoch 57: You: Alice was 🔄 Generating with N=2, T=2, temp=0.5... Response: so, they went on a minute or fire--and while the top of his head and seemed so much properly, as she said in a long laughing by a sort of the corner of the babst--and the feet of the old of the dishes of the day, but she wanted all she began again. 'What does it was a very stappe,' the King said, very soon as she was very lower and his hands. 'There's a poor Queen. But I was think,' the King intured at Alice had repeated this, 'it's head down the care!' she remarked. 'It's a great down again.' 'It's a pretend of little head?' Alice said, as she couldn't help hower, and speak again. 'Of course?' said Alice. 'I can't know what IS the more can down the reast would be ought: when I was thinking of Alice's she could be seen up and said, AANE ANAN the flowers confured our his should see the paw of the scream-cake, and some use to way, and a little from crowded him. The Queen's are one of her faces was fact, because they were talking so as the poor King. 'The song, some of the kitten,' said Alice. 'It's a box,' said Tweedledum. 'The people of places in pretty towards?' said Alice. 'And you like itself, you know,' said Alice: 'but I'll tell you see--'
Chat GPT evaluation:
Nice, epoch 57 is definitely showing signs of maturation compared to earlier ones. Let’s break it down:
🔎 Observations
Structure: Much better segmentation into dialogue + narration. Nearly every paragraph has either action or speech.
Dialogue: King, Queen, Tweedledum, Alice all appear, with back-and-forth speech — this is a big step up. Earlier epochs leaned on repetition, but now it’s more varied.
Fluency: Still has malformed phrases (babst, AANE ANAN, “scream-cake”), but fewer than before. Many sentences are syntactically complete.
Coherence: The flow is loosely consistent — Alice engages with royalty, Tweedledum shows up, odd magical objects appear. This is actually very on-brand for Lewis Carroll–style nonsense.
Readability: Much more legible than epochs 45–49, closer to something a child could follow (with some oddities).
📊 Benchmarking
TinyStories 1M–2.5M → way below this.
TinyStories 8M (~34/60 ≈ 57/100 normalized) → This looks slightly better in coherence, maybe closer to 36–37/60 (~60–62/100).
GPT-1 (117M) → Still far behind GPT-1’s consistency and grammar, but catching some of the narrative feel.
🏅 Score (1–100 scale)
Similarity to GPT-1: ~41/100
TinyStories-style eval (out of 60): ~36–37/60
Normalized quality: ~61–62/100
├── hrm_create_3.py # Core architecture
├── train_hrm_repo_style_2.py # Main training system
├── hrm_chat.py # Primary interface
├── Analysis Tools/
│ ├── character_trigram_cosine_similarity.py
│ ├── architecture_demo.py
│ └── tokenizer_recover.py
├── Chat Interfaces/
│ ├── hrm_conversation.py
│ ├── hrm_fast_chat.py
│ └── hrm_test.py
└── Advanced Features/
├── dinamic_expansion.py
└── hrm_run.py
- "Attention Is All You Need" (Transformer architecture)
- "RoFormer: Enhanced Transformer with Rotary Position Embedding"
- "GLU Variants Improve Transformer" (SwiGLU)
- "Root Mean Square Layer Normalization"
- JAX Transformer implementations
- PyTorch official examples
- Hugging Face Transformers library
- Flash Attention implementations
# Create model
python hrm_create_3.py
# Train model
python train_hrm_repo_style_2.py
# Chat with model
python hrm_chat.py
# Analyze output
python character_trigram_cosine_similarity.py
# Fix tokenizer
python tokenizer_recover.py- Model:
hrm_continual_learning*.pt - Tokenizer:
continual_tokenizer.json - Logs:
training_log_*.txt - Base Architecture:
hrm_model_advanced_with_config.pth
python -c "
import torch
data = torch.load('hrm_continual_learning_latest.pt', map_location='cpu')
print(f'Datasets: {data[\"datasets_trained\"]}')
print(f'Vocab size: {data[\"config\"][\"vocab_size\"]}')
print(f'Parameters: {data[\"total_parameters\"]:,}')
print(f'Last trained: {data[\"timestamp\"]}')
"Note: This is a research implementation designed for learning and experimentation. While functional and capable of generating coherent text, it's intended for educational and research purposes rather than production deployment.