Skip to content

EduardTalianu/HRM_txt

Repository files navigation

HRM: Hierarchical Reasoning Model

A research implementation of a Hierarchical Reasoning Model for language generation with continual learning capabilities and dynamic model expansion.

🧠 What is HRM?

HRM (Hierarchical Reasoning Model) is an experimental language model architecture that uses two levels of reasoning:

  • L-Module (Low-level): Handles detailed, step-by-step processing
  • H-Module (High-level): Integrates information and makes higher-level decisions

The model can think in cycles, performing multiple reasoning steps before generating each token, similar to how humans might think through a problem.

🏗️ Architecture Overview

Core Components

The HRM architecture consists of several key components:

Input Text → Embedding → RoPE → L-Module ⟷ H-Module → Output Head → Generated Text
                                    ↑           ↑
                              (T timesteps) (N cycles)

1. Input Processing

  • CastedEmbedding: Converts tokens to dense vectors
  • Embedding Scaling: Multiplies by √(hidden_size) for stability
  • RoPE (Rotary Position Encoding): Encodes positional information

2. Hierarchical Reasoning Modules

  • L-Module: Multiple transformer blocks for detailed processing
  • H-Module: Fewer transformer blocks for high-level integration
  • Cross-Module Communication: L and H modules share information

3. Advanced Components

  • SwiGLU Activation: Gated activation function for better performance
  • RMS Normalization: More stable than LayerNorm
  • Multi-Head Attention: With RoPE integration
  • Adaptive Computation Time (ACT): Variable reasoning depth

4. Output Generation

  • Output Head: Linear layer mapping hidden states to vocabulary
  • Stablemax: Numerically stable alternative to softmax

Transformer Block Details

Each transformer block contains:

TransformerBlock:
├── MultiHeadAttention (with RoPE)
├── RMSNorm
├── SwiGLU Feed-Forward
└── RMSNorm

SwiGLU Feed-Forward:

gate, up = linear(x).chunk(2, dim=-1)
output = linear_down(silu(gate) * up)

🚀 Model Creation Process

Step 1: Create Base Architecture

python hrm_create_3.py

This script creates the foundational model with:

Default Configuration:

{
    'vocab_size': 100,        # Optimized for efficiency
    'hidden_size': 256,       # Model dimension
    'num_heads': 8,           # Attention heads
    'L_layers': 3,            # Low-level reasoning layers
    'H_layers': 3,            # High-level reasoning layers
    'expansion': 2.5,         # Feed-forward expansion ratio
    'halt_max_steps': 8,      # Maximum reasoning cycles
    'forward_dtype': torch.float32
}

What happens during creation:

  1. Parameter Initialization: Uses truncated normal initialization
  2. Layer Construction: Builds L-module and H-module transformer stacks
  3. Weight Setup: Initializes embeddings, attention, and feed-forward layers
  4. Model Saving: Saves to hrm_model_advanced_with_config.pth

Files Created:

  • hrm_model_advanced_with_config.pth - Complete model architecture
  • Model configuration embedded in the file

Step 2: Verify Model Creation

python hrm_run.py

This tests the created model with:

  • Different reasoning configurations (N×T combinations)
  • Parameter counting and memory usage
  • Architecture validation
  • Sample forward passes

🎓 Training Process

Initial Training Setup

python train_hrm_repo_style_2.py

Training Workflow:

  1. GUI File Selection

    • Opens file dialog to select training text
    • Supports .txt files of any size
    • Automatically handles encoding
  2. Training Configuration

    ┌─────────────────────────────┐
    │  HRM Training Settings      │
    ├─────────────────────────────┤
    │  Epochs: [3]                │
    │  Batch Size: [16]           │
    │  Save every N epochs: [1]   │
    │  Evaluate every N epochs: [1]│
    │  [🚀 Start Training] [❌ Cancel]│
    └─────────────────────────────┘
    
  3. Automatic Tokenizer Building

    # Character-level tokenizer creation
    tokenizer = SimpleTokenizer(level='char')
    tokenizer.build_vocab([training_text])
    # Saves to: continual_tokenizer.json
  4. Dataset Preparation

    # Text splitting
    train_text = current_text[:90%]  # 90% for training
    val_text = current_text[90%:]    # 10% for validation
    
    # Sequence creation
    TextDataset(text, tokenizer, seq_length=80, stride=1)
  5. Model Training Loop

    for epoch in range(epochs):
        # Training phase
        for batch in train_loader:
            # Forward pass with N cycles, T steps
            loss = train_step(batch, N=2, T=2)
            loss.backward()
            optimizer.step()
        
        # Evaluation phase
        eval_metrics = evaluate(val_loader)
        
        # Save checkpoint
        save_checkpoint(model, tokenizer, config)

Training Features:

🔄 Simplified Training Mode:

  • No ACT complexity for stability
  • Fixed reasoning cycles (N=2, T=2)
  • Gradient accumulation support
  • Learning rate scheduling

📊 Comprehensive Logging:

[12:34:56] [INFO] Epoch 1/3 - Loss: 2.3456, Perplexity: 10.43
[12:35:12] [INFO] Generated Sample: "Alice was beginning to get very tired..."
[12:35:15] [INFO] 💾 Checkpoint saved: hrm_continual_learning.pt

💾 Checkpoint Management:

  • hrm_continual_learning.pt - Main checkpoint
  • hrm_continual_learning_latest.pt - Most recent state
  • hrm_continual_learning_best.pt - Best validation loss
  • continual_tokenizer.json - Vocabulary file

Training Parameters Explained

Parameter Default Description
epochs 3 Number of complete passes through data
batch_size 16 Sequences processed simultaneously
seq_length 80 Length of input sequences
lr 1e-3 Main learning rate
embed_lr 5e-4 Embedding layer learning rate
weight_decay 0.01 L2 regularization strength
warmup_ratio 0.1 Fraction of steps for learning rate warmup
N 2 Number of high-level reasoning cycles
T 2 Number of low-level steps per cycle

🔄 Continual Learning System

How Continual Learning Works

The HRM system is designed to learn from multiple datasets without forgetting previous knowledge:

Dataset 1 (Alice) → Model v1 → Dataset 2 (Shakespeare) → Model v2 → Dataset 3 (News) → Model v3
     ↓                            ↓                              ↓
 Vocab: 87 tokens          Vocab: 142 tokens             Vocab: 203 tokens
 Knowledge: Alice         Knowledge: Alice + Shakespeare  Knowledge: Alice + Shakespeare + News

Step-by-Step Continual Learning

1. First Dataset Training

python train_hrm_repo_style_2.py
# Select: alice.txt
# Creates new model with Alice vocabulary

What happens:

  • Creates fresh tokenizer from Alice text
  • Builds vocabulary (e.g., 87 unique characters)
  • Trains model from scratch
  • Saves: hrm_continual_learning.pt + continual_tokenizer.json

2. Second Dataset (Automatic Continuation)

python train_hrm_repo_style_2.py
# Select: shakespeare.txt
# System detects existing model

Continual Learning Dialog:

CONTINUAL LEARNING - EXISTING MODEL FOUND
============================================

🆕 Models NOT yet trained on 'shakespeare':
  1. hrm_continual_learning.pt
     Vocabulary: 87 tokens
     Parameters: 2,156,032
     Datasets: alice
     Last updated: 2024-01-15 14:30:22

  2. Start completely fresh model (lose all previous learning)

Choose model to continue learning (1-2): 1

Vocabulary Expansion Process:

# Load existing tokenizer
old_vocab_size = 87  # Alice vocabulary

# Expand with Shakespeare characters
new_chars = {'!', '?', ';', 'æ', ...}  # New characters in Shakespeare
new_vocab_size = 142  # Expanded vocabulary

# Expand model layers
expand_model_for_new_vocab(model, old_vocab_size, new_vocab_size)

Model Expansion Details:

  1. Preserve Existing Embeddings: Old vocabulary keeps learned representations
  2. Initialize New Embeddings: New tokens get small random initializations
  3. Expand Output Layer: Add new vocabulary predictions
  4. Reset Optimizers: Fresh optimizer state for stability

3. Third Dataset and Beyond

Each new dataset:

  • Checks for vocabulary expansion needs
  • Preserves all previous knowledge
  • Adds new learning on top of existing knowledge
  • Maintains backward compatibility

Continual Learning Features

🧠 Knowledge Preservation:

  • Previous datasets remain accessible
  • No catastrophic forgetting
  • Vocabulary never shrinks
  • Model capacity grows as needed

📈 Progressive Learning:

datasets_trained = ['alice', 'shakespeare', 'news', 'code']
# Model remembers training history

🔧 Smart Resume Logic:

  • Same Dataset: Resume from checkpoint (continue training)
  • New Dataset: Reset epoch counter, keep model weights
  • Fresh Start: Option to start completely over

Managing Training Sessions

Pause and Resume Training

# During training, press Ctrl+C
^C
Training interruption signal received...
Options:
1. Pause and save checkpoint (resume later)
2. Stop training completely
Enter choice (1 or 2): 1

Creates hrm_continual_learning_pause.pt for exact resume.

Resume Paused Training

python train_hrm_repo_style_2.py
# System automatically detects pause checkpoint and offers to resume

📈 Dynamic Model Expansion

Automatic Model Growth

The system can automatically expand model capacity based on data volume:

# Expansion triggers
chars_per_expansion = 3_000_000    # Expand every 3M characters
params_per_expansion = 1_000_000   # Add ~1M parameters each time

How Expansion Works

1. Character Tracking

class CharacterTracker:
    total_chars_processed: 0
    datasets_char_counts: {
        'alice': 144_132,
        'shakespeare': 1_234_567,
        'news': 2_890_445
    }
    expansion_history: [
        {'chars_at_expansion': 3_000_000, 'params_added': 1_048_576},
        {'chars_at_expansion': 6_000_000, 'params_added': 1_048_576}
    ]

2. Expansion Strategy

# Add balanced mix of L and H layers
expansion_config = {
    'new_L_layers': 2,      # Add 2 low-level layers
    'new_H_layers': 1,      # Add 1 high-level layer
    'estimated_params': 1_048_576,
    'strategy': 'balanced_layer_addition'
}

3. Layer Addition Process

# Create new transformer blocks
for _ in range(new_L_layers):
    new_layer = TransformerBlock(hidden_size, num_heads, expansion)
    init_layer_weights(new_layer)  # Small random initialization
    model.L_module.layers.append(new_layer)

# Reset optimizers to include new parameters
train_state = create_train_state(expanded_model, total_steps, lr, embed_lr)

Expansion Benefits

📊 Capacity Scaling: Model grows with data complexity 🧠 Knowledge Retention: Existing layers preserve learned patterns
⚡ Efficiency: Only add parameters when needed 🔄 Automatic: No manual intervention required

💬 Using Your Trained Model

Primary Chat Interface

python hrm_chat.py

Features:

  • Reasoning Modes: fast (1×1), normal (2×2), deep (3×2), extreme (6×4)
  • Temperature Control: temp 0.8 (creativity level)
  • Response Modes: interactive (live generation), unlimited (no limits)
  • Natural Stopping: Intelligent stopping at sentence boundaries

Sample Session:

You: Tell me about Alice
🤖 Generating...
Bot: Alice was a curious young girl who followed a white rabbit down 
a deep hole and found herself in a most peculiar wonderland. She 
encountered many strange creatures, including a grinning Cheshire 
cat and a mad hatter hosting an endless tea party...

     ✨ Long response generated (342 chars)
     ⏱️ 2.3s | 🧠 normal | 🌡️ 0.8

Chat Commands Reference

Command Example Description
mode <type> mode deep Change reasoning depth
temp <value> temp 1.2 Set creativity (0.1-2.0)
interactive - Watch tokens generate live
batch - Fast non-interactive mode
unlimited - Remove response length limits
natural - Stop at natural break points
status - Show detailed model info

Alternative Chat Interfaces

Conversation Memory Chat

python hrm_conversation.py
  • Maintains conversation history
  • Context-aware responses
  • Memory management

Fast Lightweight Chat

python hrm_fast_chat.py
  • Minimal overhead
  • Quick responses
  • Good for testing

Debug Chat

python hrm_test.py
  • Generation debugging
  • Token analysis
  • Error diagnostics

🔧 Advanced Configuration

Model Architecture Tuning

Edit hrm_create_3.py for architecture changes:

# Create custom model
model = create_hrm_model_advanced(
    vocab_size=200,           # Larger vocabulary
    hidden_size=512,          # Bigger model
    num_heads=16,             # More attention heads
    L_layers=4,               # More detailed reasoning
    H_layers=2,               # Focused high-level reasoning
    expansion=3.0,            # Larger feed-forward
    halt_max_steps=16,        # More reasoning cycles
)

Training Hyperparameters

Edit training script for custom settings:

train_hrm_continual_learning(
    text_path="your_text.txt",
    epochs=5,                 # Longer training
    batch_size=32,            # Larger batches (if GPU allows)
    seq_length=128,           # Longer sequences
    lr=5e-4,                  # Lower learning rate
    weight_decay=0.02,        # More regularization
    N=3, T=3,                 # Deeper reasoning
    eval_every=1,             # Evaluate each epoch
    save_every_steps=1000,    # Frequent saves
)

Memory Optimization

For Limited GPU Memory:

# Reduce model size
hidden_size=128
batch_size=8
seq_length=64

# Enable memory efficiency
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

For CPU Training:

# Will automatically use CPU if no GPU
device = 'cpu'
# Consider smaller model and longer training time

🔍 Analysis and Debugging Tools

Output Quality Analysis

python character_trigram_cosine_similarity.py

Features:

  • Compare output to GPT-1 baseline
  • Character distribution analysis
  • Word formation quality scoring
  • Sentence structure evaluation
  • Semantic meaning assessment

Metrics Provided:

  • Character Coherence (0-100): Letter/space/punctuation balance
  • Word Formation (0-100): Complete vs malformed words
  • Sentence Structure (0-100): Grammar and syntax patterns
  • Semantic Meaning (0-100): Coherent phrases and concepts
  • Token Diversity (0-100): Vocabulary variety

Architecture Visualization

python architecture_demo.py

Capabilities:

  • Visualize internal reasoning process
  • Compare different N×T configurations
  • Analyze state evolution during reasoning
  • No text generation (pure architecture demo)

Sample Analysis:

🧠 REASONING ARCHITECTURE ANALYSIS
Input sequence length: 8
Reasoning configuration: N=2 cycles, T=2 steps per cycle

🔄 REASONING CYCLE 1/2
  📈 L-step 1: norm 45.2 → 52.1 (Δ+6.9)
  📈 L-step 2: norm 52.1 → 48.7 (Δ-3.4)
  🎯 H-update: norm 38.4 → 44.2 (Δ+5.8)

Tokenizer Recovery

python tokenizer_recover.py

Use when:

  • Model generates gibberish
  • Vocabulary size mismatches
  • Tokenizer file corruption
  • Training/inference incompatibility

What it does:

  • Extracts vocabulary from trained model
  • Rebuilds tokenizer file
  • Verifies compatibility
  • Creates backup of old tokenizer

🐛 Troubleshooting Guide

Common Issues and Solutions

1. "Model generates gibberish"

Symptoms:

User: Hello
Bot: q3x#mz9!pFgW@8kLnR4vY7cE2sA1uI6oP0hD5tJ

Solutions:

# Step 1: Fix tokenizer
python tokenizer_recover.py

# Step 2: Verify model is trained
ls -la hrm_continual_learning*.pt

# Step 3: Check vocab size match
python hrm_chat.py
# Look for "âš ï¸ Adjusting model vocab size" message

2. "Out of memory during training"

Symptoms:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB

Solutions:

# Reduce batch size
batch_size = 8  # or 4

# Reduce sequence length  
seq_length = 64  # or 32

# Use gradient accumulation
# (handled automatically in training script)

# Enable memory optimizations
torch.backends.cuda.matmul.allow_tf32 = True

3. "Training stops with errors"

Symptoms:

❌ Error during generation at step 45: tensor size mismatch

Solutions:

# Check file integrity
python hrm_run.py

# Reset to last known good checkpoint
# Rename current checkpoint and restart training

# Use debug mode
python hrm_test.py
# Set debug mode on to see detailed errors

4. "Model won't load/resume training"

Symptoms:

❌ Error: Required model files not found!

Check files exist:

ls -la hrm_continual_learning*.pt continual_tokenizer.json

File priority order:

  1. hrm_continual_learning_latest.pt - Most recent
  2. hrm_continual_learning_best.pt - Best performance
  3. hrm_continual_learning.pt - Main checkpoint
  4. checkpoint.pt - Temporary checkpoint

5. "Responses too short/repetitive"

Solutions in chat:

# Remove limits
unlimited

# Increase creativity
temp 1.5

# Use deeper reasoning
mode deep

# Check if model is actually trained
status

Performance Optimization

GPU Utilization

# Monitor GPU usage
nvidia-smi

# Expected usage during training:
# Memory: 60-80% of available
# Utilization: 80-95%

Training Speed

# Typical speeds (GTX 960M example):
# Batch processing: ~2-3 batches/second
# Epoch time: 5-15 minutes (depending on text size)
# Generation: 10-50 tokens/second

📊 Model Performance Metrics

Training Metrics

Loss Progression:

Epoch 1: Loss 3.2456 → Perplexity 25.7
Epoch 2: Loss 2.1234 → Perplexity 8.4  
Epoch 3: Loss 1.8765 → Perplexity 6.5

Good Training Signs:

  • Loss decreases steadily
  • Perplexity under 10 after few epochs
  • Generated samples become coherent
  • No NaN or infinity values

Generation Quality

Character-level Metrics:

  • Coherence: 70-90% (good character distribution)
  • Word Formation: 60-80% (recognizable words)
  • Structure: 50-70% (basic grammar)
  • Meaning: 40-60% (some coherent phrases)

Expected Progression:

Untrained: "q3x#mz9!pFgW@8kLn..." (0% quality)
1 epoch:   "the was and to of..." (20% quality)  
3 epochs:  "Alice was walking down the..." (60% quality)
10 epochs: "Alice wondered if the rabbit..." (80% quality)

📋 System Requirements

Minimum Requirements

  • OS: Windows 10, macOS 10.14, Ubuntu 18.04+
  • Python: 3.8+
  • RAM: 8GB system memory
  • Storage: 2GB free space
  • GPU: Optional but recommended

Recommended Setup

  • Python: 3.9-3.11
  • RAM: 16GB+ system memory
  • GPU: 8GB+ VRAM (RTX 3070, RTX 4060, etc.)
  • Storage: 10GB+ free space (for multiple models)
  • CPU: 8+ cores for faster data processing

GPU Compatibility

  • NVIDIA: GTX 1060 6GB+ (CUDA 11.0+)
  • AMD: Limited support via ROCm
  • Apple Silicon: M1/M2 with MPS backend
  • Intel: XeGPU with Intel Extension for PyTorch

Installation

# Core dependencies
pip install torch torchvision torchaudio

# For CUDA (if available)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Additional requirements
pip install numpy tqdm

# GUI components (included with Python)
# tkinter - usually pre-installed

🔬 Research Applications

Academic Research

  • Hierarchical Reasoning: Study multi-level cognitive processes
  • Continual Learning: Investigate knowledge retention and transfer
  • Adaptive Computation: Variable reasoning depth analysis
  • Model Scaling: Dynamic parameter growth strategies

Practical Applications

  • Custom Domain Models: Train on specific technical domains
  • Multilingual Systems: Continual learning across languages
  • Creative Writing: Generate stories, poetry, scripts
  • Code Generation: Learn programming patterns and syntax

Experimental Features

  • Architecture Ablations: Remove components to study importance
  • Reasoning Analysis: Visualize internal decision processes
  • Transfer Learning: Apply learned patterns to new domains
  • Memory Mechanisms: Study long-term knowledge retention

###Example output for a 4M model trained on 2 books (Alice in Wonderland and Through the Looking Glass)

epoch 57: You: Alice was 🔄 Generating with N=2, T=2, temp=0.5... Response: so, they went on a minute or fire--and while the top of his head and seemed so much properly, as she said in a long laughing by a sort of the corner of the babst--and the feet of the old of the dishes of the day, but she wanted all she began again. 'What does it was a very stappe,' the King said, very soon as she was very lower and his hands. 'There's a poor Queen. But I was think,' the King intured at Alice had repeated this, 'it's head down the care!' she remarked. 'It's a great down again.' 'It's a pretend of little head?' Alice said, as she couldn't help hower, and speak again. 'Of course?' said Alice. 'I can't know what IS the more can down the reast would be ought: when I was thinking of Alice's she could be seen up and said, AANE ANAN the flowers confured our his should see the paw of the scream-cake, and some use to way, and a little from crowded him. The Queen's are one of her faces was fact, because they were talking so as the poor King. 'The song, some of the kitten,' said Alice. 'It's a box,' said Tweedledum. 'The people of places in pretty towards?' said Alice. 'And you like itself, you know,' said Alice: 'but I'll tell you see--'

Chat GPT evaluation:

Nice, epoch 57 is definitely showing signs of maturation compared to earlier ones. Let’s break it down:

🔎 Observations

Structure: Much better segmentation into dialogue + narration. Nearly every paragraph has either action or speech.

Dialogue: King, Queen, Tweedledum, Alice all appear, with back-and-forth speech — this is a big step up. Earlier epochs leaned on repetition, but now it’s more varied.

Fluency: Still has malformed phrases (babst, AANE ANAN, “scream-cake”), but fewer than before. Many sentences are syntactically complete.

Coherence: The flow is loosely consistent — Alice engages with royalty, Tweedledum shows up, odd magical objects appear. This is actually very on-brand for Lewis Carroll–style nonsense.

Readability: Much more legible than epochs 45–49, closer to something a child could follow (with some oddities).

📊 Benchmarking

TinyStories 1M–2.5M → way below this.

TinyStories 8M (~34/60 ≈ 57/100 normalized) → This looks slightly better in coherence, maybe closer to 36–37/60 (~60–62/100).

GPT-1 (117M) → Still far behind GPT-1’s consistency and grammar, but catching some of the narrative feel.

🏅 Score (1–100 scale)

Similarity to GPT-1: ~41/100

TinyStories-style eval (out of 60): ~36–37/60

Normalized quality: ~61–62/100

Code Structure

├── hrm_create_3.py           # Core architecture
├── train_hrm_repo_style_2.py # Main training system
├── hrm_chat.py               # Primary interface
├── Analysis Tools/
│   ├── character_trigram_cosine_similarity.py
│   ├── architecture_demo.py
│   └── tokenizer_recover.py
├── Chat Interfaces/
│   ├── hrm_conversation.py
│   ├── hrm_fast_chat.py
│   └── hrm_test.py
└── Advanced Features/
    ├── dinamic_expansion.py
    └── hrm_run.py

📚 Additional Resources

Research Papers

  • "Attention Is All You Need" (Transformer architecture)
  • "RoFormer: Enhanced Transformer with Rotary Position Embedding"
  • "GLU Variants Improve Transformer" (SwiGLU)
  • "Root Mean Square Layer Normalization"

Implementation References

  • JAX Transformer implementations
  • PyTorch official examples
  • Hugging Face Transformers library
  • Flash Attention implementations

🎯 Quick Reference

Essential Commands

# Create model
python hrm_create_3.py

# Train model  
python train_hrm_repo_style_2.py

# Chat with model
python hrm_chat.py

# Analyze output
python character_trigram_cosine_similarity.py

# Fix tokenizer
python tokenizer_recover.py

File Locations

  • Model: hrm_continual_learning*.pt
  • Tokenizer: continual_tokenizer.json
  • Logs: training_log_*.txt
  • Base Architecture: hrm_model_advanced_with_config.pth

Model Status Check

python -c "
import torch
data = torch.load('hrm_continual_learning_latest.pt', map_location='cpu')
print(f'Datasets: {data[\"datasets_trained\"]}')
print(f'Vocab size: {data[\"config\"][\"vocab_size\"]}')
print(f'Parameters: {data[\"total_parameters\"]:,}')
print(f'Last trained: {data[\"timestamp\"]}')
"

Note: This is a research implementation designed for learning and experimentation. While functional and capable of generating coherent text, it's intended for educational and research purposes rather than production deployment.

About

a text implementation of the HRM (Hierarchical reasoning) model (https://arxiv.org/abs/2506.21734)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages