HRM: Hierarchical Reasoning Model

A research implementation of a Hierarchical Reasoning Model for language generation with continual learning capabilities and dynamic model expansion.

🧠 What is HRM?

HRM (Hierarchical Reasoning Model) is an experimental language model architecture that uses two levels of reasoning:

L-Module (Low-level): Handles detailed, step-by-step processing
H-Module (High-level): Integrates information and makes higher-level decisions

The model can think in cycles, performing multiple reasoning steps before generating each token, similar to how humans might think through a problem.

🏗️ Architecture Overview

Core Components

The HRM architecture consists of several key components:

Input Text → Embedding → RoPE → L-Module ⟷ H-Module → Output Head → Generated Text
                                    ↑           ↑
                              (T timesteps) (N cycles)

1. Input Processing

CastedEmbedding: Converts tokens to dense vectors
Embedding Scaling: Multiplies by √(hidden_size) for stability
RoPE (Rotary Position Encoding): Encodes positional information

2. Hierarchical Reasoning Modules

L-Module: Multiple transformer blocks for detailed processing
H-Module: Fewer transformer blocks for high-level integration
Cross-Module Communication: L and H modules share information

3. Advanced Components

SwiGLU Activation: Gated activation function for better performance
RMS Normalization: More stable than LayerNorm
Multi-Head Attention: With RoPE integration
Adaptive Computation Time (ACT): Variable reasoning depth

4. Output Generation

Output Head: Linear layer mapping hidden states to vocabulary
Stablemax: Numerically stable alternative to softmax

Transformer Block Details

Each transformer block contains:

TransformerBlock:
├── MultiHeadAttention (with RoPE)
├── RMSNorm
├── SwiGLU Feed-Forward
└── RMSNorm

SwiGLU Feed-Forward:

gate, up = linear(x).chunk(2, dim=-1)
output = linear_down(silu(gate) * up)

🚀 Model Creation Process

Step 1: Create Base Architecture

python hrm_create_3.py

This script creates the foundational model with:

Default Configuration:

{
    'vocab_size': 100,        # Optimized for efficiency
    'hidden_size': 256,       # Model dimension
    'num_heads': 8,           # Attention heads
    'L_layers': 3,            # Low-level reasoning layers
    'H_layers': 3,            # High-level reasoning layers
    'expansion': 2.5,         # Feed-forward expansion ratio
    'halt_max_steps': 8,      # Maximum reasoning cycles
    'forward_dtype': torch.float32
}

What happens during creation:

Parameter Initialization: Uses truncated normal initialization
Layer Construction: Builds L-module and H-module transformer stacks
Weight Setup: Initializes embeddings, attention, and feed-forward layers
Model Saving: Saves to hrm_model_advanced_with_config.pth

Files Created:

hrm_model_advanced_with_config.pth - Complete model architecture
Model configuration embedded in the file

Step 2: Verify Model Creation

python hrm_run.py

This tests the created model with:

Different reasoning configurations (N×T combinations)
Parameter counting and memory usage
Architecture validation
Sample forward passes

🎓 Training Process

Initial Training Setup

python train_hrm_repo_style_2.py

Training Workflow:

GUI File Selection
- Opens file dialog to select training text
- Supports .txt files of any size
- Automatically handles encoding

Training Configuration

┌─────────────────────────────┐
│  HRM Training Settings      │
├─────────────────────────────┤
│  Epochs: [3]                │
│  Batch Size: [16]           │
│  Save every N epochs: [1]   │
│  Evaluate every N epochs: [1]│
│  [🚀 Start Training] [❌ Cancel]│
└─────────────────────────────┘

Automatic Tokenizer Building

# Character-level tokenizer creation
tokenizer = SimpleTokenizer(level='char')
tokenizer.build_vocab([training_text])
# Saves to: continual_tokenizer.json

Dataset Preparation

# Text splitting
train_text = current_text[:90%]  # 90% for training
val_text = current_text[90%:]    # 10% for validation

# Sequence creation
TextDataset(text, tokenizer, seq_length=80, stride=1)

Model Training Loop

for epoch in range(epochs):
    # Training phase
    for batch in train_loader:
        # Forward pass with N cycles, T steps
        loss = train_step(batch, N=2, T=2)
        loss.backward()
        optimizer.step()
    
    # Evaluation phase
    eval_metrics = evaluate(val_loader)
    
    # Save checkpoint
    save_checkpoint(model, tokenizer, config)

Training Features:

🔄 Simplified Training Mode:

No ACT complexity for stability
Fixed reasoning cycles (N=2, T=2)
Gradient accumulation support
Learning rate scheduling

📊 Comprehensive Logging:

[12:34:56] [INFO] Epoch 1/3 - Loss: 2.3456, Perplexity: 10.43
[12:35:12] [INFO] Generated Sample: "Alice was beginning to get very tired..."
[12:35:15] [INFO] 💾 Checkpoint saved: hrm_continual_learning.pt

💾 Checkpoint Management:

hrm_continual_learning.pt - Main checkpoint
hrm_continual_learning_latest.pt - Most recent state
hrm_continual_learning_best.pt - Best validation loss
continual_tokenizer.json - Vocabulary file

Training Parameters Explained

Parameter	Default	Description
`epochs`	3	Number of complete passes through data
`batch_size`	16	Sequences processed simultaneously
`seq_length`	80	Length of input sequences
`lr`	1e-3	Main learning rate
`embed_lr`	5e-4	Embedding layer learning rate
`weight_decay`	0.01	L2 regularization strength
`warmup_ratio`	0.1	Fraction of steps for learning rate warmup
`N`	2	Number of high-level reasoning cycles
`T`	2	Number of low-level steps per cycle

🔄 Continual Learning System

How Continual Learning Works

The HRM system is designed to learn from multiple datasets without forgetting previous knowledge:

Dataset 1 (Alice) → Model v1 → Dataset 2 (Shakespeare) → Model v2 → Dataset 3 (News) → Model v3
     ↓                            ↓                              ↓
 Vocab: 87 tokens          Vocab: 142 tokens             Vocab: 203 tokens
 Knowledge: Alice         Knowledge: Alice + Shakespeare  Knowledge: Alice + Shakespeare + News

Step-by-Step Continual Learning

1. First Dataset Training

python train_hrm_repo_style_2.py
# Select: alice.txt
# Creates new model with Alice vocabulary

What happens:

Creates fresh tokenizer from Alice text
Builds vocabulary (e.g., 87 unique characters)
Trains model from scratch
Saves: hrm_continual_learning.pt + continual_tokenizer.json

2. Second Dataset (Automatic Continuation)

python train_hrm_repo_style_2.py
# Select: shakespeare.txt
# System detects existing model

Continual Learning Dialog:

CONTINUAL LEARNING - EXISTING MODEL FOUND
============================================

🆕 Models NOT yet trained on 'shakespeare':
  1. hrm_continual_learning.pt
     Vocabulary: 87 tokens
     Parameters: 2,156,032
     Datasets: alice
     Last updated: 2024-01-15 14:30:22

  2. Start completely fresh model (lose all previous learning)

Choose model to continue learning (1-2): 1

Vocabulary Expansion Process:

# Load existing tokenizer
old_vocab_size = 87  # Alice vocabulary

# Expand with Shakespeare characters
new_chars = {'!', '?', ';', 'æ', ...}  # New characters in Shakespeare
new_vocab_size = 142  # Expanded vocabulary

# Expand model layers
expand_model_for_new_vocab(model, old_vocab_size, new_vocab_size)

Model Expansion Details:

Preserve Existing Embeddings: Old vocabulary keeps learned representations
Initialize New Embeddings: New tokens get small random initializations
Expand Output Layer: Add new vocabulary predictions
Reset Optimizers: Fresh optimizer state for stability

3. Third Dataset and Beyond

Each new dataset:

Checks for vocabulary expansion needs
Preserves all previous knowledge
Adds new learning on top of existing knowledge
Maintains backward compatibility

Continual Learning Features

🧠 Knowledge Preservation:

Previous datasets remain accessible
No catastrophic forgetting
Vocabulary never shrinks
Model capacity grows as needed

📈 Progressive Learning:

datasets_trained = ['alice', 'shakespeare', 'news', 'code']
# Model remembers training history

🔧 Smart Resume Logic:

Same Dataset: Resume from checkpoint (continue training)
New Dataset: Reset epoch counter, keep model weights
Fresh Start: Option to start completely over

Managing Training Sessions

Pause and Resume Training

# During training, press Ctrl+C
^C
Training interruption signal received...
Options:
1. Pause and save checkpoint (resume later)
2. Stop training completely
Enter choice (1 or 2): 1

Creates hrm_continual_learning_pause.pt for exact resume.

Resume Paused Training

python train_hrm_repo_style_2.py
# System automatically detects pause checkpoint and offers to resume

📈 Dynamic Model Expansion

Automatic Model Growth

The system can automatically expand model capacity based on data volume:

# Expansion triggers
chars_per_expansion = 3_000_000    # Expand every 3M characters
params_per_expansion = 1_000_000   # Add ~1M parameters each time

How Expansion Works

1. Character Tracking

class CharacterTracker:
    total_chars_processed: 0
    datasets_char_counts: {
        'alice': 144_132,
        'shakespeare': 1_234_567,
        'news': 2_890_445
    }
    expansion_history: [
        {'chars_at_expansion': 3_000_000, 'params_added': 1_048_576},
        {'chars_at_expansion': 6_000_000, 'params_added': 1_048_576}
    ]

2. Expansion Strategy

# Add balanced mix of L and H layers
expansion_config = {
    'new_L_layers': 2,      # Add 2 low-level layers
    'new_H_layers': 1,      # Add 1 high-level layer
    'estimated_params': 1_048_576,
    'strategy': 'balanced_layer_addition'
}

3. Layer Addition Process

# Create new transformer blocks
for _ in range(new_L_layers):
    new_layer = TransformerBlock(hidden_size, num_heads, expansion)
    init_layer_weights(new_layer)  # Small random initialization
    model.L_module.layers.append(new_layer)

# Reset optimizers to include new parameters
train_state = create_train_state(expanded_model, total_steps, lr, embed_lr)

Expansion Benefits

📊 Capacity Scaling: Model grows with data complexity 🧠 Knowledge Retention: Existing layers preserve learned patterns
⚡ Efficiency: Only add parameters when needed 🔄 Automatic: No manual intervention required

💬 Using Your Trained Model

Primary Chat Interface

python hrm_chat.py

Features:

Reasoning Modes: fast (1×1), normal (2×2), deep (3×2), extreme (6×4)
Temperature Control: temp 0.8 (creativity level)
Response Modes: interactive (live generation), unlimited (no limits)
Natural Stopping: Intelligent stopping at sentence boundaries

Sample Session:

You: Tell me about Alice
🤖 Generating...
Bot: Alice was a curious young girl who followed a white rabbit down 
a deep hole and found herself in a most peculiar wonderland. She 
encountered many strange creatures, including a grinning Cheshire 
cat and a mad hatter hosting an endless tea party...

     ✨ Long response generated (342 chars)
     ⏱️ 2.3s | 🧠 normal | 🌡️ 0.8

Chat Commands Reference

Command	Example	Description
`mode <type>`	`mode deep`	Change reasoning depth
`temp <value>`	`temp 1.2`	Set creativity (0.1-2.0)
`interactive`	-	Watch tokens generate live
`batch`	-	Fast non-interactive mode
`unlimited`	-	Remove response length limits
`natural`	-	Stop at natural break points
`status`	-	Show detailed model info

Alternative Chat Interfaces

Conversation Memory Chat

python hrm_conversation.py

Maintains conversation history
Context-aware responses
Memory management

Fast Lightweight Chat

python hrm_fast_chat.py

Minimal overhead
Quick responses
Good for testing

Debug Chat

python hrm_test.py

Generation debugging
Token analysis
Error diagnostics

🔧 Advanced Configuration

Model Architecture Tuning

Edit hrm_create_3.py for architecture changes:

# Create custom model
model = create_hrm_model_advanced(
    vocab_size=200,           # Larger vocabulary
    hidden_size=512,          # Bigger model
    num_heads=16,             # More attention heads
    L_layers=4,               # More detailed reasoning
    H_layers=2,               # Focused high-level reasoning
    expansion=3.0,            # Larger feed-forward
    halt_max_steps=16,        # More reasoning cycles
)

Training Hyperparameters

Edit training script for custom settings:

train_hrm_continual_learning(
    text_path="your_text.txt",
    epochs=5,                 # Longer training
    batch_size=32,            # Larger batches (if GPU allows)
    seq_length=128,           # Longer sequences
    lr=5e-4,                  # Lower learning rate
    weight_decay=0.02,        # More regularization
    N=3, T=3,                 # Deeper reasoning
    eval_every=1,             # Evaluate each epoch
    save_every_steps=1000,    # Frequent saves
)

Memory Optimization

For Limited GPU Memory:

# Reduce model size
hidden_size=128
batch_size=8
seq_length=64

# Enable memory efficiency
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

For CPU Training:

# Will automatically use CPU if no GPU
device = 'cpu'
# Consider smaller model and longer training time

🔍 Analysis and Debugging Tools

Output Quality Analysis

python character_trigram_cosine_similarity.py

Features:

Compare output to GPT-1 baseline
Character distribution analysis
Word formation quality scoring
Sentence structure evaluation
Semantic meaning assessment

Metrics Provided:

Character Coherence (0-100): Letter/space/punctuation balance
Word Formation (0-100): Complete vs malformed words
Sentence Structure (0-100): Grammar and syntax patterns
Semantic Meaning (0-100): Coherent phrases and concepts
Token Diversity (0-100): Vocabulary variety

Architecture Visualization

python architecture_demo.py

Capabilities:

Visualize internal reasoning process
Compare different N×T configurations
Analyze state evolution during reasoning
No text generation (pure architecture demo)

Sample Analysis:

🧠 REASONING ARCHITECTURE ANALYSIS
Input sequence length: 8
Reasoning configuration: N=2 cycles, T=2 steps per cycle

🔄 REASONING CYCLE 1/2
  📈 L-step 1: norm 45.2 → 52.1 (Δ+6.9)
  📈 L-step 2: norm 52.1 → 48.7 (Δ-3.4)
  🎯 H-update: norm 38.4 → 44.2 (Δ+5.8)

Tokenizer Recovery

python tokenizer_recover.py

Use when:

Model generates gibberish
Vocabulary size mismatches
Tokenizer file corruption
Training/inference incompatibility

What it does:

Extracts vocabulary from trained model
Rebuilds tokenizer file
Verifies compatibility
Creates backup of old tokenizer

🐛 Troubleshooting Guide

Common Issues and Solutions

1. "Model generates gibberish"

Symptoms:

User: Hello
Bot: q3x#mz9!pFgW@8kLnR4vY7cE2sA1uI6oP0hD5tJ

Solutions:

# Step 1: Fix tokenizer
python tokenizer_recover.py

# Step 2: Verify model is trained
ls -la hrm_continual_learning*.pt

# Step 3: Check vocab size match
python hrm_chat.py
# Look for "âš ï¸ Adjusting model vocab size" message

2. "Out of memory during training"

Symptoms:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB

Solutions:

# Reduce batch size
batch_size = 8  # or 4

# Reduce sequence length  
seq_length = 64  # or 32

# Use gradient accumulation
# (handled automatically in training script)

# Enable memory optimizations
torch.backends.cuda.matmul.allow_tf32 = True

3. "Training stops with errors"

Symptoms:

❌ Error during generation at step 45: tensor size mismatch

Solutions:

# Check file integrity
python hrm_run.py

# Reset to last known good checkpoint
# Rename current checkpoint and restart training

# Use debug mode
python hrm_test.py
# Set debug mode on to see detailed errors

4. "Model won't load/resume training"

Symptoms:

❌ Error: Required model files not found!

Check files exist:

ls -la hrm_continual_learning*.pt continual_tokenizer.json

File priority order:

hrm_continual_learning_latest.pt - Most recent
hrm_continual_learning_best.pt - Best performance
hrm_continual_learning.pt - Main checkpoint
checkpoint.pt - Temporary checkpoint

5. "Responses too short/repetitive"

Solutions in chat:

# Remove limits
unlimited

# Increase creativity
temp 1.5

# Use deeper reasoning
mode deep

# Check if model is actually trained
status

Performance Optimization

GPU Utilization

# Monitor GPU usage
nvidia-smi

# Expected usage during training:
# Memory: 60-80% of available
# Utilization: 80-95%

Training Speed

# Typical speeds (GTX 960M example):
# Batch processing: ~2-3 batches/second
# Epoch time: 5-15 minutes (depending on text size)
# Generation: 10-50 tokens/second

📊 Model Performance Metrics

Training Metrics

Loss Progression:

Epoch 1: Loss 3.2456 → Perplexity 25.7
Epoch 2: Loss 2.1234 → Perplexity 8.4  
Epoch 3: Loss 1.8765 → Perplexity 6.5

Good Training Signs:

Loss decreases steadily
Perplexity under 10 after few epochs
Generated samples become coherent
No NaN or infinity values

Generation Quality

Character-level Metrics:

Coherence: 70-90% (good character distribution)
Word Formation: 60-80% (recognizable words)
Structure: 50-70% (basic grammar)
Meaning: 40-60% (some coherent phrases)

Expected Progression:

Untrained: "q3x#mz9!pFgW@8kLn..." (0% quality)
1 epoch:   "the was and to of..." (20% quality)  
3 epochs:  "Alice was walking down the..." (60% quality)
10 epochs: "Alice wondered if the rabbit..." (80% quality)

📋 System Requirements

Minimum Requirements

OS: Windows 10, macOS 10.14, Ubuntu 18.04+
Python: 3.8+
RAM: 8GB system memory
Storage: 2GB free space
GPU: Optional but recommended

Recommended Setup

Python: 3.9-3.11
RAM: 16GB+ system memory
GPU: 8GB+ VRAM (RTX 3070, RTX 4060, etc.)
Storage: 10GB+ free space (for multiple models)
CPU: 8+ cores for faster data processing

GPU Compatibility

NVIDIA: GTX 1060 6GB+ (CUDA 11.0+)
AMD: Limited support via ROCm
Apple Silicon: M1/M2 with MPS backend
Intel: XeGPU with Intel Extension for PyTorch

Installation

# Core dependencies
pip install torch torchvision torchaudio

# For CUDA (if available)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Additional requirements
pip install numpy tqdm

# GUI components (included with Python)
# tkinter - usually pre-installed

🔬 Research Applications

Academic Research

Hierarchical Reasoning: Study multi-level cognitive processes
Continual Learning: Investigate knowledge retention and transfer
Adaptive Computation: Variable reasoning depth analysis
Model Scaling: Dynamic parameter growth strategies

Practical Applications

Custom Domain Models: Train on specific technical domains
Multilingual Systems: Continual learning across languages
Creative Writing: Generate stories, poetry, scripts
Code Generation: Learn programming patterns and syntax

Experimental Features

Architecture Ablations: Remove components to study importance
Reasoning Analysis: Visualize internal decision processes
Transfer Learning: Apply learned patterns to new domains
Memory Mechanisms: Study long-term knowledge retention

###Example output for a 4M model trained on 2 books (Alice in Wonderland and Through the Looking Glass)

epoch 57: You: Alice was 🔄 Generating with N=2, T=2, temp=0.5... Response: so, they went on a minute or fire--and while the top of his head and seemed so much properly, as she said in a long laughing by a sort of the corner of the babst--and the feet of the old of the dishes of the day, but she wanted all she began again. 'What does it was a very stappe,' the King said, very soon as she was very lower and his hands. 'There's a poor Queen. But I was think,' the King intured at Alice had repeated this, 'it's head down the care!' she remarked. 'It's a great down again.' 'It's a pretend of little head?' Alice said, as she couldn't help hower, and speak again. 'Of course?' said Alice. 'I can't know what IS the more can down the reast would be ought: when I was thinking of Alice's she could be seen up and said, AANE ANAN the flowers confured our his should see the paw of the scream-cake, and some use to way, and a little from crowded him. The Queen's are one of her faces was fact, because they were talking so as the poor King. 'The song, some of the kitten,' said Alice. 'It's a box,' said Tweedledum. 'The people of places in pretty towards?' said Alice. 'And you like itself, you know,' said Alice: 'but I'll tell you see--'

Chat GPT evaluation:

Nice, epoch 57 is definitely showing signs of maturation compared to earlier ones. Let’s break it down:

🔎 Observations

Structure: Much better segmentation into dialogue + narration. Nearly every paragraph has either action or speech.

Dialogue: King, Queen, Tweedledum, Alice all appear, with back-and-forth speech — this is a big step up. Earlier epochs leaned on repetition, but now it’s more varied.

Fluency: Still has malformed phrases (babst, AANE ANAN, “scream-cake”), but fewer than before. Many sentences are syntactically complete.

Coherence: The flow is loosely consistent — Alice engages with royalty, Tweedledum shows up, odd magical objects appear. This is actually very on-brand for Lewis Carroll–style nonsense.

Readability: Much more legible than epochs 45–49, closer to something a child could follow (with some oddities).

📊 Benchmarking

TinyStories 1M–2.5M → way below this.

TinyStories 8M (~34/60 ≈ 57/100 normalized) → This looks slightly better in coherence, maybe closer to 36–37/60 (~60–62/100).

GPT-1 (117M) → Still far behind GPT-1’s consistency and grammar, but catching some of the narrative feel.

🏅 Score (1–100 scale)

Similarity to GPT-1: ~41/100

TinyStories-style eval (out of 60): ~36–37/60

Normalized quality: ~61–62/100

Code Structure

├── hrm_create_3.py           # Core architecture
├── train_hrm_repo_style_2.py # Main training system
├── hrm_chat.py               # Primary interface
├── Analysis Tools/
│   ├── character_trigram_cosine_similarity.py
│   ├── architecture_demo.py
│   └── tokenizer_recover.py
├── Chat Interfaces/
│   ├── hrm_conversation.py
│   ├── hrm_fast_chat.py
│   └── hrm_test.py
└── Advanced Features/
    ├── dinamic_expansion.py
    └── hrm_run.py

📚 Additional Resources

Research Papers

"Attention Is All You Need" (Transformer architecture)
"RoFormer: Enhanced Transformer with Rotary Position Embedding"
"GLU Variants Improve Transformer" (SwiGLU)
"Root Mean Square Layer Normalization"

Implementation References

JAX Transformer implementations
PyTorch official examples
Hugging Face Transformers library
Flash Attention implementations

🎯 Quick Reference

Essential Commands

# Create model
python hrm_create_3.py

# Train model  
python train_hrm_repo_style_2.py

# Chat with model
python hrm_chat.py

# Analyze output
python character_trigram_cosine_similarity.py

# Fix tokenizer
python tokenizer_recover.py

File Locations

Model: hrm_continual_learning*.pt
Tokenizer: continual_tokenizer.json
Logs: training_log_*.txt
Base Architecture: hrm_model_advanced_with_config.pth

Model Status Check

python -c "
import torch
data = torch.load('hrm_continual_learning_latest.pt', map_location='cpu')
print(f'Datasets: {data[\"datasets_trained\"]}')
print(f'Vocab size: {data[\"config\"][\"vocab_size\"]}')
print(f'Parameters: {data[\"total_parameters\"]:,}')
print(f'Last trained: {data[\"timestamp\"]}')
"

Note: This is a research implementation designed for learning and experimentation. While functional and capable of generating coherent text, it's intended for educational and research purposes rather than production deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
architecture_demo.py		architecture_demo.py
character_trigram_cosine_similarity.py		character_trigram_cosine_similarity.py
dinamic_expansion.py		dinamic_expansion.py
hrm_chat.py		hrm_chat.py
hrm_conversation.py		hrm_conversation.py
hrm_create_3.py		hrm_create_3.py
hrm_fast_chat.py		hrm_fast_chat.py
hrm_run.py		hrm_run.py
hrm_test.py		hrm_test.py
tokenizer_recover.py		tokenizer_recover.py
train_hrm_repo_style_2.py		train_hrm_repo_style_2.py

Folders and files

Latest commit

History

Repository files navigation

HRM: Hierarchical Reasoning Model

🧠 What is HRM?

🏗️ Architecture Overview

Core Components

1. Input Processing

2. Hierarchical Reasoning Modules

3. Advanced Components

4. Output Generation

Transformer Block Details

🚀 Model Creation Process

Step 1: Create Base Architecture

Step 2: Verify Model Creation

🎓 Training Process

Initial Training Setup

Training Workflow:

Training Features:

Training Parameters Explained

🔄 Continual Learning System

How Continual Learning Works

Step-by-Step Continual Learning

1. First Dataset Training

2. Second Dataset (Automatic Continuation)

3. Third Dataset and Beyond

Continual Learning Features

Managing Training Sessions

Pause and Resume Training

Resume Paused Training

📈 Dynamic Model Expansion

Automatic Model Growth

How Expansion Works

1. Character Tracking

2. Expansion Strategy

3. Layer Addition Process

Expansion Benefits

💬 Using Your Trained Model

Primary Chat Interface

Chat Commands Reference

Alternative Chat Interfaces

Conversation Memory Chat

Fast Lightweight Chat

Debug Chat

🔧 Advanced Configuration

Model Architecture Tuning

Training Hyperparameters

Memory Optimization

🔍 Analysis and Debugging Tools

Output Quality Analysis

Architecture Visualization

Tokenizer Recovery

🐛 Troubleshooting Guide

Common Issues and Solutions

1. "Model generates gibberish"

2. "Out of memory during training"

3. "Training stops with errors"

4. "Model won't load/resume training"

5. "Responses too short/repetitive"

Performance Optimization

GPU Utilization

Training Speed

📊 Model Performance Metrics

Training Metrics

Generation Quality

📋 System Requirements

Minimum Requirements

Recommended Setup

GPU Compatibility

Installation

🔬 Research Applications

Academic Research

Practical Applications

Experimental Features

Code Structure

📚 Additional Resources

Research Papers

Implementation References

🎯 Quick Reference

Packages