Intrinsic Reward Policy Optimization

Authors

Minjae Cho (Correspondance) - The Grainger College of Engineering, University of Illinois Urbana-Champaign
Huy T. Tran - The Grainger College of Engineering, University of Illinois Urbana-Champaign

Citation

Please cite our paper if you use this code or algorithm for any part of your research or work:

@misc{cho2026intrinsicrewardpolicyoptimization,
      title={Intrinsic Reward Policy Optimization for Sparse-Reward Environments}, 
      author={Minjae Cho and Huy Trong Tran},
      year={2026},
      eprint={2601.21391},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.21391}, 
}

Prerequisites

In any local folder, open a terminal and run the following command to download our package into the current directory:

git clone https://github.com/Mgineer117/IRPO/
cd IRPO

We assume that you have Conda installed. If not, please refer to the Anaconda installation guide. Python 3.11.11 was used for our code.

We recommend creating a dedicated virtual environment as follows:

conda create -n IRPO python==3.11.*
conda activate IRPO

Then, install the required Python packages using:

pip install -r requirements.txt

Training

Our code uses the following command to train algorithms:

python3 main.py --env-name pointmaze-v1 --algo-name irpo

where all arguments should be written in all lowercase.

The above command will run the following algorithm:

Logging

We support three logging options—Weights & Biases (WandB), TensorBoard, and CSV—to accommodate different user preferences. Specifically, when WandB is properly configured on your local machine, all algorithmic and parameter settings, along with real-time training metrics, are automatically logged to your WandB dashboard. Simultaneously, training results are saved locally in TensorBoard format for visualization, and evaluation metrics are exported as CSV files for easy analysis. In addition, the model parameters are saved along with the best-performing one.

Environments

We consider the following widely used sparse-reward environments:

Note that the environments we used in our experiments are included in our repo. For the details of our environment configuration, check Appendix C in our paper.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
algorithms		algorithms
config		config
extractor		extractor
gridworld		gridworld
gymnasium_robotics		gymnasium_robotics
log		log
model		model
policy		policy
trainer		trainer
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intrinsic Reward Policy Optimization

Authors

Citation

Prerequisites

Training

Logging

Environments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Intrinsic Reward Policy Optimization

Authors

Citation

Prerequisites

Training

Logging

Environments

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages