Skip to content

Refactor examples#7

Merged
kbno merged 22 commits intomainfrom
feat/update_examples
Mar 5, 2025
Merged

Refactor examples#7
kbno merged 22 commits intomainfrom
feat/update_examples

Conversation

@sayeg84
Copy link
Collaborator

@sayeg84 sayeg84 commented Feb 26, 2025

PR Checklist

  • Bug fix
  • Feature addition/change
  • Documentation addition/change
  • [] Test addition/change
  • [] Black formatting

Describe your changes here:

Currently, the examples folders contains some yaml files and the notebook. While this examples are not incorrect, they are not the best as we are all using H5s to train and the ChignolinDataset is not widely used. This has become quite a challenge in the last year when we've had to train a lot of people

In this PR, I refactored the whole examples folder to give it a folder structure. To resume:

  • examples/h5_pl now contains two folders: single_molecule and multiple_molecules. Each folder contains:

    • an small H5 dataset
    • traiining and partition yamls
    • prior objects
    • initial configurations and simulation yamls.
    • A README file explaing how to use this.
      Each folder can be used for training, merging an epoch with a prior and simulating. Each folder has its own readme where all of this is explained. single_molecule is the 1L2Y example from mlcg-tk and mutiple_molecule is the small H5 from the demo version of the repo.
  • Added a README to examples and also to every subfolder in to explain the contents and where should a person direct itself when starting to learn.

  • Removed save_h5.py script that was meant more for internal use.

Other minor changes in this PR:

  • Removed references to folders on FU filesystem
  • Added a feature to sparsify and desparsify the parameter tensors in Dihedral and Harmonic prior objects (specially useful as the pre-specialized prior can take )
  • Modified the .gitignore to add the ckpt and tensorboard folders generated by small trainings.

@nec4
Copy link
Collaborator

nec4 commented Feb 27, 2025

Thanks for this work! I can take a look tomorrow if needed.

@nec4
Copy link
Collaborator

nec4 commented Feb 27, 2025

Would it also be worth updating the notebook here?:

https://github.com/ClementiGroup/mlcg/blob/main/examples/notebooks/Building_A_Coarse_Grain_Model.ipynb

I think I had a local draft PR for modernizing that one - I can find it if we want.

@kbno
Copy link
Collaborator

kbno commented Feb 27, 2025

Thanks @sayeg84 and @nec4 , I would also like to take a look but I will only have the time next week, can you wait for me before you merge it?

@sayeg84
Copy link
Collaborator Author

sayeg84 commented Feb 27, 2025

@nec4 IIRC the current notebook at examples/notebooks/Building_A_Coarse_Grain_Model.ipynb is the most updated version, with the corrections from the modernizing PR.

@kbno We can wait a while for approving this, there are many changes :)

Copy link
Collaborator

@nec4 nec4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I left a few small suggestions and questions.

- h5py (install can be done with `conda install -c conda-forge h5py`)
- Bundled datasets in single (or several) HDF5 files.
- Parallelized training on multiple GPUs with distributed data parallel (DDP) and low memory footprint.
- External description of dataset partition that enables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is something missing on this line?

Copy link
Collaborator Author

@sayeg84 sayeg84 Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've pushed to finish the sentence


We will train a model with the `1L2Y_prior_tag.h5` dataset that was previously generated using `mlcg-tk`. To train this model, we need the partition file that specifies how to split the dataset into a training and validation region: file `partition_1L2Y_prior_tag.yaml`. These two files define the dataset, its partition and its batch size that will be used to train a model

To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and
To go ahead to the trainng, the file `training.yaml` is a Pytorch Lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and


To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and

To train, we can run from a terminal (that)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does (that) refer to here?


In the 1L2Y example, it is possible that the simulation exits before finishing after trowing an error related to "Simulation blewup at timestep #..."

This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system.
This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system. For real production models, we recommend running prior-only simulations to help see if your prior is suitably accurate/stable against reference feature distributions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something similar but different to your suggestion. I don't want to reference the reference feature distributions because the relation of the prior only simulations with the reference distributions is more complicated.

Comment on lines +1 to +35
import torch
from warnings import warn
from .prior import _Prior, Dihedral, Harmonic


def sparsify_prior_module(module: _Prior) -> torch.nn.Module:
r"""
Converts buffer tensors to sparse tensors inplace for Harmonic and Dihedral objects
"""
if isinstance(module, Dihedral):
module.v_0 = module.v_0.to_sparse()
module.k1s = module.k1s.to_sparse()
module.k2s = module.k2s.to_sparse()
elif issubclass(type(module), Harmonic):
module.x_0 = module.x_0.to_sparse()
module.k = module.k.to_sparse()
else:
warn(
f"Module is not supported for sparsification. It will be returned as is"
)
return module


def desparsify_prior_module(module: _Prior) -> torch.nn.Module:
r"""
Converts parameter tensors inplace to dense tensors in Harmonic and Dihedral objects
"""
if isinstance(module, Dihedral):
module.v_0 = module.v_0.to_dense()
module.k1s = module.k1s.to_dense()
module.k2s = module.k2s.to_dense()
elif issubclass(type(module), Harmonic):
module.x_0 = module.x_0.to_dense()
module.k = module.k.to_dense()
return module
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff - not sure if its worth to write a small unit test or not to make sure the prior buffers remain the same after sparsifying/desparsifying - what do you think?

| :---------: | :---------: | :-------------: |
|`notebooks`|Notebook showing the training, simulation and simulation analysis of an MLCG model for Chignolin | People interested in understanding the procedure for building and testing an mlcg model but not interested in getting a good model nor applying it to a system of their own. |
|`h5_pl`| Files and input yamls for training a toy model of 1L2Y and a transferable model | People who intend to build an mlcg model to a system of their own and that have gone through the [mlcg-tk package example](https://github.com/ClementiGroup/mlcg-tk/tree/main/examples) for preparing AA data into a trainable H5 |
| `input_yamls`| Example yaml files that can be passed to the scripts | People that went trough the examples in `h5_pl` folder |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `input_yamls`| Example yaml files that can be passed to the scripts | People that went trough the examples in `h5_pl` folder |
| `input_yamls`| Example yaml files that can be passed to the scripts | People that went through the examples in `h5_pl` folder |

mlcg-combine_model.py --ckpt ./ckpt/last.ckpt --prior ./prior.pt --out model_with_prior.pt
```

This command might throw some warnings related to a rank problem but this are safe to ignore.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This command might throw some warnings related to a rank problem but this are safe to ignore.
This command might throw some warnings related to a rank problem but these are safe to ignore.

sayeg84 and others added 9 commits March 2, 2025 18:00
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
@sayeg84
Copy link
Collaborator Author

sayeg84 commented Mar 3, 2025

@nec4 @kbno Thanks for the insightful comments. I've addressed all of the remarks. Could this be reviewed again?

@nec4
Copy link
Collaborator

nec4 commented Mar 4, 2025

Thanks! Will take a look shortly

@nec4
Copy link
Collaborator

nec4 commented Mar 4, 2025

Thanks for making the changes and especially for adding the test. I have actually never used the torch.testing submodule before - seems there is no native torch.testing.assert_equal - could we define one as they suggest in their docs (https://pytorch.org/docs/stable/testing.html) and use it instead in the test?

import functools
assert_equal = functools.partial(torch.testing.assert_close, rtol=0, atol=0)

Other than that, LGTM. Awesome work.

@sayeg84
Copy link
Collaborator Author

sayeg84 commented Mar 4, 2025

@nec4 thanks for the tip, last commits have implemented the equality check

@nec4
Copy link
Collaborator

nec4 commented Mar 5, 2025

LGTM! Happy to merge if @kbno does not have anything further.

@kbno
Copy link
Collaborator

kbno commented Mar 5, 2025

trying out a few demos now

@kbno
Copy link
Collaborator

kbno commented Mar 5, 2025

LGTM, merging now, thanks @sayeg84 and @nec4 !

@kbno kbno merged commit ec362fa into main Mar 5, 2025
3 checks passed
@kbno kbno deleted the feat/update_examples branch March 5, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants