Refactor examples by sayeg84 · Pull Request #7 · ClementiGroup/mlcg

sayeg84 · 2025-02-26T13:50:05Z

PR Checklist

Bug fix
Feature addition/change
Documentation addition/change
[] Test addition/change
[] Black formatting

Describe your changes here:

Currently, the examples folders contains some yaml files and the notebook. While this examples are not incorrect, they are not the best as we are all using H5s to train and the ChignolinDataset is not widely used. This has become quite a challenge in the last year when we've had to train a lot of people

In this PR, I refactored the whole examples folder to give it a folder structure. To resume:

examples/h5_pl now contains two folders: single_molecule and multiple_molecules. Each folder contains:
- an small H5 dataset
- traiining and partition yamls
- prior objects
- initial configurations and simulation yamls.
- A README file explaing how to use this.
  Each folder can be used for training, merging an epoch with a prior and simulating. Each folder has its own readme where all of this is explained. single_molecule is the 1L2Y example from mlcg-tk and mutiple_molecule is the small H5 from the demo version of the repo.
Added a README to examples and also to every subfolder in to explain the contents and where should a person direct itself when starting to learn.
Removed save_h5.py script that was meant more for internal use.

Other minor changes in this PR:

Removed references to folders on FU filesystem
Added a feature to sparsify and desparsify the parameter tensors in Dihedral and Harmonic prior objects (specially useful as the pre-specialized prior can take )
Modified the .gitignore to add the ckpt and tensorboard folders generated by small trainings.

nec4 · 2025-02-27T15:28:49Z

Thanks for this work! I can take a look tomorrow if needed.

nec4 · 2025-02-27T15:31:33Z

Would it also be worth updating the notebook here?:

https://github.com/ClementiGroup/mlcg/blob/main/examples/notebooks/Building_A_Coarse_Grain_Model.ipynb

I think I had a local draft PR for modernizing that one - I can find it if we want.

kbno · 2025-02-27T15:41:37Z

Thanks @sayeg84 and @nec4 , I would also like to take a look but I will only have the time next week, can you wait for me before you merge it?

sayeg84 · 2025-02-27T17:34:16Z

@nec4 IIRC the current notebook at examples/notebooks/Building_A_Coarse_Grain_Model.ipynb is the most updated version, with the corrections from the modernizing PR.

@kbno We can wait a while for approving this, there are many changes :)

nec4

Great work! I left a few small suggestions and questions.

examples/README.md

nec4 · 2025-02-28T11:46:22Z

examples/h5_pl/README.md

- h5py (install can be done with `conda install -c conda-forge h5py`)
+- Bundled datasets in single (or several) HDF5 files.
+- Parallelized training on multiple GPUs with distributed data parallel (DDP) and low memory footprint.
+- External description of dataset partition that enables. 


Is something missing on this line?

Yes, I've pushed to finish the sentence

examples/h5_pl/additional_files/README.md

examples/h5_pl/multiple_molecules/README.md

nec4 · 2025-02-28T12:27:24Z

examples/h5_pl/single_molecule/README.md

+
+We will train a model with the `1L2Y_prior_tag.h5` dataset that was previously generated using `mlcg-tk`. To train this model, we need the partition file that specifies how to split the dataset into a training and validation region: file `partition_1L2Y_prior_tag.yaml`. These two files define the dataset, its partition and its batch size that will be used to train a model
+
+To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and 


Suggested change

To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and

To go ahead to the trainng, the file `training.yaml` is a Pytorch Lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and

nec4 · 2025-02-28T12:27:45Z

examples/h5_pl/single_molecule/README.md

+
+To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and 
+
+To train, we can run from a terminal (that)


what does (that) refer to here?

nec4 · 2025-02-28T12:29:54Z

examples/h5_pl/single_molecule/README.md

+
+In the 1L2Y example, it is possible that the simulation exits before finishing after trowing an error related to "Simulation blewup at timestep #..."
+
+This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system. 


Suggested change

This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system.

This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system. For real production models, we recommend running prior-only simulations to help see if your prior is suitably accurate/stable against reference feature distributions.

I added something similar but different to your suggestion. I don't want to reference the reference feature distributions because the relation of the prior only simulations with the reference distributions is more complicated.

nec4 · 2025-02-28T15:09:34Z

mlcg/nn/utils.py

+import torch
+from warnings import warn
+from .prior import _Prior, Dihedral, Harmonic
+
+
+def sparsify_prior_module(module: _Prior) -> torch.nn.Module:
+    r"""
+    Converts buffer tensors to sparse tensors inplace for Harmonic and Dihedral objects
+    """
+    if isinstance(module, Dihedral):
+        module.v_0 = module.v_0.to_sparse()
+        module.k1s = module.k1s.to_sparse()
+        module.k2s = module.k2s.to_sparse()
+    elif issubclass(type(module), Harmonic):
+        module.x_0 = module.x_0.to_sparse()
+        module.k = module.k.to_sparse()
+    else:
+        warn(
+            f"Module is not supported for sparsification. It will be returned as is"
+        )
+    return module
+
+
+def desparsify_prior_module(module: _Prior) -> torch.nn.Module:
+    r"""
+    Converts parameter tensors inplace to dense tensors in Harmonic and Dihedral objects
+    """
+    if isinstance(module, Dihedral):
+        module.v_0 = module.v_0.to_dense()
+        module.k1s = module.k1s.to_dense()
+        module.k2s = module.k2s.to_dense()
+    elif issubclass(type(module), Harmonic):
+        module.x_0 = module.x_0.to_dense()
+        module.k = module.k.to_dense()
+    return module


Good stuff - not sure if its worth to write a small unit test or not to make sure the prior buffers remain the same after sparsifying/desparsifying - what do you think?

kbno · 2025-02-28T15:56:53Z

examples/README.md

+| :---------: | :---------: | :-------------: |
+|`notebooks`|Notebook showing the training, simulation and simulation analysis of an MLCG model for Chignolin | People interested in understanding the procedure for building and testing an mlcg model but not interested in getting a good model nor applying it to a system of their own. |
+|`h5_pl`| Files and input yamls for training a toy model of 1L2Y and a transferable model  | People who intend to build an mlcg model to a system of their own and that have gone through the [mlcg-tk package example](https://github.com/ClementiGroup/mlcg-tk/tree/main/examples) for preparing AA data into a trainable H5 |
+| `input_yamls`| Example yaml files that can be passed to the scripts  | People that went trough the examples in `h5_pl` folder  |


kbno · 2025-02-28T16:14:06Z

examples/h5_pl/multiple_molecules/README.md

+mlcg-combine_model.py --ckpt ./ckpt/last.ckpt --prior ./prior.pt --out model_with_prior.pt
+```
+
+This command might throw some warnings related to a rank problem but this are safe to ignore. 


Suggested change

This command might throw some warnings related to a rank problem but this are safe to ignore.

This command might throw some warnings related to a rank problem but these are safe to ignore.

Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>

sayeg84 · 2025-03-03T17:48:38Z

@nec4 @kbno Thanks for the insightful comments. I've addressed all of the remarks. Could this be reviewed again?

nec4 · 2025-03-04T12:31:47Z

Thanks! Will take a look shortly

nec4 · 2025-03-04T19:12:18Z

Thanks for making the changes and especially for adding the test. I have actually never used the torch.testing submodule before - seems there is no native torch.testing.assert_equal - could we define one as they suggest in their docs (https://pytorch.org/docs/stable/testing.html) and use it instead in the test?

import functools
assert_equal = functools.partial(torch.testing.assert_close, rtol=0, atol=0)

Other than that, LGTM. Awesome work.

sayeg84 · 2025-03-04T19:25:31Z

@nec4 thanks for the tip, last commits have implemented the equality check

nec4 · 2025-03-05T12:34:17Z

LGTM! Happy to merge if @kbno does not have anything further.

kbno · 2025-03-05T12:44:19Z

trying out a few demos now

kbno · 2025-03-05T13:40:11Z

LGTM, merging now, thanks @sayeg84 and @nec4 !

sayeg84 added 8 commits February 24, 2025 20:54

Remove references to local files

dd514fa

Remove references to local files

926aab3

Added tool to sparsify a prior object

f693547

Add tool to load sparsified priors correctly

f0274fe

Add examples with h5 datasets

c37dc10

Move older files

643f967

Update readmes and clean yamls

341aec7

Ignore training output

48fee13

sayeg84 requested a review from jacopoventurin February 26, 2025 13:50

sayeg84 requested review from aguljas, felixmusil, kbno and nec4 as code owners February 26, 2025 13:50

Black formatting

0bf27d4

nec4 reviewed Feb 28, 2025

View reviewed changes

kbno reviewed Feb 28, 2025

View reviewed changes

sayeg84 and others added 9 commits March 2, 2025 18:00

Update examples/README.md

c730ada

Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>

Update examples/README.md

fcadeb2

Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>

Update examples/h5_pl/additional_files/README.md

5e7e513

Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>

Finish broken sentence

f874ca1

Update examples/h5_pl/multiple_molecules/README.md

9604063

Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>

Update examples/h5_pl/multiple_molecules/README.md

a166a9b

Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>

Removed hanging word

54d8f4d

Fix typos

3c14ccd

Add message about prior only sims to single mol example

7be0d01

Add unit test for prior sparsification

0cd2bae

sayeg84 added 2 commits March 4, 2025 20:21

Move from assert_close to assert_equal

627336c

Black

02c8060

nec4 approved these changes Mar 5, 2025

View reviewed changes

minor fixes

05b33c0

kbno merged commit ec362fa into main Mar 5, 2025
3 checks passed

kbno deleted the feat/update_examples branch March 5, 2025 13:40


		We will train a model with the `1L2Y_prior_tag.h5` dataset that was previously generated using `mlcg-tk`. To train this model, we need the partition file that specifies how to split the dataset into a training and validation region: file `partition_1L2Y_prior_tag.yaml`. These two files define the dataset, its partition and its batch size that will be used to train a model

		To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and


		To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and

		To train, we can run from a terminal (that)


		In the 1L2Y example, it is possible that the simulation exits before finishing after trowing an error related to "Simulation blewup at timestep #..."

		This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system.

	\| `input_yamls`\| Example yaml files that can be passed to the scripts \| People that went trough the examples in `h5_pl` folder \|
	\| `input_yamls`\| Example yaml files that can be passed to the scripts \| People that went through the examples in `h5_pl` folder \|

	This command might throw some warnings related to a rank problem but this are safe to ignore.
	This command might throw some warnings related to a rank problem but these are safe to ignore.

Conversation

sayeg84 commented Feb 26, 2025

PR Checklist

Describe your changes here:

Uh oh!

nec4 commented Feb 27, 2025

Uh oh!

nec4 commented Feb 27, 2025

Uh oh!

kbno commented Feb 27, 2025

Uh oh!

sayeg84 commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nec4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayeg84 Mar 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayeg84 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nec4 commented Mar 4, 2025

Uh oh!

nec4 commented Mar 4, 2025

Uh oh!

sayeg84 commented Mar 4, 2025

Uh oh!

nec4 commented Mar 5, 2025

Uh oh!

kbno commented Mar 5, 2025

Uh oh!

kbno commented Mar 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sayeg84 commented Feb 27, 2025 •

edited

Loading

sayeg84 Mar 2, 2025 •

edited

Loading

sayeg84 commented Mar 3, 2025 •

edited

Loading