Conversation
|
Thanks for this work! I can take a look tomorrow if needed. |
|
Would it also be worth updating the notebook here?: I think I had a local draft PR for modernizing that one - I can find it if we want. |
nec4
left a comment
There was a problem hiding this comment.
Great work! I left a few small suggestions and questions.
examples/h5_pl/README.md
Outdated
| - h5py (install can be done with `conda install -c conda-forge h5py`) | ||
| - Bundled datasets in single (or several) HDF5 files. | ||
| - Parallelized training on multiple GPUs with distributed data parallel (DDP) and low memory footprint. | ||
| - External description of dataset partition that enables. |
There was a problem hiding this comment.
Is something missing on this line?
There was a problem hiding this comment.
Yes, I've pushed to finish the sentence
|
|
||
| We will train a model with the `1L2Y_prior_tag.h5` dataset that was previously generated using `mlcg-tk`. To train this model, we need the partition file that specifies how to split the dataset into a training and validation region: file `partition_1L2Y_prior_tag.yaml`. These two files define the dataset, its partition and its batch size that will be used to train a model | ||
|
|
||
| To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and |
There was a problem hiding this comment.
| To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and | |
| To go ahead to the trainng, the file `training.yaml` is a Pytorch Lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and |
|
|
||
| To go ahead to the trainng, the file `training.yaml` is a Pytorch lightning yaml defines the architecture of the model to use (`model` field), the optimizer (`optimizer`), the trainer specifications (`trainer`) and the dataset (`dataset`). Note that `data.h5_file_path` and `data.partition_options` point to the `1L2Y_prior_tag.h5` and `partition_1L2Y_prior_tag.yaml` files, respectively, that will be used as training data and | ||
|
|
||
| To train, we can run from a terminal (that) |
There was a problem hiding this comment.
what does (that) refer to here?
|
|
||
| In the 1L2Y example, it is possible that the simulation exits before finishing after trowing an error related to "Simulation blewup at timestep #..." | ||
|
|
||
| This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system. |
There was a problem hiding this comment.
| This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system. | |
| This problem is related to the fact that the prior was fitted with very little data and is not a good enough prior to avoid unphysical configurations in our system. For real production models, we recommend running prior-only simulations to help see if your prior is suitably accurate/stable against reference feature distributions. |
There was a problem hiding this comment.
I added something similar but different to your suggestion. I don't want to reference the reference feature distributions because the relation of the prior only simulations with the reference distributions is more complicated.
| import torch | ||
| from warnings import warn | ||
| from .prior import _Prior, Dihedral, Harmonic | ||
|
|
||
|
|
||
| def sparsify_prior_module(module: _Prior) -> torch.nn.Module: | ||
| r""" | ||
| Converts buffer tensors to sparse tensors inplace for Harmonic and Dihedral objects | ||
| """ | ||
| if isinstance(module, Dihedral): | ||
| module.v_0 = module.v_0.to_sparse() | ||
| module.k1s = module.k1s.to_sparse() | ||
| module.k2s = module.k2s.to_sparse() | ||
| elif issubclass(type(module), Harmonic): | ||
| module.x_0 = module.x_0.to_sparse() | ||
| module.k = module.k.to_sparse() | ||
| else: | ||
| warn( | ||
| f"Module is not supported for sparsification. It will be returned as is" | ||
| ) | ||
| return module | ||
|
|
||
|
|
||
| def desparsify_prior_module(module: _Prior) -> torch.nn.Module: | ||
| r""" | ||
| Converts parameter tensors inplace to dense tensors in Harmonic and Dihedral objects | ||
| """ | ||
| if isinstance(module, Dihedral): | ||
| module.v_0 = module.v_0.to_dense() | ||
| module.k1s = module.k1s.to_dense() | ||
| module.k2s = module.k2s.to_dense() | ||
| elif issubclass(type(module), Harmonic): | ||
| module.x_0 = module.x_0.to_dense() | ||
| module.k = module.k.to_dense() | ||
| return module |
There was a problem hiding this comment.
Good stuff - not sure if its worth to write a small unit test or not to make sure the prior buffers remain the same after sparsifying/desparsifying - what do you think?
examples/README.md
Outdated
| | :---------: | :---------: | :-------------: | | ||
| |`notebooks`|Notebook showing the training, simulation and simulation analysis of an MLCG model for Chignolin | People interested in understanding the procedure for building and testing an mlcg model but not interested in getting a good model nor applying it to a system of their own. | | ||
| |`h5_pl`| Files and input yamls for training a toy model of 1L2Y and a transferable model | People who intend to build an mlcg model to a system of their own and that have gone through the [mlcg-tk package example](https://github.com/ClementiGroup/mlcg-tk/tree/main/examples) for preparing AA data into a trainable H5 | | ||
| | `input_yamls`| Example yaml files that can be passed to the scripts | People that went trough the examples in `h5_pl` folder | |
There was a problem hiding this comment.
| | `input_yamls`| Example yaml files that can be passed to the scripts | People that went trough the examples in `h5_pl` folder | | |
| | `input_yamls`| Example yaml files that can be passed to the scripts | People that went through the examples in `h5_pl` folder | |
| mlcg-combine_model.py --ckpt ./ckpt/last.ckpt --prior ./prior.pt --out model_with_prior.pt | ||
| ``` | ||
|
|
||
| This command might throw some warnings related to a rank problem but this are safe to ignore. |
There was a problem hiding this comment.
| This command might throw some warnings related to a rank problem but this are safe to ignore. | |
| This command might throw some warnings related to a rank problem but these are safe to ignore. |
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
Co-authored-by: nec4 <42926839+nec4@users.noreply.github.com>
|
Thanks! Will take a look shortly |
|
Thanks for making the changes and especially for adding the test. I have actually never used the Other than that, LGTM. Awesome work. |
|
@nec4 thanks for the tip, last commits have implemented the equality check |
|
LGTM! Happy to merge if @kbno does not have anything further. |
|
trying out a few demos now |
PR Checklist
Describe your changes here:
Currently, the examples folders contains some yaml files and the notebook. While this examples are not incorrect, they are not the best as we are all using H5s to train and the ChignolinDataset is not widely used. This has become quite a challenge in the last year when we've had to train a lot of people
In this PR, I refactored the whole examples folder to give it a folder structure. To resume:
examples/h5_plnow contains two folders:single_moleculeandmultiple_molecules. Each folder contains:Each folder can be used for training, merging an epoch with a prior and simulating. Each folder has its own readme where all of this is explained.
single_moleculeis the 1L2Y example from mlcg-tk andmutiple_moleculeis the small H5 from the demo version of the repo.Added a README to
examplesand also to every subfolder in to explain the contents and where should a person direct itself when starting to learn.Removed
save_h5.pyscript that was meant more for internal use.Other minor changes in this PR:
ckptandtensorboardfolders generated by small trainings.