Fix geneformer training instability bug (#421)

jstjohn · web-flow · commit 7192b5b810f8 · 2024-11-12T23:06:58.000+04:00
See wandb runs here: https://wandb.ai/clara-discovery/geneformer_bionemo2_timing2 See the results below, we can precisely control whether or not there is a grad norm instability by setting or unsetting the two NVTE env variables. Adding the NVTE env variables to our container is a recent change as well. Based on these results we are unsetting these variables for now. There is not a significant hit to performance by making this change. ## Old run where this was not an issue: <img width="457" alt="Screenshot 2024-11-12 at 9 42 45 AM" src="https://github.com/user-attachments/assets/7571ec4a-7bf1-4f86-901a-4dc983b53149"> ## Representative new run where we see a spike in grad norm <img width="730" alt="Screenshot 2024-11-12 at 9 43 25 AM" src="https://github.com/user-attachments/assets/c9069d1d-3cc7-43e3-93d0-1a3ff07ecfe3"> ## We can make this spike go away by unsetting `NVTE_FUSED_ATTN` and `NVTE_FLASH_ATTN` <img width="731" alt="Screenshot 2024-11-12 at 9 43 44 AM" src="https://github.com/user-attachments/assets/3883383a-e943-4d26-a12a-956f7240bd45"> ## We can introduce this spike on the old image that didn't have these env variables by setting them <img width="728" alt="Screenshot 2024-11-12 at 9 44 16 AM" src="https://github.com/user-attachments/assets/d5daeb16-57be-4e8e-bde6-8b275bf53a46"> ## Example longer/larger batch run that fails with these env variables set <img width="729" alt="Screenshot 2024-11-12 at 9 45 07 AM" src="https://github.com/user-attachments/assets/00cdb307-1863-47e1-b93e-3227cbc7259b"> ## We can stabilize this run by unsetting these env variables <img width="729" alt="Screenshot 2024-11-12 at 9 45 30 AM" src="https://github.com/user-attachments/assets/2cd370e3-5cdc-4385-9294-cdab068d6a8b"> It seems to be relatively recent so this PR is going to test some recent changes to see if any of them is causing this. - [x] Check if the arange change is causing this? - [x] Check if the grad buffer change (should not be enabled) is causing this - [x] bias fusions - [x] garbage collection callback Find out when this worked: - [x] PR 409 right before second perf change and dset change - [x] PR 410 after first perf change, CLI refactor, and wandb fix - [x] PR 404 right before new CLI - [x] PR 362 (2 weeks ago) but restarting job before the gradients start to increase - [x] PR 362 (2 weeks ago) - [x] **worked** https://wandb.ai/clara-discovery/geneformer_bionemo2/runs/0sSIf3tl?nw=nwusernvjstjohn **worked** uses `bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d` - [x] bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d but with `NVTE_FUSED_ATTN=1` and `NVTE_FLASH_ATTN=0` set in my script **did not work ** - [x] bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d but with `NVTE_FUSED_ATTN=1` and `NVTE_FLASH_ATTN=0` `unset` in my script **WORKED!!** - [x] bionemo2-pr419--f2599382e4afaf061c9948628f3f72bb8e233fd6 (most recent PR merged) but manually unsetting `NVTE_FUSED_ATTN=1` and `NVTE_FLASH_ATTN=0` Notes on differences between TOT and `pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d` - `env` doesn't have `NVTE_FUSED*` env settings. Unclear if slurm script adds them properly or not. - `NVTE_FUSED_ATTN` and `NVTE_FLASH_ATTN` are set in `bionemo2-pr373--db2fe9cc240b12bfaf045654fc5350a7b985c9de` for example. - in slurm `--export=ALL` is default and passes all env variables. Perhaps this happens then, so the run where I have those env variables added might fail if those are causing the issue. - Successful run was bs=32 vs 64. I'm running a test now that has the NVTE* settings in the docker script but not in the image. - This was a closed branch, maybe some key changes didn't make it to main. - No `pip freeze` differences pop out that distinguish the branch that passes from the set that fail. - NOTE: See the experiments above around `NVTE_FUSED_ATTN=1` and `NVTE_FLASH_ATTN=0` . I am pretty sure these settings are what cause the training instability in geneformer. Unsetting them works in the old PR and setting them causes that old PR to not work with this explosion of gradients. - Currently I'm rerunning tests on a TOT branch but calling `unset` in my script on those variables so that they are removed from the container env prior to executing the script. If this fixes the TOT training curve I will feel very confident that this is what's going on, and we can focus on purging references to these variables from our docs, other than maybe highlighting how they result in training instability.
diff --git a/Dockerfile b/Dockerfile
@@ -166,7 +166,9 @@ RUN <<EOF
 EOF
 
 # Transformer engine attention defaults
-ENV NVTE_FUSED_ATTN=1 NVTE_FLASH_ATTN=0
+# FIXME the following result in unstable training curves even if they are faster
+#  see https://github.com/NVIDIA/bionemo-framework/pull/421
+#ENV NVTE_FUSED_ATTN=1 NVTE_FLASH_ATTN=0
 
 FROM dev AS development
 
@@ -207,4 +209,6 @@ RUN chmod 777 -R /workspace/bionemo2/
 
 # Transformer engine attention defaults
 # We have to declare this again because the devcontainer splits from the release image's base.
-ENV NVTE_FUSED_ATTN=1 NVTE_FLASH_ATTN=0
+# FIXME the following results in unstable training curves even if faster.
+#  See https://github.com/NVIDIA/bionemo-framework/pull/421
+#ENV NVTE_FUSED_ATTN=1 NVTE_FLASH_ATTN=0
diff --git a/README.md b/README.md
@@ -186,9 +186,6 @@ export MY_DATA_SOURCE="pbss"
 
 ```bash
 # The fastest transformer engine environment variables in testing were the following two
-export NVTE_FUSED_ATTN=1
-export NVTE_FLASH_ATTN=0
-
 TEST_DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source $MY_DATA_SOURCE); \
 ESM2_650M_CKPT=$(download_bionemo_data esm2/650m:2.0 --source $MY_DATA_SOURCE); \
 python  \
@@ -248,9 +245,6 @@ and DataModule types.
 > ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config JSON and populate it with your WandB details.
 
 ```
-export NVTE_FUSED_ATTN=1
-export NVTE_FLASH_ATTN=0
-
 bionemo-esm2-train \
 --data-config-t bionemo.esm2.run.config_models.ESM2DataConfig \
 --model-config-t bionemo.esm2.run.config_models.ExposedESM2PretrainConfig \
diff --git a/docs/docs/user-guide/examples/bionemo-esm2/pretrain.md b/docs/docs/user-guide/examples/bionemo-esm2/pretrain.md
@@ -280,9 +280,6 @@ llm.train(
 Or simply call `esm2_pretrain.py` directly.
 ```bash
 # Enable fused attention in transformer engine for speed-up
-export NVTE_FUSED_ATTN=1
-export NVTE_FLASH_ATTN=0
-
 DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source ngc)
 
 python scripts/protein/esm2/esm2_pretrain.py \
diff --git a/scripts/protein/esm2/test_esm2_pretrain.py b/scripts/protein/esm2/test_esm2_pretrain.py
@@ -90,8 +90,6 @@ def test_main_runs(monkeypatch, tmpdir, dummy_protein_dataset, dummy_parquet_tra
     result_dir = Path(tmpdir.mkdir("results"))
 
     with megatron_parallel_state_utils.distributed_model_parallel_state():
-        monkeypatch.setenv("NVTE_FUSED_ATTN", "1")
-        monkeypatch.setenv("NVTE_FLASH_ATTN", "0")
         main(
             train_cluster_path=train_cluster_path,
             train_database_path=dummy_protein_dataset,
@@ -159,8 +157,6 @@ def test_val_dataloader_in_main_runs_with_limit_val_batches(
     result_dir = Path(tmpdir.mkdir("results"))
 
     with megatron_parallel_state_utils.distributed_model_parallel_state():
-        monkeypatch.setenv("NVTE_FUSED_ATTN", "1")
-        monkeypatch.setenv("NVTE_FLASH_ATTN", "0")
         main(
             train_cluster_path=train_cluster_path,
             train_database_path=dummy_protein_dataset,
@@ -239,9 +235,6 @@ def test_pretrain_cli(tmpdir, dummy_protein_dataset, dummy_parquet_train_val_inp
     # a local copy of the environment
     env = dict(**os.environ)
     env["MASTER_PORT"] = str(open_port)
-    env["NVTE_FUSED_ATTN"] = "1"
-    env["NVTE_FLASH_ATTN"] = "0"
-
     cmd = shlex.split(cmd_str)
     result = subprocess.run(
         cmd,
diff --git a/sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_stop_and_go.py b/sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_stop_and_go.py
@@ -97,6 +97,7 @@ class TestGeneformerStopAndGo(stop_and_go.StopAndGoHarness):
     limit_val_batches: int = 2
     lr: float = 1e-4
     precision: Literal["16-mixed", "bf16-mixed", "32"] = MODEL_PRECISION
+    train_val_output_atol: float = 2e-2
 
     @override
     @classmethod
diff --git a/sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/model.py b/sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/model.py
@@ -525,8 +525,7 @@ def configure_model(self, tokenizer: AutoTokenizer) -> MegatronBioBertModelType:
                 self.num_layers // p_size
             ) % vp_size == 0, "Make sure the number of model chunks is the same across all pipeline stages."
 
-        # The local specs all require the standard full attention mask. For transformer engine only the NVTE_FLASH_ATTN=0
-        #  option requires this full attention mask.
+        # The local specs all require the standard full attention mask.
         use_full_attention_mask: bool = "transformer_engine" not in self.biobert_spec_option
         do_next_sentence = False
         if self.model_cls is None:
diff --git a/sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py b/sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py
@@ -106,6 +106,8 @@ class StopAndGoHarness(ABC):
     limit_val_batches: int
     lr: float = 1e-4
     precision: Literal["16-mixed", "bf16-mixed", "32"]
+    train_val_output_atol: float = 1e-3
+    other_output_atol: float = 1e-4
 
     # class variables that will be setup in setUpClass
     tempdir: tempfile.TemporaryDirectory
@@ -336,9 +338,9 @@ def test_stop_and_go_consistency(self, callback_type):
         assert interrupted_callback.data, f"No data found for {callback_type}"
 
         if callback_type == testing_callbacks.TrainOutputCallback:
-            atol = 1e-3
+            atol = self.train_val_output_atol
         else:
-            atol = 1e-4
+            atol = self.other_output_atol
 
         recursive_assert_approx_equal(interrupted_callback.data, continuous_callback.data, atol=atol)
 
@@ -388,8 +390,8 @@ def test_stop_and_go_consistency_with_uneven_validation_sizes(self, callback_typ
         interrupted_data = interrupted_callback.data[-len(continuous_callback.data) :]
 
         if callback_type == testing_callbacks.ValidOutputCallback:
-            atol = 1e-3
+            atol = self.train_val_output_atol
         else:
-            atol = 1e-4
+            atol = self.other_output_atol
 
         recursive_assert_approx_equal(interrupted_data, continuous_callback.data, atol=atol)