[Compiler Toolkit] Add option for full inductor. by aditvenk · Pull Request #2150 · pytorch/torchtitan

aditvenk · 2025-12-13T01:25:50Z

Being able to compile fw/bw graphs using compile_fx_inner could help with establishing perf rooflines.

Full inductor compilation is achieved using compile_fx_inner, however, it requires the graph to have been decomposed using Inductor's default decomposition table. We apply this decomposition as a pass on the joint graph. We need to be careful to suitably unwrap the primals/tangents before running this decomposition.

Manual testing:
NGPU=4
CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml
TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train
./run_train.sh
--model.name $MODEL_NAME
--parallelism.data_parallel_shard_degree=2
--parallelism.tensor_parallel_degree=2
--job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config
--compile.joint_passes inductor_decomposition
--compile.passes full_inductor_compilation

torchtitan/experiments/compiler_toolkit/graph_utils.py

torchtitan/experiments/compiler_toolkit/job_config.py

torchtitan/experiments/compiler_toolkit/graph_utils.py

SherlockNoMad

lgtm with comments.

torchtitan/experiments/compiler_toolkit/job_config.py

torchtitan/experiments/compiler_toolkit/graph_utils.py

aditvenk · 2025-12-23T06:00:03Z

aot_eager for llama3:

[rank0]:[titan] 2025-12-22 21:58:57,041 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-12-22 21:58:57,057 - root - INFO - Model compiler_toolkit.llama3 debugmodel size: 6,163,712 total parameters
[rank0]:[titan] 2025-12-22 21:58:57,077 - root - INFO - Applied Tensor Parallelism to the model
[rank0]:[titan] 2025-12-22 21:58:57,077 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-12-22 21:58:57,097 - root - INFO - Applied Data Parallel (simple_fsdp) (dp mode=fully_shard) to the model
[rank0]:[titan] 2025-12-22 21:58:57,279 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-12-22 21:58:57,279 - root - INFO - CUDA memory usage for model: 0.01GiB(0.01%)
[rank0]:[titan] 2025-12-22 21:58:57,437 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-12-22 21:58:57,437 - root - INFO - Trainer is initialized with local batch size 8, global batch size 16, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2)
[rank0]:[titan] 2025-12-22 21:58:57,437 - root - INFO - Training starts at step 1
[rank0]:/data/users/avenkataraman/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py:2494: UserWarning: Your compiler for AOTAutograd is returning a function that doesn't take boxed arguments. Please wrap it with functorch.compile.make_boxed_func or handle the boxed arguments yourself. See https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670 for rationale.
[rank0]:  out = call_func_at_runtime_with_args(
[rank0]:[titan] 2025-12-22 21:59:03,495 - root - INFO - step:  1  loss:  7.9925  grad_norm:  1.4785  memory:  0.65GiB(0.69%)  tps: 1,272  tflops: 0.09  mfu: 0.01%
[rank0]:[titan] 2025-12-22 21:59:03,495 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/home/avenkataraman/torchtitan/torchtitan/distributed/utils.py:396: UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]:  torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)
[rank0]:[titan] 2025-12-22 21:59:03,599 - root - INFO - step:  2  loss:  7.6803  grad_norm:  1.5720  memory:  0.68GiB(0.72%)  tps: 79,257  tflops: 5.67  mfu: 0.57%
[rank0]:[titan] 2025-12-22 21:59:03,655 - root - INFO - step:  3  loss:  6.9339  grad_norm:  1.9887  memory:  0.68GiB(0.72%)  tps: 147,082  tflops: 10.53  mfu: 1.06%
[rank0]:[titan] 2025-12-22 21:59:03,712 - root - INFO - step:  4  loss:  6.0866  grad_norm:  2.2987  memory:  0.68GiB(0.72%)  tps: 143,666  tflops: 10.28  mfu: 1.04%
[rank0]:[titan] 2025-12-22 21:59:03,768 - root - INFO - step:  5  loss:  5.2493  grad_norm:  2.4151  memory:  0.68GiB(0.72%)  tps: 148,569  tflops: 10.64  mfu: 1.08%
[rank0]:[titan] 2025-12-22 21:59:03,829 - root - INFO - step:  6  loss:  4.7912  grad_norm:  2.6229  memory:  0.68GiB(0.72%)  tps: 134,967  tflops: 9.66  mfu: 0.98%
[rank0]:[titan] 2025-12-22 21:59:03,884 - root - INFO - step:  7  loss:  4.4615  grad_norm:  2.3111  memory:  0.68GiB(0.72%)  tps: 148,651  tflops: 10.64  mfu: 1.08%
[rank0]:[titan] 2025-12-22 21:59:03,943 - root - INFO - step:  8  loss:  4.2301  grad_norm:  1.9856  memory:  0.68GiB(0.72%)  tps: 140,177  tflops: 10.03  mfu: 1.01%
[rank0]:[titan] 2025-12-22 21:59:04,005 - root - INFO - step:  9  loss:  4.4596  grad_norm:  1.7412  memory:  0.68GiB(0.72%)  tps: 132,916  tflops: 9.51  mfu: 0.96%
[rank0]:[titan] 2025-12-22 21:59:04,069 - root - INFO - step: 10  loss:  4.0634  grad_norm:  1.9408  memory:  0.68GiB(0.72%)  tps: 127,978  tflops: 9.16  mfu: 0.93%

full inductor:

[rank0]:[titan] 2025-12-22 21:59:34,941 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-12-22 21:59:34,956 - root - INFO - Model compiler_toolkit.llama3 debugmodel size: 6,163,712 total parameters
[rank0]:[titan] 2025-12-22 21:59:34,974 - root - INFO - Applied Tensor Parallelism to the model
[rank0]:[titan] 2025-12-22 21:59:34,975 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-12-22 21:59:34,994 - root - INFO - Applied Data Parallel (simple_fsdp) (dp mode=fully_shard) to the model
[rank0]:[titan] 2025-12-22 21:59:35,019 - root - INFO - Using joint passes from config: ['inductor_decomposition']
[rank0]:[titan] 2025-12-22 21:59:35,019 - root - WARNING - Full Inductor compilation is enabled. Note that Inductor may change numerics and does not guarantee bitwise equivalent results compared to eager mode.
[rank0]:[titan] 2025-12-22 21:59:35,019 - root - INFO - Using compiler passes from config: ['full_inductor_compilation']
[rank0]:[titan] 2025-12-22 21:59:35,159 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-12-22 21:59:35,160 - root - INFO - CUDA memory usage for model: 0.01GiB(0.01%)
[rank0]:[titan] 2025-12-22 21:59:35,331 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-12-22 21:59:35,331 - root - INFO - Trainer is initialized with local batch size 8, global batch size 16, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2)
[rank0]:[titan] 2025-12-22 21:59:35,331 - root - INFO - Training starts at step 1
[rank0]:[titan] 2025-12-22 21:59:40,181 - root - INFO - Applying decompositions to joint graph
[rank0]:[titan] 2025-12-22 21:59:41,411 - root - INFO - Decompositions applied successfully to joint graph
[rank0]:[titan] 2025-12-22 21:59:41,921 - root - INFO - Applying pass: full_inductor_compilation_pass
[rank0]:[titan] 2025-12-22 21:59:42,538 - root - INFO - Applying pass: full_inductor_compilation_pass
[rank0]:[titan] 2025-12-22 21:59:43,240 - root - INFO - step:  1  loss:  8.2073  grad_norm:  1.3710  memory:  0.63GiB(0.67%)  tps: 989  tflops: 0.07  mfu: 0.01%
[rank0]:[titan] 2025-12-22 21:59:43,240 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/home/avenkataraman/torchtitan/torchtitan/distributed/utils.py:396: UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]:  torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)
[rank0]:[titan] 2025-12-22 21:59:43,343 - root - INFO - step:  2  loss:  7.9428  grad_norm:  1.4296  memory:  0.64GiB(0.67%)  tps: 79,314  tflops: 5.68  mfu: 0.57%
[rank0]:[titan] 2025-12-22 21:59:43,399 - root - INFO - step:  3  loss:  7.2457  grad_norm:  1.8050  memory:  0.64GiB(0.67%)  tps: 147,889  tflops: 10.59  mfu: 1.07%
[rank0]:[titan] 2025-12-22 21:59:43,458 - root - INFO - step:  4  loss:  6.4076  grad_norm:  2.2389  memory:  0.64GiB(0.67%)  tps: 140,255  tflops: 10.04  mfu: 1.02%
[rank0]:[titan] 2025-12-22 21:59:43,510 - root - INFO - step:  5  loss:  5.4848  grad_norm:  2.4729  memory:  0.64GiB(0.67%)  tps: 156,290  tflops: 11.19  mfu: 1.13%
[rank0]:[titan] 2025-12-22 21:59:43,600 - root - INFO - step:  6  loss:  4.9371  grad_norm:  2.3799  memory:  0.64GiB(0.67%)  tps: 91,446  tflops: 6.55  mfu: 0.66%
[rank0]:[titan] 2025-12-22 21:59:43,652 - root - INFO - step:  7  loss:  4.6137  grad_norm:  2.3870  memory:  0.64GiB(0.67%)  tps: 158,346  tflops: 11.34  mfu: 1.15%
[rank0]:[titan] 2025-12-22 21:59:43,711 - root - INFO - step:  8  loss:  4.4112  grad_norm:  2.2359  memory:  0.64GiB(0.67%)  tps: 141,054  tflops: 10.10  mfu: 1.02%
[rank0]:[titan] 2025-12-22 21:59:43,766 - root - INFO - step:  9  loss:  4.5883  grad_norm:  1.9379  memory:  0.64GiB(0.67%)  tps: 148,920  tflops: 10.66  mfu: 1.08%
[rank0]:[titan] 2025-12-22 21:59:43,828 - root - INFO - step: 10  loss:  4.2006  grad_norm:  2.0740  memory:  0.64GiB(0.67%)  tps: 132,314  tflops: 9.47  mfu: 0.96%

yiming0416

Can we add commands to run this to compiler_toolkit/README?

Also you can add 8-GPU CI following existing ones in compiler_toolkit/tests/integration_tests.py

torchtitan/experiments/compiler_toolkit/passes.py

torchtitan/experiments/compiler_toolkit/graph_utils.py

aditvenk · 2026-01-06T02:44:04Z

Can we add commands to run this to compiler_toolkit/README?

Also you can add 8-GPU CI following existing ones in compiler_toolkit/tests/integration_tests.py

Added to README and integration test.

torchtitan/experiments/compiler_toolkit/README.md

yiming0416 · 2026-01-06T20:27:10Z

torchtitan/experiments/compiler_toolkit/job_config.py

    """

+    joint_passes: list[str] = field(default_factory=list)
    passes: list[str] = field(default_factory=list)


non-blocking. I wonder if we should have a better naming here to distinguish joint_passes and passes applied on partitioned graphs. Do we consider having fwd_passes and bwd_passes (maybe an overkill for now)

I think joint_passes is pretty descriptive for what it is.

I think passes could be renamed to post_partition_passes, as we don't yet have an example for fwd-only or bwd-only pass, so it is overkill to split to two options for fwd and bwd.

- Being able to compile fw/bw graphs using compile_fx_inner could help with establishing perf rooflines. - Full inductor compilation is achieved using compile_fx_inner, however, it requires the graph to have been decomposed using Inductor's default decomposition table. We apply this decomposition as a pass on the joint graph. We need to be careful to suitably unwrap the primals/tangents before running this decomposition.

Being able to compile fw/bw graphs using compile_fx_inner could help with establishing perf rooflines. Full inductor compilation is achieved using `compile_fx_inner`, however, it requires the graph to have been decomposed using Inductor's default decomposition table. We apply this decomposition as a pass on the joint graph. We need to be careful to suitably unwrap the primals/tangents before running this decomposition. Manual testing: NGPU=4 \ CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml \ TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train \ ./run_train.sh \ --model.name $MODEL_NAME \ --parallelism.data_parallel_shard_degree=2 \ --parallelism.tensor_parallel_degree=2 \ --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config \ --compile.joint_passes inductor_decomposition \ --compile.passes full_inductor_compilation

aditvenk requested review from SherlockNoMad and yiming0416 December 13, 2025 01:25

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 13, 2025