Skip to content

[Codegen] Remove ROCDL index==i32; add indexIsI64 to OptimizeIntArithmetic#23948

Open
krzysz00 wants to merge 1 commit intoiree-org:mainfrom
krzysz00:index-i64-rocm
Open

[Codegen] Remove ROCDL index==i32; add indexIsI64 to OptimizeIntArithmetic#23948
krzysz00 wants to merge 1 commit intoiree-org:mainfrom
krzysz00:index-i64-rocm

Conversation

@krzysz00
Copy link
Copy Markdown
Contributor

Integer range analysis now handles narrowing to i32 where safe, making the --iree-rocm-index-bits option (which lowered all ROCDL indices to 32-bit) obsolete. Remove it so the ROCDL path matches NVVM (which always has 64-bit indices at the LLVM conversion level).

Add an indexIsI64 option to OptimizeIntArithmeticPass that relaxes the SAFE_INDEX_UNSIGNED_MAX_VALUE guard on signed-to-unsigned conversions for index values. On LLVMGPU targets where index is always 64-bit, this guard is unnecessarily conservative and blocks valid optimizations. For-loop IV narrowing (NarrowSCFForIvToI32 retains its own range checks unconditionally.)

Performance impact: on whole models, within the noise floor (as expected, this killed off a few instructions) but there is a consistent minor trend on the torch_models CI that gives a 1.01x geometric mean speedup, so there's not much reason not to do this. Table below.

…metic

Integer range analysis now handles narrowing to i32 where safe, making
the --iree-rocm-index-bits option (which lowered all ROCDL indices to
32-bit) obsolete. Remove it so the ROCDL path matches
NVVM (which always has 64-bit indices at the LLVM conversion level).

Add an indexIsI64 option to OptimizeIntArithmeticPass that relaxes the
SAFE_INDEX_UNSIGNED_MAX_VALUE guard on signed-to-unsigned conversions
for index values. On LLVMGPU targets where index is always 64-bit,
this guard is unnecessarily conservative and blocks valid optimizations.
For-loop IV narrowing (NarrowSCFForIvToI32 retains its own range
checks unconditionally.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@krzysz00
Copy link
Copy Markdown
Contributor Author

Benchmark Baseline (ms) Test (ms) Speedup
llama_8b_fp16/decode_benchmark_seq128_mi325 7.638 7.468 1.02x
llama_8b_fp16/decode_benchmark_seq2048_mi325 9.076 8.915 1.02x
llama_8b_fp16/prefill_benchmark_seq128_mi325 31.835 31.821 1.00x
llama_8b_fp16/prefill_benchmark_seq2048_mi325 279.081 277.750 1.00x
llama_8b_fp8/decode_benchmark_seq128_mi325 8.219 7.986 1.03x
llama_8b_fp8/decode_benchmark_seq128_mi325_data_tiling 17.252 17.244 1.00x
llama_8b_fp8/decode_benchmark_seq2048_mi325 11.054 11.034 1.00x
llama_8b_fp8/decode_benchmark_seq2048_mi325_data_tiling 19.907 20.085 0.99x
llama_8b_fp8/prefill_benchmark_seq128_mi325 25.691 25.748 1.00x
llama_8b_fp8/prefill_benchmark_seq128_mi325_data_tiling 24.886 24.987 1.00x
llama_8b_fp8/prefill_benchmark_seq2048_mi325 180.207 180.133 1.00x
llama_8b_fp8/prefill_benchmark_seq2048_mi325_data_tiling 197.122 196.607 1.00x
sdxl/clip_benchmark_mi325 7.266 7.215 1.01x
sdxl/punet_benchmark_mi325 46.146 46.054 1.00x
sdxl/punet_benchmark_mi325_v2 43.660 43.507 1.00x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant