Skip to content

Speed up --verify-numerics sample generation for half-precision convs#1314

Merged
rkayaith merged 1 commit intoiree-org:mainfrom
rkayaith:slow-verify-numerics
Feb 27, 2026
Merged

Speed up --verify-numerics sample generation for half-precision convs#1314
rkayaith merged 1 commit intoiree-org:mainfrom
rkayaith:slow-verify-numerics

Conversation

@rkayaith
Copy link
Member

torch.randn on CPU is much slower for half-precision dtypes than on GPU. For a large conv config (convfp16 -n 32 -c 256 -H 100 -W 100 -k 2376 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -m conv -g 1 -F 4 -t 1), sample generation was taking 18.3s of 25.1s total verification time.

Generate sample data on GPU and transfer to CPU for the reference computation instead of the other way around. Total runtime dropped from 45.9s to 16.2s with --verify-numerics (8.3s without) on a 96-core EPYC 9454.

torch.randn on CPU is much slower for half-precision dtypes than on
GPU. Generate sample data on GPU and transfer to CPU for the reference
computation instead of the other way around.

Tested on convfp16 -n 32 -c 256 -H 100 -W 100 -k 2376 -y 3 -x 3
-F 4 (NHWC, weight backward): total runtime dropped from 45.9s to
16.2s with --verify-numerics (8.3s without).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rkayaith rkayaith merged commit b3ddea4 into iree-org:main Feb 27, 2026
8 checks passed
@rkayaith rkayaith deleted the slow-verify-numerics branch February 27, 2026 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants