S651852 [ctran][tests] Cover non-zero designated CTA in localReduce unaligned-tail tests by dboyda · Pull Request #2344 · meta-pytorch/torchcomms

dboyda · 2026-05-01T23:39:49Z

Summary:
The next diff in this stack changes localReduceVectorized and localReduceFallback to use a single-designated-CTA tail (matching copyUnroll<4, T>'s pattern). Existing localReduceSumUnaligned (fbcode/comms/ctran/algos/tests/CtranAlgoDevUT.cc:298) only exercises the count=1041, blockDim=640, gridDim=2 configuration, where the designated tail CTA collapses to block 0 for every supported dtype — leaving the new "tail handled by a non-zero CTA" code path untested.

Adds localReduceSumUnalignedNonZeroDesignatedCta covering counts {10241, 20481, 30721, 40961} with gridDim=4. For T=float (sizeofT=4), numPerBlock = blockDim * (16/sizeof(T)) * 4 = 640 * 16 = 10240, so:

count=10241 -> tail=1, designated=1
count=20481 -> tail=1, designated=2
count=30721 -> tail=1, designated=3
count=40961 -> tail=1, designated=0 (wrap)

For other dtypes the designated index shifts proportionally with numPerBlock; in every combination the tail is non-empty and at least one count puts it on a non-zero block. The test passes on the current (pre-fix) code as well — both writer-tail variants must produce correct reductions in single-call usage; this only tightens regression coverage so the next diff's tail rewrite cannot silently drop bytes when the tail belongs to a CTA other than block 0.

Reviewed By: minsii

Differential Revision: D103456130

meta-codesync · 2026-05-01T23:40:20Z

@dboyda has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103456130.

…TA ownership race Summary: Adds two DISABLED reproducer tests for the per-CTA byte-ownership invariant violation between `copyUnroll<4, T>` (`fbcode/comms/ctran/algos/DevCommon.cuh:391-442`) and `localReduceVectorized` (`fbcode/comms/ctran/algos/localReduce.cuh:174-269`). The invariant: every byte of any buffer touched by both writer families must map to the same CTA in both writers. When the same buffer is written by `localReduce` and then by `copyUnroll`-via-`ctranKernCopy*` with only per-CTA sync between them, a CTA in the second writer can read or overwrite bytes a different CTA in the first writer hasn't yet committed — exactly the data-corruption pattern Pavan Balaji's D69774173 (2025-02-20) commit message describes for `copyUnroll`. Pavan fixed `copyUnroll`'s tail to use a single designated CTA but never propagated the fix to `localReduce.cuh`, where the tail still grid-strides across all CTAs. In ring AllReduce this is the RS→AG transition: RS-last-step writes `recvbuff` (and `tmpSendBuf`) via `localReduce` (`AllReduceRing.cuh:121`), then AG steps touch the same buffers via `copyUnroll`-family writers (`AllReduceRing.cuh:127, 137, 268, 278`). Inter-round sync is per-CTA only (`GpeKernelSync.completeFlag[blockIdx]`). The two tails disagreed on which CTA owns a given byte in the `[limitCount, count)` zone. Two tests, both `DISABLED_` so CI stays green until the fix lands: 1. `DISABLED_TailOwnershipMatchesCopyUnroll` (CPU): analytically transcribes both writers' iterations and asserts per-element CTA ownership matches. Fails 5/6 cases on pre-fix code with explicit `(count, blockDim, gridDim, sizeofT, copyUnrollOwner, localReduceOwner)` mismatch reports. 2. `DISABLED_DelayedBlockZeroExposesBlockOneStaleRead` (GPU): exercises the REAL `localReduceVectorized<int32, commSum, 1, 1>` and `copyUnroll<4, int32>` in a multi-writer pattern with per-CTA sync only. Block 0 is deliberately delayed via `__nanosleep` (50 ms) so block 1 races ahead to phase 2 before block 0 issues phase-1 writes. With pre-fix code, block 1 reads `out[2048] = 0xCDCDCDCD` (init sentinel) instead of expected `2049` because broken localReduce assigned that byte to block 0 and block 0 is still asleep. The hack is needed because L2 coherency on H100/GB200 normally masks the race; widening the window via deliberate delay produces deterministic failure. `OwnershipMatchesAtAlignedCounts` (CPU, ENABLED): sanity check at counts that are multiples of `numPerBlock` — both writers always agree there. Guards the analytical transcriptions against bugs in themselves. Reviewed By: minsii Differential Revision: D103324873

…naligned-tail tests Summary: The next diff in this stack changes `localReduceVectorized` and `localReduceFallback` to use a single-designated-CTA tail (matching `copyUnroll<4, T>`'s pattern). Existing `localReduceSumUnaligned` (`fbcode/comms/ctran/algos/tests/CtranAlgoDevUT.cc:298`) only exercises the `count=1041, blockDim=640, gridDim=2` configuration, where the designated tail CTA collapses to block 0 for every supported dtype — leaving the new "tail handled by a non-zero CTA" code path untested. Adds `localReduceSumUnalignedNonZeroDesignatedCta` covering counts `{10241, 20481, 30721, 40961}` with `gridDim=4`. For T=float (sizeofT=4), `numPerBlock = blockDim * (16/sizeof(T)) * 4 = 640 * 16 = 10240`, so: - count=10241 -> tail=1, designated=1 - count=20481 -> tail=1, designated=2 - count=30721 -> tail=1, designated=3 - count=40961 -> tail=1, designated=0 (wrap) For other dtypes the designated index shifts proportionally with `numPerBlock`; in every combination the tail is non-empty and at least one count puts it on a non-zero block. The test passes on the current (pre-fix) code as well — both writer-tail variants must produce correct reductions in single-call usage; this only tightens regression coverage so the next diff's tail rewrite cannot silently drop bytes when the tail belongs to a CTA other than block 0. Reviewed By: minsii Differential Revision: D103456130

meta-codesync · 2026-05-03T00:51:42Z

This pull request has been merged in eaadcee.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 1, 2026

meta-codesync Bot added fb-exported meta-exported labels May 1, 2026

dboyda force-pushed the export-D103456130 branch from bb905ec to a43a4d4 Compare May 2, 2026 20:03

dboyda added 2 commits May 2, 2026 13:20

dboyda force-pushed the export-D103456130 branch from a43a4d4 to f7282e6 Compare May 2, 2026 20:20

meta-codesync Bot closed this in eaadcee May 3, 2026

facebook-github-tools Bot added the Merged label May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S651852 [ctran][tests] Cover non-zero designated CTA in localReduce unaligned-tail tests#2344

S651852 [ctran][tests] Cover non-zero designated CTA in localReduce unaligned-tail tests#2344
dboyda wants to merge 2 commits intometa-pytorch:mainfrom
dboyda:export-D103456130

dboyda commented May 1, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 1, 2026

Uh oh!

meta-codesync Bot commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dboyda commented May 1, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 1, 2026

Uh oh!

meta-codesync Bot commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dboyda commented May 1, 2026 •

edited by meta-codesync Bot

Loading