S651852 [ctran][tests] Cover non-zero designated CTA in localReduce unaligned-tail tests#2344
Closed
dboyda wants to merge 2 commits intometa-pytorch:mainfrom
Closed
S651852 [ctran][tests] Cover non-zero designated CTA in localReduce unaligned-tail tests#2344dboyda wants to merge 2 commits intometa-pytorch:mainfrom
dboyda wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
Contributor
|
@dboyda has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103456130. |
…TA ownership race Summary: Adds two DISABLED reproducer tests for the per-CTA byte-ownership invariant violation between `copyUnroll<4, T>` (`fbcode/comms/ctran/algos/DevCommon.cuh:391-442`) and `localReduceVectorized` (`fbcode/comms/ctran/algos/localReduce.cuh:174-269`). The invariant: every byte of any buffer touched by both writer families must map to the same CTA in both writers. When the same buffer is written by `localReduce` and then by `copyUnroll`-via-`ctranKernCopy*` with only per-CTA sync between them, a CTA in the second writer can read or overwrite bytes a different CTA in the first writer hasn't yet committed — exactly the data-corruption pattern Pavan Balaji's D69774173 (2025-02-20) commit message describes for `copyUnroll`. Pavan fixed `copyUnroll`'s tail to use a single designated CTA but never propagated the fix to `localReduce.cuh`, where the tail still grid-strides across all CTAs. In ring AllReduce this is the RS→AG transition: RS-last-step writes `recvbuff` (and `tmpSendBuf`) via `localReduce` (`AllReduceRing.cuh:121`), then AG steps touch the same buffers via `copyUnroll`-family writers (`AllReduceRing.cuh:127, 137, 268, 278`). Inter-round sync is per-CTA only (`GpeKernelSync.completeFlag[blockIdx]`). The two tails disagreed on which CTA owns a given byte in the `[limitCount, count)` zone. Two tests, both `DISABLED_` so CI stays green until the fix lands: 1. `DISABLED_TailOwnershipMatchesCopyUnroll` (CPU): analytically transcribes both writers' iterations and asserts per-element CTA ownership matches. Fails 5/6 cases on pre-fix code with explicit `(count, blockDim, gridDim, sizeofT, copyUnrollOwner, localReduceOwner)` mismatch reports. 2. `DISABLED_DelayedBlockZeroExposesBlockOneStaleRead` (GPU): exercises the REAL `localReduceVectorized<int32, commSum, 1, 1>` and `copyUnroll<4, int32>` in a multi-writer pattern with per-CTA sync only. Block 0 is deliberately delayed via `__nanosleep` (50 ms) so block 1 races ahead to phase 2 before block 0 issues phase-1 writes. With pre-fix code, block 1 reads `out[2048] = 0xCDCDCDCD` (init sentinel) instead of expected `2049` because broken localReduce assigned that byte to block 0 and block 0 is still asleep. The hack is needed because L2 coherency on H100/GB200 normally masks the race; widening the window via deliberate delay produces deterministic failure. `OwnershipMatchesAtAlignedCounts` (CPU, ENABLED): sanity check at counts that are multiples of `numPerBlock` — both writers always agree there. Guards the analytical transcriptions against bugs in themselves. Reviewed By: minsii Differential Revision: D103324873
…naligned-tail tests
Summary:
The next diff in this stack changes `localReduceVectorized` and `localReduceFallback` to use a single-designated-CTA tail (matching `copyUnroll<4, T>`'s pattern). Existing `localReduceSumUnaligned` (`fbcode/comms/ctran/algos/tests/CtranAlgoDevUT.cc:298`) only exercises the `count=1041, blockDim=640, gridDim=2` configuration, where the designated tail CTA collapses to block 0 for every supported dtype — leaving the new "tail handled by a non-zero CTA" code path untested.
Adds `localReduceSumUnalignedNonZeroDesignatedCta` covering counts `{10241, 20481, 30721, 40961}` with `gridDim=4`. For T=float (sizeofT=4), `numPerBlock = blockDim * (16/sizeof(T)) * 4 = 640 * 16 = 10240`, so:
- count=10241 -> tail=1, designated=1
- count=20481 -> tail=1, designated=2
- count=30721 -> tail=1, designated=3
- count=40961 -> tail=1, designated=0 (wrap)
For other dtypes the designated index shifts proportionally with `numPerBlock`; in every combination the tail is non-empty and at least one count puts it on a non-zero block. The test passes on the current (pre-fix) code as well — both writer-tail variants must produce correct reductions in single-call usage; this only tightens regression coverage so the next diff's tail rewrite cannot silently drop bytes when the tail belongs to a CTA other than block 0.
Reviewed By: minsii
Differential Revision: D103456130
Contributor
|
This pull request has been merged in eaadcee. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The next diff in this stack changes
localReduceVectorizedandlocalReduceFallbackto use a single-designated-CTA tail (matchingcopyUnroll<4, T>'s pattern). ExistinglocalReduceSumUnaligned(fbcode/comms/ctran/algos/tests/CtranAlgoDevUT.cc:298) only exercises thecount=1041, blockDim=640, gridDim=2configuration, where the designated tail CTA collapses to block 0 for every supported dtype — leaving the new "tail handled by a non-zero CTA" code path untested.Adds
localReduceSumUnalignedNonZeroDesignatedCtacovering counts{10241, 20481, 30721, 40961}withgridDim=4. For T=float (sizeofT=4),numPerBlock = blockDim * (16/sizeof(T)) * 4 = 640 * 16 = 10240, so:For other dtypes the designated index shifts proportionally with
numPerBlock; in every combination the tail is non-empty and at least one count puts it on a non-zero block. The test passes on the current (pre-fix) code as well — both writer-tail variants must produce correct reductions in single-call usage; this only tightens regression coverage so the next diff's tail rewrite cannot silently drop bytes when the tail belongs to a CTA other than block 0.Reviewed By: minsii
Differential Revision: D103456130