Skip to content

S651852 [ctran][tests] Cover non-zero designated CTA in localReduce unaligned-tail tests#2344

Closed
dboyda wants to merge 2 commits intometa-pytorch:mainfrom
dboyda:export-D103456130
Closed

S651852 [ctran][tests] Cover non-zero designated CTA in localReduce unaligned-tail tests#2344
dboyda wants to merge 2 commits intometa-pytorch:mainfrom
dboyda:export-D103456130

Conversation

@dboyda
Copy link
Copy Markdown
Contributor

@dboyda dboyda commented May 1, 2026

Summary:
The next diff in this stack changes localReduceVectorized and localReduceFallback to use a single-designated-CTA tail (matching copyUnroll<4, T>'s pattern). Existing localReduceSumUnaligned (fbcode/comms/ctran/algos/tests/CtranAlgoDevUT.cc:298) only exercises the count=1041, blockDim=640, gridDim=2 configuration, where the designated tail CTA collapses to block 0 for every supported dtype — leaving the new "tail handled by a non-zero CTA" code path untested.

Adds localReduceSumUnalignedNonZeroDesignatedCta covering counts {10241, 20481, 30721, 40961} with gridDim=4. For T=float (sizeofT=4), numPerBlock = blockDim * (16/sizeof(T)) * 4 = 640 * 16 = 10240, so:

  • count=10241 -> tail=1, designated=1
  • count=20481 -> tail=1, designated=2
  • count=30721 -> tail=1, designated=3
  • count=40961 -> tail=1, designated=0 (wrap)

For other dtypes the designated index shifts proportionally with numPerBlock; in every combination the tail is non-empty and at least one count puts it on a non-zero block. The test passes on the current (pre-fix) code as well — both writer-tail variants must produce correct reductions in single-call usage; this only tightens regression coverage so the next diff's tail rewrite cannot silently drop bytes when the tail belongs to a CTA other than block 0.

Reviewed By: minsii

Differential Revision: D103456130

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 1, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 1, 2026

@dboyda has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103456130.

@dboyda dboyda force-pushed the export-D103456130 branch from bb905ec to a43a4d4 Compare May 2, 2026 20:03
dboyda added 2 commits May 2, 2026 13:20
…TA ownership race

Summary:
Adds two DISABLED reproducer tests for the per-CTA byte-ownership invariant violation between `copyUnroll<4, T>` (`fbcode/comms/ctran/algos/DevCommon.cuh:391-442`) and `localReduceVectorized` (`fbcode/comms/ctran/algos/localReduce.cuh:174-269`).

The invariant: every byte of any buffer touched by both writer families must map to the same CTA in both writers. When the same buffer is written by `localReduce` and then by `copyUnroll`-via-`ctranKernCopy*` with only per-CTA sync between them, a CTA in the second writer can read or overwrite bytes a different CTA in the first writer hasn't yet committed — exactly the data-corruption pattern Pavan Balaji's D69774173 (2025-02-20) commit message describes for `copyUnroll`. Pavan fixed `copyUnroll`'s tail to use a single designated CTA but never propagated the fix to `localReduce.cuh`, where the tail still grid-strides across all CTAs.

In ring AllReduce this is the RS→AG transition: RS-last-step writes `recvbuff` (and `tmpSendBuf`) via `localReduce` (`AllReduceRing.cuh:121`), then AG steps touch the same buffers via `copyUnroll`-family writers (`AllReduceRing.cuh:127, 137, 268, 278`). Inter-round sync is per-CTA only (`GpeKernelSync.completeFlag[blockIdx]`). The two tails disagreed on which CTA owns a given byte in the `[limitCount, count)` zone.

Two tests, both `DISABLED_` so CI stays green until the fix lands:

1. `DISABLED_TailOwnershipMatchesCopyUnroll` (CPU): analytically transcribes both writers' iterations and asserts per-element CTA ownership matches. Fails 5/6 cases on pre-fix code with explicit `(count, blockDim, gridDim, sizeofT, copyUnrollOwner, localReduceOwner)` mismatch reports.

2. `DISABLED_DelayedBlockZeroExposesBlockOneStaleRead` (GPU): exercises the REAL `localReduceVectorized<int32, commSum, 1, 1>` and `copyUnroll<4, int32>` in a multi-writer pattern with per-CTA sync only. Block 0 is deliberately delayed via `__nanosleep` (50 ms) so block 1 races ahead to phase 2 before block 0 issues phase-1 writes. With pre-fix code, block 1 reads `out[2048] = 0xCDCDCDCD` (init sentinel) instead of expected `2049` because broken localReduce assigned that byte to block 0 and block 0 is still asleep. The hack is needed because L2 coherency on H100/GB200 normally masks the race; widening the window via deliberate delay produces deterministic failure.

`OwnershipMatchesAtAlignedCounts` (CPU, ENABLED): sanity check at counts that are multiples of `numPerBlock` — both writers always agree there. Guards the analytical transcriptions against bugs in themselves.

Reviewed By: minsii

Differential Revision: D103324873
…naligned-tail tests

Summary:
The next diff in this stack changes `localReduceVectorized` and `localReduceFallback` to use a single-designated-CTA tail (matching `copyUnroll<4, T>`'s pattern). Existing `localReduceSumUnaligned` (`fbcode/comms/ctran/algos/tests/CtranAlgoDevUT.cc:298`) only exercises the `count=1041, blockDim=640, gridDim=2` configuration, where the designated tail CTA collapses to block 0 for every supported dtype — leaving the new "tail handled by a non-zero CTA" code path untested.

Adds `localReduceSumUnalignedNonZeroDesignatedCta` covering counts `{10241, 20481, 30721, 40961}` with `gridDim=4`. For T=float (sizeofT=4), `numPerBlock = blockDim * (16/sizeof(T)) * 4 = 640 * 16 = 10240`, so:
- count=10241 -> tail=1, designated=1
- count=20481 -> tail=1, designated=2
- count=30721 -> tail=1, designated=3
- count=40961 -> tail=1, designated=0 (wrap)

For other dtypes the designated index shifts proportionally with `numPerBlock`; in every combination the tail is non-empty and at least one count puts it on a non-zero block. The test passes on the current (pre-fix) code as well — both writer-tail variants must produce correct reductions in single-call usage; this only tightens regression coverage so the next diff's tail rewrite cannot silently drop bytes when the tail belongs to a CTA other than block 0.

Reviewed By: minsii

Differential Revision: D103456130
@dboyda dboyda force-pushed the export-D103456130 branch from a43a4d4 to f7282e6 Compare May 2, 2026 20:20
@meta-codesync meta-codesync Bot closed this in eaadcee May 3, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 3, 2026

This pull request has been merged in eaadcee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant