Skip to content

build: optimized-build mode + ns-3 submodule bump (Draft, blocked by aliyun/ns-3-alibabacloud#21)#278

Draft
tianhao909 wants to merge 1 commit into
aliyun:masterfrom
tianhao909:pr/plan07-build-mode-and-bump
Draft

build: optimized-build mode + ns-3 submodule bump (Draft, blocked by aliyun/ns-3-alibabacloud#21)#278
tianhao909 wants to merge 1 commit into
aliyun:masterfrom
tianhao909:pr/plan07-build-mode-and-bump

Conversation

@tianhao909
Copy link
Copy Markdown
Collaborator

PR-γ: default NS-3 optimized build mode + bump ns-3-alibabacloud submodule

Target repository: aliyun/SimAI


Summary

Two minimal build-system changes to aliyun/SimAI, meant to be reviewed together:

  1. Flip the default NS-3 build mode from debug to optimized in astra-sim-alibabacloud/build/astra_ns3/build.sh (1-line change). The optimized profile roughly halves NS-3 simulation wall-clock time on the H20 microbenchmark set.
  2. Bump the ns-3-alibabacloud submodule pointer from 7e3cb5b to the new upstream master HEAD produced by the companion PR-α (GCC 13 build fix + UB fix). The new hash is filled in at the moment PR-α is merged.

Together these let an out-of-the-box GCC 13 user run ./scripts/build.sh -c ns3 on aliyun/SimAI:master and get a working SimAI_simulator binary.

Key Changes

  1. astra-sim-alibabacloud/build/astra_ns3/build.sh1 line changed
    • ./ns3 configure -d debug./ns3 configure -d optimized
  2. Submodule pointer for ns-3-alibabacloud1 line changed (filled in after PR-α merges)
    • Before: 7e3cb5b88c99abcb582c5abc3919484a4805111b
    • After: (populated at push time)

Total diff: 2 lines across 2 files, 2 commits (build-mode commit, then bump commit).

Testing

Fingerprint (inline):

gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Linux kernel 5.10.134-16.3.al8.x86_64 x86_64 GNU/Linux
Python 3.13.11

Baseline (upstream aliyun/SimAI:master @ f5efb5a + submodule at 7e3cb5b, GCC 13):

./scripts/build.sh -c ns3
# Expected: build FAILS with
#   bit-serializer.h:NN:NN: error: 'uint8_t' does not name a type
# This motivates the bump in this PR.

With this PR's HEAD (submodule pointing to PR-α merged HEAD, build.sh using -d optimized):

./scripts/build.sh -c ns3
file bin/SimAI_simulator
# Expected: build rc=0
# Expected: ELF 64-bit LSB executable ... not stripped
# The top-level scripts/build.sh rm -rf's extern/ and re-copies from
# the (now-patched) submodule on every clean build, so the GCC 13 fix
# is picked up automatically once the submodule pointer lands upstream.

End-to-end cross-branch smoke (all three PRs merged into a temporary integration branch, plan07 allreduce 16m case):

  • RUN_EXIT=0
  • SimAI binary sha256 recorded inline at push time in pr_drafts/task4_gamma_local_bump.txt
  • Note: before aliyun/ns-3-alibabacloud actually merges PR-α, the pre-push verification uses a local simulation of "submodule pointed at PR-α's local branch HEAD" — this is documented in the internal test log and clearly separated from the real submodule pointer that ships with the PR commit.

Known Limitations

c_build — GCC 13 / Ubuntu 24.04 only. The bump is motivated and verified specifically for gcc-13.3.0 / Ubuntu 24.04; older toolchains (gcc-9, gcc-11) were not re-run against the new submodule HEAD in this work. The new submodule commits are additive (an include and a compile-flag pair), so compatibility is expected, but a CI run on the older matrix would be welcome.

Behavioural note on optimized default. Switching the default from debug to optimized is a behavioural change for downstream consumers who rely on unstripped, -O0 builds for in-process gdb sessions. -O2 -g preserves gdb attach-ability but does change the default numeric behaviour of stack frames. If maintainers prefer, this PR can be split into:

  • γ1 — build.sh default mode only (1 line, no submodule change)
  • γ2 — submodule bump only (1 line, no build.sh change)

The two commits are intentionally independent so either can be dropped.

Scope Disclaimer

English: This result does NOT serve as a proof of SimAI's generalization accuracy on H20/NVLink across all scenarios. It only reproduces a specific microbenchmark set under a specific calibration. Generalization to other models, sizes, topologies, or fused ops is not established.

中文:本结果不构成对 SimAI 在 H20/NVLink 全场景泛化准确性的证明。它仅复现特定 microbenchmark 集合在特定 calibration 下的精度。对其他模型、规模、拓扑、融合算子的泛化能力均未建立。

Notes

  • Base: aliyun/SimAI:master (currently f5efb5a)
  • Head: tianhao909/SimAI:pr/plan07-build-mode-and-bump
  • Blocked by: aliyun/ns-3-alibabacloud#21 (PR-α, GCC 13 + UB fix). The submodule-bump commit is added only after PR-α merges, at which point this PR transitions from Draft to Ready.
  • Independent of PR-β (topology/config), which can merge first.
  • No rdma-hw.cc, no CMakeLists, no new runtime dependency.

Checklist

  • Conventional commit messages (build: …)
  • Diff restricted to build.sh + submodule pointer
  • Scope Disclaimer inline (English + Chinese, > blockquote)
  • c_build limitation called out in Known Limitations
  • Clear "Blocked by PR-α" note in Notes
  • Split-into-γ1/γ2 option documented if maintainers prefer single-topic PRs

Reviewer FAQ

Question Answer
Why bump the submodule at all? Can I just use the latest aliyun/ns-3-alibabacloud HEAD? The target HEAD is the one from PR-α, which contains the minimum patch set (4 files, +7 lines) verified against this repo's build. Newer ns-3 commits would pull in unreviewed behaviour.
Why change default to optimized? It halves wall-clock time on the microbenchmark set this project actually runs. -g is preserved, so gdb still attaches. If preferred, I can leave the default at debug and add a SIMAI_BUILD_MODE env-var override instead — just say the word.
Can I keep the default debug? Yes — split this PR into γ1 (build.sh only) and γ2 (submodule only), and drop γ1. See Known Limitations.
Does this affect astra-sim-alibabacloud/extern/ tracked files? No. extern/ is .gitignore'd in this repo, and scripts/build.sh -c ns3 (lines 17 and 22) rm -rf extern/ then cp -r from the submodule. Nothing tracked under extern/ is preserved or affected.
Why not fix GCC 13 inside this repo directly? The build actually compiles sources copied from the ns-3 submodule, so the fix has to live in aliyun/ns-3-alibabacloud. That's PR-α.
What happens if PR-α is rejected? This PR cannot merge; the submodule-bump commit is withheld. γ1 (build-mode only) could still merge stand-alone if there is agreement on the optimized default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant