A benchmark/verification application that uses the AuroraFlow library across all three execution modes (hw, hw_emu, sw_emu). This also serves as a reference for how to use the library from both Make and CMake.
Two simple HLS kernels (hls/send.cpp, hls/recv.cpp) send and receive data over the Aurora link.
The host program (host/host_aurora_flow_test.cpp) is MPI-based and uses up to three FPGAs per node, in one of three topologies:
- Loopback (
-m 0): each Aurora core loops back to itself - Pair (
-m 1): ch0 and ch1 of the same FPGA talk to each other - Ring (
-m 2): every FPGA's ch1 talks to the next FPGA's ch0
For each (topology × message size × frame size) combination, the host measures latency and throughput, reads back the AuroraFlow status counters via the library's C++ API, and writes one line per repetition to results.csv.
With the known topology, the data can be verified on the host. This can be disabled to test arbitrary topologies using -m 3.
The host binary is identical across all three execution modes. It picks up XCL_EMULATION_MODE at runtime and the library code in AuroraFlow.hpp routes accordingly.
test/
├── Makefile Make build (via aurora_flow.mk)
├── CMakeLists.txt CMake build (via find_package(AuroraFlow))
├── hls/
│ ├── send.cpp HLS kernel: reads DDR, writes AXI-Stream
│ └── recv.cpp HLS kernel: reads AXI-Stream, writes DDR
├── host/
│ ├── host_aurora_flow_test.cpp
│ ├── Configuration.hpp (CLI parsing, ExecutionMode enum)
│ ├── Kernel.hpp (SendKernel / RecvKernel wrappers)
│ └── Results.hpp (result aggregation + csv writing)
├── cfg/
│ ├── aurora_flow_test_hw.cfg v++ link config (hw)
│ ├── aurora_flow_test_hw_emu.cfg v++ link config (hw_emu)
│ └── aurora_flow_test_sw_emu.cfg v++ link config (sw_emu)
├── scripts/
│ ├── configure_loopback_emu.sh pipe topology setup (loopback)
│ ├── configure_pair_emu.sh pipe topology setup (pair)
│ ├── configure_ring_emu.sh pipe topology setup (ring)
│ ├── hw_emu_rank_wrapper.sh xsim per-rank isolation (hw_emu + MPI)
│ ├── run_*_emu.sh emulation launchers (sw_emu / hw_emu)
│ ├── run_*_hw.sh hardware launchers (Noctua 2)
│ ├── configure_*_hw.sh QSFP link configuration
│ └── reset.sh, synth.sh, ... assorted helpers
├── eval/
│ └── eval.jl Julia script for post-processing results.csv
└── cxxopts.hpp Vendored single-header CLI parser
make host # host_aurora_flow_test
make xclbin TARGET=sw_emu # aurora_flow_test_sw_emu.xclbin
make xclbin TARGET=hw_emu # aurora_flow_test_hw_emu.xclbin
make xclbin TARGET=hw # aurora_flow_test_hw.xclbin (synthesis, slow)The test/Makefile includes ../aurora_flow.mk, which locates the library. Variables like FIFO_WIDTH=32 and USE_FRAMING=1 are forwarded to the library Makefile automatically.
make xclbin TARGET=hw_emu FIFO_WIDTH=32 USE_FRAMING=1mkdir build && cd build
cmake ..
cmake --build . --target host_aurora_flow_test
cmake --build . --target xclbin_sw_emu
cmake --build . --target xclbin_hw_emu
cmake --build . --target xclbin_hw
cmake --build . --target emconfigCache variables can be passed at configure time:
cmake .. -DPLATFORM=xilinx_u280_gen3x16_xdma_1_202211_1 -DFIFO_WIDTH=64The CMake build uses the same library artifacts. find_package(AuroraFlow) delegates kernel packaging back to the library Makefile, and CMake invokes v++ for the HLS kernel compile and xclbin link step.
All build artifacts (host binary, xclbin, emconfig.json) live in test/build/.
The run scripts cd into that directory automatically before invoking mpirun,
so you can launch them from test/. Override with BUILD_DIR=... if you built
somewhere else.
./scripts/run_loopback_emu.sh sw_emu 3 -r 2 -i 10 # sw_emu, loopback, 3 ranks
./scripts/run_pair_emu.sh hw_emu 3 -r 2 -i 2 -b 256 # hw_emu, pair, 3 ranks
./scripts/run_ring_emu.sh sw_emu 2 -r 1 -i 100 # sw_emu, ring, 2 ranksArguments: <emu_mode> [num_ranks] [host_args...]. All trailing arguments are forwarded to host_aurora_flow_test.
Each run_*_emu.sh script does three things:
- Creates
AURORA_PIPE_DIRand invokes the matchingconfigure_*_emu.shtomkfifothe per-rank pipes and symlink them into the chosen topology. The pipe naming follows the library contract documented in the root README. - For
sw_emu, launchesmpirun -n N ./host_aurora_flow_test ...directly. - For
hw_emu, launchesmpirun -n N ./hw_emu_rank_wrapper.sh ...so each rank gets its own working directory (xsim uses fixed socket/lock filenames that would collide between processes on the same host).
./scripts/run_loopback_hw.sh -p aurora_flow_test_hw.xclbin
./scripts/run_pair_hw.sh -p aurora_flow_test_hw.xclbin -l -i 10
./scripts/run_ring_hw.sh -p aurora_flow_test_hw.xclbinThese scripts source env.sh (Noctua 2 module loads), reset the FPGAs, configure QSFP links for the requested topology, and run the host with the correct -m flag. Additional arguments are passed through to the host binary.
On Noctua 2, there's also a helper that iterates over all FPGA nodes:
./scripts/for_every_node.sh ./scripts/run_over_all_configs.shTest program for AuroraFlow
Usage:
host_aurora_flow_test [OPTION...]
-d, --device_id arg Device ID according to linkscript (default: 0)
-p, --xclbin_path arg Path to xclbin file (default:
aurora_flow_test_hw.xclbin)
-r, --repetitions arg Repetitions. Will be discarded, when used with -l
(default: 1)
-i, --iterations arg Iterations in one repetition. Will be scaled up,
when used with -l (default: 1)
-f, --frame_size arg Maximum frame size. In multiple of the input width
(default: 128)
-b, --num_bytes arg Maximum number of bytes transferred per iteration.
Must be a multiple of the input width (default:
1048576)
-m, --test_mode arg Topology. 0 for loopback, 1 for pair and 2 for
ring (default: 0)
-c, --check_status Check if the link is up and exit
-n, --nfc_test NFC Test. Recv Kernel will be started 3 seconds
later then the Send kernel.
-l, --latency_test Creates one repetition for every message size, up
to the maximum
-s, --semaphore Locks the results file. Needed for parallel
evaluation
-t, --timeout_ms arg Timeout in ms (default: 10000)
-w, --wait Wait for enter after loading bitstream. Needed for
chipscope
-h, --help Print usage
Results are printed to stdout and appended to results.csv. The -s flag serializes writes when running the test in parallel across nodes (the file must exist beforehand, otherwise the application will wait forever on the lock).
Sweeps through message sizes in powers of two from the channel width up to --num_bytes.
The number of iterations per message size is adjusted so every repetition has roughly the same duration.
The recv kernel acknowledges every iteration so that the actual transfer time is observable.
This requires the loopback or pair topology, for which an acknowledge stream in one FPGA exists.
Example, largest message size 128 MB:
./host_aurora_flow_test -l -i 10 -f 128
Repetition Bytes Iterations Frame Size
------------------------------------------------
0 64 1405867 1
1 128 1198316 2
2 256 995838 4
...
21 134217728 20 128
22 268435456 10 128
Starts the recv kernel three seconds after the send kernel, forcing the RX FIFO to fill and the NFC flow control to trigger XOFF.
The test verifies that the link does not lose data under backpressure.
The host reads the nfc_full_trigger_count, nfc_empty_trigger_count, and nfc_latency_count counters from the AuroraFlow core afterwards.
Works in both hw and hw_emu modes.
./scripts/run_pair_hw.sh -p aurora_flow_test_hw.xclbin -n
./scripts/run_pair_emu.sh hw_emu 2 -r 1 -i 1 -b 65536 -nCopyright© 2023-2026 Gerrit Pape (gerrit.pape@uni-paderborn.de)