CAMTA is a 16-bit reconfigurable multi-region activation unit for nonlinear function approximation in machine learning workloads. The unit approximates functions such as GeLU, tanh, sigmoid, Swish, and the exponential stage used in Softmax by reusing the same scalar datapath with runtime-configurable thresholds, coefficients, polynomial degrees, and region modes.
The current repository contains:
- a Vitis HLS implementation of the scalar CAMTA unit and Alveo kernel wrapper;
- an XRT host application for FPGA runtime and accuracy validation;
- manually isolated Verilog RTL for the scalar core used for ASIC-oriented synthesis;
- simple C++ and Verilog testbenches for functional checks.
CAMTA is intended to be a reusable activation-function macro rather than a function-specific evaluator. Its main design trade-off is exchanging some area and power overhead for runtime configurability and hardware reuse across multiple nonlinear functions without hardware resynthesis.
- Carlos Soto-Porras
- Jose Fonseca-Cruz
- Pablo Ramirez-Morera
- Erick Obregon-Fonseca
- Luis G. Leon-Vega
- Jorge Castro-Godinez
School of Electronics Engineering
Costa Rica Institute of Technology
CAMTA partitions the input domain into three programmable regions using two independent thresholds:
r0: x < L_left
r1: L_left <= x <= L_right
r2: x > L_right
For each region, CAMTA selects:
- four polynomial coefficients:
a3,a2,a1,a0; - a polynomial degree:
0,1,2, or3; - an execution mode:
HORNER,CONST,ZERO, orIDENTITY.
In HORNER mode, the selected region is evaluated as a cubic polynomial:
P(x) = a0 + a1*x + a2*x^2 + a3*x^3
using Horner's rule:
P(x) = (((a3*x + a2)*x + a1)*x + a0)
The region mode controls the final output behavior:
| Mode | Output behavior |
|---|---|
HORNER |
Returns the Horner polynomial result. |
CONST |
Returns the selected a0 coefficient. |
ZERO |
Returns zero. |
IDENTITY |
Returns the input x. |
This mode-controlled bypass avoids unnecessary polynomial evaluation in regions where the function can be represented as constant, zero, or identity-like behavior. For example, tanh and sigmoid can use constant tails, while GeLU and Swish can use residual/identity-like tails.
The HLS implementation uses fixed-point arithmetic:
typedef ap_fixed<16,6> data_t;
typedef ap_fixed<24,10> acc_t;| Type | Description |
|---|---|
data_t |
16-bit Q6.10 fixed-point type used for inputs, outputs, thresholds, and coefficients. |
acc_t |
24-bit internal accumulator used inside the Horner evaluation chain. |
The external datapath uses a compact 16-bit Q6.10 representation, while the Horner chain uses a wider accumulator to reduce intermediate quantization effects before casting the result back to data_t.
camta-non-linear-unit/
├── README.md
├── LICENSE
├── run_hls.tcl
├── HW/
│ ├── Makefile
│ ├── camta.cpp
│ ├── camta.h
│ ├── camta_types.h
│ ├── horner_core.cpp
│ └── horner_core.h
├── SW/
│ ├── Makefile
│ └── camta.cpp
├── src/
│ ├── camta.cpp
│ ├── camta.h
│ ├── camta_types.h
│ ├── horner_core.cpp
│ ├── horner_core.h
│ ├── camta_unit_core.v
│ ├── camta_horner_seq.v
│ ├── camta_region_select.v
│ └── camta_mode_bypass.v
├── tb/
│ ├── camta_tb.cpp
│ └── tb_camta_unit_core.v
└── scripts/
└── compare_hls_solutions.py
| Path | Description |
|---|---|
src/camta.cpp |
HLS source for camta_unit and scalar array wrapper camta. Used by run_hls.tcl. |
src/horner_core.cpp |
Degree-controlled Horner polynomial evaluator. |
src/camta_types.h |
Fixed-point types and mode definitions. |
src/*.v |
Isolated Verilog RTL modules for scalar core-level simulation and ASIC-oriented synthesis. |
HW/ |
Alveo/Vitis kernel source and Makefile for generating the FPGA binary. |
SW/camta.cpp |
XRT host application used for runtime and numerical validation. |
tb/camta_tb.cpp |
Simple C++ functional testbench for the scalar HLS wrapper. |
tb/tb_camta_unit_core.v |
Verilog testbench for the isolated scalar RTL core. |
run_hls.tcl |
Vitis HLS script for synthesizing the scalar camta_unit. |
scripts/compare_hls_solutions.py |
Utility script for extracting and comparing HLS synthesis reports. |
The project assumes a Xilinx/AMD FPGA development environment with:
- Vitis HLS / Vitis;
- XRT runtime and development headers;
- a compatible Alveo platform, configured by default for
xilinx_u55c_gen3x16_xdma_3_202210_1inHW/Makefile; - a C++17 compiler;
- Python 3 for the report comparison utility.
Before building on a server, load the corresponding Vitis and XRT environment scripts according to the local installation.
From the repository root:
vitis_hls -f run_hls.tclThe script synthesizes the scalar top function:
set top_name "camta_unit"By default, it targets a 4 ns clock period. The target FPGA part and clock period can be overridden through environment variables:
export PART=xck26-sfvc784-2LV-c
export CLOCK_PERIOD=4
vitis_hls -f run_hls.tclThe HLS script runs csynth_design and then attempts to execute scripts/compare_hls_solutions.py to summarize synthesis reports across available solutions.
From the HW/ directory:
cd HW
make buildImportant Makefile settings:
TARGET := hw
PLATFORM ?= xilinx_u55c_gen3x16_xdma_3_202210_1
KERNEL_NAME := camta
KERNEL_FREQ := 250
HLS_FILES := camta.cpp horner_core.cppThe expected output binary is:
HW/package.hw/kernels.xclbin
To clean generated files:
make cleanallBuild the host application from SW/:
cd SW
makeRun all timing and functional tests:
./camta ../HW/package.hw/kernels.xclbin --mode allSupported modes:
./camta ../HW/package.hw/kernels.xclbin --mode all
./camta ../HW/package.hw/kernels.xclbin --mode timing
./camta ../HW/package.hw/kernels.xclbin --mode functionalEquivalent short options are also supported:
./camta ../HW/package.hw/kernels.xclbin -all
./camta ../HW/package.hw/kernels.xclbin -timing
./camta ../HW/package.hw/kernels.xclbin -functionalThe host executes one warm-up run and five measured runs. It reports detailed per-test timing, throughput, MSE, RMSE, MAE, maximum absolute error, and final summary tables.
The host validates the following nonlinear functions using N = 10000 samples:
| Function | Evaluation range | Best configuration used in paper |
|---|---|---|
| GeLU | [-8, 8] |
residual symmetry, Horner/Horner/Identity |
| tanh | [-4, 4] |
odd symmetry, Horner/Horner/Const |
| sigmoid | [-8, 8] |
complement symmetry, Horner/Horner/Const |
| Swish | [-8, 8] |
residual symmetry, Horner/Horner/Identity |
| Softmax-assisted exp | [-8, 0] |
Const/Horner/Horner |
The Softmax experiment is reported as CAMTA-assisted Softmax. CAMTA approximates only the exponential stage over shifted inputs in [-8, 0]; summation and normalization are performed by the host.
The following configurations correspond to the best reported N = 10000 validation cases:
| Function | Range | L_left |
L_right |
Symmetry mode | Region modes | Degrees |
|---|---|---|---|---|---|---|
| GeLU | [-8, 8] |
1.50 |
3.00 |
residual | H/H/I | 3/3/1 |
| tanh | [-4, 4] |
1.25 |
3.50 |
odd | H/H/C | 3/3/0 |
| sigmoid | [-8, 8] |
2.50 |
4.50 |
complement | H/H/C | 3/3/0 |
| Swish | [-8, 8] |
2.00 |
6.50 |
residual | H/H/I | 3/3/1 |
| Softmax-assisted exp | [-8, 0] |
-4.00 |
-1.00 |
none | C/H/H | 0/3/3 |
Legend:
- H:
HORNER - C:
CONST - I:
IDENTITY
The following results correspond to the final N = 10000 validation run on the Alveo flow:
| Function | Kernel time [us] | Throughput [MSamples/s] | RMSE | MaxAbsErr |
|---|---|---|---|---|
| GeLU | 183.59 | 54.47 | 0.00152 | 0.00501 |
| tanh | 185.74 | 53.84 | 0.00162 | 0.00582 |
| sigmoid | 185.89 | 53.79 | 0.00207 | 0.00803 |
| Swish | 185.42 | 53.93 | 0.00389 | 0.01344 |
| Softmax-assisted | 185.24 | 53.98 | 3.60e-6 | 1.08e-5 |
These timings correspond to the scalar FPGA wrapper execution and include kernel launch and memory-mapped execution overhead. They should not be interpreted as the intrinsic datapath latency of the CAMTA scalar unit.
The scalar HLS synthesis result used in the paper reports:
| Metric | CAMTA scalar unit |
|---|---|
| Target clock | 4.00 ns |
| Estimated clock | 2.465 ns |
| Datapath latency | 11 cycles |
| BRAM | 0 |
| DSP | 3 |
| FF | 802 |
| LUT | 1756 |
A CORDIC exponential core used as a comparison point reports 2 DSPs, 757 FFs, and 2676 LUTs. That comparison should be interpreted carefully because the CORDIC result corresponds only to an exponential evaluator, while CAMTA supports multiple nonlinear functions through runtime-configurable thresholds, coefficients, degrees, and region modes.
The manually isolated scalar RTL core was synthesized using a TSMC 65 nm standard-cell flow at 250 MHz. The HLS-generated interface and wrapper logic were excluded so that the ASIC evaluation reflects only the arithmetic and control core.
| Metric | Value |
|---|---|
| Clock period | 4.0 ns |
| Frequency | 250 MHz |
| Critical path | 2.62 ns |
| Total cell area | 6632.40 um^2 |
| Combinational area | 4555.60 um^2 |
| Noncombinational area | 2076.80 um^2 |
| Total cells | 1582 |
| Combinational cells | 1343 |
| Sequential cells | 236 |
| Horner core area | 4753.60 um^2 |
| Region-selection area | 164.00 um^2 |
| Mode-bypass area | 83.20 um^2 |
| Dynamic power | 1.1743 mW |
| Leakage power | 189.16 uW |
| Total power | 1.3634 mW |
The Horner core dominates the implementation area, while the region-selection and mode-bypass logic represent a small fraction of the total area. This supports the interpretation that most of the hardware cost is associated with the shared arithmetic datapath, and that the additional region configurability introduces limited area overhead relative to the polynomial evaluator.
The src/*.v files provide a scalar RTL version of the CAMTA core:
| File | Description |
|---|---|
camta_unit_core.v |
Top-level scalar RTL core with start/done handshake. |
camta_horner_seq.v |
Sequential Horner evaluator using a Q10.14 accumulator. |
camta_region_select.v |
Region decoder for L_left and L_right. |
camta_mode_bypass.v |
Output mux for HORNER, CONST, ZERO, and IDENTITY modes. |
A simple Verilog testbench is provided in:
tb/tb_camta_unit_core.v
- CAMTA is a configurable approximation unit, not a function-specific exact evaluator.
- Approximation quality depends on the selected coefficients, thresholds, degrees, modes, and input range.
- The Softmax result is CAMTA-assisted; CAMTA approximates the exponential stage, while the host performs reduction and normalization.
- The
run_hls.tclscript synthesizes the scalarcamta_unit, not the Alveo memory wrapper. - The
HW/Makefile builds the Alveo kernel wrapper namedcamta. - The ASIC-oriented synthesis results correspond to the isolated scalar RTL core, excluding HLS-generated AXI/interface logic.
Planned extensions include:
- integrating CAMTA into a RISC-V-based accelerator architecture;
- adding input gating to reduce unnecessary Horner-core switching when
CONST,ZERO, orIDENTITYmodes are selected; - improving backend timing robustness and hold-time closure for ASIC implementation;
- refining coefficient-generation and quantization-aware fitting flows for new nonlinear functions;
- evaluating multi-lane replication strategies for throughput-oriented accelerator designs.
This repository is distributed under the license included in LICENSE.