@@ -55,6 +55,30 @@ Simple multiply-accumulate (MAC) units that form the foundation of systolic arra
5555- ** Pipeline stages** : Each PE includes RegNext for horizontal and vertical data forwarding
5656- Optimized for neural network dense layer operations with configurable matrix sizes
5757
58+ ##### 2.3.1 GEMM FMA (Fused Multiply-Add) Unit
59+ - Enhanced matrix multiplication using Fused Multiply-Add operations for improved precision and performance
60+ - ** GEMMFMATotal** : Complete matrix multiplication module supporting both fixed-point and floating-point operations
61+ - Configurable matrix dimensions (m × k × n)
62+ - Uses FMA units for efficient computation
63+ - Decoupled interfaces for input and output data flow
64+ - ** GEMMFMASingle** : Single-row matrix multiplication for streaming applications
65+ - Processes one row at a time to reduce memory requirements
66+ - Returns results in row-by-row fashion
67+ - Supports configurable PE count for parallel processing
68+ - ** MultiFMA** : Multiple FMA units operating in parallel for increased throughput
69+ - Configurable number of FMA units (peCount)
70+ - Supports both fixed-point and floating-point operations
71+ - Pipelined design for continuous data processing
72+ - ** GEMMSingleQueue** : Queued single-row matrix multiplication with streaming support
73+ - Maintains internal queue for multiple matrix operations
74+ - Supports flushing of operations for dynamic workloads
75+ - Optimized for attention mechanism computations
76+ - ** Key features** :
77+ - Improved numerical precision through FMA operations
78+ - Configurable data types (GEMMDataType.Fxp, GEMMDataType.Fp32, GEMMDataType.Fp64)
79+ - Pipelined architecture for high throughput
80+ - Memory-efficient designs for different use cases
81+
5882##### 2.4 GEMV (General Matrix-Vector Multiply) Unit
5983- Reduction tree algorithm for vector-matrix multiplication using VecDotVec components
6084- ** VecDotVec** :
@@ -78,6 +102,36 @@ Simple multiply-accumulate (MAC) units that form the foundation of systolic arra
78102- Implements numerically stable softmax with max-subtraction technique to prevent overflow
79103- Supports configurable array sizes (default 4 elements, configurable)
80104
105+ ##### 2.5.1 Attention Mechanism Components
106+ - ** AttnScores** : Attention score computation module for transformer models
107+ - ** QKGenWithReg** : Generates Query and Key matrices with internal register storage
108+ - Uses two GEMMFMATotal units to compute Q and K matrices simultaneously
109+ - Stores computed Q and K in internal registers for reuse
110+ - Input: inputToken (m × k), weightQ (k × n), weightK (k × n)
111+ - Output: Query (m × n), Key (m × n)
112+ - ** QKGen** : Generates Query and Key matrices without internal storage
113+ - Uses two GEMMFMATotal units for Q and K computation
114+ - Direct streaming of results without internal storage
115+ - State machine controlled (idle, gen, done states)
116+ - ** QKMulFMA** : Computes attention scores via Q×K^T multiplication using FMA units
117+ - Single-row attention scores for memory efficiency
118+ - Uses FMA units for improved precision
119+ - Streaming input/output with Decoupled interfaces
120+ - ** OutValue** : Output value computation using attention scores and value matrix
121+ - ** OutValueTotal** : Complete output computation module
122+ - Multiplies attention scores with value matrix (Scores × Value)
123+ - Computes full attention output in one operation
124+ - Uses GEMM-style computation for attention mechanism
125+ - ** OutValueSingle** : Single-row output computation for streaming applications
126+ - Processes one row of attention scores at a time
127+ - Memory efficient for large attention matrices
128+ - Maintains state between rows for continuous processing
129+ - ** Key advantages** :
130+ - Optimized for transformer model attention mechanisms
131+ - Memory-efficient designs with configurable storage options
132+ - Streaming interfaces for continuous processing
133+ - FMA-based computation for improved numerical accuracy
134+
81135##### 2.6 Obsolete Components
82136- ** SPMM (Sparse-Dense Matrix Multiplication)** : Performs sparse-dense matrix multiplication using MAC units with mask vectors to select needed elements
83137 - Uses mask vectors to select specific numbers from input matrices
@@ -184,6 +238,40 @@ sbt "testOnly vitisrtlkernel.VitisRTLKernelTest"
184238 - Output: Hardware produces 16 result elements (4×4 result matrix C)
185239 - Validation: Each output element compared against software-calculated value C[ i] [ j ] = Σ A[ i] [ k ] * B[ k] [ j ]
186240 - ** Test Configuration** : Supports different matrix sizes (n parameter), fixed-point configurations (I integer bits, F fractional bits), and data types
241+ - ** GEMM FMA Tests** :
242+ - ** GEMMFMATotal Test** : Validates complete matrix multiplication using FMA operations
243+ - Tests configurable matrix dimensions (m × k × n)
244+ - Verifies both fixed-point and floating-point operations
245+ - Uses fork-join testing methodology for concurrent input/output validation
246+ - ** GEMMFMASingle Test** : Validates single-row matrix multiplication
247+ - Tests row-by-row processing for streaming applications
248+ - Verifies proper indexing and continuous processing
249+ - Supports configurable PE count for different parallelization levels
250+ - ** MultiFMA Test** : Validates multiple FMA units operating in parallel
251+ - Tests configurable number of FMA units (peCount)
252+ - Verifies pipelined processing of different matrix blocks
253+ - Ensures precision improvements through FMA operations
254+ - ** GEMMSingleQueue Test** : Validates queued matrix operations
255+ - Tests internal queue management for multiple operations
256+ - Verifies flush functionality for dynamic workloads
257+ - Supports attention mechanism computations with streaming inputs
258+ - ** AttnScores Tests** :
259+ - ** QKGenWithReg Test** : Validates Query and Key matrix generation with internal storage
260+ - Tests simultaneous Q and K computation using two GEMMFMATotal units
261+ - Verifies internal register storage and reuse
262+ - ** QKGen Test** : Validates Query and Key matrix generation without storage
263+ - Tests state machine control (idle, gen, done states)
264+ - Verifies direct streaming of results
265+ - ** QKMulFMA Test** : Validates attention score computation via Q×K^T
266+ - Tests single-row attention computation for memory efficiency
267+ - Verifies FMA-based precision improvements
268+ - ** OutValue Tests** :
269+ - ** OutValueTotal Test** : Validates complete output computation
270+ - Tests attention score × value matrix multiplication
271+ - Verifies full attention output computation
272+ - ** OutValueSingle Test** : Validates single-row output computation
273+ - Tests memory-efficient streaming of large attention matrices
274+ - Verifies state maintenance between rows
187275- ** GEMV Test** : Vector-matrix multiplication validation
188276 - Tests reduction tree algorithm implementation
189277 - Verifies ` log(n) + 2 ` cycle timing
@@ -446,6 +534,40 @@ sbt "testOnly vitisrtlkernel.VitisRTLKernelTest"
446534 - 输出: 硬件产生16个结果元素(4×4结果矩阵C)
447535 - 验证: 每个输出元素与软件计算值 C[ i] [ j ] = Σ A[ i] [ k ] * B[ k] [ j ] 进行比较
448536 - ** 测试配置** : 支持不同矩阵大小(n参数)、定点配置(I整数位,F小数位)和数据类型
537+ - ** GEMM FMA测试** :
538+ - ** GEMMFMATotal测试** : 验证使用FMA操作的完整矩阵乘法
539+ - 测试可配置矩阵维度 (m × k × n)
540+ - 验证定点和浮点操作
541+ - 使用fork-join测试方法进行并发输入/输出验证
542+ - ** GEMMFMASingle测试** : 验证单行矩阵乘法
543+ - 测试流式应用的逐行处理
544+ - 验证适当的索引和连续处理
545+ - 支持不同并行化级别的可配置PE计数
546+ - ** MultiFMA测试** : 验证并行运行的多个FMA单元
547+ - 测试可配置的FMA单元数量 (peCount)
548+ - 验证不同矩阵块的流水线处理
549+ - 确保通过FMA操作提高精度
550+ - ** GEMMSingleQueue测试** : 验证队列矩阵操作
551+ - 测试多个操作的内部队列管理
552+ - 验证用于动态工作负载的刷新功能
553+ - 支持带流式输入的注意力机制计算
554+ - ** AttnScores测试** :
555+ - ** QKGenWithReg测试** : 验证带内部存储的查询和键矩阵生成
556+ - 测试使用两个GEMMFMATotal单元同时进行Q和K计算
557+ - 验证内部寄存器存储和重用
558+ - ** QKGen测试** : 验证无存储的查询和键矩阵生成
559+ - 测试状态机控制 (idle, gen, done 状态)
560+ - 验证直接流式输出
561+ - ** QKMulFMA测试** : 验证通过Q×K^T进行的注意力分数计算
562+ - 测试用于内存效率的单行注意力计算
563+ - 验证基于FMA的精度改进
564+ - ** OutValue测试** :
565+ - ** OutValueTotal测试** : 验证完整输出计算
566+ - 测试注意力分数×值矩阵乘法
567+ - 验证完整注意力输出计算
568+ - ** OutValueSingle测试** : 验证单行输出计算
569+ - 测试大注意力矩阵的内存高效流式处理
570+ - 验证行之间的状态维持
449571- ** GEMV测试** : 向量-矩阵乘法验证
450572 - 测试归约树算法实现
451573 - 验证 ` log(n) + 2 ` 周期时序
@@ -687,6 +809,30 @@ make clean_emu_vpp
687809- ** 流水线阶段** :每个PE包含RegNext用于水平和垂直数据转发
688810- 针对神经网络密集层操作优化,具有可配置矩阵大小
689811
812+ ##### 2.3.1 GEMM FMA (融合乘加) 单元
813+ - 使用融合乘加操作增强矩阵乘法,提高精度和性能
814+ - ** GEMMFMATotal** : 支持定点和浮点操作的完整矩阵乘法模块
815+ - 可配置矩阵维度 (m × k × n)
816+ - 使用FMA单元进行高效计算
817+ - 输入和输出数据流的解耦接口
818+ - ** GEMMFMASingle** : 适用于流式应用的单行矩阵乘法
819+ - 逐行处理以减少内存需求
820+ - 按行返回结果
821+ - 支持可配置PE计数以实现并行处理
822+ - ** MultiFMA** : 并行运行的多个FMA单元,以提高吞吐量
823+ - 可配置FMA单元数量 (peCount)
824+ - 支持定点和浮点操作
825+ - 流水线设计用于连续数据处理
826+ - ** GEMMSingleQueue** : 支持流式处理的队列单行矩阵乘法
827+ - 为多个矩阵运算维护内部队列
828+ - 支持刷新操作以处理动态工作负载
829+ - 针对注意力机制计算优化
830+ - ** 主要特点** :
831+ - 通过FMA操作提高数值精度
832+ - 可配置数据类型 (GEMMDataType.Fxp, GEMMDataType.Fp32, GEMMDataType.Fp64)
833+ - 流水线架构实现高吞吐量
834+ - 针对不同用例的内存高效设计
835+
690836##### 2.4 GEMV (通用矩阵-向量乘法) 单元
691837- 使用VecDotVec组件的归约树算法进行向量-矩阵乘法
692838- ** VecDotVec** :
@@ -710,6 +856,36 @@ make clean_emu_vpp
710856- 实现数值稳定的softmax,采用最大值减法技巧防止溢出
711857- 支持可配置数组大小(默认4个元素,可配置)
712858
859+ ##### 2.5.1 注意力机制组件
860+ - ** AttnScores** : Transformer模型的注意力分数计算模块
861+ - ** QKGenWithReg** : 生成带内部寄存器存储的查询和键矩阵
862+ - 使用两个GEMMFMATotal单元同时计算Q和K矩阵
863+ - 在内部寄存器中存储计算的Q和K以供重用
864+ - 输入: inputToken (m × k), weightQ (k × n), weightK (k × n)
865+ - 输出: Query (m × n), Key (m × n)
866+ - ** QKGen** : 生成无内部存储的查询和键矩阵
867+ - 使用两个GEMMFMATotal单元计算Q和K
868+ - 直接流式输出结果,无内部存储
869+ - 状态机控制 (idle, gen, done 状态)
870+ - ** QKMulFMA** : 使用FMA单元通过 Q×K^T 乘法计算注意力分数
871+ - 单行注意力分数以提高内存效率
872+ - 使用FMA单元提高精度
873+ - 使用Decoupled接口的流式输入/输出
874+ - ** OutValue** : 使用注意力分数和值矩阵计算输出值
875+ - ** OutValueTotal** : 完整输出计算模块
876+ - 将注意力分数与值矩阵相乘 (Scores × Value)
877+ - 一次操作计算完整注意力输出
878+ - 使用GEMM风格计算进行注意力机制
879+ - ** OutValueSingle** : 单行输出计算,适用于流式应用
880+ - 每次处理一行注意力分数
881+ - 对大注意力矩阵的内存高效处理
882+ - 在行之间维持状态以进行连续处理
883+ - ** 主要优势** :
884+ - 针对Transformer模型注意力机制优化
885+ - 具有可配置存储选项的内存高效设计
886+ - 连续处理的流式接口
887+ - 基于FMA的计算以提高数值精度
888+
713889##### 2.6 已废弃组件
714890- ** SPMM (稀疏-稠密矩阵乘法)** :使用MAC单元和掩码向量执行稀疏-稠密矩阵乘法来选择所需元素
715891 - 使用掩码向量从输入矩阵中选择特定数字
0 commit comments