Skip to content

Commit af015e5

Browse files
renew readme with new pull request
1 parent ae1b604 commit af015e5

File tree

1 file changed

+176
-0
lines changed

1 file changed

+176
-0
lines changed

README.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,30 @@ Simple multiply-accumulate (MAC) units that form the foundation of systolic arra
5555
- **Pipeline stages**: Each PE includes RegNext for horizontal and vertical data forwarding
5656
- Optimized for neural network dense layer operations with configurable matrix sizes
5757

58+
##### 2.3.1 GEMM FMA (Fused Multiply-Add) Unit
59+
- Enhanced matrix multiplication using Fused Multiply-Add operations for improved precision and performance
60+
- **GEMMFMATotal**: Complete matrix multiplication module supporting both fixed-point and floating-point operations
61+
- Configurable matrix dimensions (m × k × n)
62+
- Uses FMA units for efficient computation
63+
- Decoupled interfaces for input and output data flow
64+
- **GEMMFMASingle**: Single-row matrix multiplication for streaming applications
65+
- Processes one row at a time to reduce memory requirements
66+
- Returns results in row-by-row fashion
67+
- Supports configurable PE count for parallel processing
68+
- **MultiFMA**: Multiple FMA units operating in parallel for increased throughput
69+
- Configurable number of FMA units (peCount)
70+
- Supports both fixed-point and floating-point operations
71+
- Pipelined design for continuous data processing
72+
- **GEMMSingleQueue**: Queued single-row matrix multiplication with streaming support
73+
- Maintains internal queue for multiple matrix operations
74+
- Supports flushing of operations for dynamic workloads
75+
- Optimized for attention mechanism computations
76+
- **Key features**:
77+
- Improved numerical precision through FMA operations
78+
- Configurable data types (GEMMDataType.Fxp, GEMMDataType.Fp32, GEMMDataType.Fp64)
79+
- Pipelined architecture for high throughput
80+
- Memory-efficient designs for different use cases
81+
5882
##### 2.4 GEMV (General Matrix-Vector Multiply) Unit
5983
- Reduction tree algorithm for vector-matrix multiplication using VecDotVec components
6084
- **VecDotVec**:
@@ -78,6 +102,36 @@ Simple multiply-accumulate (MAC) units that form the foundation of systolic arra
78102
- Implements numerically stable softmax with max-subtraction technique to prevent overflow
79103
- Supports configurable array sizes (default 4 elements, configurable)
80104

105+
##### 2.5.1 Attention Mechanism Components
106+
- **AttnScores**: Attention score computation module for transformer models
107+
- **QKGenWithReg**: Generates Query and Key matrices with internal register storage
108+
- Uses two GEMMFMATotal units to compute Q and K matrices simultaneously
109+
- Stores computed Q and K in internal registers for reuse
110+
- Input: inputToken (m × k), weightQ (k × n), weightK (k × n)
111+
- Output: Query (m × n), Key (m × n)
112+
- **QKGen**: Generates Query and Key matrices without internal storage
113+
- Uses two GEMMFMATotal units for Q and K computation
114+
- Direct streaming of results without internal storage
115+
- State machine controlled (idle, gen, done states)
116+
- **QKMulFMA**: Computes attention scores via Q×K^T multiplication using FMA units
117+
- Single-row attention scores for memory efficiency
118+
- Uses FMA units for improved precision
119+
- Streaming input/output with Decoupled interfaces
120+
- **OutValue**: Output value computation using attention scores and value matrix
121+
- **OutValueTotal**: Complete output computation module
122+
- Multiplies attention scores with value matrix (Scores × Value)
123+
- Computes full attention output in one operation
124+
- Uses GEMM-style computation for attention mechanism
125+
- **OutValueSingle**: Single-row output computation for streaming applications
126+
- Processes one row of attention scores at a time
127+
- Memory efficient for large attention matrices
128+
- Maintains state between rows for continuous processing
129+
- **Key advantages**:
130+
- Optimized for transformer model attention mechanisms
131+
- Memory-efficient designs with configurable storage options
132+
- Streaming interfaces for continuous processing
133+
- FMA-based computation for improved numerical accuracy
134+
81135
##### 2.6 Obsolete Components
82136
- **SPMM (Sparse-Dense Matrix Multiplication)**: Performs sparse-dense matrix multiplication using MAC units with mask vectors to select needed elements
83137
- Uses mask vectors to select specific numbers from input matrices
@@ -184,6 +238,40 @@ sbt "testOnly vitisrtlkernel.VitisRTLKernelTest"
184238
- Output: Hardware produces 16 result elements (4×4 result matrix C)
185239
- Validation: Each output element compared against software-calculated value C[i][j] = Σ A[i][k] * B[k][j]
186240
- **Test Configuration**: Supports different matrix sizes (n parameter), fixed-point configurations (I integer bits, F fractional bits), and data types
241+
- **GEMM FMA Tests**:
242+
- **GEMMFMATotal Test**: Validates complete matrix multiplication using FMA operations
243+
- Tests configurable matrix dimensions (m × k × n)
244+
- Verifies both fixed-point and floating-point operations
245+
- Uses fork-join testing methodology for concurrent input/output validation
246+
- **GEMMFMASingle Test**: Validates single-row matrix multiplication
247+
- Tests row-by-row processing for streaming applications
248+
- Verifies proper indexing and continuous processing
249+
- Supports configurable PE count for different parallelization levels
250+
- **MultiFMA Test**: Validates multiple FMA units operating in parallel
251+
- Tests configurable number of FMA units (peCount)
252+
- Verifies pipelined processing of different matrix blocks
253+
- Ensures precision improvements through FMA operations
254+
- **GEMMSingleQueue Test**: Validates queued matrix operations
255+
- Tests internal queue management for multiple operations
256+
- Verifies flush functionality for dynamic workloads
257+
- Supports attention mechanism computations with streaming inputs
258+
- **AttnScores Tests**:
259+
- **QKGenWithReg Test**: Validates Query and Key matrix generation with internal storage
260+
- Tests simultaneous Q and K computation using two GEMMFMATotal units
261+
- Verifies internal register storage and reuse
262+
- **QKGen Test**: Validates Query and Key matrix generation without storage
263+
- Tests state machine control (idle, gen, done states)
264+
- Verifies direct streaming of results
265+
- **QKMulFMA Test**: Validates attention score computation via Q×K^T
266+
- Tests single-row attention computation for memory efficiency
267+
- Verifies FMA-based precision improvements
268+
- **OutValue Tests**:
269+
- **OutValueTotal Test**: Validates complete output computation
270+
- Tests attention score × value matrix multiplication
271+
- Verifies full attention output computation
272+
- **OutValueSingle Test**: Validates single-row output computation
273+
- Tests memory-efficient streaming of large attention matrices
274+
- Verifies state maintenance between rows
187275
- **GEMV Test**: Vector-matrix multiplication validation
188276
- Tests reduction tree algorithm implementation
189277
- Verifies `log(n) + 2` cycle timing
@@ -446,6 +534,40 @@ sbt "testOnly vitisrtlkernel.VitisRTLKernelTest"
446534
- 输出: 硬件产生16个结果元素(4×4结果矩阵C)
447535
- 验证: 每个输出元素与软件计算值 C[i][j] = Σ A[i][k] * B[k][j] 进行比较
448536
- **测试配置**: 支持不同矩阵大小(n参数)、定点配置(I整数位,F小数位)和数据类型
537+
- **GEMM FMA测试**:
538+
- **GEMMFMATotal测试**: 验证使用FMA操作的完整矩阵乘法
539+
- 测试可配置矩阵维度 (m × k × n)
540+
- 验证定点和浮点操作
541+
- 使用fork-join测试方法进行并发输入/输出验证
542+
- **GEMMFMASingle测试**: 验证单行矩阵乘法
543+
- 测试流式应用的逐行处理
544+
- 验证适当的索引和连续处理
545+
- 支持不同并行化级别的可配置PE计数
546+
- **MultiFMA测试**: 验证并行运行的多个FMA单元
547+
- 测试可配置的FMA单元数量 (peCount)
548+
- 验证不同矩阵块的流水线处理
549+
- 确保通过FMA操作提高精度
550+
- **GEMMSingleQueue测试**: 验证队列矩阵操作
551+
- 测试多个操作的内部队列管理
552+
- 验证用于动态工作负载的刷新功能
553+
- 支持带流式输入的注意力机制计算
554+
- **AttnScores测试**:
555+
- **QKGenWithReg测试**: 验证带内部存储的查询和键矩阵生成
556+
- 测试使用两个GEMMFMATotal单元同时进行Q和K计算
557+
- 验证内部寄存器存储和重用
558+
- **QKGen测试**: 验证无存储的查询和键矩阵生成
559+
- 测试状态机控制 (idle, gen, done 状态)
560+
- 验证直接流式输出
561+
- **QKMulFMA测试**: 验证通过Q×K^T进行的注意力分数计算
562+
- 测试用于内存效率的单行注意力计算
563+
- 验证基于FMA的精度改进
564+
- **OutValue测试**:
565+
- **OutValueTotal测试**: 验证完整输出计算
566+
- 测试注意力分数×值矩阵乘法
567+
- 验证完整注意力输出计算
568+
- **OutValueSingle测试**: 验证单行输出计算
569+
- 测试大注意力矩阵的内存高效流式处理
570+
- 验证行之间的状态维持
449571
- **GEMV测试**: 向量-矩阵乘法验证
450572
- 测试归约树算法实现
451573
- 验证 `log(n) + 2` 周期时序
@@ -687,6 +809,30 @@ make clean_emu_vpp
687809
- **流水线阶段**:每个PE包含RegNext用于水平和垂直数据转发
688810
- 针对神经网络密集层操作优化,具有可配置矩阵大小
689811

812+
##### 2.3.1 GEMM FMA (融合乘加) 单元
813+
- 使用融合乘加操作增强矩阵乘法,提高精度和性能
814+
- **GEMMFMATotal**: 支持定点和浮点操作的完整矩阵乘法模块
815+
- 可配置矩阵维度 (m × k × n)
816+
- 使用FMA单元进行高效计算
817+
- 输入和输出数据流的解耦接口
818+
- **GEMMFMASingle**: 适用于流式应用的单行矩阵乘法
819+
- 逐行处理以减少内存需求
820+
- 按行返回结果
821+
- 支持可配置PE计数以实现并行处理
822+
- **MultiFMA**: 并行运行的多个FMA单元,以提高吞吐量
823+
- 可配置FMA单元数量 (peCount)
824+
- 支持定点和浮点操作
825+
- 流水线设计用于连续数据处理
826+
- **GEMMSingleQueue**: 支持流式处理的队列单行矩阵乘法
827+
- 为多个矩阵运算维护内部队列
828+
- 支持刷新操作以处理动态工作负载
829+
- 针对注意力机制计算优化
830+
- **主要特点**:
831+
- 通过FMA操作提高数值精度
832+
- 可配置数据类型 (GEMMDataType.Fxp, GEMMDataType.Fp32, GEMMDataType.Fp64)
833+
- 流水线架构实现高吞吐量
834+
- 针对不同用例的内存高效设计
835+
690836
##### 2.4 GEMV (通用矩阵-向量乘法) 单元
691837
- 使用VecDotVec组件的归约树算法进行向量-矩阵乘法
692838
- **VecDotVec**
@@ -710,6 +856,36 @@ make clean_emu_vpp
710856
- 实现数值稳定的softmax,采用最大值减法技巧防止溢出
711857
- 支持可配置数组大小(默认4个元素,可配置)
712858

859+
##### 2.5.1 注意力机制组件
860+
- **AttnScores**: Transformer模型的注意力分数计算模块
861+
- **QKGenWithReg**: 生成带内部寄存器存储的查询和键矩阵
862+
- 使用两个GEMMFMATotal单元同时计算Q和K矩阵
863+
- 在内部寄存器中存储计算的Q和K以供重用
864+
- 输入: inputToken (m × k), weightQ (k × n), weightK (k × n)
865+
- 输出: Query (m × n), Key (m × n)
866+
- **QKGen**: 生成无内部存储的查询和键矩阵
867+
- 使用两个GEMMFMATotal单元计算Q和K
868+
- 直接流式输出结果,无内部存储
869+
- 状态机控制 (idle, gen, done 状态)
870+
- **QKMulFMA**: 使用FMA单元通过 Q×K^T 乘法计算注意力分数
871+
- 单行注意力分数以提高内存效率
872+
- 使用FMA单元提高精度
873+
- 使用Decoupled接口的流式输入/输出
874+
- **OutValue**: 使用注意力分数和值矩阵计算输出值
875+
- **OutValueTotal**: 完整输出计算模块
876+
- 将注意力分数与值矩阵相乘 (Scores × Value)
877+
- 一次操作计算完整注意力输出
878+
- 使用GEMM风格计算进行注意力机制
879+
- **OutValueSingle**: 单行输出计算,适用于流式应用
880+
- 每次处理一行注意力分数
881+
- 对大注意力矩阵的内存高效处理
882+
- 在行之间维持状态以进行连续处理
883+
- **主要优势**:
884+
- 针对Transformer模型注意力机制优化
885+
- 具有可配置存储选项的内存高效设计
886+
- 连续处理的流式接口
887+
- 基于FMA的计算以提高数值精度
888+
713889
##### 2.6 已废弃组件
714890
- **SPMM (稀疏-稠密矩阵乘法)**:使用MAC单元和掩码向量执行稀疏-稠密矩阵乘法来选择所需元素
715891
- 使用掩码向量从输入矩阵中选择特定数字

0 commit comments

Comments
 (0)