Releases · fla-org/flash-linear-attention

12 Mar 14:45

yzhangcs

v0.4.2

ca910f8

v0.4.2 Latest

Latest

What's Changed

[Misc] Use autopep8 to keep style by @zhiyuan1i in #697
[Misc] Reduce D2H/H2D Sync by @zhiyuan1i in #698
[Conv]: Support mix mode(Triton fwd and CUDA bwd) by @zhiyuan1i in #699
[Test] Add memory guard fixtures for CUDA memory safety testing by @zhiyuan1i in #700
[KDA] Add lowerbound gate function by @zhiyuan1i in #701
[KDA] Remove deprecated head_first in kda gate func by @zhiyuan1i in #702
[KDA] Speed up chunk_kda by introducing lowerbound gate by @zhiyuan1i in #703
[NSA] fix varlen related logic in cmp dkv kernel by @yibozhong in #707
[DPLR] Speed up DPLR by lowerbound gate by @zhiyuan1i in #709
[Misc] Enhance non-cuda platform ci by @zhiyuan1i in #708
[Conv] Add non-contiguous tensor support for convolution ops by @zhiyuan1i in #712
[Conv] Fix corner case by @zhiyuan1i in #714
[Conv] Refactor dh0 into separate Triton kernel and add gradient tests by @zhiyuan1i in #717
[Misc] Wrap exp/log math ops with @triton.jit and enforce float32 pre… by @zhiyuan1i in #720
[Conv] Refractor to packages by @zhiyuan1i in #722
[Conv] Clean duplicate calculate chunk_indices by @zhiyuan1i in #724
[DPLR] Add disable_recompute support for DPLR chunk op by @zhiyuan1i in #726
[CP] fuse fwd/bwd kernels and fix IMA in long context by @zhiyuan1i in #733
[KCP] add KCP.md; fix fp32 precision in M matrix chain; cleanup CP tests by @zhiyuan1i in #740
[Backend] Introduce dispatch system by @zhiyuan1i in #741
[Backend] Select from available backends based on priority order by @zhiyuan1i in #742
[MAMBA2] fix initialization for mamba2 by @mayank31398 in #739
[KDA] Refractor interface by @zhiyuan1i in #744
[Deltarule] Added intra-card context parallel optimization for KDA and GDN by @zhiyuan1i in #743
[Misc] centralize reference implementations in naive.py by @zhiyuan1i in #746
[Cache] Fix get_seq_length to return per-layer length by @zhiyuan1i in #748
fewer l2norm recompilations by @tyler-romero in #745
[OJA] Integrate Gated OJA Rule by @AwesomeSeq in #730
[Misc] remove redundant dot precision param in KDA recompute_w_u by @KevinZeng08 in #750
Fix shared memory guards for AMD RDNA GPUs (64KB shared mem) by @Gildoniel in #751
[Fix] Guard A_log and dt_bias re-initialization against loaded checkpoint values in GatedDeltaNet, Comba, and KDA by @ljxw88 in #754
[MAMBA-2] fix mamba-2 init for FSDP-2 with DTensors by @mayank31398 in #753
[Deltrarule] Add cache for intra cp by @zhiyuan1i in #755
Update Windows warning to detect triton-windows by @erm14254 in #757
[Misc] Reduce recompile by @zhiyuan1i in #764
[PaTH] Prevent int32 index overflow for long sequences by @zhixuan-lin in #769
[Test] Add regression tests for cache seen-token bug (GH-766) by @zhiyuan1i in #768
[Misc] removes all @torch.jit.script decorators from the codebase by @zhiyuan1i in #767
[Conv] Fix invalid memory access by @zhiyuan1i in #774
[KDA][GDN] Support transpose_state_layout for [V,K] state memory layout by @zhiyuan1i in #776
[GDN] Enhance Triton 3.2 compatibility by @winglet0996 in #773
[Model] Unify cache function by @zhiyuan1i in #777

New Contributors

@mayank31398 made their first contribution in #739
@tyler-romero made their first contribution in #745
@KevinZeng08 made their first contribution in #750
@Gildoniel made their first contribution in #751
@ljxw88 made their first contribution in #754
@erm14254 made their first contribution in #757
@winglet0996 made their first contribution in #773

Full Changelog: v0.4.1...v0.4.2

Contributors

tyler-romero, Gildoniel, and 9 other contributors

Assets 2

24 Dec 18:07

yzhangcs

v0.4.1

3a904f0

🎄 v0.4.1

What's Changed

[GDN] fix oom on A6000 by @sustcsonglin in #622
[KDA] Fix _no_weight_decay term by @yzhangcs in #629
Lint all files by @zhiyuan1i in #628
Fix kda.gate not enforcing contiguous mem layout by @Qubitium in #627
Fix docstring for 'g' parameter shape by @GuoYiFantastic in #631
Fix: Correct K/V dimension mismatch in path_attn bwd kernels"changing K/BK to V/BV for v and dv operations by @ReyJerry in #633
KDA - fix: don't force fused_recurrent when in training mode with small sequences by @masc-it in #636
[KDA]: Fuse beta.float().sigmoid() in fused_kda_gate by @zhiyuan1i in #642
[Perf] add chunk_indices parameter to avoid redundant computation by @zhiyuan1i in #641
Added a badge for Ask DeepWiki to the README to auto-refresh the wiki weekly by @richardodliu in #644
[Tril] Enable precision autotune[skip test] by @zhiyuan1i in #646
[BC] Capitalize all envs by @yzhangcs in #650
deprecate fused_chunk_gla and safe_exp; fix kda exp mask by @sustcsonglin in #652
[kda kernel optimization] implement token-parallel intra-chunk attention by @sustcsonglin in #653
[KDA] Faster inter computation in 64x64 intra fwd by @yzhangcs in #658
Add PTX softplus by @yzhangcs in #660
[KDA] Support fused forget gate by @yzhangcs in #662
[KDA] Remove beta from fused gate by @yzhangcs in #665
[Softplus] Support AMD/Intel devices by @zhiyuan1i in #664
[KDA] Fix ood bugs in intra fwd by @yzhangcs in #673
[L2Norm] Avoid recompilation for variable-length inputs by @retonym in #669
fix: set the dtype of RMSNorm to float32 to avoid precision underflow by @pprp in #676
[KDA] Changed all exp to exp2 by @Nathancgy in #679
[KDA] Fuse inplace add by @yzhangcs in #682
Add head_dim parameter to NSA layer by @mutiann in #683
[KDA] Fuse dAqk and dv by @yzhangcs in #689
Temporary workaround to disable TritonGPUHoistTMEMAlloc in b_dk += tl.dot(tl.trans(b_dA), b_kb) by @rucnyz in #687
[NSA] fix compression branch dkv kernel dk and dv pointer impl by @yibozhong in #690
[KDA] fused bwd kernels inter and prepare wy by @Nathancgy in #688
[GDN] Fix potential ood for long inputs by @yzhangcs in #692
[GDN] Support beta in float32 by @AwesomeSeq in #693
[GSA] Fix gate oob bugs by @yzhangcs in #694

New Contributors

@Qubitium made their first contribution in #627
@GuoYiFantastic made their first contribution in #631
@ReyJerry made their first contribution in #633
@masc-it made their first contribution in #636
@retonym made their first contribution in #669
@pprp made their first contribution in #676
@mutiann made their first contribution in #683

Full Changelog: v0.4.0...v0.4.1

Contributors

Qubitium, mutiann, and 13 other contributors

Assets 2

27 Oct 08:18

yzhangcs

v0.4.0

2e0f6c3

v0.4.0

🧠 New Models

🌑 Kimi Delta Attention by @yzhangcs @sustcsonglin

What's Changed

[GDN] Fix tiling bugs once gv applied by @yzhangcs in #589
[Conv] Add comprehensive docstring and change default backend to triton by @zhiyuan1i in #592
Update cumprod_householder_bwd.py by @SeepingFragranceLock in #593
[FIX] Correct cumsum dimension in normalize_output by @sirluk in #594
[Triton] Add autotune caching support for Triton kernels by @zhiyuan1i in #598
[DeltaFormer] Add Model by @Nathancgy in #585
[DeltaFormer] Replace GenerationMixin with FLAGenerationMixin and upd… by @zhiyuan1i in #600
[Cache] Fix from_legacy_cache by @zhiyuan1i in #605
[Deps] Make pytest an optional dependency by @wedaly in #610
[DeltaFormer] Fixed testing ops error by @Nathancgy in #602
[Conv] Fix potential OOB problems by @yzhangcs in #615
[Deps] Minimize deps by @zhiyuan1i in #617
Determine the chunk size at the kernel entry by @yzhangcs in #619
Add KDA by @yzhangcs in #621
[Lint] Migrate from flake8/isort to ruff for faster linting by @zhiyuan1i in #613

New Contributors

@SeepingFragranceLock made their first contribution in #593
@sirluk made their first contribution in #594
@Nathancgy made their first contribution in #585
@wedaly made their first contribution in #610

Full Changelog: v0.3.2...v0.4.0

Contributors

wedaly, yzhangcs, and 5 other contributors

Assets 2

10 Sep 07:43

yzhangcs

v0.3.2

f7d95fa

v0.3.2

📣 Highlights

Starting with this release, every time we ship a new version of flash-linear-attention, we will simultaneously publish fla-core: a minimal-dependency subset of the main repo that contains only the essentials.

🧠 New Models

🌲 Log Linear Attention by @2022tgoel

What's Changed

[Conv] Provide fn interface for causal_conv1d by @yzhangcs in #578
[Log Linear Attention] add backward pass by @2022tgoel in #577
[PaTH] Fix q init & dq masking by @yzhangcs in #581
[TokenShift] Fix a bug in decoding by @zhiyuan1i in #583
[Deps] Lock transformers<4.56.0 by @zhiyuan1i in #582
[Log-Linear Attention] add models by @2022tgoel in #579
[Deps] Upgrade to transformers 4.56.x by @zhiyuan1i in #587
[Build] Split package distribution[skip test] by @zhiyuan1i in #588

Full Changelog: v0.3.1...v0.3.2

Contributors

yzhangcs, zhiyuan1i, and 2022tgoel

Assets 2

26 Aug 20:28

yzhangcs

v0.3.1

80acaeb

v0.3.1

What's Changed

[Misc] Change grid to support long ctx by @zhiyuan1i in #528
[RWKV7] Reduce CPU overhead by @zhiyuan1i in #529
[Tokenshift] Support SP and cache by @zhiyuan1i in #531
[RWKV7] Use tokenshift to save cache by @zhiyuan1i in #532
[RWKV7] Fix the issue of RWKV7 initialization with BFloat16 data type on CPU. by @zhiyuan1i in #538
[CI] Add compatibility check by @zhiyuan1i in #536
[ShortConv] Support cache in prefill by @zhiyuan1i in #535
[WIP] Add Log-Linear Attention by @2022tgoel in #524
[Cache] Upgrade to transformer>= v4.48[skip test] by @zhiyuan1i in #541
[Misc.] Set env var TRITON_F32_DEFAULT to ieee when tf32 is not supported on NVIDIA by @KevlarKanou in #544
[CI] Fix mirror for building triton by @zhiyuan1i in #543
Log-Linear Attention Tests by @2022tgoel in #542
[CI] Add proxy config for git by @zhiyuan1i in #548
[Conv] Fix warning issue by @zhiyuan1i in #549
[Misc.] Eliminate recompilation in layer-norm kernels caused by dynam… by @zhiyuan1i in #545
[Misc.] Add activations for non-cuda Backends by @zhiyuan1i in #174
[TMA] Accelerate solve_tril with TMA descriptors[skip test] by @zhiyuan1i in #550
[CI] Upgrade to latest casual-conv1d and fix triton build for 3.4.x by @zhiyuan1i in #551
[CI] Fix support for Intel GPU by @zhiyuan1i in #554
[Fix] Fix Triton Error for HeadDim < 16[skip test] by @zhiyuan1i in #556
[GLA] Fix simple_gla Test by @zhiyuan1i in #558
[CI] Fix CI script errors[skip test] by @zhiyuan1i in #566
require transformers <= 4.53.3 by @richardodliu in #570
[Deps] Adopt transformers>4.53.3 by @zhiyuan1i in #571
[Misc.] Clean codes and make mypy happy by @zhiyuan1i in #572
[Models]: Add MoM by @WKX933 in #442
[MoM]Fix lint by @JusenD in #573
[Refactor] Apply GradientCheckpointingLayer to all model layers by @yzhangcs in #575
[Mamba] Fix errors in Triton backend by @zhiyuan1i in #576

New Contributors

@2022tgoel made their first contribution in #524
@KevlarKanou made their first contribution in #544
@richardodliu made their first contribution in #570
@WKX933 made their first contribution in #442

Full Changelog: v0.3.0...v0.3.1

Contributors

yzhangcs, zhiyuan1i, and 5 other contributors

Assets 2

14 Jul 09:49

yzhangcs

v0.3.0

17dd566

v0.3.0

Highlights

🧠 New Models

We are excited to expand our model library with the addition of four powerful new architectures.

What's Changed

[MesaNet] add kernel impl. by @sustcsonglin in #419
[GDN] Add support for inference with GVA by @yzhangcs in #429
[HGRN] remove unused q_conv1d by @yibozhong in #430
Update mesa_net.py by @jovoswald in #434
[Gated DeltaNet] Refactor the kernel to remove one matrix inversion by @sustcsonglin in #433
[Modules] Add L2Warp to maintain bf16 precision by @zhiyuan1i in #438
[RWKV]: Set default scale to None by @zhiyuan1i in #445
[Typos] Change scale docs to (Optional[float]) [skip test] by @zhiyuan1i in #446
[Modules] Enhance Testing of l2warp by @zhiyuan1i in #448
[CI] Upgrade CI envs to torch~=2.7.0 by @zhiyuan1i in #450
[Mesa] misc. fix by @sustcsonglin in #449
[Models]: Add Comba Implementation by @AwesomeSeq in #444
[Test] Walk around the bug of causal_conv1d by @zhiyuan1i in #453
[Utils] Add deprecation handling for kwargs with deprecate_kwarg decorator by @yzhangcs in #455
[ShortConv] Replace use_fast_conv1d with backend parameter by @yzhangcs in #456
[Docs] Update tensor shape descriptions and deprecate head_first argument by @yzhangcs in #457
[Simple GLA] Support dg when dht passed by @yzhangcs in #459
[Mesa] Improve precision by @sustcsonglin in #460
[Comba] Remove problematic safe_exp by @yzhangcs in #466
[TokenShift] Fix invalid argument on AMD GPUs by @zhiyuan1i in #464
[Test] Refractor model testing[skip test] by @zhiyuan1i in #467
[Testing] Enhance generation testing by @sustcsonglin in #468
[Simple GLA] Remove unnecessary dg for data-independent decay by @yzhangcs in #469
[CI] Update workflow by @zhiyuan1i in #473
[Misc.] Enhance support for some platforms by @zhiyuan1i in #470
[Gated Delta Product] Optimize kernels by @sustcsonglin in #472
[README] Add support for aarch64 by @zhiyuan1i in #475
[Cache] Fix bad seen_tokens update by @yzhangcs in #478
[CI] Revert causal-conv1d to 2a288a1 by @zhiyuan1i in #480
[Parallel] Fix all tokens offsets by @yzhangcs in #479
Use tl.exp2 for all gating operations by @yzhangcs in #361
Refactor modeling tests by @yzhangcs in #482
Add L2_norm for p in Recurrent ops to fix generation error by @AwesomeSeq in #483
Refactor benchmark: adapt to latest FLA benchmark interface by @yuweih205 in #488
[GLA] Remove all safe_exp ops by @yzhangcs in #489
[MesaNet] Remove all safe_exp ops & Refactor tests by @yzhangcs in #490
[Misc.] Support PT2.5 by @zhiyuan1i in #491
[Misc.] Fast testing & Autotune by @sustcsonglin in #476
fix: update import path for causal_conv1d by @yuweih205 in #492
Make RWKV-7 init match official RWKV-LM by @johanwind in #493
Modernize the fused_chunk impls by @yzhangcs in #437
[ShortConv] Fix bad conv weight input shape during inference by @yzhangcs in #495
[DeltaProduct] chore: remove unused functions by @timurcarstensen in #496
[CI] Fix pipeline in GPU CIs by @zhiyuan1i in #497
[RWKV] Make torch.compile decorator compatible with python3.10 by @zhiyuan1i in #498
[GDN] Fuse 64x64 matrix inverse kernel by @yzhangcs in #501
[L2Norm] Speedup by saving rstd by @yzhangcs in #506
[Norm] Move eps out of sqrt by @yzhangcs in #508
Correct types of constructor arguments with issues for configuration classes by @V0XNIHILI in #509
Fix typo: suppoerted -> supported by @zxytim in #510
[RWKV7] Increase Lora shape for headdim>64 by @zhiyuan1i in #512
[Delta Rule] Support gk for WY reprs by @yzhangcs in #514
[PaTH attention] Support headdim 128 & refactor kernel for better stability by @sustcsonglin in #503
[Rotary] Fix max_seqlen under varlen mode by @yzhangcs in #516
[Misc] Skip testing models on Nvidia 4090 CI by @zhiyuan1i in #517
[GDP] Delete duplicated code by @yzhangcs in #518
[WIP] Add MLA layers into fla by @toothacher17 in #395
[Mamba] Add triton conv1d backend and fix mamba2 test by @zhiyuan1i in #520
[Typo] Fix types in all configuration files[skip test] by @V0XNIHILI in #513
[GSA] Fix memory boundary conditions by @JusenD in #527

New Contributors

@jovoswald made their first contribution in #434
@AwesomeSeq made their first contribution in #444
@yuweih205 made their first contribution in #488
@V0XNIHILI made their first contribution in #509
@zxytim made their first contribution in #510
@toothacher17 made their first contribution in #395
@JusenD made their first contribution in #527

Full Changelog: v0.2.2...v0.3.0

Contributors

zxytim, toothacher17, and 11 other contributors

Assets 2

05 Jun 16:50

yzhangcs

v0.2.2

a46f204

v0.2.2

What's Changed

[TokenShift] support fused_token_shift with varlen by @zhiyuan1i in #373
[Mamba] Use official init strategies by @yzhangcs in #374
[Mamba2] Create attn layer by @yzhangcs in #375
[Mamba] Add attn layer & fix configs by @yzhangcs in #376
[RWKV7] Update fused_addcmul impls by @zhiyuan1i in #378
[RWKV7]: Rewrite docs to match Triton codes. by @zhiyuan1i in #381
[RWKV7] Fix convert script by @zhiyuan1i in #383
[Misc.] Update triton-nightly.yml by @zhiyuan1i in #382
[PaTH] Add PaTH attention model and kernel by @sustcsonglin in #384
[Tests] Enable tests with causal_conv1d on H100 CIs by @zhiyuan1i in #385
[GDN]: initializing A_log and dt_bias in _init_weights by @HanGuo97 in #380
[Utils] Add fused pack/unpack fns by @yzhangcs in #386
[RWKV7] Strictly initialize rwkv7 according to RWKV-LM by @zhiyuan1i in #387
[chore] switched to processing_class kwarg inside Trainer invocation by @timurcarstensen in #391
[RWKV7] Update initialization to sync with latest RWKV-LM by @zhiyuan1i in #393
[Token Shift]: Fix potential cuda kernel parameter error for varlen by @zhiyuan1i in #397
[DeltaProduct] fix query conv cache, remove extraneous query convs by @timurcarstensen in #396
[Misc.] Log warnings when Triton is older than 3.2.0 by @zhiyuan1i in #394
[RWKV7]: clean fused_addcmul_rwkv7 impls by @zhiyuan1i in #404
[README] Update FoX venue info by @zhixuan-lin in #406
Added details to some formulas, fixed the display error of the L2 Loss formula by @Beortext in #407
[RWKV7] Change fp32 errors to warnings by @zhiyuan1i in #412
[Misc.] Add exist_ok=True to all models by @zhiyuan1i in #413
Add Rodimus impl into fla by @ziHoHe in #416
Align RWKV7 LoRA Rank Initialization with official Implementation by @WuTianyi321 in #418
[Canon] Add triton impls by @yzhangcs in #388
[GDN] Support Gated Value Attention (GVA) by @Rafa-zy in #421
[RWKV7]: clean some imps by @zhiyuan1i in #420
[RoPE] Fix out-of-boundary bugs by @yzhangcs in #423
[RWKV] Fix cu_seqlens with gradient checkpoint by @zhiyuan1i in #422

New Contributors

@timurcarstensen made their first contribution in #391
@ziHoHe made their first contribution in #416
@WuTianyi321 made their first contribution in #418
@Rafa-zy made their first contribution in #421

Full Changelog: v0.2.1...v0.2.2

Contributors

HanGuo97, yzhangcs, and 8 other contributors

Assets 2

23 Apr 17:08

yzhangcs

v0.2.1

a670dff

v0.2.1

Highlights

🚀 Performance Boost for DeltaNet

We've achieved a notable performance enhancement for (Gated) DeltaNet models. The optimization efforts focused on the fused LayerNormGated layer, particularly for small headdims, which has resulted in a 1.1x speedup.

Below are the benchmarks for 1B parameter models, tested on 4k sequences in varlen mode, using a single H100 GPU

	TPS (K tokens/s)
Transformer++	53.8
DeltaNet (before)	48.6
DeltaNet (after)	54.0

by running

python -m benchmarks.benchmark_training_throughput \
  --name delta_net \
  --batch_size 1 \
  --seq_len 32768 \
  --context_len 4096 \
  --varlen \
  --steps 512

What's Changed

[Gated DeltaNet] optimize UT transform by @sustcsonglin in #349
[RWKV] remove duplicate params from autotune key list by @jihaoh98 in #359
Fix some arg passing by @yibozhong in #358
[RWKV7] Update RWKV7 to follow official initialization by @zhiyuan1i in #365
Remove all NT: constexpr by @sustcsonglin in #364
[Misc.] Use logger.info instead of print in fla.utils.py by @zhiyuan1i in #366
[RWKV]: Prevent initialization when loading pretrained weights by @zhiyuan1i in #369
[Norm] Optimize speed for small headdim by @yzhangcs in #368
[GroupNorm] Optimized speed for small headdims by @yzhangcs in #371
[LayerNormGated] Fix arg bugs during autotuning by @yzhangcs in #372

New Contributors

@jihaoh98 made their first contribution in #359
@yibozhong made their first contribution in #358

Full Changelog: v0.2.0...v0.2.1

Contributors

yzhangcs, sustcsonglin, and 3 other contributors

Assets 2

11 Apr 20:31

yzhangcs

v0.2.0

6bfd5e6

v0.2.0

What's Changed

[Attn] Delete V reduction & Enable 256 headdim tests by @yzhangcs in #273
[RWKV7] Add more elementwise kernels by @zhiyuan1i in #271
[CI] Remove cache and disable full test on Arc GPU by @zhiyuan1i in #274
[Fox] Add model/layer/kernel impls w/ varlen support by @yzhangcs in #275
[FoX] Simplify some tests and enhance tiling by @zhiyuan1i in #277
[Test] Remove some warnings and correct condition checks by @zhiyuan1i in #278
[CI] auto-cancel workflows on PR merge via concurrency group by @zhiyuan1i in #280
[Test] use tl.float16 instead of tl.bfloat16 by @zhiyuan1i in #281
[OP] replace tl.exp, tl.log, tl.log2 with fast ops when FLA_USE_FAST_OPS=1 by @zhiyuan1i in #276
[FoX] Rename fox to forgetting_attn by @yzhangcs in #282
[DeltaNet] WY repr speedup by @yzhangcs in #279
[README] Add --no-use-pep517 flag for faster installation by @zhiyuan1i in #286
[FoX] Skip test D>128 on RTX4090 by @zhiyuan1i in #287
[FoX] Test different forget gate initialization ranges by @zhixuan-lin in #291
[FoX] Fix class inheritance for ForgettingTransformerForCausalLM by @zhixuan-lin in #293
[CI] use latest stable triton by @zhiyuan1i in #294
[Triton] use tl.gather to enhance performance by @zhiyuan1i in #270
[WY representation] Faster lower triangle inverse by @sustcsonglin in #289
[GroupNorm] Add argument is_rms_norm to GroupNorm by @zhixuan-lin in #295
[GroupNorm] Return correct residual in reference implementation by @zhixuan-lin in #297
[CI] Don't show Triton autotune logs in CI by @zhiyuan1i in #298
[FoX] Use GroupNorm for QK-norm implementation in FoX by @zhixuan-lin in #299
[Utils] Update H100 and A100 configs by @zhiyuan1i in #306
Pass shifted labels and add a warning to RWKV-7 initialization. by @Triang-jyed-driung in #304
[Misc.] Update imports for GatedDeltaProduct by @yzhangcs in #309
[FAQ] Rewrite the nightly installation instructions by @zhiyuan1i in #305
Add unit tests for model forward and variable-length checks by @yzhangcs in #310
[Test] Improve path handling and test file detection by @zhiyuan1i in #311
[ShortConv] Adjust input shape according to cu_seqlens by @yzhangcs in #316
[Tests] Add unit tests for generation with padding by @yzhangcs in #312
[Testing] Update testing.py by @zhiyuan1i in #320
[DeltaNet] optimize chunk_delta_h by @sustcsonglin in #315
[CI] Only cancel in-progress CI for pull requests by @zhiyuan1i in #321
[Test] Skip some tests on arcA770 by @zhiyuan1i in #322
[API] Update head_first parameter default to False by @yzhangcs in #324
[Rotary] Remove max_seqlen parameter and adjust related logic by @yzhangcs in #326
[DeltaProduct] Remove unnecessary config parameter. by @JulienSiems in #325
fix the training problem of GatedDeltaProduct by @ridgerchu in #327
[Linear Attn] Fix head_first tests by @yzhangcs in #330
[Deprecated] Remove head_first option in gla variants by @yzhangcs in #337
[Test] Ensure most tests on Triton 3.2.0 and add 4096 seq_length in tests [skip test] by @zhiyuan1i in #300
[FoX] Merge code to FlashAttention | support batch inference by @sustcsonglin in #333
[DeltaNet] Delete head_first option for all by @yzhangcs in #338
[WIP] Remove head_first option by @yzhangcs in #339
[RWKV7] add input_precision param [skip test] by @zhiyuan1i in #335
[Testing] Add recursive dependency finding for test discovery by @zhiyuan1i in #341
[WIP] Delete head_first option for cumsum by @yzhangcs in #342
[WIP] Delete head_first tests for DeltaNet/GLA by @yzhangcs in #344
[Attn] Remove head_first & rename offsets to cu_seqlens by @yzhangcs in #345
[RWKV7] Drop some kernels to enhance speed by @zhiyuan1i in #346
Remove the head_first arg from several token mixing layer fns. by @yzhangcs in #347

New Contributors

@sustcsonglin made their first contribution in #289

Full Changelog: v0.1.2...v0.2.0

Contributors

JulienSiems, yzhangcs, and 5 other contributors

Assets 2

31 Mar 06:30

yzhangcs

v0.1.2

53b3ac7

v0.1.2

What's Changed

[RWKV7] fix RWKV7Attention.__init__ by @exhyy in #238
fix(triton): remove num_warps=8 in bwd_prepare_wy_repr_kernel to avoid MMA layout assertion on non-Ampere GPUs. by @kugwzk in #240
[Fix]: reshape o before o_proj in linear_attn layer. by @Luther-Sparks in #243
[CI] Seperate tests to compile , normal and varlen by @zhiyuan1i in #247
[ABC] Add use_rope parameter to ABCAttention and ABCConfig & Fix compiler bugs in kernels by @yzhangcs in #248
[CI] trigger GPU workflow only on pull_request events by @zhiyuan1i in #249
Create test_linearatten.py by @kangyiyang in #250
[CI] Fix all erros and enable testing for PR by @zhiyuan1i in #251
[CI] add H100 GPU by @zhiyuan1i in #254
[Gated DeltaNet] fix gdn kernel bugs on h100 when vdim=64 by @kugwzk in #256
[Test] Enhance support for NVIDIA Hopper GPU by @zhiyuan1i in #257
[FAQ] Update triton-nightly links by @yzhangcs in #259
[Attn] Add triton impls for MHA/GQA by @yzhangcs in #260
[Attn] Use larger block size for hopper devices by @yzhangcs in #261
[Attn] Enable test for attn by @zhiyuan1i in #262
[CI] fix a syntax error in triton-nightly by @zhiyuan1i in #263
Bump fla to v0.1.2 by @yzhangcs in #264

New Contributors

@exhyy made their first contribution in #238
@kugwzk made their first contribution in #240
@Luther-Sparks made their first contribution in #243
@yzhangcs made their first contribution in #248
@kangyiyang made their first contribution in #250

Full Changelog: v0.1.1...v0.1.2

Contributors

kugwzk, yzhangcs, and 4 other contributors

Assets 2

Releases: fla-org/flash-linear-attention

v0.4.2

What's Changed

New Contributors

Contributors

Uh oh!

🎄 v0.4.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.0

🧠 New Models

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.2

📣 Highlights

🧠 New Models

What's Changed

Contributors

Uh oh!

v0.3.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.0

Highlights

🧠 New Models

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.1

Highlights

🚀 Performance Boost for DeltaNet

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2

What's Changed

New Contributors

Contributors

Uh oh!