[Core][Ref] PagedAttention reference implementation by PiotrKrzem · Pull Request #28815 · openvinotoolkit/openvino

PiotrKrzem · 2025-02-04T11:50:08Z

Details:

Add PagedAttention reference implementation
Add testing suite

Tickets:

158135

Relevant files:

Paged Attention - input for spec.pptx
ReasoningTokenEviction (1).pptx

Notes:

For the Paged Attention - input for spec:

The presentation slides 14-17 showcase the cache management system using 'free_*' outputs, that is not present in the reference. Those inputs/outputs can be easily obtained if needed, using the CacheManager. Currently all the eviction is done automatically by the CacheManager instead.
The presentation shows alibi_slopes to have a size [num_kv_heads], I believe it to be an error, should be [query_heads] (matches the executor_pa.cpp line 2060 in CPU implementation)

For the ReasoningTokenEviction:

Output 2 data is correct
However, the model currently lacks all the necessary inputs defined in the presentation to accurately compute adaptive RKV. I added these inputs as hardcoded parameters to the CacheManager

vshampor · 2025-02-20T15:22:48Z

You should also be adding the cache rotation-related functionality here since it's part of the op and has both CPU and GPU implementations now.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…tion fix

PiotrKrzem · 2026-03-05T20:46:28Z

New checkpoint, at the commit d506010 the 25-input version passes the all the CI tests with FIFO eviction policy enabled for CacheManager

PiotrKrzem · 2026-03-06T09:39:52Z

And finally with the commit b88e744 we have a fully working version with all 3 modes of cache eviction

mitruska

Great work revealing PagedAttention algorithm.

Currently this PR introduce not only reference implementation, but also changes to common components code, shared between plugins - including common PagedAttentionExtension class (like additional attribute for cache management, shape_infer), transformation, CPU plugin PA kernel adjustments.

To minimize risk, I would recommend to shrink this PR for reference implementation only covered by tests - then changes for common plugins code separately if needed.
The template plugin / reference implementation requirements shouldn't motivate additional changes in the existing code base shared with other plugins.

(Reviewed scope of the PR changes - without deep dive into reference code itself yet)

mitruska · 2026-03-06T12:29:24Z

src/plugins/intel_cpu/src/nodes/paged_attn.cpp

Please explain CPU changes and motivation to introduce them in this PR, if there are failing tests for CPU case, I would recommend to mark them as xfail, and propose fix for CPU separately.

mitruska · 2026-03-06T13:08:11Z

...common/transformations/src/transformations/common_optimizations/convert_pagedattn_inputs.cpp

            OPENVINO_DEBUG("PagedAttn ",
                           pa_op->get_friendly_name(),
                           " doesn't have rtinfo for num_k_heads/k_head_size/num_v_heads/num_v_heads");
-            status = false;


Ok, removal of this assignment doesn't change the logic, because the "status" variable is already initialized with false, but at the same time each such "nice to have" change, makes this PR harder to review and more risky when changes are applied to the common code shared between plugins.

mitruska · 2026-03-06T13:12:39Z

...common/transformations/src/transformations/common_optimizations/convert_pagedattn_inputs.cpp

+        // Propagate updated cache types to PA outputs so consumers see the correct element type
+        pa_op->validate_and_infer_types();


This updates common flow, how CPU/GPU plugins resolve that it wasn't a problem before?

mitruska · 2026-03-06T13:21:18Z

src/core/dev_api/openvino/op/paged_attention.hpp

    std::shared_ptr<ov::Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override;

+    /// \brief Gets the output element type at the specified index.
+    const ov::element::Type get_out_type(int index) const;


That's valid question, why resolved without response?

mitruska · 2026-03-06T13:24:41Z

src/core/dev_api/openvino/op/paged_attention.hpp

+    using PagedCacheManagerHandle = std::shared_ptr<void>;  // Void handle to avoid inconsistent linkage
+    PagedCacheManagerHandle get_cache_manager() const;
+    void set_cache_manager(PagedCacheManagerHandle cache_manager);
+
 protected:
+    PagedCacheManagerHandle m_cache_manager = nullptr;
    std::vector<ov::element::Type> m_output_type = {ov::element::dynamic, ov::element::dynamic, ov::element::dynamic};
 };

+// Exported function for transformations to construct the manager; avoids C4273 in some build configs
+OPENVINO_API PagedAttentionExtension::PagedCacheManagerHandle make_paged_cache_handle(ov::element::Type et);


The changes in common op shouldn't be motivated by reference implementation. If there is anything really specific for template plugin, it would be better to use common PagedAttentionExtension as a base class instead of adding new attribute like m_cache_manager not used by any production plugin like CPU/GPU/NPU.

mitruska · 2026-03-06T13:35:55Z

src/core/shape_inference/include/paged_attention_shape_inference.hpp

+namespace ov {
+namespace op {
+template <class TShape, class TRShape = result_shape_t<TShape>>
+std::vector<TRShape> shape_infer(const PagedAttentionExtension* op,
+                                 const std::vector<TShape>& input_shapes,
+                                 const ITensorAccessor& ta = make_tensor_accessor()) {
+    NODE_VALIDATION_CHECK(op, input_shapes.size() == 25, "Expected exactly 25 inputs but got ", input_shapes.size());
+    auto output_shapes = std::vector<TRShape>(3);


This PR moves PA shape_infer to a separate function, is there any change in the shape validation/output shape calculation logic?
Is this shape_infer code covered by tests to ensure no regression for common PagedAttentionExtension?

…h detailed descriptions

p-wysocki · 2026-03-09T12:10:14Z

src/core/reference/include/openvino/reference/paged_attention.hpp

+                        // Causal: only keep blocks where k_block <= q_block
+                        if (vals[i].second > qb)
+                            keep = false;


I understand that vals[i].second > qb is causal rejection, but isn't rejected value being added to cumsum above in the next loop by running cumsum += vals[i - 1].first;?

p-wysocki · 2026-03-09T12:22:35Z

src/core/reference/include/openvino/reference/paged_attention.hpp

+    if (has_sinks) {
+        sink_vals.resize(q_heads, 0.f);
+        for (std::size_t h = 0; h < q_heads; ++h) {
+            sink_vals[h] = detail::read_at_as_f32(sinks, sinks_et, h);


Is H in shape [1,H,1,1] guaranteed to be number of heads by the spec? My concern is about GQA, where if H would be consistent with number of KV heads, this line would read q_heads_count elements from memory of size kv_heads_count.

mmikolajcz · 2026-03-09T16:05:46Z

src/common/transformations/include/transformations/attach_cache_manager_to_paged_attention.hpp

+class ov::pass::AttachCacheManagerToPagedAttention : public ov::pass::ModelPass {
+public:
+    OPENVINO_MODEL_PASS_RTTI("AttachCacheManagerToPagedAttention");
+    bool run_on_model(const std::shared_ptr<ov::Model>& model) override;


I don't see any tests that cover this transformation other than full template tests. Can this pass be used with non-template implementations?

mmikolajcz · 2026-03-09T16:15:31Z

...ins/intel_cpu/tests/functional/shared_tests_instances/single_layer_tests/paged_attention.cpp

+// 0) The default test (same as in the CPU plugin)
+INSTANTIATE_TEST_SUITE_P(smoke_PagedAttentionLayerTest,
+                         PagedAttentionLayerTest,
+                         ::testing::Combine(::testing::Values(ElementType::f32),


Shouldn't we cover other input precisions, like bf16 and fp16? I saw some old comment mentioning it

mmikolajcz · 2026-03-09T16:24:14Z

src/core/tests/type_prop/paged_attention.cpp

 namespace testing {

+namespace {
+ov::OutputVector make_valid_pa_args(const element::Type& t = element::f32) {


Outside simple PA args, shouldn't type_prop also cover more complex cases like GQA?

Also please parametrize tests to test other requested dtypes

mmikolajcz · 2026-03-09T16:25:49Z

src/core/tests/type_prop/paged_attention.cpp


    const auto op = std::make_shared<op::PagedAttentionExtension>(args);
    EXPECT_EQ(op->get_output_element_type(0), element::f32);
    EXPECT_EQ(op->get_output_partial_shape(0), (PartialShape{3, 4}));


Are outputs 1 and 2 validated?

mmikolajcz · 2026-03-09T16:29:53Z

src/core/shape_inference/include/paged_attention_shape_inference.hpp

+            const auto q = out_ps[1].get_length();
+            const auto k = key_ps[1].get_length();
+            const auto v = value_ps[1].get_length();
+            NODE_VALIDATION_CHECK(op,


Are all failure scenarios validated?

mmikolajcz · 2026-03-09T16:31:06Z

src/core/tests/type_prop/paged_attention.cpp

-
+    auto args = make_valid_pa_args();
+    args[9] = std::make_shared<op::v0::Parameter>(element::i32, PartialShape{});  // scale must be real type
    EXPECT_THROW(std::ignore = std::make_shared<op::PagedAttentionExtension>(args), ov::NodeValidationFailure);


Please instead EXPECT_THROW use OV_EXPECT_THROW with error substring to make it clear what failure scenario is tested and whether it fails with expected error

github-actions bot added category: Core OpenVINO Core (aka ngraph) category: IE Tests OpenVINO Test: plugins and common category: TEMPLATE OpenVINO Template plugin category: CPP API OpenVINO CPP API bindings labels Feb 4, 2025

[FIX] Remove staged artifacts

69d4796

github-actions bot added category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Feb 27, 2025

PiotrKrzem added 2 commits February 28, 2025 02:41

Merge branch 'master' into feature/paged_reference

d3b31fc

[FIX] Add RoPE to testing suite

b8c9742

github-actions bot removed category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Feb 28, 2025

mlukasze requested a review from vshampor February 28, 2025 05:56

PiotrKrzem added 4 commits February 28, 2025 06:00

[FIX] Add missing test cases:

687d6aa

[FIX] Test build

0986e6d

[FIX] Single op graph

dfe28b0

[FIX] Visitor test

c582c82

PiotrKrzem marked this pull request as ready for review February 28, 2025 06:44

PiotrKrzem requested review from a team as code owners February 28, 2025 06:44

PiotrKrzem added 3 commits February 28, 2025 06:53

[FIX] Use reference_tests::Tensor in tests

abbea22

[FIX] test case name:

6ccf30d

[FIX] Separate extension, update dependencies

a9ebf4f

PiotrKrzem requested a review from a team as a code owner February 28, 2025 07:26

PiotrKrzem and others added 3 commits March 4, 2026 19:42

[FIX] Clang

a73ba1c

Merge branch 'master' into feature/paged_reference

9030ac0

[FIX] Unused variable left from cleanup

076474a

This comment was marked as resolved.

Sign in to view

PiotrKrzem and others added 2 commits March 5, 2026 10:15

Update src/plugins/template/backend/ops/paged_attention.cpp

52f18e5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[FIX] Copilot review suggestions

e14bb5d

This comment was marked as resolved.

Sign in to view

[FIX] Copilot review, score aggreg window mini fix, template registra…

d302718

…tion fix

This comment was marked as resolved.

Sign in to view

PiotrKrzem mentioned this pull request Mar 5, 2026

[SPEC] PagedAttention #30879

Open

PiotrKrzem added 3 commits March 5, 2026 11:59

[ADD] Tests for the edge case of 0 aggregation

d506010

[FIX] Re-add CacheManager Score & adaptive RKV policies

7191a26

[FIX] Minor adjustments and notes for future reference

aa81ffa

PiotrKrzem added 2 commits March 5, 2026 21:05

[FIX] Build for CM tests

23a313c

[FIX] Unused variable

b88e744

[FIX] ARKV tests, comments cleanup from past outdated methodologies

2b9991f

mitruska requested changes Mar 6, 2026

View reviewed changes

PiotrKrzem added 2 commits March 6, 2026 14:55

[FIX] Relocate the pooling and adaptive rkv parameters, add tests wit…

eb84fb0

…h detailed descriptions

[FIX] Clang

783fd33

p-wysocki reviewed Mar 9, 2026

View reviewed changes

mmikolajcz reviewed Mar 9, 2026

View reviewed changes

[FIX] Separate CPU and Template plugin, PR review fixes

9a13bee

PiotrKrzem mentioned this pull request Mar 23, 2026

[CPU][Transformations] Adds mini fixes for CPU and Transformations for PagedAttention #34860

Draft

PiotrKrzem added 4 commits March 23, 2026 17:33

[FIX] Build error nr 2

d5fcbd5

[FIX] Sync with 26 input version

5238c55

Merge branch 'master' into feature/paged_reference

902ad88

Merge branch 'master' into feature/paged_reference

5ec9f67

		// Propagate updated cache types to PA outputs so consumers see the correct element type
		pa_op->validate_and_infer_types();

Conversation

PiotrKrzem commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Relevant files:

Notes:

Uh oh!

vshampor commented Feb 20, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

PiotrKrzem commented Mar 5, 2026

Uh oh!

PiotrKrzem commented Mar 6, 2026

Uh oh!

mitruska left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

p-wysocki Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

PiotrKrzem commented Feb 4, 2025 •

edited

Loading

p-wysocki Mar 9, 2026 •

edited

Loading