Skip to content

[Core][Ref] PagedAttention reference implementation #28815

Open
PiotrKrzem wants to merge 255 commits intoopenvinotoolkit:masterfrom
PiotrKrzem:feature/paged_reference
Open

[Core][Ref] PagedAttention reference implementation #28815
PiotrKrzem wants to merge 255 commits intoopenvinotoolkit:masterfrom
PiotrKrzem:feature/paged_reference

Conversation

@PiotrKrzem
Copy link
Copy Markdown
Contributor

@PiotrKrzem PiotrKrzem commented Feb 4, 2025

Details:

  • Add PagedAttention reference implementation
  • Add testing suite

Tickets:

  • 158135

Relevant files:

Paged Attention - input for spec.pptx
ReasoningTokenEviction (1).pptx

Notes:

For the Paged Attention - input for spec:

  • The presentation slides 14-17 showcase the cache management system using 'free_*' outputs, that is not present in the reference. Those inputs/outputs can be easily obtained if needed, using the CacheManager. Currently all the eviction is done automatically by the CacheManager instead.
  • The presentation shows alibi_slopes to have a size [num_kv_heads], I believe it to be an error, should be [query_heads] (matches the executor_pa.cpp line 2060 in CPU implementation)

For the ReasoningTokenEviction:

  • Output 2 data is correct
  • However, the model currently lacks all the necessary inputs defined in the presentation to accurately compute adaptive RKV. I added these inputs as hardcoded parameters to the CacheManager

@github-actions github-actions bot added category: Core OpenVINO Core (aka ngraph) category: IE Tests OpenVINO Test: plugins and common category: TEMPLATE OpenVINO Template plugin category: CPP API OpenVINO CPP API bindings labels Feb 4, 2025
@vshampor
Copy link
Copy Markdown
Contributor

You should also be adding the cache rotation-related functionality here since it's part of the op and has both CPU and GPU implementations now.

@github-actions github-actions bot added category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Feb 27, 2025
@github-actions github-actions bot removed category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Feb 28, 2025
@mlukasze mlukasze requested a review from vshampor February 28, 2025 05:56
@PiotrKrzem PiotrKrzem marked this pull request as ready for review February 28, 2025 06:44
@PiotrKrzem PiotrKrzem requested review from a team as code owners February 28, 2025 06:44
@PiotrKrzem PiotrKrzem requested a review from a team as a code owner February 28, 2025 07:26

This comment was marked as resolved.

PiotrKrzem and others added 2 commits March 5, 2026 10:15
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This comment was marked as resolved.

This comment was marked as resolved.

@PiotrKrzem
Copy link
Copy Markdown
Contributor Author

New checkpoint, at the commit d506010 the 25-input version passes the all the CI tests with FIFO eviction policy enabled for CacheManager

@PiotrKrzem
Copy link
Copy Markdown
Contributor Author

And finally with the commit b88e744 we have a fully working version with all 3 modes of cache eviction

Copy link
Copy Markdown
Contributor

@mitruska mitruska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work revealing PagedAttention algorithm.

Currently this PR introduce not only reference implementation, but also changes to common components code, shared between plugins - including common PagedAttentionExtension class (like additional attribute for cache management, shape_infer), transformation, CPU plugin PA kernel adjustments.

To minimize risk, I would recommend to shrink this PR for reference implementation only covered by tests - then changes for common plugins code separately if needed.
The template plugin / reference implementation requirements shouldn't motivate additional changes in the existing code base shared with other plugins.

(Reviewed scope of the PR changes - without deep dive into reference code itself yet)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain CPU changes and motivation to introduce them in this PR, if there are failing tests for CPU case, I would recommend to mark them as xfail, and propose fix for CPU separately.

OPENVINO_DEBUG("PagedAttn ",
pa_op->get_friendly_name(),
" doesn't have rtinfo for num_k_heads/k_head_size/num_v_heads/num_v_heads");
status = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removal of this assignment doesn't change the logic, because the "status" variable is already initialized with false, but at the same time each such "nice to have" change, makes this PR harder to review and more risky when changes are applied to the common code shared between plugins.

Comment on lines +151 to +152
// Propagate updated cache types to PA outputs so consumers see the correct element type
pa_op->validate_and_infer_types();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This updates common flow, how CPU/GPU plugins resolve that it wasn't a problem before?

std::shared_ptr<ov::Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override;

/// \brief Gets the output element type at the specified index.
const ov::element::Type get_out_type(int index) const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's valid question, why resolved without response?

Comment on lines +74 to +84
using PagedCacheManagerHandle = std::shared_ptr<void>; // Void handle to avoid inconsistent linkage
PagedCacheManagerHandle get_cache_manager() const;
void set_cache_manager(PagedCacheManagerHandle cache_manager);

protected:
PagedCacheManagerHandle m_cache_manager = nullptr;
std::vector<ov::element::Type> m_output_type = {ov::element::dynamic, ov::element::dynamic, ov::element::dynamic};
};

// Exported function for transformations to construct the manager; avoids C4273 in some build configs
OPENVINO_API PagedAttentionExtension::PagedCacheManagerHandle make_paged_cache_handle(ov::element::Type et);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in common op shouldn't be motivated by reference implementation. If there is anything really specific for template plugin, it would be better to use common PagedAttentionExtension as a base class instead of adding new attribute like m_cache_manager not used by any production plugin like CPU/GPU/NPU.

Comment on lines +11 to +18
namespace ov {
namespace op {
template <class TShape, class TRShape = result_shape_t<TShape>>
std::vector<TRShape> shape_infer(const PagedAttentionExtension* op,
const std::vector<TShape>& input_shapes,
const ITensorAccessor& ta = make_tensor_accessor()) {
NODE_VALIDATION_CHECK(op, input_shapes.size() == 25, "Expected exactly 25 inputs but got ", input_shapes.size());
auto output_shapes = std::vector<TRShape>(3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR moves PA shape_infer to a separate function, is there any change in the shape validation/output shape calculation logic?
Is this shape_infer code covered by tests to ensure no regression for common PagedAttentionExtension?

Comment on lines +513 to +515
// Causal: only keep blocks where k_block <= q_block
if (vals[i].second > qb)
keep = false;
Copy link
Copy Markdown
Contributor

@p-wysocki p-wysocki Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that vals[i].second > qb is causal rejection, but isn't rejected value being added to cumsum above in the next loop by running cumsum += vals[i - 1].first;?

if (has_sinks) {
sink_vals.resize(q_heads, 0.f);
for (std::size_t h = 0; h < q_heads; ++h) {
sink_vals[h] = detail::read_at_as_f32(sinks, sinks_et, h);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is H in shape [1,H,1,1] guaranteed to be number of heads by the spec? My concern is about GQA, where if H would be consistent with number of KV heads, this line would read q_heads_count elements from memory of size kv_heads_count.

class ov::pass::AttachCacheManagerToPagedAttention : public ov::pass::ModelPass {
public:
OPENVINO_MODEL_PASS_RTTI("AttachCacheManagerToPagedAttention");
bool run_on_model(const std::shared_ptr<ov::Model>& model) override;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests that cover this transformation other than full template tests. Can this pass be used with non-template implementations?

// 0) The default test (same as in the CPU plugin)
INSTANTIATE_TEST_SUITE_P(smoke_PagedAttentionLayerTest,
PagedAttentionLayerTest,
::testing::Combine(::testing::Values(ElementType::f32),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we cover other input precisions, like bf16 and fp16? I saw some old comment mentioning it

namespace testing {

namespace {
ov::OutputVector make_valid_pa_args(const element::Type& t = element::f32) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside simple PA args, shouldn't type_prop also cover more complex cases like GQA?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please parametrize tests to test other requested dtypes


const auto op = std::make_shared<op::PagedAttentionExtension>(args);
EXPECT_EQ(op->get_output_element_type(0), element::f32);
EXPECT_EQ(op->get_output_partial_shape(0), (PartialShape{3, 4}));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are outputs 1 and 2 validated?

const auto q = out_ps[1].get_length();
const auto k = key_ps[1].get_length();
const auto v = value_ps[1].get_length();
NODE_VALIDATION_CHECK(op,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all failure scenarios validated?


auto args = make_valid_pa_args();
args[9] = std::make_shared<op::v0::Parameter>(element::i32, PartialShape{}); // scale must be real type
EXPECT_THROW(std::ignore = std::make_shared<op::PagedAttentionExtension>(args), ov::NodeValidationFailure);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please instead EXPECT_THROW use OV_EXPECT_THROW with error substring to make it clear what failure scenario is tested and whether it fails with expected error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: Core OpenVINO Core (aka ngraph) category: CPU OpenVINO CPU plugin category: IE Tests OpenVINO Test: plugins and common category: TEMPLATE OpenVINO Template plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.