Thanks for your great work!
I have a question about how position information is distributed to the attention heads in Qwen2-7B.
Qwen2-7B uses group query attention, and the number of key-value heads is set to 4.
However, we have 5 types of position information in P (1 for reading order and 4 for bounding box position).
In this case, how did you handle the distribution of position information across the heads?