Skip to content

ABC Tool

  • Home
  • About / Contect
    • PRIVACY POLICY
[2606.04032] Do Transformers Need Three Projections? Systematic Study of QKV Variants

[2606.04032] Do Transformers Need Three Projections? Systematic Study of QKV Variants

Posted on June 5, 2026 By safdargal12 No Comments on [2606.04032] Do Transformers Need Three Projections? Systematic Study of QKV Variants
Blog


[Submitted on 1 Jun 2026]

View a PDF of the paper titled Do Transformers Need Three Projections? Systematic Study of QKV Variants, by Ali Kayyam and 2 other authors

View PDF
HTML (experimental)

Abstract:Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at this https URL

Submission history

From: Anusha Madan Gopal [view email]
[v1]
Mon, 1 Jun 2026 20:59:05 UTC (2,017 KB)



Source link

Post Views: 3

Post navigation

❮ Previous Post: Today’s NYT Connections: Sports Edition Hints, Answers for June 5 #620
Next Post: Valve says it’s ready to launch the Steam Machine this summer ❯

You may also like

Google’s handsome Pixel Watch 4 is on sale for  off in both size configurations
Blog
Google’s handsome Pixel Watch 4 is on sale for $40 off in both size configurations
April 25, 2026
Xiaomi Mix Fold 5 could launch soon with in-house chipset
Blog
Xiaomi Mix Fold 5 could launch soon with in-house chipset
April 26, 2026
Today’s NYT Connections Hints, Answers for June 2 #1087
Blog
Today’s NYT Connections Hints, Answers for June 2 #1087
June 1, 2026
This is how cheaper phones could get access to Google’s latest new features
Blog
This is how cheaper phones could get access to Google’s latest new features
May 28, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Xiaomi 18 display and battery details leak
  • Google pulls an unexpected move and gives Fitbit Air owners something new
  • AT&T and Verizon lose Supreme Court case over fines for selling location data
  • Founders Fund launches game show starring Sam Altman, Palmer Luckey, and other tech elites
  • Verizon wants to do more of what customers hate and employees dread

Recent Comments

  1. Last Chance for Big Savings on TechCrunch Disrupt 2026 Tickets – Artiverse on 5 days left: Save up to $410 on Disrupt 2026 passes

Archives

  • June 2026
  • May 2026
  • April 2026

Categories

  • Blog

Copyright © 2026 ABC Tool.

Theme: Oceanly News by ScriptsTown