Skip to content

ABC Tool

  • Home
  • About / Contect
    • PRIVACY POLICY
[2606.04032] Do Transformers Need Three Projections? Systematic Study of QKV Variants

[2606.04032] Do Transformers Need Three Projections? Systematic Study of QKV Variants

Posted on June 5, 2026 By safdargal12 No Comments on [2606.04032] Do Transformers Need Three Projections? Systematic Study of QKV Variants
Blog


[Submitted on 1 Jun 2026]

View a PDF of the paper titled Do Transformers Need Three Projections? Systematic Study of QKV Variants, by Ali Kayyam and 2 other authors

View PDF
HTML (experimental)

Abstract:Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at this https URL

Submission history

From: Anusha Madan Gopal [view email]
[v1]
Mon, 1 Jun 2026 20:59:05 UTC (2,017 KB)



Source link

Post Views: 1

Post navigation

❮ Previous Post: Today’s NYT Connections: Sports Edition Hints, Answers for June 5 #620
Next Post: Valve says it’s ready to launch the Steam Machine this summer ❯

You may also like

vivo X300 FE in for review
Blog
vivo X300 FE in for review
April 25, 2026
We’re Getting a Bunch of New Stuff Dropping Today in Overwatch Season 2: Summit
Blog
We’re Getting a Bunch of New Stuff Dropping Today in Overwatch Season 2: Summit
April 14, 2026
Three's a party: US, China, and now Russia are on the prowl in GEO
Blog
Three's a party: US, China, and now Russia are on the prowl in GEO
May 16, 2026
EP215: The Anatomy of an AI Agent
Blog
EP215: The Anatomy of an AI Agent
May 16, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Valve’s Steam Machine: Summer Release Planned, Still No Price
  • Apple's first Developer Center in Europe will open its doors in Berlin later this year
  • Gemini in Google Drive can now dig through your Gmail for better answers
  • AI Agents Now Generate More Web Traffic Than Humans
  • Modders are turning Ray-Ban Meta glasses into spy gear

Recent Comments

  1. Last Chance for Big Savings on TechCrunch Disrupt 2026 Tickets – Artiverse on 5 days left: Save up to $410 on Disrupt 2026 passes

Archives

  • June 2026
  • May 2026
  • April 2026

Categories

  • Blog

Copyright © 2026 ABC Tool.

Theme: Oceanly News by ScriptsTown