Skip to content
https://abc.microfintool.com/

ABC Tool

  • Home
  • About / Contect
    • PRIVACY POLICY
Google’s Gemma 4 AI models get 3x speed boost by predicting future tokens

Google’s Gemma 4 AI models get 3x speed boost by predicting future tokens

Posted on May 6, 2026 By safdargal12 No Comments on Google’s Gemma 4 AI models get 3x speed boost by predicting future tokens
Blog

[ad_1]

Google launched its Gemma 4 open models this spring, promising a new level of power and performance for local AI. Google’s take on edge AI could be getting even faster already with the release of Multi-Token Prediction (MTP) drafters for Gemma. Google says these experimental models leverage a form of speculative decoding to take a guess at future tokens, which can speed up generation compared to the way models generate tokens on their own.

The latest Gemma models are built on the same underlying technology that powers Google’s frontier Gemini AI, but they’re tuned to run locally. Gemini is optimized to run on Google’s custom TPU chips, which operate in enormous clusters with super-fast interconnects and memory. A single high-power AI accelerator can run the largest Gemma 4 model at full precision, and quantizing will let it run on a consumer GPU.

Gemma allows users to tinker with AI on their hardware rather than sharing all their data with a cloud AI system from Google or someone else. Google also changed the license for Gemma 4 to Apache 2.0, which is much more permissive than the custom Gemma license Google employed for previous releases. However, there are inherent limitations in the hardware most people have to run local AI models. That’s where MTP comes in.

LLMs like Gemma (or Gemini) generate tokens autoregressively—that is, they produce one token at a time based on the previous token. Each one takes just as much computing work as the last one, regardless of whether the token is just a filler word in an output or a key piece of information in a complex logical problem.

The problem with rolling your own AI is that your system memory probably isn’t very fast compared to the high bandwidth memory (HBM) used in enterprise hardware. As a result, the processor spends a lot of time moving parameters from VRAM to compute units for each token, and compute cycles are going unused during this process.

Gemma 4 26B on a NVIDIA RTX PRO 6000. Standard Inference (left) vs. MTP Drafter (right) in tokens per second. Same output quality, half the wait time.

Gemma 4 26B on a NVIDIA RTX PRO 6000. Standard Inference (left) vs. MTP Drafter (right) in tokens per second. Same output quality, half the wait time.

MTP uses that time to bypass the heavy model and generate speculative tokens with the lightweight drafter. While the draft models are smaller (just 74 million parameters in Gemma 4 E2B), they’re also optimized in several ways to speed up speculative token generation. For example, the drafter shares the key value cache (essentially the LLM’s active memory) so it doesn’t need to recalculate context the main model has already worked out. The E2B and E4B drafters also use a sparse decoding technique to narrow down clusters of likely tokens.

[ad_2]

Source link

Post Views: 17

Post navigation

❮ Previous Post: Show HN: Tilde.run – Agent Sandbox with a Transactional, Versioned Filesystem
Next Post: I Watched ‘Harry Potter’ Inside an 87-Foot Dome. Here’s What It Was Like ❯

You may also like

StrictlyVC San Francisco is in less than a month
Blog
StrictlyVC San Francisco is in less than a month
April 14, 2026
Apple’s new Siri AI knows when to shut up
Blog
Apple’s new Siri AI knows when to shut up
June 11, 2026
Your YouTube home feed could soon look very different on mobile
Blog
Your YouTube home feed could soon look very different on mobile
April 30, 2026
Samsung's memory division posts massive profits for Q1, smartphones still profitable
Blog
Samsung's memory division posts massive profits for Q1, smartphones still profitable
May 1, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Whoops! Microsoft Outlook Mac Update Removes Email Conversation History
  • Anthropic’s New Claude Tag Acts as a Virtual Coworker in Slack
  • Google Home will soon get better at recognizing you
  • Meta Pauses Employee-Tracking Program Following Internal Data Leak
  • White House drastically shortens deadline for dropping quantum-vulnerable crypto

Recent Comments

  1. Aeroski 2.0 Ski Fitness Workout Machine Review & Product Info on Gaming at the Gym? Here’s How to Sneak Some Playtime Into Workouts
  2. AI Logo Generator on Tech giant Oracle cuts 21,000 jobs as it embraces AI
  3. Microsoft’s Xbox 25th anniversary console comes in translucent green - ABC Tool on Deals: Samsung's latest Galaxy Z foldables discounted, iPhone 17 Pro, Pixel 10 Pro, Xiaomi 17T Pro also on sale
  4. A Fitbit Air combined with a wristwatch looks better than expected - ABC Tool on Samsung’s latest announcement should have everyone excited about future Galaxy phones
  5. uttzfyffuq on Best Meat Delivery Services for 2026

Archives

  • June 2026
  • May 2026
  • April 2026

Categories

  • Blog

Copyright © 2026 ABC Tool.

Theme: Oceanly News by ScriptsTown