Skip to content

ABC Tool

  • Home
  • About / Contect
    • PRIVACY POLICY
Google’s latest trick gets Gemma 4 running 3x faster right on your phone

Google’s latest trick gets Gemma 4 running 3x faster right on your phone

Posted on May 6, 2026 By safdargal12 No Comments on Google’s latest trick gets Gemma 4 running 3x faster right on your phone
Blog


TL;DR

  • Google has introduced new assistant models, called “drafters,” that could significantly speed up Gemma 4.
  • Drafters work by predicting sections of prompts to the main model, which can focus on processing them in bigger batches.
  • This allows the model to use the memory and the compute more efficiently.

Google’s recently launched Gemma 4 edge AI models are especially designed to run locally on consumer-hosted hardware. While favorable from a privacy standpoint, local models can easily hog resources and slow down results, rendering them ineffective. So, Google is now offering a potential solution, which it claims can speed up Gemma 4 models by up to three times.

Google recently released Multi-Token Prediction (MTP) drafters for Gemma 4. These drafters are essentially smaller, assistive models that help the primary model by “predicting” part of the user’s request. These smaller models also work in parallel to the main model to manage the compute more effectively.

Don’t want to miss the best from Android Authority?

google preferred source badge light@2x
google preferred source badge dark@2x

How does MTP improve Gemma 4?

The process uses a technique called “Speculative Decoding,” in which the drafter models predict upcoming words in the prompt even before the main Gemma model has read through it. While the drafter moves on to the next sequence of words, the main model verifies the predicted set of words at the same time.

If the model accepts the drafted version, it moves on to verify the next set. If it disagrees, it replaces the incorrect word or chunk.

While the extra work may sound counterintuitive, it’s actually not. Let me give you an oversimplified explanation of why MTP works.

The speed of processing is not just determined by the processing hardware (typically GPU cores) but by the memory bandwidth (VRAM). That’s because the model has to be referenced with each new request. So, by combining multiple words into a single chunk, the model must be referenced only once rather than multiple times, thus, shifting the load from the memory to the processing unit.

In addition to making these changes, Google says it is also working to optimize Gemma 4 models of different weights for specific hardware, such as the Apple Silicon or the popular Nvidia A100.

The MTP drafters for Gemma 4, alongside the primary model, can use platforms such as HuggingFace or Kaggle, tools like Ollama, or through Google’s own AI Edge Gallery on Android or iOS.

Thank you for being part of our community. Read our Comment Policy before posting.



Source link

Post Views: 12
Tags: Google News

Post navigation

❮ Previous Post: A new leak says Apple’s all-screen iPhone may ditch buttons too
Next Post: The Boring Internet | Terry Godier ❯

You may also like

I Found 7 of the Best A24 Movies That Are Free to Stream
Blog
I Found 7 of the Best A24 Movies That Are Free to Stream
May 2, 2026
Verizon Debuts Discounted Plans for New Subscribers to Attract Carrier Converts
Blog
Verizon Debuts Discounted Plans for New Subscribers to Attract Carrier Converts
June 16, 2026
Before iOS 27 Arrives, Here’s How to Customize Your iPhone’s Lock Screen
Blog
Before iOS 27 Arrives, Here’s How to Customize Your iPhone’s Lock Screen
June 12, 2026
Nothing cancels this year’s CMF phone due to RAM prices
Blog
Nothing cancels this year’s CMF phone due to RAM prices
June 19, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Today’s NYT Connections: Sports Edition Hints, Answers for June 21 #636
  • Today’s NYT Strands Hints, Answer and Help for June 21 #840- CNET
  • AMD will reinstate memory encryption on Ryzen 9000 CPUs through a BIOS update in July — TSME is coming back after ‘valuable community feedback’
  • Yes, I Was Wrong About Meal Kits. This Is the Service That Changed My Mind
  • The Atlantic created a searchable database of the music used to train AI

Recent Comments

  1. blood strike top up on NYC Mayor Zohran Mamdani takes to Twitch to chat with New Yorkers
  2. Last Chance for Big Savings on TechCrunch Disrupt 2026 Tickets – Artiverse on 5 days left: Save up to $410 on Disrupt 2026 passes

Archives

  • June 2026
  • May 2026
  • April 2026

Categories

  • Blog

Copyright © 2026 ABC Tool.

Theme: Oceanly News by ScriptsTown