Releases

Google Releases Gemma 4: Frontier AI on Your Phone, Laptop, and RTX GPU — No Cloud Required

On April 2, 2026, Google DeepMind released Gemma 4 under Apache 2.0 — four open-weight models from E2B to 31B that run fully offline on smartphones, laptops, and consumer GPUs. The 31B model ranks #3 among all open models globally. The E2B runs on any modern phone.

April 3, 20266 min read

Google DeepMind released Gemma 4 on April 2, 2026, under the Apache 2.0 license. Four models, four deployment targets, one consistent architecture — and every single one runs entirely offline, on hardware most people already own.

The 31B dense model currently ranks third among all open-weight language models on the Arena AI text leaderboard, beating models 20 times its parameter count. The 26B Mixture-of-Experts variant sits sixth — while activating only 3.8 billion parameters during inference, running faster than most 4B models. At the small end, the E2B model runs on any modern smartphone with under 1.5 GB of memory. The E4B fits on an 8GB laptop.

Available today. Free. Weights downloadable from Hugging Face, Kaggle, and Ollama.


Four Models, Four Hardware Targets

ModelArchitectureActive ParamsContextBest For
E2BDense2.3B effective128KSmartphones, offline
E4BDense4.5B effective128K8GB laptops, edge
26B A4BMoE (128 experts)3.8B active256KConsumer GPU, fast
31BDense31B256KWorkstation, max quality

The "E" in E2B and E4B stands for "effective parameters" — these models use Per-Layer Embeddings to give a 2.3B-active model the representational depth of 5.1B parameters, fitting in minimal memory without trading the quality you'd expect from the size.

The 26B A4B MoE activates only 3.8B of 26B total parameters per forward pass, achieving about 97% of the dense 31B model's quality while running nearly as fast as a 4B model. For anyone who wants near-flagship performance on a single 24GB consumer GPU — an RTX 3090 or 4090 — this is the model.

Hardware requirements (4-bit quantized):

  • E2B: 5 GB RAM — runs on most smartphones
  • E4B: 8 GB RAM — runs on most modern laptops
  • 26B A4B: 18 GB RAM — single 24GB GPU
  • 31B: 20 GB RAM (4-bit) or single 80GB H100 (full precision)

What Gemma 4 Can Do

Every model in the family handles text and images. The two smaller models — E2B and E4B — also process audio natively.

Reasoning with thinking mode. All Gemma 4 models have a configurable chain-of-thought reasoning mode. Activate it for complex logic, math, and multi-step problems. Disable it for fast conversational responses. The 31B model scored 89.2% on AIME 2026 mathematics competition problems — compared to 20.8% for Gemma 3 27B. That is not an incremental improvement, it is a different generation.

Native function calling. All models support structured JSON function calls across all modalities. You can show the model an image and ask it to call an API based on what it sees. This works in every size from E2B to 31B — no separate fine-tuning required.

Long context. E2B and E4B handle 128K tokens. The 26B and 31B handle 256K — enough to process entire codebases, long legal documents, or hours of meeting transcript in a single prompt.

Video and audio. All models can analyze video by processing frames. E2B and E4B process audio directly — up to 30 seconds per input — enabling speech recognition and audio understanding at the edge without a cloud call.

140+ languages. Trained natively on 140 languages with a January 2025 data cutoff.


Running Gemma 4 Locally

Google shipped day-one support across every major local inference framework.

For laptop and desktop users: download from Ollama with ollama pull gemma4:e4b (or your preferred size), or load in LM Studio with a few clicks. For Apple Silicon Mac users: MLX with mlx-vlm supports all four sizes including full multimodal capability.

For developers who want the full API surface: llama.cpp serves Gemma 4 through an OpenAI-compatible server. This means any application already built for OpenAI's API can switch to a local Gemma 4 instance by changing the base URL — no other code changes.

For Android development: the E2B and E4B models are available today through the AICore Developer Preview and are the foundation for the next-generation Gemini Nano 4 on-device experience that will ship on production Android devices later this year.

NVIDIA has optimized Gemma 4 for RTX GPUs specifically, with the 26B A4B running on a single RTX 3090 and the 31B fitting on an RTX 4090 with 4-bit quantization. NVIDIA distributes these through the RTX AI Garage. AMD provides day-one support across Radeon and Ryzen AI hardware through llama.cpp and LM Studio.


Why This Matters Beyond the Benchmarks

The practical shift Gemma 4 represents is not just about benchmark scores. It is about what becomes possible when a frontier-tier multimodal reasoning model is free, offline, and fits on hardware that hundreds of millions of people already own.

Privacy-sensitive workflows — medical, legal, personal — can now use a model that never sends data to any cloud. Consumer applications can ship with on-device intelligence that does not require an API key or an internet connection. Developers in markets with limited internet access can build production AI applications without dependency on external infrastructure.

Google's stated framing is removing the "token tax" — the per-query cost that makes cloud AI economically prohibitive for high-volume use cases. When the model runs locally, the marginal cost of each inference is zero. This changes what is economically viable to build.

Gemma 4 has also surpassed 200 million downloads across the full Gemma family — a milestone Google noted just before the 4 launch. The developer community building on and fine-tuning Gemma models is significant, and Gemma 4's Apache 2.0 license allows commercial use, fine-tuning, and redistribution without restriction.


Get Started

Model weights: Hugging Face — google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, google/gemma-4-E4B-it, google/gemma-4-E2B-it

Quick install via Ollama: ollama run gemma4:e4b

Android Developer Preview: AICore Developer Preview program at developer.android.com

Technical documentation: ai.google.dev/gemma


Ready to Create?

Put your new knowledge into practice with Cliprise.

Start Creating
Featured on Super Launch