The Gemma family crossed 200 million downloads sometime in the week before this announcement. Google noted the milestone in the Gemma 4 launch post, not as a celebration but as context - the open-weight model ecosystem that Gemma competes in is large, active, and choosing between increasingly capable alternatives. The question Gemma 4 is trying to answer is whether Google's open models can stay competitive with the ones from Meta, Alibaba, and the growing Chinese open-source community while also delivering something they cannot: a model specifically optimized for on-device deployment across the hardware people actually own.
Gemma 4 launched April 2, 2026 under Apache 2.0. Four models, free to download, free to use commercially, free to fine-tune and redistribute. The weight files are on Hugging Face, Kaggle, and Ollama. Day-one inference support across every major local framework: llama.cpp, Ollama, LM Studio, MLX, vLLM, Hugging Face Transformers, and mistral.rs. Day-one Android integration through the AICore Developer Preview. NVIDIA optimization for RTX GPUs. AMD optimization for Radeon and Ryzen AI hardware.
This is available today, on hardware most people already own.
The Four Models and Who They Are For
Understanding Gemma 4 requires understanding that the four models are not variations on a single design - they represent genuinely different architectural approaches optimized for different deployment environments.
E2B (Effective 2B) is built for smartphones. The "E" stands for effective parameters: the model uses Per-Layer Embeddings (PLE) to achieve the representational depth of a 5.1B model while activating only 2.3B parameters during inference. With 4-bit quantization, it fits in under 1.5 GB of memory - within reach of most modern smartphones. Battery usage in Google's internal testing on a Pixel 9 Pro was 0.75% per 25 conversations. This is not a compromised mobile experience: E2B handles text, images, and audio (audio is unique to the E2B and E4B tier in the Gemma 4 family), with a 128K token context window and full function calling support. Any app that needs a capable multimodal AI operating entirely offline, without an API key, without internet, fits here.
E4B (Effective 4B) is built for laptops. 4.5B effective parameters, 8 GB RAM at 4-bit quantization - the memory requirement of a standard modern laptop. E4B adds slightly more reasoning depth than E2B while maintaining the audio input and the 128K context window. For developers building applications on laptops, for AI assistants that work offline, for any use case where a phone is too constrained but a server is overkill, E4B is the appropriate tier.
26B A4B (MoE) is built for consumer GPU workstations. The "A" stands for active parameters: this is a Mixture-of-Experts model with 128 specialized expert networks, of which only 8 plus 1 shared expert activate for any given token. In practice, only 3.8 billion parameters fire during each forward pass, despite the model containing 26 billion total parameters. This means it runs nearly as fast as a 4B model while achieving approximately 97% of the 31B Dense model's quality. At 4-bit quantization, it fits in 18 GB of RAM - a single RTX 3090 or equivalent. On the Arena AI text leaderboard, it ranks 6th among all open models globally.
31B Dense is the quality ceiling. Maximum reasoning depth, maximum accuracy, the full 31B parameters engaged for every forward pass. 20 GB RAM at 4-bit quantization, 34 GB at 8-bit. Fits on a single RTX 4090 at 4-bit, or on a single 80GB H100 at full precision. Currently ranks 3rd among all open models globally on the Arena AI text leaderboard, ahead of models from Meta, Alibaba, and Mistral that are 20 times its parameter count. On AIME 2026 mathematics competition problems, it scores 89.2% - compared to Gemma 3 27B's 20.8% on the same benchmark.
The Benchmark That Tells the Story
The AIME 2026 comparison is worth dwelling on because it captures something important that benchmark tables often obscure.
Gemma 3 27B scored 20.8% on AIME 2026 mathematics problems. Gemma 4 31B scored 89.2%. These are successive models from the same company, and the difference is not a marginal improvement - it is a generational shift in reasoning capability. AIME (American Invitational Mathematics Examination) problems require multi-step logical reasoning, not pattern matching. A score of 89.2% indicates that the model can reliably work through problems that require constructing a solution path rather than retrieving a memorized answer.
For AI image and video creators, high-level mathematics scores are not directly relevant. But reasoning capability is. The same architectural properties that enable 89.2% on AIME - the thinking mode, the hybrid attention mechanism that balances local and global context, the function calling that can chain multiple operations together - are also what enables the model to understand complex, multi-step generation instructions, to plan a creative workflow from a high-level description, and to generate structured output that can be integrated into production pipelines.
On Codeforces (competitive programming), Gemma 4 31B scored an Elo of 2150 - a score that represents genuinely competitive performance against human programmers at intermediate level. Gemma 3 27B had scored 110 on the same metric. Again, these are successive models from the same company.
The improvements in Gemma 4 are not incremental. They reflect a generation-level architectural change.
What All Four Models Can Do
Every Gemma 4 model shares a common capability floor, with the smaller models adding audio and the larger models adding extended context.
Reasoning mode. All four models support configurable chain-of-thought thinking. Activate it for complex multi-step tasks. Disable it for fast conversational responses. When thinking mode is active, the model generates an internal reasoning trace before producing the final answer - visible as thought blocks in the output. For creative production workflows that involve complex multi-step planning, this mode produces meaningfully better results.
Visual input. All four models process images and video frames natively. Variable resolution support, with a configurable visual token budget that trades detail for processing speed. Video processing up to 60 seconds at 1 frame per second for the 26B and 31B models.
Audio input. E2B and E4B process audio natively, up to 30 seconds per input, with automatic speech recognition and audio understanding. The larger 26B and 31B models do not have native audio - they process text and images only.
Function calling. Native function calling with structured JSON output across all four models and all modalities. Show the model an image and ask it to call a weather API for the location depicted. Ask it to extract structured data from a document and format it as JSON. Ask it to chain multiple tool calls to complete a complex task. This works in every size from E2B to 31B.
140+ languages. The model is trained natively on 140+ languages, not fine-tuned from English. This means non-English performance is a primary capability rather than a secondary adaptation.
Running Gemma 4 Locally: What You Actually Need
The practical hardware requirements at 4-bit quantization:
- E2B: 5 GB total memory - any modern smartphone or any laptop made in the last 5 years
- E4B: 8 GB total memory - any modern laptop with 8GB RAM
- 26B A4B: 18 GB total memory - RTX 3090, RTX 4080 Super, or equivalent
- 31B Dense: 20 GB total memory (4-bit) - RTX 4090, or two GPUs with NVLink
On Apple Silicon, all four models run via MLX with full multimodal support including audio for E2B and E4B. On Windows, llama.cpp and LM Studio handle all four. On Linux, vLLM for server deployments, llama.cpp for local inference.
The quickest start: ollama run gemma4:e4b downloads and serves the E4B model through an OpenAI-compatible API on localhost. Any application already built for OpenAI's API can switch to local Gemma 4 by changing the base URL and removing the API key requirement. No other code changes.
What This Means Beyond the Benchmarks
The practical shift Gemma 4 represents goes beyond benchmark scores. It is about what becomes economically and logistically viable when a frontier-tier multimodal reasoning model is free, offline-capable, and fits on hardware that hundreds of millions of people already own.
Privacy-sensitive workflows - medical, legal, personal - can now use a model that never sends data to any cloud. Applications can ship with on-device intelligence that does not require an API key, an internet connection, or a per-query cost. Developers in markets with limited or expensive internet access can build production AI applications without dependency on external infrastructure. Organizations with strict data residency requirements can run capable models within their controlled environments without compromise.
Google's stated framing is eliminating the "token tax" - the per-query cost that makes cloud AI economically prohibitive at high volume. When the model runs locally, the marginal cost of each inference approaches zero. This changes what is economically viable to build, not just technically possible.
The 200 million download figure for the Gemma family is a meaningful number. The developer ecosystem that has been building on and fine-tuning Gemma models is real, active, and growing. Gemma 4's Apache 2.0 license - which allows commercial use, fine-tuning, and redistribution without restriction - means that everything built on Gemma 3 by that community can be rebuilt on Gemma 4 without licensing complications.
The model weights are available now: google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, google/gemma-4-E4B-it, and google/gemma-4-E2B-it on Hugging Face. Android Developer Preview enrollment is open at developer.android.com. The technical documentation is at ai.google.dev/gemma.
For the broader context of where Gemma 4 fits alongside Nano Banana Pro, GPT Image 1.5, and the rest of the current AI model landscape, the AI image generation guide for 2026 covers the full picture. The Gemini Flash image models guide covers Google's broader open and API-available model stack.
