Building with Gemma 4 on Android: What I Learned Running a Multimodal Model On-Device

Your phone is no longer just a client. It is becoming the compute layer.

No API calls. No cloud inference. No data leaving the device.

When Gemma 4 dropped in April 2026, I wanted to go beyond demos and understand what on-device AI actually feels like in production. So I built a real app. One that scans, translates, and structures recipes in 140+ languages, entirely offline.

Here's what broke, what worked, and what building truly private AI on Android looks like.

What is Gemma 4?

Gemma 4 is an AI model small enough to run entirely inside your phone. No internet connection needed. No subscription. Everything happens on the device.

Google released it on April 2, 2026, under a fully open licence (Apache 2.0) for the first time in the Gemma family. That means anyone can use it, build products with it, or modify it with no restrictions.

The biggest change from the previous version: Gemma 4 can see. Earlier versions could only read and respond to text. Gemma 4 can look at a photo and understand what is in it. Some variants can even process audio. This capability of handling more than one type of input at once is called being "multimodal".

It works in over 140 languages and runs fully offline.

What Does Multimodal Mean?

Imagine you send a friend a photo of a handwritten note and ask them to summarise it. They use two things at once: their eyes to look at the image and their brain to understand and respond. That is multimodal.

A model that only reads text is like a friend who is blindfolded. You have to read the note out loud to them first, then they respond. A multimodal model can just look at the photo directly.

Simple as that.

How I Used It: Recipe Scanner

I built an app that does one thing. You point your camera at a recipe, any recipe, in any language, handwritten or printed, and it gives you back a clean structured recipe in English. Ingredients, quantities, steps, cooking time. Everything. And nothing ever leaves your phone.

Rather than describe it, here it is running live:

Watch on YouTube

If you want to dig into the full project, what went into building it, and how it is all wired together, the complete breakdown is here: Recipe Scanner — full project page.

The app first reads the text out of the photo, the same way your phone can scan a document. It runs two language readers at the same time and picks whichever one returns more text. That is the trick behind covering 140 languages without any special setup for each one.

Then Gemma 4 takes that raw text and figures out what is a name, what is an ingredient, what is a step, and returns everything neatly structured. A few seconds. On the phone. No internet required.

The model file is around 2.58 GB on disk.

A Quick Word on Gemma 3

Before Gemma 4, I built a Post-Crash AI Assistant using Gemma 3. That app detects a crash, speaks to the driver, listens for a response, and sends emergency SMS if no reply comes. Gemma 3 was perfect for that because it only needed to understand and generate text.

The difference is simple. Gemma 3 could read. Gemma 4 can also see. For a recipe app where the whole point is processing a photo, you need the one that can see.

What I Actually Learned

Keep your instructions short. I gave the model long detailed instructions at first. It froze. Short and direct works far better, especially when you want consistent, predictable output.
One request at a time. The model cannot handle two requests at the same moment. If two parts of the app try to use it simultaneously, things break silently with no useful error message. I had to make everything queue up and wait its turn. Debugging this was a joy. Truly. (It was not.)
Camera photos need a conversion step. Photos from the camera come in a format that the text reader cannot use directly. You have to convert them to a standard format first. The app gave me zero indication this was the problem. It just quietly failed until I figured it out.
The documentation is still catching up. Most code examples online still show the older version of the library. For the newer one, I ended up reading the actual source code more than once. A reliable sign you have arrived as a developer: when the docs give up before you do.
On-device AI teaches you to care about memory. When everything runs on the phone, you have to think about what is sitting in memory at any moment. The model, the photo, the text reader, all running together. Cloud services hide this from you completely. On-device does not. And I think that makes you a more careful developer.

We are at the beginning of AI moving off the cloud and onto the device. Not eventually. Right now. The phone in your pocket can already run models that would have required a server farm five years ago. I am still figuring out the details, and I suspect I will be for a while. But that is exactly what makes it worth writing about. Everything I am building is at ashishmehrotra.com. If you are exploring the same space, or just curious, reach out. The more people building at the edge, the better.