The AI Model That Turns Photos, PDFs, Audio, and Text Into the Same Language

Millions of people search Google Photos for "beach" every day. None of them labelled a single photo. The AI understood the pixels. It understood the meaning. That capability is built on a concept called embeddings. Google just released the most powerful embedding model ever made. It works across photos, text, PDFs, and audio. And it is now available to any developer. I built a search engine with it from scratch to understand exactly how it works.

Watch on YouTube

What is an Embedding?

Imagine you could translate every photo, every PDF, every voice note, and every text message into the same secret language. A language made of numbers.

A beach photo becomes a list of 3,072 numbers. A text note that says "sunny tropical holiday" becomes a different list of 3,072 numbers. But because they are about the same thing, those two lists of numbers are very close to each other.

A mountain photo? Its numbers look very different from both of those.

This translation is called an embedding. The model that does the translating is called an embedding model. And the list of numbers that comes out is called a vector.

Here is the part that matters: once everything is a vector, you can search by meaning. You type "beach vacation", that phrase gets converted to a vector, and the system finds whatever vectors are closest to it. The photo of a beach and the text note about holidays both end up near your search. The mountain photo does not.

No labels. No keywords. Just the numbers, and how close they are.

What Makes Gemini Embedding 2 Different

Every other embedding model before this handled one thing at a time. Text models understood text. Image models understood images. If you wanted to search across both, you needed two separate models and some way to bridge them.

Gemini Embedding 2 (gemini-embedding-2) is the first model that natively handles all of them together: text, images, audio, PDFs, and video. One model. One number space. You pass it a JPEG, it gives you 3,072 numbers. You pass it the same phrase in text, it gives you 3,072 different but nearby numbers. Same space. Same model. Completely different input formats.

This is not two models stitched together. It was trained on all modalities at once. That is the key distinction. The numbers mean the same thing whether they came from a photo or a sentence.

How Gemini Embedding 2 Works

This is the core idea. Every format goes in. One unified number comes out.

The breakthrough is that this is one model trained on all formats together. Not two separate models stitched together. Not text converted to captions first. Just a single model that natively speaks every format and outputs numbers that mean the same thing whether they came from a sentence or a photo.

How I Used It: MemoryVault

I built MemoryVault, a full-stack web app that stores anything you upload as a vector, then searches across everything by meaning.

Upload three beach photos and a text note about a beach holiday. Upload two mountain photos. Never label any of them.

Search "warm beach vacation". The three beach photos and the text note appear at the top. The mountain photos score lower and stay out of the results.

Search "snowy mountain trek". Only the mountain photos appear.

That is the demo. The same model converted a JPEG and a text sentence into vectors, placed them in the same space, and matched them by proximity. The source format is irrelevant. Meaning is everything.

The full project breakdown, the code, and the architecture are on the project page here.

What I Actually Learned

Descriptive queries work far better than single words. Search "beach" and you get reasonable results. Search "warm tropical beach with palm trees and clear water" and the gap between relevant and irrelevant results jumps dramatically. The model rewards specificity.

Text is more precise than images. A text file containing "beach" scores higher against beach queries than an actual beach photo does. Text is explicit. Images require the model to infer meaning from pixels. Both work, but text wins on raw score. This has implications for how you design what you store.

The model is API-only. Gemini Embedding 2 does not appear anywhere in the GCP Console or Vertex AI Model Garden. Not for anyone, anywhere. It is an API endpoint you call directly with a key from Google AI Studio. The Vertex AI regional path also does not support India. If you are building with this model, the Gemini API at ai.google.dev is your only route.

Cosine similarity is narrower than you expect. In the real world, nothing is ever truly "unrelated" in the embedding space. All scores land between 60% and 90%. A 70% score is not "70% similar" in everyday terms. It means roughly "moderately related". You need to design your thresholds with this in mind.

Search has worked on keywords for thirty years. You put words in the filename. You write a caption. You add a tag. The machine finds what you labelled.

Embeddings flip that entirely. The machine understands what something is about. You do not label anything. You just search, and it finds what is closest in meaning.

We are early in this shift. But the model that can convert a photo, a PDF, a voice note, and a sentence into the same number space already exists. I built this to understand it. The more I dug in, the clearer it became that keyword search is the past.

Everything I am building is at ashishmehrotra.com. If you are exploring the same space, reach out.