RecipeScanner: Scan Any Recipe in Any Language, Fully On-Device
Android app that photographs recipes in 140+ languages and uses Gemma 4 via LiteRT-LM to translate and structure them into ingredients, steps, and cooking time. Everything runs on the phone with no internet needed.
What it does
Point your camera at any recipe. A handwritten card in Japanese, a printed page in French, a screenshot in Hindi. The app reads it, translates it, and gives you back a fully structured recipe in English: name, cuisine, ingredients with quantities, step-by-step instructions, cooking time, and difficulty level. Everything happens on the phone. Nothing leaves the device.
You can also import from your gallery. Every scan is saved to a local SQLite database so you can revisit it anytime.
Phase 1 (shipped): Live camera, gallery import, OCR, translation, structured output, scan history, delete.
Phase 2 (planned): Nutrition calculation, unit converter, ingredient substitution, store finder.
Architecture
Camera / Gallery (CameraX / ImageDecoder)
↓
ML Kit OCR (Latin + Chinese, runs in parallel)
↓
Gemma 4 E2B via LiteRT-LM (on-device inference)
↓
JSON parsing → RecipeAnalysis data class
↓
Room Database (SQLite) → UI (Jetpack Compose)
MVVM with Clean Architecture:
- UI Layer: Jetpack Compose screens (MainScreen, CameraScreen, ResultScreen, SavedRecipesScreen)
- ViewModel Layer:
RecipeViewModelmanagesUiState(sealed class) andModelStatus - Repository Layer:
RecipeRepositorycoordinates Gemma4Manager and Room DAO - Data Layer:
RecipeDatabase,RecipeEntity,RecipeDao
All state lives in the ViewModel. Composables are pure rendering functions. DI via Hilt throughout.
Tech Stack
| Layer | Technology |
|---|---|
| Language | Kotlin 2.2.0 |
| UI | Jetpack Compose + Material 3 |
| On-device LLM | Gemma 4 E2B via LiteRT-LM (litertlm-android) |
| OCR | ML Kit Text Recognition (Latin + Chinese) |
| Database | Room 2.7.1 |
| DI | Hilt 2.56 |
| Camera | CameraX 1.3.4 |
| Image Loading | Coil 2.7.0 |
| Async | Kotlin Coroutines + Flow |
AI Pipeline
OCR Phase: Both Latin and Chinese ML Kit recognizers run in parallel. Whichever returns more characters wins. Latin covers most scripts, Chinese handles CJK, which is how the app reaches 140+ languages without special-casing each one.
LLM Inference Phase:
- Model: Gemma 4 E2B instruction-tuned (quantized, ~2.6 GB on disk, ~1.5 GB in RAM)
- Framework: LiteRT-LM (Google's on-device inference engine, successor to MediaPipe tasks-genai)
- Temperature: 0.1 (deterministic, not creative; structured JSON is required)
- Prompt asks for JSON with these fields:
recipe_name,cuisine,ingredients[],steps[],original_language,cooking_time,difficulty - 2-minute timeout guard
- Thread-safe via Mutex (LiteRT-LM sessions are not reentrant)
JSON Parsing: If Gemma 4 returns malformed JSON (rare), the raw response is shown as a single instruction step. Graceful degradation, no crashes.
Model Status Machine
NotInitialized → Copying (with progress %) → Ready
→ NeedsSetup (model not found, shows ADB command)
→ Error (corrupt model, IO failure)
All buttons in the UI are disabled until ModelStatus.Ready.
Key Implementation Details
- Lazy model loading: Gemma 4 initializes on first use (not app startup), which avoids ANR on launch
- Model copy: Copies from external to internal storage once, reuses on every subsequent run
- Hardware bitmap conversion: ML Kit cannot process hardware-backed bitmaps. CameraX returns them by default, so every bitmap is converted to software-backed before inference
- Memory:
android:largeHeap="true"in the manifest handles the large image and model in memory simultaneously - Image storage: JPEGs saved to app-private storage at quality 85, with paths stored in Room
- Material 3 dynamic color: Adapts to the wallpaper on Android 12+ (deep orange primary, forest green secondary)
- Parallel OCR: Latin and Chinese recognizers run concurrently on Dispatchers.IO
What I Learned
- Prompt length matters. Longer system prompts caused hanging in LiteRT-LM. Short and direct prompts are more reliable for structured output.
- Mutex is non-negotiable with LiteRT-LM. Without it, concurrent inference requests corrupt session state silently, with no obvious error.
- Hardware bitmaps are a footgun. CameraX returns hardware-backed bitmaps by default. ML Kit rejects them. Always convert.
- LiteRT-LM is newer than MediaPipe tasks-genai. Documentation is sparse, and most examples still reference the older API, so you are mostly reading source code.
- Hilt + Coroutines + Room is a very clean stack once wired up. All IO on Dispatchers.IO, UI state as StateFlow, composables just render.