RecipeScanner: Scan Any Recipe in Any Language, Fully On-Device

What it does

Point your camera at any recipe. A handwritten card in Japanese, a printed page in French, a screenshot in Hindi. The app reads it, translates it, and gives you back a fully structured recipe in English: name, cuisine, ingredients with quantities, step-by-step instructions, cooking time, and difficulty level. Everything happens on the phone. Nothing leaves the device.

You can also import from your gallery. Every scan is saved to a local SQLite database so you can revisit it anytime.

Phase 1 (shipped): Live camera, gallery import, OCR, translation, structured output, scan history, delete.

Phase 2 (planned): Nutrition calculation, unit converter, ingredient substitution, store finder.

Architecture

Camera / Gallery (CameraX / ImageDecoder)
        ↓
ML Kit OCR (Latin + Chinese, runs in parallel)
        ↓
Gemma 4 E2B via LiteRT-LM (on-device inference)
        ↓
JSON parsing → RecipeAnalysis data class
        ↓
Room Database (SQLite) → UI (Jetpack Compose)

MVVM with Clean Architecture:

UI Layer: Jetpack Compose screens (MainScreen, CameraScreen, ResultScreen, SavedRecipesScreen)
ViewModel Layer: RecipeViewModel manages UiState (sealed class) and ModelStatus
Repository Layer: RecipeRepository coordinates Gemma4Manager and Room DAO
Data Layer: RecipeDatabase, RecipeEntity, RecipeDao

All state lives in the ViewModel. Composables are pure rendering functions. DI via Hilt throughout.

Tech Stack

| Layer | Technology | |---|---| | Language | Kotlin 2.2.0 | | UI | Jetpack Compose + Material 3 | | On-device LLM | Gemma 4 E2B via LiteRT-LM (litertlm-android) | | OCR | ML Kit Text Recognition (Latin + Chinese) | | Database | Room 2.7.1 | | DI | Hilt 2.56 | | Camera | CameraX 1.3.4 | | Image Loading | Coil 2.7.0 | | Async | Kotlin Coroutines + Flow |

AI Pipeline

OCR Phase: Both Latin and Chinese ML Kit recognizers run in parallel. Whichever returns more characters wins. Latin covers most scripts, Chinese handles CJK, which is how the app reaches 140+ languages without special-casing each one.

LLM Inference Phase:

Model: Gemma 4 E2B instruction-tuned (quantized, ~2.6 GB on disk, ~1.5 GB in RAM)
Framework: LiteRT-LM (Google's on-device inference engine, successor to MediaPipe tasks-genai)
Temperature: 0.1 (deterministic, not creative; structured JSON is required)
Prompt asks for JSON with these fields: recipe_name, cuisine, ingredients[], steps[], original_language, cooking_time, difficulty
2-minute timeout guard
Thread-safe via Mutex (LiteRT-LM sessions are not reentrant)

JSON Parsing: If Gemma 4 returns malformed JSON (rare), the raw response is shown as a single instruction step. Graceful degradation, no crashes.

Model Status Machine

NotInitialized → Copying (with progress %) → Ready
                                           → NeedsSetup (model not found, shows ADB command)
                                           → Error (corrupt model, IO failure)

All buttons in the UI are disabled until ModelStatus.Ready.

Key Implementation Details

Lazy model loading: Gemma 4 initializes on first use (not app startup), which avoids ANR on launch
Model copy: Copies from external to internal storage once, reuses on every subsequent run
Hardware bitmap conversion: ML Kit cannot process hardware-backed bitmaps. CameraX returns them by default, so every bitmap is converted to software-backed before inference
Memory: android:largeHeap="true" in the manifest handles the large image and model in memory simultaneously
Image storage: JPEGs saved to app-private storage at quality 85, with paths stored in Room
Material 3 dynamic color: Adapts to the wallpaper on Android 12+ (deep orange primary, forest green secondary)
Parallel OCR: Latin and Chinese recognizers run concurrently on Dispatchers.IO

What I Learned

Prompt length matters. Longer system prompts caused hanging in LiteRT-LM. Short and direct prompts are more reliable for structured output.
Mutex is non-negotiable with LiteRT-LM. Without it, concurrent inference requests corrupt session state silently, with no obvious error.
Hardware bitmaps are a footgun. CameraX returns hardware-backed bitmaps by default. ML Kit rejects them. Always convert.
LiteRT-LM is newer than MediaPipe tasks-genai. Documentation is sparse, and most examples still reference the older API, so you are mostly reading source code.
Hilt + Coroutines + Room is a very clean stack once wired up. All IO on Dispatchers.IO, UI state as StateFlow, composables just render.