AIDrivingHelper: On-Device RAG Coaching for Driving Apps

This project is an independent proof-of-concept built for learning purposes using synthetic/demo data. It is not affiliated with or based on any proprietary systems or data.

What this is

This is an add-on coaching layer, not a standalone trip tracking app.

Apps like DriveKit (DriveQuant), Zendrive, eDriving Mentor, and Cambridge Mobile Telematics already record every trip and collect driving events: speeding, hard braking, phone usage, rapid acceleration. They have the data. What they often lack is a natural-language interface that lets the driver actually have a conversation about their behavior.

That is what this module adds. Once the host app has trip data, this layer lets the driver ask questions like "Was I speeding more at night?" or "How was my braking this week?" and get back a real, context-aware coaching response grounded in their actual trips.

What it does

The entire pipeline runs on-device. Semantic search over stored trips, retrieval of the most relevant ones, inference via a local LLM, and voice output. No cloud. No trip data ever leaves the phone.

Architecture

User voice query (SpeechRecognizer)
        ↓
GeckoEmbeddingModel (on-device, 768-dim)
        ↓
SqliteVectorStore (semantic search over stored trips)
        ↓
Top-4 most relevant trips retrieved
        ↓
Gemma 3 1B IT (on-device LLM) + retrieved context
        ↓
Coaching response (text + TTS output)

MVVM:

LLMViewModel manages LLM state and trip insertion
LLMInferenceChain sets up the full RAG chain (embedder, vector store, LLM)
TripEntity / TripDao handle Room persistence
RAGChatUI is the Compose chat interface with voice input

Tech Stack

| Layer | Technology | |---|---| | Language | Kotlin | | UI | Jetpack Compose + Material 3 | | On-device LLM | MediaPipe tasks-genai 0.10.23 (Gemma 3 1B IT int4) | | RAG Framework | Google AI Edge local-agents 0.2.0 | | Embeddings | GeckoEmbeddingModel (768-dim, on-device) | | Vector Store | SqliteVectorStore (local semantic search) | | Database | Room 2.6.1 | | Voice I/O | Android SpeechRecognizer + TextToSpeech API | | Async | Kotlin Coroutines + Guava bridge |

RAG Setup

GeckoEmbeddingModel (768-dim semantic embeddings)
    +
SqliteVectorStore (persistent vector DB, on-device)
    +
DefaultSemanticTextMemory
    ↓
RetrievalAndInferenceChain (top-k=4, question-answering task)
    ↓
MediaPipeLlmBackend (Gemma 3 1B IT, CPU)

Retrieval config is top-k=4. The 4 most semantically similar trips are retrieved and injected as context before inference. This keeps the context window focused and prevents the model from hallucinating about trips it was not given.

Prompt Engineering

The system prompt instructs the model to:

Act as a driving coach
Use second-person language ("you were speeding", not "the driver was speeding")
Only reference information from the retrieved trips, not general knowledge
Not guess or fill in gaps

Temperature: 1.0 / Top-P: 0.95 / Top-K: 64

Demo Trip Data

7 synthetic trips are pre-loaded to show the system working out of the box:

Speeding events on highway and in residential areas
Phone usage events
Sudden braking events
Hard acceleration events

Each trip is embedded and stored in SqliteVectorStore on first run.

Voice Interaction

Input: Android SpeechRecognizer via microphone button in the chat UI
Output: Android TextToSpeech API reads the coaching response aloud
Flow: Speak, recognize, embed query, retrieve trips, run LLM, speak response

What I Learned

RAG on Android is completely practical. The Google AI Edge local-agents library handles the hardest parts: embedding, vector search, chain orchestration. It is more capable than I expected for a mobile SDK.
Semantic search beats keyword search. Gecko embeddings surface contextually relevant trips even when the driver's phrasing does not match stored data word for word.
The Guava-Coroutines bridge is the one tricky dependency. kotlinx-coroutines-guava is needed because local-agents uses ListenableFuture internally. Miss this and nothing compiles.
Voice and RAG are a natural pair. Spoken queries are vague and conversational. That is exactly the kind of input where semantic retrieval is strongest.