May 12, 2025 · essay
The Anatomy of a Desktop AI Assistant: Our Tech Stack & Feature Deep Dive
Explore the tech powering our 'Desktop AI Assistant' (Electron, Vue, Python/FastAPI, Docker, local Ollama & Kokoro TTS) for a truly local, multi-modal AI experience—all coded by AI!
The short version
- LLM
- This post dissects the comprehensive technology stack and feature set of our 'Desktop AI Assistant'. Notably, Google Gemini 2.5 Pro Preview coded 100% of the application, enabling a rich, local-first, multi-modal AI experience.
- Why
- To showcase how a stack of Electron, Vue 3, local Ollama (for text & vision), local Kokoro TTS (via FastAPI), and Docker can create powerful, private, and interactive desktop AI assistants.
- Challenge
- Orchestrating a suite of local AI models (Ollama text/vision, Kokoro TTS) and services within a cohesive desktop application, while ensuring seamless multi-modal interactions (text, voice, image) – all AI-coded under human direction.
- Outcome
- A 'Desktop AI Assistant' where 100% of the code was AI-generated, featuring robust local capabilities: text chat, image understanding with local vision models, voice-to-text, local text-to-speech (Kokoro TTS) across features, and a sophisticated RAG system. A true multi-modal powerhouse on your desktop.
- AI approach
- Google Gemini 2.5 Pro Preview acted as the sole coder, implementing all local AI integrations (Ollama, Kokoro TTS via FastAPI), the multi-modal UI features, and the backend RAG pipeline, all based on my architectural guidance and detailed prompts.
- Learnings
- Local AI (Ollama, Kokoro TTS) significantly enhances user privacy and control. FastAPI excels at serving these Python-based AI models. Electron/Vue provides the ideal canvas for a multi-modal UI. Docker ensures even complex dependencies are tamed.
Introduction: Deconstructing a Modern AI Powerhouse for Your Desktop – Locally!
Our "Desktop AI Assistant" isn't just another application; it's an integrated ecosystem of AI tools designed to run efficiently and privately on your local machine, offering a truly multi-modal experience. In a previous post, I highlighted how 100% of its code was generated by Google Gemini 2.5 Pro Preview. Now, let's pull back the curtain and dissect the anatomy of this assistant – the technologies that power its diverse capabilities, the features it offers, and how these components work in concert to deliver rich interactions across text, voice, and vision.
Building such a comprehensive tool, with a strong emphasis on local processing for privacy and control, required a carefully selected stack. This stack needed to handle everything from a sleek user interface and real-time AI interactions to complex backend processing and isolated, resource-intensive tasks, all while bringing powerful AI models directly to the user's desktop.
Core Technology Stack: The Pillars of Our Local-First, Multi-Modal Assistant
The foundation of the Desktop AI Assistant is built upon several key technologies, each coded by Gemini, to create a powerful local AI experience:
Desktop Application Shell & UI
| Category | Technology/Tool | Role & Purpose |
|---|---|---|
| Shell | Electron (with Electron Forge) | Cross-platform desktop app creation and packaging. |
| Frontend UI | Vue 3 (Composition API, Composables) | Reactive user interface development for multi-modal interactions. |
| Frontend Build | Vite | Fast development server and build tool for Vue. |
| Language | TypeScript | Static typing for entire codebase. |
| Styling | Tailwind CSS | Utility-first CSS framework. |
| Routing | Vue Router | SPA navigation within Electron. |
Local AI & Backend Services
| Category | Technology/Tool | Role & Purpose |
|---|---|---|
| Local LLMs (Text & Vision) | Ollama Server (External) | Runs local language and vision models for chat, summarization, and image analysis. |
| Local Text-to-Speech | Kokoro TTS (via FastAPI) | Provides high-quality, local speech synthesis across various app features. |
| Primary Backend | Python, FastAPI, Uvicorn | Serves Kokoro TTS, orchestrates RAG, interfaces with AI libs. |
| Crawler Service | Docker, Python, FastAPI (minimal) | Isolated web crawling using crawl4ai & Playwright for RAG data ingestion. |
Cloud AI, Data Handling, Communication & Storage
| Category | Technology/Tool | Role & Purpose |
|---|---|---|
| Cloud Transcription | OpenAI Whisper API (External) | Audio-to-text conversion (user-provided API key). |
| Vector DB (RAG) | ChromaDB | Stores text embeddings for RAG. |
| Embeddings (RAG) | Sentence Transformers | Generates embeddings for RAG. |
| HTTP Clients | Axios (Electron), HTTPX (FastAPI), Node HTTP | Inter-service and external API communication. |
| App Settings | electron-store | Persists user preferences for API keys, default models. |
| Chat History | Local Filesystem (JSON) | Stores chat conversations locally. |
Core Features of the Desktop AI Assistant: A Multi-Modal Tour
This robust tech stack enables a rich set of features, all designed to bring AI power directly to the user's desktop with a focus on multi-modal interaction:
| Feature Area | Key Functionalities | Core Modalities & Local AI |
|---|---|---|
| LLM Interaction |
Text Chat: Streaming, model
selection, system prompt, history. Vision Analysis: Image upload/URL, text prompts for image queries, model status, explicit loading. Summarization: For transcripts and live sessions. |
Text, Image -> Text (Ollama Local Models), Text-to-Speech (Kokoro TTS for responses) |
| Speech & Audio Processing |
Recording: One-off &
continuous segmented. Transcription: OpenAI Whisper for speech-to-text. Text-to-Speech: Kokoro TTS for speaking assistant messages, vision descriptions, and dedicated TTS testing. Voice and speed selection. |
Voice -> Text (Whisper), Text -> Voice (Local Kokoro TTS via FastAPI) |
| Live Session | Continuous audio recording, real-time Whisper transcription, progressive Ollama re-summarization of full transcript, live UI updates, performance stats. Session persistence & AI-generated titles. | Voice -> Text -> Summarized Text (Whisper, Ollama Local Models) |
| Retrieval Augmented Generation (RAG) | Web data ingestion (via Dockerized crawl4ai), collection management, chunking, embedding (Sentence Transformers), ChromaDB storage, natural language querying against custom knowledge, display of contextual answers & retrieved sources. | Text (Web Content) -> Structured Knowledge -> Text Query -> Text Answer (Ollama Local Models for RAG answers) |
| Application Settings | Configuration for OpenAI API Key, Default Ollama Chat Model, Default Kokoro TTS Voice & Speed. | User Preferences (electron-store) |
A key aspect is the pervasive integration of local Kokoro TTS. Whether it's an LLM chat response or a vision model's description of an image, the assistant can voice it, making interactions more natural and accessible. Similarly, leveraging local Ollama vision models means image analysis happens on the user's machine, aligning with our local-first philosophy.
Key Integrations & Architectural Patterns Enabling Multi-Modality
The AI-coded architecture ensures these diverse technologies work together seamlessly:
- Electron's Multi-Process Model: Crucial for separating the UI (renderer) from backend tasks and native interactions (main process). Secure IPC via contextBridge is the lifeline.
- Modular Services Everywhere: From Electron's main process services and preload APIs to FastAPI's routers and the rag_core business logic, modularity (a core instruction to Gemini) was key to managing this multi-faceted system.
- FastAPI as the Python AI Hub: This backend serves as the central point for Python-based AI capabilities. It exposes Kokoro TTS as an API and orchestrates the complex RAG pipeline, including communication with the Dockerized crawler. This keeps heavy Python dependencies out of Electron's direct main process.
- Hidden Renderer for Web Audio: Reliably captures audio using standard Web APIs, feeding into both Whisper transcription and potentially other future local audio processing.
- Docker for Stable AI Environments: The crawl4ai and Playwright setup, critical for RAG's data ingestion, runs in a predictable Docker container, eliminating platform-specific headaches.
The Next Frontier: AI-Powered Local Computer Control & Automation
The true differentiating power of a desktop AI assistant, especially one built with Electron and integrated with local LLMs like Ollama, lies in its potential to securely interact with and control the user's local computer environment. This is a domain where web-based cloud applications inherently cannot venture due to browser sandboxing and security restrictions.
Electron, by its very nature, allows the main process to access Node.js APIs, which in turn can interact with the operating system, file system, and other local applications. When combined with a powerful local LLM capable of understanding natural language commands and generating structured output (like scripts or command sequences), the possibilities become immense:
- Automated Tasks: Imagine instructing your Desktop AI Assistant: "Organize all my screenshots from the last week into a new folder named 'May Week 3 Shots' and then open that folder." The assistant could parse this, use Electron's main process to interact with the file system, and execute these commands.
- Application Control: "Open my code editor, launch the 'PM Desktop App' project, and start the development server." While complex, an LLM could generate the necessary shell commands, which Electron's main process could then execute.
- Headless Browser Automation for Local Tasks: Beyond web crawling for RAG, a locally controlled headless browser (perhaps via Playwright managed directly by Electron's main process or a dedicated local utility script) could perform tasks on the user's behalf that require browser interaction but aren't about public web scraping. This could be interacting with local network devices with web UIs, or automating tasks on internal company web portals that aren't publicly accessible. The Dockerized crawler we use for RAG showcases the power of headless browsers; bringing this control locally (and securely) opens new avenues.
- Content Generation & File Manipulation: "Draft an email to John about our meeting, save it as 'meeting_notes.txt' in my Documents, and then open it for review." The LLM generates the text, and Electron handles file creation and opening.
Security is Paramount: Electron's architecture, with its separation of the renderer (UI) and main (Node.js/OS access) processes, is fundamental here. Any powerful local control features must be meticulously designed:
- All OS-level interactions happen exclusively in the Electron main process.
- The sandboxed renderer process (Vue UI) would send high-level, validated commands via IPC to the main process.
- The main process would carefully sanitize and execute these commands, potentially with user confirmation for sensitive actions.
- Local LLMs (Ollama) ensure that the "intent understanding" part of these commands also happens locally, reducing external dependencies for core control logic.
This deeper integration with the local computer, powered by private and local LLMs, represents a significant step towards a truly personal and powerful AI assistant—one that understands your context not just from the web, but from your own machine.
Conclusion: A Local, Multi-Modal AI Future, Built by AI Today
The "Desktop AI Assistant" stands as a compelling example of how modern development, even when fully coded by an AI like Google Gemini 2.5 Pro Preview, can yield sophisticated, local-first, and truly multi-modal applications. By strategically combining Electron, Vue 3, local powerhouses like Ollama for LLMs (text and vision) and Kokoro TTS for speech, a flexible FastAPI backend, and the environmental stability of Docker, we've created a tool that is both powerful and private.
The ability to chat, see, hear, and be heard by your desktop assistant—and looking ahead, to have it securely assist with tasks on your local machine—all while keeping sensitive interactions and data largely local, is no longer a far-off concept. It’s a reality forged by careful architectural planning and, in this case, an extraordinary AI coding partner. The journey underscores a future where human developers define the vision and guide AI to bring complex, integrated systems to life with remarkable efficiency.