SWF | Michaell Alavedra

As a software architect, I faced the challenge of building a system capable of interpreting complex visual information—specifically, geometric symbologies and wave frequencies in the context of Dakila technologies—and crossing it with a highly specialized knowledge base. The fundamental problem lay in the difficulty of processing unstructured visual artifacts and accurately correlating them with extensive documentary records, without incurring hallucinations on the part of the language models.

To solve this, I designed a progressive web application (PWA) that acts as a bridge between computer vision and Retrieval-Augmented Generation (RAG). The system is not a simple chatbot; it is an inference engine that orchestrates multimodal models to extract characteristics from images (geometric patterns, lines, curves, colors) and uses those attributes as search vectors against a specialized vector database.

Main Operating Flow

From an operational perspective, the system executes a deterministic sequence in three main phases:

Analytical Ingestion: The user provides an image and, optionally, context textual. The system processes and temporarily stores the visual file using Vercel Blob, preparing it for multimodal analysis.
Extraction and Vector Search (RAG): The AI agent (powered by Mastra and foundation models) executes an initial visual analysis to extract metadata from the image. Immediately after, the system vectorizes these findings and queries the internal knowledge base. This step ensures that any subsequent statements are strictly anchored in official literature.
Synthesis and State Return: The engine consolidates the visual findings with the retrieved records, generating a structured response. The conversation maintains state throughout the session using a dedicated memory system, allowing follow-up iterations on the same visual artifact.

Architectural Dissection

To build this system, I opted for a decoupled modular architecture, prioritizing performance and a clear separation of responsibilities between the presentation layer and cognitive orchestration.

General Architecture

The project follows a Decoupled Service-Oriented Architecture pattern within an Astro-based ecosystem. The underlying technical reason for this decision is to isolate the computational complexity of the AI agent from the user interface rendering, guaranteeing scalability and maintainability.

Presentation Layer (Frontend): Built with Astro and React. I opted for Astro because of its “islands” philosophy, which allows hydrating complex interactions (such as 3D visualizers or the chat interface) only when necessary. The interface is orchestrated by high-level components that delegate rendering to specialized subcomponents.
Orchestration and API Layer: Exposed through secure API routes acting as middleware. This layer handles file uploads, real-time distributed configuration verification, and streaming request delegation to the underlying cognitive engine.
Cognitive Engine (Agentic Layer): Represents the core of the domain. I configured an autonomous Agent equipped with vector query tools and multimodal capabilities, completely encapsating the inference logic and RAG interaction.

graph TD
    %% Architecture Diagram
    User([User]) --> |Uploads Image / Message| UI[UI Presentation Layer\nReact / Astro]
    UI --> |FormData| API[API Middleware\n/api/analyze.ts]

    subgraph Edge Infrastructure
        API --> |Upload| BlobStore[(Vercel Blob)]
        API --> |State Verification| EdgeConfig[(Edge Config)]
    end

    API --> |Stream Request\nContext + Image URL| Agent[Cognitive Engine\nMastra Agent]

    subgraph Cognitive Layer
        Agent --> |Entity Extraction| LLM[Google Gemini Multimodal]
        Agent --> |Vector Query| VectorTool[RAG Search Tool]
        VectorTool --> |Embeddings Text-004| VectorDB[(LibSQL Vector Store)]
        Agent --> |Context Management| Memory[(LibSQL Store\nMastra Memory)]
    end

    LLM --> |Synthesized Response| Agent
    Agent --> |Server-Sent Events| API
    API --> |Stream| UI

Data Modeling and State Management

The application demands strict control over the conversational state and the processed artifacts. To achieve this, I implemented a persistence model based on execution threads.

Frontend State Management: Handled centrally through React hooks and context containers, maintaining a strict unidirectional flow of data to the rendering components (chat, interactive visualizer, visual feedback).
Backend State Management: The agentic framework uses transactional memory modules to autonomously persist the message history in an embedded database, linking them through unique identifiers.

erDiagram
    %% Data Model
    THREAD {
        string threadId PK
        datetime createdAt
    }
    MESSAGE {
        string messageId PK
        string role "user | assistant"
        text content
        string threadId FK
    }
    RESOURCE {
        string resourceId PK
        string publicUrl "Blob Storage Access URL"
    }
    KNOWLEDGE_CHUNK {
        string chunkId PK
        vector embedding "Dimension: 768"
        text content
        string source
    }

    THREAD ||--o{ MESSAGE : contains
    THREAD ||--o| RESOURCE : contextualizes

Technology Stack

The selection of tools in this project represents a meticulous balance between theoretical innovation, development speed, and execution efficiency.

Layer / Domain	Technology	Technical Justification and Role
Core Framework	Astro + React 19	Astro provides efficient routing and an optimized rendering model. React manages the reactive state in interactive islands.
Styles and Interface	Tailwind CSS v4, shadcn/ui, Framer Motion	Utilitarian design system, accessible components without coupling, and high-performance declarative animations.
3D Rendering	react-three-fiber, @react-three/postprocessing	Declarative abstraction of WebGL to render complex scenes and visual effects linked to the analyzed features.
Cognitive Engine	Mastra Framework (@mastra/core)	Orchestrator of AI agents and workflows. Defines the tools schema, memory management, and base analysis instructions.
AI Models	Google Gemini (2.0 Flash / 3.0 Pro)	Foundation engines responsible for visual analysis, high-dimensionality embeddings generation, and natural language synthesis.
Inference and RAG	Mistral AI (OCR)	Documentary ingestion processing using advanced optical recognition for structuring PDFs prior to vectorization.
Data Persistence	LibSQL / Vercel Edge Config / Vercel Blob	LibSQL operates dually as a transactional and vector engine. Edge Config and Blob manage distributed configurations and static binaries.

Architectural Impact

The implementation of this design has consolidated a cohesive ecosystem where multimodal analysis operates in an integrated manner, guaranteeing high fidelity in responses thanks to a deep Retrieval-Augmented Generation architecture. By moving the inference logic to an autonomous agents environment and abstracting storage through native Edge embedded databases, I have isolated the presentation layer from the computational bottlenecks typical in generative AI systems. The result is a system that scales cleanly and deterministically while maintaining a rigorous architectural consistency in every data flow.