WalkXR-AI / docs / architecture / Tech_Architecture.md
Tech_Architecture.md
Raw

WalkXR Technical Architecture & Engineering Plan

Version: 3.0 (July 2025) Status: Phase 1


1. Guiding Philosophy: A Phased Approach to Agentic AI

This document outlines a focused, iterative, and experience-centric development process. Our strategy is to build a robust, multi-layered agentic system by perfecting components in distinct, sequential phases. Each phase corresponds to a series of epics in our project backlog and builds directly upon the last.

The Development Lifecycle:

  1. Phase 1: Foundational Systems (Epics 1-3): Construct the core components for a single, stateful agent. This includes establishing the RAG pipeline, agent memory, evaluation frameworks, and a basic end-to-end testing environment with FastAPI and Streamlit.
  2. Phase 2: Autonomous Agent Capabilities (Epics 4-6): Evolve the single agent into a more autonomous entity. This involves implementing persistent cross-session memory, agentic tool use, self-correction loops, and our first fine-tuning cycle (DPO) to create a custom, aligned model. We will also build the multi-agent orchestrator.
  3. Phase 3: Avatar Embodiment & Unreal Integration (Epic 7): Give our agents a physical, interactive presence. This phase is dedicated to integrating the agent backend with NVIDIA R2X (Riva, Audio2Face) and Unreal Engine to create a real-time, expressive NPC avatar.
  4. Phase 4: Full Walk Experience (Epics 8-10): Assemble the complete cohort of specialized agents required for the 'Small Moments' walk. This involves developing new agents for narrative, ritual, and play, all managed by the master orchestrator.
  5. Phase 5: Generalization & The WalkXR OS (Epics 11-13): Once the first walk is perfected, refactor the architecture into a reusable "Walk Factory." Abstract core components and build a dynamic orchestration engine that can support multiple walks and continuous learning.

This approach mitigates risk by proving out each layer of complexity before adding the next, ensuring we build a stable and scalable foundation for the WalkXR Emotional OS.


2. Technical Stack & Phased Implementation Plan

Our architecture is designed as a series of layers, each introduced at a specific phase of the project. This ensures we use the right tool for the right job at the right time.


Phase 1: Foundational Systems (Epics 1-3)

Goal: Build a testable, end-to-end prototype of a single, stateful agent.

Component Technology & Rationale Alternatives Considered
Agent Logic & State LangGraph will be used from Epic 3 onward to manage the agent's state and internal reasoning steps. Why: It provides the explicit, cyclical graph structure necessary for complex, multi-step agentic behaviors like tool use and self-reflection, which is a limitation of simpler LangChain Expression Language (LCEL) chains. LCEL Chains: Good for simple, linear flows but lacks the robust state management and conditional routing needed for our long-term vision.
Backend API FastAPI will serve our agent. Why: It's a high-performance Python framework that is easy to use, supports async operations (critical for streaming), and automatically generates interactive API documentation. Flask: A solid choice, but FastAPI's Pydantic integration for data validation and native async support make it better suited for modern AI applications.
Deployment Container Docker & Docker Compose will be used to containerize the FastAPI application. Why: This creates a consistent, portable, and isolated environment for development, testing, and future cloud deployment, solving the "it works on my machine" problem. Bare Metal / venv: Lacks portability and can lead to dependency conflicts between different environments.
Internal Testing UI Streamlit will be used to create a simple chat interface. Why: It allows for extremely rapid UI development in pure Python, enabling the entire team (including non-developers) to test the agent's logic without needing a complex frontend setup. Gradio: Similar to Streamlit, but Streamlit's ecosystem and component library are slightly more mature for our chat-based needs.
Knowledge Retrieval (RAG) ChromaDB will serve as our vector store. Why: It's open-source, easy to set up locally, and offers powerful metadata filtering, which is critical for our advanced chunking strategies. We will use nomic-embed-text via Ollama as our initial, high-performance, open-source embedding model. FAISS: Very fast for simple vector search but lacks the rich metadata filtering and API layer of ChromaDB.
Observability & Debugging LangSmith will be integrated from day one. Why: It provides invaluable, detailed traces of every step in our agent's execution, making it essential for debugging complex prompts, evaluating performance, and understanding failures. Manual Logging: Prone to error, lacks visualization, and cannot easily track the complex, nested calls common in agentic systems.
Model Abstraction LiteLLM will be used for all foundation model calls. Why: It provides a unified API interface for over 100 LLMs (OpenAI, Anthropic, local models via Ollama, etc.). This makes our agent model-agnostic, allowing us to switch models easily for cost, performance, or capability reasons without refactoring code. Direct SDK Calls (OpenAI, etc.): Locks us into a single vendor and requires significant code changes to experiment with different models.

Phase 2: Autonomous Agent Capabilities (Epics 4-6)

Goal: Evolve the agent from a simple prototype to an autonomous entity with long-term memory and the ability to learn.

Component Technology & Rationale Alternatives Considered
Persistent Memory A dedicated ChromaDB collection will be used for persistent, cross-session user memory. Why: This separates general knowledge from user-specific memories, which is crucial for privacy and personalization. The agent will use an LLM-powered "summarization" or "reflection" chain to distill conversational history into storable insights. Standard Chat History: Only provides short-term memory and lacks the ability to recall key facts or user preferences across sessions.
Agentic Tool Use We will use the native tool-calling capabilities of models like GPT-4o or Claude 3.5 Sonnet, integrated via LangGraph. Why: This is the industry-standard approach for building reliable tool-using agents. LangGraph's conditional routing is perfectly suited for managing the "call tool -> get observation -> continue" loop. Custom Parsers / ReAct: Brittle and less reliable than native tool-calling, which is now a standard feature in state-of-the-art models.
Self-Correction Loop A "Reflection" node will be added to our LangGraph, creating a critique-and-refine loop. Why: This "Constitutional AI" pattern, where the agent evaluates its own output against a set of principles before responding, dramatically improves response quality, safety, and alignment without requiring expensive fine-tuning. Prompt Engineering Alone: Relies solely on the initial prompt, with no mechanism to catch or fix a bad response before it's sent to the user.
Model Fine-Tuning We will use QLoRA (Quantized Low-Rank Adaptation) via Hugging Face's TRL library to fine-tune an open-source model (e.g., Llama 3, Mistral). Why: QLoRA allows us to efficiently fine-tune large models on a single GPU, making custom model development accessible and cost-effective. Full Fine-Tuning: Requires immense computational resources (multiple high-end GPUs) and is prohibitively expensive for our current stage.
Alignment Technique We will use Direct Preference Optimization (DPO). Why: DPO is a more stable and efficient alternative to traditional RLHF. It allows us to directly train the model on a preference dataset (pairs of "good" vs. "bad" responses), making the model intrinsically better aligned with our desired conversational style. RLHF with a Reward Model: More complex to implement, requires training a separate reward model, and can be unstable. DPO achieves similar results with less complexity.
Multi-Agent Orchestration A master LangGraph will be used as a router. Why: It can use an LLM to classify user intent and then conditionally route the request to the appropriate specialized agent's subgraph. This is a scalable and powerful pattern for building complex, multi-agent systems. If/Else Logic: Brittle and not scalable. Cannot handle the nuance of natural language and would require constant maintenance as new agents are added.

Phase 3: Avatar Embodiment & Unreal Integration (Epic 7)

Goal: Give our agents a voice and a face, creating a fully immersive, real-time interactive experience.

Component Technology & Rationale Alternatives Considered
Speech-to-Text (ASR) NVIDIA Riva (ASR) will be used inside Unreal Engine. Why: It is optimized for real-time, low-latency streaming transcription, which is essential for a natural-feeling conversation. OpenAI Whisper: Excellent accuracy but can have higher latency and is not designed primarily for real-time streaming, which can break immersion.
Text-to-Speech (TTS) NVIDIA Riva (TTS) will generate the agent's voice. Why: Riva's TTS provides high-quality, natural-sounding voices with the ultra-low latency required to sync with facial animations. Coqui TTS / Piper: Powerful open-source options, but they may not achieve the same level of low-latency performance required for believable, real-time lip-sync.
Facial Animation NVIDIA Audio2Face will be used to generate real-time facial animations from the TTS audio stream. Why: This is the industry-leading solution for this problem and the most difficult component to replicate. It provides stunningly realistic lip-sync and emotional expression out of the box, saving us a massive R&D effort. Building a Custom Model: A major research project in itself. Would require creating a large dataset and training a complex audio-to-blendshape model, distracting from our core focus on agent cognition.
Game Engine Unreal Engine 5 will be the client. Why: Its high-fidelity rendering capabilities (especially with Metahumans) are unmatched for creating the realistic, emotionally resonant avatars our experience requires. It also has mature plugins for the NVIDIA ecosystem. Unity: A strong game engine, but Unreal's focus on cinematic quality and realism makes it the better choice for our specific aesthetic and technical goals.

Phase 4: Full Walk Experience (Epics 8-10)

Goal: Assemble and orchestrate the full cohort of specialized agents to deliver a cohesive, emotionally resonant user journey for the 'Small Moments' walk.

Component Technology & Rationale Alternatives Considered
Specialized Agent Design This phase is less about new tech and more about Cognitive Design. We will use LangGraph to build new agents (Narrative, Ritual, Play) as modular subgraphs. The primary work is in prompt engineering, persona development, and defining the agent's unique knowledge and tools. Monolithic Agent: A single, large agent would be less maintainable, harder to debug, and less effective than a team of specialized experts.
Simulation Framework A custom Python Simulation Harness will be built. Why: This allows us to run automated tests at scale. The harness will use pre-defined user personas and scripts to interact with the FastAPI endpoint, logging entire conversations to evaluate the full user journey against our desired emotional arcs. Manual Testing: Not scalable, prone to bias, and cannot provide the volume of data needed to rigorously test a complex multi-agent system.
Advanced Evaluation We will build Custom LangSmith Evaluators. Why: Standard metrics (e.g., correctness) are insufficient. We will write Python code to evaluate conversations for specific criteria like: 1) Emotional Arc Adherence, 2) Insight Generation, and 3) Narrative Cohesion. This provides a much deeper signal of quality. Relying on LLM-as-Judge: Can be a good starting point, but custom evaluators provide more reliable, targeted, and explainable feedback based on our specific project goals.
Interaction Data Logging All simulated and real user interactions will be captured and stored as structured JSONL files in an S3 bucket. Why: This raw data is our most valuable asset. It will be the foundation for all future fine-tuning efforts, creating a data flywheel where every conversation makes the system smarter. Storing in a SQL DB: Less flexible for the semi-structured, nested nature of conversational data. S3 + JSONL is a standard, scalable pattern for ML data lakes.

Phase 5: Generalization & The WalkXR OS (Epics 11-13)

Goal: Refactor the bespoke 'Small Moments' pipeline into a reusable "Walk Factory," implement a continuous learning loop, and harden the system for production.

Component Technology & Rationale Alternatives Considered
Configuration as Code We will refactor the system to be Configuration-Driven. Agent personas, prompts, tool access, and orchestration logic will be defined in YAML files. Why: This separates the "what" from the "how," allowing non-engineers to design new walks and agents by editing simple config files instead of Python code. Hardcoded Logic: Fast for prototyping but impossible to scale. Every new walk would require a new deployment and code changes.
Continuous Learning Loop We will implement a Reinforcement Learning from AI Feedback (RLAIF) pipeline. Why: This automates the DPO process. User feedback (ratings, etc.) is programmatically converted into preference pairs, which are then used to automatically trigger fine-tuning jobs (e.g., weekly) to continuously improve the model. Manual Curation: Too slow and labor-intensive to keep up with incoming user data. An automated pipeline is the only way to achieve true continuous learning.
Managed Vector DB We will migrate from local ChromaDB to a managed service like Pinecone. Why: As we scale, we need a production-grade vector database that offers high availability, low latency, and handles the complexities of indexing and scaling for us. Self-Hosted ChromaDB/Milvus: Requires significant operational overhead to manage, scale, and maintain, distracting from our core product development.
Advanced Safety Layer We will integrate NVIDIA NeMo Guardrails. Why: This provides a dedicated, configurable safety layer that operates independently of the LLM. We can define precise rules (e.g., "the agent must not give medical advice") that are enforced on both user inputs and agent outputs, providing a critical layer of protection. Prompt-Based Safety: Important but not sufficient. A dedicated guardrails system is more robust and cannot be as easily bypassed by clever prompt engineering.