Spaces:

GFiaMon
/

meeting-agent-docker

Paused

App Files Files Community

meeting-agent-docker / documentation /ARCHITECTURE.md

GFiaMon

Updated: Readme and add Documentation

ce49379 5 days ago

preview code

raw

history blame contribute delete

20.4 kB

System Architecture

Meeting Intelligence Agent - Technical Architecture Documentation

This document provides a high-level overview of the system architecture, design decisions, and component relationships for the Meeting Intelligence Agent.

🎯 System Overview

The Meeting Intelligence Agent is a conversational AI system built on LangGraph that orchestrates meeting video processing, transcription, storage, and intelligent querying through natural language interaction.

Core Capabilities

Video Processing Pipeline: Upload → Transcription → Speaker Diarization → Metadata Extraction → Vector Storage
Semantic Search: RAG-based querying across meeting transcripts using natural language
External Integrations: MCP (Model Context Protocol) servers for Notion and time-aware queries
Conversational Interface: Gradio-based chat UI with file upload support

Design Philosophy

Conversational-First: All functionality accessible through natural language
Modular Architecture: Clear separation between UI, agent, tools, and services
Extensible: MCP protocol enables easy addition of new capabilities
Async-Ready: Supports long-running operations (transcription, MCP calls)
Production-Ready: Docker support, error handling, graceful degradation

🏗️ Architecture Diagram

graph TB
    subgraph "Frontend Layer"
        UI[Gradio Interface]
        Chat[Chat Component]
        Upload[File Upload]
        Editor[Transcript Editor]
    end
    
    subgraph "Agent Layer (LangGraph)"
        Agent[Conversational Agent]
        StateMachine[State Machine]
        ToolRouter[Tool Router]
    end
    
    subgraph "Tool Layer"
        VideoTools[Video Processing Tools]
        QueryTools[Meeting Query Tools]
        MCPTools[MCP Integration Tools]
    end
    
    subgraph "Processing Layer"
        WhisperX[WhisperX Transcription]
        Pyannote[Speaker Diarization]
        MetadataExtractor[GPT-4o-mini Metadata]
        Embeddings[OpenAI Embeddings]
    end
    
    subgraph "Storage Layer"
        Pinecone[(Pinecone Vector DB)]
        LocalState[Local State Cache]
    end
    
    subgraph "External Services"
        OpenAI[OpenAI API]
        NotionMCP[Notion MCP Server]
        TimeMCP[Time MCP Server]
        ZoomMCP[Zoom MCP Server<br/>In Development]
    end
    
    UI --> Agent
    Agent --> StateMachine
    StateMachine --> ToolRouter
    ToolRouter --> VideoTools
    ToolRouter --> QueryTools
    ToolRouter --> MCPTools
    
    VideoTools --> WhisperX
    VideoTools --> Pyannote
    VideoTools --> MetadataExtractor
    VideoTools --> Embeddings
    
    QueryTools --> Embeddings
    QueryTools --> Pinecone
    
    MCPTools --> NotionMCP
    MCPTools --> TimeMCP
    MCPTools -.-> ZoomMCP
    
    Embeddings --> Pinecone
    MetadataExtractor --> OpenAI
    Agent --> OpenAI
    
    classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
    classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000000   
    classDef tools fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000000
    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000
    classDef storage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#000000
    classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000000
    
    class UI,Chat,Upload,Editor frontend
    class Agent,StateMachine,ToolRouter agent
    class VideoTools,QueryTools,MCPTools tools
    class WhisperX,Pyannote,MetadataExtractor,Embeddings processing
    class Pinecone,LocalState storage
    class OpenAI,NotionMCP,TimeMCP,ZoomMCP external

🧩 Core Components

1. Frontend Layer (Gradio)

Purpose: User interface for interaction and file management

Components:

Chat Interface: Primary conversational UI using gr.ChatInterface
File Upload: Video file upload widget
Transcript Editor: Editable text area for manual corrections
State Display: Real-time feedback on processing status

Technology: Gradio 5.x with async support

Key Files:

src/ui/gradio_app.py - UI component definitions and event handlers

2. Agent Layer (LangGraph)

Purpose: Orchestrates the entire workflow through conversational AI

Architecture: State machine with three nodes:

┌─────────┐     ┌───────┐     ┌───────┐
│ PREPARE │ --> │ AGENT │ --> │ TOOLS │
└─────────┘     └───────┘     └───────┘
                    ↑             │
                    └─────────────┘

Components:

Prepare Node: Converts chat history to LangChain messages
Agent Node: LLM decides which tools to call
Tools Node: Executes selected tools
Conditional Router: Determines if more tool calls are needed

State Structure:

{
    "message": str,              # Current user query
    "history": List[List[str]],  # Conversation history
    "llm_messages": List[Message], # LangChain message format
    "response": str,             # Generated response
    "error": Optional[str]       # Error tracking
}

Key Files:

src/agents/conversational.py - LangGraph agent implementation (570 lines)

3. Tool Layer

Purpose: Provides discrete capabilities that the agent can invoke

Categories:

Video Processing Tools (8 tools)

File upload management
Transcription orchestration
Speaker name mapping
Transcript editing
Pinecone upload

Meeting Query Tools (6 tools)

Semantic search
Metadata retrieval
Meeting listing
Text upsert
Notion import/export

MCP Integration Tools (6+ tools)

Notion API operations
Time queries
Future: Zoom RTMS

Design Pattern: LangChain @tool decorator for automatic schema generation

Key Files:

src/tools/video.py - Video processing tools (528 lines)
src/tools/general.py - Query and integration tools (577 lines)
src/tools/mcp/ - MCP client wrappers

4. Processing Layer

Purpose: Handles compute-intensive operations

Components:

WhisperX Transcription

Model: Configurable (tiny/small/medium/large)
Features: Word-level timestamps, language detection
Performance: GPU-accelerated when available

Pyannote Speaker Diarization

Model: pyannote/speaker-diarization-3.1
Output: Speaker segments with timestamps
Integration: Aligned with WhisperX word timestamps

Metadata Extraction

Model: GPT-4o-mini (cost-optimized)
Extracts: Title, date, summary, speaker mapping
Format: Structured JSON output

Embeddings

Model: OpenAI text-embedding-3-small
Dimension: 1536
Usage: Query and document embedding

Key Files:

src/processing/transcription.py - WhisperX + Pyannote pipeline
src/processing/metadata_extractor.py - GPT-4o-mini extraction

5. Storage Layer

Purpose: Persistent and temporary data storage

Pinecone Vector Database

Type: Serverless
Index: meeting-transcripts-1-dev
Namespace: Environment-based (development/production)
Metadata: Rich metadata for filtering (title, date, source, speakers)

Schema:

{
    "id": "meeting_abc12345_chunk_001",
    "values": [1536-dim embedding],
    "metadata": {
        "meeting_id": "meeting_abc12345",
        "meeting_title": "Q4 Planning",
        "meeting_date": "2024-12-07",
        "summary": "...",
        "speaker_mapping": {...},
        "source": "video",
        "chunk_index": 1,
        "text": "actual transcript chunk"
    }
}

Local State Cache

Purpose: Temporary storage for video processing workflow
Scope: In-memory, per-session
Contents: Uploaded video path, transcription text, timing info

Key Files:

src/retrievers/pinecone.py - Vector database manager

6. External Services

Purpose: Third-party APIs and custom MCP servers

OpenAI API

Models: GPT-3.5-turbo (agent), GPT-4o-mini (metadata)
Usage: Agent reasoning, metadata extraction, embeddings

Notion MCP Server

Type: Official @notionhq/notion-mcp-server
Transport: stdio (local subprocess)
Capabilities: Search, read, create, update pages

Time MCP Server (Custom)

Type: Gradio-based MCP server
Transport: SSE (Server-Sent Events)
Deployment: HuggingFace Spaces
URL: https://gfiamon-date-time-mpc-server-tool.hf.space/gradio_api/mcp/sse
Purpose: Time-aware query support

Zoom RTMS Server (In Development)

Type: FastAPI + Gradio hybrid
Transport: stdio + webhooks
Status: Prototype, API integration pending
Purpose: Live meeting transcription

Key Files:

src/tools/mcp/mcp_manager.py - Multi-server MCP client
external_mcp_servers/time_mcp_server/ - Custom time server
external_mcp_servers/zoom_mcp/ - Zoom RTMS prototype

🔄 Data Flow

Video Upload Flow

User uploads video.mp4
    ↓
Gradio saves to temp directory
    ↓
Agent calls transcribe_uploaded_video(path)
    ↓
WhisperX extracts audio + transcribes
    ↓
Pyannote identifies speakers
    ↓
Alignment: Match speakers to transcript
    ↓
Format: SPEAKER_00, SPEAKER_01, etc.
    ↓
Return formatted transcript to agent
    ↓
Agent shows transcript to user
    ↓
User optionally edits or updates speaker names
    ↓
Agent calls upload_transcription_to_pinecone()
    ↓
GPT-4o-mini extracts metadata
    ↓
Text chunked into semantic segments
    ↓
OpenAI embeddings generated
    ↓
Upsert to Pinecone with metadata
    ↓
Return meeting_id to user

Query Flow

User asks: "What action items were assigned last Tuesday?"
    ↓
Agent receives query
    ↓
Agent calls get_time_for_city("Berlin") [Time MCP]
    ↓
Time server returns: "2024-12-07"
    ↓
Agent calculates: "Last Tuesday = 2024-12-03"
    ↓
Agent calls search_meetings(query="action items", date_filter="2024-12-03")
    ↓
Query embedded via OpenAI
    ↓
Pinecone vector search
    ↓
Top-k chunks retrieved with metadata
    ↓
Results returned to agent
    ↓
Agent synthesizes answer from chunks
    ↓
Response streamed to user

Notion Integration Flow

User: "Import 'Meeting 3' from Notion"
    ↓
Agent calls import_notion_to_pinecone(query="Meeting 3")
    ↓
Tool calls Notion MCP: API-post-search(query="Meeting 3")
    ↓
Notion returns page_id
    ↓
Tool calls API-retrieve-a-page(page_id) → metadata
    ↓
Tool calls API-get-block-children(page_id) → content blocks
    ↓
Recursive extraction of nested blocks
    ↓
Full text assembled
    ↓
GPT-4o-mini extracts metadata
    ↓
Text chunked and embedded
    ↓
Upsert to Pinecone
    ↓
Return success message with meeting_id

🎨 Key Design Decisions

1. Why LangGraph?

Decision: Use LangGraph instead of LangChain's AgentExecutor or other frameworks

Rationale:

✅ Explicit state management: Full control over conversation state
✅ Async support: Required for MCP tools (Notion API)
✅ Debugging: Clear visibility into state transitions
✅ Flexibility: Easy to add custom nodes and conditional routing
✅ Streaming: Native support for response streaming

Alternative Considered: LangChain AgentExecutor (rejected due to limited async support)

2. Why Separate MCP Servers?

Decision: Deploy custom MCP servers in external_mcp_servers/ as standalone applications

Rationale:

✅ Independent scaling: Time server can handle multiple agents
✅ Deployment flexibility: Update servers without redeploying agent
✅ Development isolation: Test MCP servers independently
✅ Reusability: Other projects can use the same MCP servers
✅ Transport options: HTTP (SSE) for remote, stdio for local

Architecture:

Main Agent (HF Space 1)
    ↓ HTTP/SSE
Time MCP Server (HF Space 2)
    ↓ HTTP/SSE
Zoom MCP Server (HF Space 3)

Alternative Considered: Embed MCP servers in main app (rejected due to coupling)

3. Why Pinecone Serverless?

Decision: Use Pinecone serverless for vector storage

Rationale:

✅ No infrastructure management: Fully managed
✅ Cost-effective: Pay per usage, no idle costs
✅ Scalability: Auto-scales with demand
✅ Metadata filtering: Rich filtering capabilities
✅ Namespaces: Environment isolation (dev/prod)

Alternative Considered: Chroma (rejected due to self-hosting requirements)

4. Why GPT-3.5-turbo for Agent?

Decision: Use GPT-3.5-turbo instead of GPT-4 for agent reasoning

Rationale:

✅ Cost: 10x cheaper than GPT-4
✅ Speed: Faster response times
✅ Sufficient: Tool calling works well with 3.5-turbo
✅ Budget: GPT-4o-mini used for metadata extraction (specialized task)

Cost Comparison (per 1M tokens):

GPT-3.5-turbo: $0.50 input / $1.50 output
GPT-4: $30 input / $60 output
GPT-4o-mini: $0.15 input / $0.60 output

5. Why Async Patterns?

Decision: Use async/await throughout the agent

Rationale:

✅ MCP requirement: Notion MCP tools are async
✅ Long operations: Transcription can take minutes
✅ Streaming: Gradio async streaming for better UX
✅ Concurrency: Handle multiple tool calls efficiently

Implementation:

async def generate_response(self, message, history):
    async for event in self.graph.astream(initial_state):
        # Process events
        yield response_chunk

🗂️ State Management

LangGraph State

Structure: TypedDict with annotated message list

class ConversationalAgentState(TypedDict):
    message: str                          # Current query
    history: List[List[str]]              # Gradio format
    llm_messages: Annotated[List[Any], add_messages]  # LangChain format
    response: str                         # Generated response
    error: Optional[str]                  # Error tracking

State Transitions:

Prepare: history → llm_messages (format conversion)
Agent: llm_messages → llm_messages (append AI response)
Tools: llm_messages → llm_messages (append tool results)

Persistence: In-memory only, no database persistence (stateless per session)

Video Processing State

Purpose: Track video upload workflow across multiple tool calls

Storage: Global dictionary in src/tools/video.py

_video_state = {
    "uploaded_video_path": None,
    "transcription_text": None,
    "transcription_segments": None,
    "timing_info": None,
    "show_video_upload": False,
    "show_transcription_editor": False,
    "transcription_in_progress": False
}

Lifecycle:

request_video_upload() → sets show_video_upload = True
transcribe_uploaded_video() → stores transcript
upload_transcription_to_pinecone() → clears state

Reset: Automatic after successful upload or manual via cancel_video_workflow()

UI State Synchronization

Challenge: Keep Gradio UI in sync with agent state

Solution: Tools return UI state changes via get_video_state()

# Tool returns state
state = get_video_state()
return {
    "show_upload": state["show_video_upload"],
    "show_editor": state["show_transcription_editor"],
    "transcript": state["transcription_text"]
}

Gradio Integration: UI components update based on returned state

⚡ Scalability & Performance

Concurrency

Current: Single-user sessions (Gradio default)

Scalability:

✅ Stateless agent (can handle multiple sessions)
✅ Pinecone auto-scales
✅ MCP servers deployed independently
⚠️ WhisperX requires GPU (bottleneck for concurrent transcriptions)

Future Improvements:

Queue system for transcription jobs
Separate transcription service (microservice)
Redis for shared state across instances

Caching

Current Caching:

❌ No LLM response caching
❌ No embedding caching
✅ Pinecone handles vector index caching

Future Improvements:

Cache frequent queries (e.g., "list meetings")
Cache embeddings for repeated text
LangChain cache for LLM responses

Performance Bottlenecks

Transcription: 2-5 minutes for typical meeting (GPU-dependent)
Metadata Extraction: 5-10 seconds (GPT-4o-mini API call)
Embedding: 1-2 seconds per chunk (OpenAI API)
Pinecone Upsert: 1-3 seconds for typical meeting

Optimization Strategies:

Parallel embedding generation
Batch Pinecone upserts
Async MCP calls
Streaming responses to user

🔒 Security Architecture

API Key Management

Storage: Environment variables via .env file

OPENAI_API_KEY=sk-...
PINECONE_API_KEY=...
NOTION_TOKEN=secret_...

Access: Loaded via python-dotenv in src/config/settings.py

Best Practices:

✅ Never commit .env to git (.gitignore configured)
✅ Use HuggingFace Spaces secrets for deployment
✅ Rotate keys regularly

Data Privacy

User Data:

Video files: Stored temporarily, deleted after processing
Transcripts: Stored in Pinecone (user-controlled index)
Conversation history: In-memory only, not persisted

Third-Party Data Sharing:

OpenAI: Transcripts sent for embedding/metadata extraction
Pinecone: Encrypted at rest and in transit
Notion: Only accessed with user's token

Compliance:

GDPR: User can delete Pinecone index
Data retention: No long-term storage of raw videos

MCP Server Security

Notion MCP:

Authentication: User's Notion token
Permissions: Limited to token's access scope
Transport: stdio (local process, no network exposure)

Time MCP:

Authentication: None required (public API)
Transport: HTTPS (TLS encrypted)
Rate limiting: HuggingFace Spaces default limits

Zoom MCP (planned):

Authentication: OAuth 2.0
Webhook validation: HMAC-SHA256 signature
Transport: HTTPS + WebSocket (TLS)

🛠️ Technology Stack

Core Framework

Python: 3.11+
LangGraph: Agent orchestration
LangChain: Tool abstractions, message handling
Gradio: Web UI framework

AI/ML Models

OpenAI GPT-3.5-turbo: Agent reasoning
OpenAI GPT-4o-mini: Metadata extraction
OpenAI text-embedding-3-small: Vector embeddings
WhisperX: Speech-to-text transcription
Pyannote: Speaker diarization

Storage & Databases

Pinecone: Vector database (serverless)
Local filesystem: Temporary video storage

External Integrations

Notion API: Via MCP server
Custom Time API: Via Gradio MCP server
Zoom API (planned): Via custom MCP server

Development Tools

Docker: Containerization
FFmpeg: Audio extraction
pytest: Testing (planned)
LangSmith: Tracing and debugging (optional)

Deployment

HuggingFace Spaces: Primary deployment platform
Docker: Container runtime
Environment Variables: Configuration management

📚 Related Documentation

TECHNICAL_IMPLEMENTATION.md - Detailed tool reference and code examples
DEPLOYMENT_GUIDE.md - Step-by-step deployment instructions
README.md - Project overview and quick start

🔄 Version History

v4.0 (Current): LangGraph-based conversational agent with MCP integration
v3.0: Experimental agent patterns
v2.0: Basic agent with video processing
v1.0: Initial prototype

Last Updated: December 5, 2025
Maintained By: Meeting Intelligence Agent Team