GFiaMon's picture
Updated: Readme and add Documentation
ce49379

System Architecture

Meeting Intelligence Agent - Technical Architecture Documentation

This document provides a high-level overview of the system architecture, design decisions, and component relationships for the Meeting Intelligence Agent.


πŸ“‹ Table of Contents


🎯 System Overview

The Meeting Intelligence Agent is a conversational AI system built on LangGraph that orchestrates meeting video processing, transcription, storage, and intelligent querying through natural language interaction.

Core Capabilities

  1. Video Processing Pipeline: Upload β†’ Transcription β†’ Speaker Diarization β†’ Metadata Extraction β†’ Vector Storage
  2. Semantic Search: RAG-based querying across meeting transcripts using natural language
  3. External Integrations: MCP (Model Context Protocol) servers for Notion and time-aware queries
  4. Conversational Interface: Gradio-based chat UI with file upload support

Design Philosophy

  • Conversational-First: All functionality accessible through natural language
  • Modular Architecture: Clear separation between UI, agent, tools, and services
  • Extensible: MCP protocol enables easy addition of new capabilities
  • Async-Ready: Supports long-running operations (transcription, MCP calls)
  • Production-Ready: Docker support, error handling, graceful degradation

πŸ—οΈ Architecture Diagram

graph TB
    subgraph "Frontend Layer"
        UI[Gradio Interface]
        Chat[Chat Component]
        Upload[File Upload]
        Editor[Transcript Editor]
    end
    
    subgraph "Agent Layer (LangGraph)"
        Agent[Conversational Agent]
        StateMachine[State Machine]
        ToolRouter[Tool Router]
    end
    
    subgraph "Tool Layer"
        VideoTools[Video Processing Tools]
        QueryTools[Meeting Query Tools]
        MCPTools[MCP Integration Tools]
    end
    
    subgraph "Processing Layer"
        WhisperX[WhisperX Transcription]
        Pyannote[Speaker Diarization]
        MetadataExtractor[GPT-4o-mini Metadata]
        Embeddings[OpenAI Embeddings]
    end
    
    subgraph "Storage Layer"
        Pinecone[(Pinecone Vector DB)]
        LocalState[Local State Cache]
    end
    
    subgraph "External Services"
        OpenAI[OpenAI API]
        NotionMCP[Notion MCP Server]
        TimeMCP[Time MCP Server]
        ZoomMCP[Zoom MCP Server<br/>In Development]
    end
    
    UI --> Agent
    Agent --> StateMachine
    StateMachine --> ToolRouter
    ToolRouter --> VideoTools
    ToolRouter --> QueryTools
    ToolRouter --> MCPTools
    
    VideoTools --> WhisperX
    VideoTools --> Pyannote
    VideoTools --> MetadataExtractor
    VideoTools --> Embeddings
    
    QueryTools --> Embeddings
    QueryTools --> Pinecone
    
    MCPTools --> NotionMCP
    MCPTools --> TimeMCP
    MCPTools -.-> ZoomMCP
    
    Embeddings --> Pinecone
    MetadataExtractor --> OpenAI
    Agent --> OpenAI
    
    classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
    classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000000   
    classDef tools fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000000
    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000
    classDef storage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#000000
    classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000000
    
    class UI,Chat,Upload,Editor frontend
    class Agent,StateMachine,ToolRouter agent
    class VideoTools,QueryTools,MCPTools tools
    class WhisperX,Pyannote,MetadataExtractor,Embeddings processing
    class Pinecone,LocalState storage
    class OpenAI,NotionMCP,TimeMCP,ZoomMCP external

🧩 Core Components

1. Frontend Layer (Gradio)

Purpose: User interface for interaction and file management

Components:

  • Chat Interface: Primary conversational UI using gr.ChatInterface
  • File Upload: Video file upload widget
  • Transcript Editor: Editable text area for manual corrections
  • State Display: Real-time feedback on processing status

Technology: Gradio 5.x with async support

Key Files:

  • src/ui/gradio_app.py - UI component definitions and event handlers

2. Agent Layer (LangGraph)

Purpose: Orchestrates the entire workflow through conversational AI

Architecture: State machine with three nodes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ PREPARE β”‚ --> β”‚ AGENT β”‚ --> β”‚ TOOLS β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↑             β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components:

  1. Prepare Node: Converts chat history to LangChain messages
  2. Agent Node: LLM decides which tools to call
  3. Tools Node: Executes selected tools
  4. Conditional Router: Determines if more tool calls are needed

State Structure:

{
    "message": str,              # Current user query
    "history": List[List[str]],  # Conversation history
    "llm_messages": List[Message], # LangChain message format
    "response": str,             # Generated response
    "error": Optional[str]       # Error tracking
}

Key Files:

  • src/agents/conversational.py - LangGraph agent implementation (570 lines)

3. Tool Layer

Purpose: Provides discrete capabilities that the agent can invoke

Categories:

Video Processing Tools (8 tools)

  • File upload management
  • Transcription orchestration
  • Speaker name mapping
  • Transcript editing
  • Pinecone upload

Meeting Query Tools (6 tools)

  • Semantic search
  • Metadata retrieval
  • Meeting listing
  • Text upsert
  • Notion import/export

MCP Integration Tools (6+ tools)

  • Notion API operations
  • Time queries
  • Future: Zoom RTMS

Design Pattern: LangChain @tool decorator for automatic schema generation

Key Files:

  • src/tools/video.py - Video processing tools (528 lines)
  • src/tools/general.py - Query and integration tools (577 lines)
  • src/tools/mcp/ - MCP client wrappers

4. Processing Layer

Purpose: Handles compute-intensive operations

Components:

WhisperX Transcription

  • Model: Configurable (tiny/small/medium/large)
  • Features: Word-level timestamps, language detection
  • Performance: GPU-accelerated when available

Pyannote Speaker Diarization

  • Model: pyannote/speaker-diarization-3.1
  • Output: Speaker segments with timestamps
  • Integration: Aligned with WhisperX word timestamps

Metadata Extraction

  • Model: GPT-4o-mini (cost-optimized)
  • Extracts: Title, date, summary, speaker mapping
  • Format: Structured JSON output

Embeddings

  • Model: OpenAI text-embedding-3-small
  • Dimension: 1536
  • Usage: Query and document embedding

Key Files:

  • src/processing/transcription.py - WhisperX + Pyannote pipeline
  • src/processing/metadata_extractor.py - GPT-4o-mini extraction

5. Storage Layer

Purpose: Persistent and temporary data storage

Pinecone Vector Database

  • Type: Serverless
  • Index: meeting-transcripts-1-dev
  • Namespace: Environment-based (development/production)
  • Metadata: Rich metadata for filtering (title, date, source, speakers)

Schema:

{
    "id": "meeting_abc12345_chunk_001",
    "values": [1536-dim embedding],
    "metadata": {
        "meeting_id": "meeting_abc12345",
        "meeting_title": "Q4 Planning",
        "meeting_date": "2024-12-07",
        "summary": "...",
        "speaker_mapping": {...},
        "source": "video",
        "chunk_index": 1,
        "text": "actual transcript chunk"
    }
}

Local State Cache

  • Purpose: Temporary storage for video processing workflow
  • Scope: In-memory, per-session
  • Contents: Uploaded video path, transcription text, timing info

Key Files:

  • src/retrievers/pinecone.py - Vector database manager

6. External Services

Purpose: Third-party APIs and custom MCP servers

OpenAI API

  • Models: GPT-3.5-turbo (agent), GPT-4o-mini (metadata)
  • Usage: Agent reasoning, metadata extraction, embeddings

Notion MCP Server

  • Type: Official @notionhq/notion-mcp-server
  • Transport: stdio (local subprocess)
  • Capabilities: Search, read, create, update pages

Time MCP Server (Custom)

  • Type: Gradio-based MCP server
  • Transport: SSE (Server-Sent Events)
  • Deployment: HuggingFace Spaces
  • URL: https://gfiamon-date-time-mpc-server-tool.hf.space/gradio_api/mcp/sse
  • Purpose: Time-aware query support

Zoom RTMS Server (In Development)

  • Type: FastAPI + Gradio hybrid
  • Transport: stdio + webhooks
  • Status: Prototype, API integration pending
  • Purpose: Live meeting transcription

Key Files:

  • src/tools/mcp/mcp_manager.py - Multi-server MCP client
  • external_mcp_servers/time_mcp_server/ - Custom time server
  • external_mcp_servers/zoom_mcp/ - Zoom RTMS prototype

πŸ”„ Data Flow

Video Upload Flow

User uploads video.mp4
    ↓
Gradio saves to temp directory
    ↓
Agent calls transcribe_uploaded_video(path)
    ↓
WhisperX extracts audio + transcribes
    ↓
Pyannote identifies speakers
    ↓
Alignment: Match speakers to transcript
    ↓
Format: SPEAKER_00, SPEAKER_01, etc.
    ↓
Return formatted transcript to agent
    ↓
Agent shows transcript to user
    ↓
User optionally edits or updates speaker names
    ↓
Agent calls upload_transcription_to_pinecone()
    ↓
GPT-4o-mini extracts metadata
    ↓
Text chunked into semantic segments
    ↓
OpenAI embeddings generated
    ↓
Upsert to Pinecone with metadata
    ↓
Return meeting_id to user

Query Flow

User asks: "What action items were assigned last Tuesday?"
    ↓
Agent receives query
    ↓
Agent calls get_time_for_city("Berlin") [Time MCP]
    ↓
Time server returns: "2024-12-07"
    ↓
Agent calculates: "Last Tuesday = 2024-12-03"
    ↓
Agent calls search_meetings(query="action items", date_filter="2024-12-03")
    ↓
Query embedded via OpenAI
    ↓
Pinecone vector search
    ↓
Top-k chunks retrieved with metadata
    ↓
Results returned to agent
    ↓
Agent synthesizes answer from chunks
    ↓
Response streamed to user

Notion Integration Flow

User: "Import 'Meeting 3' from Notion"
    ↓
Agent calls import_notion_to_pinecone(query="Meeting 3")
    ↓
Tool calls Notion MCP: API-post-search(query="Meeting 3")
    ↓
Notion returns page_id
    ↓
Tool calls API-retrieve-a-page(page_id) β†’ metadata
    ↓
Tool calls API-get-block-children(page_id) β†’ content blocks
    ↓
Recursive extraction of nested blocks
    ↓
Full text assembled
    ↓
GPT-4o-mini extracts metadata
    ↓
Text chunked and embedded
    ↓
Upsert to Pinecone
    ↓
Return success message with meeting_id

🎨 Key Design Decisions

1. Why LangGraph?

Decision: Use LangGraph instead of LangChain's AgentExecutor or other frameworks

Rationale:

  • βœ… Explicit state management: Full control over conversation state
  • βœ… Async support: Required for MCP tools (Notion API)
  • βœ… Debugging: Clear visibility into state transitions
  • βœ… Flexibility: Easy to add custom nodes and conditional routing
  • βœ… Streaming: Native support for response streaming

Alternative Considered: LangChain AgentExecutor (rejected due to limited async support)


2. Why Separate MCP Servers?

Decision: Deploy custom MCP servers in external_mcp_servers/ as standalone applications

Rationale:

  • βœ… Independent scaling: Time server can handle multiple agents
  • βœ… Deployment flexibility: Update servers without redeploying agent
  • βœ… Development isolation: Test MCP servers independently
  • βœ… Reusability: Other projects can use the same MCP servers
  • βœ… Transport options: HTTP (SSE) for remote, stdio for local

Architecture:

Main Agent (HF Space 1)
    ↓ HTTP/SSE
Time MCP Server (HF Space 2)
    ↓ HTTP/SSE
Zoom MCP Server (HF Space 3)

Alternative Considered: Embed MCP servers in main app (rejected due to coupling)


3. Why Pinecone Serverless?

Decision: Use Pinecone serverless for vector storage

Rationale:

  • βœ… No infrastructure management: Fully managed
  • βœ… Cost-effective: Pay per usage, no idle costs
  • βœ… Scalability: Auto-scales with demand
  • βœ… Metadata filtering: Rich filtering capabilities
  • βœ… Namespaces: Environment isolation (dev/prod)

Alternative Considered: Chroma (rejected due to self-hosting requirements)


4. Why GPT-3.5-turbo for Agent?

Decision: Use GPT-3.5-turbo instead of GPT-4 for agent reasoning

Rationale:

  • βœ… Cost: 10x cheaper than GPT-4
  • βœ… Speed: Faster response times
  • βœ… Sufficient: Tool calling works well with 3.5-turbo
  • βœ… Budget: GPT-4o-mini used for metadata extraction (specialized task)

Cost Comparison (per 1M tokens):

  • GPT-3.5-turbo: $0.50 input / $1.50 output
  • GPT-4: $30 input / $60 output
  • GPT-4o-mini: $0.15 input / $0.60 output

5. Why Async Patterns?

Decision: Use async/await throughout the agent

Rationale:

  • βœ… MCP requirement: Notion MCP tools are async
  • βœ… Long operations: Transcription can take minutes
  • βœ… Streaming: Gradio async streaming for better UX
  • βœ… Concurrency: Handle multiple tool calls efficiently

Implementation:

async def generate_response(self, message, history):
    async for event in self.graph.astream(initial_state):
        # Process events
        yield response_chunk

πŸ—‚οΈ State Management

LangGraph State

Structure: TypedDict with annotated message list

class ConversationalAgentState(TypedDict):
    message: str                          # Current query
    history: List[List[str]]              # Gradio format
    llm_messages: Annotated[List[Any], add_messages]  # LangChain format
    response: str                         # Generated response
    error: Optional[str]                  # Error tracking

State Transitions:

  1. Prepare: history β†’ llm_messages (format conversion)
  2. Agent: llm_messages β†’ llm_messages (append AI response)
  3. Tools: llm_messages β†’ llm_messages (append tool results)

Persistence: In-memory only, no database persistence (stateless per session)


Video Processing State

Purpose: Track video upload workflow across multiple tool calls

Storage: Global dictionary in src/tools/video.py

_video_state = {
    "uploaded_video_path": None,
    "transcription_text": None,
    "transcription_segments": None,
    "timing_info": None,
    "show_video_upload": False,
    "show_transcription_editor": False,
    "transcription_in_progress": False
}

Lifecycle:

  1. request_video_upload() β†’ sets show_video_upload = True
  2. transcribe_uploaded_video() β†’ stores transcript
  3. upload_transcription_to_pinecone() β†’ clears state

Reset: Automatic after successful upload or manual via cancel_video_workflow()


UI State Synchronization

Challenge: Keep Gradio UI in sync with agent state

Solution: Tools return UI state changes via get_video_state()

# Tool returns state
state = get_video_state()
return {
    "show_upload": state["show_video_upload"],
    "show_editor": state["show_transcription_editor"],
    "transcript": state["transcription_text"]
}

Gradio Integration: UI components update based on returned state


⚑ Scalability & Performance

Concurrency

Current: Single-user sessions (Gradio default)

Scalability:

  • βœ… Stateless agent (can handle multiple sessions)
  • βœ… Pinecone auto-scales
  • βœ… MCP servers deployed independently
  • ⚠️ WhisperX requires GPU (bottleneck for concurrent transcriptions)

Future Improvements:

  • Queue system for transcription jobs
  • Separate transcription service (microservice)
  • Redis for shared state across instances

Caching

Current Caching:

  • ❌ No LLM response caching
  • ❌ No embedding caching
  • βœ… Pinecone handles vector index caching

Future Improvements:

  • Cache frequent queries (e.g., "list meetings")
  • Cache embeddings for repeated text
  • LangChain cache for LLM responses

Performance Bottlenecks

  1. Transcription: 2-5 minutes for typical meeting (GPU-dependent)
  2. Metadata Extraction: 5-10 seconds (GPT-4o-mini API call)
  3. Embedding: 1-2 seconds per chunk (OpenAI API)
  4. Pinecone Upsert: 1-3 seconds for typical meeting

Optimization Strategies:

  • Parallel embedding generation
  • Batch Pinecone upserts
  • Async MCP calls
  • Streaming responses to user

πŸ”’ Security Architecture

API Key Management

Storage: Environment variables via .env file

OPENAI_API_KEY=sk-...
PINECONE_API_KEY=...
NOTION_TOKEN=secret_...

Access: Loaded via python-dotenv in src/config/settings.py

Best Practices:

  • βœ… Never commit .env to git (.gitignore configured)
  • βœ… Use HuggingFace Spaces secrets for deployment
  • βœ… Rotate keys regularly

Data Privacy

User Data:

  • Video files: Stored temporarily, deleted after processing
  • Transcripts: Stored in Pinecone (user-controlled index)
  • Conversation history: In-memory only, not persisted

Third-Party Data Sharing:

  • OpenAI: Transcripts sent for embedding/metadata extraction
  • Pinecone: Encrypted at rest and in transit
  • Notion: Only accessed with user's token

Compliance:

  • GDPR: User can delete Pinecone index
  • Data retention: No long-term storage of raw videos

MCP Server Security

Notion MCP:

  • Authentication: User's Notion token
  • Permissions: Limited to token's access scope
  • Transport: stdio (local process, no network exposure)

Time MCP:

  • Authentication: None required (public API)
  • Transport: HTTPS (TLS encrypted)
  • Rate limiting: HuggingFace Spaces default limits

Zoom MCP (planned):

  • Authentication: OAuth 2.0
  • Webhook validation: HMAC-SHA256 signature
  • Transport: HTTPS + WebSocket (TLS)

πŸ› οΈ Technology Stack

Core Framework

  • Python: 3.11+
  • LangGraph: Agent orchestration
  • LangChain: Tool abstractions, message handling
  • Gradio: Web UI framework

AI/ML Models

  • OpenAI GPT-3.5-turbo: Agent reasoning
  • OpenAI GPT-4o-mini: Metadata extraction
  • OpenAI text-embedding-3-small: Vector embeddings
  • WhisperX: Speech-to-text transcription
  • Pyannote: Speaker diarization

Storage & Databases

  • Pinecone: Vector database (serverless)
  • Local filesystem: Temporary video storage

External Integrations

  • Notion API: Via MCP server
  • Custom Time API: Via Gradio MCP server
  • Zoom API (planned): Via custom MCP server

Development Tools

  • Docker: Containerization
  • FFmpeg: Audio extraction
  • pytest: Testing (planned)
  • LangSmith: Tracing and debugging (optional)

Deployment

  • HuggingFace Spaces: Primary deployment platform
  • Docker: Container runtime
  • Environment Variables: Configuration management

πŸ“š Related Documentation


πŸ”„ Version History

  • v4.0 (Current): LangGraph-based conversational agent with MCP integration
  • v3.0: Experimental agent patterns
  • v2.0: Basic agent with video processing
  • v1.0: Initial prototype

Last Updated: December 5, 2025
Maintained By: Meeting Intelligence Agent Team