System Architecture
Meeting Intelligence Agent - Technical Architecture Documentation
This document provides a high-level overview of the system architecture, design decisions, and component relationships for the Meeting Intelligence Agent.
π Table of Contents
- System Overview
- Architecture Diagram
- Core Components
- Data Flow
- Key Design Decisions
- State Management
- Scalability & Performance
- Security Architecture
- Technology Stack
π― System Overview
The Meeting Intelligence Agent is a conversational AI system built on LangGraph that orchestrates meeting video processing, transcription, storage, and intelligent querying through natural language interaction.
Core Capabilities
- Video Processing Pipeline: Upload β Transcription β Speaker Diarization β Metadata Extraction β Vector Storage
- Semantic Search: RAG-based querying across meeting transcripts using natural language
- External Integrations: MCP (Model Context Protocol) servers for Notion and time-aware queries
- Conversational Interface: Gradio-based chat UI with file upload support
Design Philosophy
- Conversational-First: All functionality accessible through natural language
- Modular Architecture: Clear separation between UI, agent, tools, and services
- Extensible: MCP protocol enables easy addition of new capabilities
- Async-Ready: Supports long-running operations (transcription, MCP calls)
- Production-Ready: Docker support, error handling, graceful degradation
ποΈ Architecture Diagram
graph TB
subgraph "Frontend Layer"
UI[Gradio Interface]
Chat[Chat Component]
Upload[File Upload]
Editor[Transcript Editor]
end
subgraph "Agent Layer (LangGraph)"
Agent[Conversational Agent]
StateMachine[State Machine]
ToolRouter[Tool Router]
end
subgraph "Tool Layer"
VideoTools[Video Processing Tools]
QueryTools[Meeting Query Tools]
MCPTools[MCP Integration Tools]
end
subgraph "Processing Layer"
WhisperX[WhisperX Transcription]
Pyannote[Speaker Diarization]
MetadataExtractor[GPT-4o-mini Metadata]
Embeddings[OpenAI Embeddings]
end
subgraph "Storage Layer"
Pinecone[(Pinecone Vector DB)]
LocalState[Local State Cache]
end
subgraph "External Services"
OpenAI[OpenAI API]
NotionMCP[Notion MCP Server]
TimeMCP[Time MCP Server]
ZoomMCP[Zoom MCP Server<br/>In Development]
end
UI --> Agent
Agent --> StateMachine
StateMachine --> ToolRouter
ToolRouter --> VideoTools
ToolRouter --> QueryTools
ToolRouter --> MCPTools
VideoTools --> WhisperX
VideoTools --> Pyannote
VideoTools --> MetadataExtractor
VideoTools --> Embeddings
QueryTools --> Embeddings
QueryTools --> Pinecone
MCPTools --> NotionMCP
MCPTools --> TimeMCP
MCPTools -.-> ZoomMCP
Embeddings --> Pinecone
MetadataExtractor --> OpenAI
Agent --> OpenAI
classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000000
classDef tools fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000000
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000
classDef storage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#000000
classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000000
class UI,Chat,Upload,Editor frontend
class Agent,StateMachine,ToolRouter agent
class VideoTools,QueryTools,MCPTools tools
class WhisperX,Pyannote,MetadataExtractor,Embeddings processing
class Pinecone,LocalState storage
class OpenAI,NotionMCP,TimeMCP,ZoomMCP external
π§© Core Components
1. Frontend Layer (Gradio)
Purpose: User interface for interaction and file management
Components:
- Chat Interface: Primary conversational UI using
gr.ChatInterface - File Upload: Video file upload widget
- Transcript Editor: Editable text area for manual corrections
- State Display: Real-time feedback on processing status
Technology: Gradio 5.x with async support
Key Files:
src/ui/gradio_app.py- UI component definitions and event handlers
2. Agent Layer (LangGraph)
Purpose: Orchestrates the entire workflow through conversational AI
Architecture: State machine with three nodes:
βββββββββββ βββββββββ βββββββββ
β PREPARE β --> β AGENT β --> β TOOLS β
βββββββββββ βββββββββ βββββββββ
β β
βββββββββββββββ
Components:
- Prepare Node: Converts chat history to LangChain messages
- Agent Node: LLM decides which tools to call
- Tools Node: Executes selected tools
- Conditional Router: Determines if more tool calls are needed
State Structure:
{
"message": str, # Current user query
"history": List[List[str]], # Conversation history
"llm_messages": List[Message], # LangChain message format
"response": str, # Generated response
"error": Optional[str] # Error tracking
}
Key Files:
src/agents/conversational.py- LangGraph agent implementation (570 lines)
3. Tool Layer
Purpose: Provides discrete capabilities that the agent can invoke
Categories:
Video Processing Tools (8 tools)
- File upload management
- Transcription orchestration
- Speaker name mapping
- Transcript editing
- Pinecone upload
Meeting Query Tools (6 tools)
- Semantic search
- Metadata retrieval
- Meeting listing
- Text upsert
- Notion import/export
MCP Integration Tools (6+ tools)
- Notion API operations
- Time queries
- Future: Zoom RTMS
Design Pattern: LangChain @tool decorator for automatic schema generation
Key Files:
src/tools/video.py- Video processing tools (528 lines)src/tools/general.py- Query and integration tools (577 lines)src/tools/mcp/- MCP client wrappers
4. Processing Layer
Purpose: Handles compute-intensive operations
Components:
WhisperX Transcription
- Model: Configurable (tiny/small/medium/large)
- Features: Word-level timestamps, language detection
- Performance: GPU-accelerated when available
Pyannote Speaker Diarization
- Model:
pyannote/speaker-diarization-3.1 - Output: Speaker segments with timestamps
- Integration: Aligned with WhisperX word timestamps
Metadata Extraction
- Model: GPT-4o-mini (cost-optimized)
- Extracts: Title, date, summary, speaker mapping
- Format: Structured JSON output
Embeddings
- Model: OpenAI
text-embedding-3-small - Dimension: 1536
- Usage: Query and document embedding
Key Files:
src/processing/transcription.py- WhisperX + Pyannote pipelinesrc/processing/metadata_extractor.py- GPT-4o-mini extraction
5. Storage Layer
Purpose: Persistent and temporary data storage
Pinecone Vector Database
- Type: Serverless
- Index:
meeting-transcripts-1-dev - Namespace: Environment-based (
development/production) - Metadata: Rich metadata for filtering (title, date, source, speakers)
Schema:
{
"id": "meeting_abc12345_chunk_001",
"values": [1536-dim embedding],
"metadata": {
"meeting_id": "meeting_abc12345",
"meeting_title": "Q4 Planning",
"meeting_date": "2024-12-07",
"summary": "...",
"speaker_mapping": {...},
"source": "video",
"chunk_index": 1,
"text": "actual transcript chunk"
}
}
Local State Cache
- Purpose: Temporary storage for video processing workflow
- Scope: In-memory, per-session
- Contents: Uploaded video path, transcription text, timing info
Key Files:
src/retrievers/pinecone.py- Vector database manager
6. External Services
Purpose: Third-party APIs and custom MCP servers
OpenAI API
- Models: GPT-3.5-turbo (agent), GPT-4o-mini (metadata)
- Usage: Agent reasoning, metadata extraction, embeddings
Notion MCP Server
- Type: Official
@notionhq/notion-mcp-server - Transport: stdio (local subprocess)
- Capabilities: Search, read, create, update pages
Time MCP Server (Custom)
- Type: Gradio-based MCP server
- Transport: SSE (Server-Sent Events)
- Deployment: HuggingFace Spaces
- URL:
https://gfiamon-date-time-mpc-server-tool.hf.space/gradio_api/mcp/sse - Purpose: Time-aware query support
Zoom RTMS Server (In Development)
- Type: FastAPI + Gradio hybrid
- Transport: stdio + webhooks
- Status: Prototype, API integration pending
- Purpose: Live meeting transcription
Key Files:
src/tools/mcp/mcp_manager.py- Multi-server MCP clientexternal_mcp_servers/time_mcp_server/- Custom time serverexternal_mcp_servers/zoom_mcp/- Zoom RTMS prototype
π Data Flow
Video Upload Flow
User uploads video.mp4
β
Gradio saves to temp directory
β
Agent calls transcribe_uploaded_video(path)
β
WhisperX extracts audio + transcribes
β
Pyannote identifies speakers
β
Alignment: Match speakers to transcript
β
Format: SPEAKER_00, SPEAKER_01, etc.
β
Return formatted transcript to agent
β
Agent shows transcript to user
β
User optionally edits or updates speaker names
β
Agent calls upload_transcription_to_pinecone()
β
GPT-4o-mini extracts metadata
β
Text chunked into semantic segments
β
OpenAI embeddings generated
β
Upsert to Pinecone with metadata
β
Return meeting_id to user
Query Flow
User asks: "What action items were assigned last Tuesday?"
β
Agent receives query
β
Agent calls get_time_for_city("Berlin") [Time MCP]
β
Time server returns: "2024-12-07"
β
Agent calculates: "Last Tuesday = 2024-12-03"
β
Agent calls search_meetings(query="action items", date_filter="2024-12-03")
β
Query embedded via OpenAI
β
Pinecone vector search
β
Top-k chunks retrieved with metadata
β
Results returned to agent
β
Agent synthesizes answer from chunks
β
Response streamed to user
Notion Integration Flow
User: "Import 'Meeting 3' from Notion"
β
Agent calls import_notion_to_pinecone(query="Meeting 3")
β
Tool calls Notion MCP: API-post-search(query="Meeting 3")
β
Notion returns page_id
β
Tool calls API-retrieve-a-page(page_id) β metadata
β
Tool calls API-get-block-children(page_id) β content blocks
β
Recursive extraction of nested blocks
β
Full text assembled
β
GPT-4o-mini extracts metadata
β
Text chunked and embedded
β
Upsert to Pinecone
β
Return success message with meeting_id
π¨ Key Design Decisions
1. Why LangGraph?
Decision: Use LangGraph instead of LangChain's AgentExecutor or other frameworks
Rationale:
- β Explicit state management: Full control over conversation state
- β Async support: Required for MCP tools (Notion API)
- β Debugging: Clear visibility into state transitions
- β Flexibility: Easy to add custom nodes and conditional routing
- β Streaming: Native support for response streaming
Alternative Considered: LangChain AgentExecutor (rejected due to limited async support)
2. Why Separate MCP Servers?
Decision: Deploy custom MCP servers in external_mcp_servers/ as standalone applications
Rationale:
- β Independent scaling: Time server can handle multiple agents
- β Deployment flexibility: Update servers without redeploying agent
- β Development isolation: Test MCP servers independently
- β Reusability: Other projects can use the same MCP servers
- β Transport options: HTTP (SSE) for remote, stdio for local
Architecture:
Main Agent (HF Space 1)
β HTTP/SSE
Time MCP Server (HF Space 2)
β HTTP/SSE
Zoom MCP Server (HF Space 3)
Alternative Considered: Embed MCP servers in main app (rejected due to coupling)
3. Why Pinecone Serverless?
Decision: Use Pinecone serverless for vector storage
Rationale:
- β No infrastructure management: Fully managed
- β Cost-effective: Pay per usage, no idle costs
- β Scalability: Auto-scales with demand
- β Metadata filtering: Rich filtering capabilities
- β Namespaces: Environment isolation (dev/prod)
Alternative Considered: Chroma (rejected due to self-hosting requirements)
4. Why GPT-3.5-turbo for Agent?
Decision: Use GPT-3.5-turbo instead of GPT-4 for agent reasoning
Rationale:
- β Cost: 10x cheaper than GPT-4
- β Speed: Faster response times
- β Sufficient: Tool calling works well with 3.5-turbo
- β Budget: GPT-4o-mini used for metadata extraction (specialized task)
Cost Comparison (per 1M tokens):
- GPT-3.5-turbo: $0.50 input / $1.50 output
- GPT-4: $30 input / $60 output
- GPT-4o-mini: $0.15 input / $0.60 output
5. Why Async Patterns?
Decision: Use async/await throughout the agent
Rationale:
- β MCP requirement: Notion MCP tools are async
- β Long operations: Transcription can take minutes
- β Streaming: Gradio async streaming for better UX
- β Concurrency: Handle multiple tool calls efficiently
Implementation:
async def generate_response(self, message, history):
async for event in self.graph.astream(initial_state):
# Process events
yield response_chunk
ποΈ State Management
LangGraph State
Structure: TypedDict with annotated message list
class ConversationalAgentState(TypedDict):
message: str # Current query
history: List[List[str]] # Gradio format
llm_messages: Annotated[List[Any], add_messages] # LangChain format
response: str # Generated response
error: Optional[str] # Error tracking
State Transitions:
- Prepare:
historyβllm_messages(format conversion) - Agent:
llm_messagesβllm_messages(append AI response) - Tools:
llm_messagesβllm_messages(append tool results)
Persistence: In-memory only, no database persistence (stateless per session)
Video Processing State
Purpose: Track video upload workflow across multiple tool calls
Storage: Global dictionary in src/tools/video.py
_video_state = {
"uploaded_video_path": None,
"transcription_text": None,
"transcription_segments": None,
"timing_info": None,
"show_video_upload": False,
"show_transcription_editor": False,
"transcription_in_progress": False
}
Lifecycle:
request_video_upload()β setsshow_video_upload = Truetranscribe_uploaded_video()β stores transcriptupload_transcription_to_pinecone()β clears state
Reset: Automatic after successful upload or manual via cancel_video_workflow()
UI State Synchronization
Challenge: Keep Gradio UI in sync with agent state
Solution: Tools return UI state changes via get_video_state()
# Tool returns state
state = get_video_state()
return {
"show_upload": state["show_video_upload"],
"show_editor": state["show_transcription_editor"],
"transcript": state["transcription_text"]
}
Gradio Integration: UI components update based on returned state
β‘ Scalability & Performance
Concurrency
Current: Single-user sessions (Gradio default)
Scalability:
- β Stateless agent (can handle multiple sessions)
- β Pinecone auto-scales
- β MCP servers deployed independently
- β οΈ WhisperX requires GPU (bottleneck for concurrent transcriptions)
Future Improvements:
- Queue system for transcription jobs
- Separate transcription service (microservice)
- Redis for shared state across instances
Caching
Current Caching:
- β No LLM response caching
- β No embedding caching
- β Pinecone handles vector index caching
Future Improvements:
- Cache frequent queries (e.g., "list meetings")
- Cache embeddings for repeated text
- LangChain cache for LLM responses
Performance Bottlenecks
- Transcription: 2-5 minutes for typical meeting (GPU-dependent)
- Metadata Extraction: 5-10 seconds (GPT-4o-mini API call)
- Embedding: 1-2 seconds per chunk (OpenAI API)
- Pinecone Upsert: 1-3 seconds for typical meeting
Optimization Strategies:
- Parallel embedding generation
- Batch Pinecone upserts
- Async MCP calls
- Streaming responses to user
π Security Architecture
API Key Management
Storage: Environment variables via .env file
OPENAI_API_KEY=sk-...
PINECONE_API_KEY=...
NOTION_TOKEN=secret_...
Access: Loaded via python-dotenv in src/config/settings.py
Best Practices:
- β
Never commit
.envto git (.gitignoreconfigured) - β Use HuggingFace Spaces secrets for deployment
- β Rotate keys regularly
Data Privacy
User Data:
- Video files: Stored temporarily, deleted after processing
- Transcripts: Stored in Pinecone (user-controlled index)
- Conversation history: In-memory only, not persisted
Third-Party Data Sharing:
- OpenAI: Transcripts sent for embedding/metadata extraction
- Pinecone: Encrypted at rest and in transit
- Notion: Only accessed with user's token
Compliance:
- GDPR: User can delete Pinecone index
- Data retention: No long-term storage of raw videos
MCP Server Security
Notion MCP:
- Authentication: User's Notion token
- Permissions: Limited to token's access scope
- Transport: stdio (local process, no network exposure)
Time MCP:
- Authentication: None required (public API)
- Transport: HTTPS (TLS encrypted)
- Rate limiting: HuggingFace Spaces default limits
Zoom MCP (planned):
- Authentication: OAuth 2.0
- Webhook validation: HMAC-SHA256 signature
- Transport: HTTPS + WebSocket (TLS)
π οΈ Technology Stack
Core Framework
- Python: 3.11+
- LangGraph: Agent orchestration
- LangChain: Tool abstractions, message handling
- Gradio: Web UI framework
AI/ML Models
- OpenAI GPT-3.5-turbo: Agent reasoning
- OpenAI GPT-4o-mini: Metadata extraction
- OpenAI text-embedding-3-small: Vector embeddings
- WhisperX: Speech-to-text transcription
- Pyannote: Speaker diarization
Storage & Databases
- Pinecone: Vector database (serverless)
- Local filesystem: Temporary video storage
External Integrations
- Notion API: Via MCP server
- Custom Time API: Via Gradio MCP server
- Zoom API (planned): Via custom MCP server
Development Tools
- Docker: Containerization
- FFmpeg: Audio extraction
- pytest: Testing (planned)
- LangSmith: Tracing and debugging (optional)
Deployment
- HuggingFace Spaces: Primary deployment platform
- Docker: Container runtime
- Environment Variables: Configuration management
π Related Documentation
- TECHNICAL_IMPLEMENTATION.md - Detailed tool reference and code examples
- DEPLOYMENT_GUIDE.md - Step-by-step deployment instructions
- README.md - Project overview and quick start
π Version History
- v4.0 (Current): LangGraph-based conversational agent with MCP integration
- v3.0: Experimental agent patterns
- v2.0: Basic agent with video processing
- v1.0: Initial prototype
Last Updated: December 5, 2025
Maintained By: Meeting Intelligence Agent Team