Spaces:
Sleeping
Sleeping
File size: 20,413 Bytes
ce49379 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 |
# System Architecture
> **Meeting Intelligence Agent - Technical Architecture Documentation**
This document provides a high-level overview of the system architecture, design decisions, and component relationships for the Meeting Intelligence Agent.
---
## π Table of Contents
- [System Overview](#-system-overview)
- [Architecture Diagram](#-architecture-diagram)
- [Core Components](#-core-components)
- [Data Flow](#-data-flow)
- [Key Design Decisions](#-key-design-decisions)
- [State Management](#-state-management)
- [Scalability & Performance](#-scalability--performance)
- [Security Architecture](#-security-architecture)
- [Technology Stack](#-technology-stack)
---
## π― System Overview
The Meeting Intelligence Agent is a **conversational AI system** built on LangGraph that orchestrates meeting video processing, transcription, storage, and intelligent querying through natural language interaction.
### Core Capabilities
1. **Video Processing Pipeline**: Upload β Transcription β Speaker Diarization β Metadata Extraction β Vector Storage
2. **Semantic Search**: RAG-based querying across meeting transcripts using natural language
3. **External Integrations**: MCP (Model Context Protocol) servers for Notion and time-aware queries
4. **Conversational Interface**: Gradio-based chat UI with file upload support
### Design Philosophy
- **Conversational-First**: All functionality accessible through natural language
- **Modular Architecture**: Clear separation between UI, agent, tools, and services
- **Extensible**: MCP protocol enables easy addition of new capabilities
- **Async-Ready**: Supports long-running operations (transcription, MCP calls)
- **Production-Ready**: Docker support, error handling, graceful degradation
---
## ποΈ Architecture Diagram
```mermaid
graph TB
subgraph "Frontend Layer"
UI[Gradio Interface]
Chat[Chat Component]
Upload[File Upload]
Editor[Transcript Editor]
end
subgraph "Agent Layer (LangGraph)"
Agent[Conversational Agent]
StateMachine[State Machine]
ToolRouter[Tool Router]
end
subgraph "Tool Layer"
VideoTools[Video Processing Tools]
QueryTools[Meeting Query Tools]
MCPTools[MCP Integration Tools]
end
subgraph "Processing Layer"
WhisperX[WhisperX Transcription]
Pyannote[Speaker Diarization]
MetadataExtractor[GPT-4o-mini Metadata]
Embeddings[OpenAI Embeddings]
end
subgraph "Storage Layer"
Pinecone[(Pinecone Vector DB)]
LocalState[Local State Cache]
end
subgraph "External Services"
OpenAI[OpenAI API]
NotionMCP[Notion MCP Server]
TimeMCP[Time MCP Server]
ZoomMCP[Zoom MCP Server<br/>In Development]
end
UI --> Agent
Agent --> StateMachine
StateMachine --> ToolRouter
ToolRouter --> VideoTools
ToolRouter --> QueryTools
ToolRouter --> MCPTools
VideoTools --> WhisperX
VideoTools --> Pyannote
VideoTools --> MetadataExtractor
VideoTools --> Embeddings
QueryTools --> Embeddings
QueryTools --> Pinecone
MCPTools --> NotionMCP
MCPTools --> TimeMCP
MCPTools -.-> ZoomMCP
Embeddings --> Pinecone
MetadataExtractor --> OpenAI
Agent --> OpenAI
classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000000
classDef tools fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000000
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000
classDef storage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#000000
classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000000
class UI,Chat,Upload,Editor frontend
class Agent,StateMachine,ToolRouter agent
class VideoTools,QueryTools,MCPTools tools
class WhisperX,Pyannote,MetadataExtractor,Embeddings processing
class Pinecone,LocalState storage
class OpenAI,NotionMCP,TimeMCP,ZoomMCP external
```
---
## π§© Core Components
### 1. Frontend Layer (Gradio)
**Purpose**: User interface for interaction and file management
**Components**:
- **Chat Interface**: Primary conversational UI using `gr.ChatInterface`
- **File Upload**: Video file upload widget
- **Transcript Editor**: Editable text area for manual corrections
- **State Display**: Real-time feedback on processing status
**Technology**: Gradio 5.x with async support
**Key Files**:
- `src/ui/gradio_app.py` - UI component definitions and event handlers
---
### 2. Agent Layer (LangGraph)
**Purpose**: Orchestrates the entire workflow through conversational AI
**Architecture**: State machine with three nodes:
```
βββββββββββ βββββββββ βββββββββ
β PREPARE β --> β AGENT β --> β TOOLS β
βββββββββββ βββββββββ βββββββββ
β β
βββββββββββββββ
```
**Components**:
1. **Prepare Node**: Converts chat history to LangChain messages
2. **Agent Node**: LLM decides which tools to call
3. **Tools Node**: Executes selected tools
4. **Conditional Router**: Determines if more tool calls are needed
**State Structure**:
```python
{
"message": str, # Current user query
"history": List[List[str]], # Conversation history
"llm_messages": List[Message], # LangChain message format
"response": str, # Generated response
"error": Optional[str] # Error tracking
}
```
**Key Files**:
- `src/agents/conversational.py` - LangGraph agent implementation (570 lines)
---
### 3. Tool Layer
**Purpose**: Provides discrete capabilities that the agent can invoke
**Categories**:
#### Video Processing Tools (8 tools)
- File upload management
- Transcription orchestration
- Speaker name mapping
- Transcript editing
- Pinecone upload
#### Meeting Query Tools (6 tools)
- Semantic search
- Metadata retrieval
- Meeting listing
- Text upsert
- Notion import/export
#### MCP Integration Tools (6+ tools)
- Notion API operations
- Time queries
- Future: Zoom RTMS
**Design Pattern**: LangChain `@tool` decorator for automatic schema generation
**Key Files**:
- `src/tools/video.py` - Video processing tools (528 lines)
- `src/tools/general.py` - Query and integration tools (577 lines)
- `src/tools/mcp/` - MCP client wrappers
---
### 4. Processing Layer
**Purpose**: Handles compute-intensive operations
**Components**:
#### WhisperX Transcription
- **Model**: Configurable (tiny/small/medium/large)
- **Features**: Word-level timestamps, language detection
- **Performance**: GPU-accelerated when available
#### Pyannote Speaker Diarization
- **Model**: `pyannote/speaker-diarization-3.1`
- **Output**: Speaker segments with timestamps
- **Integration**: Aligned with WhisperX word timestamps
#### Metadata Extraction
- **Model**: GPT-4o-mini (cost-optimized)
- **Extracts**: Title, date, summary, speaker mapping
- **Format**: Structured JSON output
#### Embeddings
- **Model**: OpenAI `text-embedding-3-small`
- **Dimension**: 1536
- **Usage**: Query and document embedding
**Key Files**:
- `src/processing/transcription.py` - WhisperX + Pyannote pipeline
- `src/processing/metadata_extractor.py` - GPT-4o-mini extraction
---
### 5. Storage Layer
**Purpose**: Persistent and temporary data storage
#### Pinecone Vector Database
- **Type**: Serverless
- **Index**: `meeting-transcripts-1-dev`
- **Namespace**: Environment-based (`development`/`production`)
- **Metadata**: Rich metadata for filtering (title, date, source, speakers)
**Schema**:
```python
{
"id": "meeting_abc12345_chunk_001",
"values": [1536-dim embedding],
"metadata": {
"meeting_id": "meeting_abc12345",
"meeting_title": "Q4 Planning",
"meeting_date": "2024-12-07",
"summary": "...",
"speaker_mapping": {...},
"source": "video",
"chunk_index": 1,
"text": "actual transcript chunk"
}
}
```
#### Local State Cache
- **Purpose**: Temporary storage for video processing workflow
- **Scope**: In-memory, per-session
- **Contents**: Uploaded video path, transcription text, timing info
**Key Files**:
- `src/retrievers/pinecone.py` - Vector database manager
---
### 6. External Services
**Purpose**: Third-party APIs and custom MCP servers
#### OpenAI API
- **Models**: GPT-3.5-turbo (agent), GPT-4o-mini (metadata)
- **Usage**: Agent reasoning, metadata extraction, embeddings
#### Notion MCP Server
- **Type**: Official `@notionhq/notion-mcp-server`
- **Transport**: stdio (local subprocess)
- **Capabilities**: Search, read, create, update pages
#### Time MCP Server (Custom)
- **Type**: Gradio-based MCP server
- **Transport**: SSE (Server-Sent Events)
- **Deployment**: HuggingFace Spaces
- **URL**: `https://gfiamon-date-time-mpc-server-tool.hf.space/gradio_api/mcp/sse`
- **Purpose**: Time-aware query support
#### Zoom RTMS Server (In Development)
- **Type**: FastAPI + Gradio hybrid
- **Transport**: stdio + webhooks
- **Status**: Prototype, API integration pending
- **Purpose**: Live meeting transcription
**Key Files**:
- `src/tools/mcp/mcp_manager.py` - Multi-server MCP client
- `external_mcp_servers/time_mcp_server/` - Custom time server
- `external_mcp_servers/zoom_mcp/` - Zoom RTMS prototype
---
## π Data Flow
### Video Upload Flow
```
User uploads video.mp4
β
Gradio saves to temp directory
β
Agent calls transcribe_uploaded_video(path)
β
WhisperX extracts audio + transcribes
β
Pyannote identifies speakers
β
Alignment: Match speakers to transcript
β
Format: SPEAKER_00, SPEAKER_01, etc.
β
Return formatted transcript to agent
β
Agent shows transcript to user
β
User optionally edits or updates speaker names
β
Agent calls upload_transcription_to_pinecone()
β
GPT-4o-mini extracts metadata
β
Text chunked into semantic segments
β
OpenAI embeddings generated
β
Upsert to Pinecone with metadata
β
Return meeting_id to user
```
### Query Flow
```
User asks: "What action items were assigned last Tuesday?"
β
Agent receives query
β
Agent calls get_time_for_city("Berlin") [Time MCP]
β
Time server returns: "2024-12-07"
β
Agent calculates: "Last Tuesday = 2024-12-03"
β
Agent calls search_meetings(query="action items", date_filter="2024-12-03")
β
Query embedded via OpenAI
β
Pinecone vector search
β
Top-k chunks retrieved with metadata
β
Results returned to agent
β
Agent synthesizes answer from chunks
β
Response streamed to user
```
### Notion Integration Flow
```
User: "Import 'Meeting 3' from Notion"
β
Agent calls import_notion_to_pinecone(query="Meeting 3")
β
Tool calls Notion MCP: API-post-search(query="Meeting 3")
β
Notion returns page_id
β
Tool calls API-retrieve-a-page(page_id) β metadata
β
Tool calls API-get-block-children(page_id) β content blocks
β
Recursive extraction of nested blocks
β
Full text assembled
β
GPT-4o-mini extracts metadata
β
Text chunked and embedded
β
Upsert to Pinecone
β
Return success message with meeting_id
```
---
## π¨ Key Design Decisions
### 1. Why LangGraph?
**Decision**: Use LangGraph instead of LangChain's AgentExecutor or other frameworks
**Rationale**:
- β
**Explicit state management**: Full control over conversation state
- β
**Async support**: Required for MCP tools (Notion API)
- β
**Debugging**: Clear visibility into state transitions
- β
**Flexibility**: Easy to add custom nodes and conditional routing
- β
**Streaming**: Native support for response streaming
**Alternative Considered**: LangChain AgentExecutor (rejected due to limited async support)
---
### 2. Why Separate MCP Servers?
**Decision**: Deploy custom MCP servers in `external_mcp_servers/` as standalone applications
**Rationale**:
- β
**Independent scaling**: Time server can handle multiple agents
- β
**Deployment flexibility**: Update servers without redeploying agent
- β
**Development isolation**: Test MCP servers independently
- β
**Reusability**: Other projects can use the same MCP servers
- β
**Transport options**: HTTP (SSE) for remote, stdio for local
**Architecture**:
```
Main Agent (HF Space 1)
β HTTP/SSE
Time MCP Server (HF Space 2)
β HTTP/SSE
Zoom MCP Server (HF Space 3)
```
**Alternative Considered**: Embed MCP servers in main app (rejected due to coupling)
---
### 3. Why Pinecone Serverless?
**Decision**: Use Pinecone serverless for vector storage
**Rationale**:
- β
**No infrastructure management**: Fully managed
- β
**Cost-effective**: Pay per usage, no idle costs
- β
**Scalability**: Auto-scales with demand
- β
**Metadata filtering**: Rich filtering capabilities
- β
**Namespaces**: Environment isolation (dev/prod)
**Alternative Considered**: Chroma (rejected due to self-hosting requirements)
---
### 4. Why GPT-3.5-turbo for Agent?
**Decision**: Use GPT-3.5-turbo instead of GPT-4 for agent reasoning
**Rationale**:
- β
**Cost**: 10x cheaper than GPT-4
- β
**Speed**: Faster response times
- β
**Sufficient**: Tool calling works well with 3.5-turbo
- β
**Budget**: GPT-4o-mini used for metadata extraction (specialized task)
**Cost Comparison** (per 1M tokens):
- GPT-3.5-turbo: $0.50 input / $1.50 output
- GPT-4: $30 input / $60 output
- GPT-4o-mini: $0.15 input / $0.60 output
---
### 5. Why Async Patterns?
**Decision**: Use `async/await` throughout the agent
**Rationale**:
- β
**MCP requirement**: Notion MCP tools are async
- β
**Long operations**: Transcription can take minutes
- β
**Streaming**: Gradio async streaming for better UX
- β
**Concurrency**: Handle multiple tool calls efficiently
**Implementation**:
```python
async def generate_response(self, message, history):
async for event in self.graph.astream(initial_state):
# Process events
yield response_chunk
```
---
## ποΈ State Management
### LangGraph State
**Structure**: TypedDict with annotated message list
```python
class ConversationalAgentState(TypedDict):
message: str # Current query
history: List[List[str]] # Gradio format
llm_messages: Annotated[List[Any], add_messages] # LangChain format
response: str # Generated response
error: Optional[str] # Error tracking
```
**State Transitions**:
1. **Prepare**: `history` β `llm_messages` (format conversion)
2. **Agent**: `llm_messages` β `llm_messages` (append AI response)
3. **Tools**: `llm_messages` β `llm_messages` (append tool results)
**Persistence**: In-memory only, no database persistence (stateless per session)
---
### Video Processing State
**Purpose**: Track video upload workflow across multiple tool calls
**Storage**: Global dictionary in `src/tools/video.py`
```python
_video_state = {
"uploaded_video_path": None,
"transcription_text": None,
"transcription_segments": None,
"timing_info": None,
"show_video_upload": False,
"show_transcription_editor": False,
"transcription_in_progress": False
}
```
**Lifecycle**:
1. `request_video_upload()` β sets `show_video_upload = True`
2. `transcribe_uploaded_video()` β stores transcript
3. `upload_transcription_to_pinecone()` β clears state
**Reset**: Automatic after successful upload or manual via `cancel_video_workflow()`
---
### UI State Synchronization
**Challenge**: Keep Gradio UI in sync with agent state
**Solution**: Tools return UI state changes via `get_video_state()`
```python
# Tool returns state
state = get_video_state()
return {
"show_upload": state["show_video_upload"],
"show_editor": state["show_transcription_editor"],
"transcript": state["transcription_text"]
}
```
**Gradio Integration**: UI components update based on returned state
---
## β‘ Scalability & Performance
### Concurrency
**Current**: Single-user sessions (Gradio default)
**Scalability**:
- β
Stateless agent (can handle multiple sessions)
- β
Pinecone auto-scales
- β
MCP servers deployed independently
- β οΈ WhisperX requires GPU (bottleneck for concurrent transcriptions)
**Future Improvements**:
- Queue system for transcription jobs
- Separate transcription service (microservice)
- Redis for shared state across instances
---
### Caching
**Current Caching**:
- β No LLM response caching
- β No embedding caching
- β
Pinecone handles vector index caching
**Future Improvements**:
- Cache frequent queries (e.g., "list meetings")
- Cache embeddings for repeated text
- LangChain cache for LLM responses
---
### Performance Bottlenecks
1. **Transcription**: 2-5 minutes for typical meeting (GPU-dependent)
2. **Metadata Extraction**: 5-10 seconds (GPT-4o-mini API call)
3. **Embedding**: 1-2 seconds per chunk (OpenAI API)
4. **Pinecone Upsert**: 1-3 seconds for typical meeting
**Optimization Strategies**:
- Parallel embedding generation
- Batch Pinecone upserts
- Async MCP calls
- Streaming responses to user
---
## π Security Architecture
### API Key Management
**Storage**: Environment variables via `.env` file
```bash
OPENAI_API_KEY=sk-...
PINECONE_API_KEY=...
NOTION_TOKEN=secret_...
```
**Access**: Loaded via `python-dotenv` in `src/config/settings.py`
**Best Practices**:
- β
Never commit `.env` to git (`.gitignore` configured)
- β
Use HuggingFace Spaces secrets for deployment
- β
Rotate keys regularly
---
### Data Privacy
**User Data**:
- Video files: Stored temporarily, deleted after processing
- Transcripts: Stored in Pinecone (user-controlled index)
- Conversation history: In-memory only, not persisted
**Third-Party Data Sharing**:
- OpenAI: Transcripts sent for embedding/metadata extraction
- Pinecone: Encrypted at rest and in transit
- Notion: Only accessed with user's token
**Compliance**:
- GDPR: User can delete Pinecone index
- Data retention: No long-term storage of raw videos
---
### MCP Server Security
**Notion MCP**:
- Authentication: User's Notion token
- Permissions: Limited to token's access scope
- Transport: stdio (local process, no network exposure)
**Time MCP**:
- Authentication: None required (public API)
- Transport: HTTPS (TLS encrypted)
- Rate limiting: HuggingFace Spaces default limits
**Zoom MCP** (planned):
- Authentication: OAuth 2.0
- Webhook validation: HMAC-SHA256 signature
- Transport: HTTPS + WebSocket (TLS)
---
## π οΈ Technology Stack
### Core Framework
- **Python**: 3.11+
- **LangGraph**: Agent orchestration
- **LangChain**: Tool abstractions, message handling
- **Gradio**: Web UI framework
### AI/ML Models
- **OpenAI GPT-3.5-turbo**: Agent reasoning
- **OpenAI GPT-4o-mini**: Metadata extraction
- **OpenAI text-embedding-3-small**: Vector embeddings
- **WhisperX**: Speech-to-text transcription
- **Pyannote**: Speaker diarization
### Storage & Databases
- **Pinecone**: Vector database (serverless)
- **Local filesystem**: Temporary video storage
### External Integrations
- **Notion API**: Via MCP server
- **Custom Time API**: Via Gradio MCP server
- **Zoom API** (planned): Via custom MCP server
### Development Tools
- **Docker**: Containerization
- **FFmpeg**: Audio extraction
- **pytest**: Testing (planned)
- **LangSmith**: Tracing and debugging (optional)
### Deployment
- **HuggingFace Spaces**: Primary deployment platform
- **Docker**: Container runtime
- **Environment Variables**: Configuration management
---
## π Related Documentation
- [TECHNICAL_IMPLEMENTATION.md](TECHNICAL_IMPLEMENTATION.md) - Detailed tool reference and code examples
- [DEPLOYMENT_GUIDE.md](DEPLOYMENT_GUIDE.md) - Step-by-step deployment instructions
- [README.md](../README.md) - Project overview and quick start
---
## π Version History
- **v4.0** (Current): LangGraph-based conversational agent with MCP integration
- **v3.0**: Experimental agent patterns
- **v2.0**: Basic agent with video processing
- **v1.0**: Initial prototype
---
**Last Updated**: December 5, 2025
**Maintained By**: Meeting Intelligence Agent Team
|