Faffio commited on
Commit
aef5ea1
·
1 Parent(s): f3ac198

Added Continuos retraining

Browse files
.github/workflows/{ci_papeline.yaml → mlops_pipeline.yaml} RENAMED
@@ -33,6 +33,11 @@ jobs:
33
  python -m pip install --upgrade pip
34
  pip install -r requirements.txt
35
 
 
 
 
 
 
36
  # D. Lancia Pytest
37
  - name: Run Tests
38
  run: |
@@ -69,6 +74,7 @@ jobs:
69
 
70
  echo "Pushing image to Docker Hub..."
71
  docker push $IMAGE_TAG
 
72
  deploy_to_huggingface:
73
  needs: run_tests # Parte solo se i test passano
74
  runs-on: ubuntu-latest
@@ -89,6 +95,8 @@ jobs:
89
  run: |
90
  # Uso --force per imporre l'aggiornamento di GitHub su Hugging face ignorando la storia di quello che c'è all'interno di Hugging face (se ne frega di quello che c'è dentro, cancella e riaggiorna)
91
  git push --force https://$HF_USERNAME:[email protected]/spaces/$HF_USERNAME/$SPACE_NAME main
 
 
92
  # I file vengono salvati nella repository PRIMA che il test parta. È il fatto che tu abbia "pushato" i file che sveglia il robot e gli fa iniziare il lavoro.
93
 
94
  # Ecco la sequenza temporale esatta:
 
33
  python -m pip install --upgrade pip
34
  pip install -r requirements.txt
35
 
36
+ - name: Continuous Training (Simulation)
37
+ run: |
38
+ # Eseguiamo lo script che controlla i dati e simula il training
39
+ python src/train.py
40
+
41
  # D. Lancia Pytest
42
  - name: Run Tests
43
  run: |
 
74
 
75
  echo "Pushing image to Docker Hub..."
76
  docker push $IMAGE_TAG
77
+ # JOB 3: Push su Hugging Face
78
  deploy_to_huggingface:
79
  needs: run_tests # Parte solo se i test passano
80
  runs-on: ubuntu-latest
 
95
  run: |
96
  # Uso --force per imporre l'aggiornamento di GitHub su Hugging face ignorando la storia di quello che c'è all'interno di Hugging face (se ne frega di quello che c'è dentro, cancella e riaggiorna)
97
  git push --force https://$HF_USERNAME:[email protected]/spaces/$HF_USERNAME/$SPACE_NAME main
98
+
99
+
100
  # I file vengono salvati nella repository PRIMA che il test parta. È il fatto che tu abbia "pushato" i file che sveglia il robot e gli fa iniziare il lavoro.
101
 
102
  # Ecco la sequenza temporale esatta:
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Sentiment-Analysis
3
  emoji: 📊
4
  colorFrom: blue
5
  colorTo: indigo
@@ -8,35 +8,38 @@ pinned: false
8
  app_port: 7860
9
  ---
10
 
11
- # 📊 End-to-End MLOps Pipeline for Sentiment Analysis regarding Online Reputation
12
 
13
  ![Build Status](https://img.shields.io/badge/build-passing-brightgreen)
14
  ![Python](https://img.shields.io/badge/python-3.9%2B-blue)
 
15
  ![Deployment](https://img.shields.io/badge/deployed%20on-HuggingFace-orange)
16
  ![License](https://img.shields.io/badge/license-MIT-green)
17
 
18
  ## 🚀 Project Overview
19
 
20
- **MachineInnovators Inc.** focuses on scalable, production-ready machine learning applications. This project is a comprehensive **MLOps solution** designed to monitor online company reputation through automated sentiment analysis.
21
 
22
- Unlike standard data science experiments, this repository demonstrates a **full-cycle ML workflow**, moving from model training to automated deployment. It addresses the business need for real-time reputation tracking by classifying social media feedback (Positive, Neutral, Negative) using an automated pipeline.
23
 
24
  ### Key Features
25
- * **Production-First Approach:** Focus on scalability, modularity, and code quality.
26
- * **CI/CD Automation:** Integrated pipeline for automated testing and deployment using GitHub Actions.
27
- * **Continuous Deployment:** Automatic deployment to Hugging Face Spaces upon successful builds.
28
- * **Reproducibility:** Code and environment are strictly versioned to ensure consistent results.
 
29
 
30
  ---
31
 
32
  ## 🛠️ Tech Stack & Tools
33
 
34
  * **Core:** Python 3.9+
35
- * **Machine Learning:** [FastText / Transformers (RoBERTa)] **
36
- * **MLOps & CI/CD:** GitHub Actions
37
- * **Deployment:** Hugging Face Spaces
38
- * **Version Control:** Git
39
- * **Development:** Google Colab (Prototyping) -> VS Code (Production)
 
40
 
41
  ---
42
 
@@ -44,82 +47,92 @@ Unlike standard data science experiments, this repository demonstrates a **full-
44
 
45
  The project follows a rigorous MLOps pipeline to ensure reliability and speed of delivery:
46
 
47
- 1. **Data Ingestion & Preprocessing:**
48
- * Cleaning and tokenization of social media data using industry-standard libraries.
49
- * Usage of public datasets labeled for sentiment analysis.
50
 
51
- 2. **Model Development:**
52
- * Implementation of a robust sentiment classification model.
53
- * Optimization for inference speed and accuracy.
54
 
55
  3. **CI/CD Pipeline (GitHub Actions):**
56
- * **Linting:** Enforces code style (PEP8) to maintain high readability.
57
- * **Testing:** Unit tests ensure that data processing and prediction logic function correctly before any merge.
58
- * **Delivery:** Upon passing all checks on the `main` branch, the application is packaged and deployed.
 
59
 
60
- 4. **Deployment:**
61
- * The model is served via a web interface hosted on **Hugging Face Spaces**, allowing for immediate user interaction and testing.
62
 
63
  ---
64
 
65
  ## 📂 Repository Structure
66
 
67
  ```bash
68
- ├── .github/workflows/ # CI/CD configurations (GitHub Actions)
69
- ├── app/ # Application code (Inference & UI)
70
- ├── src/ # Source code for training and processing
71
- │ ├── model.py # Model architecture and training logic
72
- ├── preprocess.py # Data cleaning pipeline
73
- │ └── utils.py # Utility functions
74
- ├── tests/ # Unit and integration tests
75
- ├── notebooks/ # Exploratory Data Analysis (EDA) and prototyping
76
- ├── requirements.txt # Project dependencies
77
- └── README.md # Project documentation
78
-
79
- Clone the repository:
80
-
 
 
 
 
81
  Bash
82
 
83
- git clone https://github.com/your-username/your-repo-name.git
84
- cd your-repo-name
85
- Install dependencies:
 
86
 
 
 
87
  Bash
88
 
89
- pip install -r requirements.txt
90
- Run the application:
91
 
92
- Bash
 
93
 
94
- python app/main.py
95
- # OR if using Streamlit/Gradio
96
- streamlit run app/app.py
97
- Run Tests:
98
 
99
  Bash
100
 
101
- pytest tests/
102
- 📈 Results and Performance
103
- Model Accuracy: [Insert Accuracy, e.g., 85%]
104
 
105
- F1-Score: [Insert F1 Score]
106
 
107
- Inference Speed: [Optional: e.g., <50ms per tweet]
 
108
 
109
- Note: Detailed analysis of the model's performance and the confusion matrix can be found in the notebooks directory.
110
 
111
- 🔮 Future Improvements
112
- Drift Detection: Implementing tools like Evidently AI to visualize data drift.
 
113
 
114
- Containerization: Fully Dockerizing the application for cloud-agnostic deployment (AWS/GCP).
115
 
116
- API Expansion: Creating a REST API using FastAPI for integration with external dashboards.
117
 
118
  🤝 Contributing
119
- Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
120
 
121
  📝 License
122
  Distributed under the MIT License. See LICENSE for more information.
123
 
124
- 💡 Note for the Reviewer
125
- This project was developed as a comprehensive exercise to demonstrate Full-Stack Data Science capabilities, bridging the gap between model development and production engineering.
 
 
 
 
1
  ---
2
+ title: Reputation Monitor
3
  emoji: 📊
4
  colorFrom: blue
5
  colorTo: indigo
 
8
  app_port: 7860
9
  ---
10
 
11
+ # 📊 End-to-End MLOps Pipeline for Real-Time Reputation Monitoring
12
 
13
  ![Build Status](https://img.shields.io/badge/build-passing-brightgreen)
14
  ![Python](https://img.shields.io/badge/python-3.9%2B-blue)
15
+ ![Model](https://img.shields.io/badge/model-RoBERTa-yellow)
16
  ![Deployment](https://img.shields.io/badge/deployed%20on-HuggingFace-orange)
17
  ![License](https://img.shields.io/badge/license-MIT-green)
18
 
19
  ## 🚀 Project Overview
20
 
21
+ **MachineInnovators Inc.** focuses on scalable, production-ready machine learning applications. This project is a comprehensive **MLOps solution** designed to monitor online company reputation through automated sentiment analysis of real-time news.
22
 
23
+ Unlike standard static notebooks, this repository demonstrates a **full-cycle ML workflow**. The system scrapes live data from **Google News**, analyzes sentiment using a **RoBERTa Transformer** model, and visualizes insights via an interactive dashboard, all orchestrate within a Dockerized environment.
24
 
25
  ### Key Features
26
+ * **Real-Time Data Ingestion:** Automated scraping of Google News for target brand keywords.
27
+ * **State-of-the-Art NLP:** Utilizes `twitter-roberta-base-sentiment` for high-accuracy classification.
28
+ * **Full-Stack Architecture:** Integrates a **FastAPI** backend for inference and a **Streamlit** frontend for visualization in a single container.
29
+ * **CI/CD Automation:** Robust GitHub Actions pipeline for automated testing, building, and deployment to Hugging Face Spaces.
30
+ * **Embedded Monitoring:** Basic logging system to track model predictions and sentiment distribution over time.
31
 
32
  ---
33
 
34
  ## 🛠️ Tech Stack & Tools
35
 
36
  * **Core:** Python 3.9+
37
+ * **Machine Learning:** Hugging Face Transformers, PyTorch, Scikit-learn.
38
+ * **Backend:** FastAPI, Uvicorn (REST API).
39
+ * **Frontend:** Streamlit (Interactive Dashboard).
40
+ * **Data Ingestion:** `GoogleNews` library (Real-time scraping).
41
+ * **DevOps:** Docker, GitHub Actions (CI/CD).
42
+ * **Deployment:** Hugging Face Spaces (Docker SDK).
43
 
44
  ---
45
 
 
47
 
48
  The project follows a rigorous MLOps pipeline to ensure reliability and speed of delivery:
49
 
50
+ 1. **Data & Modeling:**
51
+ * **Input:** Real-time news titles and descriptions fetched dynamically.
52
+ * **Model:** Pre-trained **RoBERTa** model optimized for social media and short-text sentiment.
53
 
54
+ 2. **Containerization (Docker):**
55
+ * The application is containerized using a custom `Dockerfile`.
56
+ * Implements a custom `entrypoint.sh` script to run both the **FastAPI backend** (port 8000) and **Streamlit frontend** (port 7860) simultaneously.
57
 
58
  3. **CI/CD Pipeline (GitHub Actions):**
59
+ * **Trigger:** Pushes to the `main` branch.
60
+ * **Test:** Executes `pytest` suite to verify API endpoints (`/health`, `/analyze`) and model loading.
61
+ * **Build:** Verifies Docker image creation.
62
+ * **Deploy:** Automatically pushes the validated code to Hugging Face Spaces.
63
 
64
+ 4. **Monitoring:**
65
+ * The system logs every prediction to a local CSV file, which is visualized in the "Monitoring" tab of the dashboard.
66
 
67
  ---
68
 
69
  ## 📂 Repository Structure
70
 
71
  ```bash
72
+ ├── .github/workflows/ # CI/CD configurations (GitHub Actions)
73
+ ├── app/ # Backend Application Code
74
+ ├── api/ # FastAPI endpoints (main.py)
75
+ │ ├── model/ # Model loader logic (RoBERTa)
76
+ └── services/ # Google News scraping logic
77
+ ├── streamlit_app/ # Frontend Application Code (app.py)
78
+ ├── src/ # Training simulation scripts
79
+ ├── tests/ # Unit and integration tests (Pytest)
80
+ ├── Dockerfile # Container configuration
81
+ ├── entrypoint.sh # Startup script for dual-process execution
82
+ ├── requirements.txt # Project dependencies
83
+ └── README.md # Project documentation
84
+
85
+ 💻 Installation & Usage
86
+ To run this project locally using Docker (Recommended):
87
+
88
+ 1. Clone the repository
89
  Bash
90
 
91
+ git clone [https://github.com/YOUR_USERNAME/SentimentAnalysis.git](https://github.com/YOUR_USERNAME/SentimentAnalysis.git)
92
+ cd SentimentAnalysis
93
+ 2. Build the Docker Image
94
+ Bash
95
 
96
+ docker build -t reputation-monitor .
97
+ 3. Run the Container
98
  Bash
99
 
100
+ docker run -p 7860:7860 reputation-monitor
101
+ Access the application at http://localhost:7860
102
 
103
+ Manual Installation (No Docker)
104
+ If you prefer running it directly with Python:
105
 
106
+ Install dependencies:
 
 
 
107
 
108
  Bash
109
 
110
+ pip install -r requirements.txt
111
+ Start the Backend (FastAPI):
 
112
 
113
+ Bash
114
 
115
+ uvicorn app.api.main:app --host 0.0.0.0 --port 8000 --reload
116
+ Start the Frontend (Streamlit) in a new terminal:
117
 
118
+ Bash
119
 
120
+ streamlit run streamlit_app/app.py
121
+ ⚠️ Limitations & Future Roadmap
122
+ Data Persistence: Currently, monitoring logs are stored in an ephemeral CSV file. In a production environment, this would be replaced by a persistent database (e.g., PostgreSQL) to ensure data retention across container restarts.
123
 
124
+ Scalability: The current Google News scraper is synchronous. Future versions will implement asynchronous scraping (aiohttp) or a message queue (RabbitMQ/Celery) for high-volume processing.
125
 
126
+ Model Retraining: A placeholder pipeline (src/train.py) is included. Full implementation would require GPU resources and a labeled dataset for fine-tuning.
127
 
128
  🤝 Contributing
129
+ Contributions are welcome! Please feel free to submit a Pull Request.
130
 
131
  📝 License
132
  Distributed under the MIT License. See LICENSE for more information.
133
 
134
+ ### 👤 Author
135
+
136
+ **[Fabio Celaschi]**
137
+ * [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/fabio-celaschi-4371bb92)
138
+ * [![Instagram](https://img.shields.io/badge/Instagram-E4405F?style=for-the-badge&logo=instagram&logoColor=white)](https://www.instagram.com/fabiocelaschi/)
data/new_data.csv ADDED
File without changes
src/train.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import sys
4
+
5
+ """
6
+ Questa vuole essere solo una simulazione del retrain del modello, in quanto rifare il train
7
+ Costerebbe molto in termini computazionali, quindi l'intento è creare solo il processo per farlo
8
+ e integrarlo nel file yaml come funzione obbligatoria prima del push/deploy.
9
+
10
+ Faccio usare un dataset in CSV, se vuoto non fa il retrain e pusha lo stesso nella repository
11
+ """
12
+
13
+ # Percorsi configurabili
14
+ DATA_PATH = "data/new_data.csv"
15
+ MODEL_OUTPUT_DIR = "models/retrained_roberta"
16
+
17
+ def train_and_evaluate():
18
+ print("🚀 Starting MLOps Retraining Pipeline...")
19
+
20
+ # 1. DATA VALIDATION CHECK
21
+ # Controlliamo se il file esiste e se ha dimensioni > 0
22
+ if not os.path.exists(DATA_PATH) or os.stat(DATA_PATH).st_size < 10:
23
+ print(f"ℹ️ Dataset '{DATA_PATH}' is empty or missing.")
24
+ print("⚠️ No new data available for retraining.")
25
+ print("✅ Skipping process. (This is normal behavior for the demo).")
26
+ # Usciamo con codice 0 (Successo) perché "non fare nulla" è un risultato valido
27
+ sys.exit(0)
28
+
29
+ # --- SIMULATION ZONE (GPU Constraints) ---
30
+ print(f"📂 Loading dataset from {DATA_PATH}...")
31
+ # In reale: df = pd.read_csv(DATA_PATH)
32
+
33
+ print("⚙️ Initializing RoBERTa Fine-Tuning on CPU (Simulation)...")
34
+ time.sleep(2) # Simuliamo il tempo di caricamento
35
+
36
+ # Simula il log del training
37
+ print("Epoch 1/3: Loss 0.45 ... accuracy: 0.78")
38
+ print("Epoch 2/3: Loss 0.22 ... accuracy: 0.84")
39
+
40
+ # 2. MODEL EVALUATION CHECK (Il punto che chiedevi)
41
+ print("⚖️ Evaluating new model vs current production model...")
42
+ # Qui ci sarebbe: if new_accuracy > old_accuracy:
43
+ simulated_improvement = True
44
+
45
+ if simulated_improvement:
46
+ print("✅ Performance improved! (Accuracy +2.5%)")
47
+ print(f"💾 Saving new model artifact to {MODEL_OUTPUT_DIR}...")
48
+ os.makedirs(MODEL_OUTPUT_DIR, exist_ok=True)
49
+ with open(f"{MODEL_OUTPUT_DIR}/metadata.txt", "w") as f:
50
+ f.write(f"Model retrained on {time.strftime('%Y-%m-%d')}\nStatus: Active")
51
+ else:
52
+ print("❌ No improvement detected. Keeping the old model.")
53
+ sys.exit(0)
54
+
55
+ if __name__ == "__main__":
56
+ train_and_evaluate()