Spaces:
Sleeping
Sleeping
Added Continuos retraining
Browse files- .github/workflows/{ci_papeline.yaml → mlops_pipeline.yaml} +8 -0
- README.md +73 -60
- data/new_data.csv +0 -0
- src/train.py +56 -0
.github/workflows/{ci_papeline.yaml → mlops_pipeline.yaml}
RENAMED
|
@@ -33,6 +33,11 @@ jobs:
|
|
| 33 |
python -m pip install --upgrade pip
|
| 34 |
pip install -r requirements.txt
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
# D. Lancia Pytest
|
| 37 |
- name: Run Tests
|
| 38 |
run: |
|
|
@@ -69,6 +74,7 @@ jobs:
|
|
| 69 |
|
| 70 |
echo "Pushing image to Docker Hub..."
|
| 71 |
docker push $IMAGE_TAG
|
|
|
|
| 72 |
deploy_to_huggingface:
|
| 73 |
needs: run_tests # Parte solo se i test passano
|
| 74 |
runs-on: ubuntu-latest
|
|
@@ -89,6 +95,8 @@ jobs:
|
|
| 89 |
run: |
|
| 90 |
# Uso --force per imporre l'aggiornamento di GitHub su Hugging face ignorando la storia di quello che c'è all'interno di Hugging face (se ne frega di quello che c'è dentro, cancella e riaggiorna)
|
| 91 |
git push --force https://$HF_USERNAME:[email protected]/spaces/$HF_USERNAME/$SPACE_NAME main
|
|
|
|
|
|
|
| 92 |
# I file vengono salvati nella repository PRIMA che il test parta. È il fatto che tu abbia "pushato" i file che sveglia il robot e gli fa iniziare il lavoro.
|
| 93 |
|
| 94 |
# Ecco la sequenza temporale esatta:
|
|
|
|
| 33 |
python -m pip install --upgrade pip
|
| 34 |
pip install -r requirements.txt
|
| 35 |
|
| 36 |
+
- name: Continuous Training (Simulation)
|
| 37 |
+
run: |
|
| 38 |
+
# Eseguiamo lo script che controlla i dati e simula il training
|
| 39 |
+
python src/train.py
|
| 40 |
+
|
| 41 |
# D. Lancia Pytest
|
| 42 |
- name: Run Tests
|
| 43 |
run: |
|
|
|
|
| 74 |
|
| 75 |
echo "Pushing image to Docker Hub..."
|
| 76 |
docker push $IMAGE_TAG
|
| 77 |
+
# JOB 3: Push su Hugging Face
|
| 78 |
deploy_to_huggingface:
|
| 79 |
needs: run_tests # Parte solo se i test passano
|
| 80 |
runs-on: ubuntu-latest
|
|
|
|
| 95 |
run: |
|
| 96 |
# Uso --force per imporre l'aggiornamento di GitHub su Hugging face ignorando la storia di quello che c'è all'interno di Hugging face (se ne frega di quello che c'è dentro, cancella e riaggiorna)
|
| 97 |
git push --force https://$HF_USERNAME:[email protected]/spaces/$HF_USERNAME/$SPACE_NAME main
|
| 98 |
+
|
| 99 |
+
|
| 100 |
# I file vengono salvati nella repository PRIMA che il test parta. È il fatto che tu abbia "pushato" i file che sveglia il robot e gli fa iniziare il lavoro.
|
| 101 |
|
| 102 |
# Ecco la sequenza temporale esatta:
|
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 📊
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: indigo
|
|
@@ -8,35 +8,38 @@ pinned: false
|
|
| 8 |
app_port: 7860
|
| 9 |
---
|
| 10 |
|
| 11 |
-
# 📊 End-to-End MLOps Pipeline for
|
| 12 |
|
| 13 |

|
| 14 |

|
|
|
|
| 15 |

|
| 16 |

|
| 17 |
|
| 18 |
## 🚀 Project Overview
|
| 19 |
|
| 20 |
-
**MachineInnovators Inc.** focuses on scalable, production-ready machine learning applications. This project is a comprehensive **MLOps solution** designed to monitor online company reputation through automated sentiment analysis.
|
| 21 |
|
| 22 |
-
Unlike standard
|
| 23 |
|
| 24 |
### Key Features
|
| 25 |
-
* **
|
| 26 |
-
* **
|
| 27 |
-
* **
|
| 28 |
-
* **
|
|
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
## 🛠️ Tech Stack & Tools
|
| 33 |
|
| 34 |
* **Core:** Python 3.9+
|
| 35 |
-
* **Machine Learning:**
|
| 36 |
-
* **
|
| 37 |
-
* **
|
| 38 |
-
* **
|
| 39 |
-
* **
|
|
|
|
| 40 |
|
| 41 |
---
|
| 42 |
|
|
@@ -44,82 +47,92 @@ Unlike standard data science experiments, this repository demonstrates a **full-
|
|
| 44 |
|
| 45 |
The project follows a rigorous MLOps pipeline to ensure reliability and speed of delivery:
|
| 46 |
|
| 47 |
-
1. **Data
|
| 48 |
-
*
|
| 49 |
-
*
|
| 50 |
|
| 51 |
-
2. **
|
| 52 |
-
*
|
| 53 |
-
*
|
| 54 |
|
| 55 |
3. **CI/CD Pipeline (GitHub Actions):**
|
| 56 |
-
* **
|
| 57 |
-
* **
|
| 58 |
-
* **
|
|
|
|
| 59 |
|
| 60 |
-
4. **
|
| 61 |
-
* The
|
| 62 |
|
| 63 |
---
|
| 64 |
|
| 65 |
## 📂 Repository Structure
|
| 66 |
|
| 67 |
```bash
|
| 68 |
-
├── .github/workflows/
|
| 69 |
-
├── app/
|
| 70 |
-
├──
|
| 71 |
-
│ ├── model
|
| 72 |
-
│
|
| 73 |
-
|
| 74 |
-
├──
|
| 75 |
-
├──
|
| 76 |
-
├──
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
Bash
|
| 82 |
|
| 83 |
-
git clone https://github.com/
|
| 84 |
-
cd
|
| 85 |
-
|
|
|
|
| 86 |
|
|
|
|
|
|
|
| 87 |
Bash
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
|
| 92 |
-
|
|
|
|
| 93 |
|
| 94 |
-
|
| 95 |
-
# OR if using Streamlit/Gradio
|
| 96 |
-
streamlit run app/app.py
|
| 97 |
-
Run Tests:
|
| 98 |
|
| 99 |
Bash
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
Model Accuracy: [Insert Accuracy, e.g., 85%]
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
-
|
|
|
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
|
|
|
| 113 |
|
| 114 |
-
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
🤝 Contributing
|
| 119 |
-
Contributions
|
| 120 |
|
| 121 |
📝 License
|
| 122 |
Distributed under the MIT License. See LICENSE for more information.
|
| 123 |
|
| 124 |
-
|
| 125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Reputation Monitor
|
| 3 |
emoji: 📊
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: indigo
|
|
|
|
| 8 |
app_port: 7860
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# 📊 End-to-End MLOps Pipeline for Real-Time Reputation Monitoring
|
| 12 |
|
| 13 |

|
| 14 |

|
| 15 |
+

|
| 16 |

|
| 17 |

|
| 18 |
|
| 19 |
## 🚀 Project Overview
|
| 20 |
|
| 21 |
+
**MachineInnovators Inc.** focuses on scalable, production-ready machine learning applications. This project is a comprehensive **MLOps solution** designed to monitor online company reputation through automated sentiment analysis of real-time news.
|
| 22 |
|
| 23 |
+
Unlike standard static notebooks, this repository demonstrates a **full-cycle ML workflow**. The system scrapes live data from **Google News**, analyzes sentiment using a **RoBERTa Transformer** model, and visualizes insights via an interactive dashboard, all orchestrate within a Dockerized environment.
|
| 24 |
|
| 25 |
### Key Features
|
| 26 |
+
* **Real-Time Data Ingestion:** Automated scraping of Google News for target brand keywords.
|
| 27 |
+
* **State-of-the-Art NLP:** Utilizes `twitter-roberta-base-sentiment` for high-accuracy classification.
|
| 28 |
+
* **Full-Stack Architecture:** Integrates a **FastAPI** backend for inference and a **Streamlit** frontend for visualization in a single container.
|
| 29 |
+
* **CI/CD Automation:** Robust GitHub Actions pipeline for automated testing, building, and deployment to Hugging Face Spaces.
|
| 30 |
+
* **Embedded Monitoring:** Basic logging system to track model predictions and sentiment distribution over time.
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
## 🛠️ Tech Stack & Tools
|
| 35 |
|
| 36 |
* **Core:** Python 3.9+
|
| 37 |
+
* **Machine Learning:** Hugging Face Transformers, PyTorch, Scikit-learn.
|
| 38 |
+
* **Backend:** FastAPI, Uvicorn (REST API).
|
| 39 |
+
* **Frontend:** Streamlit (Interactive Dashboard).
|
| 40 |
+
* **Data Ingestion:** `GoogleNews` library (Real-time scraping).
|
| 41 |
+
* **DevOps:** Docker, GitHub Actions (CI/CD).
|
| 42 |
+
* **Deployment:** Hugging Face Spaces (Docker SDK).
|
| 43 |
|
| 44 |
---
|
| 45 |
|
|
|
|
| 47 |
|
| 48 |
The project follows a rigorous MLOps pipeline to ensure reliability and speed of delivery:
|
| 49 |
|
| 50 |
+
1. **Data & Modeling:**
|
| 51 |
+
* **Input:** Real-time news titles and descriptions fetched dynamically.
|
| 52 |
+
* **Model:** Pre-trained **RoBERTa** model optimized for social media and short-text sentiment.
|
| 53 |
|
| 54 |
+
2. **Containerization (Docker):**
|
| 55 |
+
* The application is containerized using a custom `Dockerfile`.
|
| 56 |
+
* Implements a custom `entrypoint.sh` script to run both the **FastAPI backend** (port 8000) and **Streamlit frontend** (port 7860) simultaneously.
|
| 57 |
|
| 58 |
3. **CI/CD Pipeline (GitHub Actions):**
|
| 59 |
+
* **Trigger:** Pushes to the `main` branch.
|
| 60 |
+
* **Test:** Executes `pytest` suite to verify API endpoints (`/health`, `/analyze`) and model loading.
|
| 61 |
+
* **Build:** Verifies Docker image creation.
|
| 62 |
+
* **Deploy:** Automatically pushes the validated code to Hugging Face Spaces.
|
| 63 |
|
| 64 |
+
4. **Monitoring:**
|
| 65 |
+
* The system logs every prediction to a local CSV file, which is visualized in the "Monitoring" tab of the dashboard.
|
| 66 |
|
| 67 |
---
|
| 68 |
|
| 69 |
## 📂 Repository Structure
|
| 70 |
|
| 71 |
```bash
|
| 72 |
+
├── .github/workflows/ # CI/CD configurations (GitHub Actions)
|
| 73 |
+
├── app/ # Backend Application Code
|
| 74 |
+
│ ├── api/ # FastAPI endpoints (main.py)
|
| 75 |
+
│ ├── model/ # Model loader logic (RoBERTa)
|
| 76 |
+
│ └── services/ # Google News scraping logic
|
| 77 |
+
├── streamlit_app/ # Frontend Application Code (app.py)
|
| 78 |
+
├── src/ # Training simulation scripts
|
| 79 |
+
├── tests/ # Unit and integration tests (Pytest)
|
| 80 |
+
├── Dockerfile # Container configuration
|
| 81 |
+
├── entrypoint.sh # Startup script for dual-process execution
|
| 82 |
+
├── requirements.txt # Project dependencies
|
| 83 |
+
└── README.md # Project documentation
|
| 84 |
+
|
| 85 |
+
💻 Installation & Usage
|
| 86 |
+
To run this project locally using Docker (Recommended):
|
| 87 |
+
|
| 88 |
+
1. Clone the repository
|
| 89 |
Bash
|
| 90 |
|
| 91 |
+
git clone [https://github.com/YOUR_USERNAME/SentimentAnalysis.git](https://github.com/YOUR_USERNAME/SentimentAnalysis.git)
|
| 92 |
+
cd SentimentAnalysis
|
| 93 |
+
2. Build the Docker Image
|
| 94 |
+
Bash
|
| 95 |
|
| 96 |
+
docker build -t reputation-monitor .
|
| 97 |
+
3. Run the Container
|
| 98 |
Bash
|
| 99 |
|
| 100 |
+
docker run -p 7860:7860 reputation-monitor
|
| 101 |
+
Access the application at http://localhost:7860
|
| 102 |
|
| 103 |
+
Manual Installation (No Docker)
|
| 104 |
+
If you prefer running it directly with Python:
|
| 105 |
|
| 106 |
+
Install dependencies:
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
Bash
|
| 109 |
|
| 110 |
+
pip install -r requirements.txt
|
| 111 |
+
Start the Backend (FastAPI):
|
|
|
|
| 112 |
|
| 113 |
+
Bash
|
| 114 |
|
| 115 |
+
uvicorn app.api.main:app --host 0.0.0.0 --port 8000 --reload
|
| 116 |
+
Start the Frontend (Streamlit) in a new terminal:
|
| 117 |
|
| 118 |
+
Bash
|
| 119 |
|
| 120 |
+
streamlit run streamlit_app/app.py
|
| 121 |
+
⚠️ Limitations & Future Roadmap
|
| 122 |
+
Data Persistence: Currently, monitoring logs are stored in an ephemeral CSV file. In a production environment, this would be replaced by a persistent database (e.g., PostgreSQL) to ensure data retention across container restarts.
|
| 123 |
|
| 124 |
+
Scalability: The current Google News scraper is synchronous. Future versions will implement asynchronous scraping (aiohttp) or a message queue (RabbitMQ/Celery) for high-volume processing.
|
| 125 |
|
| 126 |
+
Model Retraining: A placeholder pipeline (src/train.py) is included. Full implementation would require GPU resources and a labeled dataset for fine-tuning.
|
| 127 |
|
| 128 |
🤝 Contributing
|
| 129 |
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
| 130 |
|
| 131 |
📝 License
|
| 132 |
Distributed under the MIT License. See LICENSE for more information.
|
| 133 |
|
| 134 |
+
### 👤 Author
|
| 135 |
+
|
| 136 |
+
**[Fabio Celaschi]**
|
| 137 |
+
* [](https://www.linkedin.com/in/fabio-celaschi-4371bb92)
|
| 138 |
+
* [](https://www.instagram.com/fabiocelaschi/)
|
data/new_data.csv
ADDED
|
File without changes
|
src/train.py
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import time
|
| 3 |
+
import sys
|
| 4 |
+
|
| 5 |
+
"""
|
| 6 |
+
Questa vuole essere solo una simulazione del retrain del modello, in quanto rifare il train
|
| 7 |
+
Costerebbe molto in termini computazionali, quindi l'intento è creare solo il processo per farlo
|
| 8 |
+
e integrarlo nel file yaml come funzione obbligatoria prima del push/deploy.
|
| 9 |
+
|
| 10 |
+
Faccio usare un dataset in CSV, se vuoto non fa il retrain e pusha lo stesso nella repository
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
# Percorsi configurabili
|
| 14 |
+
DATA_PATH = "data/new_data.csv"
|
| 15 |
+
MODEL_OUTPUT_DIR = "models/retrained_roberta"
|
| 16 |
+
|
| 17 |
+
def train_and_evaluate():
|
| 18 |
+
print("🚀 Starting MLOps Retraining Pipeline...")
|
| 19 |
+
|
| 20 |
+
# 1. DATA VALIDATION CHECK
|
| 21 |
+
# Controlliamo se il file esiste e se ha dimensioni > 0
|
| 22 |
+
if not os.path.exists(DATA_PATH) or os.stat(DATA_PATH).st_size < 10:
|
| 23 |
+
print(f"ℹ️ Dataset '{DATA_PATH}' is empty or missing.")
|
| 24 |
+
print("⚠️ No new data available for retraining.")
|
| 25 |
+
print("✅ Skipping process. (This is normal behavior for the demo).")
|
| 26 |
+
# Usciamo con codice 0 (Successo) perché "non fare nulla" è un risultato valido
|
| 27 |
+
sys.exit(0)
|
| 28 |
+
|
| 29 |
+
# --- SIMULATION ZONE (GPU Constraints) ---
|
| 30 |
+
print(f"📂 Loading dataset from {DATA_PATH}...")
|
| 31 |
+
# In reale: df = pd.read_csv(DATA_PATH)
|
| 32 |
+
|
| 33 |
+
print("⚙️ Initializing RoBERTa Fine-Tuning on CPU (Simulation)...")
|
| 34 |
+
time.sleep(2) # Simuliamo il tempo di caricamento
|
| 35 |
+
|
| 36 |
+
# Simula il log del training
|
| 37 |
+
print("Epoch 1/3: Loss 0.45 ... accuracy: 0.78")
|
| 38 |
+
print("Epoch 2/3: Loss 0.22 ... accuracy: 0.84")
|
| 39 |
+
|
| 40 |
+
# 2. MODEL EVALUATION CHECK (Il punto che chiedevi)
|
| 41 |
+
print("⚖️ Evaluating new model vs current production model...")
|
| 42 |
+
# Qui ci sarebbe: if new_accuracy > old_accuracy:
|
| 43 |
+
simulated_improvement = True
|
| 44 |
+
|
| 45 |
+
if simulated_improvement:
|
| 46 |
+
print("✅ Performance improved! (Accuracy +2.5%)")
|
| 47 |
+
print(f"💾 Saving new model artifact to {MODEL_OUTPUT_DIR}...")
|
| 48 |
+
os.makedirs(MODEL_OUTPUT_DIR, exist_ok=True)
|
| 49 |
+
with open(f"{MODEL_OUTPUT_DIR}/metadata.txt", "w") as f:
|
| 50 |
+
f.write(f"Model retrained on {time.strftime('%Y-%m-%d')}\nStatus: Active")
|
| 51 |
+
else:
|
| 52 |
+
print("❌ No improvement detected. Keeping the old model.")
|
| 53 |
+
sys.exit(0)
|
| 54 |
+
|
| 55 |
+
if __name__ == "__main__":
|
| 56 |
+
train_and_evaluate()
|