Spaces:

Faffio
/

Sentiment-Analysis

Sleeping

App Files Files Community

Faffio commited on 4 days ago

Commit

aef5ea1

1 Parent(s): f3ac198

Added Continuos retraining

Browse files

Files changed (4) hide show

.github/workflows/{ci_papeline.yaml → mlops_pipeline.yaml} +8 -0
README.md +73 -60
data/new_data.csv +0 -0
src/train.py +56 -0

.github/workflows/{ci_papeline.yaml → mlops_pipeline.yaml} RENAMED Viewed

@@ -33,6 +33,11 @@ jobs:
         python -m pip install --upgrade pip
         pip install -r requirements.txt
     # D. Lancia Pytest
     - name: Run Tests
       run: |
@@ -69,6 +74,7 @@ jobs:
         echo "Pushing image to Docker Hub..."
         docker push $IMAGE_TAG
   deploy_to_huggingface:
     needs: run_tests  # Parte solo se i test passano
     runs-on: ubuntu-latest
@@ -89,6 +95,8 @@ jobs:
         run: |
           # Uso --force per imporre l'aggiornamento di GitHub su Hugging face ignorando la storia di quello che c'è all'interno di Hugging face (se ne frega di quello che c'è dentro, cancella e riaggiorna)
           git push --force https://$HF_USERNAME:[email protected]/spaces/$HF_USERNAME/$SPACE_NAME main
 # I file vengono salvati nella repository PRIMA che il test parta. È il fatto che tu abbia "pushato" i file che sveglia il robot e gli fa iniziare il lavoro.
 # Ecco la sequenza temporale esatta:

         python -m pip install --upgrade pip
         pip install -r requirements.txt
+    - name: Continuous Training (Simulation)
+      run: |
+        # Eseguiamo lo script che controlla i dati e simula il training
+        python src/train.py
     # D. Lancia Pytest
     - name: Run Tests
       run: |
         echo "Pushing image to Docker Hub..."
         docker push $IMAGE_TAG
+  # JOB 3: Push su Hugging Face
   deploy_to_huggingface:
     needs: run_tests  # Parte solo se i test passano
     runs-on: ubuntu-latest
         run: |
           # Uso --force per imporre l'aggiornamento di GitHub su Hugging face ignorando la storia di quello che c'è all'interno di Hugging face (se ne frega di quello che c'è dentro, cancella e riaggiorna)
           git push --force https://$HF_USERNAME:[email protected]/spaces/$HF_USERNAME/$SPACE_NAME main
 # I file vengono salvati nella repository PRIMA che il test parta. È il fatto che tu abbia "pushato" i file che sveglia il robot e gli fa iniziare il lavoro.
 # Ecco la sequenza temporale esatta:

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Sentiment-Analysis
 emoji: 📊
 colorFrom: blue
 colorTo: indigo
@@ -8,35 +8,38 @@ pinned: false
 app_port: 7860
 ---
-# 📊 End-to-End MLOps Pipeline for Sentiment Analysis regarding Online Reputation
 ![Build Status](https://img.shields.io/badge/build-passing-brightgreen)
 ![Python](https://img.shields.io/badge/python-3.9%2B-blue)
 ![Deployment](https://img.shields.io/badge/deployed%20on-HuggingFace-orange)
 ![License](https://img.shields.io/badge/license-MIT-green)
 ## 🚀 Project Overview
-**MachineInnovators Inc.** focuses on scalable, production-ready machine learning applications. This project is a comprehensive **MLOps solution** designed to monitor online company reputation through automated sentiment analysis.
-Unlike standard data science experiments, this repository demonstrates a **full-cycle ML workflow**, moving from model training to automated deployment. It addresses the business need for real-time reputation tracking by classifying social media feedback (Positive, Neutral, Negative) using an automated pipeline.
 ### Key Features
-* **Production-First Approach:** Focus on scalability, modularity, and code quality.
-* **CI/CD Automation:** Integrated pipeline for automated testing and deployment using GitHub Actions.
-* **Continuous Deployment:** Automatic deployment to Hugging Face Spaces upon successful builds.
-* **Reproducibility:** Code and environment are strictly versioned to ensure consistent results.
 ---
 ## 🛠️ Tech Stack & Tools
 * **Core:** Python 3.9+
-* **Machine Learning:** [FastText / Transformers (RoBERTa)] **
-* **MLOps & CI/CD:** GitHub Actions
-* **Deployment:** Hugging Face Spaces
-* **Version Control:** Git
-* **Development:** Google Colab (Prototyping) -> VS Code (Production)
 ---
@@ -44,82 +47,92 @@ Unlike standard data science experiments, this repository demonstrates a **full-
 The project follows a rigorous MLOps pipeline to ensure reliability and speed of delivery:
-1.  **Data Ingestion & Preprocessing:**
-    * Cleaning and tokenization of social media data using industry-standard libraries.
-    * Usage of public datasets labeled for sentiment analysis.
-2.  **Model Development:**
-    * Implementation of a robust sentiment classification model.
-    * Optimization for inference speed and accuracy.
 3.  **CI/CD Pipeline (GitHub Actions):**
-    * **Linting:** Enforces code style (PEP8) to maintain high readability.
-    * **Testing:** Unit tests ensure that data processing and prediction logic function correctly before any merge.
-    * **Delivery:** Upon passing all checks on the `main` branch, the application is packaged and deployed.
-4.  **Deployment:**
-    * The model is served via a web interface hosted on **Hugging Face Spaces**, allowing for immediate user interaction and testing.
 ---
 ## 📂 Repository Structure
 ```bash
-├── .github/workflows/  # CI/CD configurations (GitHub Actions)
-├── app/                # Application code (Inference & UI)
-├── src/                # Source code for training and processing
-│   ├── model.py        # Model architecture and training logic
-│   ├── preprocess.py   # Data cleaning pipeline
-│   └── utils.py        # Utility functions
-├── tests/              # Unit and integration tests
-├── notebooks/          # Exploratory Data Analysis (EDA) and prototyping
-├── requirements.txt    # Project dependencies
-└── README.md           # Project documentation
-Clone the repository:
 Bash
-git clone https://github.com/your-username/your-repo-name.git
-cd your-repo-name
-Install dependencies:
 Bash
-pip install -r requirements.txt
-Run the application:
-Bash
-python app/main.py
-# OR if using Streamlit/Gradio
-streamlit run app/app.py
-Run Tests:
 Bash
-pytest tests/
-📈 Results and Performance
-Model Accuracy: [Insert Accuracy, e.g., 85%]
-F1-Score: [Insert F1 Score]
-Inference Speed: [Optional: e.g., <50ms per tweet]
-Note: Detailed analysis of the model's performance and the confusion matrix can be found in the notebooks directory.
-🔮 Future Improvements
-Drift Detection: Implementing tools like Evidently AI to visualize data drift.
-Containerization: Fully Dockerizing the application for cloud-agnostic deployment (AWS/GCP).
-API Expansion: Creating a REST API using FastAPI for integration with external dashboards.
 🤝 Contributing
-Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
 📝 License
 Distributed under the MIT License. See LICENSE for more information.
-💡 Note for the Reviewer
-This project was developed as a comprehensive exercise to demonstrate Full-Stack Data Science capabilities, bridging the gap between model development and production engineering.

 ---
+title: Reputation Monitor
 emoji: 📊
 colorFrom: blue
 colorTo: indigo
 app_port: 7860
 ---
+# 📊 End-to-End MLOps Pipeline for Real-Time Reputation Monitoring
 ![Build Status](https://img.shields.io/badge/build-passing-brightgreen)
 ![Python](https://img.shields.io/badge/python-3.9%2B-blue)
+![Model](https://img.shields.io/badge/model-RoBERTa-yellow)
 ![Deployment](https://img.shields.io/badge/deployed%20on-HuggingFace-orange)
 ![License](https://img.shields.io/badge/license-MIT-green)
 ## 🚀 Project Overview
+**MachineInnovators Inc.** focuses on scalable, production-ready machine learning applications. This project is a comprehensive **MLOps solution** designed to monitor online company reputation through automated sentiment analysis of real-time news.
+Unlike standard static notebooks, this repository demonstrates a **full-cycle ML workflow**. The system scrapes live data from **Google News**, analyzes sentiment using a **RoBERTa Transformer** model, and visualizes insights via an interactive dashboard, all orchestrate within a Dockerized environment.
 ### Key Features
+* **Real-Time Data Ingestion:** Automated scraping of Google News for target brand keywords.
+* **State-of-the-Art NLP:** Utilizes `twitter-roberta-base-sentiment` for high-accuracy classification.
+* **Full-Stack Architecture:** Integrates a **FastAPI** backend for inference and a **Streamlit** frontend for visualization in a single container.
+* **CI/CD Automation:** Robust GitHub Actions pipeline for automated testing, building, and deployment to Hugging Face Spaces.
+* **Embedded Monitoring:** Basic logging system to track model predictions and sentiment distribution over time.
 ---
 ## 🛠️ Tech Stack & Tools
 * **Core:** Python 3.9+
+* **Machine Learning:** Hugging Face Transformers, PyTorch, Scikit-learn.
+* **Backend:** FastAPI, Uvicorn (REST API).
+* **Frontend:** Streamlit (Interactive Dashboard).
+* **Data Ingestion:** `GoogleNews` library (Real-time scraping).
+* **DevOps:** Docker, GitHub Actions (CI/CD).
+* **Deployment:** Hugging Face Spaces (Docker SDK).
 ---
 The project follows a rigorous MLOps pipeline to ensure reliability and speed of delivery:
+1.  **Data & Modeling:**
+    * **Input:** Real-time news titles and descriptions fetched dynamically.
+    * **Model:** Pre-trained **RoBERTa** model optimized for social media and short-text sentiment.
+2.  **Containerization (Docker):**
+    * The application is containerized using a custom `Dockerfile`.
+    * Implements a custom `entrypoint.sh` script to run both the **FastAPI backend** (port 8000) and **Streamlit frontend** (port 7860) simultaneously.
 3.  **CI/CD Pipeline (GitHub Actions):**
+    * **Trigger:** Pushes to the `main` branch.
+    * **Test:** Executes `pytest` suite to verify API endpoints (`/health`, `/analyze`) and model loading.
+    * **Build:** Verifies Docker image creation.
+    * **Deploy:** Automatically pushes the validated code to Hugging Face Spaces.
+4.  **Monitoring:**
+    * The system logs every prediction to a local CSV file, which is visualized in the "Monitoring" tab of the dashboard.
 ---
 ## 📂 Repository Structure
 ```bash
+├── .github/workflows/   # CI/CD configurations (GitHub Actions)
+├── app/                 # Backend Application Code
+│   ├── api/             # FastAPI endpoints (main.py)
+│   ├── model/           # Model loader logic (RoBERTa)
+│   └── services/        # Google News scraping logic
+├── streamlit_app/       # Frontend Application Code (app.py)
+├── src/                 # Training simulation scripts
+├── tests/               # Unit and integration tests (Pytest)
+├── Dockerfile           # Container configuration
+├── entrypoint.sh        # Startup script for dual-process execution
+├── requirements.txt     # Project dependencies
+└── README.md            # Project documentation
+💻 Installation & Usage
+To run this project locally using Docker (Recommended):
+1. Clone the repository
 Bash
+git clone [https://github.com/YOUR_USERNAME/SentimentAnalysis.git](https://github.com/YOUR_USERNAME/SentimentAnalysis.git)
+cd SentimentAnalysis
+2. Build the Docker Image
+Bash
+docker build -t reputation-monitor .
+3. Run the Container
 Bash
+docker run -p 7860:7860 reputation-monitor
+Access the application at http://localhost:7860
+Manual Installation (No Docker)
+If you prefer running it directly with Python:
+Install dependencies:
 Bash
+pip install -r requirements.txt
+Start the Backend (FastAPI):
+Bash
+uvicorn app.api.main:app --host 0.0.0.0 --port 8000 --reload
+Start the Frontend (Streamlit) in a new terminal:
+Bash
+streamlit run streamlit_app/app.py
+⚠️ Limitations & Future Roadmap
+Data Persistence: Currently, monitoring logs are stored in an ephemeral CSV file. In a production environment, this would be replaced by a persistent database (e.g., PostgreSQL) to ensure data retention across container restarts.
+Scalability: The current Google News scraper is synchronous. Future versions will implement asynchronous scraping (aiohttp) or a message queue (RabbitMQ/Celery) for high-volume processing.
+Model Retraining: A placeholder pipeline (src/train.py) is included. Full implementation would require GPU resources and a labeled dataset for fine-tuning.
 🤝 Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
 📝 License
 Distributed under the MIT License. See LICENSE for more information.
+### 👤 Author
+**[Fabio Celaschi]**
+* [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/fabio-celaschi-4371bb92)
+* [![Instagram](https://img.shields.io/badge/Instagram-E4405F?style=for-the-badge&logo=instagram&logoColor=white)](https://www.instagram.com/fabiocelaschi/)

data/new_data.csv ADDED Viewed

File without changes

src/train.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import os
+import time
+import sys
+"""
+Questa vuole essere solo una simulazione del retrain del modello, in quanto rifare il train
+Costerebbe molto in termini computazionali, quindi l'intento è creare solo il processo per farlo
+e integrarlo nel file yaml come funzione obbligatoria prima del push/deploy.
+Faccio usare un dataset in CSV, se vuoto non fa il retrain e pusha lo stesso nella repository
+"""
+# Percorsi configurabili
+DATA_PATH = "data/new_data.csv"
+MODEL_OUTPUT_DIR = "models/retrained_roberta"
+def train_and_evaluate():
+    print("🚀 Starting MLOps Retraining Pipeline...")
+    # 1. DATA VALIDATION CHECK
+    # Controlliamo se il file esiste e se ha dimensioni > 0
+    if not os.path.exists(DATA_PATH) or os.stat(DATA_PATH).st_size < 10:
+        print(f"ℹ️  Dataset '{DATA_PATH}' is empty or missing.")
+        print("⚠️  No new data available for retraining.")
+        print("✅  Skipping process. (This is normal behavior for the demo).")
+        # Usciamo con codice 0 (Successo) perché "non fare nulla" è un risultato valido
+        sys.exit(0)
+    # --- SIMULATION ZONE (GPU Constraints) ---
+    print(f"📂 Loading dataset from {DATA_PATH}...")
+    # In reale: df = pd.read_csv(DATA_PATH)
+    print("⚙️  Initializing RoBERTa Fine-Tuning on CPU (Simulation)...")
+    time.sleep(2) # Simuliamo il tempo di caricamento
+    # Simula il log del training
+    print("Epoch 1/3: Loss 0.45 ... accuracy: 0.78")
+    print("Epoch 2/3: Loss 0.22 ... accuracy: 0.84")
+    # 2. MODEL EVALUATION CHECK (Il punto che chiedevi)
+    print("⚖️  Evaluating new model vs current production model...")
+    # Qui ci sarebbe: if new_accuracy > old_accuracy:
+    simulated_improvement = True
+    if simulated_improvement:
+        print("✅  Performance improved! (Accuracy +2.5%)")
+        print(f"💾 Saving new model artifact to {MODEL_OUTPUT_DIR}...")
+        os.makedirs(MODEL_OUTPUT_DIR, exist_ok=True)
+        with open(f"{MODEL_OUTPUT_DIR}/metadata.txt", "w") as f:
+            f.write(f"Model retrained on {time.strftime('%Y-%m-%d')}\nStatus: Active")
+    else:
+        print("❌  No improvement detected. Keeping the old model.")
+        sys.exit(0)
+if __name__ == "__main__":
+    train_and_evaluate()