--- language: - en tags: - text-classification - physics - science - roberta - data-cleaning license: mit metrics: - accuracy - f1 - precision - recall base_model: roberta-base --- ![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/eeksAj0wC_vlwzCITr3Oo.png) # RobertaPhysics: Physics Content Classifier This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) designed to distinguish between **Physics-related content** and **General/Non-Physics text**. It was developed specifically for **data cleaning pipelines**, aiming to filter and curate high-quality scientific datasets by removing irrelevant noise from raw text collections. ## 📊 Model Performance The model was trained for 3 epochs and achieved the following results on the validation set (2,191 samples): | Metric | Value | Interpretation | | :--- | :--- | :--- | | **Accuracy** | **94.44%** | Overall correct classification rate. | | **Precision** | **70.00%** | Reliability when predicting "Physics" class. | | **Recall** | **62.30%** | Ability to detect Physics content within the dataset. | | **F1-Score** | **65.93%** | Harmonic mean of precision and recall. | | **Validation Loss** | **0.1574** | Low validation error indicating stable convergence. | ![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/THTeXtbx41e9jxs_UGeJI.png) ## 🏷️ Label Mapping The model uses the following mapping for inference: * **LABEL_0 (0):** `General` (Non-Physics content, noise, or other topics) * **LABEL_1 (1):** `Physics` (Scientific or educational content related to physics) ## ⚙️ Training Details * **Dataset:** Approximately 11,000 processed text samples (8,762 training / 2,191 validation). * **Architecture:** RoBERTa Base (Sequence Classification). * **Batch Size:** 16 (Train) / 64 (Eval). * **Optimizer:** AdamW (weight decay 0.01). * **Loss Function:** CrossEntropyLoss. ## 🚀 Quick Start You can use this model directly with the Hugging Face `pipeline`: ```python from transformers import pipeline # Load the classifier classifier = pipeline("text-classification", model="Madras1/RobertaPhysics") # Example 1: Physics Content text_physics = "Quantum entanglement describes a phenomenon where linked particles remain connected." result_physics = classifier(text_physics) print(result_physics) # Expected Output: [{'label': 'Physics', 'score': 0.93}] # Example 2: General Content text_general = "The quarterly earnings report will be released to investors next Tuesday." result_general = classifier(text_general) print(result_general) # Expected Output: [{'label': 'General', 'score': 0.86}] ``` ![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/ba2PpICAPZaZmZAAOdZKu.png) ⚠️ Intended Use Primary Use: Filtering datasets to retain physics-domain text. Limitations: The model prioritizes precision over recall (Precision: 70% vs Recall: 62%). This means it is "conservative": it minimizes false positives (junk labeled as physics) but may miss some valid physics texts. This is intentional for high-quality dataset curation.