--- license: apache-2.0 language: - en tags: - moe - olmo - olmoe co2_eq_emissions: 1 datasets: - allenai/OLMoE-mix-0924 library_name: transformers --- # Model Summary # OLMoE with Adapters This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training. ## Model Architecture The `OlmoEWithAdaptersForCausalLM` model extends the original OLMo architecture by: 1. Adding small adapter layers (bottleneck layers) to each MLP block 2. Allowing selective freezing of the base model's parameters 3. Training only the adapter parameters (~0.1-1% of total parameters) Key components: - `OlmoEWithAdaptersMLP`: MLP layer with additional adapter modules - `OlmoEWithAdaptersDecoderLayer`: Decoder layer incorporating adapter MLPs - `OlmoEWithAdaptersModel`: Full model with adapter-based decoder layers - `OlmoEWithAdaptersForCausalLM`: Causal language model with adapters ## Training Script The `train_olmoe_adapters.py` script provides a complete workflow for fine-tuning the model: ### Features: - Parameter-efficient fine-tuning using adapters - Support for various datasets through Hugging Face datasets library - Customizable adapter size - Option to freeze/unfreeze different components - Training with AdamW optimizer and learning rate scheduling - Evaluation with perplexity metrics - Model checkpointing and saving ### Usage: ```bash python train.py \ --model_name_or_path allenai/OLMo-7B \ --adapter_size 64 \ --freeze_base_model True \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --output_dir ./olmoe-adapter-finetuned \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --learning_rate 5e-5 \ --warmup_steps 100 \ --logging_steps 100 \ --save_steps 1000 \ --seed 42 ``` ## Benefits of Adapter-Based Fine-Tuning 1. **Efficiency**: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements 2. **Storage**: Store only adapter weights rather than full fine-tuned models 3. **Composability**: Multiple adapters can be trained for different tasks and swapped at inference time 4. **Reduced Overfitting**: Lower parameter count helps prevent overfitting on small datasets ## How to Use the Fine-Tuned Model ```python from transformers import OlmoTokenizer from modeling_olmoe import OlmoEWithAdaptersForCausalLM # Load the fine-tuned model model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned") tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned") # Generate text inputs = tokenizer("Once upon a time", return_tensors="pt") outputs = model.generate(**inputs, max_length=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Adapter Size Recommendations The adapter size determines the parameter efficiency vs. performance trade-off: - **Small datasets**: 16-32 dimensions - **Medium datasets**: 64-128 dimensions - **Large datasets**: 128-256 dimensions For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance.