---
license: apache-2.0
language:
- en
tags:
- moe
- olmo
- olmoe
co2_eq_emissions: 1
datasets:
- allenai/OLMoE-mix-0924
library_name: transformers
---


# Model Summary
# OLMoE with Adapters

This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training.

## Model Architecture

The `OlmoEWithAdaptersForCausalLM` model extends the original OLMo architecture by:

1. Adding small adapter layers (bottleneck layers) to each MLP block
2. Allowing selective freezing of the base model's parameters
3. Training only the adapter parameters (~0.1-1% of total parameters)

Key components:
- `OlmoEWithAdaptersMLP`: MLP layer with additional adapter modules
- `OlmoEWithAdaptersDecoderLayer`: Decoder layer incorporating adapter MLPs
- `OlmoEWithAdaptersModel`: Full model with adapter-based decoder layers
- `OlmoEWithAdaptersForCausalLM`: Causal language model with adapters

## Training Script

The `train_olmoe_adapters.py` script provides a complete workflow for fine-tuning the model:

### Features:
- Parameter-efficient fine-tuning using adapters
- Support for various datasets through Hugging Face datasets library
- Customizable adapter size
- Option to freeze/unfreeze different components
- Training with AdamW optimizer and learning rate scheduling
- Evaluation with perplexity metrics
- Model checkpointing and saving

### Usage:

```bash
python train.py \
    --model_name_or_path allenai/OLMo-7B \
    --adapter_size 64 \
    --freeze_base_model True \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --output_dir ./olmoe-adapter-finetuned \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --learning_rate 5e-5 \
    --warmup_steps 100 \
    --logging_steps 100 \
    --save_steps 1000 \
    --seed 42
```

## Benefits of Adapter-Based Fine-Tuning

1. **Efficiency**: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements
2. **Storage**: Store only adapter weights rather than full fine-tuned models
3. **Composability**: Multiple adapters can be trained for different tasks and swapped at inference time
4. **Reduced Overfitting**: Lower parameter count helps prevent overfitting on small datasets

## How to Use the Fine-Tuned Model

```python
from transformers import OlmoTokenizer
from modeling_olmoe import OlmoEWithAdaptersForCausalLM

# Load the fine-tuned model
model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned")
tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned")

# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Adapter Size Recommendations

The adapter size determines the parameter efficiency vs. performance trade-off:

- **Small datasets**: 16-32 dimensions
- **Medium datasets**: 64-128 dimensions
- **Large datasets**: 128-256 dimensions

For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance.