base model inconsistent with architecture claims

#17

by travisking - opened 2 days ago

2 days ago

this model points to mistralai/Mistral-Small-3.1-24B-Base-2503 as the base, which is the same base model as listed on Devstral Small 1.1 (which, of course, has the confusing model id of mistralai/Devstral-Small-2507, one must not be too predictable amirite?)

base_model:
- mistralai/Mistral-Small-3.1-24B-Base-2503

But on this this model's (v2) model card you claim the attention is different:

Attention Softmax Temperature: Devstral Small 2 uses the same architecture as Ministral 3 using rope-scaling as introduced by Llama 4 and Scalable-Softmax Is Superior for Attention.

can you explain what is going on here? You can't fine-tune from the same base and end up with a different attention mechanism

patrickvonplaten

Mistral AI_ org about 19 hours ago

Devstral Small 1.1 doesn't use yarn

patrickvonplaten changed discussion status to closed about 19 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment