base model inconsistent with architecture claims

#17
by travisking - opened

this model points to mistralai/Mistral-Small-3.1-24B-Base-2503 as the base, which is the same base model as listed on Devstral Small 1.1 (which, of course, has the confusing model id of mistralai/Devstral-Small-2507, one must not be too predictable amirite?)

base_model:
- mistralai/Mistral-Small-3.1-24B-Base-2503

But on this this model's (v2) model card you claim the attention is different:

Attention Softmax Temperature: Devstral Small 2 uses the same architecture as Ministral 3 using rope-scaling as introduced by Llama 4 and Scalable-Softmax Is Superior for Attention.

can you explain what is going on here? You can't fine-tune from the same base and end up with a different attention mechanism

Devstral Small 1.1 doesn't use yarn

patrickvonplaten changed discussion status to closed

Sign up or log in to comment