base model inconsistent with architecture claims
#17
by
travisking
- opened
this model points to mistralai/Mistral-Small-3.1-24B-Base-2503 as the base, which is the same base model as listed on Devstral Small 1.1 (which, of course, has the confusing model id of mistralai/Devstral-Small-2507, one must not be too predictable amirite?)
base_model:
- mistralai/Mistral-Small-3.1-24B-Base-2503
But on this this model's (v2) model card you claim the attention is different:
Attention Softmax Temperature: Devstral Small 2 uses the same architecture as Ministral 3 using rope-scaling as introduced by Llama 4 and Scalable-Softmax Is Superior for Attention.
can you explain what is going on here? You can't fine-tune from the same base and end up with a different attention mechanism
Devstral Small 1.1 doesn't use yarn
patrickvonplaten
changed discussion status to
closed