Mistral-44B-MoE-Patched-MLX-4bit-G64
This repository contains a patched and quantized version of the DavidAU/Mistral-2x24B-MOE-Magistral-2506-Devstral-2507-1.1-Coder-Reasoning-Ultimate-44B model, converted to the MLX format and optimized for Apple Silicon.
Model Details
- Base Model: DavidAU/Mistral-2x24B-MOE-Magistral-2506-Devstral-2507-1.1-Coder-Reasoning-Ultimate-44B
- Framework: MLX
- Quantization: 4-bit with a group size of 64 (4-bit G64)
- Original Model Size: 44 Billion Parameters (2x24B Mixture of Experts)
Patching and Conversion Process
The original Hugging Face model was converted to the MLX format using a custom Python script, convert_model.py. This script performs several critical steps to ensure compatibility and optimization for MLX on Apple Silicon. The key methods and "patches" applied during this conversion are:
- MLX Weight Conversion: The model's weights, originally in PyTorch format, were converted into MLX's native
mlx.core.mx.arrayformat. This involves iterating through all model tensors and transforming them for efficient processing on Apple Silicon. - Attention Dimension Patching (Zero-padding): A crucial patching step was applied to reconcile potential dimension mismatches in the attention layers, particularly common in Mixture of Experts (MoE) models with "narrow attention heads." This involved:
- Analyzing the model's
config.jsonand the shapes ofq_proj.weighttensors to detect if the query projection output dimension (q_proj_outdim) was smaller than the model'shidden_size. - If a mismatch was detected, the script calculated new
num_attention_headsandnum_key_value_headsto align with thehidden_sizewhile preserving the Grouped Query Attention (GQA) ratio. - The
q_proj,k_proj,v_proj, ando_projtensors were then zero-padded to their new, expanded dimensions. This ensures that the model's architecture is consistent and correctly interpreted by MLX, preventing runtime errors and enabling proper functionality.
- Analyzing the model's
- Explicit Model Type Setting: The
config.jsonfile was updated to explicitly set the"model_type"to"mistral". This ensures that the MLX framework correctly identifies and loads the model architecture, which is vital for proper model instantiation and operation. - 4-bit Quantization: The model underwent 4-bit quantization with a group size of 64 (4-bit G64). This process significantly reduces the model's memory footprint, making it more accessible for devices with limited VRAM, and can also lead to faster inference speeds. The quantization parameters (
"group_size": 64,"bits": 4) are explicitly recorded in theconfig.jsonfor transparency and reproducibility.
Usage
To use this model with MLX, you can load it directly:
from mlx_lm import load
model, tokenizer = load("BennyDaBall/Mistral-44B-MoE-Patched-MLX-4bit-G64")
# Example for text generation
prompt = "Hello, my name is"
response = model.generate(tokenizer.encode(prompt))
print(tokenizer.decode(response))
License
This model is based on the original model by DavidAU. Please refer to the original model's license for details.
- Downloads last month
- 51