Mistral-44B-MoE-Patched-MLX-4bit-G64

This repository contains a patched and quantized version of the DavidAU/Mistral-2x24B-MOE-Magistral-2506-Devstral-2507-1.1-Coder-Reasoning-Ultimate-44B model, converted to the MLX format and optimized for Apple Silicon.

Model Details

Patching and Conversion Process

The original Hugging Face model was converted to the MLX format using a custom Python script, convert_model.py. This script performs several critical steps to ensure compatibility and optimization for MLX on Apple Silicon. The key methods and "patches" applied during this conversion are:

  1. MLX Weight Conversion: The model's weights, originally in PyTorch format, were converted into MLX's native mlx.core.mx.array format. This involves iterating through all model tensors and transforming them for efficient processing on Apple Silicon.
  2. Attention Dimension Patching (Zero-padding): A crucial patching step was applied to reconcile potential dimension mismatches in the attention layers, particularly common in Mixture of Experts (MoE) models with "narrow attention heads." This involved:
    • Analyzing the model's config.json and the shapes of q_proj.weight tensors to detect if the query projection output dimension (q_proj_outdim) was smaller than the model's hidden_size.
    • If a mismatch was detected, the script calculated new num_attention_heads and num_key_value_heads to align with the hidden_size while preserving the Grouped Query Attention (GQA) ratio.
    • The q_proj, k_proj, v_proj, and o_proj tensors were then zero-padded to their new, expanded dimensions. This ensures that the model's architecture is consistent and correctly interpreted by MLX, preventing runtime errors and enabling proper functionality.
  3. Explicit Model Type Setting: The config.json file was updated to explicitly set the "model_type" to "mistral". This ensures that the MLX framework correctly identifies and loads the model architecture, which is vital for proper model instantiation and operation.
  4. 4-bit Quantization: The model underwent 4-bit quantization with a group size of 64 (4-bit G64). This process significantly reduces the model's memory footprint, making it more accessible for devices with limited VRAM, and can also lead to faster inference speeds. The quantization parameters ("group_size": 64, "bits": 4) are explicitly recorded in the config.json for transparency and reproducibility.

Usage

To use this model with MLX, you can load it directly:

from mlx_lm import load

model, tokenizer = load("BennyDaBall/Mistral-44B-MoE-Patched-MLX-4bit-G64")

# Example for text generation
prompt = "Hello, my name is"
response = model.generate(tokenizer.encode(prompt))
print(tokenizer.decode(response))

License

This model is based on the original model by DavidAU. Please refer to the original model's license for details.

Downloads last month
51
Safetensors
Model size
44B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BennyDaBall/Mistral-44B-MoE-Patched-MLX-4bit-G64