What is Mixtral 8x7B?

Mistral AI's Mixtral 8x7B architecture - 8 experts with top-2 routing, sliding window attention, multilingual training, performance vs. Llama 2 70B, and serving requirements.

How does Mistral AI work in practice?

Mixtral 8x7B - Architecture Deep Dive covers Mixtral 8x7B, Mistral AI, sliding window attention from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/mixture-of-experts/mixtral-deep-dive

What is the difference between Mixtral 8x7B and sliding window attention?

See the full breakdown at https://engineersofai.com/docs/llms/mixture-of-experts/mixtral-deep-dive

Mixtral 8x7B - Architecture Deep Dive

The Paper That Proved MoE Belonged in the Open

January 8, 2024. Mistral AI drops Mixtral 8x7B without fanfare - just a tweet and a model card. Within days, it was clear this was not just another open-source release. It was matching GPT-3.5 and outperforming Llama 2 70B on most benchmarks, while running at the inference cost of a 13B model. The open-source community had its first serious MoE model, and the weights were free to download.

Mistral had spent the previous year building a reputation for efficient, high-quality open models. Their first model, Mistral 7B, had shocked the community by outperforming Llama 2 13B. Now they were scaling the efficiency further with a MoE architecture that delivered 70B-level quality at 13B-level compute.

The technical report was sparse (Mistral tends to release fewer technical details than labs like Meta or Google), but the model weights were public and the community reverse-engineered much of what wasn't documented. This lesson covers what we know about Mixtral's architecture and why it works.

Why This Exists - The Open-Source Quality Gap

In late 2023, there was a clear quality gap between open-source models and closed frontier models. Llama 2 70B was the best open model, but it lagged behind GPT-3.5 on many tasks. For organizations that needed strong performance without API dependencies - for privacy, cost, or customization reasons - the options were limited.

Mixtral bridged a significant portion of this gap. Not by training a bigger dense model (which would have been expensive to both train and serve), but by using the efficiency of MoE to deliver more capability per active parameter.

For the open-source community, Mixtral's release was also significant because it provided:

A concrete reference implementation of transformer-based MoE
Open weights that could be fine-tuned, quantized, and studied
Proof that you could build a competitive MoE model without Google-scale resources

Architecture Specifications

Mixtral 8x7B is built on the Mistral 7B architecture as its base, extending it with a MoE layer structure. The key specifications:

Parameter	Value
Architecture	Transformer with MoE FFN layers
Number of layers	32
Hidden dimension ( $d_{\text{model}}$ )	4096
Number of attention heads	32
KV heads (Grouped Query Attention)	8
FFN dimension ( $d_{\text{ff}}$ )	14336
Number of experts	8
Active experts (top-k)	2
Context length	32768 tokens
Vocabulary size	32000
Total parameters	~47B
Active parameters per token	~13B

Every FFN layer in Mixtral is a MoE layer - there are no interleaved dense FFN layers. The attention layers are dense.

Grouped Query Attention

Mixtral inherits the Grouped Query Attention (GQA) architecture from Mistral 7B. GQA is a technique where multiple query heads share a single key-value head, reducing KV cache memory requirements during inference.

With 32 attention heads and 8 KV heads, 4 query heads share each KV head. This reduces the KV cache by 4x compared to Multi-Head Attention, enabling much longer context windows without proportional memory growth.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math


class GroupedQueryAttention(nn.Module):
    """
    Grouped Query Attention (GQA) as used in Mixtral 8x7B.

    32 query heads, 8 KV heads.
    Each KV head is shared by 4 query heads (32 / 8 = 4).
    """

    def __init__(
        self,
        d_model: int = 4096,
        n_heads: int = 32,
        n_kv_heads: int = 8,
        max_seq_len: int = 32768,
    ):
        super().__init__()
        assert n_heads % n_kv_heads == 0, "n_heads must be divisible by n_kv_heads"

        self.n_heads = n_heads
        self.n_kv_heads = n_kv_heads
        self.n_groups = n_heads // n_kv_heads  # = 4 for Mixtral
        self.d_head = d_model // n_heads

        # Q has n_heads heads, K and V have n_kv_heads heads
        self.q_proj = nn.Linear(d_model, n_heads * self.d_head, bias=False)
        self.k_proj = nn.Linear(d_model, n_kv_heads * self.d_head, bias=False)
        self.v_proj = nn.Linear(d_model, n_kv_heads * self.d_head, bias=False)
        self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False)

        # Rotary positional embeddings (RoPE)
        self.register_buffer(
            "cos_freqs",
            self._compute_rope_freqs(max_seq_len, self.d_head)
        )

    def forward(
        self,
        x: torch.Tensor,
        attention_mask: torch.Tensor = None,
        kv_cache: tuple = None,
    ) -> tuple:
        """
        Args:
            x: [batch, seq_len, d_model]
            attention_mask: Optional mask
            kv_cache: Cached (k, v) from previous steps for autoregressive generation

        Returns:
            output: [batch, seq_len, d_model]
            updated_kv_cache: Updated (k, v) tensors
        """
        B, T, _ = x.shape

        # Project to Q, K, V
        q = self.q_proj(x)  # [B, T, n_heads * d_head]
        k = self.k_proj(x)  # [B, T, n_kv_heads * d_head]
        v = self.v_proj(x)  # [B, T, n_kv_heads * d_head]

        # Reshape to separate heads
        q = q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        # [B, n_heads, T, d_head]
        k = k.view(B, T, self.n_kv_heads, self.d_head).transpose(1, 2)
        # [B, n_kv_heads, T, d_head]
        v = v.view(B, T, self.n_kv_heads, self.d_head).transpose(1, 2)
        # [B, n_kv_heads, T, d_head]

        # Append to KV cache if provided
        if kv_cache is not None:
            k_cache, v_cache = kv_cache
            k = torch.cat([k_cache, k], dim=2)
            v = torch.cat([v_cache, v], dim=2)
        new_kv_cache = (k, v)

        # Expand KV heads to match Q heads (GQA key operation)
        # [B, n_kv_heads, seq_len, d_head] -> [B, n_heads, seq_len, d_head]
        k = k.repeat_interleave(self.n_groups, dim=1)
        v = v.repeat_interleave(self.n_groups, dim=1)

        # Scaled dot-product attention
        scale = 1.0 / math.sqrt(self.d_head)
        attn_weights = torch.matmul(q, k.transpose(-2, -1)) * scale
        # [B, n_heads, T, T_total]

        if attention_mask is not None:
            attn_weights = attn_weights + attention_mask

        attn_weights = F.softmax(attn_weights, dim=-1)

        # Apply attention to values
        output = torch.matmul(attn_weights, v)  # [B, n_heads, T, d_head]

        # Reshape and project
        output = output.transpose(1, 2).contiguous().view(B, T, -1)
        output = self.o_proj(output)

        return output, new_kv_cache

    def _compute_rope_freqs(self, max_seq_len: int, d_head: int) -> torch.Tensor:
        """Compute rotary positional embedding frequencies."""
        theta = 1.0 / (10000 ** (torch.arange(0, d_head, 2).float() / d_head))
        positions = torch.arange(max_seq_len).float()
        freqs = torch.outer(positions, theta)
        return torch.cat([freqs.cos(), freqs.sin()], dim=-1)

Sliding Window Attention

Mistral's models (including Mixtral) use Sliding Window Attention (SWA) for handling long sequences. Standard attention is $O(T^2)$ in sequence length - for a 32K context window, that's a massive computation and memory cost.

SWA restricts each token to attend only to the $W$ most recent tokens, where $W$ is the window size (e.g., 4096 for Mixtral). This makes attention $O(T \cdot W)$ instead of $O(T^2)$ , dramatically reducing cost for long sequences.

The key insight: through multiple transformer layers, information can propagate further than one window. At layer 1, token 32000 can attend to tokens 28000–32000. At layer 2, the representation at position 32000 (which incorporates information from 28000–32000) is attended to by tokens up to 36000. After $k$ layers, effective receptive field is $k \times W$ .

For Mixtral's 32 layers and window size 4096: effective receptive field ≈ 131,072 tokens - larger than the 32K context window.

class SlidingWindowAttention(nn.Module):
    """
    Sliding Window Attention for long context efficiency.
    Each token attends to the W most recent tokens.
    """

    def __init__(
        self,
        d_model: int = 4096,
        n_heads: int = 32,
        n_kv_heads: int = 8,
        window_size: int = 4096,
    ):
        super().__init__()
        self.window_size = window_size
        self.gqa = GroupedQueryAttention(d_model, n_heads, n_kv_heads)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, D = x.shape

        # Create sliding window causal mask
        # Token at position t can attend to positions max(0, t-W+1) to t
        mask = torch.full((T, T), float('-inf'), device=x.device)
        mask = torch.triu(mask, diagonal=1)  # Causal mask

        # Apply window constraint
        for i in range(T):
            start = max(0, i - self.window_size + 1)
            # Tokens before the window cannot be attended to
            if start > 0:
                mask[i, :start] = float('-inf')

        output, _ = self.gqa(x, attention_mask=mask.unsqueeze(0).unsqueeze(0))
        return output

The MoE Layer in Mixtral

Mixtral's MoE layer follows the standard top-2 routing approach with a simple linear router. The key difference from some other MoE implementations:

SwiGLU activation: uses the gated activation function $\text{FFN}(x) = (\text{SiLU}(xW_1) \otimes xW_3) W_2$ , which empirically outperforms ReLU and GELU for language models
No bias terms: all linear layers in the expert FFNs omit bias terms (this is standard for large LLMs, reducing memory and parameter count with minimal quality impact)
All layers are MoE: every one of the 32 transformer layers uses a MoE FFN, no dense FFN layers

class MixtralMoELayer(nn.Module):
    """
    Mixtral's MoE layer implementation.
    8 experts, top-2 routing, SwiGLU activation.
    """

    def __init__(
        self,
        d_model: int = 4096,
        d_ff: int = 14336,
        n_experts: int = 8,
        top_k: int = 2,
    ):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k

        # Router
        self.gate = nn.Linear(d_model, n_experts, bias=False)

        # 8 experts, each is an independent SwiGLU FFN
        self.experts = nn.ModuleList([
            self._make_expert(d_model, d_ff)
            for _ in range(n_experts)
        ])

    def _make_expert(self, d_model: int, d_ff: int) -> nn.Module:
        """Create a single SwiGLU expert."""
        class SwiGLUExpert(nn.Module):
            def __init__(self):
                super().__init__()
                self.w1 = nn.Linear(d_model, d_ff, bias=False)
                self.w2 = nn.Linear(d_ff, d_model, bias=False)
                self.w3 = nn.Linear(d_model, d_ff, bias=False)

            def forward(self, x):
                # SwiGLU: SiLU(xW1) * xW3, projected by W2
                return self.w2(F.silu(self.w1(x)) * self.w3(x))

        return SwiGLUExpert()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: [batch, seq_len, d_model]

        Returns:
            output: [batch, seq_len, d_model]
        """
        B, T, D = x.shape
        x_flat = x.view(-1, D)  # [B*T, D]

        # Router: compute expert scores and select top-2
        router_logits = self.gate(x_flat)  # [B*T, n_experts]
        routing_weights, selected_experts = torch.topk(
            F.softmax(router_logits, dim=-1),
            self.top_k,
            dim=-1,
        )  # Both [B*T, 2]

        # Normalize routing weights
        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)

        # Compute expert outputs
        output = torch.zeros_like(x_flat)

        for expert_idx in range(self.n_experts):
            # Find tokens routed to this expert (at any of the k positions)
            expert_mask = (selected_experts == expert_idx)  # [B*T, k]
            token_mask = expert_mask.any(dim=-1)           # [B*T]

            if not token_mask.any():
                continue

            # Route tokens to expert
            expert_input = x_flat[token_mask]
            expert_output = self.experts[expert_idx](expert_input)

            # Get routing weights for this expert
            weights = torch.zeros(x_flat.shape[0], device=x.device)
            for k_idx in range(self.top_k):
                k_mask = expert_mask[:, k_idx]
                weights += k_mask.float() * routing_weights[:, k_idx]

            expert_weights = weights[token_mask].unsqueeze(-1)

            # Accumulate weighted outputs
            output[token_mask] += expert_weights * expert_output

        return output.view(B, T, D)

Performance vs. Llama 2 70B

The key Mixtral benchmarks from the technical report:

Benchmark	Llama 2 70B	Mixtral 8x7B	Category
MMLU	69.8%	70.6%	Knowledge
HellaSwag	87.3%	89.1%	Commonsense
WinoGrande	83.7%	81.2%	Commonsense
ARC	52.9%	59.7%	Reasoning
TriviaQA	87.3%	88.0%	Knowledge
HumanEval	29.3%	40.2%	Code
GSM8K	56.8%	60.4%	Math

The headline: Mixtral 8x7B matches or exceeds Llama 2 70B on most benchmarks, while requiring only 13B active parameters per token (vs. Llama's full 70B). The compute efficiency advantage is roughly 5x.

Where Mixtral excels: code (HumanEval +10.9%), reasoning (ARC +6.8%), and multilingual tasks (French, German, Spanish significantly better than Llama 2 70B).

Where it's weaker: WinoGrande (commonsense with gender pronouns), certain long-form instruction following tasks. The expert routing may not handle certain inference patterns as well as a fully dense model.

Mixtral 8x22B - Higher Capability

Mistral followed up with Mixtral 8x22B in April 2024. The key specs:

Parameter	Value
Number of layers	56
Hidden dimension	6144
Number of attention heads	48
KV heads	8
Number of experts	8
Active experts	2
Total parameters	~141B
Active parameters	~39B
Context length	65536 tokens

8x22B significantly outperforms 8x7B across all benchmarks and is competitive with Llama 3 70B at similar inference costs. At the time of release, it was among the best open-source models available.

Benchmark	Mixtral 8x7B	Mixtral 8x22B	Llama 3 70B
MMLU	70.6%	77.8%	82.0%
HumanEval	40.2%	75.0%	81.7%
GSM8K	60.4%	86.1%	93.0%
ARC	59.7%	65.7%	66.4%

Serving Mixtral - Hardware Requirements

Mixtral 8x7B

Memory requirements:

FP16 (half precision): ~94 GB (47B × 2 bytes)
4-bit quantized (GPTQ/AWQ): ~24 GB
8-bit quantized: ~47 GB

Hardware configurations:

Single A100 80GB: FP16 doesn't fit. 4-bit fits with room to spare.
Dual A100 40GB: FP16 fits with tensor parallelism. Recommended for quality.
Single RTX 4090 (24 GB): 4-bit quantized fits. Community option.
Multiple A40 48GB: Dual A40s work with 4-bit; triple A40s for FP16.

Mixtral 8x22B

Memory requirements:

FP16: ~282 GB - needs 4+ A100 80GB
4-bit: ~70 GB - fits on 2x A100 80GB
Serving: Typically 4x A100 80GB or equivalent

def estimate_mixtral_serving_requirements(
    model: str = "8x7B",  # or "8x22B"
    quantization: str = "fp16",  # "fp16", "int8", "int4"
    batch_size: int = 32,
    max_seq_len: int = 4096,
) -> dict:
    """
    Estimate hardware requirements for serving Mixtral.
    """
    configs = {
        "8x7B": {"total_params_B": 47, "active_params_B": 13},
        "8x22B": {"total_params_B": 141, "active_params_B": 39},
    }

    bytes_per_param = {
        "fp16": 2.0,
        "int8": 1.0,
        "int4": 0.5,
    }

    config = configs[model]
    bpp = bytes_per_param[quantization]

    # Model weights memory
    weights_GB = config["total_params_B"] * 1e9 * bpp / (1024**3)

    # KV cache memory (simplified)
    # KV cache = 2 (K and V) * n_layers * n_kv_heads * d_head * batch_size * seq_len * 2 bytes
    n_layers = 32 if model == "8x7B" else 56
    n_kv_heads = 8
    d_head = 128
    kv_cache_GB = (
        2 * n_layers * n_kv_heads * d_head * batch_size * max_seq_len * 2
        / (1024**3)
    )

    total_GB = weights_GB + kv_cache_GB

    # GPU recommendation
    a100_80gb_count = math.ceil(total_GB / 80)
    a100_40gb_count = math.ceil(total_GB / 40)

    return {
        "model": model,
        "quantization": quantization,
        "weights_GB": round(weights_GB, 1),
        "kv_cache_GB": round(kv_cache_GB, 1),
        "total_GB": round(total_GB, 1),
        "a100_80gb_needed": a100_80gb_count,
        "a100_40gb_needed": a100_40gb_count,
        "recommended_config": f"{a100_80gb_count}x A100 80GB" if quantization != "int4"
            else f"1x A100 80GB (or 2x RTX 4090)"
            if total_GB < 40 else f"{a100_80gb_count}x A100 80GB",
    }


import math

for model in ["8x7B", "8x22B"]:
    for q in ["fp16", "int8", "int4"]:
        r = estimate_mixtral_serving_requirements(model, q)
        print(f"Mixtral {model} {q}: {r['total_GB']}GB - {r['recommended_config']}")

Production Serving with vLLM

vLLM provides native support for Mixtral with expert parallelism:

# vLLM serving command (run as shell command)
# python -m vllm.entrypoints.openai.api_server \
#     --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
#     --tensor-parallel-size 2 \  # Use 2 GPUs
#     --max-model-len 32768 \
#     --gpu-memory-utilization 0.90 \
#     --max-num-seqs 128

# Python client usage
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed-for-local",
)

def mixtral_generate(
    prompt: str,
    max_tokens: int = 2048,
    temperature: float = 0.7,
    system_prompt: str = "You are a helpful AI assistant.",
) -> str:
    """Generate text with Mixtral via vLLM's OpenAI-compatible API."""
    response = client.chat.completions.create(
        model="mistralai/Mixtral-8x7B-Instruct-v0.1",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        max_tokens=max_tokens,
        temperature=temperature,
    )
    return response.choices[0].message.content

:::danger Common Mistake: Using Mixtral Base Model for Instruction Following Mixtral 8x7B base is a foundation model trained on raw text, not instruction-tuned. Using it without a chat template or system prompt will produce continuation-style completions, not helpful assistant responses. Always use Mixtral-8x7B-Instruct-v0.1 for assistant tasks, and apply the correct chat template: [INST] {user_message} [/INST]. :::

:::warning Quantization Quality Degradation 4-bit GPTQ/AWQ quantization of Mixtral 8x7B reduces quality noticeably on some tasks - especially code generation and mathematical reasoning. Benchmarks that show 40%+ on HumanEval at FP16 typically drop to 35–37% at INT4. For production systems where code quality matters, use 8-bit quantization (higher memory but much less quality loss) or the full FP16 model if hardware allows. :::

:::tip Router Temperature Tuning Mixtral's router can be made more or less sharp by adjusting the softmax temperature. At lower temperatures (adding a division by T less than 1 to router logits before softmax), routing becomes more decisive - tokens strongly prefer one or two experts. At higher temperatures, routing spreads more evenly. For most applications the default training temperature works well, but if you fine-tune Mixtral, consider annealing router temperature during fine-tuning to prevent collapse. :::

Interview Questions and Answers

Q1: Why does Mixtral 8x7B have 47B parameters but only 13B active per token?

Mixtral has 8 expert FFN networks per transformer layer, each with the equivalent of a 7B-class FFN. For any given token, only 2 of the 8 experts are activated (top-2 routing). The attention layers are shared and dense - they contribute approximately 5B parameters that are always active. The remaining parameters are in the 8 expert FFNs per layer × 32 layers, giving 47B total. Since only 2 experts are active: 5B (shared) + 2/8 × 42B (expert FFNs) = approximately 13B active parameters per token. This is the source of Mixtral's efficiency: 13B active compute for a 47B capacity model.

Q2: What is Sliding Window Attention and why does Mixtral use it?

Standard attention is $O(T^2)$ in sequence length - for a 32K context, each token attends to all previous tokens, requiring a 32K × 32K attention matrix. Sliding Window Attention restricts each token to attend only to the W most recent tokens (W = 4096 in Mixtral), making attention $O(T \cdot W)$ . Through multiple transformer layers, information propagates beyond the window: at layer $k$ , the effective receptive field is $k \times W$ . For Mixtral's 32 layers with W = 4096, the effective receptive field is 131,072 tokens - larger than the 32K context window. SWA enables long-context processing at much lower memory and compute cost than full attention.

Q3: How does Grouped Query Attention (GQA) help Mixtral at inference time?

GQA reduces KV cache memory by sharing key-value heads across multiple query heads. Mixtral has 32 query heads but only 8 KV heads - each KV head is shared by 4 query heads. The KV cache (which stores key and value tensors from all previous tokens for autoregressive generation) grows proportionally to the number of KV heads. With 8 KV heads instead of 32, the KV cache is 4x smaller, enabling longer sequences or larger batch sizes within the same memory budget. For a 32K context at batch size 32, this reduction is substantial - the KV cache would be ~6 GB with 32 KV heads vs ~1.5 GB with 8.

Q4: Compare Mixtral's performance vs. Llama 2 70B. How does it achieve similar quality with less compute?

Mixtral matches or slightly exceeds Llama 2 70B on most benchmarks (MMLU: 70.6% vs 69.8%, HumanEval: 40.2% vs 29.3%) while requiring only 13B active parameters vs Llama's 70B. Two factors enable this: (1) Total capacity advantage - with 47B parameters total, Mixtral can store more knowledge than a 13B dense model while paying only 13B in compute. The experts can specialize in different domains, and the routing ensures each token is processed by the most relevant experts. (2) Training efficiency - Mixtral's MoE structure allows it to learn more diverse representations across its many experts, making better use of its pre-training data.

Q5: What are the hardware requirements for serving Mixtral 8x7B in production, and what are the options?

At FP16 precision: 47B parameters × 2 bytes = ~94 GB VRAM, requiring 2+ A100 80GB GPUs. At INT8: ~47 GB, fits on 1 A100 80GB with room for KV cache. At INT4 (GPTQ or AWQ): ~24 GB, fits on a single A100 40GB or even a high-VRAM consumer GPU (RTX 4090 24GB). For production serving: vLLM with tensor-parallel-size=2 on dual A100 80GB is the recommended configuration for full-quality (FP16) serving. For cost-optimized serving: AWQ 4-bit on a single A100 40GB. Throughput considerations: Mixtral benefits strongly from large batch sizes - at batch size 32, throughput per dollar is much better than at batch size 1.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Mixture of Experts (MoE) Architecture demo on the EngineersOfAI Playground - no code required.

:::

The Paper That Proved MoE Belonged in the Open​

Why This Exists - The Open-Source Quality Gap​

Architecture Specifications​

Grouped Query Attention​

Sliding Window Attention​

The MoE Layer in Mixtral​

Performance vs. Llama 2 70B​

Mixtral 8x22B - Higher Capability​

Serving Mixtral - Hardware Requirements​

Mixtral 8x7B​

Mixtral 8x22B​

Production Serving with vLLM​

Interview Questions and Answers​