The scaling problem in AI has always been straightforward: bigger models perform better, but they cost more to run. For years, the dominant approach was to simply make every model parameter participate in every computation — a strategy that works until you hit the wall of economics and physics. Training a dense model with a trillion parameters means activating a trillion parameters for every single token you process, whether you’re generating a haiku or analyzing a legal contract.
Mixture of Experts (MoE) architectures offer an elegant escape from this dilemma. Instead of activating every parameter for every input, MoE models route each token to a small subset of specialized “expert” sub-networks. The result: models with massive total parameter counts — and the knowledge capacity that comes with them — but with inference costs comparable to models a fraction of their size.
This is not a new idea. The concept of conditional computation dates back to Jacobs et al. (1991), who proposed learning to route inputs to specialized modules. But MoE remained a niche technique until three papers brought it into the modern deep learning era. Shazeer et al. (2017) demonstrated that MoE layers could be inserted into LSTM and transformer networks, scaling to 137 billion parameters with sublinear increases in compute. GShard (Lepikhin et al., 2020) extended MoE to 600 billion parameters for multilingual translation, using top-2 gating and showing that sparse models could be distributed efficiently across thousands of accelerators. And Switch Transformers (Fedus et al., 2022) simplified the routing to top-1, scaled to 1.6 trillion parameters with 2,048 experts, and demonstrated that sparse models could match or exceed dense model quality with dramatically lower training compute.
Today, MoE has become the default architecture for frontier-scale language models. Understanding how it works — the gating mechanisms, the training challenges, the inference tradeoffs — is essential for anyone building or evaluating AI systems. This guide explains the full picture.
Dense vs. Sparse Models
To understand why MoE matters, you first need to understand the distinction between dense and sparse computation.
What “Dense” Means
In a dense model, every parameter is active during every forward pass. When GPT-4 (rumored to be around 1.8 trillion parameters) processes a token, it performs matrix multiplications involving all of its weights. The computational cost scales linearly with total parameter count.
This is simple and effective. Dense models are straightforward to train because gradient signals flow through every weight on every step. They are straightforward to deploy because you need to load all parameters into memory and execute all operations sequentially.
But the problem is that this approach is wildly inefficient. Consider a concrete example: when a dense 70B model generates a response about Python programming, every parameter in the model participates in every operation — including the billions of parameters that encode knowledge about medieval history, organic chemistry, Mandarin grammar, and thousands of other domains that are completely irrelevant to the current query. The compute is wasted. The energy is spent. The latency is incurred. And none of it contributes to the quality of the output.
This waste becomes especially acute as models scale. Moving from a 7B to a 70B to a 700B dense model means 10x and 100x increases in per-token compute cost. The quality improvements at each step are real but diminishing — the scaling curves flatten even as the cost curves stay linear. At some point, the economics break down: you cannot justify 10x more GPU-hours per token for 15% better benchmark scores.
What “Sparse” Means
A sparse model has many more total parameters than it activates for any single input. The key abstraction is the expert — a self-contained sub-network (typically a feed-forward block) that can process a token independently. A gating mechanism examines each incoming token and decides which experts should process it. Only the selected experts perform computation; the rest remain idle.
This creates a critical distinction:
- Total parameters: The full count of all weights across all experts. This determines the model’s total knowledge capacity and memory footprint.
- Active parameters: The subset of weights actually used for any given token. This determines the computational cost per token (FLOPs) and effective inference speed.
A model with 1 trillion total parameters but 32 billion active parameters stores as much knowledge as a 1T dense model but computes each token at roughly the cost of a 32B dense model.
This distinction is the key insight behind MoE. The total parameter count determines the model’s knowledge capacity — how much information about the world it can store. The active parameter count determines the computational cost — how many FLOPs are needed per token. MoE decouples these two quantities, allowing models to have massive knowledge capacity with modest computational requirements.
Another way to think about it: a dense model is like a company where every employee attends every meeting. An MoE model is like a company where a smart receptionist (the router) directs each question to the right team of specialists. Both companies have the same total headcount and institutional knowledge, but the MoE company processes requests far more efficiently because it does not waste expert time on irrelevant tasks.
The Current Landscape
The following table shows how major MoE models compare on total vs. active parameters:
| Model | Total Parameters | Active Parameters | Experts (Routed) | Experts Active | Shared Experts |
|---|---|---|---|---|---|
| DeepSeek V3 | 671B | 37B | 256 | 8 | 1 |
| Kimi K2.5 | 1T (1,040B) | 32B | 384 | 8 | 1 |
| MiniMax M2.5 | 230B | 10B | — | — | — |
| Qwen 3.5 | 397B | 17B | 512 | 10 | 1 |
Notice the pattern: active parameters typically range from 3% to 6% of total parameters. This extreme sparsity is what makes MoE economically viable at trillion-parameter scales. The knowledge is there, but you only pay to access the fraction you need for each token.
For comparison, consider that a dense 70B model like Llama 3.1 70B activates all 70 billion parameters for every token. Kimi K2.5 has 14 times more total parameters but activates fewer than half the parameters per token. The implications for cost and throughput are profound.
Gating and Routing Mechanisms
The core innovation of MoE architectures is the gating network (also called the router) — the mechanism that decides which experts process each token. Getting routing right is arguably the hardest part of MoE design. A bad router wastes capacity, creates bottlenecks, and can cause entire experts to go unused.
The Gating Function
The standard gating function takes a token’s hidden representation as input and produces a sparse probability distribution over experts. Formally, for a token representation :
Source: G(x) = TopK(softmax(W_g * x + epsilon), k)
where:
- is a learned weight matrix that projects the token representation into a space with one dimension per expert
- is optional noise (often Gaussian) added during training to encourage exploration
- converts the raw scores into a probability distribution
- selects only the top experts and zeros out the rest
The output is a sparse vector of weights. The token is sent to the selected experts, each expert processes it independently, and the results are combined using the gating weights as a weighted sum:
Source: y = sum over TopK of G(x)_i * E_i(x)
where is the output of expert when given input .
Token Routing Flow
The following diagram illustrates how tokens flow through a single MoE layer. Each token enters the layer, passes through the router, gets dispatched to its selected experts, and the expert outputs are recombined:
┌─────────────────────────────┐
│ INPUT TOKENS │
│ [tok_1] [tok_2] [tok_3] │
└──────────┬──────────────────┘
│
▼
┌─────────────────────────────┐
│ GATING NETWORK │
│ G(x) = TopK(softmax( │
│ W_g · x + ε), k) │
└──────────┬──────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Router │ │ Router │ │ Router │
│ scores for │ │ scores for │ │ scores for │
│ tok_1: │ │ tok_2: │ │ tok_3: │
│ E1=0.7 │ │ E2=0.5 │ │ E1=0.4 │
│ E3=0.3 │ │ E4=0.5 │ │ E2=0.6 │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌────────────┼───────────────┼───────────────┼──────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Expert 1 │ │Expert 2 │ │Expert 3 │ │Expert 4 │ │Expert N │
│ (FFN) │ │ (FFN) │ │ (FFN) │ │ (FFN) │ │ (FFN) │
│receives │ │receives │ │receives │ │receives │ │ ... │
│tok_1, │ │tok_2, │ │tok_1 │ │tok_2 │ │(idle) │
│tok_3 │ │tok_3 │ │ │ │ │ │ │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └─────────┘
│ │ │ │
└────────────┼───────────────┼───────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────┐
│ WEIGHTED COMBINATION │
│ y_i = Σ G(x)_j · E_j(x_i) │
│ for each token's selected experts │
└──────────────┬──────────────────────────┘
│
▼
┌─────────────────────┐
│ OUTPUT TOKENS │
│ [out_1][out_2][out_3]│
└─────────────────────┘
In this example with , each token is routed to exactly two experts. Token 1 goes to Experts 1 and 3, Token 2 goes to Experts 2 and 4, and Token 3 goes to Experts 1 and 2. Expert N sits idle because no tokens were routed to it. Notice how different tokens can share experts (Experts 1 and 2 each serve two tokens) while some experts receive no tokens at all in this batch.
Top-K Routing
Top-k routing is the most common approach. The value of controls the tradeoff between computational cost and representational capacity:
- Top-1 (Switch Transformer): Each token goes to exactly one expert. Minimizes compute and communication overhead. The risk is that a single expert must handle the full computation, which may limit the model’s ability to combine different types of knowledge for a single token.
- Top-2 (GShard, most modern MoE): Each token goes to two experts. Provides a balance between efficiency and expressiveness. The two experts can specialize differently — one might handle syntactic structure while another handles domain knowledge.
- Top-8 (DeepSeek V3, Kimi K2.5): Selecting more experts per token increases computational cost but allows for finer-grained specialization. When you have hundreds of experts, activating 8 still represents extreme sparsity (8 out of 256 is 3.1%).
Expert Choice Routing
Zhou et al. (2022) proposed inverting the routing paradigm: instead of tokens choosing experts, experts choose tokens. Each expert selects the top-k tokens it is most suited to process from the entire batch.
This approach has a natural load-balancing property — each expert processes exactly tokens, eliminating the imbalance problem entirely. However, it introduces a new challenge: tokens might not be selected by any expert (dropped tokens) or might be selected by too many (requiring duplication). It also requires knowledge of the full batch at routing time, which complicates certain parallelism strategies.
Hash Routing and Fixed Routing
Not all routing needs to be learned. Hash routing assigns tokens to experts based on a deterministic hash function — for example, routing based on the token’s vocabulary ID or position. This eliminates the gating network’s parameters and computation entirely, and guarantees perfect load balancing.
The tradeoff is obvious: hash routing cannot learn to group semantically related tokens together. Fixed routing patterns can be surprisingly effective for certain tasks but generally underperform learned routing on complex language modeling. They serve as useful baselines and can be valuable in resource-constrained settings where the overhead of a learned router is unacceptable.
A middle ground is random routing with learned combination weights: tokens are assigned to experts randomly (or by hash), but the output combination weights are still learned. This preserves some of the benefits of learned routing — the model can learn which expert outputs to trust for which inputs — while eliminating the routing network entirely.
Shared Experts
A refinement adopted by DeepSeek V3, Kimi K2.5, and Qwen 3.5 is the shared expert — one or more experts that process every token regardless of routing decisions. Shared experts capture universal patterns (basic syntax, common semantic operations) that apply across all inputs, while routed experts specialize in narrower domains. This decomposition improves model quality because the routed experts can afford to specialize more aggressively when they don’t also need to carry universal knowledge.
Expert Specialization
One of the most intriguing properties of MoE models is that experts develop genuine specializations during training — without being explicitly told what to specialize in. The gating network and experts co-evolve: the router learns to send certain types of tokens to certain experts, and those experts learn to become good at processing exactly those types of tokens.
Evidence of Specialization
Researchers have observed several forms of emergent specialization:
Linguistic specialization. Some experts become preferentially activated for specific parts of speech, syntactic constructions, or discourse functions. An expert might specialize in processing noun phrases, another in handling verb conjugations, and another in managing paragraph transitions.
Domain specialization. In models trained on diverse data, experts can develop affinities for specific domains — code, legal text, mathematical notation, conversational language. This is the MoE analog of how different attention heads in transformer architectures specialize for different relationship types.
Positional specialization. Some experts are preferentially activated for tokens at certain sequence positions — beginning-of-sequence tokens, tokens near punctuation, or tokens in the middle of long spans. This suggests that the routing has learned something about the different computational needs at different positions.
Language specialization. In multilingual models, experts can develop language-specific processing paths. GShard observed this directly: certain experts were activated almost exclusively for specific language pairs, effectively creating language-specialized sub-networks within the larger model. This has practical implications for multilingual deployment: an MoE model serving users in 50 languages might route each language’s tokens through a largely disjoint set of experts, effectively running 50 semi-independent language models within a single architecture.
How Specialization Emerges
Expert specialization is not programmed — it emerges from the interaction between the gating network’s optimization and the experts’ gradient updates. The process works roughly as follows:
-
Early training: The router assigns tokens to experts nearly randomly. All experts receive diverse gradient signals and learn general-purpose representations. Specialization is minimal.
-
Differentiation phase: Small random differences between experts mean some are slightly better at processing certain token types. The router begins to preferentially route those tokens to those experts. The experts, receiving more of those tokens, improve further on exactly those types. A positive feedback loop begins.
-
Stable specialization: After sufficient training, each expert has carved out a distinct niche. The router has learned a stable mapping from token features to expert assignments. The positive feedback loop has reached equilibrium — each expert is the best option for its assigned tokens, and the router has no incentive to change assignments.
This process is analogous to how attention heads in transformer layers develop functional specializations (syntax heads, positional heads, rare-word heads) without explicit supervision. The key difference is that MoE specialization operates at the sub-network level rather than the attention-pattern level, allowing for much coarser-grained and more interpretable specialization.
Capacity Factor and Expert Sizing
The capacity factor determines how many tokens each expert can process per batch. If is the total number of tokens in a batch and is the number of experts, perfect load balancing would send tokens to each expert. The capacity factor sets the actual buffer size as :
Source: Expert buffer size = C * (n / E)
A capacity factor of 1.0 means each expert has exactly enough space for perfect load balance. In practice, routing is imperfect, so capacity factors of 1.25 to 1.5 are common. Tokens routed to an expert that has already reached capacity are either dropped (their representation passes through unchanged) or rerouted to a less loaded expert.
Expert sizing is another design dimension. Most MoE models use identically-sized experts (each expert is a standard feed-forward block with the same hidden dimensions), but there is no fundamental requirement for this. Research into heterogeneous expert sizes — where some experts are larger and handle more complex operations — is an active area.
The relationship between expert count, expert size, and total parameters is governed by a simple equation. If you have experts, each with parameters in their FFN, the total expert parameters are . The design question is how to distribute a fixed parameter budget across experts:
- Fewer large experts (e.g., 64 experts, each 1B parameters = 64B total): Each expert is individually powerful but specialization is coarse-grained. Routing is easier because there are fewer options, but the model may not be able to develop highly specific domain expertise.
- Many small experts (e.g., 512 experts, each 125M parameters = 64B total): Same total parameters, but much finer-grained specialization. Each expert is individually weaker, but the combination of multiple small experts per token can be highly expressive. Routing is harder because the combinatorial space is larger.
Current trends favor more, smaller experts. DeepSeek V3 uses 256, Kimi K2.5 uses 384, and Qwen 3.5 uses 512. The evidence suggests that the combinatorial richness of selecting from a large expert pool outweighs the reduced capacity of each individual expert.
Training Challenges
MoE architectures introduce several training challenges that don’t exist in dense models. The most fundamental is load balancing — ensuring that the gating network distributes tokens roughly evenly across experts.
The Load Balancing Problem
Without explicit encouragement, the gating network tends to converge to routing most tokens to a small subset of experts. This is a positive feedback loop: experts that receive more tokens get more gradient signal, become better at processing those tokens, and attract even more tokens. Meanwhile, underutilized experts receive sparse gradients, learn slowly, and become increasingly irrelevant.
This phenomenon — sometimes called expert collapse — can result in a model that has 256 experts but effectively uses only a handful. The total parameter count is large, but most parameters are dead weight.
Auxiliary Loss for Load Balancing
The standard solution is to add an auxiliary loss term that penalizes uneven expert utilization. The most common formulation (from Switch Transformers) multiplies two vectors:
Source: L_balance = alpha * N * sum(f_i * P_i) for i = 1 to N
where:
- is the number of experts
- is the fraction of tokens routed to expert (the actual load)
- is the average routing probability assigned to expert (the intended load)
- is a hyperparameter controlling the strength of the balancing penalty
This loss is minimized when for all experts — perfect uniform distribution. The coefficient is typically small (0.01 to 0.1) because the balancing objective must not overwhelm the primary language modeling loss.
The product is a clever formulation. If you just penalized the variance of , the router could satisfy the constraint by assigning uniform probabilities while still concentrating tokens on few experts. The product formulation couples the actual routing decisions with the probability estimates, making it harder for the router to “cheat.”
Auxiliary-Loss-Free Balancing
DeepSeek V3 pioneered a different approach: achieving load balance without any auxiliary loss. Instead of adding a penalty term, DeepSeek V3 introduces a bias term to the gating scores that is dynamically adjusted based on recent routing history. If an expert has been receiving too many tokens, its bias is decreased, making it less likely to be selected. If an expert is underutilized, its bias is increased.
This approach has the advantage of not distorting the training objective — the model is always optimizing purely for language modeling quality. The balancing mechanism operates as a separate control loop that adjusts routing without interfering with gradient flow. Early results suggest this produces better-quality models, and the approach has been adopted by several subsequent architectures.
Communication Overhead
In distributed training, MoE introduces a unique communication pattern. In a standard dense model, each GPU processes its slice of the model (via tensor or pipeline parallelism) and communicates activations and gradients to its neighbors. The communication topology is predictable and can be highly optimized.
MoE adds all-to-all communication: because any token on any GPU might need to be routed to any expert on any other GPU, the routing step requires a global shuffle of token representations. This all-to-all communication is often the bottleneck in MoE training, especially as the number of experts and GPUs grows.
Mitigation strategies include:
- Expert parallelism: Distributing experts across devices so that each device hosts a subset of experts. Combined with data parallelism, this limits the number of devices each token needs to communicate with.
- Capacity constraints: Limiting the number of tokens an expert can receive reduces the worst-case communication volume.
- Topology-aware routing: DeepSeek V3 constrains each token to be sent to at most 4 nodes, explicitly limiting communication fan-out. This is a hard architectural constraint that trades some routing flexibility for dramatically reduced communication overhead.
- Overlapping communication and computation: Modern frameworks pipeline the all-to-all communication with expert computation, hiding latency behind useful work.
Training Instability
MoE models are more prone to training instability than dense models, particularly at large scales. The router creates a discrete decision (which experts to select) within an otherwise continuous optimization landscape. This can lead to:
- Oscillatory routing: Experts swap roles rapidly, preventing stable specialization.
- Sudden expert collapse: A well-functioning expert suddenly stops receiving tokens due to a routing shift, losing its learned representations.
- Gradient spikes: The combination of sparse computation and load-balancing dynamics can produce sudden large gradients.
Mitigation strategies include:
- Router z-loss: An additional regularization term that penalizes large logits in the gating network, keeping the router’s outputs in a numerically stable range.
- Dropout on routing: Randomly dropping routing decisions during training forces the model to be robust to imperfect routing.
- Gradient clipping: Aggressive gradient clipping (lower thresholds than for dense models) prevents catastrophic updates.
- Warmup: Gradually increasing the number of active experts during early training, allowing the router to develop stable patterns before the full expert pool is available.
- CISPO (MiniMax): MiniMax developed a custom algorithm called CISPO specifically for stabilizing MoE training at scale, which was critical for the M2.5 series.
Inference Efficiency
The economic argument for MoE is clearest at inference time. When serving a model to millions of users, the cost per token dominates the total cost of ownership. MoE architectures reduce the per-token compute cost dramatically while maintaining the quality benefits of large parameter counts.
Why MoE Is Cheaper Per Token
The FLOPs required to process a token through a transformer layer are dominated by the feed-forward network (FFN), which typically accounts for roughly two-thirds of per-layer computation. In a dense model, the full FFN is executed for every token. In an MoE model, the FFN is replaced by multiple expert FFNs, and only of them execute per token.
If a dense model’s FFN has parameter count and an MoE model has experts each with parameter count (same total parameters), activating experts costs:
Source: FLOPs_MoE = (k / N) * FLOPs_dense
For DeepSeek V3 with and , this is of the dense FFN cost. When you add the shared expert and attention layers (which are not sparse), the total per-token compute is roughly of what a 671B dense model would require.
Memory Requirements
Here is the critical catch: while MoE models are computationally cheap per token, they are memory-expensive. All experts must be loaded into memory (GPU VRAM or system RAM) even though only a fraction are active at any given moment. This is because routing decisions are made dynamically — you cannot predict which experts will be needed for the next token, so all must be resident and ready.
For a 671B parameter model at FP16 precision, the model weights alone require approximately:
Source: 671 * 10^9 * 2 bytes = 1.34 TB
This far exceeds the memory of any single GPU (the largest available have 80-192 GB of VRAM), necessitating multi-GPU deployment. For comparison, a dense 70B model requires roughly 140 GB at FP16 — feasible on a single high-end GPU or a pair of consumer GPUs with quantization.
Inference Comparison
The following table compares a hypothetical dense 70B model against an MoE model with 400B total parameters and 40B active parameters for a typical inference workload:
| Metric | Dense 70B | MoE 400B (40B active) |
|---|---|---|
| FLOPs per token | ~140 GFLOPs | ~80 GFLOPs |
| Model weight memory (FP16) | ~140 GB | ~800 GB |
| Min. GPUs (80GB each) | 2 | 10 |
| Tokens/second (single request) | ~50-80 | ~60-100 |
| Throughput (batch) | Moderate | High (expert parallelism) |
| Quality | Strong | Stronger (more knowledge capacity) |
The MoE model is actually faster per token (fewer FLOPs) despite having 5.7 times more total parameters. But it requires 5 times more GPUs to serve. The economic calculation depends on utilization: at high request volumes, the MoE model’s higher throughput and better quality per FLOP make it more cost-effective. At low volumes, the dense model’s lower hardware requirements win.
Expert Parallelism
Expert parallelism is a deployment strategy where different experts are placed on different GPUs. Combined with the token-level parallelism used during inference, this allows MoE models to achieve high throughput even on large GPU clusters.
The key insight is that expert parallelism converts sequential computation into parallel communication. Instead of one GPU computing all 8 active experts sequentially, 8 GPUs each compute one expert in parallel, then exchange results. The bottleneck shifts from computation to communication — but modern NVLink and InfiniBand interconnects can handle this efficiently at the latencies required for real-time inference.
Expert Offloading
For deployment on limited hardware, expert offloading moves inactive experts to CPU RAM or even disk, loading them into GPU memory only when needed. This dramatically reduces GPU memory requirements at the cost of latency:
- GPU-only: All experts in VRAM. Lowest latency, highest cost.
- GPU + CPU offloading: Active experts in VRAM, inactive experts in CPU RAM. Load on demand. Adds microseconds to milliseconds of latency per expert swap.
- GPU + disk offloading: Most experts on NVMe SSD. Adds milliseconds of latency per expert swap. Only viable for batch processing, not real-time serving.
Several open-source inference frameworks (vLLM, TensorRT-LLM, SGLang) have implemented sophisticated expert offloading strategies that predict which experts will be needed based on routing patterns, pre-fetching them into GPU memory before they are actually required. This predictive offloading can hide much of the latency penalty in practice.
Quantization and MoE
Quantization — reducing the precision of model weights from FP16 (16 bits per parameter) to INT8 (8 bits) or INT4 (4 bits) — is even more impactful for MoE models than for dense models, precisely because the memory bottleneck is more severe.
A 671B parameter model at FP16 requires ~1.34 TB of memory. At INT4 quantization, this drops to ~335 GB — potentially fitting on 4-5 high-end GPUs instead of 17+. The key question is whether quantization disproportionately harms MoE models. Early evidence suggests that MoE models are reasonably robust to quantization, potentially because each expert operates on a narrower distribution of inputs than a dense model’s FFN, making the weight distributions more quantization-friendly.
However, the router weights are typically kept at full precision even when expert weights are quantized. The routing decisions are critical — a quantization-induced error in the router that sends a token to the wrong expert is far more damaging than a small precision loss in the expert computation itself.
Real-World MoE Systems
The general principles above manifest differently in each production system. Let’s examine how three current frontier models implement MoE and what tradeoffs they make.
Kimi K2.5: Trillion-Parameter Scale
Kimi K2.5 from Moonshot AI is one of the largest open-weight MoE models, with 1.04 trillion total parameters and 32 billion active per token. Its architecture makes several notable choices:
Expert count and routing: K2.5 uses 384 routed experts per layer with top-8 routing, plus 1 shared expert that processes every token. The sparsity ratio of 8/384 (2.1%) is among the most aggressive in production models. The 384-expert configuration is 50% more granular than DeepSeek V3’s 256 experts, allowing for finer-grained specialization.
Layer topology: The model has 61 layers, but they are not all MoE layers. The first layer is a dense layer that standardizes input embeddings into a consistent latent representation before the data enters the sparse network. The remaining 60 layers implement MoE. This dense-first design ensures that the routing network receives clean, normalized representations rather than raw embedding vectors.
Multi-head Latent Attention (MLA): Inherited from the DeepSeek lineage, MLA compresses key and value representations into a low-rank latent space, dramatically reducing the memory cost of the KV-cache during inference. This is critical at the context lengths K2.5 supports (256K tokens) — standard multi-head attention would require terabytes of KV-cache memory at this scale.
Training approach: K2.5 was trained natively on 15 trillion mixed visual and text tokens. This multimodal training means the routing network has learned to handle both modalities, with some experts likely specializing in visual token processing and others in text.
The result is a model that achieves frontier-tier performance on benchmarks while being available as open weights. Its economic positioning is interesting: the 32B active parameter count means inference costs are comparable to running a 32B dense model, but with the knowledge capacity of a model 30 times larger.
MiniMax M2.5: Efficiency at Scale
MiniMax M2.5 takes a different approach to the efficiency question, achieving frontier performance with 230B total parameters and only 10B active — the smallest active parameter count among current frontier MoE models.
Architecture philosophy: While specific expert counts and configurations have not been fully disclosed in the same detail as DeepSeek V3 or Kimi K2.5, MiniMax has emphasized two key innovations in their MoE design. First, the use of shared experts that handle universal computation, allowing routed experts to specialize more aggressively. Second, the CISPO training algorithm that maintains training stability as MoE models scale.
M2.5 Lightning: MiniMax released two versions of M2.5 — the standard model and M2.5 Lightning. They are architecturally identical but differ in serving configuration. Lightning achieves a sustained 100 tokens per second throughput, approximately twice the speed of comparable frontier models. This is achieved through aggressive expert parallelism and inference optimization rather than architectural changes.
Cost positioning: At 2.40 per million output tokens, M2.5 is roughly 1/20th the cost of the most expensive frontier models. The 10B active parameter count is the foundation of this pricing — fewer active parameters means fewer FLOPs per token, which means less GPU time per request.
Domain specialization: MiniMax has noted that specific experts within M2.5 have developed fine-grained domain specializations: some experts are tuned for Python code generation, others for mathematical proofs, and others for musical theory. This level of specialization is enabled by the large number of total parameters relative to active parameters — there is enough total capacity for narrow specialization without compromising general capability.
Qwen 3.5: Hybrid Architecture Innovation
Qwen 3.5 from Alibaba takes perhaps the most architecturally novel approach, with 397B total parameters and 17B active. Several design choices distinguish it from the DeepSeek/Kimi lineage:
Expert granularity: Qwen 3.5 uses 512 routed experts per layer — the highest count among current frontier models — with top-10 routing and 1 shared expert. The combination of many small experts and a relatively high value creates a distinctive routing dynamic where each token accesses a more diverse set of specialized knowledge.
Hybrid attention architecture: Qwen 3.5 incorporates Gated Delta Networks, a form of linear attention, alongside the standard softmax attention mechanism. This hybrid approach addresses the quadratic complexity of standard attention for long sequences. At 256K context lengths, Qwen 3.5 decodes 19 times faster than Qwen3-Max and 7.2 times faster than the previous Qwen3 235B MoE model.
Expert scaling from Qwen 3: The jump from 128 experts in Qwen 3 to 512 in Qwen 3.5 (a 4x increase) while simultaneously reducing the total parameter count (from the larger Qwen3-Max) demonstrates that expert granularity is a more efficient scaling axis than raw parameter count. More, smaller experts with appropriate routing can outperform fewer, larger experts.
1M token context: Qwen 3.5 supports up to 1 million tokens of context, enabled by the linear attention hybrid. The MoE architecture contributes to this capability — because per-token compute is low (17B active parameters), the model can process extremely long sequences without the compute costs becoming prohibitive.
Comparative Routing Approaches
Each of these models addresses the routing/balancing challenge differently:
| Challenge | DeepSeek V3 | Kimi K2.5 | MiniMax M2.5 | Qwen 3.5 |
|---|---|---|---|---|
| Load balancing | Auxiliary-loss-free (bias adjustment) | Auxiliary loss | CISPO algorithm | Auxiliary loss |
| Communication | Max 4 nodes per token | — | Optimized for throughput | — |
| Shared experts | 1 per layer | 1 per layer | Yes | 1 per layer |
| Expert count | 256 | 384 | — | 512 |
| Routing k | 8 | 8 | — | 10 |
The trend is clear: expert counts are increasing (256 to 384 to 512), routing is becoming more sophisticated (from simple auxiliary loss to loss-free balancing), and shared experts have become standard. Each generation of MoE models incorporates lessons from previous designs.
The Future of Sparse Computation
MoE architectures are evolving rapidly, with several research directions likely to shape the next generation of sparse models.
Expert Merging
One emerging technique is post-training expert merging — identifying pairs of experts that have learned similar functions and combining them into a single expert. This reduces the total parameter count (and thus memory requirements) without significantly impacting quality, because the merged expert retains the capabilities of both originals.
Expert merging can also be applied selectively: merge experts for deployment on resource-constrained hardware while keeping the full expert set for high-end serving. This creates a spectrum of deployment options from a single trained model, similar to how quantization provides size-quality tradeoffs for dense models.
Dynamic Expert Allocation
Current MoE models use a fixed number of experts per layer. But different inputs have different computational needs — a simple greeting requires less processing than a complex mathematical derivation. Dynamic expert allocation would allow the model to activate more experts for harder inputs and fewer for easy ones, optimizing the compute-quality tradeoff at the token level.
This is conceptually related to “early exit” techniques in dense models, where easy inputs exit the network after fewer layers. In an MoE context, the gating network could be extended to also decide how many experts to consult, not just which ones.
Hardware Co-Design
Current GPUs and TPUs were designed primarily for dense matrix operations. MoE workloads have fundamentally different characteristics: sparse, irregular memory access patterns (loading different expert weights for different tokens) and all-to-all communication requirements. Custom hardware designed for MoE workloads could dramatically improve efficiency.
Research directions include:
- Expert-aware memory hierarchies: Caching frequently-used experts in fast memory while less-used experts reside in slower tiers.
- Sparse matrix accelerators: Hardware units optimized for the specific sparsity patterns that MoE routing creates.
- Communication-optimized interconnects: Network topologies designed for the all-to-all traffic pattern that expert parallelism requires.
Groq’s LPU architecture and Cerebras’s wafer-scale chips are early examples of hardware that breaks from the GPU paradigm. As MoE becomes the default architecture, we can expect more hardware innovation specifically targeting sparse workloads.
MoE Beyond Transformers
While current MoE research focuses on transformer FFN layers, the principle of conditional computation is more general. Research is exploring:
- MoE attention: Using routing to select among multiple attention mechanisms with different characteristics (local, global, linear, sliding window).
- MoE in state-space models: Applying sparse expert routing to SSM architectures like Mamba, which have their own efficiency advantages.
- Vision MoE: Sparse expert routing for vision transformer patches, where different image regions activate different visual processing experts.
- Multi-modal routing: Models like Kimi K2.5 already route both visual and text tokens through the same expert pool. Future architectures might use modality-aware routing that directs visual tokens to vision-specialized experts and text tokens to language-specialized experts within a single unified model.
Scaling Projections
The trajectory of MoE scaling is striking. In 2022, Switch Transformer reached 1.6T parameters with 2,048 experts. By 2025, DeepSeek V3 reached 671B with 256 experts and DeepSeek V3 demonstrated that auxiliary-loss-free training was viable. In early 2026, Kimi K2.5 crossed 1T parameters with 384 experts, and Qwen 3.5 pushed to 512 experts.
If these trends continue, we can expect models in the 2-5T total parameter range within the next year, with active parameter counts remaining in the 20-50B range. The ratio of total to active parameters — currently around 20:1 to 30:1 — may increase further as routing mechanisms become more sophisticated and hardware becomes better at handling extremely sparse workloads.
From Research to Production Standard
The speed at which MoE has gone from research novelty to production default is remarkable. In 2020, GShard was a research demonstration. By 2024, DeepSeek V3 proved that MoE could match proprietary dense models at a fraction of the training cost. In early 2026, virtually every new frontier model announcement uses MoE.
This rapid adoption was driven by convergence on a set of best practices: shared experts for universal knowledge, hundreds of small routed experts for fine-grained specialization, top-k routing with between 2 and 10, and dynamic load balancing (whether through auxiliary loss or loss-free methods). These practices have been validated at scales from 10B to over 1T parameters, giving practitioners confidence that MoE designs are robust and predictable.
For organizations evaluating AI models, the practical takeaway is straightforward: MoE is not an exotic architecture requiring special consideration. It is the standard architecture for frontier models, and the operational considerations — higher memory requirements, expert parallelism for serving, quantization strategies — are well-understood and supported by mainstream inference frameworks.
The fundamental insight of MoE remains as relevant as it was in 1991: not every part of a model needs to participate in every computation. What has changed is the scale at which this principle operates and the sophistication of the routing mechanisms that make it work. As models continue to grow, sparse computation is not just an optimization — it is the only viable path to systems that are both capable and economically deployable at scale.
