What Is Mixture of Experts (MoE)?
Mixture of Experts is an architecture where only a subset of a model's parameters activate per token, making very large models faster and cheaper to run.
Mixture of Experts (MoE) is an AI architecture where a model contains many specialized sub-networks called "experts," but only a small fraction of them activate for any given input token - making the model far more efficient to run than its total parameter count would suggest.
That single idea - conditional computation - is why Mixtral 8x7B can match GPT-3.5's quality while using roughly the same compute per token as a 12B dense model. It's why GPT-4's inference costs are lower than you'd expect from a model at that capability level. And it's why MoE has quietly become one of the most important architectural decisions in frontier AI.
I've spent the last year running MoE and dense models side by side across writing, coding, and reasoning tasks, and the practical tradeoffs are more nuanced than most explainers let on. This article covers what MoE actually does, how the routing mechanism works, where it wins and where it struggles, and when the architecture actually matters for how you choose your tools.
What Is Mixture of Experts?
Mixture of Experts is an architecture in which a large language model routes each input token through only a small subset of its available neural network modules (the "experts"), rather than passing every token through all parameters in the network.
The concept predates modern deep learning. The original MoE framework was described by Jacobs, Jordan, Nowlan, and Hinton in 1991 - the idea being that different experts specialize in different regions of the input space, and a gating network learns to select the right expert for each input. What changed in the last few years is applying this idea inside transformer feed-forward layers at enormous scale, which turned a 30-year-old idea into the architecture powering some of the most capable models alive.
Here's the core mechanic. In a standard transformer architecture, every token passes through every parameter in the feed-forward layers - the network is "dense." In an MoE model, the feed-forward layers are replaced with a collection of parallel expert networks (each structurally identical, but with separate learned weights) plus a small routing network.
For each token, the router computes a score for every expert, selects the top-K (usually 2), sends the token through just those experts, and combines the outputs in proportion to the routing weights. The other N-K experts don't run at all for that token - their parameters exist in memory, but contribute zero floating-point operations.
A model described as "8x7B" (like Mixtral) has 8 experts each of 7B parameters each, for 56B total parameters. But with top-2 routing, only about 12-14B parameters activate per token forward pass. You get the expressivity of a 56B model at something closer to the inference cost of a 13B model.
This is the reason MoE has attracted so much engineering attention: it's one of the few architectural moves that expands model capacity without a proportional increase in inference cost.
How the Router Decides Which Experts to Activate
The router is a small linear layer that maps each token's hidden representation to a score for each expert, and those scores determine which experts process that token.
In practice, the routing mechanism looks like this. Given a token's hidden state vector, the router applies a learned weight matrix to produce a logit per expert. Those logits pass through a softmax to get probabilities. Then the model takes the top-K experts by probability, normalizes those K weights so they sum to 1, computes each selected expert's output, and returns their weighted sum.
The elegance here is that the routing is fully differentiable end-to-end - the whole thing trains by gradient descent like any other transformer component. The model learns both what each expert should specialize in and when to route to it, purely from training signal.
That said, there's a well-known instability problem: without constraints, the router tends to collapse toward always using the same one or two experts, leaving the others undertrained. This is called "expert collapse" or "load imbalance," and every MoE implementation has to solve it.
Google's Switch Transformer paper (2021) - available on arxiv - introduced an auxiliary load-balancing loss that penalizes the router for routing too many tokens to the same expert. During training, a fraction of the total loss comes from this auxiliary term, which encourages the router to spread tokens more evenly across experts. The Mixtral paper uses a similar approach, and most production MoE implementations follow the same pattern.
Another technique is adding small random noise to the routing logits during training. This breaks symmetry early in training before the router has a chance to establish a dominant-expert habit.
Some newer MoE variants also use "expert choice" routing, where instead of each token choosing its top-K experts, each expert chooses its top-K tokens from the batch. This guarantees perfect load balance by construction, at the cost of some routing flexibility.
One thing that often surprises people: the learned routing patterns are not as interpretable as you'd hope. You can visualize which experts get routed which tokens, and there are some broad patterns (different experts do tend toward different token types or domains), but it's not a clean "Expert 3 handles math, Expert 7 handles code" split. The specialization is subtler and distributed.
MoE vs Dense Models - The Efficiency Trade-off
In an MoE model, the ratio of total parameters to active parameters per token is the key efficiency metric - and that ratio determines both the quality ceiling and the deployment cost.
Dense models activate all their parameters for every token. A 70B dense model runs 70B parameters per forward pass per token - that's the compute cost, and also roughly the memory you need on-chip for inference (in practice, weight quantization reduces this). An MoE model with 70B total parameters but top-2 routing across 8 experts activates roughly 17-18B parameters per token, cutting inference FLOPs by about 75%.
The trade-off is not symmetric, though. Total parameters still matter for memory at inference time. To run Mixtral 8x7B, you still need to load all ~46B parameters into GPU VRAM - even though only ~13B will activate for each token. The unused experts can't be stored on disk during inference, because the routing decision is made at runtime. This is what makes MoE models difficult to self-host on consumer hardware despite their competitive inference speeds in cloud settings.
For a dense model like Llama 3 70B at half-precision (FP16), you need about 140GB of VRAM. Mixtral 8x7B at FP16 needs roughly 90GB - still substantial, though you get the effective capacity of a much larger model.
Here's the quality story. On most standard benchmarks, Mixtral 8x7B outperformed the original Llama 2 70B despite using fewer active parameters per token. Why? Because during training, the model sees more diverse parameter configurations - different experts develop different statistical strengths, and the routing mechanism learns to combine them appropriately. The model has effectively learned from a larger parameter space even if it uses less of it at inference time.
This is the key insight that doesn't show up in a simple FLOP count comparison. MoE models can punch above their active-parameter weight during inference because their training was more parameter-rich.
The tradeoff starts to go the other way in latency-sensitive, single-user scenarios. If you're running one query at a time on local hardware, expert routing adds overhead. The real throughput advantage of MoE shows up in high-concurrency server deployments where you're batching many requests simultaneously - in that setting, the reduced active-parameter count translates directly to higher tokens-per-second and lower cost per token.
Real MoE Models in 2026
The clearest confirmation that MoE works at the frontier is that most of the most capable models currently available use it - even if the companies behind them haven't officially confirmed the architecture.
Here's what we know and what we can reasonably infer as of mid-2026.
Mixtral 8x22B and 8x7B (Mistral AI) are the most architecturally transparent MoE models available. Mistral released both with documentation confirming the number of experts, top-K routing (K=2), and the load-balancing approach. Mixtral 8x22B has approximately 141B total parameters, activating around 39B per token. It represents one of the clearest proof points that MoE scales well - it outperforms many dense models two to three times its active-parameter size on coding and reasoning tasks.
GPT-4 is widely believed to be an MoE model based on information that leaked through various channels in 2023. The commonly cited numbers - 8 experts, 220B active parameters, roughly 1.8T total - have never been confirmed by OpenAI, and should be treated as informed estimates rather than facts. What is evident from the API's inference behavior is that it's far more capable than what a dense 220B model would typically produce, which is consistent with MoE architecture giving you frontier-quality output at a fraction of the parameter activation cost.
Gemini Ultra (and likely the broader Gemini 1.5 and 2.0 family) is strongly inferred to use MoE based on Google's research trajectory. The Switch Transformer paper came from Google Brain, as did follow-up work on GLaM and other MoE systems. It would be architecturally surprising if Gemini's frontier models didn't incorporate MoE. Google hasn't published specifics, which is normal for closed frontier systems.
DeepSeek-V3 is arguably the most architecturally interesting openly documented MoE deployment. It uses 256 "fine-grained" experts with top-8 routing, plus two "shared experts" that always activate regardless of routing scores. This split between routed and shared experts is a newer design choice that helps maintain a consistent base capability while still getting specialization benefits. With 671B total parameters but only 37B active per token, it achieves near-frontier performance at a fraction of the inference cost.
The broader trend is clear. If you're comparing models for your business use case, most frontier models are now MoE under the hood - which means understanding MoE helps you understand why inference costs vary so much between providers even when capability is similar.
The Catch: MoE Models Are Harder to Deploy
MoE architecture introduces three deployment challenges that don't exist with dense models: memory requirements that exceed what the active-parameter count suggests, load balancing complexity in production, and communication overhead in multi-GPU setups.
Let me go through each one, because they're why you can't just assume "MoE = cheaper to run" in all contexts.
Memory problem. As covered above, all expert parameters stay in VRAM the entire time. A useful mental model: MoE reduces compute cost (FLOPs), not memory cost (VRAM). If you're evaluating whether a model fits on a given GPU cluster, you need to think about total parameters, not active parameters. For providers running at scale, this often means using expert parallelism across multiple nodes, adding infrastructure complexity that dense model deployments don't need.
Load balancing at inference. The auxiliary load-balancing loss during training helps ensure reasonably even expert utilization, but it's not a guarantee. At inference time, when you're processing real-world request distributions, you can still see hot experts - experts that get routed most tokens for a given input type. If experts are spread across different GPUs (as they are in large deployments), a hot expert creates a compute bottleneck on its GPU while other GPUs sit idle. Production MoE serving systems often implement capacity buffers and overflow mechanisms to handle this.
Cross-GPU communication. In large-scale deployment, different experts typically live on different GPUs. That means every MoE layer requires an all-to-all communication step: tokens that got routed to Expert 5 need to be sent from wherever they originated to the GPU that holds Expert 5, then the result needs to come back. This all-to-all communication is expensive in terms of network bandwidth and adds latency. The Switch Transformer paper documented this overhead as one of the main engineering challenges at scale.
These challenges are solvable - Mixtral runs fine in production at Mistral and through the major API providers - but they explain why "MoE is more efficient" isn't a universal statement. It's more efficient when you have the infrastructure to handle it.
For most people choosing tools rather than building them, the practical implication is: MoE models you access via API are priced competitively and run fast in the cloud. MoE models you try to self-host on local hardware are harder to run than their active-parameter count suggests. If you're thinking through when to use cloud AI vs local AI, MoE architecture is a genuine factor in that decision.
What I Noticed Testing MoE vs Dense Models Side-by-Side
I've been running structured side-by-side tests comparing MoE and dense models across the same task categories since early 2025, and there are a few patterns I've noticed that most benchmark comparisons miss.
The short version: MoE models are not just "cheaper dense models." They have a distinct behavioral fingerprint, and once you know what to look for, it's recognizable.
The most consistent pattern I noticed was in long-form writing. When I gave Mixtral 8x22B a 3,000-word essay to write with a specific voice and style, it would drift more than a comparably capable dense model like Llama 3 70B. The middle sections would subtly shift tone - sometimes toward more formal language, sometimes losing thread of the introduced metaphors. My working hypothesis is that different sections of the text were routing through different expert configurations, and the stylistic consistency wasn't fully maintained across those switches.
I was wrong to assume this would be a universal problem. When I tested the same models on factual question-answering across diverse domains - law, biology, software engineering, history in the same session - the MoE model handled the domain shifts more smoothly. The routing that created inconsistency in long-form writing seemed to actually help when the task required broad domain knowledge across sections.
The instruction following observation surprised me more. On complex multi-step instructions with many constraints ("write X but avoid Y, in style Z, with format W"), the dense models I tested were more reliably compliant on all constraints simultaneously. My guess - and it's a guess - is that multi-constraint compliance benefits from all parameters working on the same representation, rather than specialist sub-networks that may optimize differently on different constraints.
For the use cases I actually care about day-to-day - coding assistance, structured writing, research tasks - the MoE vs dense distinction became a non-factor at the frontier level. GPT-4 (likely MoE) and Claude Opus 4 (dense, or at least differently scaled) felt different in practice, but the difference wasn't primarily explainable by architecture. Training data, RLHF, and post-training alignment matter at least as much.
Where the architecture difference showed up clearly was in cost and throughput. Running Mixtral 8x22B for batch processing tasks via API cost me roughly 40-60% less than comparable dense models at similar quality, and throughput was noticeably higher. For AI tool stacks built around high-volume automation, that cost difference compounds significantly.
The honest summary: MoE models are better value for money at medium-to-large scale, occasionally weaker on stylistic consistency tasks, and roughly comparable on most benchmark tasks to similarly capable dense models. Understanding this helped me stop treating architecture as the primary variable and start focusing on the right evaluation criteria for each use case. Our AI output quality evaluation guide covers how to build those criteria systematically.
When MoE Architecture Matters for Choosing Your AI Tool
MoE architecture directly affects your tool choice in four specific situations, and matters very little in most others.
Let me be specific about when it's actually a relevant variable.
Situation 1: High-volume API usage with cost constraints. If you're making millions of API calls per month - for content generation, data extraction, RAG pipelines, or AI agent workflows - the MoE efficiency advantage translates directly into your invoice. Mistral's Mixtral models are priced significantly below comparable dense models on a per-token basis. DeepSeek-V3 is even more aggressively priced. If you're building something where inference cost scales with usage, MoE-based models are worth evaluating carefully on your specific tasks before defaulting to a more expensive dense model.
Situation 2: Self-hosting vs cloud deployment decisions. If you're evaluating open-source vs closed AI and considering running models locally, MoE architecture significantly affects hardware requirements. A 46B-parameter MoE model needs as much VRAM as a 46B dense model - not as much as a 13B model - even though inference compute is closer to a 13B model. For on-premises deployments, the memory requirement matters more than the compute efficiency. This is a real consideration in AI privacy decisions where you need to run models locally.
Situation 3: Diverse multi-domain workloads. If your use case spans dramatically different domains - technical writing, legal analysis, code debugging, and creative tasks in the same pipeline - there's a credible theoretical argument that MoE models handle this better due to expert specialization. My testing supports this hypothesis partially: MoE models did better on cross-domain breadth tasks than on single-domain depth tasks. It's not a guaranteed win, but worth testing.
Situation 4: Evaluating a model's capabilities vs its cost. When you look at a model like Mixtral 8x22B scoring competitively against much larger dense models on benchmarks, understanding MoE explains why that's possible. Without that context, you might incorrectly conclude either that the parameter count is misleading or that the benchmarks are gamed. Neither - the model has full large-model capacity because it trained all 141B parameters, it just uses fewer at inference. This affects how you interpret model comparison data.
For the vast majority of people using AI tools - whether for coding, writing, research, or AI assistant tasks - the MoE vs dense distinction will not be the most important variable in your decision. Prompt engineering quality, context window size, and whether the model has been fine-tuned for your domain will all matter more in practice.
Where MoE knowledge pays off concretely is when you're reading model cards, evaluating API pricing, or making infrastructure decisions for a product. Understanding that a model's "total parameters" and "effective compute per query" can diverge by a factor of 4-8x means you're less likely to be misled by parameter count as a quality proxy.
You can explore how different models compare on this dimension using our model comparison tool, or if you're not sure what architecture tier fits your use case, the AI tool quiz can narrow it down based on your actual requirements.
MoE is one of the concepts that separates someone who reads AI headlines from someone who can evaluate AI tools with real technical grounding. It's not complicated once the core mechanic clicks - only some parameters activate per token - but that one insight unlocks a whole layer of understanding about why frontier models are priced the way they are and why some models punch above their inference cost.
FAQ
What is Mixture of Experts in simple terms?
Mixture of Experts is an AI architecture where a model contains many specialized sub-networks ("experts"), but only activates a small fraction of them for each token it processes. The rest sit idle. This means a very large model can answer questions while using far less compute than its total size would suggest.
How is MoE different from a regular (dense) neural network?
In a dense model, every token passes through every parameter. In an MoE model, a small routing network first decides which 2-4 "expert" sub-networks are most appropriate for each token, then only those experts run. The parameters of all other experts exist in memory but contribute zero compute for that token. The difference is called "sparse activation."
Why does MoE save compute but not memory?
The routing decision happens at inference time, so all experts must be loaded into VRAM before any token is processed - you can't know in advance which experts will be needed. Compute savings come from only running chosen experts. Memory savings don't materialize because you still need all parameters available at a moment's notice.
What is top-K routing in MoE?
Top-K routing means each token is sent to the K experts with the highest routing scores. K=2 (top-2) is the most common setting, used in Mixtral. Some models use K=1 for maximum efficiency or K=8 for more capacity. Higher K means more experts activate per token, more compute, but potentially better quality.
Does MoE cause hallucinations more often?
There's no strong evidence that MoE architecture causes more hallucination than dense models at equivalent capability levels. Hallucination is primarily a function of training data, post-training alignment, and model size relative to task difficulty. MoE models that score higher on benchmarks than smaller dense models tend to hallucinate less, as you'd expect from the capability difference.
Is GPT-4 an MoE model?
OpenAI has never officially confirmed GPT-4's architecture. Leaked information and technical analysis strongly suggest it uses a mixture-of-experts design with approximately 8 experts, but these numbers should be treated as estimates rather than facts. The performance-to-cost ratio of GPT-4 via API is consistent with MoE efficiency, but so are other architectural choices.
What is "expert collapse" in MoE training?
Expert collapse is when the routing network learns to always send tokens to the same one or two experts, leaving the remaining experts with no gradient signal and making them untrained. This defeats the purpose of having multiple experts. It's solved with auxiliary load-balancing losses that penalize the router for uneven token distribution, and by adding random noise to routing logits early in training.
Can I run MoE models locally?
Yes, but you need as much VRAM as the total parameter count requires - not just the active-parameter count. Mixtral 8x7B (~46B total parameters) requires about 90GB VRAM in FP16, similar to a dense 46B model. Quantized versions (4-bit or 8-bit) reduce this substantially: a Q4 quantization of Mixtral 8x7B can run on around 26GB, which fits on a high-end consumer GPU or a pair of standard ones. See our cloud AI vs local AI guide for more on the infrastructure decision.
How do I know if a model uses MoE architecture?
Open-weight models like Mixtral usually document their architecture in the model card. Closed models like GPT-4 rarely confirm architecture officially. Indirect signals include: unusually competitive performance relative to reported active parameters, pricing below what the total parameter count would normally justify, and technical blog posts from the company referencing sparse activation or conditional computation. The absence of confirmation doesn't mean dense - most closed frontier models are cagey about architecture details.
Does MoE affect how I should prompt a model?
Not significantly. The routing happens at the sub-layer level and is opaque to users. Your prompts interact with the model the same way regardless of whether it's MoE or dense. The practical difference shows up in quality patterns (as described in the testing section above) rather than in how you need to write prompts. Understanding prompt engineering is the same skill regardless of the underlying architecture.
How does MoE relate to the transformer architecture?
MoE is an extension to the transformer architecture, not a replacement for it. In a standard transformer, each layer contains an attention sub-layer and a feed-forward network (FFN) sub-layer. In an MoE transformer, the FFN sub-layer is replaced by a collection of FFN experts plus a router. The attention mechanism, residual connections, layer normalization, and everything else about the transformer stays the same. MoE is best understood as an upgrade to one component of the transformer, not a different architecture altogether.
What's the difference between MoE and fine-tuning?
Fine-tuning is a training technique that updates a model's parameters on a specific dataset. MoE is an architectural design choice about how parameters are organized and activated. They're independent - you can fine-tune an MoE model, and fine-tuning a dense model doesn't make it an MoE model. Some fine-tuning approaches (like mixture-of-experts LoRA adapters) borrow MoE ideas, but that's a separate pattern.
What to read next
Gemini vs ChatGPT
Apr 2026